What happens when you break your site’s daily usage record… by 10x?

A retrospective on Fundfill’s biggest day

Yesterday was a bit of a crazy day! Since retweets of the fund for auditing TrueCrypt (fundfill.com/fund/TrueCryptAudited) began circulating and HackerNews featured the site, we’ve gotten roughly 35,000 views of the fund and more than 100,000 hits to our site. The most traffic we had had was 3,000 views in a day before, so we braced to see how the site would fare. I’ve read articles where the link to a site goes down because of a web server not being able to handle the load, so I braced for the worst. No need, as the site itself was able to display the pages with no speed issues at all! Yet, we started getting reports of users unable to pledge, and the number of pledges wasn’t moving at all. Users were registering, but nobody was donating? Looking at the site myself, I noticed every 10th page view would display “An error has occurred.” Hrmmmmmm.

We had tested the site vigorously leading up to this. Load tests had worked really well after we tuned up some issues 3 months ago, and our suite of UI tests, designed to test the user’s expected interaction with the site, passed every time we pushed out a new build. I had seen an error (regarding “unit of work”) about 1/100 times I was working on the site and I never could isolate the issue to reproduce it to fix it.

So, what happened?

Many users tried out our site for the first time yesterday. In the three hours after it started getting tweeted, there was a major problem in the code. Users who tried pledging were getting “An error has occurred” multiple times while trying to pledge, even though they could register. We identified the error and were able to fix it AND release a secure deploy within an hour after that. Unfortunately, there was also a one-time setting in the site I had changed to highlight the TrueCrypt fund. This “featured fund” setting was designed to expose the fund on the homepage. It had an adverse affect of temporarily changing the meta-data to the previous “featured fund” for a bounty on a killer during Bay-to-Breakers, a local race in San Francisco. So, users visiting the site saw a different fund than the one they were expecting. This snafu was caught within an hour of my causing it and fixed soon thereafter.

The other bug, for those technically inclined, was a race condition in a unit of work. Units of work are context boundaries that allow database changes to be collected and executed at one time to more efficiently separate code and access the database. There are two separate architectures in our code that create units of work. It turns out we had a race condition between these two pieces, causing the database to fire earlier than expected under heavy load. Once we were able to identify it, we organized the two pieces to perform database unit of work closing at the same time.

The new release not only smoothed out the site’s errors, but it also allowed users who had problems to finish their pledges. Contacting some of those users, we were able to get back some of the pledges we would have lost. Overall, we had over an hour of the Stephen Martin / Bay-to-Breakers fund being shown in place of the actual TrueCrypt audit fund and three hours of being unable to pledge. However, we’ve had zero interruptions in the last 16 hours, and we’ve tripled the previous days pledges to $1910. Given the current momentum, we expect another 2x to 4x increase in pledging by day’s end.

Lessons learned

For anyone out there with a website that needs evolve rapidly, invest in continuous deployment. Have tests that verify everything about your site and any bug that pops up, as well. Not just unit tests, but UI tests that can confirm that any user can perform all the actions your website requires. Once the root causes were identified, they were fixed within an hour – 20 minutes for the metadata chage, one hour for the pledging bug. Without the protections provided by these, we would have no confidence pushing out another build. Instead, we were confidently able to push out and provide a working site because we unit test and UI test every build before approving it.

Get feedback – One of the issues we faced was finding all the feedback actually going on. When something becomes popular very quickly, the internet has a way of transforming and creating communities to discuss problems – even if you don’t know they exist yet. Twitter is amazing at this by using it’s hashtags to create a grassroots communities out of nothing in a matter of hours. It wasn’t twitter, though, that had the most valuable feedback. I discovered an hour after they posted that HackerNews had started dissecting the fund. We were able to read and and gain valuable feedback about users’ experiences (and some much deserved criticism). Following a variety of terms on twitter and watching hackernews thereafter, we were able to see everything users were posting about the site and responding appropriately.

Communication – If a problem is behind-the-scenes and affects only one or two users, it’s best simply to fix the problem and contact the user who had the issue. In our case, everyone who visited the site knew that something was up, so not addressing it would have been a huge mistake. Green banners said “An error has occurred” and the site sometimes redirected to the homepage (the default behavior when the unit of work errors messed with the site). Reaching out to everyone and broadcasting our progress loudly was the best course to ensure everyone was aware of our issues. For the 5 users who added money but couldn’t pledge, I was able to directly contact them by phone, email, twitter and get their issues resolved. Overall, we got very positive reception from the users we were able to contact and in the forums where we addressed the issue. In business in general, I always advise people to be upfront and direct and honest. If you’re not honest, I won’t work with you, so I have to hold us to the same level of accountability.

In the end, I was really proud of the systems we put in place long ago that allowed us to make a quick recovery. Fundfill is designed specifically for bounties and rewards, so while there may be other sites for crowdfunding, they don’t necessarily cater to awarding money to the person who wins. Furthermore, Fundfill allows users to vote for the winner, based on the amount of money they donated. Did you donate your money but don’t feel the person claiming the prize actually did what they’re supposed to? You can reject their claim and demand the the right work for your bounty money. Plus, unlike the iPhone TouchId bounty that was hosted and operated via twitter and a website put together from scratch, we handle all the operational details for you – pledging, updating the money, and informing everyone via Twitter.

We’re still working out some operational details like what happens to the money if nobody is able to fulfill the fund. The fund’s creators will decide how long the money is kept in escrow and what to do with it should there be a failed bounty. If you’d like to discuss this directly with me, please find me on twitter at @joebalfantz or @fundfill. If you’re interested in donating, please check our site – there’s a link on the homepage.

This entry was posted on October 10, 2013 at 4:22 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Joe's hrmmmms