Prepare for your Moosey Fate
Posted On: 2010-11-12 09:14:43
We've made some minor changes to the architecture. Here's the before picture.
And here's the after picture.
As you can tell, we've thrown everything and the kitchen sink at the problem. Well, maybe not the kitchen sink, but at least 5 of the TF2 workstations. We have decided to implement the secret Shmoo High-availability Moose Cluster. It's a proper web cluster with a load balancer (HAProxy), a pile of web servers, a dedicated DB server, and an audit box. As you can see in the diagram below, if we need more capacity we'll just grow bigger antlers.
Seriously, we've been pouring hours a day in to this setup. We built the software knowing we may want to go with a load balanced architecture at some point. Based on last week's results, we realized we had to scale sooner rather than later.
We're still tuning and shaking out the bugs. As you can imagine there's a lot of changes when moving from a single host with all software running in one place to a distributed setup like we have in the Moose Cluster. We've made good progress, but we won't be ready this week.
That said, we do think we'll be ready on Tuesday. So we're scheduling the next attempt to sell tickets for noon EST (yes, EST now) on Tuesday, November 16th. So, get F5 ready. We should be ready for you.
Share this post:
Posted On: 2010-11-08 13:11:00
Well, we said we were going to take some time to relax this weekend. Turns out that was a lie. We worked throughout the weekend building systems, tuning servers, and instrumenting code. The team made a lot of progress and we think we have an architecture that will work. We're in the process of deploying some systems and testing out what we've built. We'll post more details on the setup later once we've had a chance to kick the tires a bit.
The end result is we're still a few days away from retrying the Nov 1 ticket sales. We will post another update on Wednesday and hopefully will have a new sales date by then. Again, you'll have at least 24 hours notice between the next update and the start of sales.
Share this post:
As we head into the weekend...
Posted On: 2010-11-04 18:25:09
Another night and day of testing and troubleshooting. We're making progress. There are several options on the table that we're implementing/testing but for now, we will spare you the details. However, it's Friday and it's been a long week. We're going to take at least part of the weekend to regroup and rest a little. There will be another update Monday morning.
Share this post:
Latest news plus a thank you
Posted On: 2010-11-04 18:24:24
First off, we want to thank everyone for the response we've received today. There's been a lot of positive feedback and we really appreciate everyone being understanding of our situation. We're working our butts off trying to find a solution that gets tickets sold and everyone back to their normal lives.
As far as what we've found technically... we think we bumped up against socket limits on FreeBSD. We were running the machine with maxfiles at ~12800 and tcp.msl at 30000 (ie: 60 seconds). maxfiles is one of the variables that controls the number of open sockets on the box. We were seeing that both today and Monday, when we hit ~225 new connections a second, the box would freak out. Some quick math:
225 new connections * 60 seconds (OS timeout for TCP sessions... even after they close) = 13,500.
That's more attempts to open sockets than we had sockets available. This explains why our load actually _decreased_ when we hit the wall. It also explains why apache was still servicing some but not many connections (it was only picking up sockets as they expired).
We cranked up the number of available sockets and cranked down the timer, so I think we've got that problem.
That said, we now have a ton of information about how many ppl were trying to buy tickets and the rates they were hitting us with. We're estimating about 1,300 people were hitting the webserver during the attempted sales. We assume most of them were trying to buy tickets and not just window shopping ;) The numbers we got today are going to be useful to continue to load test. We're going to take some time, regroup a bit, do some more testing based on new numbers, and get back to you.
Share this post:
Not the news we wanted to be posting but...
Posted On: 2010-11-04 13:00:06
First and foremost, our sincere apologies. We know this is not a joke. As much as we've put in hours to make this right, you also are putting in time trying to get tickets. Your time is valuable, and we wasted it twice this week. We don't take that lightly. We've heard your tweets, we've read your email. We understand.
Next, we feel in the name of all things geeky, we need to explain things a bit... if for no other reason than to help other ppl that may be trying to build high capacity web servers and so that you understand what we've been doing for the last 4 days.
On Monday, Apache died at 2 minutes before noon. Load on the box was 8... not bad at all for an 2 socket 4-core xeon. But Apache just fell over. Actually, it didn't fall over, it just stopped servicing requests. Load on the box fell, but it was because Apache wasn't doing anything useful, not because people were slowing down on F5.
After some investigation, we determined Apache needed some tuning. Here are some examples:
- We cranked up the number of children. We had to drop 4 GB more in the box to handle 200 children. Our resident memory footprint for apache is ~20MB at start, 40MB by the time each child services 10k requests and we kill it.
- We implemented PHP caching. The main server is written in PHP, but the cart is written in python (Django). This really helped the page loads against /registration.
- Implementing AcceptFilter. On FreeBSD, this means the OS will cache HTTP requests before they get handed off to Apache.
- Tuned misc other variables. Safe to say we had some Apache heavy-weights looking at the config.
We ran some load testing after these tweaks. Using anywhere between 150 and 500 concurrent connections, we hit the machine. Load stayed low and content was served very quickly.
After we thought we had the Apache problem fixed, we moved on to Django, our code, and MySQL. We profiled what our code was doing to the DB and tuned MySQL to match. That resulted in ~75% of our requests to the DB to be returned from cache. Definitely streamlined MySQL and reduced response time. We also restructured our code to consolidate DB calls and minimize compute time.
We then ran more tests simulating various situations. We were able to peg the box (putting the load up to nearly 100) but pages were still loading, the machine was responsive, and we simulated multiple registration sessions. We were able to reserve an entire sales cycle in <30 seconds. I think the best we had was 15. We saw bandwidth consumption at over 80Mbps during some of the tests.
We made a few final tunes this morning. The biggest was we were inadvertently redirecting /registration to SSL and that wasn't helping matters (the real cart has a different URL).
At 5 minutes to noon, load on the box was .5... WAY better than Monday. Pages were loading, things were looking good. Then at ~11:58, we had basically the same failure. *poof* Very low load, but the box started dropping/re-seting connections.
Honestly, between all the people working on this, we probably invested at least 80 hours over the last 4 days addressing this. We really thought we had a solution that would work.
After looking at the stats so far, it seems that either Apache is getting overwhelmed by absolute number of connections (which seems odd) or that FreeBSD is somehow squashing connections at the OS level (potentially over-driving something on the server). We're not sure which, but we've got a number of ppl looking at it. We'll have raw data for you in a bit as far as number of IP's, numbers of connections, requests, etc.
We're not sure what the next step is right now. We need more time to dig through the data. We will have an update here at 6pm. And again, whatever we decide to do, we will give everyone at least 24 hours notice before the next round of sales.
Again. Thanks for your patience. Feedback (good or bad) is always welcome to info at shmoocon dot org.
Share this post: