Not the news we wanted to be posting but...
Posted On: 2010-11-04 13:00:06
First and foremost, our sincere apologies. We know this is not a joke. As much as we've put in hours to make this right, you also are putting in time trying to get tickets. Your time is valuable, and we wasted it twice this week. We don't take that lightly. We've heard your tweets, we've read your email. We understand.
Next, we feel in the name of all things geeky, we need to explain things a bit... if for no other reason than to help other ppl that may be trying to build high capacity web servers and so that you understand what we've been doing for the last 4 days.
On Monday, Apache died at 2 minutes before noon. Load on the box was 8... not bad at all for an 2 socket 4-core xeon. But Apache just fell over. Actually, it didn't fall over, it just stopped servicing requests. Load on the box fell, but it was because Apache wasn't doing anything useful, not because people were slowing down on F5.
After some investigation, we determined Apache needed some tuning. Here are some examples:
- We cranked up the number of children. We had to drop 4 GB more in the box to handle 200 children. Our resident memory footprint for apache is ~20MB at start, 40MB by the time each child services 10k requests and we kill it.
- We implemented PHP caching. The main server is written in PHP, but the cart is written in python (Django). This really helped the page loads against /registration.
- Implementing AcceptFilter. On FreeBSD, this means the OS will cache HTTP requests before they get handed off to Apache.
- Tuned misc other variables. Safe to say we had some Apache heavy-weights looking at the config.
We ran some load testing after these tweaks. Using anywhere between 150 and 500 concurrent connections, we hit the machine. Load stayed low and content was served very quickly.
After we thought we had the Apache problem fixed, we moved on to Django, our code, and MySQL. We profiled what our code was doing to the DB and tuned MySQL to match. That resulted in ~75% of our requests to the DB to be returned from cache. Definitely streamlined MySQL and reduced response time. We also restructured our code to consolidate DB calls and minimize compute time.
We then ran more tests simulating various situations. We were able to peg the box (putting the load up to nearly 100) but pages were still loading, the machine was responsive, and we simulated multiple registration sessions. We were able to reserve an entire sales cycle in <30 seconds. I think the best we had was 15. We saw bandwidth consumption at over 80Mbps during some of the tests.
We made a few final tunes this morning. The biggest was we were inadvertently redirecting /registration to SSL and that wasn't helping matters (the real cart has a different URL).
At 5 minutes to noon, load on the box was .5... WAY better than Monday. Pages were loading, things were looking good. Then at ~11:58, we had basically the same failure. *poof* Very low load, but the box started dropping/re-seting connections.
Honestly, between all the people working on this, we probably invested at least 80 hours over the last 4 days addressing this. We really thought we had a solution that would work.
After looking at the stats so far, it seems that either Apache is getting overwhelmed by absolute number of connections (which seems odd) or that FreeBSD is somehow squashing connections at the OS level (potentially over-driving something on the server). We're not sure which, but we've got a number of ppl looking at it. We'll have raw data for you in a bit as far as number of IP's, numbers of connections, requests, etc.
We're not sure what the next step is right now. We need more time to dig through the data. We will have an update here at 6pm. And again, whatever we decide to do, we will give everyone at least 24 hours notice before the next round of sales.
Again. Thanks for your patience. Feedback (good or bad) is always welcome to info at shmoocon dot org.
Share this post:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | Next ->