News

|
Not the news we wanted to be posting but...Posted On: 2010-11-04 13:00:06 sigh.
First and foremost, our sincere apologies. We know this is not a joke. As much as we've put in hours to make this right, you also are putting in time trying to get tickets. Your time is valuable, and we wasted it twice this week. We don't take that lightly. We've heard your tweets, we've read your email. We understand.
Next, we feel in the name of all things geeky, we need to explain things a bit... if for no other reason than to help other ppl that may be trying to build high capacity web servers and so that you understand what we've been doing for the last 4 days.
On Monday, Apache died at 2 minutes before noon. Load on the box was 8... not bad at all for an 2 socket 4-core xeon. But Apache just fell over. Actually, it didn't fall over, it just stopped servicing requests. Load on the box fell, but it was because Apache wasn't doing anything useful, not because people were slowing down on F5.
After some investigation, we determined Apache needed some tuning. Here are some examples: - We cranked up the number of children. We had to drop 4 GB more in the box to handle 200 children. Our resident memory footprint for apache is ~20MB at start, 40MB by the time each child services 10k requests and we kill it. - We implemented PHP caching. The main server is written in PHP, but the cart is written in python (Django). This really helped the page loads against /registration. - Implementing AcceptFilter. On FreeBSD, this means the OS will cache HTTP requests before they get handed off to Apache. - Tuned misc other variables. Safe to say we had some Apache heavy-weights looking at the config.
We ran some load testing after these tweaks. Using anywhere between 150 and 500 concurrent connections, we hit the machine. Load stayed low and content was served very quickly.
After we thought we had the Apache problem fixed, we moved on to Django, our code, and MySQL. We profiled what our code was doing to the DB and tuned MySQL to match. That resulted in ~75% of our requests to the DB to be returned from cache. Definitely streamlined MySQL and reduced response time. We also restructured our code to consolidate DB calls and minimize compute time.
We then ran more tests simulating various situations. We were able to peg the box (putting the load up to nearly 100) but pages were still loading, the machine was responsive, and we simulated multiple registration sessions. We were able to reserve an entire sales cycle in <30 seconds. I think the best we had was 15. We saw bandwidth consumption at over 80Mbps during some of the tests.
We made a few final tunes this morning. The biggest was we were inadvertently redirecting /registration to SSL and that wasn't helping matters (the real cart has a different URL).
At 5 minutes to noon, load on the box was .5... WAY better than Monday. Pages were loading, things were looking good. Then at ~11:58, we had basically the same failure. *poof* Very low load, but the box started dropping/re-seting connections.
Honestly, between all the people working on this, we probably invested at least 80 hours over the last 4 days addressing this. We really thought we had a solution that would work.
Clearly not.
After looking at the stats so far, it seems that either Apache is getting overwhelmed by absolute number of connections (which seems odd) or that FreeBSD is somehow squashing connections at the OS level (potentially over-driving something on the server). We're not sure which, but we've got a number of ppl looking at it. We'll have raw data for you in a bit as far as number of IP's, numbers of connections, requests, etc.
We're not sure what the next step is right now. We need more time to dig through the data. We will have an update here at 6pm. And again, whatever we decide to do, we will give everyone at least 24 hours notice before the next round of sales.
Again. Thanks for your patience. Feedback (good or bad) is always welcome to info at shmoocon dot org.
Share this post:
 |
|
Another day, another update.Posted On: 2010-11-03 11:33:53 Ready?
We've tuned Apache. We've tweaked the operating system. We've optimized our MySQL settings. We've reviewed the code and removed/consolidated/swore at the database calls. We've even sacrificed small animals. Ok, just joking about that last one.
Honestly, it's been a very busy 48 hours. We'd like to thank the folks at Intrepidus, The Guy Named Shmoo, Ben, and 3ric for their help, ideas, and time. A lot of changes have been made to the system. Each time we made a change, we threw a huge load at the system using a variety of custom python scripts and ab, the Apache benchmarking tool. While the machine definitely still gets loaded, it does so gracefully now and the amount of traffic it takes to load the box is much higher.
We'll continue to make some minor modifications up through tomorrow morning, but we're still a "go" for noon on Thursday. To be clear, the first round of ticket sales will now commence at noon EDT on Thursday, November 4.
Share this post:
 |
|
So come on already...tell us what's up!Posted On: 2010-11-02 11:02:04 After a night of load testing and tweaking a number of settings, we're feeling much better about things. As it currently stands, we've streamlined our apache installation and it seems to handle many more connections. It's not perfect, and if we were to go live right now we suspect that some of you would still get time out errors if the demand is as high as it was on Monday.
We're also continuing to performance test the cart itself and the database. Remember, the server went belly up before we even had a chance to make ticket sales live. While we remain fairly confident that the cart will work as planned, we're taking advantage of the delay to do further testing.
The current plan is that ticket sales will go live on Thursday, November 4 at Noon EDT. We will post another update tomorrow morning.
Share this post:
 |
|
6PM UpdatePosted On: 2010-11-01 18:17:13 Ooo... Shiny Thing
ShmooCon ticket sales.... For those that have watched our 0wn the Con talks in years past, you know that the ticket sales process has been something we've struggled to get right. The team spends a lot of time learning from what went right and what went wrong and trying to do it better the next time.
After several years of extending our initial system (same basic systems from ShmooCon 2 with lots of upgrades) we decided to do a ground-up rebuild. One of the primary issues the team focused on was the queuing to make purchasing more fair. It's actually pretty complicated to fairly sell a limited number of tickets when people can buy 1 OR 2 tickets and there are more buyers than tickets. At the end of it all, the system we have now is highly customized to our selling and redemption process and should suit our needs very well.
After the code was written, we did some load testing based on last year's numbers from the Dec 1 sales run (typically our most aggressive). We had some decent stats and were able to do what we thought was a good approximation of last year's demand.
Unfortunately, two things happened. First, last year our main website was a static site that we updated from a code repository. This year, it's PHP. So the load on the server before we even turned on sales increased dramatically. But the real issue was the combination of PHP with our Apache configuration. Apache has some tuning parameters to try and prevent itself from freaking out when it gets loaded. We basically pushed our Apache config so hard that rather than queuing requests, it just dropped them instantly to save itself. Ooops. :( The bottom line: rather than failing gracefully under load, Apache basically pulled its head inside its shell and pretended it wasn't there.
We tuned Apache quite a bit this afternoon (really, we started tuning apache a few minutes after noon when we realized what was going on). Honestly, the number of ppl who hit the site at 1pm for an update was greater than the number that were hitting the site 2 mins before sales were suppose to start at noon. And the webserver was up, responsive, and using much less memory and CPU. We've made a few minor changes since then based on further analysis and are continuing to look at the data.
So the lesson learned here is we spent a lot of time and effort on the hard CS problem at the core of selling tickets in the manner we do. Unfortunately we took the ball off sysadmin / operations side of the process. Now, we have a fighting chance of getting this done right. On the off chance we still need capacity, our new architecture allows us to scale across multiple boxes. However, that requires more infrastructure and more tweaks and more places for failure. We'll take our chances on the current hardware :)
Also, everyone always asks why we do this ourselves. There are a variety of reasons. The big one is when you pay someone else to handle ticket sales, they take a cut of the revenue for themselves. For example, EventBrite would cost us $9.50/ticket on a $150 ticket. Losing 6% off the top of our budget before we can even use it is a bummer. Plus, frankly, it's a heck of a learning experience.
Anyhoo, we're in the process of stress testing our changes. We're going to really push the box this time using the information we gathered today. We'll post another update tomorrow morning. We hope to have a new time for the first round of ticket sales. Again, you will have at least 24 hours between the time we post and the time sales start. That should give you enough time to tape down F5 again.
Share this post:
 |
|
The Curse that is ShmooCon Ticket Sales....Posted On: 2010-11-01 12:55:57 Alright we are still in the process of digging through the chaos of the last hour. Here is what we know:
- The server buckled even before ticket sales went live.
- The good news is no tickets were sold, no one has missed out on anything.
- The bad news is even with our aggressive load testing, y'all managed to overdrive the box.
- We're focusing on the Apache configuration based on the logs and data we have.
- We'll make another announcement at 6pm EDT with regards to when we will go live with ticket sales. There will be at least 24 hours notice.
And because we still need to find humor in this somewhere:
- We had the top twitter post there for a bit.
- We must be official now as there is a FakeShmooCon twitter account.
And because we like to poke fun at ourselves we're going to run a contest. Email us your best "Curse of the ShmooCon Ticket Sales" poster/t-shirt design (think campy old horror flick) by Dec 1st. Winner will get their design printed on a t-shirt and two tickets to ShmooCon.
Questions? Email us at info @ shmoocon.org but please be patient with us as we continue to work on this issue.
Share this post:
 |
|
|