little_shiba
His Home
His Blog

Pirates of the Burning Sea - Stress Test Report

With the help of the fine folks at FilePlanet and SOE, we just ranour first Pirates of the Burning Sea stress test weekend. The event wasan unqualified success. Despite all the painful downtime and very longdays, we learned things about our servers that we simply couldn’t havelearned any other way. Fortunately the stress testers were good sportsthe whole weekend and every time the servers came back up the testerswere always ready for more.

Starting last Monday, FilePlanetbegan to distribute stress test keys and to allow people to downloadthe 4GB client installer. Midday on Wednesday, they opened the test toall FilePlanet members (not just subscribers). On Thursday at 3pmPacific time, SOE activated the keys, and we put up the servers,letting all 15,000 key holders log in for the first time. The responsewas immediate. Hundreds of people were in and playing within a fewminutes. The numbers continued to climb for nearly three hours.Everything with the servers looked really good, and the hardware itselfwas actually pretty bored.

That’s when everything went horribly, gloriously wrong.

BigBrother

Ifyou’ve read Brendan’s recent server tool devlogs, you’ve heard of thisserver process we have called BigBrother. BigBrother is responsible forkeeping all the other server processes up and running. In fact, that’spretty much all BigBrother does. Short of server crashes, most of theprocesses don’t go down, so BigBrother’s job is all about the Zoneservers.

Each instance in PotBS requires a fresh zone serverto start up. We keep a number of these around in the Idle state, sogenerally there’s one ready to go when someone enters an instance.Normally, the next idle server in the list is told to load theinstance, and then the player zones in. If there aren’t any availablefor some reason, no big deal; BigBrother starts them constantly, soanother will be ready soon.

At least that’s the theory.Unfortunately once all those players hit the servers at the same time,it didn’t work out that way. A series of bugs conspired to keep idlezones from starting nearly quickly enough to keep up with demand. Sothree hours into the playtest we ran out of idle zones and very quicklyevery player on the cluster was waiting at a loading screen for an idlezone that would never come.

To make matters worse, we hadanother performance problem on the database that is responsible forkeeping track of which processes BigBrother already had up and running.Shortly after we ran out of idle zones, about 75% of the processes inthe cluster (including most of the BigBrother processes) shut downbecause they lost contact with the “server directory.” That’s exactlywhat they are supposed to do, as a means of preserving player data, butlosing so much of the cluster disconnected most of the players.

Wespent the next four days fixing the idle zone spawning problem and theserver directory performance problem. We pushed out a dozen newversions of various servers (most of them BigBrother) over the courseof the weekend. Misha, most of the programmers, and all the operationspeople were in until well after midnight every single night betweenThursday and Sunday. We’ve all mostly recovered at this point, but boywere we tired by Sunday night. :)

Five Big Bugs

Fivedifferent bugs conspired to keep the idle zone processes from workingcorrectly. Each of them individually wouldn’t have caused such a bigproblem, but when their powers combined, the infinite Waiting for IdleZone screens began.

The first of them was pretty funny. Ifwe ran out of idle zones at any point and had to wait for them to startup, we would put the people who needed to zone into a list and thenpull them out one at a time as new zones become available. The problemis that we pulled them out in the opposite order than we put them in.If you were the first one to get in line for a new zone, you were thelast to get one! If people were being added to the list more quicklythan idle zones were starting up, only the last person in the list hadANY chance of ever getting a zone. Oops! Fortunately the fix for thiswas small and easy.

The second major problem was actuallymore a configuration problem than a code problem. The logging database,which the zone servers connect to when they start up, was a littleconfused and was rejecting about half the attempts to connect to it andslowing the others way down. Normally an idle zone takes about 3seconds to start up. With this database problem they were taking morelike 30 seconds, and half the time they didn’t start up at all. A SQLServer restart fixed it, but we didn’t realize this was actually theproblem until later in the weekend.

To make the matter ofnot connecting to the database worse, we had another bug that causedany idle zone that didn’t successfully connect to the flogger to hangwhen it tried to shut down. Not only did that take up load on theserver, but it also made BigBrother think the server was still startingup.

The fourth major problem we encountered was that a zonethat hung on startup was actually able to tie up a zone slot andeventually stop BigBrother from starting any new zones at all. To keepfrom overloading the servers, BigBrother is limited in the number ofservers it will start at any time. Every time one of the zones failedto connect, it took away one from the total number of serversBigBrother would start. After a while, BigBrother stopped spawning newzones entirely because it thought all the previous requests it had sentwere still pending. We didn’t realize the root of the problem (that theDB needed a restart) until later, so this bit us for the first coupledays of the test.

There were a number of different caseswhere BigBrother could get confused about how many processes wereactually spawning and eventually stop being able to start more. It tookmost of the weekend for us to investigate and fix these cases, but atthis point BigBrother is spawning servers just as well on a loadedserver as on an empty one.

Finding the problems

Thebiggest problem we had during the test wasn’t actually figuring out howto fix the problems, it was figuring out what the problems were in thefirst place. If one of these problems cropped up on one of ourdevelopment servers, we’d just pop open the debugger and look around inthe various processes involved until we figured it out. Unfortunatelythat’s not an option in a cluster that’s under load, which the stresstest cluster was all weekend.

The next best option is toturn on existing logging to figure it out from the logged events. Wehave logging scattered throughout the game and can selectively turn iton and off depending on what we’re debugging. Unfortunately the amountof logging in BigBrother at the start of the test was pretty meager.

Myleast favorite option for finding the source of a bug is to modify thecode to add additional logging, since anytime you add new code it’s arisk. But that’s exactly what we had to do with BigBrother during thestress test. On Thursday and Friday Brady was sending new versions ofBigBrother up to the operations people every couple hours. The troublewith adding logs to look for a bug that you don’t yet understand isthat you don’t actually know what logging to add. It was sometime onFriday before we even knew that BigBrother was stopping itself fromspawning new zones because of the throttling code. The actual fixwasn’t ready to go until sometime on Saturday.

Fortunately,as with every other bug that we’ve added logging to track down, thisnew logging code stays in the game. The next time we need to track downa bug with process spawning in BigBrother it will be a piece of cake.

Long Wait Times

Ata little after 1am Friday morning, Brady and I were no longer in anycondition to write code. We went home at that point, and got some sleepbefore heading back to pick things up again Friday morning.Unfortunately Gray Noten, our operations lead, wasn’t so lucky. The GMscalled him every 30 minutes or so all night long to reset the serversthat had, once again, run out of idle zones.

To avoid himhaving to do that two nights in a row, we turned on the login queues.We dropped the maximum player count on the servers down to a point thatwe knew they could easily handle for the overnight stretches so wecould get some sleep. This worked pretty well Friday night, so we didit again on Saturday, with a somewhat higher limit. By Sunday night wehad solved enough of the problems that we were able to keep the limitsoff all night.

These login queues were very frustrating fora lot of people. They did allow the people who made it into the game6-8 hours of uninterrupted server uptime, and gave our operationspeople a much needed rest. I think they were the right thing to do, butnow that the clusters are supporting much higher populations, we willhopefully never need to do it again.

Saint-like Patience

Thebig heroes in this story are the stress testers themselves. They bravedconstant server reboots and long overnight queue times all weekend longand just kept coming back for more. We did what we could to keep theminformed about the state of things, and whenever the servers came backup they were immediately back to pound on things again.

I’mincredibly grateful for the patience shown by this weekend’s testers.The game is going to be much better as a direct result of your efforts.Your persistence helped us more than you can know. Those of you whodidn’t make it into the stress test have these people to thank for theserver stability you see when you do get a chance to play.

The Results

Sowith all these problems, how can I call this an “unqualified success”?Well, pushing the servers until they broke was the entire point of thistest. That’s exactly what the weekend’s stress testers did. And theydid it over and over and over.

We found about 6 major bugs(including the server directory slowness) and have put in fixes forevery single one of them. By the end of the weekend we were able tokeep up with the testers with no queues and no problems. We alsolearned a lot about exactly how high levels of stress affect theservers. We will use that knowledge to improve our automated testing tobetter simulate actual players and push the servers even further. Thatwork is already underway this week and by next week, we expect to berunning automated stress tests internally that will accurately recreatethe conditions we had in the live stress test. That will let us bang onthe servers a bunch more ourselves so we’ll be in even better shapebefore our next big public event.

We pushed the serverarchitecture further this weekend than ever before. We hit concurrencynumbers we’ve only hit with automation. We ended the weekend supportingway more players at the same time than we did when we started. This wasa fantastic test for us and I can’t wait for the next one to see whathappens when we push it even higher!

0
Like it
Tags:
Pirates of the Burning Sea
Article Url:
http://my.mmosite.com/9d6980b48458ea1c8cf0f65fa41cf342/blog/item/552b07e2917bee3ae66f121308edd1d2.html

Related articles

Read More

Comment ( 0 )


Leave your words