Friday, December 7, 2007

And ... we're back.

Hey guys -- just want to update you all to let you know that we successfully switched over to the new servers. While RPI players already noticed that we'd moved, we decided to wait a few days to make sure everything worked properly before we officially announced. Over the past few days, we've re-added the dynamic map (when our servers couldn't handle the load we had to replace the map with an image), 6X-ed the number of turns for RPI a day, and un-paused Rice. The result? We're functioning fine and considering re-launching the ILC soon.

A lot's changed though. From a non-technical perspective (which is how this article is written), it probably sounds easy to add additional servers. But, there's a good deal more that goes into scaling out than just throwing more money at hardware. We had to rework some of the back-end -- particularly the application's interactions with the database -- so as to allow it to operate on many computers approximately as it would on a single one.

So how is this done? I usually save the gory details for the second date, but we arranged our servers into what's known as a master/slave configuration. We separated our servers into master and slave servers. The slaves hold read only copies of the data on the masters and constantly check with the masters for updates. Meanwhile, the masters write new data. So, any time you sign up for a new account, move a unit, or post a chat, your action is written to the master. When you load a page, chances are you are loading content served from a slave. If you're curious for a longer explanation, by all means Google (you should probably leave safe search on if you're in class).

Certain challenges must be overcome during this process. Unless appropriate precautions are taken, a system might fail if a user hits one server, and then another upon reload. Data can show up incorrectly if a slave database hasn't yet copied changes from the master. A data collision can occur if the database tries to insert two pieces of data to the same point. Or, a major bottleneck or single point of activity can bog down everything as traffic increases. GoCrossCampus was originally built to operate on a single server - and so we spent a lot of time planning solutions, researching and coding to make the transition as smooth as possible.

Over the past month, we optimized, cached and moved towards stateless interactions with the database. We added indexes to our tables, made code more efficient, fixed a lot of small bugs, separated out static content (CSS, JavaScript, Images) and began transmitting compressed content.

We've had to handle a couple minor glitches - and had to take the site down for a few minutes once. But we haven't run into any major problems yet. Overall we've been very pleased. Since December 1st (the day we switched to the new servers), visits have more than doubled, page views have 5X-ed and our servers are holding up strong.

On that note, let the games (re-)begin!