For almost 2 years, Wordnik had been running its infrastructure on really fast, dedicated servers—with lots of storage, RAM and very impressive CPU + I/O. Seeing our API firing on all 8 cores using 60GB of RAM in a single process was a beautiful thing—and end users got to enjoy the same benefits. Huge corpus of MongoDB data hit anywhere, anytime and fast. Using just one of these boxes, we could push thousands of API calls a second, hitting mysql, mongo, memcached, Lucene, etc.
But we turned all that off almost 2 weeks ago and quietly moved to the humble virtual servers of Amazon’s AWS infrastructure. We now have our choice of relatively low-performance virtualized servers with significantly less horsepower. In the operations world of “CPU, memory, I/O” we can now “choose one of the above”. What gives? Is this all about saving money? And how will this affect Wordnik’s infrastructure?
Before I dig into the details let me summarize a couple facts. First, our system runs almost exactly the same speed in EC2 as it did on those beefy servers (I’ll get to the reason why). Next, there’s no less data than before. In fact we haven’t slowed down the growth of our corpus of text. Finally, yep, we saved money. But that wasn’t the goal (nice side effect though)!
So first off, the reason for the move is really quite simple. We needed multi-datacenters to run our system with higher availability. We also need the ability to burst into a larger deployment for launches and heavy traffic periods. We needed to be able to perform incremental cluster upgrades and roll out features faster. Sounds like an advertisement for a cloud services company, right?
Well when it comes down to it, your software either works or it doesn’t, and nothing puts it to a better test than switching infrastructure to something that’s intrinsically less powerful on a unit basis. And in the process of switching to the cloud is certainly not free unless you’re running a “hello world” application. If you wrote software to take advantage of monster physical servers, it will almost certainly fail to run efficiently in the cloud.
When we started the process of migrating back to EC2 (yes, we started here and left it—you can read more here) it was clear that a number of things had to change or we’d have to run the largest, most expensive EC2 instances which — believe it or not — actually cost more money to run than those big servers that we were moving away from. The biggest driver was our database. We have been using MongoDB for 2 years now, but in that time, a chunk of data has remained in MySQL in something on the order of 100GB of transactional tables. Simply restoring one of these tables and rebuilding indexes takes days on a virtual server (I’m sure someone out there could make it faster). We knew that this had to change.
So this data was completely migrated to MongoDB. That was no small feat, as it was simply the oldest, most crusty Java relic in our software stack (we switched to Scala almost 18 months ago). One of our team members, Gregg Carrier, tirelessly worked this code into MongoDB and silently migrated the data about a week before our datacenter move. Without doing this, we couldn’t have made the datacenter move.
Also on the data front, we had a number of monster MongoDB databases — these have been operational ever since we first transitioned our corpus to MongoDB! Back to the previous point, these would have been a challenge to run with good performance on the EC2 instances with their weak I/O performance. Our disk seeks in the cloud were tested out to be as much as 1/10th the performance.
To address this, we made a significant architectural shift. We have split our application stack into something called Micro Services — a term that I first heard from the folks at Netflix. The idea is that you can scale your software, deployment and team better by having smaller, more focused units of software. The idea is simple — take the library (jar) analogy and push it to the nth degree. If you consider your “distributable” software artifact to be a server, you can better manage the reliability, testability, deployability of it, as well as produce an environment where the performance of any one portion of the stack can be understood and isolated from the rest of the system. Now the question of “whose pager should ring” when there’s an outage is easily answered! The owner of the service, of course.
This translates to the data tier as well. We have low cost servers, and they work extremely well when they stay relatively small. Make them too big and things can go sour, quickly. So from the data tier, each service gets its own data cluster. This keeps services extremely focused, compact, and fast — there’s almost no fear that some other consumer of a shared data tier is going to perform some ridiculously slow operation which craters the runtime performance. Have you ever seen what happens when a BI tool is pointed at the runtime database? This is no different.
MongoDB has replica sets, and making these smaller clusters work is trivial — maybe 1/10th the effort of setting up a MySQL cluster. You add a server, it syncs, becomes available to services. When you don’t need it, you remove it from the set and shut it down — it just doesn’t get much easier than this.
We got the performance of these smaller clusters quite respectable by running them on ephemeral (volatile) storage. To keep our precious data safe, we run one non-participating server on an EBS-backed volume in a different zone. Whoa — you might ask — what happens if your data gets too big for that ephemeral storage? Easy! If the data gets too big, it’s time to shard/partition. This is a self-imposed design constraint to keep the data tier both performant and manageable. If it gets too big, other problems will arise, cost being one of them.
Finally, if you have all these services floating around, how do you communicate? Isn’t this a communication and versioning nightmare?
Yes! It definitely can be. To solve the communication issues, we developed Swagger. This defines a simple mechanism for how you communicate with our APIs—both internally and externally. If your service is going to be used in the cluster, it needs to expose this spec. Then all consumers have a contract as to how to communicate. You can see more about Swagger here.
For the versioning challenge, we developed an internal software package named Caprica. This is a provisioning and configuration management tool which provides distributed configuration management based on version and cluster compatibility. That means that a “production cluster” in us-west-1c will have different services to talk to than a “development cluster” in the same zone. I’ll cover more about Caprica in another post, it’s a really exciting way to think of infrastructure.
Hopefully this has been entertaining! Find out more about swagger-izing your infrastructure in this short deck.