October 11, 2012 by tonytam

Swagger codgen 2.0 Released!

Swagger codegen is now at 2.0! What does this mean for you?

Swagger codegen is a way to take advantage of the API declarations provided by swagger. Yes you get a beautiful user interface! But having knowledge of your API allows for some amazing things. Swagger codegen provides the structure to take this to the next level.

The code generation is NOT limited to just client libraries. You can see how an interface-driven development process can facilitate your server code as well through some very popular server frameworks (javascript with nodejs, ruby with sinatra, scala with scalatra). But these are just examples–you can easily take advantage of your own framework with codegen. Even more, you can use the mustache template structure to write static documentation, supporting classes, test harnesses, etc. We think you’ll find it quite valuable.

Yes the client library generation is hugely valuable. In the 2.0 release we’ve added support for ruby clients, python3, and objective-c. Again, the templates are YOURS to modify to suit your coding style.

Finally, and most importantly, swagger-ui, swagger-core, swagger-codegen is open source. We have open sourced this framework to give back to the OSS community which has helped Wordnik so substantially. Plus, it’s WAY too cool to keep in-house only.

You can find swagger-codegen in maven central and github.

Thanks, and as always, please feel free to post questions on our support group or in irc at #swagger

The Wordnik Engineering Team
http://developer.wordnik.com

March 19, 2012 by tonytam - 4 Comments

With Software, Small is the new Big

For almost 2 years, Wordnik had been running its infrastructure on really fast, dedicated servers—with lots of storage, RAM and very impressive CPU + I/O. Seeing our API firing on all 8 cores using 60GB of RAM in a single process was a beautiful thing—and end users got to enjoy the same benefits. Huge corpus of MongoDB data hit anywhere, anytime and fast. Using just one of these boxes, we could push thousands of API calls a second, hitting mysql, mongo, memcached, Lucene, etc.

But we turned all that off almost 2 weeks ago and quietly moved to the humble virtual servers of Amazon’s AWS infrastructure. We now have our choice of relatively low-performance virtualized servers with significantly less horsepower. In the operations world of “CPU, memory, I/O” we can now “choose one of the above”. What gives? Is this all about saving money? And how will this affect Wordnik’s infrastructure?

Before I dig into the details let me summarize a couple facts. First, our system runs almost exactly the same speed in EC2 as it did on those beefy servers (I’ll get to the reason why). Next, there’s no less data than before. In fact we haven’t slowed down the growth of our corpus of text. Finally, yep, we saved money. But that wasn’t the goal (nice side effect though)!

So first off, the reason for the move is really quite simple. We needed multi-datacenters to run our system with higher availability. We also need the ability to burst into a larger deployment for launches and heavy traffic periods. We needed to be able to perform incremental cluster upgrades and roll out features faster. Sounds like an advertisement for a cloud services company, right?

Well when it comes down to it, your software either works or it doesn’t, and nothing puts it to a better test than switching infrastructure to something that’s intrinsically less powerful on a unit basis. And in the process of switching to the cloud is certainly not free unless you’re running a “hello world” application. If you wrote software to take advantage of monster physical servers, it will almost certainly fail to run efficiently in the cloud.

When we started the process of migrating back to EC2 (yes, we started here and left it—you can read more here) it was clear that a number of things had to change or we’d have to run the largest, most expensive EC2 instances which — believe it or not — actually cost more money to run than those big servers that we were moving away from. The biggest driver was our database. We have been using MongoDB for 2 years now, but in that time, a chunk of data has remained in MySQL in something on the order of 100GB of transactional tables. Simply restoring one of these tables and rebuilding indexes takes days on a virtual server (I’m sure someone out there could make it faster). We knew that this had to change.

So this data was completely migrated to MongoDB. That was no small feat, as it was simply the oldest, most crusty Java relic in our software stack (we switched to Scala almost 18 months ago). One of our team members, Gregg Carrier, tirelessly worked this code into MongoDB and silently migrated the data about a week before our datacenter move. Without doing this, we couldn’t have made the datacenter move.

Also on the data front, we had a number of monster MongoDB databases — these have been operational ever since we first transitioned our corpus to MongoDB! Back to the previous point, these would have been a challenge to run with good performance on the EC2 instances with their weak I/O performance. Our disk seeks in the cloud were tested out to be as much as 1/10th the performance.

To address this, we made a significant architectural shift. We have split our application stack into something called Micro Services — a term that I first heard from the folks at Netflix. The idea is that you can scale your software, deployment and team better by having smaller, more focused units of software. The idea is simple — take the library (jar) analogy and push it to the nth degree. If you consider your “distributable” software artifact to be a server, you can better manage the reliability, testability, deployability of it, as well as produce an environment where the performance of any one portion of the stack can be understood and isolated from the rest of the system. Now the question of “whose pager should ring” when there’s an outage is easily answered! The owner of the service, of course.

This translates to the data tier as well. We have low cost servers, and they work extremely well when they stay relatively small. Make them too big and things can go sour, quickly. So from the data tier, each service gets its own data cluster. This keeps services extremely focused, compact, and fast — there’s almost no fear that some other consumer of a shared data tier is going to perform some ridiculously slow operation which craters the runtime performance. Have you ever seen what happens when a BI tool is pointed at the runtime database? This is no different.

MongoDB has replica sets, and making these smaller clusters work is trivial — maybe 1/10th the effort of setting up a MySQL cluster. You add a server, it syncs, becomes available to services. When you don’t need it, you remove it from the set and shut it down — it just doesn’t get much easier than this.

We got the performance of these smaller clusters quite respectable by running them on ephemeral (volatile) storage. To keep our precious data safe, we run one non-participating server on an EBS-backed volume in a different zone. Whoa — you might ask — what happens if your data gets too big for that ephemeral storage? Easy! If the data gets too big, it’s time to shard/partition. This is a self-imposed design constraint to keep the data tier both performant and manageable. If it gets too big, other problems will arise, cost being one of them.

Finally, if you have all these services floating around, how do you communicate? Isn’t this a communication and versioning nightmare?

Yes! It definitely can be. To solve the communication issues, we developed Swagger. This defines a simple mechanism for how you communicate with our APIs—both internally and externally. If your service is going to be used in the cluster, it needs to expose this spec. Then all consumers have a contract as to how to communicate. You can see more about Swagger here.

For the versioning challenge, we developed an internal software package named Caprica. This is a provisioning and configuration management tool which provides distributed configuration management based on version and cluster compatibility. That means that a “production cluster” in us-west-1c will have different services to talk to than a “development cluster” in the same zone. I’ll cover more about Caprica in another post, it’s a really exciting way to think of infrastructure.

Hopefully this has been entertaining! Find out more about swagger-izing your infrastructure in this short deck.

December 9, 2011 by tonytam - 1 Comment

What’s with the Swaggering?

If you’re a developer and have seen the Wordnik Developer documentation, you might have noticed some links to raw JSON like this. Behind all that lovely notation is the Swagger API framework. We needed to solve some recurring sources of pain at Wordnik, and we knew how our needs were going to evolve, so we took a proactive step and built Swagger.

First, our documentation was really tough to keep up to date. When you’re in a situation where your capabilities precede your documentation, you can end up in a tough spot: Your users are unable to benefit from the tools you’ve worked so hard to build, or they end up using them the wrong way. Both are bummers for developers, who typically have many choices and should get your best efforts.

As we made updates to our API, it got harder and harder to keep our client libraries current. Add an API, modify all your drivers. As with documentation, this is an unnecessarily tedious thing to do. Our developer community helped out tremendously by open-sourcing a number of different libraries, but this led to “driver drift”. Our developers shouldn’t have to worry about writing our code! It’s our job to make it easy.

Next, we needed a way to create APIs for our partners. Guess what? The same two previous issues apply. More busywork for us.

Finally, we needed a zero-code way to try out our API. A real sandbox — not white papers, video tours, slide decks. A full-featured mechanism to call the API without monkeying with code.

So that was our goal. The outcome is what we now call Swagger. So how does this thing work? Should you use it?

Our server produces a Resource Declaration. This is like a sitemap for the API — it tells what APIs are available for the person asking. Who is asking? Well, if you pass your API key, the Swagger framework shows what is associated with that principal! That keeps sensitive APIs from being exposed. It also gives you a way to let people try out new features in an incremental fashion. It’s completely pluggable and we provide some simple, demo implementations.

Follow one of the paths to an API and you get all the operations available to the client, an operation being an HTTP method + a resource path. Look further and you’ll see something useful — the input and output models! Now you know what you’re going to get before you call. It’s a contract between your server and the client.

So that’s all fine and dandy but what’s the use? How do we address the pain points from above?

Well, once you know both how to call an API and what you get back, you can do some interesting things. First, how about a sandbox? There’s really not much to it — you can see it in action at petstore.swagger.wordnik.com and developer.wordnik.com. Neat! Simple sandbox, calls your API exactly how you would from your code. You can try out those different levers to calling an API without editing your source. Heck, even your boss can try out the API!

Next, how about clients? Well, we wrote a code generator which creates clients in a number of languages. Don’t like our codegen style? We’re not offended! It’s template driven. Make your own templates, or even your own code generator. Best of all, change your API and rebuild your client libraries. It’s all *automatic*. And for those folks who want special APIs? Build them their own client by passing their API key in. Really, it works.

Finally, documentation. Wasn’t that the first point in this post? Yes! If you look in one of our sample Swagger apps, you’ll see how this is accomplished.

See that code? The documentation is built in. This function defines the GET method of the /findByStatus path. There is a required query param with allowable values of “available, pending, sold”, with a default value of “available”. It returns a Pet object. Best of all, it serves as both the input declaration as well as the documentation system. See here:

http://petstore.swagger.wordnik.com/#!/pet/findPetsByStatus_GET

All of Swagger is open-source. Check out swagger.wordnik.com for a list of repos. More on the Swagger roadmap in an upcoming post!

March 15, 2011 by tonytam

Introducing WordKit for iOS

The Wordnik team is proud to announce the WordKit SDK. In our quest to provide word information to any application, we are releasing both a UI and Data SDK which gives developers the ability to easily add in-place word information to any iOS project. The WordKit SDK is a native, iOS library which can be added to your application with a minimal amount of effort. iPhone and iPad formats are supported.

Take a look at our iOS WordKit Developer Page for full instructions on adding WordKit to your iOS application. Since the SDK is written in Objective C, you can add Wordnik functionality to applications written with UITextView, UIWebView and your own custom view implementations. When users look up the meaning of a word or phrase they will see definitions, related words, and real-time example uses. Wordnik’s English language corpus includes tens-of-billions of words which means your users will find the widest range of word meaning anywhere in the world.

Included in our sample code is the skeleton for a full-featured iOS dictionary application, including full auto-complete and Word List manipulation which demonstrates the full Wordnik RESTful API. We have also included concise command-line examples for playing audio files and performing efficient batch requests.

If you don’t have a Wordnik API key, head over to our developer site and get one for free. We have a number of resources on the site including a full-featured sandbox to the API which displays raw JSON and XML responses from our live API.

As always, please reach out to the Wordnik team with any questions, suggestions, or comments. You can reach the Wordnik team a variety of ways, listed on our support page.

Happy Coding,

The Wordnik API Team

February 6, 2011 by tonytam - 1 Comment

Wordnik API has gone Beta

This has been fun. We’ve had the API open for over a year and have received tens of millions of calls from developers. We’ve also gotten a lot of good advice on how to make it easier to use, and in christening our API as officially in “beta” we’ve tried to incorporate as many of these ideas as possible.

Since launching our API we have made major infrastructural changes, starting with our move to MongoDB. This allowed infinitely more freedom with the types of queries we could make under the hood, and also let us add more complex processing since the entire system was so much faster. It was a giant win for Wordnik and we owe many thanks to 10gen.

We also have rewritten our REST API in Scala. This has removed a ton of code and allowed us to reuse “traits” throughout the system. Standardization makes everyone’s lives better, and Scala has greatly facilitated this.

Finally, with Wordnik API libraries being built in most common languages (there’s even a Wordnik plugin for Emacs!) we’ve received feedback across the board about nuances in query parameters, ids, RESTful-ness, etc.

One surprise to all of us was the number of folks using the Wordnik API from mobile devices. Currently over 60% of our API traffic comes from mobile devices—double what we initially expected. So we now offer a latency-friendly Batch API to help folks out with this, and as a result we’ve been seeing great performance across both 2G + 3G mobile networks. Nowhere could this be more evident than with biNu, a company providing a dictionary app for 2G feature phones. Their Wordnik-powered application has almost 700,000 downloads since September.

The downside to mobile has become the deployment of applications. It can be a relatively slow process to move a new Wordnik API into your mobile application. Thus the stability of the API signatures that you depend upon is more important than ever, as hot-patching your app might not be as simple as a redeploy of some centralized server code. With that in mind, we’ve taken a sweep through our API signatures and formalized parameter naming, as well as deliberately put the API major version number in the URL.

So that said, we are enforcing compatibility with the API in the URL. Previously you could access /api to reach the latest production-stable API—this proved to be too confusing and made signature changes too difficult. From now on new API versions will always be deployed in a /vX path, where X is the major version. Major versions will always be compatible—if we change a signature, we’ll support the old request syntax as well—but across major versions we reserve the right to change signatures if needed. Realizing this is work for everyone (from our testing to your updates), we’ll only do this as needed.

Next, we’ve standardized our attribute naming. Ambiguous terms like “startAt” and “count” have been replaced across the board with terms like “skip” and “limit”, and at the suggestion of many developers we have moved from terms like “headword” and “wordstring” to “word”. You’ll also see that a number of “id” fields are now gone from the response. Since we’ve moved to MongoDB from relational database storage & retrieval, we can now have meaningful identifiers for data. Like the “wordObject”—you don’t really need an ID, the word itself identifies it uniquely. So less clutter in the models, less bandwidth to your client, and happiness across the board.

If you’re the HTTP-header-sniffing-type, you’ll see the Wordnik API version + build number in the response, which can help you know what you’re calling. Lots of folks have asked for this, and we’re happy to deliver it.

And last but not least, we’ve launched a new developer site. Please check out http://developer.wordnik.com for code libraries, an application showcase, support information and … drumroll… an interactive API shell. You can call the API and see raw XML/JSON against our server without having to stick a breakpoint in your code, add a println or nag your hacker buddy with wireshark. The developer site shows you live data from the API and lets you try out different parameters quickly and efficiently.

Thank you again for all your feedback and support! The communication lines are open so please feel free to reach out with any questions or further suggestions about the API.

The Wordnik Development Team

December 3, 2010 by tonytam - 2 Comments

Have some software on us

Today Wordnik is at MongoSV, a conference held by MongoDB creator 10gen in Silicon Valley. We’re presenting tools and tricks for managing a large deployment of MongoDB. In addition, we’re releasing our home-grown MongoDB admin tools to the wild.

We’ve shared lots of information on how we use MongoDB in some blog posts and back at MongoSF in April. MongoDB plays an invaluable role in Wordnik’s application stack. It’s becoming the clear leader in the non-relational database world, and has enabled us to bring more to our users faster. Wordnik is now serving 10 million API requests per day, so performance is paramount. (You can sign up for Wordnik’s free API and see for yourself.)

We’ve had an excellent experience switching to MongoDB (you can read how we did it here). Managing it takes less effort than our previous MySQL deployment, allowing us to add both features and content quickly. So to help other folks out there manage their MongoDB deployment we’re announcing Wordnik OSS Tools, a suite for managing MongoDB deployments, including tools for incremental backup, restoring from backups, and more.

And if you didn’t make it to the conference, check out our slide deck: Keeping the Lights on with MongoDB.

As always, questions and feedback welcome.

October 25, 2010 by tonytam - 31 Comments

12 Months with MongoDB

Happy Monday everyone!

As previously blogged, Wordnik is a heavy user of 10gen’s MongoDB. One year ago today we started the investigation to find an alternative to MySQL to store, find, and retrieve our corpus data. After months of experimentation in the non-relational landscape (and running a scary number of nightly builds), we settled on MongoDB. To mark the one-year anniversary of what ended up being a great move for Wordnik, I’ll describe a summary of how the migration has worked out for us.

Performance. The primary driver for migrating to MongoDB was for performance. We had issues with MySQL for both storage and retrieval, and both were alleviated by MongoDB. Some statistics:

Mongo serves an average of 500k requests/hour for us (that does include nights and weekends). We typically see 4x that during peak hours

We have > 12 billion documents in Mongo

Our storage is ~3TB per node

We easily sustain an insert speed of 8k documents/second, often burst to 50k/sec

A single java client can sustain 10MB/sec read over the backend (gigabit) network to one mongod. Four readers from the same client pull 40MB/sec over the same pipe

Every type of retrieval has become significantly faster than our MySQL implementation:

– example fetch time reduced from 400ms to 60ms
– dictionary entries from 20ms to 1ms
– document metadata from 30ms to .1ms
– spelling suggestions from 10ms to 1.2ms

One wonderful benefit to the built-in caching from Mongo is that taking our memcached layer out actually sped up calls by 1-2ms/call under load. This also frees up many GB of ram. We clearly cannot fit all our corpus data in RAM so the 60ms average for examples includes disk access.

Flexibility. We’ve been able to add a lot of flexibility to our system since we can now efficiently execute queries against attributes deep in the object graph. You’d need to design a really ugly schema to do this in mysql (although it can be done). Best of all, by essentially building indexes on object attributes, these queries are blazingly fast.

Other benefits:

We now store our audio files in MongoDB’s GridFS. Previously we used a clustered file system so files could be read and written from multiple servers. This created a huge amount of complexity from the IT operations point of view, and it meant that system backups (database + audio data) could get out of sync. Now that they’re in Mongo, we can reach them anywhere in the data center with the same mongo driver, and backups are consistent across the system.

Capped collections. We keep trend data inside capped collections, which have been wonderful for keeping datasets from unbounded growth.

Reliability. Of course, storing all your critical data in a relatively new technology has its risks. So far, we’ve done well from a reliability standpoint. Since April, we’ve had to restart Mongo twice. The first restart was to apply a patch on 1.4.2 (we’re currently running 1.4.4) to address some replication issues. The second was due to an outage in our data center. More on that in a bit.

Maintainability. This is one challenge for a new player like MongoDB. The administrative tools are pretty immature when compared with a product like MySQL. There is a blurry hand-off between engineering and IT Operations for this product, which is something worth noting. Luckily for all of us, there are plenty of hooks in Mongo to allow for good tools to be built, and without a doubt there will be a number of great applications to help manage Mongo.

The size of our database has required us to build some tools for helping to maintain Mongo, which I’ll be talking about at MongoSV in December. The bottom line is yes–you can run and maintain MongoDB, but it is important to understand the relationship between your server and your data.

The outage we had in our data center caused a major panic. We lost our DAS device during heavy writes to the server–this caused corruption on both master and slave nodes. The master was busy flushing data to disk while the slave was applying operations via oplog. When the DAS came back online, we had to run a repair on our master node which took over 24 hours. The slave was compromised yet operable–we were able to promote that to being the master while repairing the other system.

Restoring from tape was an option but keep in mind, even a fast tape drive will take a chunk of time to recover 3TB data, let alone lose the data between the last backup and the outage. Luckily we didn’t have to go down this path. We also had an in-house incremental backup + point-in-time recovery tool which we’ll be making open-source before MongoSV.

Of course, there have been a few surprises in this process, and some good learnings to share.

Data size. At the MongoSF conference in April, I whined about the 4x disk space requirements of MongoDB. Later, the 10gen folks pointed out how collection-level padding works in Mongo and for our scenario–hundreds of collections with an average of 1GB padding/collection–we were wasting a ton of disk in this alone. We also were able to embed a number of objects in subdocuments and drop indexes–this got our storage costs under control–now only about 1.5-2x that of our former MySQL deployment.

Locking. There are operations that will lock MongoDB at the database level. When you’re serving hundreds of requests a second, this can cause requests to pile up and create lots of problems. We’ve done the following optimizations to avoid locking:

If updating a record, we always query the record before issuing the update. That gets the object in RAM and the update will operate as fast as possible. The same logic has been added for master/slave deployments where the slave can be run with “–pretouch” which causes a query on the object before issuing the update

Multiple mongod processes. We have split up our database to run in multiple processes based on access patterns.

In summary, life with MongoDB has been good for Wordnik. Our code is faster, more flexible and dramatically smaller. We can code up tools to help out the administrative side until other options surface.

Hope this has been informative and entertaining–you can always see MongoDB in action via our public api.

Tony