Wordnik Tech Roundup

Wordnik Lead Engineer Kumanan Rajamanikkam recently wrote a great post for Cloudera’s blog about how we’re using Hadoop to process our corpus data–a real challenge, since the corpus can grow at up to 8,000 words per second. Since switching to Cloudera-flavored Hadoop, some analysis jobs that once took over three weeks to run can now be done in a day.

The blog High Scalability also wrote a good summary of blog posts and slideshows outlining how we use MongoDB and Scala, among other tools, to manage our data.

And, saving the best for last, we’d like to introduce our new Director of Engineering, Ramesh Pidikiti, who will be working on our API and our overall engineering project management. Prior to joining Wordnik Ramesh was the VP of Engineering at Passenger, and before that played lead roles at Certus Software and HCL. We’re very excited to have him–welcome, Ramesh!

Introducing WordKit for iOS

The Wordnik team is proud to announce the WordKit SDK. In our quest to provide word information to any application, we are releasing both a UI and Data SDK which gives developers the ability to easily add in-place word information to any iOS project. The WordKit SDK is a native, iOS library which can be added to your application with a minimal amount of effort. iPhone and iPad formats are supported.

Take a look at our iOS WordKit Developer Page for full instructions on adding WordKit to your iOS application.  Since the SDK is written in Objective C, you can add Wordnik functionality to applications written with UITextView, UIWebView and your own custom view implementations.  When users look up the meaning of a word or phrase they will see definitions, related words, and real-time example uses.  Wordnik’s English language corpus includes tens-of-billions of words which means your users will find the widest range of word meaning anywhere in the world.

Included in our sample code is the skeleton for a full-featured iOS dictionary application, including full auto-complete and Word List manipulation which demonstrates the full Wordnik RESTful API.  We have also included concise command-line examples for playing audio files and performing efficient batch requests.

If you don’t have a Wordnik API key, head over to our developer site and get one for free.  We have a number of resources on the site including a full-featured sandbox to the API which displays raw JSON and XML responses from our live API.

As always, please reach out to the Wordnik team with any questions, suggestions, or comments.   You can reach the Wordnik team a variety of ways, listed on our support page.

Happy Coding,

The Wordnik API Team

Wordnik API has gone Beta

Wordnik API LogoThis has been fun. We’ve had the API open for over a year and have received tens of millions of calls from developers. We’ve also gotten a lot of good advice on how to make it easier to use, and in christening our API as officially in “beta” we’ve tried to incorporate as many of these ideas as possible.

Since launching our API we have made major infrastructural changes, starting with our move to MongoDB. This allowed infinitely more freedom with the types of queries we could make under the hood, and also let us add more complex processing since the entire system was so much faster.  It was a giant win for Wordnik and we owe many thanks to 10gen.

We also have rewritten our REST API in Scala. This has removed a ton of code and allowed us to reuse “traits” throughout the system. Standardization makes everyone’s lives better, and Scala has greatly facilitated this.

Finally, with Wordnik API libraries being built in most common languages (there’s even a Wordnik plugin for Emacs!) we’ve received feedback across the board about nuances in query parameters, ids, RESTful-ness, etc.

One surprise to all of us was the number of folks using the Wordnik API from mobile devices. Currently over 60% of our API traffic comes from mobile devices—double what we initially expected. So we now offer a latency-friendly Batch API to help folks out with this, and as a result we’ve been seeing great performance across both 2G + 3G mobile networks. Nowhere could this be more evident than with biNu, a company providing a dictionary app for 2G feature phones. Their Wordnik-powered application has almost 700,000 downloads since September.

The downside to mobile has become the deployment of applications. It can be a relatively slow process to move a new Wordnik API into your mobile application. Thus the stability of the API signatures that you depend upon is more important than ever, as hot-patching your app might not be as simple as a redeploy of some centralized server code. With that in mind, we’ve taken a sweep through our API signatures and formalized parameter naming, as well as deliberately put the API major version number in the URL.

So that said, we are enforcing compatibility with the API in the URL. Previously you could access /api to reach the latest production-stable API—this proved to be too confusing and made signature changes too difficult. From now on new API versions will always be deployed in a /vX path, where X is the major version. Major versions will always be compatible—if we change a signature, we’ll support the old request syntax as well—but across major versions we reserve the right to change signatures if needed. Realizing this is work for everyone (from our testing to your updates), we’ll only do this as needed.

Next, we’ve standardized our attribute naming. Ambiguous terms like “startAt” and “count” have been replaced across the board with terms like “skip” and “limit”, and at the suggestion of many developers we have moved from terms like “headword” and “wordstring” to “word”. You’ll also see that a number of “id” fields are now gone from the response. Since we’ve moved to MongoDB from relational database storage & retrieval, we can now have meaningful identifiers for data. Like the “wordObject”—you don’t really need an ID, the word itself identifies it uniquely. So less clutter in the models, less bandwidth to your client, and happiness across the board.

If you’re the HTTP-header-sniffing-type, you’ll see the Wordnik API version + build number in the response, which can help you know what you’re calling. Lots of folks have asked for this, and we’re happy to deliver it.

And last but not least, we’ve launched a new developer site. Please check out http://developer.wordnik.com for code libraries, an application showcase, support information and … drumroll… an interactive API shell. You can call the API and see raw XML/JSON against our server without having to stick a breakpoint in your code, add a println or nag your hacker buddy with wireshark. The developer site shows you live data from the API and lets you try out different parameters quickly and efficiently.

Thank you again for all your feedback and support! The communication lines are open so please feel free to reach out with any questions or further suggestions about the API.

The Wordnik Development Team

Calling All Developers: Check Out Our New Developer Site!

Today we’re happy to announce the launch of our shiny new developer site at developer.wordnik.com. The site provides comprehensive documentation of our newly-updated API methods and features a sleek and powerful API sandbox for making requests right in your web browser! In addition to the new documentation, we’ve made it even easier for developers to start incorporating Wordnik features into their apps by offering code libraries in a variety of languages such as Objective-C, Ruby, Python, Java, PHP, and Actionscript/Flex. We’ve also added an application showcase to show off some of the nifty games, apps, and tools that people have already built using the Wordnik API.

You have the tools, now go build something cool! We’d love to feature your next app in our developer showcase.

Have some software on us

Today Wordnik is at MongoSV, a conference held by MongoDB creator 10gen in Silicon Valley. We’re presenting tools and tricks for managing a large deployment of MongoDB. In addition, we’re releasing our home-grown MongoDB admin tools to the wild.

We’ve shared lots of information on how we use MongoDB in some blog posts and back at MongoSF in April. MongoDB plays an invaluable role in Wordnik’s application stack. It’s becoming the clear leader in the non-relational database world, and has enabled us to bring more to our users faster. Wordnik is now serving 10 million API requests per day, so performance is paramount. (You can sign up for Wordnik’s free API and see for yourself.)

We’ve had an excellent experience switching to MongoDB (you can read how we did it here). Managing it takes less effort than our previous MySQL deployment, allowing us to add both features and content quickly. So to help other folks out there manage their MongoDB deployment we’re announcing Wordnik OSS Tools, a suite for managing MongoDB deployments, including tools for incremental backup, restoring from backups, and more.

And if you didn’t make it to the conference, check out our slide deck: Keeping the Lights on with MongoDB.

As always, questions and feedback welcome.

Happy API-versary!

Wordnik t-shirt

Do you have a Wordnik API key? This could be yours!

We’re a bit late on this … we just realized last week that the Wordnik API has been open for more than a year!

To celebrate our data-versary, we’re going to give away some spiffy Wordnik t-shirts. We’ll pick five lucky developers from everyone who has registered for a Wordnik API key by Monday, November 15th, and send them each a t-shirt.

If you haven’t signed up for a Wordnik API key yet, you might wonder: what can you do with the Wordnik API? You can power games, make mobile apps (including feature-phone apps), and more.

12 Months with MongoDB

Happy Monday everyone!

As previously blogged, Wordnik is a heavy user of 10gen’s MongoDB. One year ago today we started the investigation to find an alternative to MySQL to store, find, and retrieve our corpus data. After months of experimentation in the non-relational landscape (and running a scary number of nightly builds), we settled on MongoDB. To mark the one-year anniversary of what ended up being a great move for Wordnik, I’ll describe a summary of how the migration has worked out for us.

Performance. The primary driver for migrating to MongoDB was for performance. We had issues with MySQL for both storage and retrieval, and both were alleviated by MongoDB. Some statistics:

  • Mongo serves an average of 500k requests/hour for us (that does include nights and weekends). We typically see 4x that during peak hours
  • We have > 12 billion documents in Mongo
  • Our storage is ~3TB per node
  • We easily sustain an insert speed of 8k documents/second, often burst to 50k/sec
  • A single java client can sustain 10MB/sec read over the backend (gigabit) network to one mongod. Four readers from the same client pull 40MB/sec over the same pipe
  • Every type of retrieval has become significantly faster than our MySQL implementation:
  • – example fetch time reduced from 400ms to 60ms
    – dictionary entries from 20ms to 1ms
    – document metadata from 30ms to .1ms
    – spelling suggestions from 10ms to 1.2ms

    One wonderful benefit to the built-in caching from Mongo is that taking our memcached layer out actually sped up calls by 1-2ms/call under load. This also frees up many GB of ram. We clearly cannot fit all our corpus data in RAM so the 60ms average for examples includes disk access.

    Flexibility. We’ve been able to add a lot of flexibility to our system since we can now efficiently execute queries against attributes deep in the object graph. You’d need to design a really ugly schema to do this in mysql (although it can be done). Best of all, by essentially building indexes on object attributes, these queries are blazingly fast.

    Other benefits:

  • We now store our audio files in MongoDB’s GridFS. Previously we used a clustered file system so files could be read and written from multiple servers. This created a huge amount of complexity from the IT operations point of view, and it meant that system backups (database + audio data) could get out of sync. Now that they’re in Mongo, we can reach them anywhere in the data center with the same mongo driver, and backups are consistent across the system.
  • Capped collections. We keep trend data inside capped collections, which have been wonderful for keeping datasets from unbounded growth.
  • Reliability. Of course, storing all your critical data in a relatively new technology has its risks. So far, we’ve done well from a reliability standpoint. Since April, we’ve had to restart Mongo twice. The first restart was to apply a patch on 1.4.2 (we’re currently running 1.4.4) to address some replication issues. The second was due to an outage in our data center. More on that in a bit.

    Maintainability. This is one challenge for a new player like MongoDB. The administrative tools are pretty immature when compared with a product like MySQL. There is a blurry hand-off between engineering and IT Operations for this product, which is something worth noting. Luckily for all of us, there are plenty of hooks in Mongo to allow for good tools to be built, and without a doubt there will be a number of great applications to help manage Mongo.

    The size of our database has required us to build some tools for helping to maintain Mongo, which I’ll be talking about at MongoSV in December. The bottom line is yes–you can run and maintain MongoDB, but it is important to understand the relationship between your server and your data.

    The outage we had in our data center caused a major panic. We lost our DAS device during heavy writes to the server–this caused corruption on both master and slave nodes. The master was busy flushing data to disk while the slave was applying operations via oplog. When the DAS came back online, we had to run a repair on our master node which took over 24 hours. The slave was compromised yet operable–we were able to promote that to being the master while repairing the other system.

    Restoring from tape was an option but keep in mind, even a fast tape drive will take a chunk of time to recover 3TB data, let alone lose the data between the last backup and the outage. Luckily we didn’t have to go down this path. We also had an in-house incremental backup + point-in-time recovery tool which we’ll be making open-source before MongoSV.

    Of course, there have been a few surprises in this process, and some good learnings to share.

    Data size. At the MongoSF conference in April, I whined about the 4x disk space requirements of MongoDB. Later, the 10gen folks pointed out how collection-level padding works in Mongo and for our scenario–hundreds of collections with an average of 1GB padding/collection–we were wasting a ton of disk in this alone. We also were able to embed a number of objects in subdocuments and drop indexes–this got our storage costs under control–now only about 1.5-2x that of our former MySQL deployment.

    Locking. There are operations that will lock MongoDB at the database level. When you’re serving hundreds of requests a second, this can cause requests to pile up and create lots of problems. We’ve done the following optimizations to avoid locking:

  • If updating a record, we always query the record before issuing the update. That gets the object in RAM and the update will operate as fast as possible. The same logic has been added for master/slave deployments where the slave can be run with “–pretouch” which causes a query on the object before issuing the update
  • Multiple mongod processes. We have split up our database to run in multiple processes based on access patterns.
  • In summary, life with MongoDB has been good for Wordnik. Our code is faster, more flexible and dramatically smaller. We can code up tools to help out the administrative side until other options surface.

    Hope this has been informative and entertaining–you can always see MongoDB in action via our public api.