B is for Billion

Only 10 years ago having a structured database with 100 million records in it was quite a feat. Today Wordnik passed the 9 billion record mark with the open-source MongoDB from 10gen. But a record in an object store is quite different from a row in a circa-1999 relational database.

Object-oriented programming concepts flew right by the RDBMS long ago. Inner Joins, left/right outer, unions, etc., have served us well, but how much of our data can we model in a tabular fashion? Have you ever tried doing anything complicated in Excel with just ONE sheet?

MongoDB removes an enormous amount of friction from the development process. A record shouldn’t be limited to things like the standard “user” table, with first_name, last_name, email, etc. They should be able to hold more meaningful and conceptually deep data, like “the frequency usage of a word across all time” or “the graph of all relationships to a word”, concepts difficult to express in tabular data. By using a document-oriented database, we at Wordnik don’t need to nag a DBA to add a field or column (well, we’re a startup, so more like nag the guy sitting next to you). If we can model it in software, MongoDB can store it, simple as that. And if MongoDB can store it, we can not only get it back (very important) but *find* it with very rich and flexible queries. Object-relational mapping (ORM) has been around about as long as OOP, but let’s face it: there is no ORM solution that (a) is flexible for the developer and (b) works in harmony with the storage system (i.e. performance doesn’t suck). MongoDB does both, easily, and it’s very, very fast.

So we hit 9 billion records, which is of course very exciting. Traffic to our public API is keeps growing–MongoDB served 100M queries in the last week and didn’t break a sweat. And what’s most exciting is the number of features this helps us develop very rapidly, which we will be sneaking out over the next few weeks.

What has technology done for words lately?

There are two significant computing advancements which are enabling Wordnik to deliver more words and more information about them to you: eventual consistency and document-oriented storage.

Eventual consistency is a parallel computing concept that was first presented in the context of fault tolerance but is now completely applicable to engines like Wordnik. Why is eventual consistency important to us? Because we do a lot of counting. Since we add about 150 million words a day to the corpus, getting an accurate count of the current size is not only impossible but pointless. We can add 150 words every second.

In a traditional, transactional database,  a counting-type operation will typically do one of two things: either it will lock the relevant database objects so that it can guarantee accuracy *right now* or it will perform a number of isolation operations so that your count *was* accurate at a given point in time. Sometimes it’s important to have an exact number — like when you’re checking your account balance at an ATM. But at Wordnik, we’d rather give you a rough estimate and keep the data flowing in as fast as possible.  More data is almost always better, and it’s our goal to have as much as we can. With eventual consistency, we count as many words as possible when we can, and add them all up when there’s a lag. The count’s always in the ballpark, and we never have to stop.

The next big computing advance that’s helping Wordnik is document-oriented storage.  Hierarchy is part of most data structures, but storing hierarchical data in a flattened, tabular manner makes creative search and retrieval very difficult.

Take a dictionary entry.  An entry’s hierarchy isn’t overly complex, but it does have a number of relationships–between the entry and the definitions, the parts of speech, pronunciations, citations, etc. Most software engineers have modeled hierarchal relationships in relational databases using primary & foreign keys, normalized tables, etc.  But doesn’t it make more sense to look at a dictionary entry as a “document” rather than a set of related tables?  It’s faster to find data with syntax like “dictionary.definitions.partOfSpeech=’noun'” instead of with a series of complex (and often expensive) joins across dozens of tables.

Luckily for us the fine folks at 10gen have created MongoDB, an open-source, document-oriented database that solves these and many other technical challenges.  Working with their system has been delightful and it has opened many doors for Wordnik, speeding up the development of new features!