What has technology done for words lately?

by tonytam on February 18, 2010

There are two significant computing advancements which are enabling Wordnik to deliver more words and more information about them to you: eventual consistency and document-oriented storage.

Eventual consistency is a parallel computing concept that was first presented in the context of fault tolerance but is now completely applicable to engines like Wordnik. Why is eventual consistency important to us? Because we do a lot of counting. Since we add about 150 million words a day to the corpus, getting an accurate count of the current size is not only impossible but pointless. We can add 150 words every second.

In a traditional, transactional database,  a counting-type operation will typically do one of two things: either it will lock the relevant database objects so that it can guarantee accuracy *right now* or it will perform a number of isolation operations so that your count *was* accurate at a given point in time. Sometimes it’s important to have an exact number — like when you’re checking your account balance at an ATM. But at Wordnik, we’d rather give you a rough estimate and keep the data flowing in as fast as possible.  More data is almost always better, and it’s our goal to have as much as we can. With eventual consistency, we count as many words as possible when we can, and add them all up when there’s a lag. The count’s always in the ballpark, and we never have to stop.

The next big computing advance that’s helping Wordnik is document-oriented storage.  Hierarchy is part of most data structures, but storing hierarchical data in a flattened, tabular manner makes creative search and retrieval very difficult.

Take a dictionary entry.  An entry’s hierarchy isn’t overly complex, but it does have a number of relationships–between the entry and the definitions, the parts of speech, pronunciations, citations, etc. Most software engineers have modeled hierarchal relationships in relational databases using primary & foreign keys, normalized tables, etc.  But doesn’t it make more sense to look at a dictionary entry as a “document” rather than a set of related tables?  It’s faster to find data with syntax like “dictionary.definitions.partOfSpeech=’noun'” instead of with a series of complex (and often expensive) joins across dozens of tables.

Luckily for us the fine folks at 10gen have created MongoDB, an open-source, document-oriented database that solves these and many other technical challenges.  Working with their system has been delightful and it has opened many doors for Wordnik, speeding up the development of new features!

Comments on this entry are closed.

Previous post:

Next post: