What has technology done for words lately?

There are two significant computing advancements which are enabling Wordnik to deliver more words and more information about them to you: eventual consistency and document-oriented storage.

Eventual consistency is a parallel computing concept that was first presented in the context of fault tolerance but is now completely applicable to engines like Wordnik. Why is eventual consistency important to us? Because we do a lot of counting. Since we add about 150 million words a day to the corpus, getting an accurate count of the current size is not only impossible but pointless. We can add 150 words every second.

In a traditional, transactional database,  a counting-type operation will typically do one of two things: either it will lock the relevant database objects so that it can guarantee accuracy *right now* or it will perform a number of isolation operations so that your count *was* accurate at a given point in time. Sometimes it’s important to have an exact number — like when you’re checking your account balance at an ATM. But at Wordnik, we’d rather give you a rough estimate and keep the data flowing in as fast as possible.  More data is almost always better, and it’s our goal to have as much as we can. With eventual consistency, we count as many words as possible when we can, and add them all up when there’s a lag. The count’s always in the ballpark, and we never have to stop.

The next big computing advance that’s helping Wordnik is document-oriented storage.  Hierarchy is part of most data structures, but storing hierarchical data in a flattened, tabular manner makes creative search and retrieval very difficult.

Take a dictionary entry.  An entry’s hierarchy isn’t overly complex, but it does have a number of relationships–between the entry and the definitions, the parts of speech, pronunciations, citations, etc. Most software engineers have modeled hierarchal relationships in relational databases using primary & foreign keys, normalized tables, etc.  But doesn’t it make more sense to look at a dictionary entry as a “document” rather than a set of related tables?  It’s faster to find data with syntax like “dictionary.definitions.partOfSpeech=’noun'” instead of with a series of complex (and often expensive) joins across dozens of tables.

Luckily for us the fine folks at 10gen have created MongoDB, an open-source, document-oriented database that solves these and many other technical challenges.  Working with their system has been delightful and it has opened many doors for Wordnik, speeding up the development of new features!

Announcing the new Wordnik alpha APIs! (UPDATED)

Today we’re happy to announce the alpha version of our new Wordnik APIs! UPDATE: See the video of the announcement we made at the Web 2.0 Summit.

Wordnik’s goal is not just to collect at least some information about every word in English — it’s also to make great information about words widely available, and our alpha APIs are a first step towards that goal.

Our new APIs include:

  • a definitions server, with definitions from The Century Dictionary (other dictionaries will be coming soon);
  • a “frequency” API, which returns a frequency number based on our initial API corpus*;
  • an “examples” API, which will return up to five example sentences for any word that appears in our initial API corpus;
  • the Wordnik word-of-the-day API (so you can create your own word-of-the-day wrapper or widget);
  • and it’s not really a standalone API, but we’re also throwing in an autocomplete API that is useful for making stuff with the other APIs.

You can sign up for our APIs here. Depending on demand, we may have to stagger approvals so as not to overwhelm the servers. If you want a better chance of being approved, give us as much detail as possible about how you plan to use our APIs. Coolness counts (but spelling doesn’t — since we haven’t released a spelling API yet).

Rudimentary documentation is here.

This is just a start — we’re hoping to release new APIs at regular intervals, so if there’s a kind of word data you’re longing to have access to, please let us know!

(* Our initial API corpus is about 3 billion words of running text. The API corpus is slightly different from the corpus that drives the Wordnik web site.)