There are two significant computing advancements which are enabling Wordnik to deliver more words and more information about them to you: eventual consistency and document-oriented storage.
Eventual consistency is a parallel computing concept that was first presented in the context of fault tolerance but is now completely applicable to engines like Wordnik. Why is eventual consistency important to us? Because we do a lot of counting. Since we add about 150 million words a day to the corpus, getting an accurate count of the current size is not only impossible but pointless. We can add 150 words every second.
In a traditional, transactional database, a counting-type operation will typically do one of two things: either it will lock the relevant database objects so that it can guarantee accuracy *right now* or it will perform a number of isolation operations so that your count *was* accurate at a given point in time. Sometimes it’s important to have an exact number — like when you’re checking your account balance at an ATM. But at Wordnik, we’d rather give you a rough estimate and keep the data flowing in as fast as possible. More data is almost always better, and it’s our goal to have as much as we can. With eventual consistency, we count as many words as possible when we can, and add them all up when there’s a lag. The count’s always in the ballpark, and we never have to stop.
The next big computing advance that’s helping Wordnik is document-oriented storage. Hierarchy is part of most data structures, but storing hierarchical data in a flattened, tabular manner makes creative search and retrieval very difficult.
Take a dictionary entry. An entry’s hierarchy isn’t overly complex, but it does have a number of relationships–between the entry and the definitions, the parts of speech, pronunciations, citations, etc. Most software engineers have modeled hierarchal relationships in relational databases using primary & foreign keys, normalized tables, etc. But doesn’t it make more sense to look at a dictionary entry as a “document” rather than a set of related tables? It’s faster to find data with syntax like “dictionary.definitions.partOfSpeech=’noun'” instead of with a series of complex (and often expensive) joins across dozens of tables.
Luckily for us the fine folks at 10gen have created MongoDB, an open-source, document-oriented database that solves these and many other technical challenges. Working with their system has been delightful and it has opened many doors for Wordnik, speeding up the development of new features!
Interesting hearing the tech perspective on this. Cool approach….
Interesting. I never though one could put so much thoughts on “technologizing” words
H L Mencken did a Three volume Dictionary of American Slang. You should incorporate it into your source material. I’m sure Blue balls will have a spot in his Dictionary. Taboo plays a big part of its makeup.
Mack Kelly
I wonder what the long-term impact of the Internet will have on language. Will it make it more concise, with words that don’t mean anything being corrected with easier access to sites like this (I hate the word “conversating” for example), or will it make for more wide-spread use of slang and new words?
The Internet was supposed to make us more educated, but I’m afraid it’s making us dumber sometimes…
So I’m curious if you have a “noun” document that contains all properties of the noun part of speech, or do all definitions just have a partOfSpeech property, and you find all documents with the same value for that property?
Hi Jesse!
Your question is part of what we’re trying to answer with smartwords — what method makes more sense to you?
Eric:
Having a partOfSpeech document makes more sense to me, but I am thinking about it from a RDBMS perspective. I don’t know how much of a concern normalization is with document storage systems like couch or mongo. I think there is a reason for data normalization outside of RDBMS, but I don’t know how NoSQL addresses it if at all That’s what I’m trying to learn.
Hi Jesse, a lot of the benefits of normalization have gone out the window with virtually free disk space and different access patterns. So if you are constantly updating, say, the “partOfSpeech.name” attribute in a large set of documents (say you can’t decide between “noun” and “n.”), then normalization is great. But in heavy read-only scenarios, it’s often cheaper to incur the disk space hit of redundant data for fast access.
You could argue that normalization doesn’t work well in noSQL data storage logic level because there are no joins, which means more application logic or nested queries. But if you step back and look at something like a hierarchal data structure like a dictionary, it’s incredibly powerful to query inside a document.