from the post at Semantic Web:
You may know Wordnik from subscribing to its Word of the Day service (by the way, today that word is eloign). Or perhaps you know it from some of the apps that have used its API – such as Freebase WordNet Explorer, or one of the many mobile ones that let users access direct features of the system through their smart phones.
Now comes something new on the API front: Word Graph is the latest result of some three years of algorithm development around analyzing the digital text that Wordnik has collected from partners, to understand the relationship between words in order to derive meaning. Word Graph matches content based on digital text from partners who need to understand more of what their content says and is, and to help them and their services make decisions based on that understanding.
In that respect, it’s taking Wordnik’s API services closer to helping accomplish business requirements, rather than drive neat B-to-C apps, from crossword puzzles to jumble games to pronunciation voice services, where its APIs have currently mostly been employed.
The first partner to use the API is TaskRabbit, an online service that matches task creators (e.g. someone who needs child care) with task runners (e.g. babysitters). Previous to integrating the API into its business logic tier, the key to-dos of the service were all manually accomplished, says Tony Tam, co-founder and vp of engineering at Wordnik. Submitted tasks, for instance, would need to be manually categorized, but now the system has been trained, based on TaskRabbit content, to appropriate treat terms from its domains. That is, for example, to understand that babysitting and child care are roughly the same thing, and to automatically categorize together the tasks submitted with the various terms. Now it knows to show task runners who perform those services tasks that used either term; in fact, it can find those task runners who’ve done a certain type of job (whether it’s called babysitting, child care, mother’s helper, day care, and so on) multiple times, and tell them about new tasks in the same vein. Similarly, for task posters, the API is used to match relevant tasks, so that they can quickly see how others with categorically similar requirements have posted their tasks, what they’re offering as fees, and possibly revise their own job postings to be better and more competitive matches.
“The goal is to match these task posters with the task runners as efficiently as possible based on the content of that task,” says Tam. But he sees potential for other ways products in many different verticals would benefit from recommending or matching content based on digital text – online publishing among them, of course. Why it’s different from other semantically-oriented attempts to do the same, Tam says, is that “our whole existence is built on the concept of marrying lexicography with computational linguistics.” It’s captured billions of words of English over its lifetime to feed its Word Graph word relationship graph and developing analytical algorithms so that it can do very strong recommendations and matching without a large training set. “So the Graph itself is one of our strongest tools in our toolbox,” he says. “We are taking a very different approach as far as how we can apply user behavior and content from the digital text on top of each other. We may be looking at similar problems but our approach is radically different.”
One of the important capabilities around its Word Graph is accounting for how dynamic language is – even when meaning seems undercut by cacography (yes, it is too a word) or something else. Tam estimates that roughly 200 words are created in the online digital set every day – perhaps unintentionally because of a misspelling, perhaps thanks to a new Twitter hashtag, or maybe in response to something taking place in society — the branding of Charlie Sheen as a ‘shenius,” for instance, Tam offers. “That went into our Word Graph and kicked into our algorithms,” he says. “So when you analyze text with current events and words, if you don’t know that relationship, the ability to do real processing is severely hampered. That is really core about what Wordnik is doing and is essential in building out a graph of words so we can make those associations. It’s not fair for me to say I can match your content only as long as it’s perfect. By taking text shorthand, misspellings, Twitter hashtags and so on, and and translate those into something that can be understood, now we can do real analysis on text.”
What Wordnik also is doing in the next little while is open sourcing its infrastructure to help solve real-world problems for API developers. Much of this will reflect the scaling expertise that Wordnik has been building, having had on its own plate dealing with documents that are millions of words long and that need to be processed efficiently in real time. Tam’s own background is in that area, including expertise in federated data query technologies, and Wordnik aims at making the scale big enough so it can be used at run time for tens of millions of nodes and millions of edges. He notes that Wordnik is one of the larger known instances of Mongo DB and uses the Scala programming language that runs on the Java Virtual Machine platform, and it also leverages cloud computing for locality requirements.