“New Word Graph API Takes Wordnik From Fun and Funky Apps to Some Serious Business Services”

from the post at Semantic Web:

You may know Wordnik from subscribing to its Word of the Day service (by the way, today that word is eloign). Or perhaps you know it from some of the apps that have used its API – such as Freebase WordNet Explorer, or one of the many mobile ones that let users access direct features of the system through their smart phones.

Now comes something new on the API front: Word Graph is the latest result of some three years of algorithm development around analyzing the digital text that Wordnik has collected from partners, to understand the relationship between words in order to derive meaning. Word Graph matches content based on digital text from partners who need to understand more of what their content says and is, and to help them and their services make decisions based on that understanding.

In that respect, it’s taking Wordnik’s API services closer to helping accomplish business requirements, rather than drive neat B-to-C apps, from crossword puzzles to jumble games to pronunciation voice services, where its APIs have currently mostly been employed.

The first partner to use the API is TaskRabbit, an online service that matches task creators (e.g. someone who needs child care) with task runners (e.g. babysitters). Previous to integrating the API into its business logic tier, the key to-dos of the service were all manually accomplished, says Tony Tam, co-founder and vp of engineering at Wordnik. Submitted tasks, for instance, would need to be manually categorized, but now the system has been trained, based on TaskRabbit content, to appropriate treat terms from its domains. That is, for example, to understand that babysitting and child care are roughly the same thing, and to automatically categorize together the tasks submitted with the various terms. Now it knows to show task runners who perform those services tasks that used either term; in fact, it can find those task runners who’ve done a certain type of job (whether it’s called babysitting, child care, mother’s helper, day care, and so on) multiple times, and tell them about new tasks in the same vein. Similarly, for task posters, the API is used to match relevant tasks, so that they can quickly see how others with categorically similar requirements have posted their tasks, what they’re offering as fees, and possibly revise their own job postings to be better and more competitive matches.

“The goal is to match these task posters with the task runners as efficiently as possible based on the content of that task,” says Tam. But he sees potential for other ways products in many different verticals would benefit from recommending or matching content based on digital text – online publishing among them, of course. Why it’s different from other semantically-oriented attempts to do the same, Tam says, is that “our whole existence is built on the concept of marrying lexicography with computational linguistics.” It’s captured billions of words of English over its lifetime to feed its Word Graph word relationship graph and developing analytical algorithms so that it can do very strong recommendations and matching without a large training set. “So the Graph itself is one of our strongest tools in our toolbox,” he says. “We are taking a very different approach as far as how we can apply user behavior and content from the digital text on top of each other. We may be looking at similar problems but our approach is radically different.”

One of the important capabilities around its Word Graph is accounting for how dynamic language is – even when meaning seems undercut by cacography (yes, it is too a word) or something else. Tam estimates that roughly 200 words are created in the online digital set every day – perhaps unintentionally because of a misspelling, perhaps thanks to a new Twitter hashtag, or maybe in response to something taking place in society — the branding of Charlie Sheen as a ‘shenius,” for instance, Tam offers. “That went into our Word Graph and kicked into our algorithms,” he says. “So when you analyze text with current events and words, if you don’t know that relationship, the ability to do real processing is severely hampered. That is really core about what Wordnik is doing and is essential in building out a graph of words so we can make those associations. It’s not fair for me to say I can match your content only as long as it’s perfect. By taking text shorthand, misspellings, Twitter hashtags and so on, and and translate those into something that can be understood, now we can do real analysis on text.”

What Wordnik also is doing in the next little while is open sourcing its infrastructure to help solve real-world problems for API developers. Much of this will reflect the scaling expertise that Wordnik has been building, having had on its own plate dealing with documents that are millions of words long and that need to be processed efficiently in real time. Tam’s own background is in that area, including expertise in federated data query technologies, and Wordnik aims at making the scale big enough so it can be used at run time for tens of millions of nodes and millions of edges. He notes that Wordnik is one of the larger known instances of Mongo DB and uses the Scala programming language that runs on the Java Virtual Machine platform, and it also leverages cloud computing for locality requirements.

All-Star Words

In honor of tomorrow’s All-Star Game, we’re happy to announce a new word-of-the-day list from Paul Dickson, the author of The Dickson Baseball Dictionary, Third Edition (and more than fifty other books)!

Why are baseball words so hugely entertaining (even if you’re not a hard-core baseball fan)? Paul puts it best:

Baseball is a metaphoric circus.

The game has a particular infatuation with what one critic of sportswriters termed “the incorrect use of correct words.” There are hundreds of examples, but the point can be made by simply listing a selection of synonyms for the hard-hit ball or line drive. It is variously known as an aspirin, a BB, a bolt, a clothesline, a frozen rope, a pea, a rocket, and a seed. A player’s throwing arm seems to be called everything but an arm: gun, hose, rifle, soupbone, whip, and wing, to name just a few. The arm is not the only renamed body part. From top to bottom, players have lamps (eyes), a pipe (neck or throat), hooks (hands), wheels (legs), and tires (feet).
So many allusions are made to food and dining, including pitches that seem to fall off the table, that a fairly well-balanced diet suggests itself in terms like can of corn, cup of coffee, fish cakes, banana stalk, mustard, pretzel, rhubarb, green pea, juice, meat hand, grapefruit league, and tater. Among the many terms for the ball itself are apple, cantaloupe, egg, lemon, orange, pea, potato, and tomato. Implements? There is the plate (also known as the platter, pan, and dish) and, of course, the forkball. Dessert? The red abrasion from a slide into base is a strawberry and the fan’s time-honored sound of disapproval is a raspberry.
The game proudly displays its rustic roots and there is a tone to the language of the game that is remarkably pastoral. If any imagery dominates, it is that of rural America. Even under a dome, it is a game of fields and fences, where ducks [sit] on the pond and pitchers sit in the catbird seat. New players come out of the farm system and a farm hand who pitches may get to work in the bullpen.

You can sign up to receive Paul’s baseball words here; and if you have a Nook Color, we’ll be featuring baseball words this whole week via the Wordnik Nook Word of the Day app! [Want a hard copy of the Dictionary? You can find one at any of these fine booksellers.]

We hope you enjoy exploring this rich slice of English … and if you’ll be in Pasadena, CA this Sunday, the 17th of July, you can also see Paul Dickson at the Baseball Reliquary, where he will be presented with the 2011 Tony Salin Memorial Award for his commitment to the preservation of baseball history.

This Week’s Language Blog Roundup

It’s that time once again when we bring you the highlights from our favorite language blogs and the latest in word news.

The Economist’s language blog Johnson rang in the Fourth with American accents and Accigone, the accent eradicator, while the Dialect Blog provided British accent samples instead.  At Language Log, Mark Liberman took a look at things that aren’t what they are, namely Google’s recent bids for Nortel patents (“pi” and “the distance between the earth and the sun” are just a couple of examples); some verbal illusions (no one is too busy to read this post, right?); and some variations on the French oh la la.

Language Corner at the Columbia Journalism Review took issue with using words such as gonna and wanna to convey dialect, while The Economist explored the diverse world of voiceovers and dubbing in the Arab film industry, from “Syrian musalsalaat, or soap operas,” to Gulf Arabic for “dramas from India and its neighbours.”

In endangered languages, it appears that as elders die off, fluency in Maori is diminishing, even as the number of Maori speakers increases, while according to K International, the Oaxca, an indigenous people of Mexico, are rethinking their strategy in maintaining their language.

K International also took a look at one foundation is using technology to preserve languages, as well as some unlikely language preservationists – teenagers, namely those in southern Chile who have been “posting videos on YouTube of themselves rapping in a mixture of Spanish and Huilliche, an indigenous language with only about 2,000 speakers,” as well as teens texting in regional and indigenous languages in the Philippines (as mentioned in our last post) and Mexico.  Another online project gives a home to dying languages, while social networking may give Welsh a new lease on life.

Johnson also mused on color naming, while Lynneguist at Separated By a Common Language discussed making suggestions in different cultures.  Arnold Zwicky had fun with telephon- combining words; some porn-manteaus; and mishearing Navy SEALs as baby seals.  Headsup: The Blog asserted that serve and serve up cannot be used interchangeably, at least where people are concerned.

The Virtual Linguist blogged about naturists’ – or nudists’ – slang (for instance, “cotton-tails. . .are people with white bottoms ie non-naturists, or, at the very least, recent converts to naturism”); a several hundred year old term for prostitute; and a couple of slang terms for money.  And the Dialect Blog recounted the evolution of the word, douchebag.

This week we also learned of a chimp who recognizes synthetic speech; a scholar who is studying how the concept of time differs across languages; and that the prolific British Library is building a database of Britain’s most obscure words. Some of our favorites?  Dimpsy, “half light, just turning dark,” gurtlush, “the best,” and tittermatorter, “seesaw”.  We also found out that the third edition of The Encyclopedia of Science Fiction will be available later this year online for free, and then our heads  exploded with excitement.

That’s it from here! Till next week, adios, au revior, aufweidersehn!

Summer Watching

Last month we brought you some summer reading recommendations. Today we have some movies and videos that might be word-nerd-worthy.

“A group of ivory-tower lexicographers realize they need to hear how real people talk, and end up helping a beautiful singer escape from the Mob.”  Did you know such a movie existed?  Is it too good to be true?  It’s not. Ball of Fire stars Gary Cooper as that ivory-tower lexicographer, Professor Bertram Potts, and Barbara Stanwyck, in an Oscar-nominated role, as Sourpuss O’Shea, that saucy nightclub singer. They meet while the professor is researching an article on slang, and as expected, end up falling for each other.

Love the Scripps National Spelling Bee? Chances are you already love Spellbound, a documentary which follows eight competitors in the 1999 Scripps National Spelling Bee. You may also want to check out The Girl Who Spelled Freedom, a 1986 made-for-TV movie based on the true story of Linn Yann, a Cambodian refugee who survived the Khmer Rouge labor camps and immigrated with her family to Chattanooga, Tennessee.  Four years later, “she won the countywide 1983 Chattanooga Times Spelling Bee,” before making it to the 1985 national finals of the Scripps Bee in Washington, DC.

Also available for your spelling-viewing pleasure are Akeelah and the Bee, about “a young girl from South Los Angeles [who] tries to make it to the National Spelling Bee,” and Bee Season, based on Myla Goldberg’s novel about “a wife and mother [who] begins a downward emotional spiral, as her husband avoids their collapsing marriage by immersing himself in his 11 year-old daughter’s quest to become a spelling bee champion.”

For you crossword-puzzle addicts, there’s Wordplay which focuses on four crossword puzzle solvers competing in the American Crossword Puzzle Tournament, features Will Shortz, the editor of The New York Times crossword puzzle, and includes celebrity crossword-geek “confessions” from the likes of Jon Stewart, Bill Clinton, Bob Dole, and documentarian Ken Burns. If Scrabble is more to your taste, try Word Wars which explores the world of competitive Scrabble playing, following “four players in the nine months leading up to the 2002 National Scrabble Championship.”

For you history buffs, there’s The Story of English, a nine-part series that appeared on PBS in the mid-1980s, and that includes such episodes as An English Speaking World, The Guid Scots Tongue, and Muvver Tongue.  Also be sure check out the perfectly delightful History of English in 10 Minutes from Open University.

Want a walk down memory lane? Sing along with Schoolhouse Rock’s Grammar Rock, in particular Conjunction Junction (“What’s your function?”) and Lolly Lolly Lolly Get Your Adverbs Here.

Know of more word-nerd-friendly movies and videos?  Share them in the comments!  Till then, here’s to happy (air-conditioned) summer watching.

This Week’s Language Blog Roundup

Happy Friday before Fourth of July! It’s time again for another Language Blog Roundup, in which we bring you the highlights from our favorite language blogs and the latest in word news.

There was much hubbub in the Twitterverse this week over the loss of the Oxford comma, as stated in the University of Oxford’s style guide. However, it was soon determined that the Oxford comma wasn’t dead after all, and that the “only explicit permission to dispense with the Oxford comma. . .was in a guide for university staff on writing press releases and internal communications.” Whew! We’re calm, cool, and collected now.

In Shakespeare news, a group of scientists got the green light from the Church of England to exhume “the Bard of Avon’s remains to determine the cause of his death and, among other things, if the playwright had traces of pot pumping through his system.”  Meanwhile in politics, Vanity Fair desconstructed Michele Bachmann’s favorite metaphor, the three-legged stool.

Erin Gloria Ryan over at Jezebel wrote about her love affair with peppering her speech with “like” while Mark Liberman at Language Log questioned Ryan’s proposal that women may use “like” more often than men, and jokingly devised a possible solution, the iPeeve, an imaginary app that is “a speech recognizer with a style checker [that] will make [your smartphone] vibrate (or beep, or flash) whenever you indulge in any of the verbal tics that you’ve asked it to watch out for.”

In neologism news, The Economist’s language blog, Johnson, noticed incent, the verb form of incentive, while Stan Carey mused over preloved euphemisms.  Word Spy spotted omega male, “the man who is least likely to take on a dominant role in a social or professional situation”; teacup, “a college student with a fragile, easily shattered psyche”; and filter bubble, “search results, recommendations, and other online data that have been filtered to match your interests, thus preventing you from seeing data outside of those interests.”

But the unmapped word of the week, in our humble opinion, was humblebrag (brought to our attention by @mcintyrekm), a “type of bragging which masks the brag in a faux-humble guise.”

The Virtual Linguist took a look at “once every preston guild,” a Lancashire expression meaning “very rarely”; down, meaning “an area of high land”; a now-troubling word that once simply meant a “bundle of sticks”; and The Daily Mail’s taking Kate Middleton to task for using ‘till instead of ‘til in her wedding thank you cards.  The Dialect Blog discussed the ever elusive English schwa; David Marsh at The Guardian demanded the termination of “railspeak”; and LeVar Burton is apparently “actively plotting” a “Rainbow Reading flashmob.”  Empirical Zeal blogged about dissecting the language of songbirds, while Buzzfeed cited a very British headline that is positively for the birds.

In library news, the Internet Archive announced that their eBook lending program has expanded to 1,000 libraries in six countries. Congrats! Meanwhile, George Mason University is busy archiving the world’s English accents.

The Book Haven at Stanford University profiled the exiled Chinese poet, Bei Dao, who stated that “each language keeps the secret code of a culture,” which may be just another reason to preserve endangered and disappearing languages such as Calo, “spoken by Romani people, sometimes referred to as Gypsies, in Spain”; Ayapanec, “which is thought to have descended from a language spoken by the Olmecs, a pre-Columbian civilization”; Eyak, “once spoken by a native tribe in Alaska”; and Kapampangan, a regional language in the Philippines, which may be preserved by “teenagers [who] think it’s ‘cool’ to send mobile phone text messages” in that language.

In videos of the week, check out this one from the Getty Museum about the structure of a medieval manuscript, and this thoroughly entertaining 10-minute history of the English language from Open University, brought to our attention by @MisterVerb via @Fritinancy.  Don’t have 10 minutes? Then at least check out chapter 7, the Age of the Dictionary.

Finally, happy Canada Day to our friends north of the border!  One Canadian living in Wales grieved the loss of her Canadian English, while our list of the day celebrates Canadianisms, and this one and this one honor Canadian places.

Till (or til?) next week, stay wordy my friends!