Improved Search: Wildcards and Lists

We’ve rolled out a handful of improvements to word search. There’s more on the way, but here’s a quick overview of two new features: wildcard searches and list search.

These can be used from our recently-added search results pages, which you can get to either from the ‘See all results for’ link at the bottom of the autocomplete results when you search from any page, or by going to http://www.wordnik.com/search directly.

The * wildcard matches any number of characters:
http://www.wordnik.com/search/*tacular

? matches any single character:
http://www.wordnik.com/search/f?t

Or you can limit single-character wildcards to just vowels or just consonants with @ and # respectively:
http://www.wordnik.com/search/f@rt
http://www.wordnik.com/search/#at

Searching without wildcards returns results similar to what you see from autocomplete, but includes results from lists, tags, and related words:
http://www.wordnik.com/search/cat

Or you can specifically focus on lists and see more results:
http://www.wordnik.com/search/lists/cat

Upcoming releases will allow regex-style searches and let you search other kinds of Wordnik content. If you’d like to see other search-related features, or have suggestions for how these should work, please let us know in the comments or through feedback@wordnik.com.

On (Y)our Mark …

On Your Mark!

Photo by, and licensed (C BY-NC-ND 2.0) from, chicagoceli.

We are overjoyed to welcome Mark Wong-VanHaren as Wordnik’s new VP of R&D. He’ll be working with us to turn billions of words of delicious language data into cool products and tools to help folks know and enjoy more about English.

Mark was one of the original founders of Excite, which was arguably the first *real* search engine. In addition to being a world-renowned technologist, he casually types around 130WPM on one of those crazy Kinesis keyboards.

Here’s Mark in his own words:

Mark Wong-VanHaren loves language. He has studied a half-dozen natural languages, and coded in far more programming ones. With friends from Stanford, he co-founded the pioneering search engine Excite, putting his Symbolic Systems skills to good use. He then worked with a handful of other start-ups, in addition to being an EIR at Charles River Ventures, before recently serving as CTO of Glyde, an e-commerce company. Mark loves afro-cuban music, playing hockey (both the ice- and table- varieties), functional programming, the Green Bay Packers, hiking with his family, and Almodóvar flicks.

Wordnik and Blekko!

How we feel about helping Blekko searchers

We feel downright ebullient to be powering the “/define” slashtag for the awesome guys at Blekko!

Slashtags are our favorite thing about Blekko — you can use one (like /define or /language) to limit your search to just the slice of sources that are the most likely to give you the answer you need. You can make your own or you can use the preset slashtags that Blekko provides.

Blekko’s lots of fun — but even more useful. Try it, we think you’ll like it!

12 Months with MongoDB

Happy Monday everyone!

As previously blogged, Wordnik is a heavy user of 10gen’s MongoDB. One year ago today we started the investigation to find an alternative to MySQL to store, find, and retrieve our corpus data. After months of experimentation in the non-relational landscape (and running a scary number of nightly builds), we settled on MongoDB. To mark the one-year anniversary of what ended up being a great move for Wordnik, I’ll describe a summary of how the migration has worked out for us.

Performance. The primary driver for migrating to MongoDB was for performance. We had issues with MySQL for both storage and retrieval, and both were alleviated by MongoDB. Some statistics:

  • Mongo serves an average of 500k requests/hour for us (that does include nights and weekends). We typically see 4x that during peak hours
  • We have > 12 billion documents in Mongo
  • Our storage is ~3TB per node
  • We easily sustain an insert speed of 8k documents/second, often burst to 50k/sec
  • A single java client can sustain 10MB/sec read over the backend (gigabit) network to one mongod. Four readers from the same client pull 40MB/sec over the same pipe
  • Every type of retrieval has become significantly faster than our MySQL implementation:
  • – example fetch time reduced from 400ms to 60ms
    – dictionary entries from 20ms to 1ms
    – document metadata from 30ms to .1ms
    – spelling suggestions from 10ms to 1.2ms

    One wonderful benefit to the built-in caching from Mongo is that taking our memcached layer out actually sped up calls by 1-2ms/call under load. This also frees up many GB of ram. We clearly cannot fit all our corpus data in RAM so the 60ms average for examples includes disk access.

    Flexibility. We’ve been able to add a lot of flexibility to our system since we can now efficiently execute queries against attributes deep in the object graph. You’d need to design a really ugly schema to do this in mysql (although it can be done). Best of all, by essentially building indexes on object attributes, these queries are blazingly fast.

    Other benefits:

  • We now store our audio files in MongoDB’s GridFS. Previously we used a clustered file system so files could be read and written from multiple servers. This created a huge amount of complexity from the IT operations point of view, and it meant that system backups (database + audio data) could get out of sync. Now that they’re in Mongo, we can reach them anywhere in the data center with the same mongo driver, and backups are consistent across the system.
  • Capped collections. We keep trend data inside capped collections, which have been wonderful for keeping datasets from unbounded growth.
  • Reliability. Of course, storing all your critical data in a relatively new technology has its risks. So far, we’ve done well from a reliability standpoint. Since April, we’ve had to restart Mongo twice. The first restart was to apply a patch on 1.4.2 (we’re currently running 1.4.4) to address some replication issues. The second was due to an outage in our data center. More on that in a bit.

    Maintainability. This is one challenge for a new player like MongoDB. The administrative tools are pretty immature when compared with a product like MySQL. There is a blurry hand-off between engineering and IT Operations for this product, which is something worth noting. Luckily for all of us, there are plenty of hooks in Mongo to allow for good tools to be built, and without a doubt there will be a number of great applications to help manage Mongo.

    The size of our database has required us to build some tools for helping to maintain Mongo, which I’ll be talking about at MongoSV in December. The bottom line is yes–you can run and maintain MongoDB, but it is important to understand the relationship between your server and your data.

    The outage we had in our data center caused a major panic. We lost our DAS device during heavy writes to the server–this caused corruption on both master and slave nodes. The master was busy flushing data to disk while the slave was applying operations via oplog. When the DAS came back online, we had to run a repair on our master node which took over 24 hours. The slave was compromised yet operable–we were able to promote that to being the master while repairing the other system.

    Restoring from tape was an option but keep in mind, even a fast tape drive will take a chunk of time to recover 3TB data, let alone lose the data between the last backup and the outage. Luckily we didn’t have to go down this path. We also had an in-house incremental backup + point-in-time recovery tool which we’ll be making open-source before MongoSV.

    Of course, there have been a few surprises in this process, and some good learnings to share.

    Data size. At the MongoSF conference in April, I whined about the 4x disk space requirements of MongoDB. Later, the 10gen folks pointed out how collection-level padding works in Mongo and for our scenario–hundreds of collections with an average of 1GB padding/collection–we were wasting a ton of disk in this alone. We also were able to embed a number of objects in subdocuments and drop indexes–this got our storage costs under control–now only about 1.5-2x that of our former MySQL deployment.

    Locking. There are operations that will lock MongoDB at the database level. When you’re serving hundreds of requests a second, this can cause requests to pile up and create lots of problems. We’ve done the following optimizations to avoid locking:

  • If updating a record, we always query the record before issuing the update. That gets the object in RAM and the update will operate as fast as possible. The same logic has been added for master/slave deployments where the slave can be run with “–pretouch” which causes a query on the object before issuing the update
  • Multiple mongod processes. We have split up our database to run in multiple processes based on access patterns.
  • In summary, life with MongoDB has been good for Wordnik. Our code is faster, more flexible and dramatically smaller. We can code up tools to help out the administrative side until other options surface.

    Hope this has been informative and entertaining–you can always see MongoDB in action via our public api.

    Tony

    Wordnik is Hiring! Know Ruby on Rails?

    help wanted

    [image by, and licensed from, Sekimura]

    We are seeking a Senior Ruby on Rails Developer to help build Wordnik.com and create Wordnik functionality to our partners and developer community for both desktop and mobile applications.

    Wordnik builds and supports Wordnik.com as well as a public RESTful API which serves millions of requests a day. We run with noSQL databases, analyze data with Hadoop, and use a variety of programming languages including Java, Python, Ruby and Scala. Wordnik is an open source friendly environment and we encourage contributions to collaborative projects.

    Required Experience:
    * Strong development background with 3+ years Ruby experience
    * Ability to conceive and develop modern and usable UI/UX designs.
    * Familiarity with modern HTML/CSS design and frameworks.
    * Strong JQuery experience
    * Integration with REST services, external web services
    * Ruby performance tuning

    You’ll do even better with:
    * Mobile device development (iOS or Android)
    * Integration with Facebook Connect, oAuth
    * Experience in a fast-paced startup environment

    We’d love to see:
    * your github account
    * open source participation

    We:
    * Are doing something unique and meaningful
    * Have interesting problems to solve
    * Are backed by top-tier investors

    More info:

    The beginnings of Wordnik @ Erin McKean’s 2007 TED talk:

    More about Wordnik’s noSQL initiatives.

    Wordnik is an equal-opportunity employer and we are committed to diversity in hiring.

    Interested? Email us at feedback@wordnik.com.

    biNu: Wordnik on almost any mobile phone

    For all the Sturm und Drang about smartphones, most people still have what are called ‘feature phones.’ Features phones are simpler than smartphones, but many can still run basic apps.

    biNu is a company specializing in this enormous if little-heralded market, and they’ve used the Wordnik API to build a dictionary and translation app optimized for basic phones. It’s been downloaded almost 300,000 times in the month since it launched, making it both Binu’s most popular app and, other than Wordnik.com itself, one of the larger sources of traffic to Wordnik’s API.

    We’re super excited to see Wordnik made available across an enormous array of devices and to a worldwide audience who might not have easy access to the web. If you have a feature phone, give it a shot and let us know what you think.

    A Heartrending Moment: Orthoepy and The OED

    This month marks a regrettable turn of events in orthoepic history – the meaning of orthoepy changed in the ongoing online edition of the Oxford English Dictionary (OED). The two earlier print editions (1933, 1989) defined orthoepy as “correct, accepted, or customary pronunciation.” The “draft revision” of September 2010 shortens that, brutally, to “accepted or customary pronunciation.”

    Excising the word correct probably gave the editor who did it a frisson, but it cut the very heart out of this venerable word. The ortho- in orthoepy comes from the Greek orthos, “right, correct,” and “correct pronunciation, or the study of correct pronunciation” has been the core meaning of orthoepy since the earliest English orthoepists compiled their dictionaries of pronunciation in the 18th century. Indeed, the expunging of correct from the online OED’s definition of orthoepy would suggest that there’s nothing, or should be nothing, normative about pronunciation. Yet, curiously, the September 2010 online draft revision does not alter the original definition of orthography: “correct or proper spelling.”

    How is it that spelling can be correct or incorrect but pronunciation now cannot? When the OED’s editors get around to revising the entry for cacoepy, currently defined as “bad or erroneous pronunciation; opposed to orthoepy,” will they dilute it to “unaccepted or unusual pronunciation”?

    While it’s the proper business of modern descriptive dictionaries to record accepted or customary pronunciations, it’s the proper business of orthoepists to examine what is accepted or customary and opine on what passes muster and what does not. Sometimes what has been accepted by some is objectionable to others: for example, neesh for niche, zoo-ology for zoology, the prissy s instead of the traditional sh in negotiate.

    And sometimes what is customary for certain speakers strikes others as slovenly: for example, nucular for nuclear, pronounciation for pronunciation, liberry for library.

    Modern dictionaries profess to record pronunciations used by “educated speakers” (if I only had a nickel for every time I’ve heard an “educated” speaker mispronounce a word!) but that’s a deceptively broad category. It comprises anyone who possesses the credentials of an education, from a high school diploma to a Ph.D., and within it there is substantial variation. To the educated person who aspires to be a careful speaker — one whose pronunciation has been arrived at not by imitation, affectation, or conjecture but by careful consideration and prudent choice — a list of pronunciations used by educated speakers is of little help. It conveys only how the word has been spoken, not how it might best be spoken. That is where the orthoepist comes in: as an interpreter and arbiter of correct and cultivated speech.

    Standards change over time, of course, but what abides is the natural and admirable human desire to speak in a way that will not attract undue notice or derision. As traditional pronunciations fall into disuse, faddish variants surge to prominence, and the forces of ignorance and pomposity vie for recognition, the orthoepist draws a bold line in the sand and tries, as the English elocutionist John Walker said in his Critical Pronouncing Dictionary of 1791, “to tempt the lovers of their language to incline to the side of propriety,” and “give such a display of the analogies of the language as may enable every inspector to decide for himself.”

    In my next post, I will attempt to give you a capsule history of orthoepy, from Walker and his contemporaries to the present. Meanwhile, as always, I welcome your comments and your suggestions for pronunciations to record.