B is for Billion

Only 10 years ago having a structured database with 100 million records in it was quite a feat. Today Wordnik passed the 9 billion record mark with the open-source MongoDB from 10gen. But a record in an object store is quite different from a row in a circa-1999 relational database.

Object-oriented programming concepts flew right by the RDBMS long ago. Inner Joins, left/right outer, unions, etc., have served us well, but how much of our data can we model in a tabular fashion? Have you ever tried doing anything complicated in Excel with just ONE sheet?

MongoDB removes an enormous amount of friction from the development process. A record shouldn’t be limited to things like the standard “user” table, with first_name, last_name, email, etc. They should be able to hold more meaningful and conceptually deep data, like “the frequency usage of a word across all time” or “the graph of all relationships to a word”, concepts difficult to express in tabular data. By using a document-oriented database, we at Wordnik don’t need to nag a DBA to add a field or column (well, we’re a startup, so more like nag the guy sitting next to you). If we can model it in software, MongoDB can store it, simple as that. And if MongoDB can store it, we can not only get it back (very important) but *find* it with very rich and flexible queries. Object-relational mapping (ORM) has been around about as long as OOP, but let’s face it: there is no ORM solution that (a) is flexible for the developer and (b) works in harmony with the storage system (i.e. performance doesn’t suck). MongoDB does both, easily, and it’s very, very fast.

So we hit 9 billion records, which is of course very exciting. Traffic to our public API is keeps growing–MongoDB served 100M queries in the last week and didn’t break a sweat. And what’s most exciting is the number of features this helps us develop very rapidly, which we will be sneaking out over the next few weeks.

4 thoughts on “B is for Billion

  1. I’d be very interested to know a couple of things about your MongoDB setup:

    1. Are you using sharding at all? If so, how many records can you handle per shard?
    2. Are you doing any map/reduce queries across all of those 9bn records? How do you deal with MongoDB’s poor map/reduce performance?

  2. Hi Alex, we use MongoDB for a number of different data structures. One structure is not sharded and has 800M records in it. Another (the biggest) is sharded at the application layer. Distribution across the shards is pretty good but there are a couple hot spots of ~100M records. The sharding was primarily to improve bulk insert performance, not seek performance.

    We don’t use MR within MongoDB. It is powerful but we are treating the system as storage & (deep object) retrieval. We use Lucene for analytics.

    We’ll be using auto-sharding after 1.6 comes out and we get enough testing on it.

Comments are closed.