Word frequency charts

awesome!

This spring, our word statistics pages have quietly improved. We’re now indicating the frequency of a word and how it has changed over the last 200 years. Our new graphs show word occurrences for each year in counts per-million-words-of-text, which — for most words — will be in the low handful.

It’s neat to look at how some words have appeared over time (Internet, a fad which will never catch on) or disappeared (e.g. hansom a two-wheeled horse-drawn carriage). Also neat to see are words that have changed their sense — icon has a new meaning in the late 21st century, and this remarkably changes its frequency (from 1-3 per million up to 10+ in the last fifteen years). (We note that not all statistics are entirely safe for work.)

Since our corpus varies in its density (we have far more text available for the last twenty years than we do in the 150 before that), our frequency representations are shown with confidence intervals, indicating a 95% confidence interval* on a given year. (Sometimes that gives us unusually spiky plots, because the sparse years offer relatively little information.)

In future releases, we’d like to compare two words on the same plot (compare apple to Apple) or explore other aspects of the words’ appearance.

What would you like to see?

* Our confidence intervals use the Agresti-Coull approximation, which is probably too generous in its upper-bound, especially for rare words. We’d like to fix that to include Bayesian priors on word frequencies in a future release.

See also previous post on word-frequency visualization.

23 thoughts on “Word frequency charts

  1. Since you asked, haha, here are some things I’d like to see:

    1. Plot related words. Say, 5 most frequent ditransitive verbs or verbs related in WordNet (I think psycholinguists would like to get their hands on something like that).

    2. Plot collocates. What are the most significant collocates for each year. That would tell us something interesting about semantic shifts.

    Great work! Keep it up!

  2. Chris, thanks for your comments!

    It sounds like you’re really interested in corpus-level statistics, which are really neat. We’re developing more interesting statistics every chance we can get, and looking ahead to more and more of them.

    I’m curious how you are thinking about “significant” for “most significant collocates” — as a statistician, I have one sense in mind, and as a lexicographer and linguist, I have another, somewhat less rigorous sense — care to clarify?

  3. You might consider something like box-and-whisker plots for the graphical display. Your desire to give an idea of the quality of the frequency estimate for any given word and year is commendable (truly – major kudos!), but confidence intervals might not be the most natural approach, for all the usual difficulties in the proper interpretation, as well as some conceptual difficulties about the sampling distribution (though the conceptual difficulties might be just mine).

    I guess I would just formulate the problem as having a bunch of empirical data for each word and time point and providing an informative graphical summary, without getting bogged down too much in inferential questions.

    But thanks for the existing graphs, which are already super-cool.

  4. @Sionnach: actually, we are providing box-and-whisker, but you can only see them clearly in the large-scale graph.

    As currently displayed, the box contains about 70% — plus or minus 1 standard deviation — and the whiskers contain about 95% — plus or minus 2 standard deviations, using the normal approximation after smoothing that makes up the Agresti-Coull approximation.

    As for the empirical data — that is available, if you’re interested, through our API — are you interested in your own research there? please tell us about it!

  5. @Chris:
    I am familiar with Student’s t — as a tool for distinguishing (or determining indistinguishability) between two observed distributions, but I’m not familiar with MIS or MIS/T scores. I confess I’m not sure what you mean.

  6. One problem with this display is that low-confidence estimates, with large uncertainty span, look visually bigger than the more confident estimates. I would try to use color to combat this misperception, for example by painting the estimates progressively darker gray as they become more certain, so that the tall boxes will be light gray and not too visible to distract.

  7. You’re welcome – here are two more things I find slightly annoying after having used this feature:

    1. Often, too tall estimates from 19th century force a too small vertical scale, which means the more recent, meaningful, high-confidence part of the graph is squeezed flat making it hard to discern any trends in it. You might want to filter out the low-confidence outliers, or present them only partially with a numerical indication of how high up they go.

    2. The time axis divides the span into equal-width marks, totally disregarding the decades sometimes. It’s hard to get an idea of where a point is on time axis when the nearest marks you see there are 1958 and 1988. Can you force it to always subdivide the time axis at decades?

  8. Thanks very much! A great resource for those of us who really dig the use and history of language.
    Great site and obviously a great deal of work to make it so.

    Very cool! Thanks again.

    Bill

  9. Great that you’re providing this, thnak you; interesting and potentially useful.

    You say “Our new graphs show word occurrences for each year in counts per-million-words-of-text…”
    This is not exactly apparent or easily discernable on the graphs.

    In general the presentation of the data could be improved to make it much more friendly and accesible.

    I have suggestions:
    1: Sort out the date scale to index individual years. At the moment, neither of the plots allow distinction of a particular year with any degree of certainty. Variability on the date scale with some intervals representing greater time-spans (where data is sparse) further confounds the issue.
    2: Re-align your data so that all graphs reference the same intervals. Some use xxx0, some use xxx1 E.g 1950-1980 or 1951-1981
    3: Consider providing the actual figures for each column, when a user hovers their cursor over it.
    4: Consider allowing user scalability, or display options?

    Thanks,
    Sarah

  10. Someone said that you might mean “early 21st century”, but I’m pretty sure you meant “late 20th century” when mentioning the new meaning of the word “icon”.

  11. I’d like to understand sources you use to establish the frequency of word use. It is easy today with the web. But over time, starting from the 1800s, how do you count occurrences? And will using Bayesian help better with earlier occurrences of evidential probability of use?

  12. I would like to know where you get your corpora and how large each of them are. I presume you are using literary texts and that you have 200 or more corpora with several million words each. As the producer of word frequency lists for commercial use, I know that is a tremendous job and would like to know more about how you have accomplished it.

  13. Hi Robert, our data comes from our smartwords partners as well as from select web sites, blogs, and even OCR’d text. We retrieve and process this text with our internal team. It is a lot of work but well worth it.

  14. It’s time: Do you REALLY *READ* these comments for anything other than your corporate Monday morning “fun” and a good laugh?

    1- You’ve been called several times on your claim that “icon” has some occurrence in the “late 21st century”. I’m ‘curious’ about how you go about observing ANYthing in the *future*. It would seem obvious to anyone who parses words and/or phrases that the phrase “late 21st century” would refer to the period of time between 2051 and 2100AD (“inclusive”). One generous commenter offered an “easy out” by suggesting that you really meant “late 20th century”. I would also buy-in to that understanding, but nobody seems to be “listening”, or if you are, it’s obvious that nobody is taking any action on this egregious error in an article that EVERYONE who clicks on this explanatory file must STEP OVER. Come on folks, this kinda stuff in your DISPLAY window gets in EVERYONE’s face!

    2- Making the “graph” meaningful and *actually* usable: Have you tried logarithmic scales on X, Y, or X AND Y? Adding a numeric marker on the tops of peaks and bottoms of valleys would also be helpful. Check-out the financial market [charts and graphs] reporting of their numeric data for more ideas.

    This is a really cool idea, and clearly deserves a much more careful application of the proof-reader’s talent.

    -rw

  15. Jeremy,

    Gathering such statistics sounds like an invaluable endeavor for wordsmiths and historians alike.

    Would it be possible to incorporate a search capability on a range of dates? For example, I have a thing for old words that are not yet archaic but are on its way out. Would there be a function to look up those words?

    Thanks!

  16. Hi there,

    congrats for such a nice API. I’m just wondering which sources did you use to compute the usage of the words over the last 200 years. Cheers!

  17. The “Word frequency charts” is a neat feature, it would’ve been more helpful if users are allowed to customize the date range. Anyways, I’d say wordnik is pretty good at being creative which certainly adds more fun when users are looking up a word definition. Very impressive, loved it!

  18. these frequency estimates are a very interesting study. congratulations! and all the best for future improvements to your database.

  19. Rick, while I admire your attention to detail and accuracy, give the site a bit of a break. Oftentimes, this type of work is done on an ad hoc basis by passionate individuals that are taking a break from more onerous and revenue generating work (“real work”) and sharing their interest and whimsy. This is a dynamic and evolving website, not a peer-reviewed scholarly tome.

    BTW, in the spirit of whimsy, your post prompted me to look up “condescend.” A quick glance at the graph led me to the conclusion that you’d favor life in the early 17th or mid 19th century. Hope I got that right.

    This is a great tool and service and I applaud the team for being open to input. Keep up the good work.

    Rick, just lighten up. “Egregious error” is a strong.

  20. Sorry, didn’t finish last post – should be: “Egregious error” is a strong statement, making a mountain of a molehill

Comments are closed.