This spring, our word statistics pages have quietly improved. We’re now indicating the frequency of a word and how it has changed over the last 200 years. Our new graphs show word occurrences for each year in counts per-million-words-of-text, which — for most words — will be in the low handful.
It’s neat to look at how some words have appeared over time (Internet, a fad which will never catch on) or disappeared (e.g. hansom a two-wheeled horse-drawn carriage). Also neat to see are words that have changed their sense — icon has a new meaning in the late 21st century, and this remarkably changes its frequency (from 1-3 per million up to 10+ in the last fifteen years). (We note that not all statistics are entirely safe for work.)
Since our corpus varies in its density (we have far more text available for the last twenty years than we do in the 150 before that), our frequency representations are shown with confidence intervals, indicating a 95% confidence interval* on a given year. (Sometimes that gives us unusually spiky plots, because the sparse years offer relatively little information.)
In future releases, we’d like to compare two words on the same plot (compare apple to Apple) or explore other aspects of the words’ appearance.
What would you like to see?
* Our confidence intervals use the Agresti-Coull approximation, which is probably too generous in its upper-bound, especially for rare words. We’d like to fix that to include Bayesian priors on word frequencies in a future release.
See also previous post on word-frequency visualization.