Tuesday, December 21, 2010

First-Digit Law and Google Ngram

The first-digit law [Benford's Law] describes how the leading digit in count data will tend to over-represent "1", and to a decreasing extent "2", "3" each value less common than the one before.  Google Ngram counts of number frequencies in their book corpus show a similar trend, which is interesting, since these values arise from such heterogeneous sources.

I ran the same set for the hundreds, and the results are similar.  Although, "800" is behaving differently than expected.  One possible explanation might be that our surplus "800"'s come from 1-800 phone numbers.   Running the same thing but substituting "101" for "100" etc. eliminates the 800 bulge that starts in the 1980's.

It is exciting to think about the potential to ask more socially interesting questions of this data.   Note, I stopped the graph at the default (2000).  Although the data set extends to 2008, it seems that there must be data missing after 2000 because of many values that should not drop in concert are dropping.

