Solution

A mixed blessing of working with the Common Crawl dataset is the massive scale. The formula as defined by Shannon needed to be "tweaked", to make it more scalable. We re-wrote it like this;

formulas

This alternative notation made it easier to process the dataset in two runs; #####1. Big run (many terabytes of input) Creating a wordcount-like dictionary of all occurrences of character combinations (of length N)

#####2. Small run (a few gigabytes of input) Sums occurrences of 7 character Strings, and uses these to calulate the chance of a certain character following these Strings. It then calculates the entropy of chosing the next character after the specific String, and sums these entropies times the amount of occurrences of the String.

In the cleanup function (at the near-end of the run), it divides the weighted sum of entropies by the sum of occurrences of all Strings. This gives the weighted average of the entropies, which is the "total entropy".

The meaning of this "total entropy" is further explained in the interpretation

An important connotation, is that we skipped what we considered "illegal" characters. Only a-z and spaces are accounted for, capitals are casted to lowercase, other characters are encoded as spaces with a maximum of 1 consecutive illegal characters. Thus the input strings "foo bar " and "Foo!=bar" give the same output results. This choice we made partly on ground of complexity (processing alphabets of 27 characters is easier than processing ~60 character alphabets), and partly because other characters play only a very limited role in languages (or so we hope).

Next (the results)
Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solution

Clone this wiki locally