-
Notifications
You must be signed in to change notification settings - Fork 0
Solution
A mixed blessing of working with the Common Crawl dataset is the massive scale. The formula as defined by Shannon needed to be "tweaked", to make it more scalable. We re-wrote it like this;
This alternative notation made it easier to process the dataset in two runs; #####1. Big run (many terabytes of input) Creating a wordcount-like dictionary of all occurrences of character combinations (of length N)
#####2. Small run (a few gigabytes of input) Sums occurrences of 7 character Strings, and uses these to calulate the chance of a certain character following these Strings. It then calculates the entropy of chosing the next character after the specific String, and sums these entropies times the amount of occurrences of the String.
In the cleanup function (at the near-end of the run), it divides the weighted sum of entropies by the sum of occurrences of all Strings. This gives the weighted average of the entropies, which is the "total entropy".
The meaning of this "total entropy" is further explained in the interpretation
An important connotation, is that we skipped what we considered "illegal" characters. Only a-z and spaces are accounted for, capitals are casted to lowercase, other characters are encoded as spaces with a maximum of 1 consecutive illegal characters. Thus the input strings "foo bar " and "Foo!=bar" give the same output results. This choice we made partly on ground of complexity (processing alphabets of 27 characters is easier than processing ~60 character alphabets), and partly because other characters play only a very limited role in languages (or so we hope).