|
| 1 | +### Classifying documents with Naïve Bayes |
| 2 | +Naïve Bayes is a really interesting model. It's somewhat similar to k-NN in the sense that it |
| 3 | +makes some assumptions that might oversimplify reality, but still perform well in many cases. |
| 4 | + |
| 5 | +#### Preparation} % %Getting ready |
| 6 | +In this recipe, we'll use Naïve Bayes to do document classification with sklearn. An example |
| 7 | +I have personal experience of is using the words that make up an account descriptor in |
| 8 | +accounting, such as Accounts Payable, and determining if it belongs to Income Statement, |
| 9 | +Cash Flow Statement, or Balance Sheet. |
| 10 | + |
| 11 | +The basic idea is to use the word frequency from a labeled test corpus to learn the classifications of the documents. Then, we can turn this on a training set and attempt to |
| 12 | +predict the label. |
| 13 | +We'll use the newgroups dataset within sklearn to play with the Naïve Bayes model. It's a |
| 14 | +nontrivial amount of data, so we'll fetch it instead of loading it. We'll also limit the categories |
| 15 | +to rec.autos and rec.motorcycles: |
| 16 | +<pre><code> |
| 17 | + |
| 18 | +from sklearn.datasets import fetch_20newsgroups |
| 19 | +categories = ["rec.autos", "rec.motorcycles"] |
| 20 | +newgroups = fetch_20newsgroups(categories=categories) |
| 21 | +#take a look |
| 22 | +print "\n".join(newgroups.data[:1]) |
| 23 | +From: gregl@zimmer.CSUFresno.EDU (Greg Lewis) |
| 24 | +Subject: Re: WARNING.....(please read)... |
| 25 | +Keywords: BRICK, TRUCK, DANGER |
| 26 | +Nntp-Posting-Host: zimmer.csufresno.edu |
| 27 | +Organization: CSU Fresno |
| 28 | +Lines: 33 |
| 29 | + |
| 30 | +[…] |
| 31 | +newgroups.target_names[newgroups.target[:1]] |
| 32 | +'rec.autos' |
| 33 | +<\code><\pre> |
| 34 | +Now that we have newgroups, we'll need to represent each document as a bag of words. This representation is what gives Naïve Bayes its name. The model is "naive" because documents |
| 35 | +are classified without regard for any intradocument word covariance. This might be considered a flaw, but Naïve Bayes has been shown to work reasonably well. |
| 36 | +We need to preprocess the data into a bag-of-words matrix. This is a sparse matrix that has entries when the word is present in the document. This matrix can become quite large, |
| 37 | +as illustrated: |
| 38 | + |
| 39 | +<pre><code> |
| 40 | + |
| 41 | +from sklearn.feature_extraction.text import CountVectorizer |
| 42 | +count_vec = CountVectorizer() |
| 43 | +bow = count_vec.fit_transform(newgroups.data) |
| 44 | +<\code><\pre> |
| 45 | + |
| 46 | +This matrix is a sparse matrix, which is the length of the number of documents by each word. |
| 47 | +The document and word value of the matrix are the frequency of the particular term: |
| 48 | +<pre><code> |
| 49 | + |
| 50 | +bow |
| 51 | +<1192x19177 sparse matrix of type '<type 'numpy.int64'>' |
| 52 | +with 164296 stored elements in Compressed Sparse Row format> |
| 53 | +<\code><\pre> |
| 54 | + |
| 55 | +We'll actually need the matrix as a dense array for the Naïve Bayes object. So, let's convert |
| 56 | +it back: |
| 57 | +<pre><code> |
| 58 | + |
| 59 | +bow = np.array(bow.todense()) |
| 60 | +<\code><\pre> |
| 61 | + |
| 62 | +Clearly, most of the entries are 0, but we might want to reconstruct the document counts as |
| 63 | +a sanity check: |
| 64 | +<pre><code> |
| 65 | + |
| 66 | +words = np.array(count_vec.get_feature_names()) |
| 67 | +words[bow[0] > 0][:5] |
| 68 | +array([u'10pm', u'1qh336innfl5', u'33', u'93740', u'_____________________ |
| 69 | +______________________________________________'], |
| 70 | +dtype='<U79') |
| 71 | +<\code><\pre> |
| 72 | + |
| 73 | +Now, are these the examples in the first document? Let's check that using the |
| 74 | +following command: |
| 75 | +<pre><code> |
| 76 | + |
| 77 | +'10pm' in newgroups.data[0].lower() |
| 78 | +True |
| 79 | +'1qh336innfl5' in newgroups.data[0].lower() |
| 80 | +True |
| 81 | +<\code><\pre> |
| 82 | + |
| 83 | +%=======================================================================================================% |
| 84 | + |
| 85 | +### Implementation |
| 86 | +Ok, so it took a bit longer than normal to get the data ready, but we're dealing with text data |
| 87 | +that isn't as quickly represented as a matrix as the data we're used to. |
| 88 | + |
| 89 | +However, now that we're ready, we'll fire up the classifier and fit our model: |
| 90 | + |
| 91 | +<pre><code> |
| 92 | +from sklearn import naive_bayes |
| 93 | +clf = naive_bayes.GaussianNB() |
| 94 | +Before we fit the model, let's split the dataset into a training and test set: |
| 95 | +mask = np.random.choice([True, False], len(bow)) |
| 96 | +clf.fit(bow[mask], newgroups.target[mask]) |
| 97 | +predictions = clf.predict(bow[~mask]) |
| 98 | +</code></pre> |
| 99 | + |
| 100 | +Now that we fit a model on a test set, and then predicted the training set in an attempt to |
| 101 | +determine which categories go with which articles, let's get a sense of the approximate |
| 102 | +accuracy: |
| 103 | +np.mean(predictions == newgroups.target[~mask]) |
| 104 | +0.92446043165467628 |
| 105 | + |
| 106 | +### Theoretical Background |
| 107 | + |
| 108 | +The fundamental idea of how Naïve Bayes works is that we can estimate the probability of |
| 109 | +some data point being a class, given the feature vector. |
| 110 | +This can be rearranged via the Bayes formula to give the MAP estimate for the feature vector. |
| 111 | +This MAP estimate chooses the class for which the feature vector's probability is maximized. |
| 112 | + |
| 113 | +### There's more… |
| 114 | +We can also extend Naïve Bayes to do multiclass work. Instead of assuming a Gaussian |
| 115 | +likelihood, we'll use a multinomial likelihood. |
| 116 | +First, let's get a third category of data: |
| 117 | +from sklearn.datasets import fetch_20newsgroups |
| 118 | +mn_categories = ["rec.autos", "rec.motorcycles", |
| 119 | +"talk.politics.guns"] |
| 120 | +mn_newgroups = fetch_20newsgroups(categories=mn_categories) |
| 121 | + |
| 122 | +We'll need to vectorize this just like the class case: |
| 123 | +<pre><code> |
| 124 | + |
| 125 | +mn_bow = count_vec.fit_transform(mn_newgroups.data) |
| 126 | +mn_bow = np.array(mn_bow.todense()) |
| 127 | +<\code><\pre> |
| 128 | + |
| 129 | +Let's create a mask array to train and test: |
| 130 | +<pre><code> |
| 131 | + |
| 132 | +mn_mask = np.random.choice([True, False], len(mn_newgroups.data)) |
| 133 | +multinom = naive_bayes.MultinomialNB() |
| 134 | +multinom.fit(mn_bow[mn_mask], mn_newgroups.target[mn_mask]) |
| 135 | +mn_predict = multinom.predict(mn_bow[~mn_mask]) |
| 136 | +np.mean(mn_predict == mn_newgroups.target[~mn_mask]) |
| 137 | +0.96594778660612934 |
| 138 | +<\code><\pre> |
| 139 | + |
| 140 | +It's not completely surprising that we did well. We did fairly well in the dual class case, and |
| 141 | +since one will guess that the talk.politics.guns category is fairly orthogonal to the |
| 142 | +other two, we should probably do pretty well. |
0 commit comments