Skip to content

Commit 21442f8

Browse files
authoredNov 14, 2018
Create Part_45_Classifying _with_ NaïveBayes.md
1 parent ff6abc1 commit 21442f8

File tree

1 file changed

+142
-0
lines changed

1 file changed

+142
-0
lines changed
 
+142
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
### Classifying documents with Naïve Bayes
2+
Naïve Bayes is a really interesting model. It's somewhat similar to k-NN in the sense that it
3+
makes some assumptions that might oversimplify reality, but still perform well in many cases.
4+
5+
#### Preparation} % %Getting ready
6+
In this recipe, we'll use Naïve Bayes to do document classification with sklearn. An example
7+
I have personal experience of is using the words that make up an account descriptor in
8+
accounting, such as Accounts Payable, and determining if it belongs to Income Statement,
9+
Cash Flow Statement, or Balance Sheet.
10+
11+
The basic idea is to use the word frequency from a labeled test corpus to learn the classifications of the documents. Then, we can turn this on a training set and attempt to
12+
predict the label.
13+
We'll use the newgroups dataset within sklearn to play with the Naïve Bayes model. It's a
14+
nontrivial amount of data, so we'll fetch it instead of loading it. We'll also limit the categories
15+
to rec.autos and rec.motorcycles:
16+
<pre><code>
17+
18+
from sklearn.datasets import fetch_20newsgroups
19+
categories = ["rec.autos", "rec.motorcycles"]
20+
newgroups = fetch_20newsgroups(categories=categories)
21+
#take a look
22+
print "\n".join(newgroups.data[:1])
23+
From: gregl@zimmer.CSUFresno.EDU (Greg Lewis)
24+
Subject: Re: WARNING.....(please read)...
25+
Keywords: BRICK, TRUCK, DANGER
26+
Nntp-Posting-Host: zimmer.csufresno.edu
27+
Organization: CSU Fresno
28+
Lines: 33
29+
30+
[…]
31+
newgroups.target_names[newgroups.target[:1]]
32+
'rec.autos'
33+
<\code><\pre>
34+
Now that we have newgroups, we'll need to represent each document as a bag of words. This representation is what gives Naïve Bayes its name. The model is "naive" because documents
35+
are classified without regard for any intradocument word covariance. This might be considered a flaw, but Naïve Bayes has been shown to work reasonably well.
36+
We need to preprocess the data into a bag-of-words matrix. This is a sparse matrix that has entries when the word is present in the document. This matrix can become quite large,
37+
as illustrated:
38+
39+
<pre><code>
40+
41+
from sklearn.feature_extraction.text import CountVectorizer
42+
count_vec = CountVectorizer()
43+
bow = count_vec.fit_transform(newgroups.data)
44+
<\code><\pre>
45+
46+
This matrix is a sparse matrix, which is the length of the number of documents by each word.
47+
The document and word value of the matrix are the frequency of the particular term:
48+
<pre><code>
49+
50+
bow
51+
<1192x19177 sparse matrix of type '<type 'numpy.int64'>'
52+
with 164296 stored elements in Compressed Sparse Row format>
53+
<\code><\pre>
54+
55+
We'll actually need the matrix as a dense array for the Naïve Bayes object. So, let's convert
56+
it back:
57+
<pre><code>
58+
59+
bow = np.array(bow.todense())
60+
<\code><\pre>
61+
62+
Clearly, most of the entries are 0, but we might want to reconstruct the document counts as
63+
a sanity check:
64+
<pre><code>
65+
66+
words = np.array(count_vec.get_feature_names())
67+
words[bow[0] > 0][:5]
68+
array([u'10pm', u'1qh336innfl5', u'33', u'93740', u'_____________________
69+
______________________________________________'],
70+
dtype='<U79')
71+
<\code><\pre>
72+
73+
Now, are these the examples in the first document? Let's check that using the
74+
following command:
75+
<pre><code>
76+
77+
'10pm' in newgroups.data[0].lower()
78+
True
79+
'1qh336innfl5' in newgroups.data[0].lower()
80+
True
81+
<\code><\pre>
82+
83+
%=======================================================================================================%
84+
85+
### Implementation
86+
Ok, so it took a bit longer than normal to get the data ready, but we're dealing with text data
87+
that isn't as quickly represented as a matrix as the data we're used to.
88+
89+
However, now that we're ready, we'll fire up the classifier and fit our model:
90+
91+
<pre><code>
92+
from sklearn import naive_bayes
93+
clf = naive_bayes.GaussianNB()
94+
Before we fit the model, let's split the dataset into a training and test set:
95+
mask = np.random.choice([True, False], len(bow))
96+
clf.fit(bow[mask], newgroups.target[mask])
97+
predictions = clf.predict(bow[~mask])
98+
</code></pre>
99+
100+
Now that we fit a model on a test set, and then predicted the training set in an attempt to
101+
determine which categories go with which articles, let's get a sense of the approximate
102+
accuracy:
103+
np.mean(predictions == newgroups.target[~mask])
104+
0.92446043165467628
105+
106+
### Theoretical Background
107+
108+
The fundamental idea of how Naïve Bayes works is that we can estimate the probability of
109+
some data point being a class, given the feature vector.
110+
This can be rearranged via the Bayes formula to give the MAP estimate for the feature vector.
111+
This MAP estimate chooses the class for which the feature vector's probability is maximized.
112+
113+
### There's more…
114+
We can also extend Naïve Bayes to do multiclass work. Instead of assuming a Gaussian
115+
likelihood, we'll use a multinomial likelihood.
116+
First, let's get a third category of data:
117+
from sklearn.datasets import fetch_20newsgroups
118+
mn_categories = ["rec.autos", "rec.motorcycles",
119+
"talk.politics.guns"]
120+
mn_newgroups = fetch_20newsgroups(categories=mn_categories)
121+
122+
We'll need to vectorize this just like the class case:
123+
<pre><code>
124+
125+
mn_bow = count_vec.fit_transform(mn_newgroups.data)
126+
mn_bow = np.array(mn_bow.todense())
127+
<\code><\pre>
128+
129+
Let's create a mask array to train and test:
130+
<pre><code>
131+
132+
mn_mask = np.random.choice([True, False], len(mn_newgroups.data))
133+
multinom = naive_bayes.MultinomialNB()
134+
multinom.fit(mn_bow[mn_mask], mn_newgroups.target[mn_mask])
135+
mn_predict = multinom.predict(mn_bow[~mn_mask])
136+
np.mean(mn_predict == mn_newgroups.target[~mn_mask])
137+
0.96594778660612934
138+
<\code><\pre>
139+
140+
It's not completely surprising that we did well. We did fairly well in the dual class case, and
141+
since one will guess that the talk.politics.guns category is fairly orthogonal to the
142+
other two, we should probably do pretty well.

0 commit comments

Comments
 (0)
Please sign in to comment.