Skip to content

Commit 620faba

Browse files
Create Part_46_GaussianNB.md
1 parent 21442f8 commit 620faba

File tree

1 file changed

+60
-0
lines changed

1 file changed

+60
-0
lines changed

Part_46_GaussianNB.md

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
2+
%=======================================================================================================%
3+
% % Classifying Data with scikit-learn
4+
5+
### Implementation
6+
7+
Ok, so it took a bit longer than normal to get the data ready, but we're dealing with text data that isn't as quickly represented as a matrix as the data we're used to.
8+
However, now that we're ready, we'll fire up the classifier and fit our model:
9+
<pre><code>
10+
from sklearn import naive_bayes
11+
clf = naive_bayes.GaussianNB()
12+
</code></pre>
13+
Before we fit the model, let's split the dataset into a training and test set:
14+
<pre><code>
15+
mask = np.random.choice([True, False], len(bow))
16+
clf.fit(bow[mask], newgroups.target[mask])
17+
predictions = clf.predict(bow[~mask])
18+
</code></pre>
19+
Now that we fit a model on a test set, and then predicted the training set in an attempt to
20+
determine which categories go with which articles, let's get a sense of the approximate
21+
accuracy:
22+
np.mean(predictions == newgroups.target[~mask])
23+
0.92446043165467628
24+
### Theoretical Background
25+
The fundamental idea of how Naïve Bayes works is that we can estimate the probability of
26+
some data point being a class, given the feature vector.
27+
This can be rearranged via the Bayes formula to give the MAP estimate for the feature vector.
28+
This MAP estimate chooses the class for which the feature vector's probability is maximized.
29+
There's more…
30+
31+
We can also extend Naïve Bayes to do multiclass work. Instead of assuming a Gaussian
32+
likelihood, we'll use a multinomial likelihood.
33+
First, let's get a third category of data:
34+
<pre><code>
35+
from sklearn.datasets import fetch_20newsgroups
36+
mn_categories = ["rec.autos", "rec.motorcycles",
37+
"talk.politics.guns"]
38+
mn_newgroups = fetch_20newsgroups(categories=mn_categories)
39+
</code></pre>
40+
%157
41+
We'll need to vectorize this just like the class case:
42+
<pre><code>
43+
44+
mn_bow = count_vec.fit_transform(mn_newgroups.data)
45+
mn_bow = np.array(mn_bow.todense())
46+
<\code><\pre>
47+
48+
Let's create a mask array to train and test:
49+
<pre><code>
50+
51+
mn_mask = np.random.choice([True, False], len(mn_newgroups.data))
52+
multinom = naive_bayes.MultinomialNB()
53+
multinom.fit(mn_bow[mn_mask], mn_newgroups.target[mn_mask])
54+
mn_predict = multinom.predict(mn_bow[~mn_mask])
55+
np.mean(mn_predict == mn_newgroups.target[~mn_mask])
56+
0.96594778660612934
57+
<\code><\pre>
58+
59+
It's not completely surprising that we did well. We did fairly well in the dual class case, and
60+
since one will guess that the ``talk.politics.guns`` category is fairly orthogonal to the other two, we should probably do pretty well.

0 commit comments

Comments
 (0)