-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathSKL39-KNN.tex
115 lines (109 loc) · 4.41 KB
/
SKL39-KNN.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
\documentclass[SKL-MASTER.tex]{subfiles}
\section*{Using k-NN for regression}
Regression is covered elsewhere in the book, but we might also want to run a regression
on "pockets" of the feature space. We can think that our dataset is subject to several data
processes. If this is true, only training on similar data points is a good idea.
\subsection*{Getting ready}
* Regression can be used in the context of clustering. Regression is obviously a
supervised technique, so we'll use k-Nearest Neighbors (k-NN) clustering rather than KMeans.
* For the k-NN regression, we'll use the K closest points in the feature space to build the
regression rather than using the entire space as in regular regression.
\subsection*{Example}
In this exercise, we'll use the iris dataset. If we want to predict something such as the petal
width for each flower, clustering by iris species can potentially give us better results. The k-NN
regression won't cluster by the species, but we'll work under the assumption that the Xs will
be close for the same species, or in this case, the petal length.
%========================================================%
% % Building Models with Distance Metrics
% % 116
We'll use the iris dataset for this recipe:
<pre>
\begin{verbatim}
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)']
\end{verbatim}
\end{framed}
We'll try to predict the petal length based on the sepal length and width. We'll also fit a regular
linear regression to see how well the k-NN regression does in comparison:
<pre>
\begin{verbatim}
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> lr.fit(X, y)
>>> print "The MSE is: {:.2}".format(np.power(y - lr.predict(X),
2).mean())
The MSE is: 0.15
\end{verbatim}
\end{framed}
Now, for the k-NN regression, use the following code:
<pre>
\begin{verbatim}
>>> from sklearn.neighbors import KNeighborsRegressor
>>> knnr = KNeighborsRegressor(n_neighbors=10)
>>> knnr.fit(X, y)
>>> print "The MSE is: {:.2}".format(np.power(y - knnr.predict(X),
2).mean())
The MSE is: 0.069
\end{verbatim}
\end{framed}
%========================================================%
% % Chapter 3
% % 117
Let's look at what the k-NN regression does when we tell it to use the closest 10 points
for regression:
<pre>
\begin{verbatim}
>>> f, ax = plt.subplots(nrows=2, figsize=(7, 10))
>>> ax[0].set_title("Predictions")
>>> ax[0].scatter(X[:, 0], X[:, 1], s=lr.predict(X)*80, label='LR
Predictions', color='c', edgecolors='black')
>>> ax[1].scatter(X[:, 0], X[:, 1], s=knnr.predict(X)*80, label='k-NN
Predictions', color='m', edgecolors='black')
>>> ax[0].legend()
>>> ax[1].legend()
\end{verbatim}
\end{framed}
The following is the output:
%========================================================%
% % Building Models with Distance Metrics
% % 118
It might be completely clear that the predictions are close for the most part, but let's look at
the predictions for the Setosa species as compared to the actuals:
<pre>
\begin{verbatim}
>>> setosa_idx = np.where(iris.target_names=='setosa')
>>> setosa_mask = iris.target == setosa_idx[0]
>>> y[setosa_mask][:5]
array([ 0.2, 0.2, 0.2, 0.2, 0.2])
>>>
>>> knnr.predict(X)[setosa_mask][:5]
array([ 0.28, 0.17, 0.21, 0.2 , 0.31])
>>>
>>> lr.predict(X)[setosa_mask][:5]
array([ 0.44636645, 0.53893889, 0.29846368, 0.27338255, 0.32612885])
\end{verbatim}
\end{framed}
Looking at the plots again, the Setosa species (upper-left cluster) is largely overestimated by
linear regression, and k-NN is fairly close to the actual values.
\subsection*{k-NN regression}
The k-NN regression is very simply calculated taking the average of the k closest point to the
point being tested.
Let's manually predict a single point. Then we need to get the 10 closest points to our \texttt{example\_point}:
<pre>
\begin{verbatim}
>>> example_point = X[0]
>>> from sklearn.metrics import pairwise
>>> distances_to_example = pairwise.pairwise_distances(X)[0]
>>>
>>> ten_closest_points = X[np.argsort(distances_to_example)][:10]
>>> ten_closest_y = y[np.argsort(distances_to_example)][:10]
>>> ten_closest_y.mean()
0.28000
\end{verbatim}
\end{framed}
We can see that this is very close to what was expected.
%========================================================%
\end{document}