-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathSKL07-GettingData.tex
152 lines (110 loc) · 5.2 KB
/
SKL07-GettingData.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
\documentclass[SKL-MASTER.tex]{subfiles}
\textbf{Layout of Datasets}\\
The scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis.\\ When the data is not intially in the (\texttt{n\_samples, n\_features}) shape, it needs to be preprocessed to be used by the scikit.
\textbf{Packaged Datasets}\\
The scikit-learn library is packaged with datasets. These datasets are useful for getting a handle on a given machine learning algorithm or library feature before using it in your own work. \\
% This recipe demonstrates how to load the famous Iris flowers dataset.
\newpage
A simple example shipped with the scikit: iris dataset
<pre>
\begin{verbatim}
>>> from scikits.learn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)
\end{verbatim}
\end{framed}
Iris is made of 150 observations of irises, each described by 4 features: their sepal and petal length and width, as detailed in \texttt{iris.DESCR}.
%==========================================================================%
\newpage
scikit-learn embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:
<pre>
\begin{verbatim}
from sklearn.datasets import load_iris
## - Load the packaged iris flowers dataset
## - Iris flower dataset
## - (4x150, reals, multi-label classification)
iris = load_iris()
print(iris)
iris.keys()
\end{verbatim}
\end{framed}
\newpage
\textbf{Classifying irises}
The iris dataset is a classification task consisting in identifying 3 different types of irises (Setosa, Versicolour, and Virginica) from their petal and sepal length and width:
<pre>
\begin{verbatim}
>>> import numpy as np
>>> from sklearn import datasets
>>> iris = load_iris()
>>>
>>> iris_X = iris.data
>>> iris_y = iris.target
>>>
>>> np.unique(iris_y)
array([0, 1, 2])
>>>
>>> # Three Classes (Species)
\end{verbatim}
\end{framed}
Split iris data in train and test data
A random permutation, to split the data randomly
<pre>
\begin{verbatim}
>>> np.random.seed(0)
>>> indices = np.random.permutation(len(iris_X))
>>> iris_X_train = iris_X[indices[:-10]]
>>> iris_y_train = iris_y[indices[:-10]]
>>> iris_X_test = iris_X[indices[-10:]]
>>> iris_y_test = iris_y[indices[-10:]]
>>> # Create and fit a nearest-neighbor classifier
>>> from scikits.learn.neighbors import NeighborsClassifier
>>> knn = NeighborsClassifier()
>>> knn.fit(iris_X_train, iris_y_train)
NeighborsClassifier(n_neighbors=5, leaf_size=20, algorithm='auto')
>>> knn.predict(iris_X_test)
array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])
>>> iris_y_test
array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])
\end{verbatim}
\end{framed}
%============================================================================= %
\subsubsection{k-Nearest neigbhors classifier}
The simplest possible classifier is the nearest neighbor: given a new observation \texttt{x\_test}, find in the training set (i.e. the data used to train the estimator) the observation with the closest feature vector.
\newpage
\textbf{Load from CSV}\\
* In most of the Scikit-learn algorithms, the data must be loaded as a \texttt{Bunch} Object.
* However there are many example in the tutorial where \texttt{load\_files()} or other functions are used to populate the bunch object.
* Function like \texttt{load\_files()} expect data to be present in certain format. Suppose we have a different format in which data is stored.
* It is very common to have a dataset as a CSV file on the local workstation or on a remote server.
* You load a CSV file from a URL, in this case the Pima Indians diabetes classification dataset from the UCI Machine Learning Repository.
* From the prepared \texttt{X} and \texttt{y} variables, you can train a machine learning model.
%* A CSV file with a bunch of strings for each field.
\newpage
<pre>
\begin{verbatim}
# Pima Indians diabetes
# Load the dataset from CSV URL
import numpy as np
import urllib
# URL for the Pima Indians Diabetes dataset
# (UCI Machine Learning Repository)
url = "http://goo.gl/j0Rvxq"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]
\end{verbatim}
\end{framed}
%
% \textbf{Summary}
%
%In this post you discovered that the scikit-learn method comes with packaged data sets including the iris flowers dataset. These datasets can be loaded easily and used for explore and experiment with different machine learning models.
%
%You also saw how you can load CSV data with scikit-learn. You learned a way of opening CSV files from the web using the \textbf{urllib} library and how you can read that data as a NumPy matrix for use in scikit-learn.
\end{document}