Skip to content

Commit b189388

Browse files
committed
deploy: b011509
1 parent 4d27bc1 commit b189388

21 files changed

+350
-208
lines changed

_images/five_10_1.png

6.36 KB
Loading

_images/five_8_0.png

9.06 KB
Loading

_images/four_5_1.png

15.9 KB
Loading

_images/four_7_1.png

7.28 KB
Loading

_images/four_9_1.png

-7.53 KB
Binary file not shown.
File renamed without changes.

_images/six_2_1.png

-87.7 KB
Binary file not shown.

_images/three_2_1.png

-345 KB
Binary file not shown.

_sources/chapters/five.ipynb

+30-20
Large diffs are not rendered by default.

_sources/chapters/four.ipynb

+49-62
Large diffs are not rendered by default.

_sources/chapters/six.ipynb

+55-30
Large diffs are not rendered by default.

_sources/chapters/three.ipynb

+21-22
Large diffs are not rendered by default.

chapters/five.html

+21-7
Original file line numberDiff line numberDiff line change
@@ -525,6 +525,9 @@ <h2>5.1 Split data into training and testing subsets<a class="headerlink" href="
525525
<div class="cell docutils container">
526526
<div class="cell_input docutils container">
527527
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># read model input features and labels </span>
528+
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
529+
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
530+
528531
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;./data/samples/sample_100K.csv&#39;</span><span class="p">,</span> <span class="n">index_col</span> <span class="o">=</span> <span class="kc">False</span><span class="p">)</span>
529532
<span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Sample dimentions:&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(),</span> <span class="n">data</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
530533
<span class="nb">print</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">head</span><span class="p">())</span>
@@ -553,7 +556,9 @@ <h2>5.2 Define the random forest model<a class="headerlink" href="#define-the-ra
553556
<p>Now, as we have the training subset and the optimal parameters, we can run the ‘RandomForestClassifier()’ to train our model using the code below:</p>
554557
<div class="cell docutils container">
555558
<div class="cell_input docutils container">
556-
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># define the model</span>
559+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">RandomForestClassifier</span>
560+
561+
<span class="c1"># define the model</span>
557562
<span class="n">model</span> <span class="o">=</span> <span class="n">RandomForestClassifier</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">max_depth</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">max_features</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
558563
</pre></div>
559564
</div>
@@ -562,7 +567,10 @@ <h2>5.2 Define the random forest model<a class="headerlink" href="#define-the-ra
562567
<p>To evaluate the model performance, we conduct K-fold cross-validation using ‘RepeatedStratifiedKFold’ and ‘cross_val_score’ from ‘sklearn.model_selection’. Here, the training subset is randomly split into 10 folds evenly, and each fold is literally used to test the model which is trained by the remaining 9 folds of data. This process is repeated until each fold of the 10 folds has been used as the testing set. The average evaluation metric, here the ‘accuracy’, is used to represent the model performance. This whole process is repeated 1000 times to get the final model performance reported as below:</p>
563568
<div class="cell docutils container">
564569
<div class="cell_input docutils container">
565-
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># evaluate the model</span>
570+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">RepeatedStratifiedKFold</span>
571+
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">cross_val_score</span>
572+
573+
<span class="c1"># evaluate the model</span>
566574
<span class="n">cv</span> <span class="o">=</span> <span class="n">RepeatedStratifiedKFold</span><span class="p">(</span><span class="n">n_splits</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">n_repeats</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
567575
<span class="n">n_scores</span> <span class="o">=</span> <span class="n">cross_val_score</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s1">&#39;accuracy&#39;</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">cv</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
568576
<span class="c1"># report model performance</span>
@@ -571,15 +579,17 @@ <h2>5.2 Define the random forest model<a class="headerlink" href="#define-the-ra
571579
</div>
572580
</div>
573581
<div class="cell_output docutils container">
574-
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Mean Score: 0.998049 (SD: 0.002128)
582+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Mean Score: 0.998038 (SD: 0.002173)
575583
</pre></div>
576584
</div>
577585
</div>
578586
</div>
579587
<p>The overall model training accuracy is 0.998 with 0.002 standard deviation over the 1000 repeated cross-validations, indicating that only 0.2% of samples or pixels on average are incorrectly classified. If we look at the distribution of the accuracy values as shown below, most accuracy values are clustered near 1.00 and all values are higher than 0.98, indicating the model training is very precise and robust.</p>
580588
<div class="cell docutils container">
581589
<div class="cell_input docutils container">
582-
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># the histogram of the scores</span>
590+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
591+
592+
<span class="c1"># the histogram of the scores</span>
583593
<span class="n">n</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">patches</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">n_scores</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">facecolor</span><span class="o">=</span><span class="s1">&#39;blue&#39;</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.75</span><span class="p">)</span>
584594
<span class="n">plt</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.91</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="sa">r</span><span class="s1">&#39;mean = &#39;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">n_scores</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">6</span><span class="p">))</span> <span class="o">+</span> <span class="s1">&#39; &#39;</span><span class="o">+</span> <span class="s1">&#39;SD = &#39;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">n_scores</span><span class="o">.</span><span class="n">std</span><span class="p">()</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">6</span><span class="p">)))</span>
585595
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">1.01</span><span class="p">)</span>
@@ -604,7 +614,9 @@ <h2>5.3 Feature importance<a class="headerlink" href="#feature-importance" title
604614
The result shows that the blue band provides the most important information for SCA mapping, while other three bands all show much less important.</p>
605615
<div class="cell docutils container">
606616
<div class="cell_input docutils container">
607-
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>
617+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn.inspection</span> <span class="kn">import</span> <span class="n">permutation_importance</span>
618+
619+
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>
608620
<span class="n">result</span> <span class="o">=</span> <span class="n">permutation_importance</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">n_repeats</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
609621
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Permutation importance - average:&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(),</span> <span class="n">X_train</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
610622
<span class="nb">print</span><span class="p">([</span><span class="nb">round</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">result</span><span class="o">.</span><span class="n">importances_mean</span><span class="p">])</span>
@@ -620,7 +632,7 @@ <h2>5.3 Feature importance<a class="headerlink" href="#feature-importance" title
620632
</div>
621633
<div class="cell_output docutils container">
622634
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Permutation importance - average: Index([&#39;blue&#39;, &#39;green&#39;, &#39;red&#39;, &#39;nir&#39;], dtype=&#39;object&#39;)
623-
[0.504763, 0.000225, 0.002684, 0.000224]
635+
[0.516662, 0.000393, 0.000746, 0.000474]
624636
</pre></div>
625637
</div>
626638
<img alt="../_images/five_10_1.png" src="../_images/five_10_1.png" />
@@ -632,7 +644,9 @@ <h2>5.4 Save the model<a class="headerlink" href="#save-the-model" title="Permal
632644
<p>We now have our model trained and evaluated. We can save the model using the ‘dump()’ function from the ‘joblib’ package as shown below, so that next time when we want to apply this model, we do not have to run through the process mentioned ahead again. In the next section, we will discuss how we load this model and apply it to a satellite image.</p>
633645
<div class="cell docutils container">
634646
<div class="cell_input docutils container">
635-
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># save model </span>
647+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">joblib</span>
648+
649+
<span class="c1"># save model </span>
636650
<span class="n">dir_model</span> <span class="o">=</span> <span class="s2">&quot;./models/random_forest_SCA_binary.joblib&quot;</span>
637651
<span class="n">joblib</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">dir_model</span><span class="p">)</span>
638652
</pre></div>

0 commit comments

Comments
 (0)