@@ -525,6 +525,9 @@ <h2>5.1 Split data into training and testing subsets<a class="headerlink" href="
525
525
< div class ="cell docutils container ">
526
526
< div class ="cell_input docutils container ">
527
527
< div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="c1 "> # read model input features and labels </ span >
528
+ < span class ="kn "> import</ span > < span class ="nn "> pandas</ span > < span class ="k "> as</ span > < span class ="nn "> pd</ span >
529
+ < span class ="kn "> from</ span > < span class ="nn "> sklearn.model_selection</ span > < span class ="kn "> import</ span > < span class ="n "> train_test_split</ span >
530
+
528
531
< span class ="n "> data</ span > < span class ="o "> =</ span > < span class ="n "> pd</ span > < span class ="o "> .</ span > < span class ="n "> read_csv</ span > < span class ="p "> (</ span > < span class ="s1 "> './data/samples/sample_100K.csv'</ span > < span class ="p "> ,</ span > < span class ="n "> index_col</ span > < span class ="o "> =</ span > < span class ="kc "> False</ span > < span class ="p "> )</ span >
529
532
< span class ="nb "> print</ span > < span class ="p "> (</ span > < span class ="s2 "> "Sample dimentions:"</ span > < span class ="o "> .</ span > < span class ="n "> format</ span > < span class ="p "> (),</ span > < span class ="n "> data</ span > < span class ="o "> .</ span > < span class ="n "> shape</ span > < span class ="p "> )</ span >
530
533
< span class ="nb "> print</ span > < span class ="p "> (</ span > < span class ="n "> data</ span > < span class ="o "> .</ span > < span class ="n "> head</ span > < span class ="p "> ())</ span >
@@ -553,7 +556,9 @@ <h2>5.2 Define the random forest model<a class="headerlink" href="#define-the-ra
553
556
< p > Now, as we have the training subset and the optimal parameters, we can run the ‘RandomForestClassifier()’ to train our model using the code below:</ p >
554
557
< div class ="cell docutils container ">
555
558
< div class ="cell_input docutils container ">
556
- < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="c1 "> # define the model</ span >
559
+ < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="kn "> from</ span > < span class ="nn "> sklearn.ensemble</ span > < span class ="kn "> import</ span > < span class ="n "> RandomForestClassifier</ span >
560
+
561
+ < span class ="c1 "> # define the model</ span >
557
562
< span class ="n "> model</ span > < span class ="o "> =</ span > < span class ="n "> RandomForestClassifier</ span > < span class ="p "> (</ span > < span class ="n "> n_estimators</ span > < span class ="o "> =</ span > < span class ="mi "> 10</ span > < span class ="p "> ,</ span > < span class ="n "> max_depth</ span > < span class ="o "> =</ span > < span class ="mi "> 10</ span > < span class ="p "> ,</ span > < span class ="n "> max_features</ span > < span class ="o "> =</ span > < span class ="mi "> 4</ span > < span class ="p "> )</ span >
558
563
</ pre > </ div >
559
564
</ div >
@@ -562,7 +567,10 @@ <h2>5.2 Define the random forest model<a class="headerlink" href="#define-the-ra
562
567
< p > To evaluate the model performance, we conduct K-fold cross-validation using ‘RepeatedStratifiedKFold’ and ‘cross_val_score’ from ‘sklearn.model_selection’. Here, the training subset is randomly split into 10 folds evenly, and each fold is literally used to test the model which is trained by the remaining 9 folds of data. This process is repeated until each fold of the 10 folds has been used as the testing set. The average evaluation metric, here the ‘accuracy’, is used to represent the model performance. This whole process is repeated 1000 times to get the final model performance reported as below:</ p >
563
568
< div class ="cell docutils container ">
564
569
< div class ="cell_input docutils container ">
565
- < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="c1 "> # evaluate the model</ span >
570
+ < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="kn "> from</ span > < span class ="nn "> sklearn.model_selection</ span > < span class ="kn "> import</ span > < span class ="n "> RepeatedStratifiedKFold</ span >
571
+ < span class ="kn "> from</ span > < span class ="nn "> sklearn.model_selection</ span > < span class ="kn "> import</ span > < span class ="n "> cross_val_score</ span >
572
+
573
+ < span class ="c1 "> # evaluate the model</ span >
566
574
< span class ="n "> cv</ span > < span class ="o "> =</ span > < span class ="n "> RepeatedStratifiedKFold</ span > < span class ="p "> (</ span > < span class ="n "> n_splits</ span > < span class ="o "> =</ span > < span class ="mi "> 10</ span > < span class ="p "> ,</ span > < span class ="n "> n_repeats</ span > < span class ="o "> =</ span > < span class ="mi "> 1000</ span > < span class ="p "> )</ span >
567
575
< span class ="n "> n_scores</ span > < span class ="o "> =</ span > < span class ="n "> cross_val_score</ span > < span class ="p "> (</ span > < span class ="n "> model</ span > < span class ="p "> ,</ span > < span class ="n "> X_train</ span > < span class ="p "> ,</ span > < span class ="n "> y_train</ span > < span class ="p "> ,</ span > < span class ="n "> scoring</ span > < span class ="o "> =</ span > < span class ="s1 "> 'accuracy'</ span > < span class ="p "> ,</ span > < span class ="n "> cv</ span > < span class ="o "> =</ span > < span class ="n "> cv</ span > < span class ="p "> ,</ span > < span class ="n "> n_jobs</ span > < span class ="o "> =-</ span > < span class ="mi "> 1</ span > < span class ="p "> )</ span >
568
576
< span class ="c1 "> # report model performance</ span >
@@ -571,15 +579,17 @@ <h2>5.2 Define the random forest model<a class="headerlink" href="#define-the-ra
571
579
</ div >
572
580
</ div >
573
581
< div class ="cell_output docutils container ">
574
- < div class ="output stream highlight-myst-ansi notranslate "> < div class ="highlight "> < pre > < span > </ span > Mean Score: 0.998049 (SD: 0.002128 )
582
+ < div class ="output stream highlight-myst-ansi notranslate "> < div class ="highlight "> < pre > < span > </ span > Mean Score: 0.998038 (SD: 0.002173 )
575
583
</ pre > </ div >
576
584
</ div >
577
585
</ div >
578
586
</ div >
579
587
< p > The overall model training accuracy is 0.998 with 0.002 standard deviation over the 1000 repeated cross-validations, indicating that only 0.2% of samples or pixels on average are incorrectly classified. If we look at the distribution of the accuracy values as shown below, most accuracy values are clustered near 1.00 and all values are higher than 0.98, indicating the model training is very precise and robust.</ p >
580
588
< div class ="cell docutils container ">
581
589
< div class ="cell_input docutils container ">
582
- < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="c1 "> # the histogram of the scores</ span >
590
+ < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="kn "> import</ span > < span class ="nn "> matplotlib.pyplot</ span > < span class ="k "> as</ span > < span class ="nn "> plt</ span >
591
+
592
+ < span class ="c1 "> # the histogram of the scores</ span >
583
593
< span class ="n "> n</ span > < span class ="p "> ,</ span > < span class ="n "> bins</ span > < span class ="p "> ,</ span > < span class ="n "> patches</ span > < span class ="o "> =</ span > < span class ="n "> plt</ span > < span class ="o "> .</ span > < span class ="n "> hist</ span > < span class ="p "> (</ span > < span class ="n "> n_scores</ span > < span class ="p "> ,</ span > < span class ="n "> density</ span > < span class ="o "> =</ span > < span class ="kc "> True</ span > < span class ="p "> ,</ span > < span class ="n "> facecolor</ span > < span class ="o "> =</ span > < span class ="s1 "> 'blue'</ span > < span class ="p "> ,</ span > < span class ="n "> alpha</ span > < span class ="o "> =</ span > < span class ="mf "> 0.75</ span > < span class ="p "> )</ span >
584
594
< span class ="n "> plt</ span > < span class ="o "> .</ span > < span class ="n "> text</ span > < span class ="p "> (</ span > < span class ="mf "> 0.91</ span > < span class ="p "> ,</ span > < span class ="mi "> 15</ span > < span class ="p "> ,</ span > < span class ="sa "> r</ span > < span class ="s1 "> 'mean = '</ span > < span class ="o "> +</ span > < span class ="nb "> str</ span > < span class ="p "> (</ span > < span class ="n "> n_scores</ span > < span class ="o "> .</ span > < span class ="n "> mean</ span > < span class ="p "> ()</ span > < span class ="o "> .</ span > < span class ="n "> round</ span > < span class ="p "> (</ span > < span class ="mi "> 6</ span > < span class ="p "> ))</ span > < span class ="o "> +</ span > < span class ="s1 "> ' '</ span > < span class ="o "> +</ span > < span class ="s1 "> 'SD = '</ span > < span class ="o "> +</ span > < span class ="nb "> str</ span > < span class ="p "> (</ span > < span class ="n "> n_scores</ span > < span class ="o "> .</ span > < span class ="n "> std</ span > < span class ="p "> ()</ span > < span class ="o "> .</ span > < span class ="n "> round</ span > < span class ="p "> (</ span > < span class ="mi "> 6</ span > < span class ="p "> )))</ span >
585
595
< span class ="n "> plt</ span > < span class ="o "> .</ span > < span class ="n "> xlim</ span > < span class ="p "> (</ span > < span class ="mf "> 0.9</ span > < span class ="p "> ,</ span > < span class ="mf "> 1.01</ span > < span class ="p "> )</ span >
@@ -604,7 +614,9 @@ <h2>5.3 Feature importance<a class="headerlink" href="#feature-importance" title
604
614
The result shows that the blue band provides the most important information for SCA mapping, while other three bands all show much less important.</ p >
605
615
< div class ="cell docutils container ">
606
616
< div class ="cell_input docutils container ">
607
- < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="n "> model</ span > < span class ="o "> .</ span > < span class ="n "> fit</ span > < span class ="p "> (</ span > < span class ="n "> X_train</ span > < span class ="p "> ,</ span > < span class ="n "> y_train</ span > < span class ="p "> )</ span >
617
+ < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="kn "> from</ span > < span class ="nn "> sklearn.inspection</ span > < span class ="kn "> import</ span > < span class ="n "> permutation_importance</ span >
618
+
619
+ < span class ="n "> model</ span > < span class ="o "> .</ span > < span class ="n "> fit</ span > < span class ="p "> (</ span > < span class ="n "> X_train</ span > < span class ="p "> ,</ span > < span class ="n "> y_train</ span > < span class ="p "> )</ span >
608
620
< span class ="n "> result</ span > < span class ="o "> =</ span > < span class ="n "> permutation_importance</ span > < span class ="p "> (</ span > < span class ="n "> model</ span > < span class ="p "> ,</ span > < span class ="n "> X_train</ span > < span class ="p "> ,</ span > < span class ="n "> y_train</ span > < span class ="p "> ,</ span > < span class ="n "> n_repeats</ span > < span class ="o "> =</ span > < span class ="mi "> 1000</ span > < span class ="p "> ,</ span > < span class ="n "> random_state</ span > < span class ="o "> =</ span > < span class ="mi "> 42</ span > < span class ="p "> ,</ span > < span class ="n "> n_jobs</ span > < span class ="o "> =</ span > < span class ="mi "> 2</ span > < span class ="p "> )</ span >
609
621
< span class ="nb "> print</ span > < span class ="p "> (</ span > < span class ="s1 "> 'Permutation importance - average:'</ span > < span class ="o "> .</ span > < span class ="n "> format</ span > < span class ="p "> (),</ span > < span class ="n "> X_train</ span > < span class ="o "> .</ span > < span class ="n "> columns</ span > < span class ="p "> )</ span >
610
622
< span class ="nb "> print</ span > < span class ="p "> ([</ span > < span class ="nb "> round</ span > < span class ="p "> (</ span > < span class ="n "> i</ span > < span class ="p "> ,</ span > < span class ="mi "> 6</ span > < span class ="p "> )</ span > < span class ="k "> for</ span > < span class ="n "> i</ span > < span class ="ow "> in</ span > < span class ="n "> result</ span > < span class ="o "> .</ span > < span class ="n "> importances_mean</ span > < span class ="p "> ])</ span >
@@ -620,7 +632,7 @@ <h2>5.3 Feature importance<a class="headerlink" href="#feature-importance" title
620
632
</ div >
621
633
< div class ="cell_output docutils container ">
622
634
< div class ="output stream highlight-myst-ansi notranslate "> < div class ="highlight "> < pre > < span > </ span > Permutation importance - average: Index(['blue', 'green', 'red', 'nir'], dtype='object')
623
- [0.504763 , 0.000225 , 0.002684 , 0.000224 ]
635
+ [0.516662 , 0.000393 , 0.000746 , 0.000474 ]
624
636
</ pre > </ div >
625
637
</ div >
626
638
< img alt ="../_images/five_10_1.png " src ="../_images/five_10_1.png " />
@@ -632,7 +644,9 @@ <h2>5.4 Save the model<a class="headerlink" href="#save-the-model" title="Permal
632
644
< p > We now have our model trained and evaluated. We can save the model using the ‘dump()’ function from the ‘joblib’ package as shown below, so that next time when we want to apply this model, we do not have to run through the process mentioned ahead again. In the next section, we will discuss how we load this model and apply it to a satellite image.</ p >
633
645
< div class ="cell docutils container ">
634
646
< div class ="cell_input docutils container ">
635
- < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="c1 "> # save model </ span >
647
+ < div class ="highlight-ipython3 notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="kn "> import</ span > < span class ="nn "> joblib</ span >
648
+
649
+ < span class ="c1 "> # save model </ span >
636
650
< span class ="n "> dir_model</ span > < span class ="o "> =</ span > < span class ="s2 "> "./models/random_forest_SCA_binary.joblib"</ span >
637
651
< span class ="n "> joblib</ span > < span class ="o "> .</ span > < span class ="n "> dump</ span > < span class ="p "> (</ span > < span class ="n "> model</ span > < span class ="p "> ,</ span > < span class ="n "> dir_model</ span > < span class ="p "> )</ span >
638
652
</ pre > </ div >
0 commit comments