diff --git a/README.md b/README.md index 250ac3d..ce0749d 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ with lightgbm. The R version of this package may be found - Has efficient mean matching solutions. - Can utilize GPU training - **Flexible** - - Can impute pandas dataframes and numpy arrays + - Can impute pandas dataframes - Handles categorical data automatically - Fits into a sklearn pipeline - User can customize every aspect of the imputation process @@ -48,58 +48,37 @@ you can find #### Table of Contents: - - [Package - Meta](https://github.com/AnotherSamWilson/miceforest#Package-Meta) - - [The - Basics](https://github.com/AnotherSamWilson/miceforest#The-Basics) - - [Basic - Examples](https://github.com/AnotherSamWilson/miceforest#Basic-Examples) - - [Customizing LightGBM - Parameters](https://github.com/AnotherSamWilson/miceforest#Customizing-LightGBM-Parameters) - - [Available Mean Match - Schemes](https://github.com/AnotherSamWilson/miceforest#Controlling-Tree-Growth) - - [Imputing New Data with Existing - Models](https://github.com/AnotherSamWilson/miceforest#Imputing-New-Data-with-Existing-Models) - - [Saving and Loading - Kernels](https://github.com/AnotherSamWilson/miceforest#Saving-and-Loading-Kernels) - - [Implementing sklearn - Pipelines](https://github.com/AnotherSamWilson/miceforest#Implementing-sklearn-Pipelines) - - [Advanced - Features](https://github.com/AnotherSamWilson/miceforest#Advanced-Features) - - [Customizing the Imputation - Process](https://github.com/AnotherSamWilson/miceforest#Customizing-the-Imputation-Process) - - [Building Models on Nonmissing - Data](https://github.com/AnotherSamWilson/miceforest#Building-Models-on-Nonmissing-Data) - - [Tuning - Parameters](https://github.com/AnotherSamWilson/miceforest#Tuning-Parameters) - - [On - Reproducibility](https://github.com/AnotherSamWilson/miceforest#On-Reproducibility) - - [How to Make the Process - Faster](https://github.com/AnotherSamWilson/miceforest#How-to-Make-the-Process-Faster) - - [Imputing Data In - Place](https://github.com/AnotherSamWilson/miceforest#Imputing-Data-In-Place) - - [Diagnostic - Plotting](https://github.com/AnotherSamWilson/miceforest#Diagnostic-Plotting) - - [Imputed - Distributions](https://github.com/AnotherSamWilson/miceforest#Distribution-of-Imputed-Values) - - [Correlation - Convergence](https://github.com/AnotherSamWilson/miceforest#Convergence-of-Correlation) - - [Variable - Importance](https://github.com/AnotherSamWilson/miceforest#Variable-Importance) - - [Mean - Convergence](https://github.com/AnotherSamWilson/miceforest#Variable-Importance) - - [Benchmarks](https://github.com/AnotherSamWilson/miceforest#Benchmarks) - - [Using the Imputed - Data](https://github.com/AnotherSamWilson/miceforest#Using-the-Imputed-Data) - - [The MICE - Algorithm](https://github.com/AnotherSamWilson/miceforest#The-MICE-Algorithm) - - [Introduction](https://github.com/AnotherSamWilson/miceforest#The-MICE-Algorithm) - - [Common Use - Cases](https://github.com/AnotherSamWilson/miceforest#Common-Use-Cases) - - [Predictive Mean - Matching](https://github.com/AnotherSamWilson/miceforest#Predictive-Mean-Matching) - - [Effects of Mean - Matching](https://github.com/AnotherSamWilson/miceforest#Effects-of-Mean-Matching) +This document contains a thorough walkthrough of the package, +benchmarks, and an introduction to multiple imputation. More information +on MICE can be found in Stef van Buuren’s excellent online book, which +you can find +[here](https://stefvanbuuren.name/fimd/ch-introduction.html). + +#### Table of Contents: + + - [Classes](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#classes) + - [Basic Usage](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#basic-usage) + - [Example](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#basic-usage) + - [Customizing LightGBM Parameters](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#customizing-lightgbm-parameters) + - [Available Mean Match Schemes](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#adjusting-the-mean-matching-scheme) + - [Imputing New Data with Existing Models](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#imputing-new-data-with-existing-models) + - [Saving and Loading Kernels](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#saving-and-loading-kernels) + - [Implementing sklearn Pipelines](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#saving-and-loading-kernels) + - [Advanced Features](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#advanced-features) + - [Building Models on Nonmissing Data](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#building-models-on-nonmissing-data) + - [Tuning Parameters](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#tuning-parameters) + - [On Reproducibility](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#on-reproducibility) + - [How to Make the Process Faster](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#how-to-make-the-process-faster) + - [Imputing Data In Place](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#imputing-data-in-place) + - [Diagnostic Plotting](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#diagnostic-plotting) + - [Feature Importance](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#feature-importance) + - [Imputed Distributions](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#plot-imputed-distributions) + - [Using the Imputed Data](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#using-the-imputed-data) + - [The MICE Algorithm](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#the-mice-algorithm) + - [Introduction](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#the-mice-algorithm) + - [Common Use Cases](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#common-use-cases) + - [Predictive Mean Matching](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#predictive-mean-matching) + - [Effects of Mean Matching](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#effects-of-mean-matching) ## Installation @@ -350,7 +329,7 @@ cust_kernel.mice( ) ``` -### Imputing New Data with Existing Models +## Imputing New Data with Existing Models Multiple Imputation can take a long time. If you wish to impute a dataset using the MICE algorithm, but don’t have time to train new @@ -434,6 +413,8 @@ assert not np.any(np.isnan(X_train_t)) assert not np.any(np.isnan(X_test_t)) ``` +# Advanced Features + ## Building Models on Nonmissing Data The MICE process itself is used to impute missing data in a dataset. @@ -843,9 +824,9 @@ print(iris_amp.isnull().sum(0)) dtype: int64 -## Diagnostic Plotting +# Diagnostic Plotting -As of now, there is 2 diagnostic plot available. More coming soon! +As of now, there are 2 diagnostic plot available. More coming soon! ### Feature Importance diff --git a/README_gen.ipynb b/README_gen.ipynb index 8ded140..de36fca 100644 --- a/README_gen.ipynb +++ b/README_gen.ipynb @@ -67,58 +67,29 @@ "\n", "#### Table of Contents:\n", "\n", - " - [Package\n", - " Meta](https://github.com/AnotherSamWilson/miceforest#Package-Meta)\n", - " - [The\n", - " Basics](https://github.com/AnotherSamWilson/miceforest#The-Basics)\n", - " - [Basic\n", - " Examples](https://github.com/AnotherSamWilson/miceforest#Basic-Examples)\n", - " - [Customizing LightGBM\n", - " Parameters](https://github.com/AnotherSamWilson/miceforest#Customizing-LightGBM-Parameters)\n", - " - [Available Mean Match\n", - " Schemes](https://github.com/AnotherSamWilson/miceforest#Controlling-Tree-Growth)\n", - " - [Imputing New Data with Existing\n", - " Models](https://github.com/AnotherSamWilson/miceforest#Imputing-New-Data-with-Existing-Models)\n", - " - [Saving and Loading\n", - " Kernels](https://github.com/AnotherSamWilson/miceforest#Saving-and-Loading-Kernels)\n", - " - [Implementing sklearn\n", - " Pipelines](https://github.com/AnotherSamWilson/miceforest#Implementing-sklearn-Pipelines)\n", - " - [Advanced\n", - " Features](https://github.com/AnotherSamWilson/miceforest#Advanced-Features)\n", - " - [Customizing the Imputation\n", - " Process](https://github.com/AnotherSamWilson/miceforest#Customizing-the-Imputation-Process)\n", - " - [Building Models on Nonmissing\n", - " Data](https://github.com/AnotherSamWilson/miceforest#Building-Models-on-Nonmissing-Data)\n", - " - [Tuning\n", - " Parameters](https://github.com/AnotherSamWilson/miceforest#Tuning-Parameters)\n", - " - [On\n", - " Reproducibility](https://github.com/AnotherSamWilson/miceforest#On-Reproducibility)\n", - " - [How to Make the Process\n", - " Faster](https://github.com/AnotherSamWilson/miceforest#How-to-Make-the-Process-Faster)\n", - " - [Imputing Data In\n", - " Place](https://github.com/AnotherSamWilson/miceforest#Imputing-Data-In-Place)\n", - " - [Diagnostic\n", - " Plotting](https://github.com/AnotherSamWilson/miceforest#Diagnostic-Plotting)\n", - " - [Imputed\n", - " Distributions](https://github.com/AnotherSamWilson/miceforest#Distribution-of-Imputed-Values)\n", - " - [Correlation\n", - " Convergence](https://github.com/AnotherSamWilson/miceforest#Convergence-of-Correlation)\n", - " - [Variable\n", - " Importance](https://github.com/AnotherSamWilson/miceforest#Variable-Importance)\n", - " - [Mean\n", - " Convergence](https://github.com/AnotherSamWilson/miceforest#Variable-Importance)\n", - " - [Benchmarks](https://github.com/AnotherSamWilson/miceforest#Benchmarks)\n", - " - [Using the Imputed\n", - " Data](https://github.com/AnotherSamWilson/miceforest#Using-the-Imputed-Data)\n", - " - [The MICE\n", - " Algorithm](https://github.com/AnotherSamWilson/miceforest#The-MICE-Algorithm)\n", - " - [Introduction](https://github.com/AnotherSamWilson/miceforest#The-MICE-Algorithm)\n", - " - [Common Use\n", - " Cases](https://github.com/AnotherSamWilson/miceforest#Common-Use-Cases)\n", - " - [Predictive Mean\n", - " Matching](https://github.com/AnotherSamWilson/miceforest#Predictive-Mean-Matching)\n", - " - [Effects of Mean\n", - " Matching](https://github.com/AnotherSamWilson/miceforest#Effects-of-Mean-Matching)" + " - [Classes](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#classes)\n", + " - [Basic Usage](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#basic-usage)\n", + " - [Example](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#basic-usage)\n", + " - [Customizing LightGBM Parameters](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#customizing-lightgbm-parameters)\n", + " - [Available Mean Match Schemes](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#adjusting-the-mean-matching-scheme)\n", + " - [Imputing New Data with Existing Models](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#imputing-new-data-with-existing-models)\n", + " - [Saving and Loading Kernels](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#saving-and-loading-kernels)\n", + " - [Implementing sklearn Pipelines](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#saving-and-loading-kernels)\n", + " - [Advanced Features](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#advanced-features)\n", + " - [Building Models on Nonmissing Data](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#building-models-on-nonmissing-data)\n", + " - [Tuning Parameters](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#tuning-parameters)\n", + " - [On Reproducibility](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#on-reproducibility)\n", + " - [How to Make the Process Faster](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#how-to-make-the-process-faster)\n", + " - [Imputing Data In Place](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#imputing-data-in-place)\n", + " - [Diagnostic Plotting](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#diagnostic-plotting)\n", + " - [Feature Importance](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#feature-importance)\n", + " - [Imputed Distributions](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#plot-imputed-distributions)\n", + " - [Using the Imputed Data](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#using-the-imputed-data)\n", + " - [The MICE Algorithm](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#the-mice-algorithm)\n", + " - [Introduction](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#the-mice-algorithm)\n", + " - [Common Use Cases](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#common-use-cases)\n", + " - [Predictive Mean Matching](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#predictive-mean-matching)\n", + " - [Effects of Mean Matching](https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#effects-of-mean-matching)" ] }, { @@ -475,7 +446,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Imputing New Data with Existing Models\n", + "## Imputing New Data with Existing Models\n", "\n", "Multiple Imputation can take a long time. If you wish to impute a\n", "dataset using the MICE algorithm, but don’t have time to train new\n", @@ -586,6 +557,13 @@ "assert not np.any(np.isnan(X_test_t))" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Advanced Features" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1128,9 +1106,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Diagnostic Plotting\n", + "# Diagnostic Plotting\n", "\n", - "As of now, there is 2 diagnostic plot available. More coming soon!" + "As of now, there are 2 diagnostic plot available. More coming soon!" ] }, { diff --git a/docs/index.rst b/docs/index.rst index 1007dcb..f7aaf0e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -49,7 +49,7 @@ Equations (MICE) with lightgbm. The R version of this package may be found - Has efficient mean matching solutions. - Can utilize GPU training - **Flexible** - - Can impute pandas dataframes and numpy arrays + - Can impute pandas dataframes - Handles categorical data automatically - Fits into a sklearn pipeline - User can customize every aspect of the imputation process