KutayAkalin · daretechie · Nov 16, 2020
diff --git a/Data Prep/.ipynb_checkpoints/DataPrep-checkpoint.ipynb b/Data Prep/.ipynb_checkpoints/DataPrep-checkpoint.ipynb
@@ -0,0 +1,342 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/KutayAkalin/ML_Course_9-11-20/blob/main/Data%20Prep/DataPrep.ipynb\">\n",
+    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "                                    All rights reserved © Global AI Hub 2020 \n",
+    "![](img/logo.png)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Building ML Project"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/flow.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Steps of Project\n",
+    "\n",
+    "- Gathering Data\n",
+    "- Preparing the Data\n",
+    "- Choosing Models\n",
+    "- Training\n",
+    "- Evaluation\n",
+    "- Hyperparameter Tuning\n",
+    "- Prediction\n",
+    "- Model Selection"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1) Gathering Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After the problem definition, we need to obtain data which will be appropriate for our case.  The quality and quantity of data that you gather will directly determine how good our predictive model can be."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2) Preparing the Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Data preparation, where we load our data into a suitable place and prepare it for use in our machine learning training. This is also a good time to do any pertinent visualizations of your data, to help you see if there are any relevant relationships between different variables you can take advantage of, as well as show you if there are any data imbalances."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exploratory Data Analysis (EDA)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/wordcloud.jpg)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/hist2.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/bar2.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Pre-Processing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Duplicate Values\n",
+    "In most cases, the duplicates are removed so as to not give that particular data object an advantage or bias, when running machine learning algorithms."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Imbalanced Data\n",
+    "An Imbalanced dataset is one where the number of instances of a class(es) are significantly higher than another class(es), thus leading to an imbalance and creating rarer class(es)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/imbalance.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Missing Values\n",
+    "\n",
+    "- Eleminate missing values\n",
+    "- Filling with mean or median\n",
+    "\n",
+    "`df.isnull().sum() `  \n",
+    "`df.dropna()`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/skew.jpg)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Outlier Detection\n",
+    "\n",
+    "- Standart Deviation\n",
+    "- Box Plots / IQR Calculation\n",
+    "- Isolation Forest\n",
+    "\n",
+    "\n",
+    "`from sklearn.ensemble import IsolationForest`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/stddev.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/IQR.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Feature Scaling\n",
+    "\n",
+    "- Standardization  \n",
+    "$$  X_{new} = \\frac{X-\\mu}{\\sigma} $$  \n",
+    "- Normalization  \n",
+    "$$X_{new} = \\frac{X-X_{min}}{X_{max} - X_{min}} $$ \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/stndr.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/norm.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Bucketing (Binning)\n",
+    "\n",
+    "Data binning, bucketing is a data pre-processing method used to minimize the effects of small observation errors (noisy data). The original data values are divided into small intervals known as bins and then they are replaced by a general value calculated for that bin.  \n",
+    "\n",
+    "![](img/binning.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Feature Extraction\n",
+    "- Principle Components Analysis (PCA)\n",
+    "- Independent Component Analysis (ICA)\n",
+    "- Linear Discriminant Analysis (LDA)\n",
+    "- t-distributed Stochastic Neighbor Embedding (t-SNE)\n",
+    "\n",
+    "Example:  \n",
+    "$$Profit = Revenue - Cost$$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Feature Encoding\n",
+    "Feature encoding is basically performing transformations on the data such that it can be easily accepted as input for machine learning algorithms while still retaining its original meaning.\n",
+    "\n",
+    "- **Nominal** : Any one-to-one mapping can be done which retains the meaning. For instance, a permutation of values like in One-Hot Encoding.\n",
+    "- **Ordinal** : An order-preserving change of values. The notion of small, medium and large can be represented equally well with the help of a new function. For example, we can encode this S, M and L sizes into {0, 1, 2} or maybe {1, 2, 3}."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/encode.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Train / Validation / Test Split"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "But before we start deciding the algorithm which should be used, it is always advised to split the dataset into 2 or sometimes 3 parts. Machine Learning algorithms, or any algorithm for that matter, has to be first trained on the data distribution available and then validated and tested, before it can be deployed to deal with real-world data.  \n",
+    "\n",
+    "- 60 / 20 / 20\n",
+    "- 70 / 30\n",
+    "\n",
+    "`from sklearn.model_selection import train_test_split`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/split.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "#### Cross Validation\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/cross_valid.png)\n",
+    "`from sklearn.model_selection import cross_validate`\n",
+    "\n",
+    "https://scikit-learn.org/stable/modules/cross_validation.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Resources\n",
+    "\n",
+    "https://towardsdatascience.com/introduction-to-data-preprocessing-in-machine-learning-a9fa83a5dc9d  \n",
+    "https://developers.google.com/machine-learning/data-prep  \n",
+    "https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623\n",
+    "https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/  \n",
+    "https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02  \n",
+    "https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2  \n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/Data Prep/DataPrep.ipynb b/Data Prep/DataPrep.ipynb
@@ -334,7 +334,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.9"
   }
  },
  "nbformat": 4,