1. Data Exploratory Analysis

About The Project

Identity fraud is the act of misrepresenting who they are when they apply for a product or service, causing businesses close to $56 billions in 2021. This project will tackle identity fraud in credit card application using intensive data analytics and advanced machine learning method to save companies from financial losses.

Tools: Python (NumPy, Pandas, Scikit-learn, PySpark), MS Excel, Spark

Skills: Exploratory Data Analysis, Applied Statistics, Data Visualization, Machine Learning, Feature Engineering, Fraud Analytics

1. Data Exploratory Analysis

The data is about application for credit cards and cell phones with personal identifying information. There are 10 fields and a million records in the data. The data covers applications submitted in the year of 2017.

1.1 Summary Table

(1) Numerical Fields

(2) Categorical Fields

Note

There is no missing value

1.2 Data Quality

1.2.1 Field “SSN

Note

Pad any SSN with less than 9 digit with leading 0s.
Replace frivolous SSN value “999999999” with a record number string to create a unique SSN to avoid false alarm risk

1.2.2 Field “firstname” and “lastname”

Note

There is no data quality issue with the field First Name and Last Name

1.2.3 Field “address”

Note

The address value “123 MAIN ST” appears to be the placeholder or frivolous value, accounting for 0.1079% of the data but appearing 1,012.37% more than the second most popular address in the data. We don't want this value to be a false alarm to the model. Therefore, replace them with a distinct record number to create a unique address for an application.

1.2.4 Field “zip5”

Note

Pad any Zip5 with less than 5 digit with leading 0s.

1.2.5 Field “dob”

Note

The date of birth value '19070626' appears to be the placeholder or frivolous value, accounting for 12.6568% of the data and appearing 2,526.98% more than the second most popular date of birth in the data. We don't want this value to be a false alarm to the model. Therefore, assign them a distinct value for each of those applications.

1.2.6 Field “homephone”

Note

Pad any homephone with less than 10 digit with leading 0s.
Replace frivolous homephone value “9999999999” with a record number string to create a unique homephone to avoid false positive risk

1.3 More Data Visualizations

Note

The normalized daily good applications percentage seems to be more stably constant than the normalized daily bad applications percentage.
Fraud rate is higher later of the week (Wed-Sun) than the early of the week (Mon-Wed)
It’s interesting that many applicants has the age above 110. In fact, the number of applicants within the age group is more than double compared to other age groups. However, the % of fraud applications are quite evenly stable, indicating that the data is not biased fraud towards old applicants even intuitively applicants of that age is quite rare.

2. Features Engineering

Signal of Identity Fraud:

Fraudster originating many applications from the same place: Unusual number of applications with the same address and/or phone number
A compromised identity, perhaps available for purchase: A particular SSN, Name_DOB combination being used with many different addresses, phones combinations

Fraud Modes:

An individual fraudster has gotten a list of identity information (stolen, bought on internet…) and is going through this list applying for many products with many identities. He uses the victims’ core identity information (SSN,Name,DOB) and his own contact information (Address, Phone number).
- How many applications have we seen in the recent past that have that same address or phone number?
A victim’s identity was compromised in a data breach and his core identity information (SSN,Name,DOB) is being used by many fraudsters.
- How many applications have we seen in the recent past that have that SSN or Name_DOB or both?

Approaches:

Note

In total, generated more than 4,000 features tailored to identity theft dynamics, equipping model with targeted insights to effectively flag fraudulent applications.

3. Features Selection

Previously generated 4,000+ variables posed high dimensionality and potential multi-collinearity challenges. Our goal was removing weak predictors to minimize dimensionality while retaining predictive power. Here is the process of features selection:

Note

We measured the model performance using this process: The model scores the fraud scores for all application. We select 3% of applications with the highest score as fraud and measure the accuracy of fraud application captured within that 3% pool.

Below is the best performance by number of features:

Below is the the final optimal selected set of features:

4. Model Development

4.1 Preliminary Model Exploration

We developed and evaluated 7 Machine Learning Classification models, starting with simple and highly interpretable model like Logistic Regression. Then we experimented with non-linear models and adjusted its hype-parameters as well as number of variables to explore bias and variance behaviors. To develop the model, we split the data into train/test set which is the first 10 months and Out-of-time set which is the last 2 months. Each model was trained and evaluated on randomly split train/test set, in the meantime, recording the performance on Out-of-time, for 10 times. Eventually, models were fine-tuned to achieve the highest performance on the Out-of-time set.

The baseline model was Logistic Regression which achieved the average Fraud Detection Rate (FDR) of approximately 51.53% at 3% cutoff rate on Out-of-time (OOT) set. One of the best models can achieve slightly above 55%. Below is the performance result of 7 models:

4.2 Final Model Performance

Neural Network and Random Forest were amongst the model with high performance on Out-of-time. Neural Network performed a slightly better than the other; however, due to training a Neural Network model was more time-consuming, we didn’t select this model. Therefore, we selected Random Forest as the final model:

Below is the performance results of the best model:

Note

The best model was able to achieve 55.07% of Fraud Detection Rate while only rejecting the top 3% of all applications.

5. Estimate Impacts

Experian data from Q3 2022 shows American consumers had an average total credit limit of $28,930 across all revolving credit accounts. As of January 2024, the average American spends more than $1,500 per month on their credit card. Assume these numbers are true, if a fraudster successfully opens a credit card account, the company will lose $28,930 from the fraudster’s illegal spendings. However, if the legit customers wanted to open a credit line but was rejected due to False Positive error of the Fraud model, the company would lose the potential income from the customer’s credit card usage. We assume an average customer would spend $1,500 per month while the company makes 2.5% from the spending fee. J.D. Power found that 51% of Americans can't pay off their entire balance each month and instead let it revolve to the next month. Also, let assume the credit card APR is 22% (or 1.83% monthly). We estimate that the credit card company would lose $563.66/year per false positive error. Also, we assume most customers would use the card for 2 years before they want to cancel. Below is our estimate savings using the fraud model:

Note

We suggest using the model at the Fraud Cutoff Rate at 2% to achieve the optimal annual savings of $34,919,012.56 per 1 million applications with around 1.44% of fraud (from this dataset) according to above assumptions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
1_EDA.ipynb		1_EDA.ipynb
2_Modeling.ipynb		2_Modeling.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About The Project

1. Data Exploratory Analysis

1.1 Summary Table

1.2 Data Quality

1.2.1 Field “SSN

1.2.2 Field “firstname” and “lastname”

1.2.3 Field “address”

1.2.4 Field “zip5”

1.2.5 Field “dob”

1.2.6 Field “homephone”

1.3 More Data Visualizations

2. Features Engineering

3. Features Selection

4. Model Development

4.1 Preliminary Model Exploration

4.2 Final Model Performance

5. Estimate Impacts

About

Languages

tomvdo29usc/Credit_Card_Application_Identity_Fraud_AIML_Model

Folders and files

Latest commit

History

Repository files navigation

About The Project

1. Data Exploratory Analysis

1.1 Summary Table

1.2 Data Quality

1.2.1 Field “SSN

1.2.2 Field “firstname” and “lastname”

1.2.3 Field “address”

1.2.4 Field “zip5”

1.2.5 Field “dob”

1.2.6 Field “homephone”

1.3 More Data Visualizations

2. Features Engineering

3. Features Selection

4. Model Development

4.1 Preliminary Model Exploration

4.2 Final Model Performance

5. Estimate Impacts

About

Topics

Resources

Stars

Watchers

Forks

Languages