Machine Learning Algorithms using Spark

The purpose of these packages (solutions expressed in Java and Spark) are to show how to implement basic machine learning algorithms (K-Means, Naive Bayes, Logistic Regression, Linear Regression, ...) in Spark and Spark's MLlib library. Spark's MLlib offers a suite of machine learning libraries for

Naive Bayes
Logistic Regression
K-Means
Linear Regression

Machine Learning vs. Traditional Programming

"Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions." (source: https://en.wikipedia.org/wiki/Machine_learning)
"Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it." (source https://www.coursera.org/learn/machine-learning)
A Tour of Machine Learning Algorithms
The Elements of Statistical Learning, Data Mining, Inference, and Prediction, 2nd Edition
Spark Machine Learning Library Tutorial
Using Intel® Data Analytics Acceleration Library to Improve the Performance of Naïve Bayes Algorithm in Python

K-Means Clustering Algorithm

K-Means clustering is a clustering algorithm that can be used to partition your dataset into K (where K > 1) clusters. We now look at how we can implement K-Means clustering using Spark to cluster the featurized Wikipedia dataset. K-Means is one of the simplest un-supervised learning algorithms that solve the well known clustering problem.

For details on K-Means clustering, you should read

A Tutorial on Clustering Algorithms
K-means and Hierarchical Clustering
K-Means: Step-By-Step Example

K-Means using Spark's MLlib

org.dataalgorithms.machinelearning.kmeans.Featurization

This is a standalone Spark program to featurize the WikiStats
org.dataalgorithms.machinelearning.kmeans.WikipediaKMeansUsingUtilVector

This solution implements K-Means algorithm using the org.apache.spark.util.Vector class
org.dataalgorithms.machinelearning.kmeans.WikipediaKMeansUsingMLlibVector

This solution implements K-Means algorithm using the org.apache.spark.mllib.linalg.Vector interface

Logistic Regression Algorithm

Use simple logistic regression when you have one nominal variable and one measurement variable, and you want to know whether variation in the measurement variable causes variation in the nominal variable.

For details on Logistic Regression, you should read

Logistic Regression
A Tutorial on Logistic Regression
Logistic Regression Tutorial
Logit Models for Binary Data
Introduction to Logistic Regression

Logistic Regression Applications using Spark's MLlib

Breast Cancer Detection

These Spark programs detect breast cancer using Logistic Regression model

org.dataalgorithms.machinelearning.logistic.BreastCancerDetectionBuildModel

The class BreastCancerDetectionBuildModel builds the model from the given training data

org.dataalgorithms.machinelearning.logistic.BreastCancerDetection

This is the driver class, which uses the built model to classify new queried data

Detect Spam and Non-Spam Emails

This solution detects spam and non-spam emails

org.dataalgorithms.machinelearning.logistic.EmailSpamDetectionBuildModel

The class EmailSpamDetectionBuildModel builds the model from the given training data

org.dataalgorithms.machinelearning.logistic.EmailSpamDetection

This is the driver class, which uses the built model to classify new queried data

Naive Bayes Algorithm

"The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonging to each class to make a prediction. It is the supervised learning approach you would come up with if you wanted to model a predictive modeling problem probabilistically. Naive Bayes simplifies the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method." (source: http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)

Naive Bayes Classifier can be specified by using the following conditional probabilities:

P(a|b) = (P(b|a) P(a)) / P(b)

For details on Naive Bayes, you should read

How To Implement Naive Bayes From Scratch in Python

Naive Bayes Applications using Spark's MLlib

The following Spark classes may be used to implement Naive Bayes:

org.apache.spark.mllib.classification.NaiveBayes Trains a Naive Bayes model given an RDD of (label, features) pairs.
org.apache.spark.mllib.classification.NaiveBayesModel Model for Naive Bayes Classifiers

Linear Regression

Regression analysis is the art and science of fitting straight lines to patterns of data. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X₁, X₂, ..., X_n is linear. In reality, true regression functions are never linear.

For details on Linear Regression, you should read

Linear regression
Notes on linear regression analysis
Linear Regression Analysis, second edition
Simple Linear Regression
Blog: Predicting Car Prices : Linear Regression http://www.datasciencecentral.com/profiles/blogs/predicting-car-prices-part-1-linear-regression
Training Data Set: https://github.com/datailluminations/PredictingToyotaPricesBlog/blob/master/ToyotaCorolla.csv

Questions/Comments

View Mahmoud Parsian's profile on LinkedIn
Please send me an email: mahmoud.parsian@yahoo.com
Twitter: @mahmoudparsian

Thank you!

best regards,
Mahmoud Parsian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Machine Learning Algorithms using Spark

Machine Learning vs. Traditional Programming

K-Means Clustering Algorithm

K-Means using Spark's MLlib

Logistic Regression Algorithm

Logistic Regression Applications using Spark's MLlib

Breast Cancer Detection

Detect Spam and Non-Spam Emails

Naive Bayes Algorithm

Naive Bayes Applications using Spark's MLlib

Linear Regression

Questions/Comments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Machine Learning Algorithms using Spark

Machine Learning vs. Traditional Programming

K-Means Clustering Algorithm

K-Means using Spark's MLlib

Logistic Regression Algorithm

Logistic Regression Applications using Spark's MLlib

Breast Cancer Detection

Detect Spam and Non-Spam Emails

Naive Bayes Algorithm

Naive Bayes Applications using Spark's MLlib

Linear Regression

Questions/Comments