LAPD Crimes

Term project for the course 'Advanced Databases' during $9^{t h}$ semester at ECE NTUA.
The project objective was to process a large dataset on Apache Hadoop's distributed storage using the Apache Spark processing engine. The dataset contains information about crimes recorded by the Los Angeles Police Department (LAPD) from 2010 to 2024. It contains information about the date and time of the crime, the type of crime, the area where the crime was committed, the type of location (e.g. street, residence).

How to start

Install Hadoop + Spark ( instruction @ installation)
- Follow the same installation steps for setting the VMs locally, by ignoring the okeanos part and using a virtualization software like VirtualBox.
- Remember to set the VMs under the same network.
Download the data executing the download_data.sh script
Start the services by
- start-dfs.sh
- start-yarn.sh
- start-master.sh
- start-worker.sh spark://master:7077
Import the data into HDFS with hdfs dfs -put advDatabases/data/*.csv hdfs://master:9000/user/user/data
Execute spark-submit --master spark://master:7077 advDatabases/Config_data.py to create the complete dataset.
Execute each query with spark-submit --master spark://master:7077 advDatabases/q{i}_{DF,SQL,RDD}.py (Or execute run_all.sh to run all queries)

❗ Steps 3 and 4 can be skipped by executing reset_master.sh script at master node and reset_worker.sh at worker node,too.

Implemented Queries

Find, for each year, the 3 months with the highest number of recorded crimes and print the total number of criminal activities recorded at that time, and the position of that month in the ranking within the corresponding year.
Sort the time of the day according to the number of crimes recorded on the street (STREET).The day is divided in:
- Morning $\to$ 05:00-11:59
- Afternoon $\to$ 12:00-16:59
- Evening $\to$ 17:00-20:59
- Night $\to$ 21:00-04:59
Find the descent of recorded crime victims in Los Angeles for the year 2015 in the 3 areas (ZIP Codes) with the highest and the 3 areas (ZIP Codes) with the lowest income per household.

Techologies

Apache Spark v3.5.0
Apache Hadoop v3.3.6
Python3 v3.10.12
OpenJDK v11.0.21

Cluster

1 Master Node : 192.168.64.9
- 1 Master (Spark)
- 1 Worker (Spark)
- 1 Namenode (HDFS)
- 1 Datanode (HDFS)
1 Worker Node : 192.168.64.10
- 1 Worker (Spark)
- 1 Datanode (HDFS)

Resource Manager

To monitor cluster's health and the apps' execution :

Apache Spark : http://192.168.64.9:8080
Apache Hadoop : http://192.168.64.9:9870
Apache Hadoop YARN : http://192.168.64.9:8088

Name	Name	Last commit message	Last commit date
Latest commit NicoleMp2 Update README.md Feb 2, 2024 6a90e9a · Feb 2, 2024 History 15 Commits
images	images	report	Feb 2, 2024
outputs	outputs	report q3	Feb 2, 2024
.gitignore	.gitignore	report	Feb 2, 2024
README.md	README.md	Update README.md	Feb 2, 2024
advanced_db_project.pdf	advanced_db_project.pdf	beautification of code, added scripts	Feb 2, 2024
config_data.py	config_data.py	report	Feb 2, 2024
download_data.sh	download_data.sh	configuration changes and latex file	Jan 31, 2024
el18604.pdf	el18604.pdf	Report ready	Feb 2, 2024
el18604.tex	el18604.tex	Report ready	Feb 2, 2024
q1_df.py	q1_df.py	beautification of code, added scripts	Feb 2, 2024
q1_sql.py	q1_sql.py	report q3	Feb 2, 2024
q2_df.py	q2_df.py	beautification of code, added scripts	Feb 2, 2024
q2_rdd.py	q2_rdd.py	beautification of code, added scripts	Feb 2, 2024
q2_sql.py	q2_sql.py	beautification of code, added scripts	Feb 2, 2024
q3_df.py	q3_df.py	beautification of code, added scripts	Feb 2, 2024
q3_sql.py	q3_sql.py	beautification of code, added scripts	Feb 2, 2024
reset_master.sh	reset_master.sh	beautification of code, added scripts	Feb 2, 2024
reset_worker.sh	reset_worker.sh	beautification of code, added scripts	Feb 2, 2024
run_all.sh	run_all.sh	beautification of code, added scripts	Feb 2, 2024
stop_all.sh	stop_all.sh	README and stop_all	Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LAPD Crimes

How to start

Implemented Queries

Techologies

Cluster

Resource Manager

About

Languages

NicoleMp2/advDatabases

Folders and files

Latest commit

History

Repository files navigation

LAPD Crimes

How to start

Implemented Queries

Techologies

Cluster

Resource Manager

About

Resources

Stars

Watchers

Forks

Languages