Term project for the course 'Advanced Databases' during
The project objective was to process a large dataset on Apache Hadoop's distributed storage using the Apache Spark processing engine. The dataset contains information about crimes recorded by the Los Angeles Police Department (LAPD) from 2010 to 2024. It contains information about the date and time of the crime, the type of crime, the area where the crime was committed, the type of location (e.g. street, residence).
- Install Hadoop + Spark ( instruction @ installation)
- Follow the same installation steps for setting the VMs locally, by ignoring the okeanos part and using a virtualization software like VirtualBox.
- Remember to set the VMs under the same network.
- Download the data executing the
download_data.sh
script - Start the services by
start-dfs.sh
start-yarn.sh
start-master.sh
start-worker.sh spark://master:7077
- Import the data into HDFS with
hdfs dfs -put advDatabases/data/*.csv hdfs://master:9000/user/user/data
- Execute
spark-submit --master spark://master:7077 advDatabases/Config_data.py
to create the complete dataset. - Execute each query with
spark-submit --master spark://master:7077 advDatabases/q{i}_{DF,SQL,RDD}.py
(Or executerun_all.sh
to run all queries)
❗ Steps 3 and 4 can be skipped by executing reset_master.sh
script at master node and reset_worker.sh
at worker node,too.
- Find, for each year, the 3 months with the highest number of recorded crimes and print the total number of criminal activities recorded at that time, and the position of that month in the ranking within the corresponding year.
- Sort the time of the day according to the number of crimes recorded on the street (STREET).The day is divided in:
- Morning
05:00-11:59 - Afternoon
12:00-16:59 - Evening
17:00-20:59 - Night
21:00-04:59
- Morning
- Find the descent of recorded crime victims in Los Angeles for the year 2015 in the 3 areas (ZIP Codes) with the highest and the 3 areas (ZIP Codes) with the lowest income per household.
- Apache Spark v3.5.0
- Apache Hadoop v3.3.6
- Python3 v3.10.12
- OpenJDK v11.0.21
- 1 Master Node : 192.168.64.9
- 1 Master (Spark)
- 1 Worker (Spark)
- 1 Namenode (HDFS)
- 1 Datanode (HDFS)
- 1 Worker Node : 192.168.64.10
- 1 Worker (Spark)
- 1 Datanode (HDFS)
To monitor cluster's health and the apps' execution :
- Apache Spark : http://192.168.64.9:8080
- Apache Hadoop : http://192.168.64.9:9870
- Apache Hadoop YARN : http://192.168.64.9:8088