Skip to content

Files

Latest commit

 

History

History

apps

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Generate report from the public data set

spark-submit --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider --packages org.apache.hadoop:hadoop-aws:3.2.0,com.amazonaws:aws-java-sdk-s3:1.12.180,com.amazonaws:aws-java-sdk-core:1.12.180 ny_tlc_report.py --input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'

Notes:

  • The AnonymousAWSCredentialsProvider is required because the nyc-tlc is a public S3 bucket.
  • The --packages list contains the job dependencies for the AWS S3 connection.
  • Only one file is used for reporting in this example yellow_tripdata_2021-07.csv
  • This example works with spark-3.1.3-bin-hadoop3.2 and Python 3.9.7. Other versions may require different dependency versions or even complete different dependencies altogether.

Links

[0] TLC trip data set [1] AWS Java SDK [2] Hadoop AWS module