In this homework we'll put what we learned about Spark in practice.
We'll use high volume for-hire vehicles (HVFHV) dataset for that.
- Install Spark
- Run PySpark
- Create a local spark session
- Execute
spark.version
What's the output?
Download the HVFHV data for february 2021:
wget https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-02.csv
Read it with Spark using the same schema as we did in the lessons. We will use this dataset for all the remaining questions.
Repartition it to 24 partitions and save it to parquet.
What's the size of the folder with results (in MB)?
How many taxi trips were there on February 15?
Consider only trips that started on February 15.
Now calculate the duration for each trip.
Trip starting on which day was the longest?
Now find the most frequently occurring dispatching_base_num
in this dataset.
How many stages this spark job has?
Note: the answer may depend on how you write the query, so there are multiple correct answers. Select the one you have.
Find the most common pickup-dropoff pair.
For example:
"Jamaica Bay / Clinton East"
Enter two zone names separated by a slash
If any of the zone names are unknown (missing), use "Unknown". For example, "Unknown / Clinton East".
(not graded)
For finding the answer to Q6, you'll need to perform a join.
What type of join is it?
And how many stages your spark job has?
- Form for submitting: https://forms.gle/dBkVK9yT8cSMDwuw7
- You can submit your homework multiple times. In this case, only the last submission will be used.
Deadline: 07 March (Monday), 22:00 CET