Robustness and fault tolerance

There are many machines working together in my pipeline's ecosystem. The ones somewhat under my control are the EC2 instance to run kinesis firehose, the S3 buckets, the EMR spark clusters, and of course my local machine. The raw json objects streaming in from Edmunds API go into S3, and consolidated DataFrame tables get stored as parquet files as a backup. The kinesis firehose is a bit delicate, because a python file is running continuously using nohup, and it makes an API call for a specific make-model every 60 minutes. When there's an error, it will store the object as an error and continue. The system could be improved by setting up alerts whenever some part of the pipeline goes down.

Low latency reads and updates

Reads are low latency because a complete .html file gets transferred to the local machine before invoking a boto connection to S3 to update the static website. An update to the Spark DataFrames and SQL tables first requieres reading in all the files in the S3 bucket for the raw json objects which takes some time. The spark.read.json("s3a://edmundsvehicle/2017////") functionality could be improved by implementing a fast update version that only reads in new objects that have been added to the bucket since the last update. Occassionally though (at least once a year) the entire bucket would need to be reupdated since car manufactures come out with new models.

Scalability

Since we are using Amazon EC2 clusters, we could easily scale up the system by adding more machines or upgrading the class. Thus, it would be able to maintain performance while handling increasing data load. The main deterents to saclability are limited API calls and costs.

Generalization

Since the entirety of the vehicle data gets stored in third-normal-form tables, it's easy to pull data through standard SQL queries. This facilitates a wide range of applications for financial management, analytics, and market research. To further improve the app, historical data could be added, and if the limitations of the API calls were increased, I could also add dealer inventory info, images, and much more.

Extensibility

The system can be easily extended, the flowchart can be modified to add new features. To further extend this application, we could gather data from the prices and dealer inventory information to augment the vehicle data. The main limitation though is that the Edmunds API exploratory tier has a maximum limit of 25 API calls per day.

Ad hoc queries

All the data gets stored into 3 SQL tables that satisfy third normal form (3NF). Ad hoc queries can be ran by writing or editing the standard SQL select statements in spark SQL. If we wanted to be a little more user friendly and make connections to a database, we could set up a SQL server using a tool such as PG Admin to host the database server.

Minimal maintenance

The Kinesis firehose runs every hour on an EC2 instance automatically. Errors get logged as objects, so they don't inturrupt the process. To keep things simple, all the SQL requirements are done using Spark SQL and the results output to an html file. An update to the webpage requires running a program on a local machine to fetch the updated html file and push changes. Moving all scripts to cloud instances would be better than having to rely on a local laptop.

Debuggability

Errors get printed to the log and can also get stored as objects, making things easy to debug. It can be further improved by using a Lambda Architecture and functional batch layer, using recomputation algorithms when possible.

https://databricks.com/glossary/lambda-architecture

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
edmunds		edmunds
images		images
project_screenshots		project_screenshots
screenshots		screenshots
.Rhistory		.Rhistory
.gitignore		.gitignore
ARCHITECTURE_DIAGRAM.png		ARCHITECTURE_DIAGRAM.png
AUTHORS.md		AUTHORS.md
EDMUNDS car data pipeline.ipynb		EDMUNDS car data pipeline.ipynb
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST		MANIFEST
MANIFEST.in		MANIFEST.in
Screen Shot 2017-03-08 at 9.36.40 PM.png		Screen Shot 2017-03-08 at 9.36.40 PM.png
derby.log		derby.log
edmunds_firehose.py		edmunds_firehose.py
edmunds_spark_dataframe.ipynb		edmunds_spark_dataframe.ipynb
error.html		error.html
index.html		index.html
make_models.txt		make_models.txt
project_slides_screenshots.key		project_slides_screenshots.key
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py
spark_dataframe_edmunds.ipynb		spark_dataframe_edmunds.ipynb
spark_dataframe_edmunds.py		spark_dataframe_edmunds.py
spark_results_boto.py		spark_results_boto.py
topVehicles.html		topVehicles.html
topVehicles_old.html		topVehicles_old.html
web_page_screen_shot.png		web_page_screen_shot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robustness and fault tolerance

Low latency reads and updates

Scalability

Generalization

Extensibility

Ad hoc queries

Minimal maintenance

Debuggability

About

Releases

Packages

Languages

License

byukan/Edmunds-Car-Data-Pipeline-sdk-python

Folders and files

Latest commit

History

Repository files navigation

Robustness and fault tolerance

Low latency reads and updates

Scalability

Generalization

Extensibility

Ad hoc queries

Minimal maintenance

Debuggability

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages