Skip to content

Appendix: Productionalizing Models

Mikiko Bazeley edited this page Jul 16, 2019 · 5 revisions

How to create a data pipeline for an application which scrapes zip files from a website and we are supposed to extract the contents of zip files which are .txt files separated by tabs? Using a cloud service to store the .txt files as tables is also a must. Since, the data is lot, pandas doesn't scale well so it's advised to use efficient distributed packages or frameworks such as spark for calculating metrics from the tables.

Simplest solution in my opinion would be to run an airflow schedule that uses spark to extract data from your system, process it and store it in a desired format to Cloud storage. If you are using AWS check EMR. if the ETL process does not take place that often and is computationally intensive you can use transient clusters that automatically shuts down your emr cluster after your ETL task saving you money. Hope it helps.

Clone this wiki locally