US Domestic Flights ETL flow with weather, geo, delays, airlines for the country's top 5 airports. Uses Logstash, Elasticsearch & Kibana (with optionally the Kibana plugin Timelion).
As added bonus, there is a separate data set with 2014 TSA claims data.
Complete these steps:
-
Download data:
sh wget.sh
- Data size is about 2.5 GB. Because of filtering, size in Elasticsearch will be much lower, below 100 MB.
-
Create Elasticsearch indices and templates:
sh create_flight_template.sh URI [USERNAME:PASSWORD]
sh create_tsaclaims_index.sh URI [USERNAME:PASSWORD]
-
Ingest flight data into Elasticsearch with Logstash:
- Optionally put a username/password/host in
import_*.conf
sh load_tsaclaims.sh && sh load_flights.sh
- Optionally put a username/password/host in
-
Create an alias called
flights
, composed of allflights-*
indices:sh create_flight_alias.sh
-
Create the index patterns in Kibana:
tsaclaims
withDate Received
as time fieldflights
withFlightDateTime
as time field
-
Import Kibana visuals and dashboards:
- In Kibana, go to
Settings
, thenObjects
, then Importkibana_import.json
- Optional: Timelion is a time series graphing plugin for Kibana, developed by the people of Elastic. Read more about Timelion and how to get it here. Currently it is not possible to export or import Timelion sheets. To create some charts about this data, open Timelion and add the following. For every line, add a Chart on the Timelion sheet and paste in the code for six different charts. Don't forget to save the sheet.
.es(index=flights).label("All Flights"), .es(index=flights, q=ArrDelayMinutes:>0).label("Delayed Flights")
.static(55).color(red).label("Red Line"), .static(50).color(orange).label("Orange Line"), .es(index=flights, q=ArrDelayMinutes:>0).label("Delayed Flights Percentage").divide(.es(index=flights)).multiply(100).color(navy).movingaverage(5)
.es(index=flights, metric=avg:tmax).color(orange).lines(width=2).movingaverage(5).label("Minimum Temperature (celsius) mavg=5"), .es(index=flights, metric=avg:tmin).color(lightblue).lines(width=2).movingaverage(5).label("Maximum Temperature (celsius) mavg=5"), .es(index=flights, metric=avg:WeatherDelay).color(Red).movingaverage(5).label("Weather Delay (in minutes) mavg=5")
.es(index=flights, q=ArrDelayMinutes:>0).label("Delayed Flights Percentage").color(navy).movingaverage(10), .es(index=flights, metric=sum:terribility).label("Terribility Index").movingaverage(10)
.es(index=tsaclaims, timefield="Date Received").movingaverage(7).label("TSA Claims mavg(7)"), .es(index=flights).movingaverage(7).divide(10).label("Flights mavg(7) /10")
.es(index=flights, metric=avg:snowfall).divide(10).add(.es(index=flights, metric=avg:thunder)).sum(.es(index=flights, metric=avg:hail).multiply(3)).sum(.es(index=flights, metric=avg:glaze).multiply(2)).sum(.es(index=flights, metric=avg:fog).multiply(1)).sum(.es(index=flights, metric=avg:heavy_fog).multiply(5)).sum(.es(index=flights, metric=avg:dust_ash).multiply(10)).label("Average Terribility(R)").points(4).color(Navy), .es(index=flights, metric=avg:terribility).label("Ingested Terribility(R)")
- In Kibana, go to
- Elasticsearch 2.3
- Kibana 4.4
- Logstash 2.3
- Timelion 4.4 (optional)
Other versions may work but are untested. If it turns out it works, please consider letting us know by making a pull request on this README.
create_*.sh
: sets up Elasticsearch templates, mappings (actual mappings inmapping*.json
) and aliaseslookup_data/*
: airport timezone and weather data for enriching the flight datalogstash/filters/*.rb
: four simple Logstash filters to join the lookup dataload_*.sh
: invoke Logstash to import the flat data filesremove_indices.sh
: remove all indices, mappings, templates andwget.sh
: downloads the flight data filesimport_*.conf
: configuration files for Logstash. Here, the host is hardcoded so change it to your needskibana_import.json
: Two Dashboards and 43 Visualizations for Kibana
- The airline data is taken from US BTS and is limited to 2014 and the 5 busiest airports: ATL, ORD, JFK, LAX and DFW. Flights need one of these airports as both source as well as destination to qualify.
- The weather data is taken from NCEI. For all 5 airports I used the closest weather station (in all cases, that means readings that are taken on the actual airport)
- The timezone data was provided by jpatokal