THIS IS A MIRRORED VERSION OF AN INTERNAL PRODUCTION ETL REPOSITORY LAST UPDATED 2024-11-20
Principal Author: James Midkiff (CityGeo)
Author of dag_trigger.py: Roland MacDavid (CityGeo)
Secondary Author: Jophy Samuel (CCO)
First, download the latest asset data from https://phldoe.visium.io, upload that data to the table Assets in databridge, and update an excel file in the SFTP server.
Next, for each asset that was upserted into databridge, get the observation history for those assets and upload their history to the table Asset_History.
And finally, join this Assets data to the Routers data via the table router_precincts to associate each asset with a router location by precinct; insert this data into the table Asset_Router_Locations.
In each case, it will initiate an Airflow DAG that back-syncs the data from Databridge-V2 to Oracle and uploads the data to AGO.
None of these tables are geo-registered with ArcGIS Pro - they will not have objectid fields.
Routers repository: https://github.com/CityOfPhiladelphia/phillyvotes-routers
This is sensitive data that should not be shared publicly.
- cco/assets
- cco/pollbooks
- Also triggers its dag so as to have a higher frequency than the Airflow migrated trigger dag frequency of 15 minutes
The Jenkins automation server will run this pipeline every 10 minutes:
python run.py
runs the script after having created and sourced a virtual environment. Options:--test
uses test database credentials and will not trigger DAGs or upload to SFTP.--log=<value>
changes the logging mode; choose one of ['error', 'warn', 'info', 'debug']--drop
will drop the tables for the database whose credentials are used. Warning: Destructive--run_local
- Run this script on a local machine outside of our AWS environment. See below
In its current format, the script can only be run by OIT CityGeo because it depends on access to CityGeo's Keeper password management account.
The API access token for the visium API expires every 15 days; the script is capable of automatically detecting an expired token, generating a new one, then successfully processing the data.
Technically, requesting a new token doesn't seem to invalidate old credentials, which will instead expire on their own schedule after 15 days. When you request a new token, you're simply doing that - requesting a new, valid token
Q: Can you give me a little more detail about how the asset history API relates to the assets API?
A: Asset History API saves all data of Asset Observations, hence it is updated more frequently (including if the asset is reported on the same location multiple times), whereas the Asset API retrieves only the "LastSeen" data and updates only periodically if the asset is continuously seen at one location, and additionally only if it moves to another location.
Assets "lastseentime" and the Asset_History "lastseentime" do not correspond exactly. Do not expect to see records from the former in the latter. Both tables may receive new records that are different from existing records only by the "lastseentime".
Asset_History is processed using multithreading for efficiency. With single-threading, the script could process the history of roughly 1,200 - 1,800 in 10 minutes. With up to 20 threads (the maximum recommended by InThing, the owner of Visium, employees), the script can now process the history of roughly 7,400 - 8,000 in 10 minutes, an increase of 4x-6x.
This repository will automatically update api_update.timestamp
to easily show when the latest Visium API Token was generated.
This relies on a git remote URL called "automated", a deploy key located in this repository for EC2-User, and matching information in EC2-User's .ssh config file for the server this script runs on (currently linux-scripts-aws3). For more information, run:
git remote -v
code ~/.ssh/config
ls ~/.ssh
(as EC2-User)
As a normal user, you can run git push
or git pull
as normal to pull from the origin url. EC2-User can only run git push automated main
and git pull automated main
because that is the only URL they have access to.
The Oracle database's time zone is set to UTC +0000
. This means that all timestamp columns will by default display in that timezone (4 or 5 hours ahead of US-Eastern depending on daylight savings time). To see the timestamp columns in your local time zone, use the following query in Oracle:
-- Display timestamp fields in local timezone
SELECT a.* ,
a.LASTSEENTIME AT LOCAL AS LASTSEENTIME_LOCAL ,
a.UPDATED_ON AT LOCAL AS updated_on_local
FROM GIS_ELECTIONS.ASSETS a
ORDER BY a.UPDATED_ON DESC;
-- To check your local session timezone
SELECT SESSIONTIMEZONE FROM DUAL;
-- To set your local session timezone if results are unexpected
ALTER SESSION SET TIME_ZONE = 'America/New_York';
This is not a concern in the Databridge-V2 postgres database or in the excel file upload to SFTP which by default are in US-Eastern timezone.
run.sh
- Main bash script file that runsrun.py
and pushes any updates forapi_update.timestamp
to GitHub- See Repository Updates above
run.py
- Main script filedag_trigger.py
- File to trigger the Airflow pipeline dag with Keeper-related functionsassetdetails.py
- API and SFTP-related functionsutils.py
- Miscellaneous utility functionsconfig.py
- Configuration informationmodels.py
- Database table definition file using peewee ORMapi_update.timestamp
- File to track the latest Assets API Token Resetrequirements.txt
- Python module requirementsextract_postgres_load_oracle.sh
- bash script to extract from postgres and load to Oracle manually. This script was only run to alleviate any concerns about uploading to AGO
run_asset_history.py
- Main python file, triggered byrun.py
run_asset_router_locations.py
- Main script file to join assets and routers tables. This is triggered byrun.py
config_db.py
- Database configuration informationasset_router_locations.sql
- SQL file with some useful data analysis; not needed by scripts.
If you need to run this script locally, which hopefully you will never need to, then you must perform the following steps:
- Install python packages:
- For Linux or Windows, see
requirements.txt
- For Linux or Windows, see
- You may encounter SSLCertVerificationErrors in pip. To bypass that, either
- Get off the City network and off the VPN, or
- Run
pip install --trusted-host pypi.org <package(s)>
- Pass
--run_local
torouters.py
, which will:- All network requests (to Keeper, to APIs) will run without verifying SSL certs
- Warnings about the lack of SSL certificate verification will be suppressed
- The databridge-v2 host will go to the pgbouncer address rather than directly to the server
Documentation regarding the API V2 is available:
- Go to the Keeper secrets
- Open up the "api_documentation_url" and input the "api_documentation_username" and "api_documentation_password" from the secret
- Search for /api/item/items-overview-report/data
Each API instance (assets being separate from assets_history) has a rate limit of 200 calls per minute according to communications with InThing, the owner of Visium.