Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impute missing lat, lon coordinates #113

Open
katie-lamb opened this issue Apr 13, 2023 · 1 comment
Open

Impute missing lat, lon coordinates #113

katie-lamb opened this issue Apr 13, 2023 · 1 comment

Comments

@katie-lamb
Copy link
Member

There are records in the MSHA mines and EIA plants data that are missing latitude and longitude coordinates. Currently, these records are being excluded. Instead, try to impute the Census tract and county from the other locational data given for these records.

@zaneselvans
Copy link
Member

We're (IMHO) overly aggressively dropping lat/lon values in the main PUDL ETL right now, and should make this fix as far upstream as possible. IIRC, right now we're basically treating the floating point values as strings and declaring them inconsistent if the digits aren't identical, which is not good.

Ideally we would use the haversine distance between all the different (lon,lat) points to estimate an "actual" location and identify any totally crazy outliers (and assign to them the actual location). Probably rectilinear coordinates are good enough though and much simpler, since all the points should be very close to each other, and if they're not very close, it'll be obvious in either spherical or euclidean coords.

We could also convert (lon, lat) into a geopoint / tuple stored in a single column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

No branches or pull requests

2 participants