Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a census data intake pipeline using the newly refactored flow #111

Open
MattTriano opened this issue Apr 9, 2023 · 4 comments
Open

Comments

@MattTriano
Copy link
Owner

Try to match the task grouping from issue #107

@MattTriano
Copy link
Owner Author

Regarding implementation of the metadata collector, I think I can achieve this by recursively scraping the "file system" descended from the root node (https://www2.census.gov/). In the data structure on my end, I think I'll create a tree where the id of each node instance is the URL and the node class has methods to:

  • scrape the page
  • identify child_directories
  • identify child_files
  • compare the last_modified time of a directory or file to a cache of checks in the data warehouse
    • this should significantly accelerate subsequent runs if implemented to short circuit update checks when a high-level directory's cached last_modified matches an update check.

The census_metadata table will need columns:

  • metadata_url
  • last_modified
  • size
  • description
  • is_dir
  • is_file
  • check_time
  • updated_data_available
    And likely some more fields as needed.

I don't want to automatically pull down all of the census data as that's a gigantic volume of data, and there are a lot of different file_types (most of which won't be handled until a need for the data or content arises).

@MattTriano
Copy link
Owner Author

My scraper for www2.census.gov metadata has been running since yesterday (I put in a 0.75 * random(0,1) + 0.5 second sleep between requests to avoid just hammering their server) and I've collected over 1.15M URL endpoints, and it looks like I'm still a long way from scraping everything. Still, it shouldn't take anywhere near this long in the future.

I might want to just use this side to short-circuit unnecessary data updates and use the census APIs (here's a json dict of endpoints) to act as the data source (rather than the endpoints from the metadata-side URL. I'll have to see how the data is formatted and organized, but using the APIs might allow me to sidestep a lot of explicit combining of files from the metadata-side endpoints.

@MattTriano
Copy link
Owner Author

Oh hey, the API json data menu (https://api.census.gov/data.json, .html and .xml also work) includes a modified field that contains datelike values. That would probably be much easier to work with than the stuff I built yesterday (although that work still has a lot of value as it will produce a full list of the pdfs documenting these data sets).

@MattTriano
Copy link
Owner Author

The scraper ran through the week and I cut it off this morning. It still has the following URLs to scrape (see list below; it has been scraping in a depth-first search pattern)

[
'https://www2.census.gov/2020Census',
'https://www2.census.gov/EEO_2006_2010',
'https://www2.census.gov/EEO_2014_2018',
'https://www2.census.gov/EEO_Disability_2008-2010',
'https://www2.census.gov/Econ2001_And_Earlier',
'https://www2.census.gov/about',
'https://www2.census.gov/acs',
'https://www2.census.gov/acs2002',
'https://www2.census.gov/acs2003',
'https://www2.census.gov/acs2004',
'https://www2.census.gov/acs2005',
'https://www2.census.gov/acs2005_2007_3yr',
'https://www2.census.gov/acs2005_2009_5yr',
'https://www2.census.gov/acs2006',
'https://www2.census.gov/acs2006_2008_3yr',
'https://www2.census.gov/acs2007_1yr',
'https://www2.census.gov/acs2007_2009_3yr',
'https://www2.census.gov/acs2007_3yr',
'https://www2.census.gov/acs2008_1yr',
'https://www2.census.gov/acs2008_3yr',
'https://www2.census.gov/acs2009_1yr',
'https://www2.census.gov/acs2009_3yr',
'https://www2.census.gov/acs2009_5yr',
'https://www2.census.gov/acs2010_1yr',
'https://www2.census.gov/acs2010_3yr',
'https://www2.census.gov/acs2010_5yr',
'https://www2.census.gov/acs2010_SPT_AIAN',
'https://www2.census.gov/acs2011_1yr',
'https://www2.census.gov/acs2011_3yr',
'https://www2.census.gov/acs2011_5yr',
'https://www2.census.gov/acs2012_1yr',
'https://www2.census.gov/acs2012_3yr',
'https://www2.census.gov/acs2012_5yr',
'https://www2.census.gov/acs2013_1yr',
'https://www2.census.gov/acs2013_3yr',
'https://www2.census.gov/acs2013_5yr',
'https://www2.census.gov/acs_latest_data',
'https://www2.census.gov/acs_special_tabs',
'https://www2.census.gov/adrm',
'https://www2.census.gov/cac',
'https://www2.census.gov/census_1940',
'https://www2.census.gov/census_1980',
'https://www2.census.gov/census_1990',
'https://www2.census.gov/census_2000',
'https://www2.census.gov/census_2010',
'https://www2.census.gov/ces',
'https://www2.census.gov/data',
'https://www2.census.gov/decennial',
'https://www2.census.gov/desen002',
'https://www2.census.gov/dssd',
'https://www2.census.gov/econ',
'https://www2.census.gov/econ1977',
'https://www2.census.gov/econ1982',
'https://www2.census.gov/econ1987',
'https://www2.census.gov/econ1992',
'https://www2.census.gov/econ1997',
'https://www2.census.gov/econ2002',
'https://www2.census.gov/econ2003',
'https://www2.census.gov/econ2004',
'https://www2.census.gov/econ2005',
'https://www2.census.gov/econ2006',
'https://www2.census.gov/econ2007',
'https://www2.census.gov/econ2008',
'https://www2.census.gov/econ2009',
'https://www2.census.gov/econ2010',
'https://www2.census.gov/econ2011',
'https://www2.census.gov/econ2012',
'https://www2.census.gov/econ2013',
'https://www2.census.gov/econ2014',
'https://www2.census.gov/econ2015',
'https://www2.census.gov/econ2016',
'https://www2.census.gov/econ2017',
'https://www2.census.gov/foia',
'https://www2.census.gov/geo/maps/DC2010',
'https://www2.census.gov/geo/maps/DC2020/ACO20',
'https://www2.census.gov/geo/maps/DC2020/AIANWall2020',
'https://www2.census.gov/geo/maps/DC2020/DC20BLK',
'https://www2.census.gov/geo/maps/DC2020/IFAC',
'https://www2.census.gov/geo/maps/DC2020/MCS',
'https://www2.census.gov/geo/maps/DC2020/PL20',
'https://www2.census.gov/geo/maps/DC2020/PL20Proto',
'https://www2.census.gov/geo/maps/DC2020/PSAPV',
'https://www2.census.gov/geo/maps/DC2020/PUMA',
'https://www2.census.gov/geo/maps/DC2020/PopCenter',
'https://www2.census.gov/geo/maps/DC2020/PopDist_Nighttime',
'https://www2.census.gov/geo/maps/DC2020/SLD_RefMap',
'https://www2.census.gov/geo/maps/DC2020/SR20',
'https://www2.census.gov/geo/maps/DC2020/TEA'
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant