Develop a census data intake pipeline using the newly refactored flow #111

MattTriano · 2023-04-09T03:07:26Z

Try to match the task grouping from issue #107

MattTriano · 2023-04-10T17:53:24Z

Regarding implementation of the metadata collector, I think I can achieve this by recursively scraping the "file system" descended from the root node (https://www2.census.gov/). In the data structure on my end, I think I'll create a tree where the id of each node instance is the URL and the node class has methods to:

scrape the page
identify child_directories
identify child_files
compare the last_modified time of a directory or file to a cache of checks in the data warehouse
- this should significantly accelerate subsequent runs if implemented to short circuit update checks when a high-level directory's cached last_modified matches an update check.

The census_metadata table will need columns:

metadata_url
last_modified
size
description
is_dir
is_file
check_time
updated_data_available
And likely some more fields as needed.

I don't want to automatically pull down all of the census data as that's a gigantic volume of data, and there are a lot of different file_types (most of which won't be handled until a need for the data or content arises).

MattTriano · 2023-04-11T13:43:11Z

My scraper for www2.census.gov metadata has been running since yesterday (I put in a 0.75 * random(0,1) + 0.5 second sleep between requests to avoid just hammering their server) and I've collected over 1.15M URL endpoints, and it looks like I'm still a long way from scraping everything. Still, it shouldn't take anywhere near this long in the future.

I might want to just use this side to short-circuit unnecessary data updates and use the census APIs (here's a json dict of endpoints) to act as the data source (rather than the endpoints from the metadata-side URL. I'll have to see how the data is formatted and organized, but using the APIs might allow me to sidestep a lot of explicit combining of files from the metadata-side endpoints.

MattTriano · 2023-04-11T13:48:18Z

Oh hey, the API json data menu (https://api.census.gov/data.json, .html and .xml also work) includes a modified field that contains datelike values. That would probably be much easier to work with than the stuff I built yesterday (although that work still has a lot of value as it will produce a full list of the pdfs documenting these data sets).

MattTriano · 2023-04-17T12:11:22Z

The scraper ran through the week and I cut it off this morning. It still has the following URLs to scrape (see list below; it has been scraping in a depth-first search pattern)

[
'https://www2.census.gov/2020Census',
'https://www2.census.gov/EEO_2006_2010',
'https://www2.census.gov/EEO_2014_2018',
'https://www2.census.gov/EEO_Disability_2008-2010',
'https://www2.census.gov/Econ2001_And_Earlier',
'https://www2.census.gov/about',
'https://www2.census.gov/acs',
'https://www2.census.gov/acs2002',
'https://www2.census.gov/acs2003',
'https://www2.census.gov/acs2004',
'https://www2.census.gov/acs2005',
'https://www2.census.gov/acs2005_2007_3yr',
'https://www2.census.gov/acs2005_2009_5yr',
'https://www2.census.gov/acs2006',
'https://www2.census.gov/acs2006_2008_3yr',
'https://www2.census.gov/acs2007_1yr',
'https://www2.census.gov/acs2007_2009_3yr',
'https://www2.census.gov/acs2007_3yr',
'https://www2.census.gov/acs2008_1yr',
'https://www2.census.gov/acs2008_3yr',
'https://www2.census.gov/acs2009_1yr',
'https://www2.census.gov/acs2009_3yr',
'https://www2.census.gov/acs2009_5yr',
'https://www2.census.gov/acs2010_1yr',
'https://www2.census.gov/acs2010_3yr',
'https://www2.census.gov/acs2010_5yr',
'https://www2.census.gov/acs2010_SPT_AIAN',
'https://www2.census.gov/acs2011_1yr',
'https://www2.census.gov/acs2011_3yr',
'https://www2.census.gov/acs2011_5yr',
'https://www2.census.gov/acs2012_1yr',
'https://www2.census.gov/acs2012_3yr',
'https://www2.census.gov/acs2012_5yr',
'https://www2.census.gov/acs2013_1yr',
'https://www2.census.gov/acs2013_3yr',
'https://www2.census.gov/acs2013_5yr',
'https://www2.census.gov/acs_latest_data',
'https://www2.census.gov/acs_special_tabs',
'https://www2.census.gov/adrm',
'https://www2.census.gov/cac',
'https://www2.census.gov/census_1940',
'https://www2.census.gov/census_1980',
'https://www2.census.gov/census_1990',
'https://www2.census.gov/census_2000',
'https://www2.census.gov/census_2010',
'https://www2.census.gov/ces',
'https://www2.census.gov/data',
'https://www2.census.gov/decennial',
'https://www2.census.gov/desen002',
'https://www2.census.gov/dssd',
'https://www2.census.gov/econ',
'https://www2.census.gov/econ1977',
'https://www2.census.gov/econ1982',
'https://www2.census.gov/econ1987',
'https://www2.census.gov/econ1992',
'https://www2.census.gov/econ1997',
'https://www2.census.gov/econ2002',
'https://www2.census.gov/econ2003',
'https://www2.census.gov/econ2004',
'https://www2.census.gov/econ2005',
'https://www2.census.gov/econ2006',
'https://www2.census.gov/econ2007',
'https://www2.census.gov/econ2008',
'https://www2.census.gov/econ2009',
'https://www2.census.gov/econ2010',
'https://www2.census.gov/econ2011',
'https://www2.census.gov/econ2012',
'https://www2.census.gov/econ2013',
'https://www2.census.gov/econ2014',
'https://www2.census.gov/econ2015',
'https://www2.census.gov/econ2016',
'https://www2.census.gov/econ2017',
'https://www2.census.gov/foia',
'https://www2.census.gov/geo/maps/DC2010',
'https://www2.census.gov/geo/maps/DC2020/ACO20',
'https://www2.census.gov/geo/maps/DC2020/AIANWall2020',
'https://www2.census.gov/geo/maps/DC2020/DC20BLK',
'https://www2.census.gov/geo/maps/DC2020/IFAC',
'https://www2.census.gov/geo/maps/DC2020/MCS',
'https://www2.census.gov/geo/maps/DC2020/PL20',
'https://www2.census.gov/geo/maps/DC2020/PL20Proto',
'https://www2.census.gov/geo/maps/DC2020/PSAPV',
'https://www2.census.gov/geo/maps/DC2020/PUMA',
'https://www2.census.gov/geo/maps/DC2020/PopCenter',
'https://www2.census.gov/geo/maps/DC2020/PopDist_Nighttime',
'https://www2.census.gov/geo/maps/DC2020/SLD_RefMap',
'https://www2.census.gov/geo/maps/DC2020/SR20',
'https://www2.census.gov/geo/maps/DC2020/TEA'
]

MattTriano mentioned this issue Apr 17, 2023

Implement a metadata collector for Census data sets #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a census data intake pipeline using the newly refactored flow #111

Develop a census data intake pipeline using the newly refactored flow #111

MattTriano commented Apr 9, 2023

MattTriano commented Apr 10, 2023

MattTriano commented Apr 11, 2023

MattTriano commented Apr 11, 2023

MattTriano commented Apr 17, 2023

Develop a census data intake pipeline using the newly refactored flow #111

Develop a census data intake pipeline using the newly refactored flow #111

Comments

MattTriano commented Apr 9, 2023

MattTriano commented Apr 10, 2023

MattTriano commented Apr 11, 2023

MattTriano commented Apr 11, 2023

MattTriano commented Apr 17, 2023