-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop a census data intake pipeline using the newly refactored flow #111
Comments
Regarding implementation of the metadata collector, I think I can achieve this by recursively scraping the "file system" descended from the root node (https://www2.census.gov/). In the data structure on my end, I think I'll create a tree where the id of each node instance is the URL and the node class has methods to:
The census_metadata table will need columns:
I don't want to automatically pull down all of the census data as that's a gigantic volume of data, and there are a lot of different file_types (most of which won't be handled until a need for the data or content arises). |
My scraper for www2.census.gov metadata has been running since yesterday (I put in a 0.75 * random(0,1) + 0.5 second sleep between requests to avoid just hammering their server) and I've collected over 1.15M URL endpoints, and it looks like I'm still a long way from scraping everything. Still, it shouldn't take anywhere near this long in the future. I might want to just use this side to short-circuit unnecessary data updates and use the census APIs (here's a json dict of endpoints) to act as the data source (rather than the endpoints from the metadata-side URL. I'll have to see how the data is formatted and organized, but using the APIs might allow me to sidestep a lot of explicit combining of files from the metadata-side endpoints. |
Oh hey, the API json data menu (https://api.census.gov/data.json, .html and .xml also work) includes a |
The scraper ran through the week and I cut it off this morning. It still has the following URLs to scrape (see list below; it has been scraping in a depth-first search pattern) [ |
Try to match the task grouping from issue #107
The text was updated successfully, but these errors were encountered: