Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve seeders understanding of what a data set is #9

Open
postfalk opened this issue Feb 11, 2017 · 1 comment
Open

Improve seeders understanding of what a data set is #9

postfalk opened this issue Feb 11, 2017 · 1 comment

Comments

@postfalk
Copy link

Currently, I see two different problems in the seeded urls:

  1. News releases ABOUT data. It would be nice if the actual data set could be tracked down in the woods of the government web pages. Sure, that can be tedious at times.

  2. Some people log entire data portals. It would be better if they could be broken down to single data products. (It might be nice to maintain relations for code reuse).

I would identify data products by the presence of meta data, scientific citations, and method documents. I am aware of the fact that this might not be always feasible. In many cases it would make the scraping task more manageable. It would be also great to store these documents alongside the data set.

@suchthis
Copy link

Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants