-
Notifications
You must be signed in to change notification settings - Fork 32
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Minor formatting and copy updates to seeders
- Loading branch information
Showing
1 changed file
with
7 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,28 +1,28 @@ | ||
## Seeding and Sorting Overview | ||
# Seeding and Sorting Overview | ||
|
||
## What do Seeders/Sorters do? | ||
Seeders and Sorters canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive's webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, otherwise they add them to the Uncrawlable spreadsheet using the project's Chrome Extension. | ||
|
||
## Choosing the website | ||
The Seeders/Sorters team will use the EDGI subprimer systems ([found here](https://envirodatagov.org/agency-forecasts/)), or a similar set of resources, to identify important/at risk data. Talk to the DataRescue organizers to learn more. | ||
The Seeders/Sorters team will use the [EDGI subprimers](https://envirodatagov.org/agency-forecasts/), or a similar set of resources, to identify important/at risk data. Talk to the DataRescue organizers to learn more. | ||
|
||
## Canvassing the website and evaluating content | ||
- Start exploring the website assigned, identifying important URLs. | ||
- Decide whether the data on a page or website subsection can be [automatically captured by the Internet Archive webcrawler](./what-heritrix-does.md). | ||
- The best source of information about the seeding and sorting process is represented at [https://envirodatagov.org/](https://envirodatagov.org/), see: | ||
- [Understanding What the Internet Archive Webcrawler Does](https://docs.google.com/document/d/1PeWefW2toThs-Pbw0CMv2us7wxQI0gRrP1LGuwMp_UQ/edit) | ||
- [Seeding the Internet Archive’s Webcrawler](https://docs.google.com/document/d/1qpuNCmBmu4KcsS_hE2srewcCiP4f9P5cCyDfHmsSAVU/edit)) | ||
- [Understanding What the Internet Archive Webcrawler Does](https://docs.google.com/document/d/1PeWefW2toThs-Pbw0CMv2us7wxQI0gRrP1LGuwMp_UQ/edit) | ||
- [Seeding the Internet Archive’s Webcrawler](https://docs.google.com/document/d/1qpuNCmBmu4KcsS_hE2srewcCiP4f9P5cCyDfHmsSAVU/edit) | ||
|
||
## Crawlable URLs | ||
### Crawlable URLs | ||
- URLs judged to be possibly crawlable are "nominated" (equivalently, "seeded") to the End-Of-Term project (EOT), using the [EDGI Nomination Chrome extension](https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok?hl=en) or | ||
[bookmarklet](http://digital2.library.unt.edu/nomination/eth2016/about/). | ||
|
||
**Wherever possible, add in the Agency Office Code.** Talk to the DataRescue organizers to learn more. | ||
|
||
## Uncrawlable URLs | ||
### Uncrawlable URLs | ||
- If URL is judged not crawlable, add it to the "Uncrawlable" spreadsheet through the Chrome Extension. | ||
- In the spreadsheet is automatically associated with a universal unique identifyer (UUID) that was generated in advance. | ||
- You can check whether the page or some files are rendered using the Internet Archive's [Wayback Machine Chrome Extension](https://chrome.google.com/webstore/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak) | ||
|
||
## Not sure? | ||
### Not sure? | ||
- This sorting is only provisional: when in doubt seeders nominate the URL **and** mark it as possibly not crawlable. |