Skip to content

Commit

Permalink
Merge pull request #30 from datarefugephilly/seeder-updates
Browse files Browse the repository at this point in the history
Minor formatting and copy updates to seeders
  • Loading branch information
dcwalk authored Feb 3, 2017
2 parents 58d62a9 + f0081fc commit c9659f4
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions seednsort.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
## Seeding and Sorting Overview
# Seeding and Sorting Overview

## What do Seeders/Sorters do?
Seeders and Sorters canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive's webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, otherwise they add them to the Uncrawlable spreadsheet using the project's Chrome Extension.

## Choosing the website
The Seeders/Sorters team will use the EDGI subprimer systems ([found here](https://envirodatagov.org/agency-forecasts/)), or a similar set of resources, to identify important/at risk data. Talk to the DataRescue organizers to learn more.
The Seeders/Sorters team will use the [EDGI subprimers](https://envirodatagov.org/agency-forecasts/), or a similar set of resources, to identify important/at risk data. Talk to the DataRescue organizers to learn more.

## Canvassing the website and evaluating content
- Start exploring the website assigned, identifying important URLs.
- Decide whether the data on a page or website subsection can be [automatically captured by the Internet Archive webcrawler](./what-heritrix-does.md).
- The best source of information about the seeding and sorting process is represented at [https://envirodatagov.org/](https://envirodatagov.org/), see:
- [Understanding What the Internet Archive Webcrawler Does](https://docs.google.com/document/d/1PeWefW2toThs-Pbw0CMv2us7wxQI0gRrP1LGuwMp_UQ/edit)
- [Seeding the Internet Archive’s Webcrawler](https://docs.google.com/document/d/1qpuNCmBmu4KcsS_hE2srewcCiP4f9P5cCyDfHmsSAVU/edit))
- [Understanding What the Internet Archive Webcrawler Does](https://docs.google.com/document/d/1PeWefW2toThs-Pbw0CMv2us7wxQI0gRrP1LGuwMp_UQ/edit)
- [Seeding the Internet Archive’s Webcrawler](https://docs.google.com/document/d/1qpuNCmBmu4KcsS_hE2srewcCiP4f9P5cCyDfHmsSAVU/edit)

## Crawlable URLs
### Crawlable URLs
- URLs judged to be possibly crawlable are "nominated" (equivalently, "seeded") to the End-Of-Term project (EOT), using the [EDGI Nomination Chrome extension](https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok?hl=en) or
[bookmarklet](http://digital2.library.unt.edu/nomination/eth2016/about/).

**Wherever possible, add in the Agency Office Code.** Talk to the DataRescue organizers to learn more.

## Uncrawlable URLs
### Uncrawlable URLs
- If URL is judged not crawlable, add it to the "Uncrawlable" spreadsheet through the Chrome Extension.
- In the spreadsheet is automatically associated with a universal unique identifyer (UUID) that was generated in advance.
- You can check whether the page or some files are rendered using the Internet Archive's [Wayback Machine Chrome Extension](https://chrome.google.com/webstore/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak)

## Not sure?
### Not sure?
- This sorting is only provisional: when in doubt seeders nominate the URL **and** mark it as possibly not crawlable.

0 comments on commit c9659f4

Please sign in to comment.