From f0081fc1836b1b96e51ce9077e2775afa6e28616 Mon Sep 17 00:00:00 2001 From: dcwalk Date: Wed, 1 Feb 2017 10:28:49 -0500 Subject: [PATCH] Minor formatting and copy updates to seeders --- seednsort.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/seednsort.md b/seednsort.md index 17295a6..402a5b7 100644 --- a/seednsort.md +++ b/seednsort.md @@ -1,28 +1,28 @@ -## Seeding and Sorting Overview +# Seeding and Sorting Overview ## What do Seeders/Sorters do? Seeders and Sorters canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive's webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, otherwise they add them to the Uncrawlable spreadsheet using the project's Chrome Extension. ## Choosing the website -The Seeders/Sorters team will use the EDGI subprimer systems ([found here](https://envirodatagov.org/agency-forecasts/)), or a similar set of resources, to identify important/at risk data. Talk to the DataRescue organizers to learn more. +The Seeders/Sorters team will use the [EDGI subprimers](https://envirodatagov.org/agency-forecasts/), or a similar set of resources, to identify important/at risk data. Talk to the DataRescue organizers to learn more. ## Canvassing the website and evaluating content - Start exploring the website assigned, identifying important URLs. - Decide whether the data on a page or website subsection can be [automatically captured by the Internet Archive webcrawler](./what-heritrix-does.md). - The best source of information about the seeding and sorting process is represented at [https://envirodatagov.org/](https://envirodatagov.org/), see: -- [Understanding What the Internet Archive Webcrawler Does](https://docs.google.com/document/d/1PeWefW2toThs-Pbw0CMv2us7wxQI0gRrP1LGuwMp_UQ/edit) -- [Seeding the Internet Archive’s Webcrawler](https://docs.google.com/document/d/1qpuNCmBmu4KcsS_hE2srewcCiP4f9P5cCyDfHmsSAVU/edit)) + - [Understanding What the Internet Archive Webcrawler Does](https://docs.google.com/document/d/1PeWefW2toThs-Pbw0CMv2us7wxQI0gRrP1LGuwMp_UQ/edit) + - [Seeding the Internet Archive’s Webcrawler](https://docs.google.com/document/d/1qpuNCmBmu4KcsS_hE2srewcCiP4f9P5cCyDfHmsSAVU/edit) -## Crawlable URLs +### Crawlable URLs - URLs judged to be possibly crawlable are "nominated" (equivalently, "seeded") to the End-Of-Term project (EOT), using the [EDGI Nomination Chrome extension](https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok?hl=en) or [bookmarklet](http://digital2.library.unt.edu/nomination/eth2016/about/). **Wherever possible, add in the Agency Office Code.** Talk to the DataRescue organizers to learn more. -## Uncrawlable URLs +### Uncrawlable URLs - If URL is judged not crawlable, add it to the "Uncrawlable" spreadsheet through the Chrome Extension. - In the spreadsheet is automatically associated with a universal unique identifyer (UUID) that was generated in advance. - You can check whether the page or some files are rendered using the Internet Archive's [Wayback Machine Chrome Extension](https://chrome.google.com/webstore/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak) -## Not sure? +### Not sure? - This sorting is only provisional: when in doubt seeders nominate the URL **and** mark it as possibly not crawlable.