diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..d9f176f1 --- /dev/null +++ b/404.html @@ -0,0 +1,1634 @@ + + + +
+ + + + + + + + + + + + + + + + +GEOMG is a custom tool that functions as a backend metadata editor and manager for the GeoBlacklight application.
+BTAA-GIN Operations technical staff at the University of Minnesota
+The BTAA Geoportal Lead Developer, Eric Larson, created GEOMG, with direction from the BTAA-GIN. It is based upon the Kithe framework.
+Can other GeoBlacklight projects adopt it?
+We are currently working on offering this tool as a plugin for GeoBlacklight.
+In the meantime, this presentation describes the motivation for building the tool and a few screencasts showing how it works:
+ + + + + + + + + + + + + + +This page describes some of the processes and terminology associated with extracting metadata from various sources.
+Web scraping is the process of programmatically collecting and extracting information from websites using automated scripts or bots. Common web scraping tools include pandas, Beautiful Soup, and WGET.
+Data harvesting refers to the process of collecting large volumes of data from various sources, such as websites, social media, or other online platforms. This can involve using automated scripts or tools to extract structured or unstructured data, such as text, images, videos, or other types of content. The collected data can be used for various purposes, such as data analysis or content aggregation.
+Metadata harvesting refers specifically to the process of collecting metadata from digital resources, such as websites, online databases, or digital libraries. Metadata is information that describes other data, such as the title, author, date, format, or subject of a document. Metadata harvesting involves extracting this information from digital resources and storing it in a structured format, such as a database or a metadata record.
+Metadata harvesting is often used in the context of digital libraries, archives, or repositories, where the metadata is used to organize and manage large collections of digital resources. Metadata harvesting can be done using specialized tools or protocols, such as the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which is a widely used standard for sharing metadata among digital repositories.
+The terms "harvesting" and "scraping" are often used interchangeably. However, there may be subtle differences in the way these terms are used, depending on the context.
+In general, scraping refers to the process of programmatically extracting data from websites using automated scripts or bots. The term "scraping" often implies a more aggressive approach, where data is extracted without explicit permission from the website owner. Scraping may involve parsing HTML pages, following links, and using techniques such as web crawling or screen scraping to extract data from websites.
+On the other hand, harvesting may refer to a more structured and systematic approach to extracting data from websites. The term "harvesting" often implies a more collaborative approach, where data is extracted with the explicit permission of the website owner or through APIs or web services provided by the website. Harvesting may involve using specialized software or tools to extract metadata, documents, or other resources from websites.
+Web parsing refers to the process of scanning structured documents and extracting information. Although, it usually refers to parsing HTML pages, it can also describe parsing XML or JSON documents. Tools designed for this purpose, such as Beautiful Soup, are often called "parsers."
+ETL (Extract Transform Load) is a process of extracting data from one or more sources, transforming it to fit the target system's data model, and then loading it into the target system, such as a database or a data warehouse.
+The ETL process typically involves three main stages:
+Version 1.0 - April 18, 2018
Deprecated
+This document has been replaced by the BTAA-GIN Scanning Guidelines for ArcGIS Hub, version 2.0
+This document describes the BTAA GDP Accession Guidelines for ArcGIS Open Data Portals, including eligible sites and records, harvest schedules, and remediation work. This policy may change at any time in response to updates to the ArcGIS Open Data Portal platform and/or the BTAA GDP harvesting and remediation processes.
+Policy: Any ArcGIS Open Data Portal (Arc-ODP) that serves public geospatial data is eligible for inclusion in the BTAA GDP Geoportal (“the Geoportal”). However, preference is given to portals that are hosting original layers, not federated portals that aggregate from other sites. Task Force members are responsible for finding and submitting Arc-ODPs for inclusion in the Geoportal. Each Arc-ODP will be assigned a Provenance value to the university that submitted it or is closest geographically to the site.
+Explanation: In order to avoid duplication, records that appear in multiple Arc-ODPs should only be accessioned from one instance. This also helps to avoid harvesting records that may be out of date or not yet aggregated within federated portals. Although the technical workers at the University of Minnesota will be performing the metadata processing, the Task Force members are expected to periodically monitor their records and make suggestions for edits or additions.
+Policy: The only records that will be harvested from Arc-ODPs are Esri REST Services of the type Map Service, Feature Service, or Image Service. This is further restricted to only items that are harvestable through the DCAT API. By default, the following records types will not be accessioned on a regular basis:
+Explanation: Arc-ODPs are structured to automatically create item records from submitted Esri REST services. However, Arc-ODP administrators are able to manually add records for other types of resources, such as external websites or documents. These may not be spatial datasets and may not have consistently formed metadata or access links, which impedes automated accessioning. If these types of resources are approved by the Metadata Coordinator, they may be processed separately from the regular accessions.
+Policy: The Arc-ODPs included in the Geoportal will be queried monthly to check for deleted and new items. The results of this query will be logged. Deleted items will be removed from the geoportal immediately. New records from the Arc-ODPs will be accessioned and processed within two months of harvesting the metadata.
+Explanation: Removing broken links is a priority for maintaining a positive user experience. However, accessioning and processing records requires remediation work that necessitates a variable time frame.
+The records will be processed by the Metadata Coordinator and available UMN Graduate Research Assistants. The following metadata remediation steps will be undertaken:
+Policy: Creating or linking to standards based metadata files for Arc-ODPs is out of scope at this time.
+Explanation: If metadata is enabled for an Arc-ODP, it will be available as ArcGIS Metadata Format 1.0 in XML, which is not a schema that GeoBlacklight can display. The metadata may also be available as FGDC or ISO HTML pages, but these types of links are not part of the current GeoBlacklight schema. Further, very few Arc-ODPs are taking advantage of this feature at this time.
+ + + + + + + + + + + + + +Version 2.0 - April 24, 2023
This document describes the BTAA-GIN Scanning Guidelines for ArcGIS Hubs, including eligible sites and records, harvest schedules, and remediation work. This policy may change at any time in response to updates to the ArcGIS Hub architecture platform and/or the BTAA-GIN harvesting and remediation processes.
+Guideline: Any ArcGIS Hub (Hub) that serves public geospatial data is eligible for inclusion in the BTAA Geoportal (“the Geoportal”). Our scope includes public Hubs from the states in the BTAA geographic region and Hubs submitted by Team Members that are of interest to BTAA researchers.
+Explanation: See the BTAA-GIN Collection Development Policy for more details.
+Guideline: The only records that will be harvested from Hubs are Esri REST Services of the type Map Service, Feature Service, or Image Service. This is further restricted to only items that are harvestable through the DCAT API. By default, the following records types will not be accessioned on a regular basis:
+Explanation: Hubs are structured to automatically create item records from submitted Esri REST services. However, Hub administrators are able to manually add records for other types of resources, such as external websites or documents. These may not be spatial datasets and may not have consistently formed metadata or access links, which impedes automated accessioning.
+Guideline: The Hubs included in the Geoportal will be scanned weekly to harvest complete lists of eligible records. The list will be published and overwrite the previous scan.
+Explanation: Broken links negatively impact user experience. Over the course of a week, as many as 10% of the ArcGIS Hub records in the Geoportal can break or include outdated information.
+Guideline: The harvesting script, R-01_arcgis-hubs
, programmatically performs all of the remediation for each record.
Explanation: We now scan a large number of ArcGIS Hubs, which makes manual remediation unrealistic. This is in contrast to the previous policy established in 2018, when our collection was smaller.
+Info
+This document replaces the BTAA GDP Accession Guidelines for ArcGIS Open Data Portals, version 1.0.
+