diff --git a/GEOMG/index.html b/GEOMG/index.html index e683c097..31a1f677 100644 --- a/GEOMG/index.html +++ b/GEOMG/index.html @@ -1680,11 +1680,11 @@ -
GEOMG is a custom tool that functions as a backend metadata editor and manager for the GeoBlacklight application.
+GEOMG is a custom tool that functions as a backend metadata editor and manager for the GeoBlacklight application.
The BTAA Geoportal Lead Developer, Eric Larson, created GEOMG, with direction from the BTAA-GIN. It is based upon the Kithe framework.
+The BTAA Geoportal Lead Developer, Eric Larson, created GEOMG, with direction from the BTAA-GIN. It is based upon the Kithe framework.
Can other GeoBlacklight projects adopt it?
diff --git a/b1g-custom-elements/index.html b/b1g-custom-elements/index.html index aa7ad3d3..fff399dd 100644 --- a/b1g-custom-elements/index.html +++ b/b1g-custom-elements/index.html @@ -2734,7 +2734,7 @@The GeoBTAA Metadata Template (https://z.umn.edu/b1g-template) is a set of spreadsheets that are formatted for our metadata editor, GEOMG.
+The GeoBTAA Metadata Template (https://z.umn.edu/b1g-template) is a set of spreadsheets that are formatted for our metadata editor, GBL Admin.
Users will need to make a copy of the spreadsheet to use for editing. In some cases, the Metadata Coordinator can provide a customized version of the sheets for specific collections.
The Template contains the following tabs:
Note
-The input format for some fields in this template may differ from how the field is documented in OpenGeoMetadata. These differences are intended to make it easier to enter values, which will be transformed when we upload the record to GEOMG.
+The input format for some fields in this template may differ from how the field is documented in OpenGeoMetadata. These differences are intended to make it easier to enter values, which will be transformed when we upload the record to GBL Admin.
Bounding Box coordinates should be entered as W,S,E,N
. The coordinates are automatically transformed to a different order ENVELOPE(W,E,N,S)
. Read more under the Local Input Guidelines.
Team Members in the Big Ten Academic Alliance Geospatial Information Network (BTAA-GIN)
+Team Members in the Big Ten Academic Alliance Geospatial Information Network (BTAA-GIN)
Development & Operations Staff in the BTAA-GIN
@@ -1810,7 +1810,7 @@Changes for Version 4.3 (August 15, 2022)
Changes for Version 4.2 (March 24, 2022)
Changes for Version 4.0 (July 2021)
Changes for version 3.3 (May 13, 2020)
diff --git a/lifecycle/index.html b/lifecycle/index.html index 968c0bd6..9d45986d 100644 --- a/lifecycle/index.html +++ b/lifecycle/index.html @@ -1742,7 +1742,7 @@This stage involves batch processing of the records, including harvesting, transformations, crosswalking information. This stage is carried out by the Metadata Coordinator, who may contact Team members for assistance.
-Regardless of the method used for acquiring the metadata, it is always transformed into a spreadsheet for editing. These spreadsheets are uploaded to GEOMG Metadata Editor.
+Regardless of the method used for acquiring the metadata, it is always transformed into a spreadsheet for editing. These spreadsheets are uploaded to GBL Admin Metadata Editor.
Because of the variety of platforms and standards, this process can take many forms. The Metadata Coordinator will contact Team members if they need to supply metadata directly.
Once the metadata is in spreadsheet form, it is ready to be normalized and augmented. UMN Staff will add template information and use spreadsheet functions or scripts to programmatically complete the metadata records.
@@ -1751,8 +1751,8 @@Once the editing spreadsheets are completed, UMN Staff uploads the records to GEOMG
, a metadata management tool. GEOMG validates records and performs any needed field transformations. Once the records are satisfactory, they are published and available in the BTAA Geoportal.
Read more on the GEOMG documentation page.
+Once the editing spreadsheets are completed, UMN Staff uploads the records to GBL Admin
, a metadata management tool. GBL Admin validates records and performs any needed field transformations. Once the records are satisfactory, they are published and available in the BTAA Geoportal.
Read more on the GBL Admin documentation page.
General Maintenance
All project team members are encouraged to review the geoportal records assigned to their institutions periodically to check for issues. Use the feedback form at the top of each page in the geoportal to report errors or suggestions. This submission will include the URL of the last page you were on, and it will be sent to the Metadata Coordinator.
diff --git a/recipes/R-01_arcgis-hubs/index.html b/recipes/R-01_arcgis-hubs/index.html index 142aec1a..bb1078fb 100644 --- a/recipes/R-01_arcgis-hubs/index.html +++ b/recipes/R-01_arcgis-hubs/index.html @@ -1846,7 +1846,7 @@-
This recipe includes steps that use the GBL Admin toolkit. Access to this tool is restricted to UMN BTAA-GIN staff and requires a login account. External users can create their own list or use one provided in this repository.
+This recipe includes steps that use the GBL Admin toolkit. Access to this tool is restricted to UMN BTAA-GIN staff and requires a login account. External users can create their own list or use one provided in this repository.
graph TB
@@ -1872,10 +1872,10 @@ PurposeStep 1: Download the list of active ArcGIS Hubs¶
-We maintain a list of active ArcGIS Hub sites in GEOMG.
+We maintain a list of active ArcGIS Hub sites in GBL Admin.
- Go to the Admin (https://geo.btaa.org/admin) dashboard
@@ -1889,7 +1889,7 @@ Step 1: Download the lis
Info
-Exporting from GEOMG will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are:
+Exporting from GBL Admin will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are:
- ID: Unique code assigned to each portal. This is transferred to the "Is Part Of" field for each dataset.
- Title: The name of the Hub. This is transferred to the "Provider" field for each dataset
@@ -1921,7 +1921,7 @@ Troubleshooting (as needed)Update ArcGIS Hubs list page for more guidance on how to edit the website record.
-- If a site is missing: Unpublish it from GEOMG, indicate the Date Retired, and make a note in the Status field.
+- If a site is missing: Unpublish it from GBL Admin, indicate the Date Retired, and make a note in the Status field.
- If a site is still live, but the JSON API link is not working: remove the value "DCAT US 1.1" from the Accrual Method field and make a note in the Status field.
- If the site has moved to a new URL, update the website record with the new information.
@@ -1929,7 +1929,7 @@ Troubleshooting (as needed)Step 3: Validate and Clean¶
-
Although the harvest notebook will produce valide metadata for most of the items, there may still be some errors. Run the cleaning script to ensure that the records are valid before we try to ingest them into GEOMG.
+Although the harvest notebook will produce valide metadata for most of the items, there may still be some errors. Run the cleaning script to ensure that the records are valid before we try to ingest them into GBL Admin.
Step 4: Upload all records¶
- Review the previous upload. Check the Date Accessioned field of the last harvest and copy it.
diff --git a/recipes/R-02_socrata/index.html b/recipes/R-02_socrata/index.html
index 3c926386..41593dc1 100644
--- a/recipes/R-02_socrata/index.html
+++ b/recipes/R-02_socrata/index.html
@@ -1391,9 +1391,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1814,9 +1814,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1847,7 +1847,7 @@
Purpose
-This recipe includes steps that use the metadata toolkit GEOMG. Access to GEOMG is restricted to UMN BTAA-GIN staff and requires a login account. External users can create their own list or use one provided in this repository.
+This recipe includes steps that use the metadata toolkit GBL Admin. Access to GBL Admin is restricted to UMN BTAA-GIN staff and requires a login account. External users can create their own list or use one provided in this repository.
graph TB
@@ -1861,7 +1861,7 @@ PurposePurposeStep: Download the list of active Socrata Data Portals¶
-We maintain a list of active Socrata Hub sites in GEOMG.
+We maintain a list of active Socrata Hub sites in GBL Admin.
-- Go to the GEOMG dashboard
+- Go to the GBL Admin dashboard
- Use the Advanced Search to filter for items with these parameters:
- Format: "Socrata data portal"
@@ -1885,7 +1885,7 @@ Step: Download th
Info
-Exporting from GEOMG will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are:
+Exporting from GBL Admin will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are:
- ID: Unique code assigned to each portal. This is transferred to the "Is Part Of" field for each dataset.
- Title: The name of the Hub. This is transferred to the "Provider" field for each dataset
@@ -1906,14 +1906,14 @@ Troubleshooting (as needed)GBL Admin and indicate the Date Retired, and make a note in the Status field.
Start over from Step 1.
Step 3: Validate and Clean¶
-Although the harvest notebook will produce valide metadata for most of the items, there may still be some errors. Run the cleaning script to ensure that the records are valid before we try to ingest them into GEOMG.
-Step 4: Upload to GEOMG¶
+Although the harvest notebook will produce valide metadata for most of the items, there may still be some errors. Run the cleaning script to ensure that the records are valid before we try to ingest them into GBL Admin.
+Step 4: Upload to GBL Admin¶
- Review the previous upload. Check the Date Accessioned field of the last harvest and copy it.
- Upload the new CSV file. This will overwrite the Date Accessioned value for any items that were already present.
diff --git a/recipes/R-03_ckan/index.html b/recipes/R-03_ckan/index.html
index c83658ca..f04a233b 100644
--- a/recipes/R-03_ckan/index.html
+++ b/recipes/R-03_ckan/index.html
@@ -1911,9 +1911,9 @@ Step 3: Edit the metadata for ne
- Title: Concatenate values in the Alternative Title column with the Spatial Coverage of the dataset.
Step 4: Upload metadata for new records¶
-Open GEOMG and upload the new items found in reports/allNewItems_{today's date}.csv
+Open GBL Admin and upload the new items found in reports/allNewItems_{today's date}.csv
Step 5: Delete metadata for retired records¶
-Unpublish records found in reports/allDeletedItems_{today's date}.csv
. This can be done in GEOMG manually (one by one) or with the GEOMG documents update script.
+Unpublish records found in reports/allDeletedItems_{today's date}.csv
. This can be done in GBL Admin manually (one by one) or with the GBL Admin documents update script.
diff --git a/recipes/R-05_iiif/index.html b/recipes/R-05_iiif/index.html
index d15f2693..a581731d 100644
--- a/recipes/R-05_iiif/index.html
+++ b/recipes/R-05_iiif/index.html
@@ -1445,9 +1445,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1796,9 +1796,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1846,7 +1846,7 @@
Step 2: Run the extraction script
Step 3: Merge the metadata¶
Although the Jupyter Notebook extracts the metadata to a flat CSV, we still need to merge this with any existing metadata for the records.
-Step 4: Upload to GEOMG¶
+Step 4: Upload to GBL Admin¶
diff --git a/recipes/R-06_mods/index.html b/recipes/R-06_mods/index.html
index 7157cca9..8169bc0a 100644
--- a/recipes/R-06_mods/index.html
+++ b/recipes/R-06_mods/index.html
@@ -1466,9 +1466,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1796,9 +1796,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1837,7 +1837,7 @@
Step 2: Run the extraction script
Step 3: Format as OpenGeoMetadata¶
Manually adjust the names of the columns to match metadata into our GeoBTAA metadata template.
-Step 4: Upload to GEOMG¶
+Step 4: Upload to GBL Admin¶
diff --git a/recipes/R-07_ogm/index.html b/recipes/R-07_ogm/index.html
index 5833de05..31872201 100644
--- a/recipes/R-07_ogm/index.html
+++ b/recipes/R-07_ogm/index.html
@@ -1487,9 +1487,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1796,9 +1796,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1846,7 +1846,7 @@
Step 2: Run the conversion script
Step 3: Edit the output CSV¶
The GeoBTAA Metadata Profile may have additional or different requirements. Consult with the Product Manager on which fields may need augmentation.
-Step 4: Upload to GEOMG¶
+Step 4: Upload to GBL Admin¶
diff --git a/recipes/R-08_pasda/index.html b/recipes/R-08_pasda/index.html
index d8f0498d..7a9607d7 100644
--- a/recipes/R-08_pasda/index.html
+++ b/recipes/R-08_pasda/index.html
@@ -1517,9 +1517,9 @@
-
-
+
- Step 5: Upload the CSV to GEOMG
+ Step 5: Upload the CSV to GBL Admin
@@ -1814,9 +1814,9 @@
-
-
+
- Step 5: Upload the CSV to GEOMG
+ Step 5: Upload the CSV to GBL Admin
@@ -1873,9 +1873,9 @@
Step 3: Query t
It will also pull the geometry type and keywords, if available.
Step 4: Add default and calculated values¶
This step will clean up the harvested metadata and add our administrative values to each row. At the end, there will be a CSV file in your directory named for today's date.
-Step 5: Upload the CSV to GEOMG¶
+Step 5: Upload the CSV to GBL Admin¶
-- Upload the new records to GEOMG
+- Upload the new records to GBL Admin
- Use the Date Accessioned field to search for records that were not present in the current harvest. Retire any records that have the code "08a-01" but were not part of this harvest.
diff --git a/recipes/R-09_umedia/index.html b/recipes/R-09_umedia/index.html
index 35f7ea82..0bc9df2c 100644
--- a/recipes/R-09_umedia/index.html
+++ b/recipes/R-09_umedia/index.html
@@ -1529,9 +1529,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1796,9 +1796,9 @@
-
-
+
- Step 4: Upload to GEOMG
+ Step 4: Upload to GBL Admin
@@ -1844,7 +1844,7 @@
Step 2: Run the harvesting scriptThe third code cell will ask for a date range. Select a month (in the form yyyy-mm
) based on the last time you ran the script.
Step 3: Edit the metadata¶
-Step 4: Upload to GEOMG¶
+Step 4: Upload to GBL Admin¶
diff --git a/recipes/add-bbox/index.html b/recipes/add-bbox/index.html
index 4faf7824..85ae5a4f 100644
--- a/recipes/add-bbox/index.html
+++ b/recipes/add-bbox/index.html
@@ -1812,7 +1812,7 @@ Option 2: Draw a shapePart C: Copy back to GeoBTAA metadata¶
- Click the “Copy to Clipboard” icon on the Klokan site.
-- Paste the coordinates into the Bounding Box field in the GeoBTAA metadata template or in the GEOMG metadata editor.
+- Paste the coordinates into the Bounding Box field in the GeoBTAA metadata template or in the GBL Admin metadata editor.
Programmatic method¶
The OpenStreetMap offers and API that allows users to query with place names and return a bounding box. Follow the Tutorial, Use OpenStreetMap to generate bounding boxes, for this method.
diff --git a/recipes/secondary-tables/index.html b/recipes/secondary-tables/index.html
index d2703a93..52311a17 100644
--- a/recipes/secondary-tables/index.html
+++ b/recipes/secondary-tables/index.html
@@ -64,7 +64,7 @@
-
+
Skip to content
@@ -1772,11 +1772,11 @@
-How to upload links in secondary tables in GEOMG¶
+How to upload links in secondary tables in GBL Admin¶
- We use two compound metadata fields,
Multiple Download Links
and Institutional Access Links
, that include multiple links that are formatted with both a label + a link.
-- Because these fields are not regular JSON flat key:value pairs, they are stored in secondary tables within GEOMG.
-- When using GEOMG's Form view, these values can be entered by clicking into a side page linked from the record.
+- Because these fields are not regular JSON flat key:value pairs, they are stored in secondary tables within GBL Admin.
+- When using GBL Admin's Form view, these values can be entered by clicking into a side page linked from the record.
- For CSV uploads, these values are uploaded with a separate CSV from the one used for the main import template.
diff --git a/recipes/split-bbox/index.html b/recipes/split-bbox/index.html
index abf58cf5..4e56fdbd 100644
--- a/recipes/split-bbox/index.html
+++ b/recipes/split-bbox/index.html
@@ -1780,8 +1780,8 @@ ProblemSolution¶
One way to mitigate this is to create two bounding boxes for the OGM Aardvark Geometry
field. The Bounding Box value will be the same, but the Geometry field will have a multipolygon that is made up of two adjacent boxes.
The following script will scan a CSV of the records, identify which cross the 180th Meridian, and insert a multipolygon into a new column.
-The script was designed with the assumption that the input CSV will be in the OGM Aardvark format, likely exported from GEOMG. The CSV file must contain a field for Bounding Box
. It may contain a Geometry
field with some values that we do not want to overwrite.
-This script will create a new field called "Bounding Box (WKT)". Items that crossed the 180th Meridian will have a multipolygon in that field. Items that don't cross will not have a value in that field. Copy and paste only the new values into the Geometry
column and reupload the CSV to GEOMG.
+The script was designed with the assumption that the input CSV will be in the OGM Aardvark format, likely exported from GBL Admin. The CSV file must contain a field for Bounding Box
. It may contain a Geometry
field with some values that we do not want to overwrite.
+This script will create a new field called "Bounding Box (WKT)". Items that crossed the 180th Meridian will have a multipolygon in that field. Items that don't cross will not have a value in that field. Copy and paste only the new values into the Geometry
column and reupload the CSV to GBL Admin.
import csv
def split_coordinates(coordinate_str):
diff --git a/recipes/update-hub-list/index.html b/recipes/update-hub-list/index.html
index b948f9cf..00714d59 100644
--- a/recipes/update-hub-list/index.html
+++ b/recipes/update-hub-list/index.html
@@ -1659,9 +1659,9 @@
-
-
+
- How to create or troubleshoot an ArcGIS Hub website record in GEOMG
+ How to create or troubleshoot an ArcGIS Hub website record in GBL Admin
@@ -1757,9 +1757,9 @@
-
-
+
- How to create or troubleshoot an ArcGIS Hub website record in GEOMG
+ How to create or troubleshoot an ArcGIS Hub website record in GBL Admin
@@ -1796,19 +1796,19 @@
How to update the list of
Background¶
The BTAA Geoportal provides a central access point to find and browse public geospatial data. A large portion of these records come from ArcGIS Hubs that are maintained by states, counties, cities, and regional entities. These entities continually update the data and the website platforms. In turn, we need to continually update our links to these resources.
The ArcGIS Harvesting Recipe walks through how we programmatically query the ArcGIS Hubs to obtain the current list of datasets. This page describes how to keep our list of ArcGIS Hub websites updated.
-How to create or troubleshoot an ArcGIS Hub website record in GEOMG¶
+How to create or troubleshoot an ArcGIS Hub website record in GBL Admin¶
Info
The highlighted fields listed below are required for the ArcGIS harvesting script. If the script fails, check that these fields have been added.
-The underlined fields are used to query GEOMG and produce the list of Hubs that we regularly harvest.
+The underlined fields are used to query GBL Admin and produce the list of Hubs that we regularly harvest.
-Each Hub has its own entry in GEOMG. Manually create or update each record with the following parameters:
+Each Hub has its own entry in GBL Admin. Manually create or update each record with the following parameters:
- Title: The name of the site as shown on its homepage. This value will be transferred into the Provider field of each dataset.
- Description: Usually "Website for finding and accessing open data provided by " followed by the name of the administrative place or organization publishing the site. Additional descriptions are helpful.
- Language: 3-letter code as shown on the OpenGeoMetadata website.
- Publisher: The administrative place or organization publishing the site. This value will be concatenated into the title of each dataset. For place names, use the FAST format (i.e.
Minnesota--Clay County
.
-- Resource Class:
Websites
This value is used for filtering & finding the Hubs in GEOMG
+- Resource Class:
Websites
This value is used for filtering & finding the Hubs in GBL Admin
- Temporal Coverage:
Continually updated resource
- Spatial Coverage: Add place names using the FAST format as described for the B1G Profile.
- Bounding Box: If the Hub covers a specific area, include a bounding box for it using the manual method described in the Add bounding boxes recipe.
@@ -1817,9 +1817,9 @@ How
b0153110-e455-4ced-9114-9b13250a7093
(Research Institutes Geospatial Data Collection)
-- Format:
ArcGIS Hub
This value is used for filtering & finding the Hubs in GEOMG
+- Format:
ArcGIS Hub
This value is used for filtering & finding the Hubs in GBL Admin
- Links - Reference - "Full layer description" : link to the homepage for the Hub
-- ID and Code: Both of the values will be the same. Create a new code by following the description on the Code Naming Schema page. Use the Advanced Search in GEOMG to query which codes have already been used. If it is not clear what code to create, ask the Product Manager or use the UUID Generator website to create a random one. The ID value will be transferred into the Code field of each dataset.
+- ID and Code: Both of the values will be the same. Create a new code by following the description on the Code Naming Schema page. Use the Advanced Search in GBL Admin to query which codes have already been used. If it is not clear what code to create, ask the Product Manager or use the UUID Generator website to create a random one. The ID value will be transferred into the Code field of each dataset.
-
Identifier: If the record will be part of the monthly harvests, add this to the end of the baseUrl (usually the homepage): /api/feed/dcat-us/1.1.json
. The Identifier will be used to query the metadata for the website.
@@ -1830,12 +1830,12 @@ How
-
Access Rights: Public
for all ArcGIS Hubs.
-- Accrual Method:
DCAT US 1.1
. This value is used for filtering & finding the Hubs in GEOMG
+- Accrual Method:
DCAT US 1.1
. This value is used for filtering & finding the Hubs in GBL Admin
- Status: If the site is part of the ArcGIS Harvest, use the value
Indexed
. If the site is not part of the harvest, use Not indexed
. Other explanatory text can be included here, such as indicating if the site is broken.
- Publication State: When a new record is created it will automatically be assigned
Draft
. Change the state to published
when the metadata record is ready. If the site breaks or is deprecated, change this value to unpublished
.
How to remove broken or deprecated ArcGIS Hubs¶
-If a Hub site stalls the Harvesting script, it needs to be updated in GEOMG.
+If a Hub site stalls the Harvesting script, it needs to be updated in GBL Admin.
If the site is missing:¶
Try to find a replacement site. When a Hub is updated to a new version, sometimes the baseURL will change. If a new site is found, update:
diff --git a/resource-lifecycle/index.html b/resource-lifecycle/index.html
index a6bf998b..335bf9d4 100644
--- a/resource-lifecycle/index.html
+++ b/resource-lifecycle/index.html
@@ -1877,11 +1877,11 @@ 2. Harvest3. Edit¶
Graduate Research Assistants and Product Manager
When working with metadata, it is common to come across missing or corrupted values, which require troubleshooting and manual editing in our spreadsheets. Refer to the Collections Project Board for examples of this work.
-After compiling the metadata, we run a validation and cleaning script to ensure the records conform to the required elements of our schema. Finally, we upload the completed spreadsheet to GEOMG, which serves as the administrative interface for the Geoportal. If GEOMG detects any formatting errors, it will issue a warning and may reject the upload.
+After compiling the metadata, we run a validation and cleaning script to ensure the records conform to the required elements of our schema. Finally, we upload the completed spreadsheet to GBL Admin, which serves as the administrative interface for the Geoportal. If GBL Admin detects any formatting errors, it will issue a warning and may reject the upload.
4. Index¶
Product Manager
-Once the metadata is successfully uploaded to GEOMG, we can publish the records to the Geoportal. The technology that actually stores the records and enables searching is called Solr. The action of adding records is known as "Indexing."
-Periodically, we need to remove records from the Geoportal. To do this, we use GEOMG to either delete them or change their status to "unpublished."
+Once the metadata is successfully uploaded to GBL Admin, we can publish the records to the Geoportal. The technology that actually stores the records and enables searching is called Solr. The action of adding records is known as "Indexing."
+Periodically, we need to remove records from the Geoportal. To do this, we use GBL Admin to either delete them or change their status to "unpublished."
5. Maintain¶
BTAA-GIN Team Members, Graduate Research Assistants, and Product Manager
The Geoportal is programmatically checked for broken links on a monthly basis. The are fixed either by manually repairing them or by reharvesting from the source.
@@ -1895,7 +1895,7 @@ Sequence diagram of Resource Lif
actor Product Manager
participant GitHub
actor Research Assistant
- participant GEOMG
+ participant GBL Admin
participant Geoportal
@@ -1907,12 +1907,12 @@ Sequence diagram of Resource Lif
Note left of Research Assistant: HARVEST
Note left of Research Assistant: EDIT
- Research Assistant->>GEOMG: Upload records
+ Research Assistant->>GBL Admin: Upload records
Research Assistant ->>GitHub: Update GitHub issue
- Note right of GEOMG: PUBLISH
+ Note right of GBL Admin: PUBLISH
- Product Manager->>GEOMG: Publish records
- GEOMG->>Geoportal: Send records online
+ Product Manager->>GBL Admin: Publish records
+ GBL Admin->>Geoportal: Send records online
Product Manager->>GitHub: Close GitHub issue
Product Manager ->> Team Member: Share link to published records
diff --git a/search/search_index.json b/search/search_index.json
index 58e2f819..e34ed206 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"GeoBTAA Metadata Handbook","text":"
This handbook describes how to harvest and format metadata records for the BTAA Geoportal.
"},{"location":"#reference","title":"Reference","text":"Information about the GeoBTAA Metadata Application Profile and our harvest guidelines.
"},{"location":"#explanation","title":"Explanation","text":"Descriptions and clarifications of processes, content organization, policies, and tools
"},{"location":"#tutorials","title":"Tutorials","text":"Short, easy to complete exercises to help someone get the basics of running and writing scripts to harvest metadata.
"},{"location":"#recipes","title":"Recipes","text":"Multi-step workflows for harvesting and editing metadata for the BTAA Geoportal.
"},{"location":"#who-is-this-handbook-for","title":"Who is this handbook for?","text":" -
Team Members in the Big Ten Academic Alliance Geospatial Information Network (BTAA-GIN)
-
Development & Operations Staff in the BTAA-GIN
-
Users & developers of open-source geospatial projects, such as OpenGeoMetadata and GeoBlacklight
-
Contributors to the BTAA Geoportal
-
Users of the BTAA Geoportal
Metadata Handbook Version History Changes for Version 5.1 (September 25, 2023)
This version adds several new recipes and page cleanups.
- New recipes for:
- cleaning metadata
- adding bounding boxes
- normalizing creators
- updating our list of ArcGIS Hubs
- how to add multiple download links in GEOMG
- Updates the documentation for the ArcGIS, Socrata, and PASDA recipes.
- Updates the DCAT and CKAN documentation pages.
Changes for Version 5.0 (May 24, 2023)
This version incorporates the Harvesting Guide notebooks and includes documentation for harvesting metadata from different sources.
- New page describing the Tutorials in the Harvesting Guide
- Eight Recipe pages corresponding to Harvesting Guide
- Updated header design to match Geoportal
Changes for Version 4.6 (March 15, 2023)
- New page for manually adding bounding boxes
- Restructure using Diataxis framework
- Remove some GEOMG how to guidelines (moved to GEOMG Wiki)
- Clarify Editing Template differences from OGM-Aaardvark documentation
- Added Collection Development Policy and Curation Priorities documents
- Update input guidelines for Spatial Coverage (FAST IDs)
Changes for Version 4.5.1 (February 28, 2023)
- Update version documentation
- Add link to generated PDF
Changes for Version 4.5 (February 28, 2023)
- Add Creator ID
- Update input guidelines for Creator, Creator ID
- Remove Harvesting Guide info (migrating to separate site)
- Edit Submitting Metadata page
- Minor copy editing
- Add PDF export capability
Changes for Version 4.4 (August 23, 2022)
- updated theme
- reorganized and expanded navigation menu
- new sections for Harvesting Guide and using GEOMG
Changes for Version 4.3 (August 15, 2022)
- migrate to MkDocs.org platform
- update bounding box entry guidelines
- add GEOMG page
Changes for Version 4.2 (March 24, 2022)
- New Entry and Usage Guidelines page
- Expands content organization model documentation
- Changes the name of the schema from 'Aardvark' to 'OpenGeoMetadata (version Aardvark)'
- Cleans up outdated links
Changes for Version 4.1 (Jan 2022)
- updates Status as optional; removes controlled vocabulary
- Clarifies relationship model
Changes for Version 4.0 (July 2021)
- Incorporation of GEOMG Metadata Editor
- Upgrade to Aardvark Metadata Schema for GeoBlacklight
Changes for version 3.3 (May 13, 2020)
- Added University of Nebraska
- Reorganized Metadata Elements to match editing template
- Updated the \u201cUpdate the Collections\u201d section to match new administrative process for tracking records
Changes for version 3.2 (Jan 8, 2020)
- Added Date Range element
Changes for version 3.1 (Dec 19, 2019)
- Added collection level records metadata schema
Changes for version 3 (Oct 2019)
- GeoNetwork and Omeka deprecated
- all GeoBlacklight records are stored in a spreadsheet in Google Sheets
- records are transformed from CSV to GeoBlacklight JSON with a Python script
- additional metadata fields were added for administrative purposes
- IsPartOf field now holds a code pointing to the collection record
- Administrative groupings such as \u201cState agencies geospatial data\u201d are now subjects, not a Collection
- updated editing templates available
- all supplemental metadata can be stored as XML or HTML in project hosted folder
- updated links to collections database
"},{"location":"GEOMG/","title":"About GEOMG","text":"What is it? GEOMG is a custom tool that functions as a backend metadata editor and manager for the GeoBlacklight application.
Who uses it? BTAA-GIN Operations technical staff at the University of Minnesota
Who developed it? The BTAA Geoportal Lead Developer, Eric Larson, created GEOMG, with direction from the BTAA-GIN. It is based upon the Kithe framework.
Can other GeoBlacklight projects adopt it?
We are currently working on offering this tool as a plugin for GeoBlacklight.
In the meantime, this presentation describes the motivation for building the tool and a few screencasts showing how it works:
"},{"location":"about-harvesting/","title":"About harvesting","text":"This page describes some of the processes and terminology associated with extracting metadata from various sources.
"},{"location":"about-harvesting/#what-is-web-scraping","title":"What is web scraping?","text":"Web scraping is the process of programmatically collecting and extracting information from websites using automated scripts or bots. Common web scraping tools include pandas, Beautiful Soup, and WGET.
"},{"location":"about-harvesting/#what-is-data-harvesting","title":"What is data harvesting?","text":"Data harvesting refers to the process of collecting large volumes of data from various sources, such as websites, social media, or other online platforms. This can involve using automated scripts or tools to extract structured or unstructured data, such as text, images, videos, or other types of content. The collected data can be used for various purposes, such as data analysis or content aggregation.
"},{"location":"about-harvesting/#what-is-metadata-harvesting","title":"What is metadata harvesting?","text":"Metadata harvesting refers specifically to the process of collecting metadata from digital resources, such as websites, online databases, or digital libraries. Metadata is information that describes other data, such as the title, author, date, format, or subject of a document. Metadata harvesting involves extracting this information from digital resources and storing it in a structured format, such as a database or a metadata record.
Metadata harvesting is often used in the context of digital libraries, archives, or repositories, where the metadata is used to organize and manage large collections of digital resources. Metadata harvesting can be done using specialized tools or protocols, such as the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which is a widely used standard for sharing metadata among digital repositories.
"},{"location":"about-harvesting/#do-scraping-and-harvesting-mean-the-same-thing","title":"Do \"scraping\" and \"harvesting\" mean the same thing?","text":"The terms \"harvesting\" and \"scraping\" are often used interchangeably. However, there may be subtle differences in the way these terms are used, depending on the context.
In general, scraping refers to the process of programmatically extracting data from websites using automated scripts or bots. The term \"scraping\" often implies a more aggressive approach, where data is extracted without explicit permission from the website owner. Scraping may involve parsing HTML pages, following links, and using techniques such as web crawling or screen scraping to extract data from websites.
On the other hand, harvesting may refer to a more structured and systematic approach to extracting data from websites. The term \"harvesting\" often implies a more collaborative approach, where data is extracted with the explicit permission of the website owner or through APIs or web services provided by the website. Harvesting may involve using specialized software or tools to extract metadata, documents, or other resources from websites.
"},{"location":"about-harvesting/#what-is-web-parsing","title":"What is web parsing?","text":"Web parsing refers to the process of scanning structured documents and extracting information. Although, it usually refers to parsing HTML pages, it can also describe parsing XML or JSON documents. Tools designed for this purpose, such as Beautiful Soup, are often called \"parsers.\"
"},{"location":"about-harvesting/#what-is-extract-transform-load-etl","title":"What is Extract-Transform-Load (ETL)?","text":"ETL (Extract Transform Load) is a process of extracting data from one or more sources, transforming it to fit the target system's data model, and then loading it into the target system, such as a database or a data warehouse.
The ETL process typically involves three main stages:
- Extract: This stage involves retrieving data from one or more sources, which may include databases, files, APIs, web services, or other data sources.
- Transform: This stage involves converting, cleaning, and restructuring the extracted data to fit the target system's data model and business rules. This may include tasks such as data mapping, data validation, data enrichment, data aggregation, or data cleansing.
- Load: This stage involves inserting or updating the transformed data into the target system, such as a database or a data warehouse. This may involve tasks such as data partitioning, indexing, or data quality checks.
"},{"location":"arcgis-harvest-policy-2018/","title":"BTAA GDP Accession Guidelines for ArcGIS Open Data Portals","text":"Version 1.0 - April 18, 2018
Deprecated
This document has been replaced by the BTAA-GIN Scanning Guidelines for ArcGIS Hub, version 2.0
"},{"location":"arcgis-harvest-policy-2018/#overview","title":"OVERVIEW","text":"This document describes the BTAA GDP Accession Guidelines for ArcGIS Open Data Portals, including eligible sites and records, harvest schedules, and remediation work. This policy may change at any time in response to updates to the ArcGIS Open Data Portal platform and/or the BTAA GDP harvesting and remediation processes.
"},{"location":"arcgis-harvest-policy-2018/#eligible-sites","title":"ELIGIBLE SITES","text":"Policy: Any ArcGIS Open Data Portal (Arc-ODP) that serves public geospatial data is eligible for inclusion in the BTAA GDP Geoportal (\u201cthe Geoportal\u201d). However, preference is given to portals that are hosting original layers, not federated portals that aggregate from other sites. Task Force members are responsible for finding and submitting Arc-ODPs for inclusion in the Geoportal. Each Arc-ODP will be assigned a Provenance value to the university that submitted it or is closest geographically to the site.
Explanation: In order to avoid duplication, records that appear in multiple Arc-ODPs should only be accessioned from one instance. This also helps to avoid harvesting records that may be out of date or not yet aggregated within federated portals. Although the technical workers at the University of Minnesota will be performing the metadata processing, the Task Force members are expected to periodically monitor their records and make suggestions for edits or additions.
"},{"location":"arcgis-harvest-policy-2018/#eligible-records","title":"ELIGIBLE RECORDS","text":"Policy: The only records that will be harvested from Arc-ODPs are Esri REST Services of the type Map Service, Feature Service, or Image Service. This is further restricted to only items that are harvestable through the DCAT API. By default, the following records types will not be accessioned on a regular basis:
- Web applications
- Nonspatial data, tabular data, PDFs
- Single records that describe many different associated files to download, such as imagery services with a vast number of sublayers
Explanation: Arc-ODPs are structured to automatically create item records from submitted Esri REST services. However, Arc-ODP administrators are able to manually add records for other types of resources, such as external websites or documents. These may not be spatial datasets and may not have consistently formed metadata or access links, which impedes automated accessioning. If these types of resources are approved by the Metadata Coordinator, they may be processed separately from the regular accessions.
"},{"location":"arcgis-harvest-policy-2018/#query-harvest-frequency","title":"QUERY & HARVEST FREQUENCY","text":"Policy: The Arc-ODPs included in the Geoportal will be queried monthly to check for deleted and new items. The results of this query will be logged. Deleted items will be removed from the geoportal immediately. New records from the Arc-ODPs will be accessioned and processed within two months of harvesting the metadata.
Explanation: Removing broken links is a priority for maintaining a positive user experience. However, accessioning and processing records requires remediation work that necessitates a variable time frame.
"},{"location":"arcgis-harvest-policy-2018/#remediation-work","title":"REMEDIATION WORK","text":"The records will be processed by the Metadata Coordinator and available UMN Graduate Research Assistants. The following metadata remediation steps will be undertaken:
"},{"location":"arcgis-harvest-policy-2018/#1-a-python-script-will-be-run-to-harvest-metadata-from-the-dcat-api-this-will-provide-the-following-elements-for-each-record","title":"1. A Python script will be run to harvest metadata from the DCAT API. This will provide the following elements for each record:","text":" - Identifier
- Title
- Description
- Date Issued
- Date Modified
- Bounding Box
- Publisher
- Keywords
- Landing Page
- Web Service link
"},{"location":"arcgis-harvest-policy-2018/#2-the-metadata-will-be-batch-augmented-with-administrative-template-values-in-the-following-elements","title":"2. The metadata will be batch augmented with Administrative template values in the following elements:","text":" - Collection
- Rights
- Type
- Format
- Provenance
- Language
- Centroid (derived from Bounding Box)
- Download Link (created from Landing Page)
- Tag (web service type)
- Thumbnail link (derived from server or Arc-ODP page)
"},{"location":"arcgis-harvest-policy-2018/#3-the-metadata-will-be-manually-augmented-with-descriptive-values-for-the-following-elements","title":"3. The metadata will be manually augmented with descriptive values for the following elements:","text":" - Subject (at least one ISO topic category)
- Geometry type (Vector or Raster)
- Spatial Coverage (place names written out to the nation level: \u201cMinneapolis, Minnesota, United States\u201d)
- Temporal Coverage (dates included in the title or description)
- Title (add place names, expand acronyms, and move dates to the end of the string)
- Description (remove html and non-UTF8 characters)
- Creator (if available)
- Solr Year (integer value based on temporal coverage or published date)
"},{"location":"arcgis-harvest-policy-2018/#4-the-metadata-will-not-be-fully-remediated-for-the-following-cases","title":"4. The metadata will not be fully remediated for the following cases:","text":" - Missing bounding box coordinates (0.00 values) will be defaulted to the bounding box of administrative level of the Arc-ODP or the record will be omitted.
- Missing or incomplete descriptions will be left alone or omitted from the record
- Individual records that require additional research in order to make the metadata record compliant, such as missing required elements or non-functioning links, will be omitted.
"},{"location":"arcgis-harvest-policy-2018/#standards-metadata","title":"STANDARDS METADATA","text":"Policy: Creating or linking to standards based metadata files for Arc-ODPs is out of scope at this time.
Explanation: If metadata is enabled for an Arc-ODP, it will be available as ArcGIS Metadata Format 1.0 in XML, which is not a schema that GeoBlacklight can display. The metadata may also be available as FGDC or ISO HTML pages, but these types of links are not part of the current GeoBlacklight schema. Further, very few Arc-ODPs are taking advantage of this feature at this time.
"},{"location":"arcgis-hub-guidelines/","title":"BTAA-GIN Scanning Guidelines for ArcGIS Hubs","text":"Version 2.0 - April 24, 2023
Info
This document replaces the BTAA GDP Accession Guidelines for ArcGIS Open Data Portals, version 1.0.
"},{"location":"arcgis-hub-guidelines/#overview","title":"Overview","text":"This document describes the BTAA-GIN Scanning Guidelines for ArcGIS Hubs, including eligible sites and records, harvest schedules, and remediation work. This policy may change at any time in response to updates to the ArcGIS Hub architecturs platform and/or the BTAA-GIN harvesting and remediation processes.
"},{"location":"arcgis-hub-guidelines/#eligible-sites","title":"Eligible sites","text":"Guideline: Any ArcGIS Hub (Hub) that serves public geospatial data is eligible for inclusion in the BTAA Geoportal (\u201cthe Geoportal\u201d). Our scope includes public Hubs from the states in the BTAA geographic region and Hubs submitted by Team Members that are of interest to BTAA researchers.
Explanation: See the BTAA-GIN Collection Development Policy for more details.
"},{"location":"arcgis-hub-guidelines/#eligible-records","title":"Eligible records","text":"Guideline: The only records that will be harvested from Hubs are Esri REST Services of the type Map Service, Feature Service, or Image Service. This is further restricted to only items that are harvestable through the DCAT API. By default, the following records types will not be accessioned on a regular basis:
- Web applications
- Nonspatial data, tabular data, PDFs
- Single records that describe many different associated files to download, such as imagery services with a vast number of sublayers
Explanation: Hubs are structured to automatically create item records from submitted Esri REST services. However, Hub administrators are able to manually add records for other types of resources, such as external websites or documents. These may not be spatial datasets and may not have consistently formed metadata or access links, which impedes automated accessioning.
"},{"location":"arcgis-hub-guidelines/#frequency","title":"Frequency","text":"Guideline: The Hubs included in the Geoportal will be scanned weekly to harvest complete lists of eligible records. The list will be published and overwrite the previous scan.
Explanation: Broken links negatively impact user experience. Over the course of a week, as many as 10% of the ArcGIS Hub records in the Geoportal can break or include outdated information.
"},{"location":"arcgis-hub-guidelines/#metadata-remediation","title":"Metadata Remediation","text":"Guideline: The harvesting script, R-01_arcgis-hubs
, programmatically performs all of the remediation for each record.
Explanation: We now scan a large number of ArcGIS Hubs, which makes manual remediation unrealistic. This is in contrast to the previous policy established in 2018, when our collection was smaller.
"},{"location":"b1g-custom-elements/","title":"Custom Elements","text":"This page documents the custom metadata elements for the GeoBTAA Metadata Profile. These elements extend the official OpenGeoMetadata (Aardvark) schema.
b1g-id Label URI Obligation b1g-01 Code b1g_code_s
Required b1g-02 Status b1g_status_s
Optional b1g-03 Accrual Method b1g_dct_accrualMethod_s
Required b1g-04 Accrual Periodicity b1g_dct_accrualPeriodicity_s
Optional b1g-05 Date Accessioned b1g_dateAccessioned_s
Required b1g-06 Date Retired b1g_dateRetired_s
Conditional b1g-07 Child Record b1g_child_record_b
Conditional b1g-08 Mediator b1g_dct_mediator_sm
Conditional b1g-09 Access b1g_access_s
Conditional b1g-10 Image b1g_image_ss
Optional b1g-11 GeoNames b1g_geonames_sm
Optional b1g-12 Publication State b1g_publication_state_s
Required b1g-13 Language String b1g_language_sm
Required b1g-14 Creator ID b1g_creatorID_sm
Optional"},{"location":"b1g-custom-elements/#code","title":"Code","text":"Label Code URI b1g_code_s
Profile ID b1g-01 Obligation Required Multiplicity 1-1 Field type string Purpose To group records based upon their source Entry Guidelines Codes are developed by the metadata coordinator and indicate the provider, the type of institution hosting the resources, and a numeric sequence number. For more details, see Code Naming Schema Commentary This administrative field is used to group and track records based upon where they are harvested. This is frequently an identical value to \"Member Of\". The value will differ for records that are retired (these are removed from \"Member Of\") and records that are part of a subcollection. Controlled Vocabulary yes-strict Example value 12d-01 Element Set B1G"},{"location":"b1g-custom-elements/#status","title":"Status","text":"Label Status URI b1g_status_s
Profile ID b1g-02 Obligation Optional Multiplicity 0-1 Field type string Purpose To indicate if a record is currently active, retired, or unknown. It can also be used to indicate if individual data layers from website has been indexed in the Geoportal. Entry Guidelines Plain text string is acceptable Commentary This is a legacy admin field that was previously used to track published vs retired items. The current needs are still TBD. Controlled Vocabulary no Example value Active Element Set B1G"},{"location":"b1g-custom-elements/#accrual-method","title":"Accrual Method","text":"Label Accrual Method URI b1g_dct_accrualMethod_s
Profile ID b1g-03 Obligation Required Multiplicity 1-1 Field type string Purpose To describe how the record was obtained Entry Guidelines Some values, such as \"ArcGIS Hub\" should be entered consistently. Others may be more descriptive, such as \"Manually entered from text file.\" Commentary This allows us to find all of the ArcGIS records in one search. It also can help track records that have been harvested via different methods within the same collection. Controlled Vocabulary no Example value ArcGIS Hub Element Set B1G/ Dublin Core"},{"location":"b1g-custom-elements/#accrual-periodicity","title":"Accrual Periodicity","text":"Label Accrual Periodicity URI b1g_dct_accrualPeriodicity_s
Profile ID b1g-04 Obligation Optional Multiplicity 0-1 Field type string Purpose To indicate how often a collection is reaccessioned Entry Guidelines Enter one of the following values: Annually, Semiannually, Quarterly, Monthly, As Needed Commentary This field is primarily for collection level records. Controlled Vocabulary yes-not strict Example value As Needed Element Set B1G/ Dublin Core"},{"location":"b1g-custom-elements/#date-accessioned","title":"Date Accessioned","text":"Label Date Accessioned URI b1g_dateAccessioned_s
Profile ID b1g-05 Obligation Required Multiplicity 1-1 Field type string Purpose To store the date a record was harvested Entry Guidelines Enter the date a record was harvested OR when it was added to the geoportal using the format yyyy-mm-dd Commentary This field allows us to track how many records are ingested into the portal in a given time period and to which collections. Controlled Vocabulary no Example value 2021-01-01 Element Set B1G"},{"location":"b1g-custom-elements/#date-retired","title":"Date Retired","text":"Label Date Retired URI b1g_dateRetired_s
Profile ID b1g-06 Obligation Conditional Multiplicity 0-1 Field type string Purpose To store the date the record was removed from the geoportal public interface Entry Guidelines Enter the date a record was removed from the geoportal Commentary This field allows us to track how many records have been removed from the geoportal interface by time period and collection. Controlled Vocabulary no Example value 2021-01-02 Element Set B1G"},{"location":"b1g-custom-elements/#child-record","title":"Child Record","text":"Label Child Record URI b1g_child_record_b
Profile ID b1g-07 Obligation Optional Multiplicity 0-1 Field type string boolean Purpose To apply an algorithm to the record that causes it to appear lower in search results Entry Guidelines Only one of two values are allowed: true or false Commentary This is used to lower a record's placement in search results. This can be useful for a large collection with many similar metadata values that might clutter a user's experience. Controlled Vocabulary string boolean Example value true Element Set B1G"},{"location":"b1g-custom-elements/#mediator","title":"Mediator","text":"Label Mediator URI b1g_dct_mediator_sm
Profile ID b1g-08 Obligation Conditional Multiplicity 0-0 or 1-* Field type string Purpose To indicate the universities that have licensed access to a record Entry Guidelines The value for this field should be one of the names for each institution that have been coded in the GeoBlacklight application. Commentary This populates a facet on the search page so that users can filter to only databases that they are able log into based upon their institutional affiliation. Controlled Vocabulary yes Example value University of Wisconsin-Madison Element Set B1G/ Dublin Core"},{"location":"b1g-custom-elements/#access","title":"Access","text":"Label Access URI b1g_access_s
Profile ID b1g-09 Obligation Conditional Multiplicity 0-0 or 1-1 Field type string JSON Purpose To supply the links for restricted records Entry Guidelines The field value is an array of key/value pairs, with keys representing an insitution code and values the URL for the library catalog record. See the Access Template for entry. Commentary This field is challenging to construct manually, is it is a JSON string of institutional codes and URLs. The codes are used instead of the actual names in order to make the length of the field more manageable and to avoid spaces. Controlled Vocabulary no Example value {\\\"03\\\":\\\"https://purl.lib.uiowa.edu/PolicyMap\\\",\\\"04\\\":\\\"https://www.lib.umd.edu/dbfinder/id/UMD09180\\\",\\\"05\\\":\\\"https://primo.lib.umn.edu/permalink/f/1q7ssba/UMN_ALMA51581932400001701\\\",\\\"06\\\":\\\"http://catalog.lib.msu.edu/record=b10238077~S39a\\\",\\\"07\\\":\\\"https://search.lib.umich.edu/databases/record/39117\\\",\\\"09\\\":\\\"https://libraries.psu.edu/databases/psu01762\\\",\\\"10\\\":\\\"https://digital.library.wisc.edu/1711.web/policymap\\\",\\\"11\\\":\\\"https://library.ohio-state.edu/record=b7869979~S7\\\"} Element Set B1G"},{"location":"b1g-custom-elements/#image","title":"Image","text":"Label Image URI b1g_image_ss
Profile ID b1g-10 Obligation Optional Multiplicity 0-0 or 0-1 Field type stored string (URL) Purpose To show a thumbnail on the search results page Entry Guidelines Enter an image file using a secure link (https). Acceptable file types are JPEG or PNG Commentary This link is used to harvest an image into the Geoportal server for storage and display. Once it has been harvested, it will remain in storage, even if the orginal link to the image stops working. Controlled Vocabulary no Example value https://gis.allencountyohio.com/GIS/Image/countyseal.jpg Element Set B1G"},{"location":"b1g-custom-elements/#geonames","title":"GeoNames","text":"Label GeoNames URI b1g_geonames_sm
Profile ID b1g-11 Obligation Optional Multiplicity 0-* Field type stored string (URI) Purpose To indicate a URI for a place name from the GeoNames database Entry Guidelines Enter a value in the format \"http://sws.geonames.org/URI
\" Commentary This URI provides a linked data value for one or more place names. It is optional as there is currently no functionality tied to it in the GeoBlacklight application Controlled Vocabulary yes Example value https://sws.geonames.org/2988507 Element Set B1G"},{"location":"b1g-custom-elements/#publication-state","title":"Publication State","text":"Label Publication State URI b1g_publication_state_s
Profile ID b1g-12 Obligation Required Multiplicity 1-1 Field type string Purpose To communicate to Solr if the item is public or hidden Entry Guidelines Use the dropdown or batch editing functions to change the state Commentary When items are first added to GEOMG, they are set as Draft by default. When they are ready to be published, they can be manually changed to Published. If the record is retired or needs to be hidden, it can be changed to Unpublished Controlled Vocabulary yes Example value Draft Element Set B1G"},{"location":"b1g-custom-elements/#language-string","title":"Language string","text":"Label Language string URI b1g_language_sm
Profile ID b1g-13 Obligation Required Multiplicity 1-* Field type string Purpose To display the spelled out string (in English) of a language code to users Entry Guidelines This field is automatically generated from the Language field in the main form Commentary The OGM schema specified using a 3-digit code to indicate lanuage. In order to display this to users, it needs to be translated into a human-readable string. Controlled Vocabulary yes Example value French Element Set B1G"},{"location":"b1g-custom-elements/#creator-id","title":"Creator ID","text":"Label Creator ID URI b1g_creatorID_sm
Profile ID b1g-14 Obligation Optional Multiplicity 0-* Field type string Purpose To track the URI of a creator value Entry Guidelines This field is entered as a URI representing an authority record Commentary These best practices recommend consulting one or two name registries when deciding how to standardize names of creators: the Faceted Application of Subject Terminology (FAST) or the Library of Congress Name Authority File (LCNAF). FAST is a controlled vocabulary based on the Library of Congress Subject Headings (LCSH) that is well-suited to the faceted navigation of the Geoportal. The LCNAF is an authoritative list of names, events, geographic locations and organizations used by libraries and other organizations to collocate authorized creator names to make searching and browsing easier. Controlled Vocabulary yes Example value fst02013467 Element Set B1G"},{"location":"ckan/","title":"Overview of CKAN Data Portals and its APIs","text":"\"CKAN is a tool for making open data websites\". (https://docs.ckan.org/en/2.10/user-guide.html#what-is-ckan) CKAN is often utilized by governments and organizations and is an open-source alternative to platforms like ArcGIS Hubs.
"},{"location":"ckan/#content-organization","title":"Content Organization","text":"The content organization model of a CKAN site uses the term Datasets for each item record. A Dataset may have multiple Resources, such as downloadable files, thumbnails, supplemental metadata files, and external links. This model can give data providers flexibility on how they organize their files, but can be challenging for harvesting into the BTAA Geoportal.
Unlike CKAN, GeoBlacklight was designed to have only one data file per record, so it can be challenging to programmatically sort through all of the possible access points for a Dataset and attach them to a single record in GeoBlacklight. To mitigate this, we use the multiple downloads option when possible.
"},{"location":"ckan/#metadata","title":"Metadata","text":"CKAN metadata contains several basic fields (documented at https://ckan.org/features/metadata) along with an \"extras\" group that can be customized by site. Some portals have many custom fields in \"extras\" and some do not use them at all.
"},{"location":"ckan/#api","title":"API","text":"CKAN offers several types of APIs for sharing metadata. The most useful one for the BTAA Geoportal is the package_search
, which can be accessed by appending \"api/3/action/package_search\" to a base URL.
Example
https://demo.ckan.org/api/3/action/package_search
"},{"location":"codeNamingSchema/","title":"Code Naming Schema","text":"Each website / collection in the BTAA Geoportal has an alphanumeric code. This code is also added to each metadata record to facilitate administrative tasks and for grouping items by their source. Some of the Codes are randomly generated strings, but most are constructed with an administrative schema described below:
First part of string Contributing institution 01 Indiana University 02 University of Illinois Urbana-Campaign 03 University of Iowa 04 University of Maryland 05 University of Minnesota 06 Michigan State University 07 University of Michigan 08 Pennsylvania State University 09 Purdue University 10 University of Wisconsin-Madison 11 The Ohio State University 12 University of Chicago 13 University of Nebraska-Lincoln 14 Rutgers University Second part of string Type of organization hosting the datasests a State b County c Municipality d University f Other (ex. NGOs, Regional Groups, Collaborations) g Federal Third part of string The sequence number added in order of accession or a county FIPs code -01 First collection added from same institution and type of organization -02 Second collection added from same institution and type of organization -55079 County FIPS code for Milwaukee County, Wisconsin Example
code for a collection sourced from Milwaukee County: '10b-55079'
"},{"location":"collection-development-policy/","title":"BTAA Geoportal Collection Development Policy","text":" Authors: BTAA Collection Development & Education Outreach Committee
"},{"location":"collection-development-policy/#purpose","title":"Purpose","text":"The BTAA Geospatial Information Network is a collaborative project to enhance discoverability, facilitate access, and connect scholars across the Big Ten Academic Alliance (BTAA) to scanned maps, geospatial data, and aerial imagery resources. The project\u2019s main output is the BTAA Geoportal, which serves as a platform through which participating libraries can share materials from their collections to make them more easily discoverable and accessible to varied user communities. Collections within the Geoportal primarily support the research, teaching, learning, and information needs of faculty, staff, and students at participating institutions and beyond.
The project supports the creation and aggregation of discovery-focused metadata describing geospatial resources from participating institutions and public sources across the Big Ten region and makes those resources discoverable via an open source portal. For more information and a list of participating BTAA institutions, please visit our project site.
"},{"location":"collection-development-policy/#summary-of-collection-scope","title":"Summary of Collection Scope","text":"Access to the BTAA Geoportal is open to all. This collection consists of geospatial resources relevant to all disciplines. Access to resources is curated based on their authoritativeness, currency, comprehensiveness, ease of use, and relevancy. Materials included are generally publicly available geospatial datasets (vector/raster), scanned maps (georeferenced or with bounding box coordinates), and aerial imagery. Scanned maps protected by copyright are not included in the Geoportal. Access to licensed resources may be restricted to users affiliated with a participating institution.
- Geographic areas: Items in the collection vary in scale based on subject and range from global to local. Geographic areas vary based on subject and may refer to biomes/ecosystems, political boundaries, cultural boundaries, economic boundaries, or land use types. In addition to a geographic focus on the Big Ten region (i.e., the states where participating institutions are located), the collection will emphasize resources and topics relevant to faculty and student research interests and reflect the strengths of participating library collections.
- Time periods: All time periods are collected, with an ability to accommodate both current and historical versions of datasets.
- Format: The collection consists of geospatial datasets, georeferenced maps, scanned maps with bounding box coordinates, and aerial imagery. Records for web mapping applications may also be included, with priority given to applications with datasets that are also accessible for download through the Geoportal. Preference is given to open and open source formats, but other formats are accepted as required to facilitate ease of use. When possible, resources are presented in formats that allow for download capabilities. Additional software may be needed to view datasets after download.
- Language(s): The collection primarily consists of English language content. Some non-English language content may be available for certain regions, reflecting the collection strengths and research/curricular interests of participating institutions.
- Diversity: The Geoportal and its participating institutions aspire to collect and provide access to geospatial resources that represent diverse perspectives, abilities, and experience levels. We will strive to apply best practices for diverse collection development as they relate to geospatial resources, including but not limited to:
- considering resources from small, independent, and local producers
- seeking content created by and representative of marginalized and underrepresented groups.
- Preservation and life cycle: Digital file preservation for discovery metadata is managed by BTAA Geoportal staff. Digital file preservation for resources is the responsibility of the content provider. Resources may cease to be accessible through the Geoportal if access from the original provider is no longer available.
"},{"location":"collection-development-policy/#statement-of-communication","title":"Statement of Communication","text":"The members of the Geoportal project team will continue to communicate with the creators of other geoportals (GeoBlacklight Community, etc.), with data providers in our respective regions, and across Big Ten institutions to build a comprehensive and robust collection.
Implementation and Revision Schedule: This policy will be reviewed annually by the Collection Development & Education Outreach Committee and is considered effective on the date indicated below. It will be revised as needed to reflect new collection needs and identify new areas of study as well as those areas that may be excluded.
Updated: April 27, 2022
"},{"location":"curation-priorities/","title":"Curation Priorities","text":" Authors: BTAA Collection Development & Education Outreach Committee; Product Manager
There are three distinct but related aspects of prioritizing the addition of new collections: content/theme, administration, and technology.
These priorities will affect how quickly the items are processed or where they fall in line within our processing queue.
"},{"location":"curation-priorities/#contenttheme","title":"Content/Theme","text":"When it comes to scanned maps, prioritization based on content or theme is primarily a local effort. However, there are opportunities for internal collaborations, including with Special Collections librarians or other local digital collections initiatives. These collaborations can allow for unique and distinctive maps to be harvested into the geoportal across our universities.
For geospatial data, datasets created in association with research projects at our institutions may be a high priority based on content or theme. Additionally, resources that provide access to foundational datasets, such as administrative boundaries, parcels, road networks, address points, and land use, should also be considered.
Regardless of the content type, special consideration should be given to highly relevant content, especially to current events. For example, in April 2020, a call went out to all task force members to identify and submit content related to COVID-19 for harvesting into the geoportal. Content that aligns with other ongoing BTAA-GIN program efforts, such as the Diverse Collections Working Group, will also be a higher priority as these efforts are further developed.
"},{"location":"curation-priorities/#administration","title":"Administration","text":"Collections may be prioritized based on the organization responsible for creating and maintaining content, which impacts the types of maps or datasets available to be harvested, spatial and temporal coverage, and stability. Based on these considerations, current priorities in terms of administration are:
-
University libraries and archives
- Links to these resources are likely to be stable
- Resources will likely be documented with a metadata standard
- Represent our core audience
-
States and counties
- Produce most foundational geospatial datasets (e.g., roads and parcels) and are currently our largest source of geospatial data
- Technology and open data policies vary widely resulting in patchwork coverage
-
Regional organizations and research institutes
- Often special organizations with funding to create geospatial data across political boundaries
- Higher risk of harvesting duplicate datasets, as these organizations sometimes aggregate records from cities, counties, or state agencies
-
Cities
- less likely to produce and share data in geospatial formats and more likely to share tabular data
- prioritized cities: major metropolitan areas and the locations of our university campuses
"},{"location":"curation-priorities/#technology","title":"Technology","text":"The source's hosting platform influences the ease of harvesting, the quality of the metadata, and the stability of the access links. Based on these considerations, current priorities in terms of technology are:
-
Published via known portal or digital library platforms, including:
- Blacklight/GeoBlacklight
- Islandora
- Preservica
- ArcGIS Hubs
- Socrata
- CKAN
- Sites with OAI-PMH enabled APIs
-
Custom portals
- each portal requires a customized script for HTML web parsing
- writing and maintaining custom scripts takes extra time
-
Static webpages with download links
- at a minimum, a title is required for each item
- static sites with nested links take a long time to process and may require an extensive amount of manual work
-
Database websites
- require the user to perform interactive queries to extract data
- not realistic to make Geoportal records for individual datasets
- usually results in a single \"website\" record in the Geoportal to represent the database
"},{"location":"dcat/","title":"DCAT Metadata","text":""},{"location":"dcat/#overview","title":"Overview","text":"DCAT (Data Catalog Vocabulary) is metadata schema for web-based data catalogs. It is intended to facilitate interoperability and many data platforms offer a DCAT API for metadata sharing.
The most up-to-date documentation of the schema can be found here: https://www.w3.org/TR/vocab-dcat-3/
Documentation that is older, but still in use for United States portals can be found here: https://resources.data.gov/resources/dcat-us/
"},{"location":"dcat/#json-structure","title":"JSON Structure","text":"Many of the data platforms in the United States use a DCAT profile documented as \"Project Open Data Catalog\". The following JSON template shows the generic structure of a DCAT JSON document:
{\n \"$schema\": \"http://json-schema.org/draft-04/schema#\",\n \"id\": \"https://project-open-data.cio.gov/v1.1/schema/catalog.json#\",\n \"title\": \"Project Open Data Catalog\",\n \"description\": \"Validates an entire collection of Project Open Data metadata JSON objects. Agencies produce said collections in the form of Data.json files.\",\n \"type\": \"object\",\n \"dependencies\": {\n \"@type\": [\n \"@context\"\n ]\n },\n \"required\": [\n \"conformsTo\",\n \"dataset\"\n ],\n \"properties\": {\n \"@context\": {\n \"title\": \"Metadata Context\",\n \"description\": \"URL or JSON object for the JSON-LD Context that defines the schema used\",\n \"type\": \"string\",\n \"format\": \"uri\"\n },\n \"@id\": {\n \"title\": \"Metadata Catalog ID\",\n \"description\": \"IRI for the JSON-LD Node Identifier of the Catalog. This should be the URL of the data.json file itself.\",\n \"type\": \"string\",\n \"format\": \"uri\"\n },\n \"@type\": {\n \"title\": \"Metadata Context\",\n \"description\": \"IRI for the JSON-LD data type. This should be dcat:Catalog for the Catalog\",\n \"enum\": [\n \"dcat:Catalog\"\n ]\n },\n \"conformsTo\": {\n \"description\": \"Version of Schema\",\n \"title\": \"Version of Schema\",\n \"enum\": [\n \"https://project-open-data.cio.gov/v1.1/schema\"\n ]\n },\n \"describedBy\": {\n \"description\": \"URL for the JSON Schema file that defines the schema used\",\n \"title\": \"Data Dictionary\",\n \"type\": \"string\",\n \"format\": \"uri\"\n },\n \"dataset\": {\n \"type\": \"array\",\n \"items\": {\n \"$ref\": \"dataset.json\",\n \"minItems\": 1,\n \"uniqueItems\": true\n }\n }\n }\n}\n
"},{"location":"dcat/#how-to-find-the-dcat-api","title":"How to find the DCAT API","text":"Most sites, including Socrata:
To find a data API, a good place to start is to try appending the string \"/data.json\" to the base URL. If available, your browser will display the data catalog as a JSON file.
ArcGIS Hubs:
- Version 1: append the string \"/api/feed/dcat-us/1.1.json\". Esri made this change was made in 2022 to differentiate the older DCAT version from 2.0. Our harvest recipe current uses this version.
- Version 2: use the string \"api/feed/dcat-ap/2.0.1.json\". We plan to evaluate the newer format and will consider migrating our recipe in 2024.
"},{"location":"editingTemplate/","title":"Editing Template","text":"The GeoBTAA Metadata Template (https://z.umn.edu/b1g-template) is a set of spreadsheets that are formatted for our metadata editor, GEOMG.
Users will need to make a copy of the spreadsheet to use for editing. In some cases, the Metadata Coordinator can provide a customized version of the sheets for specific collections.
The Template contains the following tabs:
- Map Template
- Dataset Template
- Website Record Template
- Values: All of the controlled vocabulary values for the associated fields.
- Access Links and Multiple Downloads: Fields for adding secondary tables.
Note
The input format for some fields in this template may differ from how the field is documented in OpenGeoMetadata. These differences are intended to make it easier to enter values, which will be transformed when we upload the record to GEOMG.
-
Bounding Box coordinates should be entered as W,S,E,N
. The coordinates are automatically transformed to a different order ENVELOPE(W,E,N,S)
. Read more under the Local Input Guidelines.
-
Date Range should be entered as yyyy-yyyy
. This is automatically transformed to [yyyy TO yyyy].
-
External links are added separately under column headers for the type of link. These are combined into the dct_references_s
field as a string of key:value pairs.
"},{"location":"ephemeral-data/","title":"The challenge of ephemeral data","text":"summary
Many of the resources in the BTAA Geoportal are from sites that continually update their datasets. As a result, we need to regularly re-harvest the metadata.
Government agencies now issue most geospatial information as digital data instead of as physical maps. However, many academic libraries have not yet expanded their collection scopes to include publicly available digital data, and are therefore no longer systematically capturing and storing the changing geospatial landscape for future researchers.
The BTAA Geoportal partially fills a gap in the geospatial data ecosystem by cataloging metadata records for current publicly available state, county, and municipal geospatial resources. The value of this data is high, as researchers routinely use it to form the base layers for web maps or geographic analysis. However, the the mandates and policies for providing this data varies considerably from state to state and from county to county. The lack of consistent policies at this level of government means that this data can be considered ephemeral, as providers regularly migrate, update, delete, and re-publish data without saving previous versions and without notification to the public.
The lack of standard policies at this level of government means that this data can be considered ephemeral. It may be updated, removed, or replaced without notification to the public. The rate at which datasets change or disappear is variable, but is often high.
This continual turnover creates a difficult environment for researchers to properly source data and replicate results. It also requires a great deal of dedicated labor to maintain the correct access links in the geoportal. As the geoportal\u2019s collection grows, the labor required to maintain it grows as well.
"},{"location":"geobtaa-metadata-application-profile/","title":"GeoBTAA Metadata Profile","text":"The GeoBTAA Metadata Application Profile consists of the following components:
"},{"location":"geobtaa-metadata-application-profile/#1-opengeometadata-elements","title":"1. OpenGeoMetadata Elements","text":" - The BTAA Geoportal uses the OpenGeoMetadata Schema for each resource.
- The current version of OpenGeoMetadata is called 'Aardvark'.
- This lightweight schema was designed specifically for the GeoBlacklight application and is geared towards discoverability.
- The GeoBTAA Metadata Profile aligns with all of the guidelines and recommendations in the official OpenGeoMetadata documentation.
- The schema is documented on the OpenGeoMetadata website .
"},{"location":"geobtaa-metadata-application-profile/#2-custom-elements","title":"2. Custom Elements","text":" - The GeoBTAA profile includes custom fields for lifecycle tracking and administration
- These elements are generally added to the record by admin staff. When they appear on editing templates, they are grayed out.
- They all start with the namespace
b1g
- See the Custom Elements page for more detail
"},{"location":"geobtaa-metadata-application-profile/#3-geobtaa-input-guidelines","title":"3. GeoBTAA Input Guidelines","text":" -
For the content in some fields, the GeoBTAA profile has specific guidelines that extends or differs from what is documented in the OpenGeoMetadata schema.
-
See the GeoBTAA Input Guidelines page for more detail
Info
The GeoBTAA Metadata Template can be found at https://z.umn.edu/b1g-template
"},{"location":"glossary/","title":"Glossary","text":""},{"location":"glossary/#python-and-scripting","title":"Python and scripting","text":""},{"location":"glossary/#apis","title":"APIs","text":"An API (Application Programming Interface) is a set of rules, protocols, and tools for building software applications. It specifies how different software components should interact with each other, allowing them to communicate and exchange information.
In the context of Python, APIs are often used to retrieve data from a web server or to interact with an external service. For example, the requests library is a popular Python package that simplifies making HTTP requests to APIs, while the json module provides an easy way to parse and encode JSON data.
"},{"location":"glossary/#beautiful-soup","title":"Beautiful Soup","text":"HTML and XML parser
"},{"location":"glossary/#conda","title":"Conda","text":" - Conda is an open source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
"},{"location":"glossary/#conda-package-manager","title":"Conda Package Manager","text":" - Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
For more information on Conda and environments, refer to this website: https://docs.conda.io/projects/conda/en/stable/user-guide/index.html
"},{"location":"glossary/#pandas","title":"Pandas","text":"Pandas is a Python library that contains many functions for analyzing data. For the GeoBTAA workflows, we are most interested in how it eases transformations between JSON and CSV files:
CSV files: Pandas can easily read and write CSV files using its read_csv()
and to_csv()
methods, respectively. These methods can handle many CSV formats, including different delimiter characters, header options, and data types. Once the CSV data is loaded into a Pandas DataFrame, it can be easily manipulated and analyzed using Pandas' powerful data manipulation tools, such as filtering, grouping, and aggregation.
JSON data: Pandas can also read and write JSON data using its read_json()
and to_json()
methods. These methods can handle various JSON formats, such as normal JSON objects, JSON arrays, and JSON lines. Once the JSON data is loaded into a Pandas DataFrame, it can be easily manipulated and analyzed using the same data manipulation tools used for CSV data.
pandas DataFrame A DataFrame is similar to a Python list or dictionary, but it has rows and columns, similar to a spreadsheet. This makes it a simpler task to convert between JSON and CSV. To review these Python terms, refer to the glossary.
"},{"location":"glossary/#pandas-dataframe","title":"Pandas DataFrame","text":"Pandas DataFrame is a 2-dimensional table-like data structure that is used for data manipulation and analysis. It is a powerful tool for handling and processing structured data. A DataFrame has rows and columns, similar to a spreadsheet. It can contain heterogeneous data types and can be indexed and sliced in various ways. It is part of the Pandas library and provides powerful features for data analysis and manipulation. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas-dataframe
"},{"location":"glossary/#python-list","title":"Python List","text":"A list is a basic data structure in Python that is used to store a collection of items of different data types. It is an ordered collection of elements, and each element is indexed by an integer starting from 0. A list can contain elements of different data types, including other lists and dictionaries. A list is mutable, meaning its elements can be added, removed, or modified. It is a simple, general-purpose data structure that is commonly used for storing and manipulating small to medium-sized data sets.
"},{"location":"glossary/#python-dictionary","title":"Python Dictionary","text":"A dictionary is another data structure in Python that is used to store data in the form of key-value pairs. It is an unordered collection of elements, where each element is identified by a unique key instead of an index. The keys can be of any hashable data type, and the values can be of any data type. A dictionary is also mutable, meaning its elements can be added, removed, or modified. It is commonly used to store and manipulate structured data, such as user profiles or configuration settings.
"},{"location":"glossary/#python-objects","title":"Python Object(s)","text":"In Python, everything is an object. An object is an instance of a class, which is a blueprint for creating objects. It contains data and functions (also called methods) that operate on that data. Objects are created dynamically, which means that you don't have to declare the type of a variable or allocate memory for it explicitly. When you assign a value to a variable, Python creates an object of the appropriate type and associates it with that variable.
For example, an integer in Python is an object of the int class, and a string is an object of the str class. Each object of a class has its own set of data attributes, which store the values of its properties, and methods, which operate on those properties.
"},{"location":"glossary/#python-interface","title":"Python Interface","text":"In Python, an interface refers to the set of methods that a class or an object exposes to the outside world. It defines the way in which an object can be interacted with, and the methods that are available to be called. An interface can be thought of as a contract that specifies how a class can be used, and what methods are available to a programmer when working with that class.
Python is an object-oriented programming language, and as such, it supports the concept of an interface. Python does not have a specific language construct for creating an interface. Instead, interfaces are implemented using a combination of abstract base classes and duck typing.
An abstract base class is a class that cannot be instantiated directly, but instead, is intended to be subclassed. It defines a set of abstract methods that must be implemented by any subclass. This allows for the creation of a common interface that can be shared among multiple classes.
Duck typing is a concept in Python that allows for the determination of an object's type based on its behavior, rather than its actual type. This means that if an object behaves like a certain type, it is considered to be of that type. This allows for more flexibility in programming, as it allows for the creation of classes that can be used interchangeably, as long as they implement the same methods.
"},{"location":"glossary/#spacy","title":"SpaCy","text":"SpaCy is a python module that uses natural language processing (NLP) to comb through text help us find and extract patterns. To extract place names, it uses named entity recognition (NER) by searching a list of place names. This list is called \"GPE\", which stands for Geopolitical Entity.
This article in the Code4Lib journal, From Text to Map: Combing Named Entity Recognition and Geographic Information Systems, explains the process we use to extract place names.
"},{"location":"glossary/#data-portals","title":"Data Portals","text":""},{"location":"glossary/#arcgis-hubs","title":"ArcGIS Hubs","text":"ArcGIS Hub is a data portal platform that allows users to share geospatial data as web services. It is especially popular with local governments that already use the Esri ArcGIS ecosystem.
"},{"location":"glossary/#ckan","title":"CKAN","text":"CKAN is a tool for making open data websites, it helps you manage and publish collections of data. For CKAN purposes, data is published in units called \u201cdatasets\u201d.
- A dataset contains two things:
Information or \u201cmetadata\u201d about the data. For example, the title and publisher, date, what formats it is available in, what license it is released under, etc.
A number of \u201cresources\u201d, which hold the data itself. CKAN does not mind what format the data is in. A resource can be a CSV or Excel spreadsheet, XML file, PDF document, image file, linked data in RDF format, etc. CKAN can store the resource internally, or store it simply as a link, the resource itself being elsewhere on the web. A dataset can contain any number of resources. For example, different resources might contain the data for different years, or they might contain the same data in different formats.
"},{"location":"glossary/#socrata","title":"Socrata","text":""},{"location":"glossary/#geoblacklight","title":"GeoBlacklight","text":""},{"location":"glossary/#metadata-standards","title":"Metadata Standards","text":"This chart provides links to documentation about the common metadata standards and schemas we encounter when harvesting.
{{ read_csv('../tables/metadataStandards.csv') }}
- Dublin Core: A set of metadata elements that are used to describe resources in a simple and standardized way. Dublin Core is widely used in library systems, archives, and other digital repositories.
- MODS (Metadata Object Description Schema): A flexible and extensible XML schema for describing a wide range of resources, including books, articles, and other types of digital content.
- METS (Metadata Encoding and Transmission Standard): A standard for encoding descriptive, administrative, and structural metadata for digital objects.
- MARC (Machine-Readable Cataloging): A metadata format used by libraries to describe bibliographic information about books, journals, and other materials.
"},{"location":"input-guidelines/","title":"Local Input Guidelines","text":"For the following elements, the GeoBTAA Metadata Profile has input guidelines beyond what is documented in the OpenGeoMetadata schema:
"},{"location":"input-guidelines/#title","title":"Title","text":"Maps: The title for scanned maps is generally left as it was originally cataloged by a participating library. MARC subfields are omitted and can be inserted in the Description field.
Datasets: Harvested datasets often are lacking descriptive titles and may need to be augmented with place names. Dates may also be added to the end, but if the dataset is subject to updates, the data should be left off. Acronyms should be spelled out. The preferred format for dataset titles is: Theme [place] {date}
. This punctuation allows for batch processing and splitting title elements.
"},{"location":"input-guidelines/#language","title":"Language","text":"Although Language is optional in the OGM schema, a three-digit code is required for the BTAA Geoportal.
"},{"location":"input-guidelines/#creator","title":"Creator","text":"When possible, Creators should be drawn from a value in the Faceted Application of Subject Terminology (FAST).
"},{"location":"input-guidelines/#creator-id","title":"Creator ID","text":"If the Creator value is from a name authority, insert the ID in this field.
"},{"location":"input-guidelines/#publisher","title":"Publisher","text":"Maps: Publisher values for maps are pulled from the original catalog record. Remove subfields for place names and dates.
Datasets: The BTAA Geoportal does not use the Publisher field for Datasets.
"},{"location":"input-guidelines/#provider","title":"Provider","text":"This is the name of the organization hosting the resources. If the organization is part of the BTAA library network, a university icon will display next to the resource's title. However, most Providers will not have an icon.
"},{"location":"input-guidelines/#spatial-coverage","title":"Spatial Coverage","text":"This should be in the format used by the Faceted Application of Subject Terminology (FAST).
For US counties and cities, the format should be state--county
or state--city
. The state itself should also be included. Examples:
Example
-
Wisconsin--Dane County
-
Wisconsin--Madison
-
Wisconsin
"},{"location":"input-guidelines/#bounding-box","title":"Bounding Box","text":"On the Metadata Editing Template, provide Bounding Boxes in this format: W,S,E,N This order matches the DCAT API and is how the Klokan Bounding Box provides coordinates with their \"CSV\" setting.
This format will be programmatically converted to other formats when it is published to the Geoportal:
-
The OpenGeoMetadata Bounding Box field (dcat_bbox_s
) uses this order: ENVELOPE(W,E,N,S)
-
The OpenGeoMetadata Geometry field (locn_geometry
) uses a WKT format and the coordinate order will be converted to this layout: POLYGON((W N, E N, E S, W S, W N))
-
The OpenGeoMetadata Centroid field (dcat_centroid
) will be calculated to display longitude,latitude.
Example
Metadata CSV: -120,10,-80,35
converts to
dcat_bbox_s:
ENVELOPE(-120,-80,35,10)
locn_geometry:
POLYGON((-120 35, -80 35, -80 10, -120 10, -120 35))
dcat_centroid
: \"22.5,-100.0\"
"},{"location":"lifecycle/","title":"Lifecycle","text":"Deprecation
This lifecycle documentation had been replaced by a newer version at Resource Lifecycle.
"},{"location":"lifecycle/#1-submit-records","title":"1. Submit Records","text":"It is the role of the Team members to seek out new content for the geoportal. See the page How to Submit Resources to the BTAA Geoportal for more information.
"},{"location":"lifecycle/#2-metadata-transition","title":"2. Metadata Transition","text":"This stage involves batch processing of the records, including harvesting, transformations, crosswalking information. This stage is carried out by the Metadata Coordinator, who may contact Team members for assistance.
Regardless of the method used for acquiring the metadata, it is always transformed into a spreadsheet for editing. These spreadsheets are uploaded to GEOMG Metadata Editor.
Because of the variety of platforms and standards, this process can take many forms. The Metadata Coordinator will contact Team members if they need to supply metadata directly.
"},{"location":"lifecycle/#3-edit-records","title":"3. Edit Records","text":"Once the metadata is in spreadsheet form, it is ready to be normalized and augmented. UMN Staff will add template information and use spreadsheet functions or scripts to programmatically complete the metadata records.
- The GeoBTAA Metadata Template is for creating GeoBlacklight metadata.
- Refer to the documentation for the OpenGeoMetadata, version Aardvark fields and the GeoBTAA Custom Elements for guidance on values and formats.
"},{"location":"lifecycle/#4-publish-records","title":"4. Publish Records","text":"Once the editing spreadsheets are completed, UMN Staff uploads the records to GEOMG
, a metadata management tool. GEOMG validates records and performs any needed field transformations. Once the records are satisfactory, they are published and available in the BTAA Geoportal.
Read more on the GEOMG documentation page.
"},{"location":"lifecycle/#5-maintenance","title":"5. Maintenance","text":"General Maintenance
All project team members are encouraged to review the geoportal records assigned to their institutions periodically to check for issues. Use the feedback form at the top of each page in the geoportal to report errors or suggestions. This submission will include the URL of the last page you were on, and it will be sent to the Metadata Coordinator.
Broken Links
The geoportal will be programmatically checked for broken links on a monthly basis. Systematic errors will be fixed by UMN Staff. Some records from this report may be referred back to Team Members for investigating broken links.
Subsequent Accessions
- Portals that utilize the DCAT metadata standard will be re-accessioned monthly.
- Other GIS data portals will be periodically re-accessioned by the Metadata Coordinator at least once per year.
- Team members may be asked to review this work and provide input on decisions for problematic records.
Retired Records
When an external resource has been moved, deleted, or versioned to a new access link, the original record is retired from the BTAA Geoportal. This is done by converting the Publication State of the record from 'Published' to 'Unpublished'. The record is not deleted from the database and can still be accessed via a direct link. However, it will not show up in any search queries.
"},{"location":"model/","title":"Content Organization Model","text":"GeoBlacklight organizes records with a network model rather than with a hierarchical model. It is a flat system whereby every database entry is a \"Layer\" and uses the same metadata fields. Unlike many digital library applications, it does not have different types of records for entities such as \"communities,\" \"collections,\" or \"groups.\" As a result, it does not present a breadcrumb navigation structure, and all records appear in the same catalog directory with the URL of https:geo.btaa.org/catalog/ID
.
Instead of a hierarchy, GeoBlacklight relates records via metadata fields. These fields include Member Of
, Is Part Of
, Is Version Of
, Source
, and a general Relation
. This flexibility allows records to be presented in several different ways. For example, records can have multiple parent/child/grandchild/sibling relationships. In addition, they can be nested (i.e., a collection can belong to another collection). They can also connect data layers about similar topics or represent different years in a series.
The following diagram illustrates how the BTAA Geoportal organizes records. The connecting arrow lines indicate the name of the relationship. The labels reflect each record's Resource Class (Collections, Websites, Datasets, Maps, Web services).
"},{"location":"purpose/","title":"About the Geoportal","text":"summary
The BTAA Geoportal is a catalog that makes it easier to find geospatial resources.
Geospatial data and tools are increasingly important in academic research and education. With the growing number of GIS datasets and digitized historical maps available, it can be challenging to locate the right resources since they are scattered across various platforms and not always tagged with the necessary metadata for easy discovery. To address this issue, the Big Ten Academic Alliance Geospatial Data Project launched in 2015 to connect scholars with geospatial resources.
One of the primary outputs of the project is the BTAA Geoportal, a comprehensive index of tens of thousands of geospatial resources from hundreds of sources. The Geoportal enables users to search by map, keyword, and category, providing access to scanned maps, digital GIS data, aerial imagery, and interactive mapping applications. All of the resources available through the Geoportal are free and open, and the scanned maps cover countries around the world. Most of the data in the catalog is sourced from local governments, such as states, counties, and cities.
The Geoportal is a useful tool because finding local geospatial data through a simple Google search can be difficult due to the lack of visibility of these datasets. The problem is that users need to know which agency is responsible for creating and distributing the data they are looking for and visit that agency's website to access the datasets. For instance, if you are researching a particular neighborhood in a city and need data on the roads, parks, parcels, and city council ward boundaries, you might need to check different state agencies, the city or the county website. But with the Geoportal, you can easily search by What, Where, and When without worrying about the Who or Why.
Overall, the BTAA Geoportal provides an easy and comprehensive way for scholars to find and access geospatial resources from various sources, enabling them to focus on their research rather than the time-consuming task of finding the right data.
"},{"location":"resource-lifecycle/","title":"Resource Lifecycle","text":"5 Stages of the Resource Lifecycle
flowchart LR\n\n I((1.<br> IDENTIFY)) --> H[/2. <br> HARVEST/] --> P[3. <br> EDIT] --> X[4. <br>INDEX] --> M{{5. <br>MAINTAIN}}--> H[/2. <br>HARVEST/]\n
"},{"location":"resource-lifecycle/#1-identify","title":"1. Identify","text":" BTAA-GIN Team Members and Product Manager
Team members seek out new content for the geoportal. See the page How to Submit Resources to the BTAA Geoportal for more information.
"},{"location":"resource-lifecycle/#2-harvest","title":"2. Harvest","text":" Graduate Research Assistants and Product Manager
This stage involves obtaining the metadata for resources. At a minimum, this will include a title and and access link. However, it will ideally also include descriptions, dates, authors, rights, keywords, and more.
Here are the most common ways that we obtain the metadata:
- a BTAA-GIN Team Member sends us the metadata values as individual documents or as a combined spreadsheet
- we are provided with (or are able to find) an API that will automatically generate the metadata in a structured file, such as JSON or XML
- we develop a customized script to scrape directly from the HTML on a source's website
- we manually copy and paste the metadata into a spreadsheet
- a combination of one or more of the above
This step also involves using a crosswalk to convert the metadata into the schema needed for the Geoportal. Our goal is to end up with a spreadsheet containing columns matching our metadata template.
Why do we rely on CSV?
CSV (Comma Separated Values) files organize tabular data in plain text format, where each row of data is separated by a line break, and each column of data is separated by a delimiter.
We have found this tabular format to be the most human-readable way to batch create, edit, and troubleshoot metadata records. We can visually scan large numbers of records at once and normalize the values in ways that would be difficult with native nested formats, like JSON or XML. Therefore, many of our workflow processes involve transforming things to and from CSV.
"},{"location":"resource-lifecycle/#3-edit","title":"3. Edit","text":" Graduate Research Assistants and Product Manager
When working with metadata, it is common to come across missing or corrupted values, which require troubleshooting and manual editing in our spreadsheets. Refer to the Collections Project Board for examples of this work.
After compiling the metadata, we run a validation and cleaning script to ensure the records conform to the required elements of our schema. Finally, we upload the completed spreadsheet to GEOMG, which serves as the administrative interface for the Geoportal. If GEOMG detects any formatting errors, it will issue a warning and may reject the upload.
"},{"location":"resource-lifecycle/#4-index","title":"4. Index","text":" Product Manager
Once the metadata is successfully uploaded to GEOMG, we can publish the records to the Geoportal. The technology that actually stores the records and enables searching is called Solr. The action of adding records is known as \"Indexing.\"
Periodically, we need to remove records from the Geoportal. To do this, we use GEOMG to either delete them or change their status to \"unpublished.\"
"},{"location":"resource-lifecycle/#5-maintain","title":"5. Maintain","text":" BTAA-GIN Team Members, Graduate Research Assistants, and Product Manager
The Geoportal is programmatically checked for broken links on a monthly basis. The are fixed either by manually repairing them or by reharvesting from the source.
"},{"location":"resource-lifecycle/#sequence-diagram-of-resource-lifecycle","title":"Sequence diagram of Resource Lifecycle","text":"\n\n\n\n sequenceDiagram\n actor Team Member\n actor Product Manager\n participant GitHub\n actor Research Assistant\n participant GEOMG\n participant Geoportal \n\n\n Note right of Team Member: IDENTIFY\n\n Team Member->>Product Manager: Submit Resources\n Product Manager->>GitHub: Create GitHub issue\n GitHub ->>Research Assistant: Assign issue\n Note left of Research Assistant: HARVEST\n Note left of Research Assistant: EDIT \n\n Research Assistant->>GEOMG: Upload records\n Research Assistant ->>GitHub: Update GitHub issue\n Note right of GEOMG: PUBLISH \n\n Product Manager->>GEOMG: Publish records\n GEOMG->>Geoportal: Send records online \n Product Manager->>GitHub: Close GitHub issue\n Product Manager ->> Team Member: Share link to published records\n\n Note left of Research Assistant: MAINTAIN \n
"},{"location":"resourceClasses/","title":"Resource Classes","text":""},{"location":"resourceClasses/#collections","title":"Collections","text":"The BTAA Geoportal interprets the Resource Class, Collections, as top-level, custom groupings. These reflect our curation activities and priorities.
Other records are linked to Collections using the Member Of
field. The ID of the parent record is added to the child record only. View all of the current Collections in the geoportal at this link: https://geo.btaa.org/?f%5Bgbl_resourceClass_sm%5D%5B%5D=Collections
"},{"location":"resourceClasses/#websites","title":"Websites","text":"The BTAA Geoportal uses the Resource Class, Websites, to create parent records for data portals, digital libraries, dashboards, and interactive maps. These often start off as standalone records. Once the items in a website have been indexed, they will have child records.
Individual Datasets, Maps, or Web services are linked to the Website they came from using the Is Part Of
field. The ID of the parent record is added to the child record only.
View all of the current Websites in the geoportal at this link: https://geo.btaa.org/?f%5Bgbl_resourceClass_sm%5D%5B%5D=Websites
"},{"location":"resourceClasses/#datasets-maps-and-web-services","title":"Datasets, Maps, and Web services","text":"The items in this Resource Class represent individual data layers, scanned map files, and/or geospatial web services. (Some items may have multiple Resource Classes attached to the same record.)
This item class is likely to have the most relationships specified in the metadata. A typical Datasets record might have the following:
Member Of
a Collections record Is Part Of
a Websites record - If the data was digitized from a paper map in the geoportal, it can be linked to the Maps record via the
Source
relation - a general
Relation
to a research guide or similar dataset
"},{"location":"resourceClasses/#multipart-items","title":"Multipart Items","text":"Many items in the geoportal are multipart. There may be individual pages from an atlas, sublayers from a larger project, or datasets broken up into more than one download. In these cases, the Is Part Of
field is used.
As a result, these items may have multiple Is Part Of
relationships- (1) the parent for the multipart items and (2) the original website.
"},{"location":"schedule/","title":"Harvesting Schedule","text":"Established April 2023
"},{"location":"schedule/#weekly","title":"Weekly","text":" - ArcGIS Hubs
"},{"location":"schedule/#monthly","title":"Monthly","text":" - CKAN sites
- Minnesota Geospatial Commons
"},{"location":"schedule/#quarterly","title":"Quarterly","text":" - PASDA
- OpenGeoMetadata
- Illinois Geospatial Data Clearinghouse
- Minnesota Natural Resource Atlas
- Socrata
"},{"location":"schedule/#annually","title":"Annually","text":" - Licensed Resources
- Custom HTML sites for public data
"},{"location":"schedule/#as-needed","title":"As Needed","text":" - Any website that reports errors during the automated monthly broken link scan.
- Any website when we receive notification that new records are available.
Info
See the GitHub Project Board, Collections, to track harvests.
"},{"location":"submit-resources/","title":"How to submit resources to the BTAA Geoportal","text":""},{"location":"submit-resources/#1-identify-resources","title":"1. Identify Resources","text":"Places to find public domain collections
- State GIS clearinghouses
- State agencies (especially DNRs and DOTs)
- County or city GIS departments
- Library digital collections
- Research institutes
- Nonprofit organizations
Review the Curation Priorites and the Collection Development Policy for guidelines on selecting resources.
"},{"location":"submit-resources/#optional-contact-the-organization","title":"Optional: Contact the organization","text":"Use this template to inform the organization that we plan to include their resources in our geoportal.
Tip
If metadata for the resources are not readily available, the organization may be able to send you an API, metadata documents, or a spreadsheet export.
"},{"location":"submit-resources/#2-investigate-metadata-harvesting-options","title":"2. Investigate metadata harvesting options","text":"Metadata records can be submitted directly or we can harvest it using parsing and transformation scripts.
Here are the most common methods of obtaining metadata for the BTAA Geoportal:
"},{"location":"submit-resources/#spreadsheets","title":"Spreadsheets","text":"This method is preferred, because the submitters can control which metadata values are exported and because format transformations by UMN Staff are not necessary. The GeoBTAA Metadata Template shows all of the fields needed for the Geoportal.
"},{"location":"submit-resources/#api-harvesting-or-html-parsing","title":"API Harvesting or HTML Parsing","text":"Most data portals have APIs or HTML structures that can be programmatically parsed to obtain metadata for each record.
DCAT enabled portals
ArcGIS Open Data Portals (HUB), Socrata portals, and some others share metadata in the DCAT standard.
CKAN / DKAN portals
This application uses a custom metadata schema for their API.
HTML Parsing
If a data portal or website does not have an API, we may be able to parse the HTML pages to obtain the metadata needed to create GeoBlacklight schema records.
OAI-PMH
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a framework that can be used to harvest metadata records from enabled repositories. The records are usually available as a simple Dublin Core XML format. If the protocol is not set up to include extra fields, such as the map image's download link or bounding box, this method may not be sufficient on its own.
"},{"location":"submit-resources/#individual-metadata-files","title":"Individual Metadata files","text":"Geospatial metadata standards are expressed in the XML or JSON format, which can be parsed to extract metadata needed to create GeoBlacklight schema records. Common standards for geospatial resources include:
- ISO 19139
- FGDC
- ArcGIS 1.0
- MARC
- MODS
Tip
The best way to transfer MARC records is to send a single file containing multiple records in the .MRC or MARC XML format. The Metadata Coordinator will use MarcEdit or XML parsing to transform the records.
"},{"location":"submit-resources/#3-contact-the-btaa-gin-product-manager","title":"3. Contact the BTAA-GIN Product Manager","text":"Send an email, Slack message to the Product Manager / Metadata Coordinator.
Minimum information to include:
- Title and Description of the collection
- a link to the website
- (If known) information about how to harvest the metadata or construct access links.
The submission will be added to our collections processing queue.
Info
Metadata processing tasks are tracked on our public GitHub project dashboard.
"},{"location":"supplementalMetadata/","title":"Supplemental Metadata","text":"All other forms of metadata, such as ISO 19139, FGDC Content Standard for Digital Geospatial Metadata, attribute table definitions, or custom schemas are treated as Supplemental Metadata.
- Supplemental Metadata is not usually edited directly for inclusion in the project.
- If this metadata is available as XML or HTML, it can be added as a hosted link for the Metadata preview tool in GeoBlacklight.
- XML or HTML files can be parsed to extract metadata that forms the basis for the item\u2019s GeoBlacklight schema record.
- The file formats that can be viewed within the geoportal application include:
- ISO 19139 XML
- FGDC XML
- MODS XML
- HTML (any standard)
"},{"location":"tutorials/","title":"Tutorials","text":"These tutorials are short, easy to complete exercises to help someone get the basics of running and writing scripts to harvest metadata. They are available as Jupyter Notebooks hosted in GitHub in the Harvesting Guide repository.
"},{"location":"tutorials/#1-setting-up-your-environment","title":"1. Setting up your environment","text":" - These tutorials will guide users on how to set up your environment for harvesting.
- Getting Started with GitHub: Provides an introduction to GitHub and a walkthrough of creating a repository.
- Getting Started with Jupyter Notebooks: Provides an introduction and overview of cell types.
"},{"location":"tutorials/#2-navigating-paths","title":"2. Navigating paths","text":"This tutorial shows how to navigate and list directories. Techniques covered include:
- Using the
cd
command in the terminal - Navigating to and from the home directory
- Listing the current path
- Listing documents within the current path
"},{"location":"tutorials/#3-iterating-over-files","title":"3. Iterating over files","text":" - This guide will assist users in how to open, read, and print the results of a CSV.
- The module
os.Walk
will be introduced to read through multiple directories to find files. - The pandas module will be used to display a CSV for records.
"},{"location":"tutorials/#4-merge-csv-files-based-on-a-shared-column","title":"4. Merge CSV files based on a shared column","text":"This tutorial will take two CSV files and combine them using a shared key.
"},{"location":"tutorials/#5-transform-a-batch-of-json-files-into-a-single-csv-file","title":"5. Transform a batch of JSON files into a single CSV file","text":"This tutorial uses the Python module pandas (Python Data Analysis Library) to open a batch of JSON files and transform the contents into a single CSV.
"},{"location":"tutorials/#6-extract-place-names","title":"6. Extract Place Names","text":"This tutorial scans the two columns from a CSV file ('Title' and 'Description') to look for known place names and writes the values to a separate field.
"},{"location":"tutorials/#7-parsing-html-with-beautiful-soup","title":"7. Parsing HTML with Beautiful Soup","text":"This tutorial will guide users through Hyper Text Mark-Up Language (HTML) site parsing using the BeautifulSoup Python module, including:
- how to install the BeautifulSoup module
- scan and list web pages
- return titles, descriptions, and dates
- writing parsed results to CSV format
"},{"location":"tutorials/#8-use-openstreetmap-to-generate-bounding-boxes","title":"8. Use OpenStreetMap to generate bounding boxes","text":"This tutorial demonstrates how to query the OpenStreetMap API using place names to return bounding box coordinates.
Credits
These tutorials were prepared by Alexander Danielson and Karen Majewicz in April 2023.
"},{"location":"recipes/","title":"About","text":"These recipes are step-by-step how-to guides for harvesting and processing metadata from the sites that we harvest the most frequently.
The scripts needed for these recipes are all in the form of Jupyter Notebooks. To get started, download, fork, or clone the Harvesting Guide repository.
Warning
These recipes are not guaranteed to work! Since they rely on external websites, the scripts are necessarily works-in-progress. They need to be regularly updated and reconfigured in response to changes at the source website, python updates, and adjustments to our metadata schema.
"},{"location":"recipes/R-01_arcgis-hubs/","title":"ArcGIS","text":""},{"location":"recipes/R-01_arcgis-hubs/#purpose","title":"Purpose","text":"To scan the DCAT 1.1 API of ArcGIS Hubs and return the metadata for all suitable items as a CSV file in the GeoBTAA Metadata Application Profile.
This recipe includes steps that use the GBL Admin toolkit. Access to this tool is restricted to UMN BTAA-GIN staff and requires a login account. External users can create their own list or use one provided in this repository.
graph TB\n\nA{{STEP 1. <br>Download arcHubs.csv}}:::green --> B[[STEP 2. <br>Run Jupyter Notebook harvest script]]:::green;\nB --> C{Did the script run successfully?};\nC --> |No| D[Troubleshoot]:::yellow;\nD --> H{Did the script stall because of a Hub?};\nH --> |Yes| I[Refer to the page Update ArcGIS Hubs]:::yellow;\nH --> |No & I can't figure it out.| F[Refer issue back to Product Manager]:::red;\nH --> |No| J[Try updating your Python modules or investigating the error]:::yellow;\nJ --> B;\nI --> A;\nC --> |Yes| K[[STEP 3. Validate and Clean]]:::green; \nK --> E[STEP 4. <br>Publish/unpublish records in GBL Admin]:::green; \n\n\nclassDef green fill:#E0FFE0\nclassDef yellow fill:#FAFAD2\nclassDef red fill:#E6C7C2\n\n\nclassDef questionCell fill:#fff,stroke:#333,stroke-width:2px;\nclass C,H questionCell;\n
"},{"location":"recipes/R-01_arcgis-hubs/#step-1-download-the-list-of-active-arcgis-hubs","title":"Step 1: Download the list of active ArcGIS Hubs","text":"We maintain a list of active ArcGIS Hub sites in GEOMG.
Shortcut
Pre-formatted GEOMG query link
- Go to the Admin (https://geo.btaa.org/admin) dashboard
- Filter for items with these parameters:
- Resource Class: Websites
- Accrual Method: DCAT US 1.1
- Select all the results and click Export -> CSV
- Download the CSV and rename it
arcHubs.csv
Info
Exporting from GEOMG will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are:
- ID: Unique code assigned to each portal. This is transferred to the \"Is Part Of\" field for each dataset.
- Title: The name of the Hub. This is transferred to the \"Provider\" field for each dataset
- Publisher: The place or administration associated with the portal. This is applied to the title in each dataset in brackets
- Spatial Coverage: A list of place names. These are transferred to the Spatial Coverage for each dataset
- Member Of: a larger collection level record. Most of the Hubs are either part of our Government Open Geospatial Data Collection or the Research Institutes Geospatial Data Collection
However, it is not necessary to take extra time and manually remove the extra fields, because the Jupyter Notebook code will ignore them.
"},{"location":"recipes/R-01_arcgis-hubs/#step-2-run-the-harvest-script","title":"Step 2: Run the harvest script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-01_arcgis-hubs.ipynb
- Move the downloaded file
arcHubs.csv
into the same directory as the Jupyter Notebook. - Run all cells.
Expand to read about the R-01_arcgis-hubs.ipynb Jupyter Notebook This code reads data from hubFile.csv
using the csv.DictReader
function. It then iterates over each row in the file and extracts values from specific columns to be used later in the script.
For each row, the script also defines default values for a set of metadata fields. It then checks if the URL provided in the CSV file exists and is a valid JSON response. If the response is not valid, the script prints an error message and continues to the next row. Otherwise, it extracts dataset identifiers from the JSON response and passes the response along with the identifiers to a function called metadataNewItems.
It also includes a function to drop duplicate rows. ArcGIS Hub administrators can include datasets from other Hubs in their own site. As a result, some datasets are duplicated in other Hubs. However, they always have the same Identifier, so we can use pandas to detect and remove duplicate rows.
"},{"location":"recipes/R-01_arcgis-hubs/#troubleshooting-as-needed","title":"Troubleshooting (as needed)","text":"The Hub sites are fairly unstable and it is likely that one or more of them will occasionally fail and interrupt the script.
- Visit the URL for the Hub to check and see if the site is down, moved, etc.
- Refer to the Update ArcGIS Hubs list page for more guidance on how to edit the website record.
- If a site is missing: Unpublish it from GEOMG, indicate the Date Retired, and make a note in the Status field.
- If a site is still live, but the JSON API link is not working: remove the value \"DCAT US 1.1\" from the Accrual Method field and make a note in the Status field.
- If the site has moved to a new URL, update the website record with the new information.
- Start over from Step 1.
"},{"location":"recipes/R-01_arcgis-hubs/#step-3-validate-and-clean","title":"Step 3: Validate and Clean","text":"Although the harvest notebook will produce valide metadata for most of the items, there may still be some errors. Run the cleaning script to ensure that the records are valid before we try to ingest them into GEOMG.
"},{"location":"recipes/R-01_arcgis-hubs/#step-4-upload-all-records","title":"Step 4: Upload all records","text":" - Review the previous upload. Check the Date Accessioned field of the last harvest and copy it.
- Upload the new CSV file. This will overwrite the Date Accessioned value for any items that were already present.
- Use the old Date Accessioned value to search for the previous harvest date. This example uses 2023-03-07: (https://geo.btaa.org/admin/documents?f%5Bb1g_dct_accrualMethod_s%5D%5B%5D=ArcGIS+Hub&q=%222023-03-07%22&rows=20&sort=score+desc)
- Unpublish the ones that have the old date in the Date Accessioned field
- Record this number in the GitHub issue for the scan under Number Deleted
- Look for records in the uploaded batch that are still \"Draft\" - these are new records.
- Publish them and record this number in the GitHub issue under Number Added
"},{"location":"recipes/R-02_socrata/","title":"Socrata","text":""},{"location":"recipes/R-02_socrata/#purpose","title":"Purpose","text":"To scan the DCAT API of Socrata Data Portals and return the metadata for all suitable items as a CSV file in the GeoBTAA Metadata Application Profile.
Note: This recipe is very similar to the ArcGIS Hubs Scanner.
This recipe includes steps that use the metadata toolkit GEOMG. Access to GEOMG is restricted to UMN BTAA-GIN staff and requires a login account. External users can create their own list or use one provided in this repository.
graph TB\n\nA{{STEP 1. <br>Download socrataPortals.csv}}:::green --> B[[STEP 2. <br>Run Jupyter Notebook harvest script]]:::green;\nB --> C{Did the script run successfully?}:::white;\nC --> |No| D[Troubleshoot]:::yellow;\nD --> H{Did the script stall because of a portal?}:::white;\nH --> |Yes| I[Remove or update the portal from the list]:::yellow;\nH --> |No & I can't figure it out.| F[Refer issue back to Product Manager]:::red;\nH --> |No| J[Try updating your Python modules or investigating the error]:::yellow;\nJ --> B;\nI --> A;\nC --> |Yes| K[[STEP 3. Validate and Clean]]:::green; \nK --> E[STEP 4. <br>Publish/unpublish records in GEOMG]:::green; \n\nclassDef green fill:#E0FFE0\nclassDef yellow fill:#FAFAD2\nclassDef red fill:#E6C7C2\nclassDef white fill:#FFFFFF\n
"},{"location":"recipes/R-02_socrata/#step-download-the-list-of-active-socrata-data-portals","title":"Step: Download the list of active Socrata Data Portals","text":"We maintain a list of active Socrata Hub sites in GEOMG.
Shortcut
Pre-formatted GEOMG query link
- Go to the GEOMG dashboard
- Use the Advanced Search to filter for items with these parameters:
- Format: \"Socrata data portal\"
- Select all the results and click Export -> CSV
- Download the CSV and rename it
socrataPortals.csv
Info
Exporting from GEOMG will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are:
- ID: Unique code assigned to each portal. This is transferred to the \"Is Part Of\" field for each dataset.
- Title: The name of the Hub. This is transferred to the \"Provider\" field for each dataset
- Publisher: The place or administration associated with the portal. This is applied to the title in each dataset in brackets
- Spatial Coverage: A list of place names. These are transferred to the Spatial Coverage for each dataset
- Member Of: a larger collection level record. Most of the Hubs are either part of our Government Open Geospatial Data Collection or the Research Institutes Geospatial Data Collection
It is not necessary to take extra time and manually remove the unused fields, because the Jupyter Notebook code will ignore them.
"},{"location":"recipes/R-02_socrata/#step-2-run-the-harvest-script","title":"Step 2: Run the harvest script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-02_socrata.ipynb
- Move the downloaded file
socrataPortals.csv
into the same directory as the Jupyter Notebook.
"},{"location":"recipes/R-02_socrata/#troubleshooting-as-needed","title":"Troubleshooting (as needed)","text":" - Visit the URL for the Socrata Portal to check and see if the site is down, moved, etc.
- If a site is missing
- Unpublish it from GEOMG and indicate the Date Retired, and make a note in the Status field.
- Start over from Step 1.
"},{"location":"recipes/R-02_socrata/#step-3-validate-and-clean","title":"Step 3: Validate and Clean","text":"Although the harvest notebook will produce valide metadata for most of the items, there may still be some errors. Run the cleaning script to ensure that the records are valid before we try to ingest them into GEOMG.
"},{"location":"recipes/R-02_socrata/#step-4-upload-to-geomg","title":"Step 4: Upload to GEOMG","text":" - Review the previous upload. Check the Date Accessioned field of the last harvest and copy it.
- Upload the new CSV file. This will overwrite the Date Accessioned value for any items that were already present.
- Use the old Date Accessioned value to search for the previous harvest date.
- Unpublish the ones that have the old date in the Date Accessioned field 5. Record this number in the GitHub issue for the scan under Number Deleted
- Look for records in the uploaded batch that are still \"Draft\" - these are new records.
- Publish them and record this number in the GitHub issue under Number Added
"},{"location":"recipes/R-03_ckan/","title":"CKAN","text":""},{"location":"recipes/R-03_ckan/#purpose","title":"Purpose","text":"To scan the Action API for CKAN data portals and retrieve metadata for new items while returning a list of deleted items.
Warning
This batch CKAN recipe is being deprecated and replaced with recipes tailored to each site.
graph TB\n\nA((STEP 1. <br>Set up directories)) --> B[STEP 2. <br>Run Jupyter Notebook script] ;\nB --> C{Did the script run successfully?};\nC --> |No| D[Troubleshoot];\nD -->A;\nC --> |No & I can't figure it out.| F[Refer issue back to Product Manager];\nC --> |Yes| E[STEP 3. <br>Edit places names & titles]; \nE --> G[STEP 4. <br>Upload new records];\nG --> H[STEP 5. <br>Unpublish deleted records];\n\nclassDef goCell fill:#99d594,stroke:#333,stroke-width:2px\nclass A,B,C,E,G goCell;\nclassDef troubleCell fill:#ffffbf,stroke:#333,stroke-width:2px;\nclass D troubleCell;\nclassDef endCell fill:#fc8d59,stroke:#333,stroke-width:2px\nclass F,H endCell;\nclassDef questionCell fill:#fff,stroke:#333,stroke-width:2px;\nclass C questionCell;\n\n\n
"},{"location":"recipes/R-03_ckan/#step-1-set-up-your-directories","title":"Step 1: Set up your directories","text":" -
Navigate to your local Recipes directory for R-03_ckan.
-
Verify that there are two folders
resource
: contains a CSV for each portal per harvest that lists all of the dataset identifiers reports
: combined CSV metadata files for all new and deleted datasets per harvest
-
Review the CKANportals.csv file. Each active portal should have values in the following fields:
- portalName
- URL
- Provider
- Publisher
- Spatial Coverage
- Bounding Box
"},{"location":"recipes/R-03_ckan/#step-2-run-the-harvest-script","title":"Step 2: Run the harvest script","text":" - Start Jupyter Notebook
- Open your local copy of R-03_ckan.ipynb
Info
This script will harvest from a set of CKAN data portals. It saves a list of datasets found in each portal and will compare the output between runs. The result will be two CSVs: new items and deleted items.
The script only harvests items that can be identified as shapefiles or imagery.
"},{"location":"recipes/R-03_ckan/#step-3-edit-the-metadata-for-new-items","title":"Step 3: Edit the metadata for new items","text":"The new records can be found in reports/allNewItems_{today's date}.csv
and will need some manual editing.
- Spatial Coverage: Add place names related to the datasets.
- Title: Concatenate values in the Alternative Title column with the Spatial Coverage of the dataset.
"},{"location":"recipes/R-03_ckan/#step-4-upload-metadata-for-new-records","title":"Step 4: Upload metadata for new records","text":"Open GEOMG and upload the new items found in reports/allNewItems_{today's date}.csv
"},{"location":"recipes/R-03_ckan/#step-5-delete-metadata-for-retired-records","title":"Step 5: Delete metadata for retired records","text":"Unpublish records found in reports/allDeletedItems_{today's date}.csv
. This can be done in GEOMG manually (one by one) or with the GEOMG documents update script.
"},{"location":"recipes/R-04_oai-pmh/","title":"Harvest via OAI-PMH","text":"Using Illinois Library Digital Collections as example
Steps:
"},{"location":"recipes/R-04_oai-pmh/#part-1-get-the-files-via-oai","title":"Part 1: get the files via oai","text":" - Use this OAI-PMH validator tool at https://validator.oaipmh.com
- Go to the Download XML tab
- Enter the base URL (https://digital.library.illinois.edu/oai-pmh) and the set name (6ff64b00-072d-0130-c5bb-0019b9e633c5-2)
- Wait for the app to pull all the XML files and download them (ideally in a ZIP, but sometimes that doesn't work and you need to click on each file)
"},{"location":"recipes/R-04_oai-pmh/#part-2-turn-the-records-into-a-csv-via-openrefine","title":"Part 2: turn the records into a CSV via OpenRefine","text":" - start OpenRefine
- Choose \"Get Data from this Computer\" and upload the XML files
- From the parsing options, select from the Header \"record\"
"},{"location":"recipes/R-04_oai-pmh/#part-3-collapse-multivalued-cells","title":"Part 3: Collapse multivalued cells","text":" - The multi-valued cells will start out being grouped together by which XML file they came from. We don't want that, so remove the column called File.
- Now, they are grouped by a value \"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\" Leave this for now.
- There are multiple Identifiers (dc:identifier), so select that column, Edit Cells - Join multi-valued cells
- Move the Identifier column to the beginning so that items will be grouped by these unique values
- Collapse the remaining cells with the same Join Multi-valued cells function
- Export to CSV
"},{"location":"recipes/R-05_iiif/","title":"IIIF","text":""},{"location":"recipes/R-05_iiif/#purpose","title":"Purpose","text":"To extract metadata from IIIF JSON Manifests.
"},{"location":"recipes/R-05_iiif/#step-1-download-the-jsons","title":"Step 1: Download the JSONs","text":" - Create a CSV file called \"jsonUrls.csv\" with just the URLs of the JSONs.
- Navigate to the Terminal/Command Line and into a directory where you can save files
- Type:
wget -i jsonUrls.csv
- Review that all of the JSONs downloaded to a local directory
"},{"location":"recipes/R-05_iiif/#step-2-run-the-extraction-script","title":"Step 2: Run the extraction script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-05_iiif.ipynb
- Move the downloaded file
jsonUrls.csv
into the same directory as the Jupyter Notebook. - Run all cells
Warning
This will lump all the subsections into single fields and the user will still need to split them.
"},{"location":"recipes/R-05_iiif/#step-3-merge-the-metadata","title":"Step 3: Merge the metadata","text":"Although the Jupyter Notebook extracts the metadata to a flat CSV, we still need to merge this with any existing metadata for the records.
"},{"location":"recipes/R-05_iiif/#step-4-upload-to-geomg","title":"Step 4: Upload to GEOMG","text":""},{"location":"recipes/R-06_mods/","title":"MODS","text":""},{"location":"recipes/R-06_mods/#purpose","title":"Purpose:","text":"To extract metadata from XML files in the MODS metadata format.
"},{"location":"recipes/R-06_mods/#step-1-obtain-a-list-of-urls-for-the-xml-files","title":"Step 1: Obtain a list of URLs for the XML files","text":"This list may be supplied by the submitter or we may need to query the website to find them.
"},{"location":"recipes/R-06_mods/#step-2-run-the-extraction-script","title":"Step 2: Run the extraction script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-06-mods.ipynb
- Move the downloaded file
arcHubs.csv
into the same directory as the Jupyter Notebook. - Run all cells.
"},{"location":"recipes/R-06_mods/#step-3-format-as-opengeometadata","title":"Step 3: Format as OpenGeoMetadata","text":"Manually adjust the names of the columns to match metadata into our GeoBTAA metadata template.
"},{"location":"recipes/R-06_mods/#step-4-upload-to-geomg","title":"Step 4: Upload to GEOMG","text":""},{"location":"recipes/R-07_ogm/","title":"OpenGeoMetadata","text":""},{"location":"recipes/R-07_ogm/#purpose","title":"Purpose","text":"To convert OpenGeoMetadata (GeoBlacklight) JSONs to a CSV in the GeoBTAA Metadata Profile
"},{"location":"recipes/R-07_ogm/#step-1-obtain-the-jsons","title":"Step 1: Obtain the JSONs","text":"Collect the metadata JSONs. This type of metadata typically is obtain in one of the following ways:
- Direct submission (Team Members from Wisconsin)
- Via an OpenGeoMetadata repository
"},{"location":"recipes/R-07_ogm/#step-2-run-the-conversion-script","title":"Step 2: Run the conversion script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-07_openGeoMetadata.ipynb
- Move the folder with the metadata JSONs into the same directory as the Jupyter Notebook.
- Declare the paths and folder name in the Notebook.
- Run all cells
Tip
Depending upon the source, you may want to adjust the script to accomodate custom fields.
"},{"location":"recipes/R-07_ogm/#step-3-edit-the-output-csv","title":"Step 3: Edit the output CSV","text":"The GeoBTAA Metadata Profile may have additional or different requirements. Consult with the Product Manager on which fields may need augmentation.
"},{"location":"recipes/R-07_ogm/#step-4-upload-to-geomg","title":"Step 4: Upload to GEOMG","text":""},{"location":"recipes/R-08_pasda/","title":"PASDA","text":""},{"location":"recipes/R-08_pasda/#purpose","title":"Purpose","text":"To harvest metadata from Pennsylvania Spatial Data Access (PASDA), a custom data portal. To begin, start the Jupyter Notebook:
- Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-08_pasda.ipynb
"},{"location":"recipes/R-08_pasda/#step-1-obtain-a-list-of-landing-pages","title":"Step 1: Obtain a list of landing pages","text":"Run Part 1 of the Notebook to obtain a list of all of the records currently in PASDA. This list can be found by doing a blank search on the PASDA website: (https://www.pasda.psu.edu/uci/SearchResults.aspx?Keyword=+)
Since this is a large result list, the recipe recommends downloading the HTML file to your desktop (In the Safari browser, this is File-Save As - Save As Page Source)
Then, we can use the Beautiful Soup module to query this page and harvest the following values:
- Title
- Date Issued
- Publisher
- Description
- Metadata file link
- Download link
"},{"location":"recipes/R-08_pasda/#step-2-download-the-supplemental-metadata-files","title":"Step 2: Download the supplemental metadata files","text":"Context
Bounding boxes & keywords are not found in the landing pages, but most of the PASDA datasets have a supplemental metadata document, which does contain coordinates. The link to this document was scraped to the 'Metadata File\" column during the previous step.
Most of the records have supplemental metadata in ISO 19139 or FGDC format. The link to this document is found in the 'Metadata File\" column. Although these files are created as XMLs, the link is a rendered HTML.
There is additional information in these files that we want to scrape, including bounding boxes and geometry type.
At the end of Step 2, you will have a folder of HTML metadata files.
"},{"location":"recipes/R-08_pasda/#step-3-query-the-downloaded-supplemental-metadata-files","title":"Step 3: Query the downloaded supplemental metadata files","text":"This section of the script will scan each HTML metadata file. If it contains bounding box information, it will pull the coordinates. Otherwise, it will assign a default value of the State of Pennsylvania extents.
It will also pull the geometry type and keywords, if available.
"},{"location":"recipes/R-08_pasda/#step-4-add-default-and-calculated-values","title":"Step 4: Add default and calculated values","text":"This step will clean up the harvested metadata and add our administrative values to each row. At the end, there will be a CSV file in your directory named for today's date.
"},{"location":"recipes/R-08_pasda/#step-5-upload-the-csv-to-geomg","title":"Step 5: Upload the CSV to GEOMG","text":" - Upload the new records to GEOMG
- Use the Date Accessioned field to search for records that were not present in the current harvest. Retire any records that have the code \"08a-01\" but were not part of this harvest.
"},{"location":"recipes/R-09_umedia/","title":"UMedia","text":""},{"location":"recipes/R-09_umedia/#purpose","title":"Purpose","text":"To harvest new records added to the University Of Minnesota's UMedia Digital Library.
"},{"location":"recipes/R-09_umedia/#step-1-set-up-folders","title":"Step 1: Set up folders","text":" - Navigate to the UMedia Recipe directory at R-09_umedia.ipynb
- Verify the following folders are present:
requests
This folder stores all search results in JSON format for each reaccession as request_YYYYMMDD.json
.
jsons
This folder stores all JSON files by different added month for UMedia maps. After we get the search result JSON file from each reaccession, we will read this request_YYYYMMDD.json
file in detail to filter out the included maps by month, and store them to dateAdded_YYYYMM.json
individually.
reports
This folder stores all CSV files for metadata by month. Once we have JSON files for different month, we extract all useful metadata and contribute in the dateAdded_YYYYMM.csv
in this folder.
"},{"location":"recipes/R-09_umedia/#step-2-run-the-harvesting-script","title":"Step 2: Run the harvesting script","text":" - Start Jupyter Notebook and open R-09_umedia.ipynb
- The second code cell will ask for an input on how many map records you want to harvest.
- The third code cell will ask for a date range. Select a month (in the form
yyyy-mm
) based on the last time you ran the script.
"},{"location":"recipes/R-09_umedia/#step-3-edit-the-metadata","title":"Step 3: Edit the metadata","text":""},{"location":"recipes/R-09_umedia/#step-4-upload-to-geomg","title":"Step 4: Upload to GEOMG","text":""},{"location":"recipes/add-bbox/","title":"Add bounding boxes","text":"Summary
This page describes processes for obtaining bounding box coordinates for our scanned maps. The coordinates will be used for indexing the records in the Big Ten Academic Alliance Geoportal.
**About bounding box coordinates for the BTAA Geoportal **
- Bounding boxes enable users to search for items with a map interface.
- The format is 4 coordinates in decimal degrees
- Provide the coordinates in this order: West, South, East, North.
- The bounding boxes do not need to be exact, particularly with old maps that may not be very precise anyways.
"},{"location":"recipes/add-bbox/#manual-method","title":"Manual method","text":""},{"location":"recipes/add-bbox/#part-a-setup","title":"Part A: Setup","text":" - Open and inspect the image file.
- Try to identify a single / combined region that the map or atlas represents
- You can also check to see if the map has the bounding coordinates printed in the text anywhere or you are able to find the bounds by inspecting the edges.
- Open another window with the Klokan Bounding Box tool.
- Set the Copy & Paste section to CSV.
"},{"location":"recipes/add-bbox/#part-b-find-the-coordinates","title":"Part B: Find the coordinates","text":""},{"location":"recipes/add-bbox/#option-1-search-for-a-place-name","title":"Option 1: Search for a place name","text":" - Use the search boxes on the Klokan Bounding Box tool to zoom to the region. (For example, search for \u201cIllinois\u201d.
- Manually adjust the grey overlay box in the Klokan site to line up the edges to the edges of the map.
- Try to line it up reasonably closely
"},{"location":"recipes/add-bbox/#option-2-draw-a-shape","title":"Option 2: Draw a shape","text":" - Switch to the Polygon tool by clicking on the pentagon icon
- Click as many points on the screen as needed to approximate the map extent.
- Click on the first point to close the polygon
- The interface will display a dotted line showing the bounding box rectangle.
"},{"location":"recipes/add-bbox/#part-c-copy-back-to-geobtaa-metadata","title":"Part C: Copy back to GeoBTAA metadata","text":" - Click the \u201cCopy to Clipboard\u201d icon on the Klokan site.
- Paste the coordinates into the Bounding Box field in the GeoBTAA metadata template or in the GEOMG metadata editor.
"},{"location":"recipes/add-bbox/#programmatic-method","title":"Programmatic method","text":"The OpenStreetMap offers and API that allows users to query with place names and return a bounding box. Follow the Tutorial, Use OpenStreetMap to generate bounding boxes, for this method.
"},{"location":"recipes/clean/","title":"Validate and Clean Metadata","text":"Info
Find the cleaning script here: https://github.com/geobtaa/harvesting-guide/tree/main/recipes/R-00_clean
As a final step of Edit stage of the resource lifecycle, we run a cleaning script to fix common errors:
"},{"location":"recipes/clean/#required-fields","title":"Required fields","text":" - Resource Class: Checks that an entry exists and that it is one of the controlled values. If the field is empty, the cleaning script will insert
Datasets
as a default. - Access Rights: Checks that it contains either
Public
or Restricted
. If empty, the script will insert Public
as a default.
"},{"location":"recipes/clean/#conditionally-required-fields","title":"Conditionally required fields","text":" - Format: If the \"Download\" field has a value, Format must also be present. If empty, the script will insert
File
as a default.
"},{"location":"recipes/clean/#syntax","title":"Syntax","text":" - Date Range: If present, checks that it is valid with a range in the format
yyyy-yyyy
, where the second value is earlier than the first. If the second value is earlier (lower) than the first, the script will reorder them. - Bounding Box: There are numerous possible conditions that the script will fix:
- Rounding: The script will round all coordinates to two decimal points. (For some collections, we change this to three). This is done because the bounding boxes function as a finding aid and overly precise coordinates can be misleading. It also saves on the memory load for geometry and centroid calculations.
- non-degrees: If one of the coordinates exceeds the earth's coordinates (over 180 for longitude or 90 for latitude), the coordinates are considered invalid and the entire field will be cleared for that record.
- lines or points: If the script finds that easting and westing or north and south coordinates are equal, it will add a .001 to the end.
"},{"location":"recipes/clean/#reports","title":"Reports","text":"After cleaning, the script will produce two CSVS:
- Cleaned Metadata: All of the original rows with fixes applied. A new column called \"Cleaned\" will indicate if the row was edited by the script.
- Cleaning Log: A list of all the records and fields that were cleaned and what was done.
"},{"location":"recipes/secondary-tables/","title":"How to upload links in secondary tables in GEOMG","text":" - We use two compound metadata fields,
Multiple Download Links
and Institutional Access Links
, that include multiple links that are formatted with both a label + a link. - Because these fields are not regular JSON flat key:value pairs, they are stored in secondary tables within GEOMG.
- When using GEOMG's Form view, these values can be entered by clicking into a side page linked from the record.
- For CSV uploads, these values are uploaded with a separate CSV from the one used for the main import template.
Tip
See the opengeometadata.org page on multiple downloads for how these fields are formatted in JSON
"},{"location":"recipes/secondary-tables/#manual-entry","title":"Manual entry","text":""},{"location":"recipes/secondary-tables/#multiple-download-links","title":"Multiple Download Links","text":" - On the Form view, scroll down to the end of the Links section and click the text \"Multiple Download Links\"
- Click the New Download URL button
- Enter a label (i.e., \"Shapefile\") and the download URL
- Repeat for as many as needed
"},{"location":"recipes/secondary-tables/#institutional-access-links","title":"Institutional Access Links","text":" - On the Form view, scroll down to the bottom of the right-hand navigation and click the text \"Institutional Access Links\"
- Click the New Access URL button
- Select an institution code and the access URL
- Repeat for as many as needed
"},{"location":"recipes/secondary-tables/#csv-upload-for-either-type","title":"CSV Upload for either type","text":" - Go to Admin Tools - Multiple Downloads or Access Links
- Upload a CSV in on of these formats:
CSV field headers for secondary tables
Multiple DownloadsInstitutional Access Links | friendlier_id | label | value |\n |---------------------|------------------|------------|\n | ID of target record | any string | the link |\n
| friendlier_id | institution_code | access_URL |\n |---------------------|------------------|------------|\n | ID of target record | 2 digit code | the link |\n
"},{"location":"recipes/split-bbox/","title":"Split Bounding Boxes that cross the 180th Meridian","text":""},{"location":"recipes/split-bbox/#problem","title":"Problem","text":"The BTAA Geoportal does not display bounding boxes that cross the 180th meridian (also known as the International Date Line.) In these circumstances, the West coordinate will be a positive number, but the East coordinate will be negative.
"},{"location":"recipes/split-bbox/#solution","title":"Solution","text":"One way to mitigate this is to create two bounding boxes for the OGM Aardvark Geometry
field. The Bounding Box value will be the same, but the Geometry field will have a multipolygon that is made up of two adjacent boxes.
The following script will scan a CSV of the records, identify which cross the 180th Meridian, and insert a multipolygon into a new column.
The script was designed with the assumption that the input CSV will be in the OGM Aardvark format, likely exported from GEOMG. The CSV file must contain a field for Bounding Box
. It may contain a Geometry
field with some values that we do not want to overwrite.
This script will create a new field called \"Bounding Box (WKT)\". Items that crossed the 180th Meridian will have a multipolygon in that field. Items that don't cross will not have a value in that field. Copy and paste only the new values into the Geometry
column and reupload the CSV to GEOMG.
import csv\n\ndef split_coordinates(coordinate_str):\n if not coordinate_str:\n return ''\n\n coordinates = coordinate_str.split(',')\n\n west, south, east, north = map(float, coordinates)\n\n if west > 0 and east < 0:\n polygon1 = f'({west} {south}, {179.9} {south}, {179.9} {north}, {west} {north}, {west} {south})'\n polygon2 = f'({-179.9} {south}, {-179.9} {north}, {east} {north}, {east} {south}, {-179.9} {south})'\n return f'MULTIPOLYGON(({polygon1}), ({polygon2}))'\n\n return coordinate_str\n\n# Specify the input CSV file path\ninput_file = 'your_input_file.csv'\n\n# Specify the output CSV file path\noutput_file = 'your_output_file.csv'\n\n# Specify the name of the column with the Bounding Box coordinates\ncoordinate_column = 'Bounding Box'\n\n# Specify the name of the new column to store the updated coordinates\nnew_coordinate_column = 'New Bounding Box (WKT)'\n\n# Read the input CSV and process the coordinates\nwith open(input_file, 'r') as file:\n reader = csv.DictReader(file)\n fieldnames = reader.fieldnames + [new_coordinate_column]\n\n with open(output_file, 'w', newline='') as output:\n writer = csv.DictWriter(output, fieldnames=fieldnames)\n writer.writeheader()\n\n for row in reader:\n bounding_box = row[coordinate_column]\n new_bounding_box = split_coordinates(bounding_box)\n row[new_coordinate_column] = new_bounding_box\n writer.writerow(row)\n\nprint(\"CSV processing completed!\")\n
"},{"location":"recipes/standardize-creators/","title":"Best Practices for Standardizing Creator Field Data","text":" Authors: Creator Standardization Working Group
Date: 28 November 2022
Info
See this Journal Article for a more thorough description of this process:
Laura Kane McElfresh (2023) Creator Name Standardization Using Faceted Vocabularies in the BTAA Geoportal, Cataloging & Classification Quarterly, DOI: 10.1080/01639374.2023.2200430
Creator names are a critical access point for the discovery of geospatial information. Within the BTAA Geoportal, creator names\u2013whether names of persons or corporate bodies\u2013are displayed on landing pages and the citation widget, and are indexed and faceted for searching and browsing. Standardizing the names of resource creators makes search results more predictable, thereby producing a better experience for Geoportal users.
To ensure that the Geoportal\u2019s collocation functions operate properly, this document recommends using the formulation of personal and corporate body names as they are found in identity registries and, when creators are not available in those registries, provides guidance for formulating names of creators. We seek to provide consistency of creator names within our database through the recommendations provided below. This document does not address the manual creation or editing of identity registry records.
These best practices assume that standardization of names will occur after data is ingested into the BTAA Geoportal; however this document may be used to inform description choices made before records are ingested into the Geoportal.
"},{"location":"recipes/standardize-creators/#preferred-identity-registries","title":"Preferred Identity Registries","text":"These best practices recommend consulting one or two name registries when deciding how to standardize names of creators: the Faceted Application of Subject Terminology (FAST) or the Library of Congress Name Authority File (LCNAF). FAST is a controlled vocabulary based on the Library of Congress Subject Headings (LCSH) that is well-suited to the faceted navigation of the Geoportal. The LCNAF is an authoritative list of names, events, geographic locations and organizations used by libraries and other organizations to collocate authorized creator names to make searching and browsing easier.
"},{"location":"recipes/standardize-creators/#overview-of-the-process","title":"Overview of the Process","text":"These best practices present the following workflow for standardizing names of creators:
"},{"location":"recipes/standardize-creators/#search-fast-for-the-creators-name","title":"Search FAST for the creator\u2019s name","text":" - If the creator\u2019s name is found, then use the name as found in FAST
- If there\u2019s no match in FAST, then consult the Guidance for Formulating Creator Names Not Present in the Registries Noted Above
"},{"location":"recipes/standardize-creators/#searching-fast","title":"Searching FAST","text":"To search the FAST registry for a creator name, you may use either assignFAST or searchFAST. assignFAST is ideal for quick searches, while searchFAST allows for advanced searching.
"},{"location":"recipes/standardize-creators/#assignfast","title":"assignFAST","text":" - Go to http://experimental.worldcat.org/fast/assignfast/ and begin typing the creator name into the text box.
- assignFAST will suggest headings in FAST format. When the correct heading appears, click it in the list of suggestions.
- The selected heading will appear in the text box, highlighted for copying.
- For example, if you type in
St. Francis, Minnesota
, you will see suggestions including - \u201cMinnesota--St. Francis (Anoka County) USE Minnesota--Saint Francis (Anoka County)\u201d
- \u201cMinnesota--St. Francis (Anoka Co.) USE Minnesota--Saint Francis (Anoka County)\u201d.
- Click on either of those suggestions and you will receive the authorized form of the name:
Minnesota--Saint Francis (Anoka County)
.
- Copy the authorized name from the text box and paste it into the spreadsheet.
"},{"location":"recipes/standardize-creators/#searching-lcnaf","title":"Searching LCNAF","text":"When a name is not found in FAST, search the Library of Congress Name Authority File (LCNAF) for a match using the directions found in the Searching LCNAF section below.
If no match is found, there\u2019s no requirement to do intensive research. Continue to the next section, Guidance for Formulating Creator Names Not Present in the Registries Noted Above. If using these Best Practices in a metadata sprint, you may alternatively move onto the next name in the sprint spreadsheet.
"},{"location":"recipes/standardize-creators/#guidance-for-formulating-creator-names-not-present-in-the-registries-noted-above","title":"Guidance for Formulating Creator Names Not Present in the Registries Noted Above","text":"When a personal or corporate body name cannot be found in neither FAST nor LCNAF, follow the directions below.
"},{"location":"recipes/standardize-creators/#personal-names","title":"Personal Names","text":"Personal names should be formulated in inverted order (last name first) based on the information that appears on the item in the Geoportal.
Felsted, L. E.\n Ackley, Seth\n Colvert, DeLynn C.\n Griffey, Ken, Jr. \n
In cases where extra information is needed to distinguish a name, you may add a parenthetical at the end of the name, e.g., Surveyor, Cartographer, Draftsman, Geologist, Engraver.
Perry, Katy (Cartographer)\n
"},{"location":"recipes/standardize-creators/#corporate-body-names","title":"Corporate Body Names","text":""},{"location":"recipes/standardize-creators/#abbreviations-and-initialisms","title":"Abbreviations and Initialisms","text":"Regardless of how a name appears on the resource, always use the spelled out form of the name as opposed to abbreviations or initialisms for the purpose of being clear. For example, use
United States Geological Survey
U.S.G.S.
USGS
Cook County Geographic Information Systems
Cook County GIS
"},{"location":"recipes/standardize-creators/#subordinate-bodies","title":"Subordinate Bodies","text":"A \u201csubordinate body\u201d is a corporate entity that is part of another corporate entity. To avoid confusion for Geoportal users, always include the name of the larger \u201cparent\u201d entity. For instance:
Cheyenne Light, Fuel and Power Company. Engineering Department
Engineering Department
Canada. Department of the Interior
Department of the Interior
"},{"location":"recipes/standardize-creators/#jurisdictional-geographic-names-used-in-the-creator-field","title":"Jurisdictional Geographic Names Used in the Creator Field","text":"For background, \u201cjurisdictional\u201d place names are those that are defined legally by a set of boundaries and overseen by a governmental agency. In the United States these would be cities, towns, townships, boroughs, villages (mostly), counties, states and so forth. Non-jurisdictional places are of two types: either entities in nature that have been given a name such as the Mississippi River or Rocky Mountains, or are administrative component areas of a larger formal jurisdiction such as ranger districts within a national forest.
It will be extremely rare NOT to find an authorized form of a jurisdictional place name in FAST and LCNAF. However, if you encounter a place not found in these resources, follow the pattern used in FAST.
For a dataset in which the creator name is given as \u201cCity of Kenosha\u201d, FAST formulates the jurisdictional place name as:
Wisconsin--Kenosha\n
"},{"location":"recipes/standardize-creators/#directions-for-standardizing-metadata-records-for-the-btaa-geoportal","title":"Directions for standardizing metadata records for the BTAA Geoportal","text":"When a name is not found in FAST, search the Library of Congress Name Authority File (LCNAF) for a match following the directions in the Searching LCNAF section below. If a matching name is found in LCNAF, a FAST record may be requested, as explained below.
If no match is found, there\u2019s no requirement to do intensive research. Instead, follow the direction in the section, Guidance for Formulating Creator Names Not Present in the Registries noted above.
"},{"location":"recipes/standardize-creators/#searching-lcnaf_1","title":"Searching LCNAF","text":"To search the LCNAF for a creator name, you may use either the Library of Congress Authorities or the LC Linked Data Service. The LC Linked Data Service is ideal for quick keyword searches, while Library of Congress Authorities allows for browse searching.
Searching the LC Linked Data Service:
- Go to https://id.loc.gov/authorities/names.html
- Type the creator name into the text box and press enter or click on Go
- The result that appears in the \"Label\" column is the LCNAF authorized form of the name, for instance,
Cumberland County (Pa.)
- The number at the far right in the \"Identifier\" column is the Library of Congress control number (LCCN), which for Cumberland County Pennsylvania is
n81032665
Often, when searching for names of persons, several results appear to be possible matches. In those cases, click on a heading in the results list and look for the \"Sources\" heading to see a list of citations that have been associated with that entity.
When a name is found in the LCNAF, submit a request that the LCNAF name be added to FAST by following the steps below.
"},{"location":"recipes/standardize-creators/#request-additions-to-fast","title":"Request Additions to FAST","text":"You will use the importFAST Subject Headings utility to request LCNAF additions to FAST.
- First, select the type of LCNAF name.
- Personal Name
- Corporate Name
- Topical: We are unlikely to use this one for a creator.
- Copy the \u201cIdentifier\u201d (the Library of Congress control number) from the LCNAF record and paste it into the \u201center name LCCN\u201d text box. Click the \u201cImport\u201d button.
- The form will automatically populate using the LCNAF name. Check to make sure the correct name has been imported.
- Enter the email address
geoportal@btaa.org
(you do not need to fill out the \u201cAnything extra\u201d text box) and click \u201cSubmit Heading\u201d. - The new FAST heading will appear! Please copy the FAST heading number from the end of the string (e.g. \u201c
fst02013467
\u201d) and paste it into the spreadsheet to show that the heading has been added to FAST. - Enter the newly created FAST heading into the spreadsheet.
Geographic names cannot be requested through the importFAST Subject Headings tool. In the spreadsheet, place \u201cYes\u201d in the column labeled \u201cRequest FAST geographic name\u201d.
"},{"location":"recipes/update-hub-list/","title":"How to update the list of ArcGIS Hub websites.","text":""},{"location":"recipes/update-hub-list/#background","title":"Background","text":"The BTAA Geoportal provides a central access point to find and browse public geospatial data. A large portion of these records come from ArcGIS Hubs that are maintained by states, counties, cities, and regional entities. These entities continually update the data and the website platforms. In turn, we need to continually update our links to these resources.
The ArcGIS Harvesting Recipe walks through how we programmatically query the ArcGIS Hubs to obtain the current list of datasets. This page describes how to keep our list of ArcGIS Hub websites updated.
"},{"location":"recipes/update-hub-list/#how-to-create-or-troubleshoot-an-arcgis-hub-website-record-in-geomg","title":"How to create or troubleshoot an ArcGIS Hub website record in GEOMG","text":"Info
The highlighted fields listed below are required for the ArcGIS harvesting script. If the script fails, check that these fields have been added. The underlined fields are used to query GEOMG and produce the list of Hubs that we regularly harvest.
Each Hub has its own entry in GEOMG. Manually create or update each record with the following parameters:
- Title: The name of the site as shown on its homepage. This value will be transferred into the Provider field of each dataset.
- Description: Usually \"Website for finding and accessing open data provided by \" followed by the name of the administrative place or organization publishing the site. Additional descriptions are helpful.
- Language: 3-letter code as shown on the OpenGeoMetadata website.
- Publisher: The administrative place or organization publishing the site. This value will be concatenated into the title of each dataset. For place names, use the FAST format (i.e.
Minnesota--Clay County
. - Resource Class:
Websites
This value is used for filtering & finding the Hubs in GEOMG - Temporal Coverage:
Continually updated resource
- Spatial Coverage: Add place names using the FAST format as described for the B1G Profile.
- Bounding Box: If the Hub covers a specific area, include a bounding box for it using the manual method described in the Add bounding boxes recipe.
- Member Of: one of the following values:
ba5cc745-21c5-4ae9-954b-72dd8db6815a
(Government Open Geospatial Data) b0153110-e455-4ced-9114-9b13250a7093
(Research Institutes Geospatial Data Collection)
- Format:
ArcGIS Hub
This value is used for filtering & finding the Hubs in GEOMG - Links - Reference - \"Full layer description\" : link to the homepage for the Hub
- ID and Code: Both of the values will be the same. Create a new code by following the description on the Code Naming Schema page. Use the Advanced Search in GEOMG to query which codes have already been used. If it is not clear what code to create, ask the Product Manager or use the UUID Generator website to create a random one. The ID value will be transferred into the Code field of each dataset.
-
Identifier: If the record will be part of the monthly harvests, add this to the end of the baseUrl (usually the homepage): /api/feed/dcat-us/1.1.json
. The Identifier will be used to query the metadata for the website.
Warning
Always check the Identifier link! It should show a JSON API in your browser that displays all of the metadata for each dataset hosted by the website. The baseUrl may be slightly different than the landing page for the organization. For example, some entities may add the string \"data-\" to the beginning of their site URL. The best way to make sure you have the right URL is to look for a box labeled \"Search all data\". This will result in a link like \"baseUrl
/search\". Then, replace the \"/search\" with /api/feed/dcat-us/1.1.json
-
Access Rights: Public
for all ArcGIS Hubs.
- Accrual Method:
DCAT US 1.1
. This value is used for filtering & finding the Hubs in GEOMG - Status: If the site is part of the ArcGIS Harvest, use the value
Indexed
. If the site is not part of the harvest, use Not indexed
. Other explanatory text can be included here, such as indicating if the site is broken. - Publication State: When a new record is created it will automatically be assigned
Draft
. Change the state to published
when the metadata record is ready. If the site breaks or is deprecated, change this value to unpublished
.
"},{"location":"recipes/update-hub-list/#how-to-remove-broken-or-deprecated-arcgis-hubs","title":"How to remove broken or deprecated ArcGIS Hubs","text":"If a Hub site stalls the Harvesting script, it needs to be updated in GEOMG.
"},{"location":"recipes/update-hub-list/#if-the-site-is-missing","title":"If the site is missing:","text":"Try to find a replacement site. When a Hub is updated to a new version, sometimes the baseURL will change. If a new site is found, update:
- Links - Reference - \"Full layer description\" : new landing page
- Identifier : the new API (with the suffix
/api/feed/dcat-us/1.1.json
)
"},{"location":"recipes/update-hub-list/#if-the-site-is-present-but-the-api-is-returning-an-error","title":"If the site is present, but the API is returning an error:","text":"In this case, we freeze the website and the dataset records, but stop new harvests. Make these changes:
"},{"location":"recipes/update-hub-list/#website-record","title":"Website Record","text":" - Accrual Method: remove
DCAT US 1.1
and leave blank - Status: Change the value \"Indexed\" to \"Not indexed\". Leave a short explanation if the API is broken.
"},{"location":"recipes/update-hub-list/#dataset-records","title":"Dataset Records","text":" - Export all of the records using the
Code
field - Accrual Method: change from \"ArcGIS Hub\" to \"ArcGISHub-paused\"
Broken APIs
These steps will remove the records from our preset query to select active ArcGIS Hubs in GBL Admin. The effect will be to freeze them until the API starts working again. Once the API becomes accessible again, reverse the Accrual Method values.
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"GeoBTAA Metadata Handbook","text":"This handbook describes how to harvest and format metadata records for the BTAA Geoportal.
"},{"location":"#reference","title":"Reference","text":"Information about the GeoBTAA Metadata Application Profile and our harvest guidelines.
"},{"location":"#explanation","title":"Explanation","text":"Descriptions and clarifications of processes, content organization, policies, and tools
"},{"location":"#tutorials","title":"Tutorials","text":"Short, easy to complete exercises to help someone get the basics of running and writing scripts to harvest metadata.
"},{"location":"#recipes","title":"Recipes","text":"Multi-step workflows for harvesting and editing metadata for the BTAA Geoportal.
"},{"location":"#who-is-this-handbook-for","title":"Who is this handbook for?","text":" -
Team Members in the Big Ten Academic Alliance Geospatial Information Network (BTAA-GIN)
-
Development & Operations Staff in the BTAA-GIN
-
Users & developers of open-source geospatial projects, such as OpenGeoMetadata and GeoBlacklight
-
Contributors to the BTAA Geoportal
-
Users of the BTAA Geoportal
Metadata Handbook Version History Changes for Version 5.1 (September 25, 2023)
This version adds several new recipes and page cleanups.
- New recipes for:
- cleaning metadata
- adding bounding boxes
- normalizing creators
- updating our list of ArcGIS Hubs
- how to add multiple download links in GEOMG
- Updates the documentation for the ArcGIS, Socrata, and PASDA recipes.
- Updates the DCAT and CKAN documentation pages.
Changes for Version 5.0 (May 24, 2023)
This version incorporates the Harvesting Guide notebooks and includes documentation for harvesting metadata from different sources.
- New page describing the Tutorials in the Harvesting Guide
- Eight Recipe pages corresponding to Harvesting Guide
- Updated header design to match Geoportal
Changes for Version 4.6 (March 15, 2023)
- New page for manually adding bounding boxes
- Restructure using Diataxis framework
- Remove some GEOMG how to guidelines (moved to GEOMG Wiki)
- Clarify Editing Template differences from OGM-Aaardvark documentation
- Added Collection Development Policy and Curation Priorities documents
- Update input guidelines for Spatial Coverage (FAST IDs)
Changes for Version 4.5.1 (February 28, 2023)
- Update version documentation
- Add link to generated PDF
Changes for Version 4.5 (February 28, 2023)
- Add Creator ID
- Update input guidelines for Creator, Creator ID
- Remove Harvesting Guide info (migrating to separate site)
- Edit Submitting Metadata page
- Minor copy editing
- Add PDF export capability
Changes for Version 4.4 (August 23, 2022)
- updated theme
- reorganized and expanded navigation menu
- new sections for Harvesting Guide and using GEOMG
Changes for Version 4.3 (August 15, 2022)
- migrate to MkDocs.org platform
- update bounding box entry guidelines
- add GEOMG page
Changes for Version 4.2 (March 24, 2022)
- New Entry and Usage Guidelines page
- Expands content organization model documentation
- Changes the name of the schema from 'Aardvark' to 'OpenGeoMetadata (version Aardvark)'
- Cleans up outdated links
Changes for Version 4.1 (Jan 2022)
- updates Status as optional; removes controlled vocabulary
- Clarifies relationship model
Changes for Version 4.0 (July 2021)
- Incorporation of GEOMG Metadata Editor
- Upgrade to Aardvark Metadata Schema for GeoBlacklight
Changes for version 3.3 (May 13, 2020)
- Added University of Nebraska
- Reorganized Metadata Elements to match editing template
- Updated the \u201cUpdate the Collections\u201d section to match new administrative process for tracking records
Changes for version 3.2 (Jan 8, 2020)
- Added Date Range element
Changes for version 3.1 (Dec 19, 2019)
- Added collection level records metadata schema
Changes for version 3 (Oct 2019)
- GeoNetwork and Omeka deprecated
- all GeoBlacklight records are stored in a spreadsheet in Google Sheets
- records are transformed from CSV to GeoBlacklight JSON with a Python script
- additional metadata fields were added for administrative purposes
- IsPartOf field now holds a code pointing to the collection record
- Administrative groupings such as \u201cState agencies geospatial data\u201d are now subjects, not a Collection
- updated editing templates available
- all supplemental metadata can be stored as XML or HTML in project hosted folder
- updated links to collections database
"},{"location":"GEOMG/","title":"About GEOMG","text":"What is it? GEOMG is a custom tool that functions as a backend metadata editor and manager for the GeoBlacklight application.
Who uses it? BTAA-GIN Operations technical staff at the University of Minnesota
Who developed it? The BTAA Geoportal Lead Developer, Eric Larson, created GEOMG, with direction from the BTAA-GIN. It is based upon the Kithe framework.
Can other GeoBlacklight projects adopt it?
We are currently working on offering this tool as a plugin for GeoBlacklight.
In the meantime, this presentation describes the motivation for building the tool and a few screencasts showing how it works:
"},{"location":"about-harvesting/","title":"About harvesting","text":"This page describes some of the processes and terminology associated with extracting metadata from various sources.
"},{"location":"about-harvesting/#what-is-web-scraping","title":"What is web scraping?","text":"Web scraping is the process of programmatically collecting and extracting information from websites using automated scripts or bots. Common web scraping tools include pandas, Beautiful Soup, and WGET.
"},{"location":"about-harvesting/#what-is-data-harvesting","title":"What is data harvesting?","text":"Data harvesting refers to the process of collecting large volumes of data from various sources, such as websites, social media, or other online platforms. This can involve using automated scripts or tools to extract structured or unstructured data, such as text, images, videos, or other types of content. The collected data can be used for various purposes, such as data analysis or content aggregation.
"},{"location":"about-harvesting/#what-is-metadata-harvesting","title":"What is metadata harvesting?","text":"Metadata harvesting refers specifically to the process of collecting metadata from digital resources, such as websites, online databases, or digital libraries. Metadata is information that describes other data, such as the title, author, date, format, or subject of a document. Metadata harvesting involves extracting this information from digital resources and storing it in a structured format, such as a database or a metadata record.
Metadata harvesting is often used in the context of digital libraries, archives, or repositories, where the metadata is used to organize and manage large collections of digital resources. Metadata harvesting can be done using specialized tools or protocols, such as the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which is a widely used standard for sharing metadata among digital repositories.
"},{"location":"about-harvesting/#do-scraping-and-harvesting-mean-the-same-thing","title":"Do \"scraping\" and \"harvesting\" mean the same thing?","text":"The terms \"harvesting\" and \"scraping\" are often used interchangeably. However, there may be subtle differences in the way these terms are used, depending on the context.
In general, scraping refers to the process of programmatically extracting data from websites using automated scripts or bots. The term \"scraping\" often implies a more aggressive approach, where data is extracted without explicit permission from the website owner. Scraping may involve parsing HTML pages, following links, and using techniques such as web crawling or screen scraping to extract data from websites.
On the other hand, harvesting may refer to a more structured and systematic approach to extracting data from websites. The term \"harvesting\" often implies a more collaborative approach, where data is extracted with the explicit permission of the website owner or through APIs or web services provided by the website. Harvesting may involve using specialized software or tools to extract metadata, documents, or other resources from websites.
"},{"location":"about-harvesting/#what-is-web-parsing","title":"What is web parsing?","text":"Web parsing refers to the process of scanning structured documents and extracting information. Although, it usually refers to parsing HTML pages, it can also describe parsing XML or JSON documents. Tools designed for this purpose, such as Beautiful Soup, are often called \"parsers.\"
"},{"location":"about-harvesting/#what-is-extract-transform-load-etl","title":"What is Extract-Transform-Load (ETL)?","text":"ETL (Extract Transform Load) is a process of extracting data from one or more sources, transforming it to fit the target system's data model, and then loading it into the target system, such as a database or a data warehouse.
The ETL process typically involves three main stages:
- Extract: This stage involves retrieving data from one or more sources, which may include databases, files, APIs, web services, or other data sources.
- Transform: This stage involves converting, cleaning, and restructuring the extracted data to fit the target system's data model and business rules. This may include tasks such as data mapping, data validation, data enrichment, data aggregation, or data cleansing.
- Load: This stage involves inserting or updating the transformed data into the target system, such as a database or a data warehouse. This may involve tasks such as data partitioning, indexing, or data quality checks.
"},{"location":"arcgis-harvest-policy-2018/","title":"BTAA GDP Accession Guidelines for ArcGIS Open Data Portals","text":"Version 1.0 - April 18, 2018
Deprecated
This document has been replaced by the BTAA-GIN Scanning Guidelines for ArcGIS Hub, version 2.0
"},{"location":"arcgis-harvest-policy-2018/#overview","title":"OVERVIEW","text":"This document describes the BTAA GDP Accession Guidelines for ArcGIS Open Data Portals, including eligible sites and records, harvest schedules, and remediation work. This policy may change at any time in response to updates to the ArcGIS Open Data Portal platform and/or the BTAA GDP harvesting and remediation processes.
"},{"location":"arcgis-harvest-policy-2018/#eligible-sites","title":"ELIGIBLE SITES","text":"Policy: Any ArcGIS Open Data Portal (Arc-ODP) that serves public geospatial data is eligible for inclusion in the BTAA GDP Geoportal (\u201cthe Geoportal\u201d). However, preference is given to portals that are hosting original layers, not federated portals that aggregate from other sites. Task Force members are responsible for finding and submitting Arc-ODPs for inclusion in the Geoportal. Each Arc-ODP will be assigned a Provenance value to the university that submitted it or is closest geographically to the site.
Explanation: In order to avoid duplication, records that appear in multiple Arc-ODPs should only be accessioned from one instance. This also helps to avoid harvesting records that may be out of date or not yet aggregated within federated portals. Although the technical workers at the University of Minnesota will be performing the metadata processing, the Task Force members are expected to periodically monitor their records and make suggestions for edits or additions.
"},{"location":"arcgis-harvest-policy-2018/#eligible-records","title":"ELIGIBLE RECORDS","text":"Policy: The only records that will be harvested from Arc-ODPs are Esri REST Services of the type Map Service, Feature Service, or Image Service. This is further restricted to only items that are harvestable through the DCAT API. By default, the following records types will not be accessioned on a regular basis:
- Web applications
- Nonspatial data, tabular data, PDFs
- Single records that describe many different associated files to download, such as imagery services with a vast number of sublayers
Explanation: Arc-ODPs are structured to automatically create item records from submitted Esri REST services. However, Arc-ODP administrators are able to manually add records for other types of resources, such as external websites or documents. These may not be spatial datasets and may not have consistently formed metadata or access links, which impedes automated accessioning. If these types of resources are approved by the Metadata Coordinator, they may be processed separately from the regular accessions.
"},{"location":"arcgis-harvest-policy-2018/#query-harvest-frequency","title":"QUERY & HARVEST FREQUENCY","text":"Policy: The Arc-ODPs included in the Geoportal will be queried monthly to check for deleted and new items. The results of this query will be logged. Deleted items will be removed from the geoportal immediately. New records from the Arc-ODPs will be accessioned and processed within two months of harvesting the metadata.
Explanation: Removing broken links is a priority for maintaining a positive user experience. However, accessioning and processing records requires remediation work that necessitates a variable time frame.
"},{"location":"arcgis-harvest-policy-2018/#remediation-work","title":"REMEDIATION WORK","text":"The records will be processed by the Metadata Coordinator and available UMN Graduate Research Assistants. The following metadata remediation steps will be undertaken:
"},{"location":"arcgis-harvest-policy-2018/#1-a-python-script-will-be-run-to-harvest-metadata-from-the-dcat-api-this-will-provide-the-following-elements-for-each-record","title":"1. A Python script will be run to harvest metadata from the DCAT API. This will provide the following elements for each record:","text":" - Identifier
- Title
- Description
- Date Issued
- Date Modified
- Bounding Box
- Publisher
- Keywords
- Landing Page
- Web Service link
"},{"location":"arcgis-harvest-policy-2018/#2-the-metadata-will-be-batch-augmented-with-administrative-template-values-in-the-following-elements","title":"2. The metadata will be batch augmented with Administrative template values in the following elements:","text":" - Collection
- Rights
- Type
- Format
- Provenance
- Language
- Centroid (derived from Bounding Box)
- Download Link (created from Landing Page)
- Tag (web service type)
- Thumbnail link (derived from server or Arc-ODP page)
"},{"location":"arcgis-harvest-policy-2018/#3-the-metadata-will-be-manually-augmented-with-descriptive-values-for-the-following-elements","title":"3. The metadata will be manually augmented with descriptive values for the following elements:","text":" - Subject (at least one ISO topic category)
- Geometry type (Vector or Raster)
- Spatial Coverage (place names written out to the nation level: \u201cMinneapolis, Minnesota, United States\u201d)
- Temporal Coverage (dates included in the title or description)
- Title (add place names, expand acronyms, and move dates to the end of the string)
- Description (remove html and non-UTF8 characters)
- Creator (if available)
- Solr Year (integer value based on temporal coverage or published date)
"},{"location":"arcgis-harvest-policy-2018/#4-the-metadata-will-not-be-fully-remediated-for-the-following-cases","title":"4. The metadata will not be fully remediated for the following cases:","text":" - Missing bounding box coordinates (0.00 values) will be defaulted to the bounding box of administrative level of the Arc-ODP or the record will be omitted.
- Missing or incomplete descriptions will be left alone or omitted from the record
- Individual records that require additional research in order to make the metadata record compliant, such as missing required elements or non-functioning links, will be omitted.
"},{"location":"arcgis-harvest-policy-2018/#standards-metadata","title":"STANDARDS METADATA","text":"Policy: Creating or linking to standards based metadata files for Arc-ODPs is out of scope at this time.
Explanation: If metadata is enabled for an Arc-ODP, it will be available as ArcGIS Metadata Format 1.0 in XML, which is not a schema that GeoBlacklight can display. The metadata may also be available as FGDC or ISO HTML pages, but these types of links are not part of the current GeoBlacklight schema. Further, very few Arc-ODPs are taking advantage of this feature at this time.
"},{"location":"arcgis-hub-guidelines/","title":"BTAA-GIN Scanning Guidelines for ArcGIS Hubs","text":"Version 2.0 - April 24, 2023
Info
This document replaces the BTAA GDP Accession Guidelines for ArcGIS Open Data Portals, version 1.0.
"},{"location":"arcgis-hub-guidelines/#overview","title":"Overview","text":"This document describes the BTAA-GIN Scanning Guidelines for ArcGIS Hubs, including eligible sites and records, harvest schedules, and remediation work. This policy may change at any time in response to updates to the ArcGIS Hub architecturs platform and/or the BTAA-GIN harvesting and remediation processes.
"},{"location":"arcgis-hub-guidelines/#eligible-sites","title":"Eligible sites","text":"Guideline: Any ArcGIS Hub (Hub) that serves public geospatial data is eligible for inclusion in the BTAA Geoportal (\u201cthe Geoportal\u201d). Our scope includes public Hubs from the states in the BTAA geographic region and Hubs submitted by Team Members that are of interest to BTAA researchers.
Explanation: See the BTAA-GIN Collection Development Policy for more details.
"},{"location":"arcgis-hub-guidelines/#eligible-records","title":"Eligible records","text":"Guideline: The only records that will be harvested from Hubs are Esri REST Services of the type Map Service, Feature Service, or Image Service. This is further restricted to only items that are harvestable through the DCAT API. By default, the following records types will not be accessioned on a regular basis:
- Web applications
- Nonspatial data, tabular data, PDFs
- Single records that describe many different associated files to download, such as imagery services with a vast number of sublayers
Explanation: Hubs are structured to automatically create item records from submitted Esri REST services. However, Hub administrators are able to manually add records for other types of resources, such as external websites or documents. These may not be spatial datasets and may not have consistently formed metadata or access links, which impedes automated accessioning.
"},{"location":"arcgis-hub-guidelines/#frequency","title":"Frequency","text":"Guideline: The Hubs included in the Geoportal will be scanned weekly to harvest complete lists of eligible records. The list will be published and overwrite the previous scan.
Explanation: Broken links negatively impact user experience. Over the course of a week, as many as 10% of the ArcGIS Hub records in the Geoportal can break or include outdated information.
"},{"location":"arcgis-hub-guidelines/#metadata-remediation","title":"Metadata Remediation","text":"Guideline: The harvesting script, R-01_arcgis-hubs
, programmatically performs all of the remediation for each record.
Explanation: We now scan a large number of ArcGIS Hubs, which makes manual remediation unrealistic. This is in contrast to the previous policy established in 2018, when our collection was smaller.
"},{"location":"b1g-custom-elements/","title":"Custom Elements","text":"This page documents the custom metadata elements for the GeoBTAA Metadata Profile. These elements extend the official OpenGeoMetadata (Aardvark) schema.
b1g-id Label URI Obligation b1g-01 Code b1g_code_s
Required b1g-02 Status b1g_status_s
Optional b1g-03 Accrual Method b1g_dct_accrualMethod_s
Required b1g-04 Accrual Periodicity b1g_dct_accrualPeriodicity_s
Optional b1g-05 Date Accessioned b1g_dateAccessioned_s
Required b1g-06 Date Retired b1g_dateRetired_s
Conditional b1g-07 Child Record b1g_child_record_b
Conditional b1g-08 Mediator b1g_dct_mediator_sm
Conditional b1g-09 Access b1g_access_s
Conditional b1g-10 Image b1g_image_ss
Optional b1g-11 GeoNames b1g_geonames_sm
Optional b1g-12 Publication State b1g_publication_state_s
Required b1g-13 Language String b1g_language_sm
Required b1g-14 Creator ID b1g_creatorID_sm
Optional"},{"location":"b1g-custom-elements/#code","title":"Code","text":"Label Code URI b1g_code_s
Profile ID b1g-01 Obligation Required Multiplicity 1-1 Field type string Purpose To group records based upon their source Entry Guidelines Codes are developed by the metadata coordinator and indicate the provider, the type of institution hosting the resources, and a numeric sequence number. For more details, see Code Naming Schema Commentary This administrative field is used to group and track records based upon where they are harvested. This is frequently an identical value to \"Member Of\". The value will differ for records that are retired (these are removed from \"Member Of\") and records that are part of a subcollection. Controlled Vocabulary yes-strict Example value 12d-01 Element Set B1G"},{"location":"b1g-custom-elements/#status","title":"Status","text":"Label Status URI b1g_status_s
Profile ID b1g-02 Obligation Optional Multiplicity 0-1 Field type string Purpose To indicate if a record is currently active, retired, or unknown. It can also be used to indicate if individual data layers from website has been indexed in the Geoportal. Entry Guidelines Plain text string is acceptable Commentary This is a legacy admin field that was previously used to track published vs retired items. The current needs are still TBD. Controlled Vocabulary no Example value Active Element Set B1G"},{"location":"b1g-custom-elements/#accrual-method","title":"Accrual Method","text":"Label Accrual Method URI b1g_dct_accrualMethod_s
Profile ID b1g-03 Obligation Required Multiplicity 1-1 Field type string Purpose To describe how the record was obtained Entry Guidelines Some values, such as \"ArcGIS Hub\" should be entered consistently. Others may be more descriptive, such as \"Manually entered from text file.\" Commentary This allows us to find all of the ArcGIS records in one search. It also can help track records that have been harvested via different methods within the same collection. Controlled Vocabulary no Example value ArcGIS Hub Element Set B1G/ Dublin Core"},{"location":"b1g-custom-elements/#accrual-periodicity","title":"Accrual Periodicity","text":"Label Accrual Periodicity URI b1g_dct_accrualPeriodicity_s
Profile ID b1g-04 Obligation Optional Multiplicity 0-1 Field type string Purpose To indicate how often a collection is reaccessioned Entry Guidelines Enter one of the following values: Annually, Semiannually, Quarterly, Monthly, As Needed Commentary This field is primarily for collection level records. Controlled Vocabulary yes-not strict Example value As Needed Element Set B1G/ Dublin Core"},{"location":"b1g-custom-elements/#date-accessioned","title":"Date Accessioned","text":"Label Date Accessioned URI b1g_dateAccessioned_s
Profile ID b1g-05 Obligation Required Multiplicity 1-1 Field type string Purpose To store the date a record was harvested Entry Guidelines Enter the date a record was harvested OR when it was added to the geoportal using the format yyyy-mm-dd Commentary This field allows us to track how many records are ingested into the portal in a given time period and to which collections. Controlled Vocabulary no Example value 2021-01-01 Element Set B1G"},{"location":"b1g-custom-elements/#date-retired","title":"Date Retired","text":"Label Date Retired URI b1g_dateRetired_s
Profile ID b1g-06 Obligation Conditional Multiplicity 0-1 Field type string Purpose To store the date the record was removed from the geoportal public interface Entry Guidelines Enter the date a record was removed from the geoportal Commentary This field allows us to track how many records have been removed from the geoportal interface by time period and collection. Controlled Vocabulary no Example value 2021-01-02 Element Set B1G"},{"location":"b1g-custom-elements/#child-record","title":"Child Record","text":"Label Child Record URI b1g_child_record_b
Profile ID b1g-07 Obligation Optional Multiplicity 0-1 Field type string boolean Purpose To apply an algorithm to the record that causes it to appear lower in search results Entry Guidelines Only one of two values are allowed: true or false Commentary This is used to lower a record's placement in search results. This can be useful for a large collection with many similar metadata values that might clutter a user's experience. Controlled Vocabulary string boolean Example value true Element Set B1G"},{"location":"b1g-custom-elements/#mediator","title":"Mediator","text":"Label Mediator URI b1g_dct_mediator_sm
Profile ID b1g-08 Obligation Conditional Multiplicity 0-0 or 1-* Field type string Purpose To indicate the universities that have licensed access to a record Entry Guidelines The value for this field should be one of the names for each institution that have been coded in the GeoBlacklight application. Commentary This populates a facet on the search page so that users can filter to only databases that they are able log into based upon their institutional affiliation. Controlled Vocabulary yes Example value University of Wisconsin-Madison Element Set B1G/ Dublin Core"},{"location":"b1g-custom-elements/#access","title":"Access","text":"Label Access URI b1g_access_s
Profile ID b1g-09 Obligation Conditional Multiplicity 0-0 or 1-1 Field type string JSON Purpose To supply the links for restricted records Entry Guidelines The field value is an array of key/value pairs, with keys representing an insitution code and values the URL for the library catalog record. See the Access Template for entry. Commentary This field is challenging to construct manually, is it is a JSON string of institutional codes and URLs. The codes are used instead of the actual names in order to make the length of the field more manageable and to avoid spaces. Controlled Vocabulary no Example value {\\\"03\\\":\\\"https://purl.lib.uiowa.edu/PolicyMap\\\",\\\"04\\\":\\\"https://www.lib.umd.edu/dbfinder/id/UMD09180\\\",\\\"05\\\":\\\"https://primo.lib.umn.edu/permalink/f/1q7ssba/UMN_ALMA51581932400001701\\\",\\\"06\\\":\\\"http://catalog.lib.msu.edu/record=b10238077~S39a\\\",\\\"07\\\":\\\"https://search.lib.umich.edu/databases/record/39117\\\",\\\"09\\\":\\\"https://libraries.psu.edu/databases/psu01762\\\",\\\"10\\\":\\\"https://digital.library.wisc.edu/1711.web/policymap\\\",\\\"11\\\":\\\"https://library.ohio-state.edu/record=b7869979~S7\\\"} Element Set B1G"},{"location":"b1g-custom-elements/#image","title":"Image","text":"Label Image URI b1g_image_ss
Profile ID b1g-10 Obligation Optional Multiplicity 0-0 or 0-1 Field type stored string (URL) Purpose To show a thumbnail on the search results page Entry Guidelines Enter an image file using a secure link (https). Acceptable file types are JPEG or PNG Commentary This link is used to harvest an image into the Geoportal server for storage and display. Once it has been harvested, it will remain in storage, even if the orginal link to the image stops working. Controlled Vocabulary no Example value https://gis.allencountyohio.com/GIS/Image/countyseal.jpg Element Set B1G"},{"location":"b1g-custom-elements/#geonames","title":"GeoNames","text":"Label GeoNames URI b1g_geonames_sm
Profile ID b1g-11 Obligation Optional Multiplicity 0-* Field type stored string (URI) Purpose To indicate a URI for a place name from the GeoNames database Entry Guidelines Enter a value in the format \"http://sws.geonames.org/URI
\" Commentary This URI provides a linked data value for one or more place names. It is optional as there is currently no functionality tied to it in the GeoBlacklight application Controlled Vocabulary yes Example value https://sws.geonames.org/2988507 Element Set B1G"},{"location":"b1g-custom-elements/#publication-state","title":"Publication State","text":"Label Publication State URI b1g_publication_state_s
Profile ID b1g-12 Obligation Required Multiplicity 1-1 Field type string Purpose To communicate to Solr if the item is public or hidden Entry Guidelines Use the dropdown or batch editing functions to change the state Commentary When items are first added to GBL Admin, they are set as Draft by default. When they are ready to be published, they can be manually changed to Published. If the record is retired or needs to be hidden, it can be changed to Unpublished Controlled Vocabulary yes Example value Draft Element Set B1G"},{"location":"b1g-custom-elements/#language-string","title":"Language string","text":"Label Language string URI b1g_language_sm
Profile ID b1g-13 Obligation Required Multiplicity 1-* Field type string Purpose To display the spelled out string (in English) of a language code to users Entry Guidelines This field is automatically generated from the Language field in the main form Commentary The OGM schema specified using a 3-digit code to indicate lanuage. In order to display this to users, it needs to be translated into a human-readable string. Controlled Vocabulary yes Example value French Element Set B1G"},{"location":"b1g-custom-elements/#creator-id","title":"Creator ID","text":"Label Creator ID URI b1g_creatorID_sm
Profile ID b1g-14 Obligation Optional Multiplicity 0-* Field type string Purpose To track the URI of a creator value Entry Guidelines This field is entered as a URI representing an authority record Commentary These best practices recommend consulting one or two name registries when deciding how to standardize names of creators: the Faceted Application of Subject Terminology (FAST) or the Library of Congress Name Authority File (LCNAF). FAST is a controlled vocabulary based on the Library of Congress Subject Headings (LCSH) that is well-suited to the faceted navigation of the Geoportal. The LCNAF is an authoritative list of names, events, geographic locations and organizations used by libraries and other organizations to collocate authorized creator names to make searching and browsing easier. Controlled Vocabulary yes Example value fst02013467 Element Set B1G"},{"location":"ckan/","title":"Overview of CKAN Data Portals and its APIs","text":"\"CKAN is a tool for making open data websites\". (https://docs.ckan.org/en/2.10/user-guide.html#what-is-ckan) CKAN is often utilized by governments and organizations and is an open-source alternative to platforms like ArcGIS Hubs.
"},{"location":"ckan/#content-organization","title":"Content Organization","text":"The content organization model of a CKAN site uses the term Datasets for each item record. A Dataset may have multiple Resources, such as downloadable files, thumbnails, supplemental metadata files, and external links. This model can give data providers flexibility on how they organize their files, but can be challenging for harvesting into the BTAA Geoportal.
Unlike CKAN, GeoBlacklight was designed to have only one data file per record, so it can be challenging to programmatically sort through all of the possible access points for a Dataset and attach them to a single record in GeoBlacklight. To mitigate this, we use the multiple downloads option when possible.
"},{"location":"ckan/#metadata","title":"Metadata","text":"CKAN metadata contains several basic fields (documented at https://ckan.org/features/metadata) along with an \"extras\" group that can be customized by site. Some portals have many custom fields in \"extras\" and some do not use them at all.
"},{"location":"ckan/#api","title":"API","text":"CKAN offers several types of APIs for sharing metadata. The most useful one for the BTAA Geoportal is the package_search
, which can be accessed by appending \"api/3/action/package_search\" to a base URL.
Example
https://demo.ckan.org/api/3/action/package_search
"},{"location":"codeNamingSchema/","title":"Code Naming Schema","text":"Each website / collection in the BTAA Geoportal has an alphanumeric code. This code is also added to each metadata record to facilitate administrative tasks and for grouping items by their source. Some of the Codes are randomly generated strings, but most are constructed with an administrative schema described below:
First part of string Contributing institution 01 Indiana University 02 University of Illinois Urbana-Campaign 03 University of Iowa 04 University of Maryland 05 University of Minnesota 06 Michigan State University 07 University of Michigan 08 Pennsylvania State University 09 Purdue University 10 University of Wisconsin-Madison 11 The Ohio State University 12 University of Chicago 13 University of Nebraska-Lincoln 14 Rutgers University Second part of string Type of organization hosting the datasests a State b County c Municipality d University f Other (ex. NGOs, Regional Groups, Collaborations) g Federal Third part of string The sequence number added in order of accession or a county FIPs code -01 First collection added from same institution and type of organization -02 Second collection added from same institution and type of organization -55079 County FIPS code for Milwaukee County, Wisconsin Example
code for a collection sourced from Milwaukee County: '10b-55079'
"},{"location":"collection-development-policy/","title":"BTAA Geoportal Collection Development Policy","text":" Authors: BTAA Collection Development & Education Outreach Committee
"},{"location":"collection-development-policy/#purpose","title":"Purpose","text":"The BTAA Geospatial Information Network is a collaborative project to enhance discoverability, facilitate access, and connect scholars across the Big Ten Academic Alliance (BTAA) to scanned maps, geospatial data, and aerial imagery resources. The project\u2019s main output is the BTAA Geoportal, which serves as a platform through which participating libraries can share materials from their collections to make them more easily discoverable and accessible to varied user communities. Collections within the Geoportal primarily support the research, teaching, learning, and information needs of faculty, staff, and students at participating institutions and beyond.
The project supports the creation and aggregation of discovery-focused metadata describing geospatial resources from participating institutions and public sources across the Big Ten region and makes those resources discoverable via an open source portal. For more information and a list of participating BTAA institutions, please visit our project site.
"},{"location":"collection-development-policy/#summary-of-collection-scope","title":"Summary of Collection Scope","text":"Access to the BTAA Geoportal is open to all. This collection consists of geospatial resources relevant to all disciplines. Access to resources is curated based on their authoritativeness, currency, comprehensiveness, ease of use, and relevancy. Materials included are generally publicly available geospatial datasets (vector/raster), scanned maps (georeferenced or with bounding box coordinates), and aerial imagery. Scanned maps protected by copyright are not included in the Geoportal. Access to licensed resources may be restricted to users affiliated with a participating institution.
- Geographic areas: Items in the collection vary in scale based on subject and range from global to local. Geographic areas vary based on subject and may refer to biomes/ecosystems, political boundaries, cultural boundaries, economic boundaries, or land use types. In addition to a geographic focus on the Big Ten region (i.e., the states where participating institutions are located), the collection will emphasize resources and topics relevant to faculty and student research interests and reflect the strengths of participating library collections.
- Time periods: All time periods are collected, with an ability to accommodate both current and historical versions of datasets.
- Format: The collection consists of geospatial datasets, georeferenced maps, scanned maps with bounding box coordinates, and aerial imagery. Records for web mapping applications may also be included, with priority given to applications with datasets that are also accessible for download through the Geoportal. Preference is given to open and open source formats, but other formats are accepted as required to facilitate ease of use. When possible, resources are presented in formats that allow for download capabilities. Additional software may be needed to view datasets after download.
- Language(s): The collection primarily consists of English language content. Some non-English language content may be available for certain regions, reflecting the collection strengths and research/curricular interests of participating institutions.
- Diversity: The Geoportal and its participating institutions aspire to collect and provide access to geospatial resources that represent diverse perspectives, abilities, and experience levels. We will strive to apply best practices for diverse collection development as they relate to geospatial resources, including but not limited to:
- considering resources from small, independent, and local producers
- seeking content created by and representative of marginalized and underrepresented groups.
- Preservation and life cycle: Digital file preservation for discovery metadata is managed by BTAA Geoportal staff. Digital file preservation for resources is the responsibility of the content provider. Resources may cease to be accessible through the Geoportal if access from the original provider is no longer available.
"},{"location":"collection-development-policy/#statement-of-communication","title":"Statement of Communication","text":"The members of the Geoportal project team will continue to communicate with the creators of other geoportals (GeoBlacklight Community, etc.), with data providers in our respective regions, and across Big Ten institutions to build a comprehensive and robust collection.
Implementation and Revision Schedule: This policy will be reviewed annually by the Collection Development & Education Outreach Committee and is considered effective on the date indicated below. It will be revised as needed to reflect new collection needs and identify new areas of study as well as those areas that may be excluded.
Updated: April 27, 2022
"},{"location":"curation-priorities/","title":"Curation Priorities","text":" Authors: BTAA Collection Development & Education Outreach Committee; Product Manager
There are three distinct but related aspects of prioritizing the addition of new collections: content/theme, administration, and technology.
These priorities will affect how quickly the items are processed or where they fall in line within our processing queue.
"},{"location":"curation-priorities/#contenttheme","title":"Content/Theme","text":"When it comes to scanned maps, prioritization based on content or theme is primarily a local effort. However, there are opportunities for internal collaborations, including with Special Collections librarians or other local digital collections initiatives. These collaborations can allow for unique and distinctive maps to be harvested into the geoportal across our universities.
For geospatial data, datasets created in association with research projects at our institutions may be a high priority based on content or theme. Additionally, resources that provide access to foundational datasets, such as administrative boundaries, parcels, road networks, address points, and land use, should also be considered.
Regardless of the content type, special consideration should be given to highly relevant content, especially to current events. For example, in April 2020, a call went out to all task force members to identify and submit content related to COVID-19 for harvesting into the geoportal. Content that aligns with other ongoing BTAA-GIN program efforts, such as the Diverse Collections Working Group, will also be a higher priority as these efforts are further developed.
"},{"location":"curation-priorities/#administration","title":"Administration","text":"Collections may be prioritized based on the organization responsible for creating and maintaining content, which impacts the types of maps or datasets available to be harvested, spatial and temporal coverage, and stability. Based on these considerations, current priorities in terms of administration are:
-
University libraries and archives
- Links to these resources are likely to be stable
- Resources will likely be documented with a metadata standard
- Represent our core audience
-
States and counties
- Produce most foundational geospatial datasets (e.g., roads and parcels) and are currently our largest source of geospatial data
- Technology and open data policies vary widely resulting in patchwork coverage
-
Regional organizations and research institutes
- Often special organizations with funding to create geospatial data across political boundaries
- Higher risk of harvesting duplicate datasets, as these organizations sometimes aggregate records from cities, counties, or state agencies
-
Cities
- less likely to produce and share data in geospatial formats and more likely to share tabular data
- prioritized cities: major metropolitan areas and the locations of our university campuses
"},{"location":"curation-priorities/#technology","title":"Technology","text":"The source's hosting platform influences the ease of harvesting, the quality of the metadata, and the stability of the access links. Based on these considerations, current priorities in terms of technology are:
-
Published via known portal or digital library platforms, including:
- Blacklight/GeoBlacklight
- Islandora
- Preservica
- ArcGIS Hubs
- Socrata
- CKAN
- Sites with OAI-PMH enabled APIs
-
Custom portals
- each portal requires a customized script for HTML web parsing
- writing and maintaining custom scripts takes extra time
-
Static webpages with download links
- at a minimum, a title is required for each item
- static sites with nested links take a long time to process and may require an extensive amount of manual work
-
Database websites
- require the user to perform interactive queries to extract data
- not realistic to make Geoportal records for individual datasets
- usually results in a single \"website\" record in the Geoportal to represent the database
"},{"location":"dcat/","title":"DCAT Metadata","text":""},{"location":"dcat/#overview","title":"Overview","text":"DCAT (Data Catalog Vocabulary) is metadata schema for web-based data catalogs. It is intended to facilitate interoperability and many data platforms offer a DCAT API for metadata sharing.
The most up-to-date documentation of the schema can be found here: https://www.w3.org/TR/vocab-dcat-3/
Documentation that is older, but still in use for United States portals can be found here: https://resources.data.gov/resources/dcat-us/
"},{"location":"dcat/#json-structure","title":"JSON Structure","text":"Many of the data platforms in the United States use a DCAT profile documented as \"Project Open Data Catalog\". The following JSON template shows the generic structure of a DCAT JSON document:
{\n \"$schema\": \"http://json-schema.org/draft-04/schema#\",\n \"id\": \"https://project-open-data.cio.gov/v1.1/schema/catalog.json#\",\n \"title\": \"Project Open Data Catalog\",\n \"description\": \"Validates an entire collection of Project Open Data metadata JSON objects. Agencies produce said collections in the form of Data.json files.\",\n \"type\": \"object\",\n \"dependencies\": {\n \"@type\": [\n \"@context\"\n ]\n },\n \"required\": [\n \"conformsTo\",\n \"dataset\"\n ],\n \"properties\": {\n \"@context\": {\n \"title\": \"Metadata Context\",\n \"description\": \"URL or JSON object for the JSON-LD Context that defines the schema used\",\n \"type\": \"string\",\n \"format\": \"uri\"\n },\n \"@id\": {\n \"title\": \"Metadata Catalog ID\",\n \"description\": \"IRI for the JSON-LD Node Identifier of the Catalog. This should be the URL of the data.json file itself.\",\n \"type\": \"string\",\n \"format\": \"uri\"\n },\n \"@type\": {\n \"title\": \"Metadata Context\",\n \"description\": \"IRI for the JSON-LD data type. This should be dcat:Catalog for the Catalog\",\n \"enum\": [\n \"dcat:Catalog\"\n ]\n },\n \"conformsTo\": {\n \"description\": \"Version of Schema\",\n \"title\": \"Version of Schema\",\n \"enum\": [\n \"https://project-open-data.cio.gov/v1.1/schema\"\n ]\n },\n \"describedBy\": {\n \"description\": \"URL for the JSON Schema file that defines the schema used\",\n \"title\": \"Data Dictionary\",\n \"type\": \"string\",\n \"format\": \"uri\"\n },\n \"dataset\": {\n \"type\": \"array\",\n \"items\": {\n \"$ref\": \"dataset.json\",\n \"minItems\": 1,\n \"uniqueItems\": true\n }\n }\n }\n}\n
"},{"location":"dcat/#how-to-find-the-dcat-api","title":"How to find the DCAT API","text":"Most sites, including Socrata:
To find a data API, a good place to start is to try appending the string \"/data.json\" to the base URL. If available, your browser will display the data catalog as a JSON file.
ArcGIS Hubs:
- Version 1: append the string \"/api/feed/dcat-us/1.1.json\". Esri made this change was made in 2022 to differentiate the older DCAT version from 2.0. Our harvest recipe current uses this version.
- Version 2: use the string \"api/feed/dcat-ap/2.0.1.json\". We plan to evaluate the newer format and will consider migrating our recipe in 2024.
"},{"location":"editingTemplate/","title":"Editing Template","text":"The GeoBTAA Metadata Template (https://z.umn.edu/b1g-template) is a set of spreadsheets that are formatted for our metadata editor, GBL Admin.
Users will need to make a copy of the spreadsheet to use for editing. In some cases, the Metadata Coordinator can provide a customized version of the sheets for specific collections.
The Template contains the following tabs:
- Map Template
- Dataset Template
- Website Record Template
- Values: All of the controlled vocabulary values for the associated fields.
- Access Links and Multiple Downloads: Fields for adding secondary tables.
Note
The input format for some fields in this template may differ from how the field is documented in OpenGeoMetadata. These differences are intended to make it easier to enter values, which will be transformed when we upload the record to GBL Admin.
-
Bounding Box coordinates should be entered as W,S,E,N
. The coordinates are automatically transformed to a different order ENVELOPE(W,E,N,S)
. Read more under the Local Input Guidelines.
-
Date Range should be entered as yyyy-yyyy
. This is automatically transformed to [yyyy TO yyyy].
-
External links are added separately under column headers for the type of link. These are combined into the dct_references_s
field as a string of key:value pairs.
"},{"location":"ephemeral-data/","title":"The challenge of ephemeral data","text":"summary
Many of the resources in the BTAA Geoportal are from sites that continually update their datasets. As a result, we need to regularly re-harvest the metadata.
Government agencies now issue most geospatial information as digital data instead of as physical maps. However, many academic libraries have not yet expanded their collection scopes to include publicly available digital data, and are therefore no longer systematically capturing and storing the changing geospatial landscape for future researchers.
The BTAA Geoportal partially fills a gap in the geospatial data ecosystem by cataloging metadata records for current publicly available state, county, and municipal geospatial resources. The value of this data is high, as researchers routinely use it to form the base layers for web maps or geographic analysis. However, the the mandates and policies for providing this data varies considerably from state to state and from county to county. The lack of consistent policies at this level of government means that this data can be considered ephemeral, as providers regularly migrate, update, delete, and re-publish data without saving previous versions and without notification to the public.
The lack of standard policies at this level of government means that this data can be considered ephemeral. It may be updated, removed, or replaced without notification to the public. The rate at which datasets change or disappear is variable, but is often high.
This continual turnover creates a difficult environment for researchers to properly source data and replicate results. It also requires a great deal of dedicated labor to maintain the correct access links in the geoportal. As the geoportal\u2019s collection grows, the labor required to maintain it grows as well.
"},{"location":"geobtaa-metadata-application-profile/","title":"GeoBTAA Metadata Profile","text":"The GeoBTAA Metadata Application Profile consists of the following components:
"},{"location":"geobtaa-metadata-application-profile/#1-opengeometadata-elements","title":"1. OpenGeoMetadata Elements","text":" - The BTAA Geoportal uses the OpenGeoMetadata Schema for each resource.
- The current version of OpenGeoMetadata is called 'Aardvark'.
- This lightweight schema was designed specifically for the GeoBlacklight application and is geared towards discoverability.
- The GeoBTAA Metadata Profile aligns with all of the guidelines and recommendations in the official OpenGeoMetadata documentation.
- The schema is documented on the OpenGeoMetadata website .
"},{"location":"geobtaa-metadata-application-profile/#2-custom-elements","title":"2. Custom Elements","text":" - The GeoBTAA profile includes custom fields for lifecycle tracking and administration
- These elements are generally added to the record by admin staff. When they appear on editing templates, they are grayed out.
- They all start with the namespace
b1g
- See the Custom Elements page for more detail
"},{"location":"geobtaa-metadata-application-profile/#3-geobtaa-input-guidelines","title":"3. GeoBTAA Input Guidelines","text":" -
For the content in some fields, the GeoBTAA profile has specific guidelines that extends or differs from what is documented in the OpenGeoMetadata schema.
-
See the GeoBTAA Input Guidelines page for more detail
Info
The GeoBTAA Metadata Template can be found at https://z.umn.edu/b1g-template
"},{"location":"glossary/","title":"Glossary","text":""},{"location":"glossary/#python-and-scripting","title":"Python and scripting","text":""},{"location":"glossary/#apis","title":"APIs","text":"An API (Application Programming Interface) is a set of rules, protocols, and tools for building software applications. It specifies how different software components should interact with each other, allowing them to communicate and exchange information.
In the context of Python, APIs are often used to retrieve data from a web server or to interact with an external service. For example, the requests library is a popular Python package that simplifies making HTTP requests to APIs, while the json module provides an easy way to parse and encode JSON data.
"},{"location":"glossary/#beautiful-soup","title":"Beautiful Soup","text":"HTML and XML parser
"},{"location":"glossary/#conda","title":"Conda","text":" - Conda is an open source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
"},{"location":"glossary/#conda-package-manager","title":"Conda Package Manager","text":" - Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
For more information on Conda and environments, refer to this website: https://docs.conda.io/projects/conda/en/stable/user-guide/index.html
"},{"location":"glossary/#pandas","title":"Pandas","text":"Pandas is a Python library that contains many functions for analyzing data. For the GeoBTAA workflows, we are most interested in how it eases transformations between JSON and CSV files:
CSV files: Pandas can easily read and write CSV files using its read_csv()
and to_csv()
methods, respectively. These methods can handle many CSV formats, including different delimiter characters, header options, and data types. Once the CSV data is loaded into a Pandas DataFrame, it can be easily manipulated and analyzed using Pandas' powerful data manipulation tools, such as filtering, grouping, and aggregation.
JSON data: Pandas can also read and write JSON data using its read_json()
and to_json()
methods. These methods can handle various JSON formats, such as normal JSON objects, JSON arrays, and JSON lines. Once the JSON data is loaded into a Pandas DataFrame, it can be easily manipulated and analyzed using the same data manipulation tools used for CSV data.
pandas DataFrame A DataFrame is similar to a Python list or dictionary, but it has rows and columns, similar to a spreadsheet. This makes it a simpler task to convert between JSON and CSV. To review these Python terms, refer to the glossary.
"},{"location":"glossary/#pandas-dataframe","title":"Pandas DataFrame","text":"Pandas DataFrame is a 2-dimensional table-like data structure that is used for data manipulation and analysis. It is a powerful tool for handling and processing structured data. A DataFrame has rows and columns, similar to a spreadsheet. It can contain heterogeneous data types and can be indexed and sliced in various ways. It is part of the Pandas library and provides powerful features for data analysis and manipulation. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas-dataframe
"},{"location":"glossary/#python-list","title":"Python List","text":"A list is a basic data structure in Python that is used to store a collection of items of different data types. It is an ordered collection of elements, and each element is indexed by an integer starting from 0. A list can contain elements of different data types, including other lists and dictionaries. A list is mutable, meaning its elements can be added, removed, or modified. It is a simple, general-purpose data structure that is commonly used for storing and manipulating small to medium-sized data sets.
"},{"location":"glossary/#python-dictionary","title":"Python Dictionary","text":"A dictionary is another data structure in Python that is used to store data in the form of key-value pairs. It is an unordered collection of elements, where each element is identified by a unique key instead of an index. The keys can be of any hashable data type, and the values can be of any data type. A dictionary is also mutable, meaning its elements can be added, removed, or modified. It is commonly used to store and manipulate structured data, such as user profiles or configuration settings.
"},{"location":"glossary/#python-objects","title":"Python Object(s)","text":"In Python, everything is an object. An object is an instance of a class, which is a blueprint for creating objects. It contains data and functions (also called methods) that operate on that data. Objects are created dynamically, which means that you don't have to declare the type of a variable or allocate memory for it explicitly. When you assign a value to a variable, Python creates an object of the appropriate type and associates it with that variable.
For example, an integer in Python is an object of the int class, and a string is an object of the str class. Each object of a class has its own set of data attributes, which store the values of its properties, and methods, which operate on those properties.
"},{"location":"glossary/#python-interface","title":"Python Interface","text":"In Python, an interface refers to the set of methods that a class or an object exposes to the outside world. It defines the way in which an object can be interacted with, and the methods that are available to be called. An interface can be thought of as a contract that specifies how a class can be used, and what methods are available to a programmer when working with that class.
Python is an object-oriented programming language, and as such, it supports the concept of an interface. Python does not have a specific language construct for creating an interface. Instead, interfaces are implemented using a combination of abstract base classes and duck typing.
An abstract base class is a class that cannot be instantiated directly, but instead, is intended to be subclassed. It defines a set of abstract methods that must be implemented by any subclass. This allows for the creation of a common interface that can be shared among multiple classes.
Duck typing is a concept in Python that allows for the determination of an object's type based on its behavior, rather than its actual type. This means that if an object behaves like a certain type, it is considered to be of that type. This allows for more flexibility in programming, as it allows for the creation of classes that can be used interchangeably, as long as they implement the same methods.
"},{"location":"glossary/#spacy","title":"SpaCy","text":"SpaCy is a python module that uses natural language processing (NLP) to comb through text help us find and extract patterns. To extract place names, it uses named entity recognition (NER) by searching a list of place names. This list is called \"GPE\", which stands for Geopolitical Entity.
This article in the Code4Lib journal, From Text to Map: Combing Named Entity Recognition and Geographic Information Systems, explains the process we use to extract place names.
"},{"location":"glossary/#data-portals","title":"Data Portals","text":""},{"location":"glossary/#arcgis-hubs","title":"ArcGIS Hubs","text":"ArcGIS Hub is a data portal platform that allows users to share geospatial data as web services. It is especially popular with local governments that already use the Esri ArcGIS ecosystem.
"},{"location":"glossary/#ckan","title":"CKAN","text":"CKAN is a tool for making open data websites, it helps you manage and publish collections of data. For CKAN purposes, data is published in units called \u201cdatasets\u201d.
- A dataset contains two things:
Information or \u201cmetadata\u201d about the data. For example, the title and publisher, date, what formats it is available in, what license it is released under, etc.
A number of \u201cresources\u201d, which hold the data itself. CKAN does not mind what format the data is in. A resource can be a CSV or Excel spreadsheet, XML file, PDF document, image file, linked data in RDF format, etc. CKAN can store the resource internally, or store it simply as a link, the resource itself being elsewhere on the web. A dataset can contain any number of resources. For example, different resources might contain the data for different years, or they might contain the same data in different formats.
"},{"location":"glossary/#socrata","title":"Socrata","text":""},{"location":"glossary/#geoblacklight","title":"GeoBlacklight","text":""},{"location":"glossary/#metadata-standards","title":"Metadata Standards","text":"This chart provides links to documentation about the common metadata standards and schemas we encounter when harvesting.
{{ read_csv('../tables/metadataStandards.csv') }}
- Dublin Core: A set of metadata elements that are used to describe resources in a simple and standardized way. Dublin Core is widely used in library systems, archives, and other digital repositories.
- MODS (Metadata Object Description Schema): A flexible and extensible XML schema for describing a wide range of resources, including books, articles, and other types of digital content.
- METS (Metadata Encoding and Transmission Standard): A standard for encoding descriptive, administrative, and structural metadata for digital objects.
- MARC (Machine-Readable Cataloging): A metadata format used by libraries to describe bibliographic information about books, journals, and other materials.
"},{"location":"input-guidelines/","title":"Local Input Guidelines","text":"For the following elements, the GeoBTAA Metadata Profile has input guidelines beyond what is documented in the OpenGeoMetadata schema:
"},{"location":"input-guidelines/#title","title":"Title","text":"Maps: The title for scanned maps is generally left as it was originally cataloged by a participating library. MARC subfields are omitted and can be inserted in the Description field.
Datasets: Harvested datasets often are lacking descriptive titles and may need to be augmented with place names. Dates may also be added to the end, but if the dataset is subject to updates, the data should be left off. Acronyms should be spelled out. The preferred format for dataset titles is: Theme [place] {date}
. This punctuation allows for batch processing and splitting title elements.
"},{"location":"input-guidelines/#language","title":"Language","text":"Although Language is optional in the OGM schema, a three-digit code is required for the BTAA Geoportal.
"},{"location":"input-guidelines/#creator","title":"Creator","text":"When possible, Creators should be drawn from a value in the Faceted Application of Subject Terminology (FAST).
"},{"location":"input-guidelines/#creator-id","title":"Creator ID","text":"If the Creator value is from a name authority, insert the ID in this field.
"},{"location":"input-guidelines/#publisher","title":"Publisher","text":"Maps: Publisher values for maps are pulled from the original catalog record. Remove subfields for place names and dates.
Datasets: The BTAA Geoportal does not use the Publisher field for Datasets.
"},{"location":"input-guidelines/#provider","title":"Provider","text":"This is the name of the organization hosting the resources. If the organization is part of the BTAA library network, a university icon will display next to the resource's title. However, most Providers will not have an icon.
"},{"location":"input-guidelines/#spatial-coverage","title":"Spatial Coverage","text":"This should be in the format used by the Faceted Application of Subject Terminology (FAST).
For US counties and cities, the format should be state--county
or state--city
. The state itself should also be included. Examples:
Example
-
Wisconsin--Dane County
-
Wisconsin--Madison
-
Wisconsin
"},{"location":"input-guidelines/#bounding-box","title":"Bounding Box","text":"On the Metadata Editing Template, provide Bounding Boxes in this format: W,S,E,N This order matches the DCAT API and is how the Klokan Bounding Box provides coordinates with their \"CSV\" setting.
This format will be programmatically converted to other formats when it is published to the Geoportal:
-
The OpenGeoMetadata Bounding Box field (dcat_bbox_s
) uses this order: ENVELOPE(W,E,N,S)
-
The OpenGeoMetadata Geometry field (locn_geometry
) uses a WKT format and the coordinate order will be converted to this layout: POLYGON((W N, E N, E S, W S, W N))
-
The OpenGeoMetadata Centroid field (dcat_centroid
) will be calculated to display longitude,latitude.
Example
Metadata CSV: -120,10,-80,35
converts to
dcat_bbox_s:
ENVELOPE(-120,-80,35,10)
locn_geometry:
POLYGON((-120 35, -80 35, -80 10, -120 10, -120 35))
dcat_centroid
: \"22.5,-100.0\"
"},{"location":"lifecycle/","title":"Lifecycle","text":"Deprecation
This lifecycle documentation had been replaced by a newer version at Resource Lifecycle.
"},{"location":"lifecycle/#1-submit-records","title":"1. Submit Records","text":"It is the role of the Team members to seek out new content for the geoportal. See the page How to Submit Resources to the BTAA Geoportal for more information.
"},{"location":"lifecycle/#2-metadata-transition","title":"2. Metadata Transition","text":"This stage involves batch processing of the records, including harvesting, transformations, crosswalking information. This stage is carried out by the Metadata Coordinator, who may contact Team members for assistance.
Regardless of the method used for acquiring the metadata, it is always transformed into a spreadsheet for editing. These spreadsheets are uploaded to GBL Admin Metadata Editor.
Because of the variety of platforms and standards, this process can take many forms. The Metadata Coordinator will contact Team members if they need to supply metadata directly.
"},{"location":"lifecycle/#3-edit-records","title":"3. Edit Records","text":"Once the metadata is in spreadsheet form, it is ready to be normalized and augmented. UMN Staff will add template information and use spreadsheet functions or scripts to programmatically complete the metadata records.
- The GeoBTAA Metadata Template is for creating GeoBlacklight metadata.
- Refer to the documentation for the OpenGeoMetadata, version Aardvark fields and the GeoBTAA Custom Elements for guidance on values and formats.
"},{"location":"lifecycle/#4-publish-records","title":"4. Publish Records","text":"Once the editing spreadsheets are completed, UMN Staff uploads the records to GBL Admin
, a metadata management tool. GBL Admin validates records and performs any needed field transformations. Once the records are satisfactory, they are published and available in the BTAA Geoportal.
Read more on the GBL Admin documentation page.
"},{"location":"lifecycle/#5-maintenance","title":"5. Maintenance","text":"General Maintenance
All project team members are encouraged to review the geoportal records assigned to their institutions periodically to check for issues. Use the feedback form at the top of each page in the geoportal to report errors or suggestions. This submission will include the URL of the last page you were on, and it will be sent to the Metadata Coordinator.
Broken Links
The geoportal will be programmatically checked for broken links on a monthly basis. Systematic errors will be fixed by UMN Staff. Some records from this report may be referred back to Team Members for investigating broken links.
Subsequent Accessions
- Portals that utilize the DCAT metadata standard will be re-accessioned monthly.
- Other GIS data portals will be periodically re-accessioned by the Metadata Coordinator at least once per year.
- Team members may be asked to review this work and provide input on decisions for problematic records.
Retired Records
When an external resource has been moved, deleted, or versioned to a new access link, the original record is retired from the BTAA Geoportal. This is done by converting the Publication State of the record from 'Published' to 'Unpublished'. The record is not deleted from the database and can still be accessed via a direct link. However, it will not show up in any search queries.
"},{"location":"model/","title":"Content Organization Model","text":"GeoBlacklight organizes records with a network model rather than with a hierarchical model. It is a flat system whereby every database entry is a \"Layer\" and uses the same metadata fields. Unlike many digital library applications, it does not have different types of records for entities such as \"communities,\" \"collections,\" or \"groups.\" As a result, it does not present a breadcrumb navigation structure, and all records appear in the same catalog directory with the URL of https:geo.btaa.org/catalog/ID
.
Instead of a hierarchy, GeoBlacklight relates records via metadata fields. These fields include Member Of
, Is Part Of
, Is Version Of
, Source
, and a general Relation
. This flexibility allows records to be presented in several different ways. For example, records can have multiple parent/child/grandchild/sibling relationships. In addition, they can be nested (i.e., a collection can belong to another collection). They can also connect data layers about similar topics or represent different years in a series.
The following diagram illustrates how the BTAA Geoportal organizes records. The connecting arrow lines indicate the name of the relationship. The labels reflect each record's Resource Class (Collections, Websites, Datasets, Maps, Web services).
"},{"location":"purpose/","title":"About the Geoportal","text":"summary
The BTAA Geoportal is a catalog that makes it easier to find geospatial resources.
Geospatial data and tools are increasingly important in academic research and education. With the growing number of GIS datasets and digitized historical maps available, it can be challenging to locate the right resources since they are scattered across various platforms and not always tagged with the necessary metadata for easy discovery. To address this issue, the Big Ten Academic Alliance Geospatial Data Project launched in 2015 to connect scholars with geospatial resources.
One of the primary outputs of the project is the BTAA Geoportal, a comprehensive index of tens of thousands of geospatial resources from hundreds of sources. The Geoportal enables users to search by map, keyword, and category, providing access to scanned maps, digital GIS data, aerial imagery, and interactive mapping applications. All of the resources available through the Geoportal are free and open, and the scanned maps cover countries around the world. Most of the data in the catalog is sourced from local governments, such as states, counties, and cities.
The Geoportal is a useful tool because finding local geospatial data through a simple Google search can be difficult due to the lack of visibility of these datasets. The problem is that users need to know which agency is responsible for creating and distributing the data they are looking for and visit that agency's website to access the datasets. For instance, if you are researching a particular neighborhood in a city and need data on the roads, parks, parcels, and city council ward boundaries, you might need to check different state agencies, the city or the county website. But with the Geoportal, you can easily search by What, Where, and When without worrying about the Who or Why.
Overall, the BTAA Geoportal provides an easy and comprehensive way for scholars to find and access geospatial resources from various sources, enabling them to focus on their research rather than the time-consuming task of finding the right data.
"},{"location":"resource-lifecycle/","title":"Resource Lifecycle","text":"5 Stages of the Resource Lifecycle
flowchart LR\n\n I((1.<br> IDENTIFY)) --> H[/2. <br> HARVEST/] --> P[3. <br> EDIT] --> X[4. <br>INDEX] --> M{{5. <br>MAINTAIN}}--> H[/2. <br>HARVEST/]\n
"},{"location":"resource-lifecycle/#1-identify","title":"1. Identify","text":" BTAA-GIN Team Members and Product Manager
Team members seek out new content for the geoportal. See the page How to Submit Resources to the BTAA Geoportal for more information.
"},{"location":"resource-lifecycle/#2-harvest","title":"2. Harvest","text":" Graduate Research Assistants and Product Manager
This stage involves obtaining the metadata for resources. At a minimum, this will include a title and and access link. However, it will ideally also include descriptions, dates, authors, rights, keywords, and more.
Here are the most common ways that we obtain the metadata:
- a BTAA-GIN Team Member sends us the metadata values as individual documents or as a combined spreadsheet
- we are provided with (or are able to find) an API that will automatically generate the metadata in a structured file, such as JSON or XML
- we develop a customized script to scrape directly from the HTML on a source's website
- we manually copy and paste the metadata into a spreadsheet
- a combination of one or more of the above
This step also involves using a crosswalk to convert the metadata into the schema needed for the Geoportal. Our goal is to end up with a spreadsheet containing columns matching our metadata template.
Why do we rely on CSV?
CSV (Comma Separated Values) files organize tabular data in plain text format, where each row of data is separated by a line break, and each column of data is separated by a delimiter.
We have found this tabular format to be the most human-readable way to batch create, edit, and troubleshoot metadata records. We can visually scan large numbers of records at once and normalize the values in ways that would be difficult with native nested formats, like JSON or XML. Therefore, many of our workflow processes involve transforming things to and from CSV.
"},{"location":"resource-lifecycle/#3-edit","title":"3. Edit","text":" Graduate Research Assistants and Product Manager
When working with metadata, it is common to come across missing or corrupted values, which require troubleshooting and manual editing in our spreadsheets. Refer to the Collections Project Board for examples of this work.
After compiling the metadata, we run a validation and cleaning script to ensure the records conform to the required elements of our schema. Finally, we upload the completed spreadsheet to GBL Admin, which serves as the administrative interface for the Geoportal. If GBL Admin detects any formatting errors, it will issue a warning and may reject the upload.
"},{"location":"resource-lifecycle/#4-index","title":"4. Index","text":" Product Manager
Once the metadata is successfully uploaded to GBL Admin, we can publish the records to the Geoportal. The technology that actually stores the records and enables searching is called Solr. The action of adding records is known as \"Indexing.\"
Periodically, we need to remove records from the Geoportal. To do this, we use GBL Admin to either delete them or change their status to \"unpublished.\"
"},{"location":"resource-lifecycle/#5-maintain","title":"5. Maintain","text":" BTAA-GIN Team Members, Graduate Research Assistants, and Product Manager
The Geoportal is programmatically checked for broken links on a monthly basis. The are fixed either by manually repairing them or by reharvesting from the source.
"},{"location":"resource-lifecycle/#sequence-diagram-of-resource-lifecycle","title":"Sequence diagram of Resource Lifecycle","text":"\n\n\n\n sequenceDiagram\n actor Team Member\n actor Product Manager\n participant GitHub\n actor Research Assistant\n participant GBL Admin\n participant Geoportal \n\n\n Note right of Team Member: IDENTIFY\n\n Team Member->>Product Manager: Submit Resources\n Product Manager->>GitHub: Create GitHub issue\n GitHub ->>Research Assistant: Assign issue\n Note left of Research Assistant: HARVEST\n Note left of Research Assistant: EDIT \n\n Research Assistant->>GBL Admin: Upload records\n Research Assistant ->>GitHub: Update GitHub issue\n Note right of GBL Admin: PUBLISH \n\n Product Manager->>GBL Admin: Publish records\n GBL Admin->>Geoportal: Send records online \n Product Manager->>GitHub: Close GitHub issue\n Product Manager ->> Team Member: Share link to published records\n\n Note left of Research Assistant: MAINTAIN \n
"},{"location":"resourceClasses/","title":"Resource Classes","text":""},{"location":"resourceClasses/#collections","title":"Collections","text":"The BTAA Geoportal interprets the Resource Class, Collections, as top-level, custom groupings. These reflect our curation activities and priorities.
Other records are linked to Collections using the Member Of
field. The ID of the parent record is added to the child record only. View all of the current Collections in the geoportal at this link: https://geo.btaa.org/?f%5Bgbl_resourceClass_sm%5D%5B%5D=Collections
"},{"location":"resourceClasses/#websites","title":"Websites","text":"The BTAA Geoportal uses the Resource Class, Websites, to create parent records for data portals, digital libraries, dashboards, and interactive maps. These often start off as standalone records. Once the items in a website have been indexed, they will have child records.
Individual Datasets, Maps, or Web services are linked to the Website they came from using the Is Part Of
field. The ID of the parent record is added to the child record only.
View all of the current Websites in the geoportal at this link: https://geo.btaa.org/?f%5Bgbl_resourceClass_sm%5D%5B%5D=Websites
"},{"location":"resourceClasses/#datasets-maps-and-web-services","title":"Datasets, Maps, and Web services","text":"The items in this Resource Class represent individual data layers, scanned map files, and/or geospatial web services. (Some items may have multiple Resource Classes attached to the same record.)
This item class is likely to have the most relationships specified in the metadata. A typical Datasets record might have the following:
Member Of
a Collections record Is Part Of
a Websites record - If the data was digitized from a paper map in the geoportal, it can be linked to the Maps record via the
Source
relation - a general
Relation
to a research guide or similar dataset
"},{"location":"resourceClasses/#multipart-items","title":"Multipart Items","text":"Many items in the geoportal are multipart. There may be individual pages from an atlas, sublayers from a larger project, or datasets broken up into more than one download. In these cases, the Is Part Of
field is used.
As a result, these items may have multiple Is Part Of
relationships- (1) the parent for the multipart items and (2) the original website.
"},{"location":"schedule/","title":"Harvesting Schedule","text":"Established April 2023
"},{"location":"schedule/#weekly","title":"Weekly","text":" - ArcGIS Hubs
"},{"location":"schedule/#monthly","title":"Monthly","text":" - CKAN sites
- Minnesota Geospatial Commons
"},{"location":"schedule/#quarterly","title":"Quarterly","text":" - PASDA
- OpenGeoMetadata
- Illinois Geospatial Data Clearinghouse
- Minnesota Natural Resource Atlas
- Socrata
"},{"location":"schedule/#annually","title":"Annually","text":" - Licensed Resources
- Custom HTML sites for public data
"},{"location":"schedule/#as-needed","title":"As Needed","text":" - Any website that reports errors during the automated monthly broken link scan.
- Any website when we receive notification that new records are available.
Info
See the GitHub Project Board, Collections, to track harvests.
"},{"location":"submit-resources/","title":"How to submit resources to the BTAA Geoportal","text":""},{"location":"submit-resources/#1-identify-resources","title":"1. Identify Resources","text":"Places to find public domain collections
- State GIS clearinghouses
- State agencies (especially DNRs and DOTs)
- County or city GIS departments
- Library digital collections
- Research institutes
- Nonprofit organizations
Review the Curation Priorites and the Collection Development Policy for guidelines on selecting resources.
"},{"location":"submit-resources/#optional-contact-the-organization","title":"Optional: Contact the organization","text":"Use this template to inform the organization that we plan to include their resources in our geoportal.
Tip
If metadata for the resources are not readily available, the organization may be able to send you an API, metadata documents, or a spreadsheet export.
"},{"location":"submit-resources/#2-investigate-metadata-harvesting-options","title":"2. Investigate metadata harvesting options","text":"Metadata records can be submitted directly or we can harvest it using parsing and transformation scripts.
Here are the most common methods of obtaining metadata for the BTAA Geoportal:
"},{"location":"submit-resources/#spreadsheets","title":"Spreadsheets","text":"This method is preferred, because the submitters can control which metadata values are exported and because format transformations by UMN Staff are not necessary. The GeoBTAA Metadata Template shows all of the fields needed for the Geoportal.
"},{"location":"submit-resources/#api-harvesting-or-html-parsing","title":"API Harvesting or HTML Parsing","text":"Most data portals have APIs or HTML structures that can be programmatically parsed to obtain metadata for each record.
DCAT enabled portals
ArcGIS Open Data Portals (HUB), Socrata portals, and some others share metadata in the DCAT standard.
CKAN / DKAN portals
This application uses a custom metadata schema for their API.
HTML Parsing
If a data portal or website does not have an API, we may be able to parse the HTML pages to obtain the metadata needed to create GeoBlacklight schema records.
OAI-PMH
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a framework that can be used to harvest metadata records from enabled repositories. The records are usually available as a simple Dublin Core XML format. If the protocol is not set up to include extra fields, such as the map image's download link or bounding box, this method may not be sufficient on its own.
"},{"location":"submit-resources/#individual-metadata-files","title":"Individual Metadata files","text":"Geospatial metadata standards are expressed in the XML or JSON format, which can be parsed to extract metadata needed to create GeoBlacklight schema records. Common standards for geospatial resources include:
- ISO 19139
- FGDC
- ArcGIS 1.0
- MARC
- MODS
Tip
The best way to transfer MARC records is to send a single file containing multiple records in the .MRC or MARC XML format. The Metadata Coordinator will use MarcEdit or XML parsing to transform the records.
"},{"location":"submit-resources/#3-contact-the-btaa-gin-product-manager","title":"3. Contact the BTAA-GIN Product Manager","text":"Send an email, Slack message to the Product Manager / Metadata Coordinator.
Minimum information to include:
- Title and Description of the collection
- a link to the website
- (If known) information about how to harvest the metadata or construct access links.
The submission will be added to our collections processing queue.
Info
Metadata processing tasks are tracked on our public GitHub project dashboard.
"},{"location":"supplementalMetadata/","title":"Supplemental Metadata","text":"All other forms of metadata, such as ISO 19139, FGDC Content Standard for Digital Geospatial Metadata, attribute table definitions, or custom schemas are treated as Supplemental Metadata.
- Supplemental Metadata is not usually edited directly for inclusion in the project.
- If this metadata is available as XML or HTML, it can be added as a hosted link for the Metadata preview tool in GeoBlacklight.
- XML or HTML files can be parsed to extract metadata that forms the basis for the item\u2019s GeoBlacklight schema record.
- The file formats that can be viewed within the geoportal application include:
- ISO 19139 XML
- FGDC XML
- MODS XML
- HTML (any standard)
"},{"location":"tutorials/","title":"Tutorials","text":"These tutorials are short, easy to complete exercises to help someone get the basics of running and writing scripts to harvest metadata. They are available as Jupyter Notebooks hosted in GitHub in the Harvesting Guide repository.
"},{"location":"tutorials/#1-setting-up-your-environment","title":"1. Setting up your environment","text":" - These tutorials will guide users on how to set up your environment for harvesting.
- Getting Started with GitHub: Provides an introduction to GitHub and a walkthrough of creating a repository.
- Getting Started with Jupyter Notebooks: Provides an introduction and overview of cell types.
"},{"location":"tutorials/#2-navigating-paths","title":"2. Navigating paths","text":"This tutorial shows how to navigate and list directories. Techniques covered include:
- Using the
cd
command in the terminal - Navigating to and from the home directory
- Listing the current path
- Listing documents within the current path
"},{"location":"tutorials/#3-iterating-over-files","title":"3. Iterating over files","text":" - This guide will assist users in how to open, read, and print the results of a CSV.
- The module
os.Walk
will be introduced to read through multiple directories to find files. - The pandas module will be used to display a CSV for records.
"},{"location":"tutorials/#4-merge-csv-files-based-on-a-shared-column","title":"4. Merge CSV files based on a shared column","text":"This tutorial will take two CSV files and combine them using a shared key.
"},{"location":"tutorials/#5-transform-a-batch-of-json-files-into-a-single-csv-file","title":"5. Transform a batch of JSON files into a single CSV file","text":"This tutorial uses the Python module pandas (Python Data Analysis Library) to open a batch of JSON files and transform the contents into a single CSV.
"},{"location":"tutorials/#6-extract-place-names","title":"6. Extract Place Names","text":"This tutorial scans the two columns from a CSV file ('Title' and 'Description') to look for known place names and writes the values to a separate field.
"},{"location":"tutorials/#7-parsing-html-with-beautiful-soup","title":"7. Parsing HTML with Beautiful Soup","text":"This tutorial will guide users through Hyper Text Mark-Up Language (HTML) site parsing using the BeautifulSoup Python module, including:
- how to install the BeautifulSoup module
- scan and list web pages
- return titles, descriptions, and dates
- writing parsed results to CSV format
"},{"location":"tutorials/#8-use-openstreetmap-to-generate-bounding-boxes","title":"8. Use OpenStreetMap to generate bounding boxes","text":"This tutorial demonstrates how to query the OpenStreetMap API using place names to return bounding box coordinates.
Credits
These tutorials were prepared by Alexander Danielson and Karen Majewicz in April 2023.
"},{"location":"recipes/","title":"About","text":"These recipes are step-by-step how-to guides for harvesting and processing metadata from the sites that we harvest the most frequently.
The scripts needed for these recipes are all in the form of Jupyter Notebooks. To get started, download, fork, or clone the Harvesting Guide repository.
Warning
These recipes are not guaranteed to work! Since they rely on external websites, the scripts are necessarily works-in-progress. They need to be regularly updated and reconfigured in response to changes at the source website, python updates, and adjustments to our metadata schema.
"},{"location":"recipes/R-01_arcgis-hubs/","title":"ArcGIS","text":""},{"location":"recipes/R-01_arcgis-hubs/#purpose","title":"Purpose","text":"To scan the DCAT 1.1 API of ArcGIS Hubs and return the metadata for all suitable items as a CSV file in the GeoBTAA Metadata Application Profile.
This recipe includes steps that use the GBL Admin toolkit. Access to this tool is restricted to UMN BTAA-GIN staff and requires a login account. External users can create their own list or use one provided in this repository.
graph TB\n\nA{{STEP 1. <br>Download arcHubs.csv}}:::green --> B[[STEP 2. <br>Run Jupyter Notebook harvest script]]:::green;\nB --> C{Did the script run successfully?};\nC --> |No| D[Troubleshoot]:::yellow;\nD --> H{Did the script stall because of a Hub?};\nH --> |Yes| I[Refer to the page Update ArcGIS Hubs]:::yellow;\nH --> |No & I can't figure it out.| F[Refer issue back to Product Manager]:::red;\nH --> |No| J[Try updating your Python modules or investigating the error]:::yellow;\nJ --> B;\nI --> A;\nC --> |Yes| K[[STEP 3. Validate and Clean]]:::green; \nK --> E[STEP 4. <br>Publish/unpublish records in GBL Admin]:::green; \n\n\nclassDef green fill:#E0FFE0\nclassDef yellow fill:#FAFAD2\nclassDef red fill:#E6C7C2\n\n\nclassDef questionCell fill:#fff,stroke:#333,stroke-width:2px;\nclass C,H questionCell;\n
"},{"location":"recipes/R-01_arcgis-hubs/#step-1-download-the-list-of-active-arcgis-hubs","title":"Step 1: Download the list of active ArcGIS Hubs","text":"We maintain a list of active ArcGIS Hub sites in GBL Admin.
Shortcut
Pre-formatted GBL Admin query link
- Go to the Admin (https://geo.btaa.org/admin) dashboard
- Filter for items with these parameters:
- Resource Class: Websites
- Accrual Method: DCAT US 1.1
- Select all the results and click Export -> CSV
- Download the CSV and rename it
arcHubs.csv
Info
Exporting from GBL Admin will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are:
- ID: Unique code assigned to each portal. This is transferred to the \"Is Part Of\" field for each dataset.
- Title: The name of the Hub. This is transferred to the \"Provider\" field for each dataset
- Publisher: The place or administration associated with the portal. This is applied to the title in each dataset in brackets
- Spatial Coverage: A list of place names. These are transferred to the Spatial Coverage for each dataset
- Member Of: a larger collection level record. Most of the Hubs are either part of our Government Open Geospatial Data Collection or the Research Institutes Geospatial Data Collection
However, it is not necessary to take extra time and manually remove the extra fields, because the Jupyter Notebook code will ignore them.
"},{"location":"recipes/R-01_arcgis-hubs/#step-2-run-the-harvest-script","title":"Step 2: Run the harvest script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-01_arcgis-hubs.ipynb
- Move the downloaded file
arcHubs.csv
into the same directory as the Jupyter Notebook. - Run all cells.
Expand to read about the R-01_arcgis-hubs.ipynb Jupyter Notebook This code reads data from hubFile.csv
using the csv.DictReader
function. It then iterates over each row in the file and extracts values from specific columns to be used later in the script.
For each row, the script also defines default values for a set of metadata fields. It then checks if the URL provided in the CSV file exists and is a valid JSON response. If the response is not valid, the script prints an error message and continues to the next row. Otherwise, it extracts dataset identifiers from the JSON response and passes the response along with the identifiers to a function called metadataNewItems.
It also includes a function to drop duplicate rows. ArcGIS Hub administrators can include datasets from other Hubs in their own site. As a result, some datasets are duplicated in other Hubs. However, they always have the same Identifier, so we can use pandas to detect and remove duplicate rows.
"},{"location":"recipes/R-01_arcgis-hubs/#troubleshooting-as-needed","title":"Troubleshooting (as needed)","text":"The Hub sites are fairly unstable and it is likely that one or more of them will occasionally fail and interrupt the script.
- Visit the URL for the Hub to check and see if the site is down, moved, etc.
- Refer to the Update ArcGIS Hubs list page for more guidance on how to edit the website record.
- If a site is missing: Unpublish it from GBL Admin, indicate the Date Retired, and make a note in the Status field.
- If a site is still live, but the JSON API link is not working: remove the value \"DCAT US 1.1\" from the Accrual Method field and make a note in the Status field.
- If the site has moved to a new URL, update the website record with the new information.
- Start over from Step 1.
"},{"location":"recipes/R-01_arcgis-hubs/#step-3-validate-and-clean","title":"Step 3: Validate and Clean","text":"Although the harvest notebook will produce valide metadata for most of the items, there may still be some errors. Run the cleaning script to ensure that the records are valid before we try to ingest them into GBL Admin.
"},{"location":"recipes/R-01_arcgis-hubs/#step-4-upload-all-records","title":"Step 4: Upload all records","text":" - Review the previous upload. Check the Date Accessioned field of the last harvest and copy it.
- Upload the new CSV file. This will overwrite the Date Accessioned value for any items that were already present.
- Use the old Date Accessioned value to search for the previous harvest date. This example uses 2023-03-07: (https://geo.btaa.org/admin/documents?f%5Bb1g_dct_accrualMethod_s%5D%5B%5D=ArcGIS+Hub&q=%222023-03-07%22&rows=20&sort=score+desc)
- Unpublish the ones that have the old date in the Date Accessioned field
- Record this number in the GitHub issue for the scan under Number Deleted
- Look for records in the uploaded batch that are still \"Draft\" - these are new records.
- Publish them and record this number in the GitHub issue under Number Added
"},{"location":"recipes/R-02_socrata/","title":"Socrata","text":""},{"location":"recipes/R-02_socrata/#purpose","title":"Purpose","text":"To scan the DCAT API of Socrata Data Portals and return the metadata for all suitable items as a CSV file in the GeoBTAA Metadata Application Profile.
Note: This recipe is very similar to the ArcGIS Hubs Scanner.
This recipe includes steps that use the metadata toolkit GBL Admin. Access to GBL Admin is restricted to UMN BTAA-GIN staff and requires a login account. External users can create their own list or use one provided in this repository.
graph TB\n\nA{{STEP 1. <br>Download socrataPortals.csv}}:::green --> B[[STEP 2. <br>Run Jupyter Notebook harvest script]]:::green;\nB --> C{Did the script run successfully?}:::white;\nC --> |No| D[Troubleshoot]:::yellow;\nD --> H{Did the script stall because of a portal?}:::white;\nH --> |Yes| I[Remove or update the portal from the list]:::yellow;\nH --> |No & I can't figure it out.| F[Refer issue back to Product Manager]:::red;\nH --> |No| J[Try updating your Python modules or investigating the error]:::yellow;\nJ --> B;\nI --> A;\nC --> |Yes| K[[STEP 3. Validate and Clean]]:::green; \nK --> E[STEP 4. <br>Publish/unpublish records in GBL Admin]:::green; \n\nclassDef green fill:#E0FFE0\nclassDef yellow fill:#FAFAD2\nclassDef red fill:#E6C7C2\nclassDef white fill:#FFFFFF\n
"},{"location":"recipes/R-02_socrata/#step-download-the-list-of-active-socrata-data-portals","title":"Step: Download the list of active Socrata Data Portals","text":"We maintain a list of active Socrata Hub sites in GBL Admin.
Shortcut
Pre-formatted GBL Admin query link
- Go to the GBL Admin dashboard
- Use the Advanced Search to filter for items with these parameters:
- Format: \"Socrata data portal\"
- Select all the results and click Export -> CSV
- Download the CSV and rename it
socrataPortals.csv
Info
Exporting from GBL Admin will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are:
- ID: Unique code assigned to each portal. This is transferred to the \"Is Part Of\" field for each dataset.
- Title: The name of the Hub. This is transferred to the \"Provider\" field for each dataset
- Publisher: The place or administration associated with the portal. This is applied to the title in each dataset in brackets
- Spatial Coverage: A list of place names. These are transferred to the Spatial Coverage for each dataset
- Member Of: a larger collection level record. Most of the Hubs are either part of our Government Open Geospatial Data Collection or the Research Institutes Geospatial Data Collection
It is not necessary to take extra time and manually remove the unused fields, because the Jupyter Notebook code will ignore them.
"},{"location":"recipes/R-02_socrata/#step-2-run-the-harvest-script","title":"Step 2: Run the harvest script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-02_socrata.ipynb
- Move the downloaded file
socrataPortals.csv
into the same directory as the Jupyter Notebook.
"},{"location":"recipes/R-02_socrata/#troubleshooting-as-needed","title":"Troubleshooting (as needed)","text":" - Visit the URL for the Socrata Portal to check and see if the site is down, moved, etc.
- If a site is missing
- Unpublish it from GBL Admin and indicate the Date Retired, and make a note in the Status field.
- Start over from Step 1.
"},{"location":"recipes/R-02_socrata/#step-3-validate-and-clean","title":"Step 3: Validate and Clean","text":"Although the harvest notebook will produce valide metadata for most of the items, there may still be some errors. Run the cleaning script to ensure that the records are valid before we try to ingest them into GBL Admin.
"},{"location":"recipes/R-02_socrata/#step-4-upload-to-gbl-admin","title":"Step 4: Upload to GBL Admin","text":" - Review the previous upload. Check the Date Accessioned field of the last harvest and copy it.
- Upload the new CSV file. This will overwrite the Date Accessioned value for any items that were already present.
- Use the old Date Accessioned value to search for the previous harvest date.
- Unpublish the ones that have the old date in the Date Accessioned field 5. Record this number in the GitHub issue for the scan under Number Deleted
- Look for records in the uploaded batch that are still \"Draft\" - these are new records.
- Publish them and record this number in the GitHub issue under Number Added
"},{"location":"recipes/R-03_ckan/","title":"CKAN","text":""},{"location":"recipes/R-03_ckan/#purpose","title":"Purpose","text":"To scan the Action API for CKAN data portals and retrieve metadata for new items while returning a list of deleted items.
Warning
This batch CKAN recipe is being deprecated and replaced with recipes tailored to each site.
graph TB\n\nA((STEP 1. <br>Set up directories)) --> B[STEP 2. <br>Run Jupyter Notebook script] ;\nB --> C{Did the script run successfully?};\nC --> |No| D[Troubleshoot];\nD -->A;\nC --> |No & I can't figure it out.| F[Refer issue back to Product Manager];\nC --> |Yes| E[STEP 3. <br>Edit places names & titles]; \nE --> G[STEP 4. <br>Upload new records];\nG --> H[STEP 5. <br>Unpublish deleted records];\n\nclassDef goCell fill:#99d594,stroke:#333,stroke-width:2px\nclass A,B,C,E,G goCell;\nclassDef troubleCell fill:#ffffbf,stroke:#333,stroke-width:2px;\nclass D troubleCell;\nclassDef endCell fill:#fc8d59,stroke:#333,stroke-width:2px\nclass F,H endCell;\nclassDef questionCell fill:#fff,stroke:#333,stroke-width:2px;\nclass C questionCell;\n\n\n
"},{"location":"recipes/R-03_ckan/#step-1-set-up-your-directories","title":"Step 1: Set up your directories","text":" -
Navigate to your local Recipes directory for R-03_ckan.
-
Verify that there are two folders
resource
: contains a CSV for each portal per harvest that lists all of the dataset identifiers reports
: combined CSV metadata files for all new and deleted datasets per harvest
-
Review the CKANportals.csv file. Each active portal should have values in the following fields:
- portalName
- URL
- Provider
- Publisher
- Spatial Coverage
- Bounding Box
"},{"location":"recipes/R-03_ckan/#step-2-run-the-harvest-script","title":"Step 2: Run the harvest script","text":" - Start Jupyter Notebook
- Open your local copy of R-03_ckan.ipynb
Info
This script will harvest from a set of CKAN data portals. It saves a list of datasets found in each portal and will compare the output between runs. The result will be two CSVs: new items and deleted items.
The script only harvests items that can be identified as shapefiles or imagery.
"},{"location":"recipes/R-03_ckan/#step-3-edit-the-metadata-for-new-items","title":"Step 3: Edit the metadata for new items","text":"The new records can be found in reports/allNewItems_{today's date}.csv
and will need some manual editing.
- Spatial Coverage: Add place names related to the datasets.
- Title: Concatenate values in the Alternative Title column with the Spatial Coverage of the dataset.
"},{"location":"recipes/R-03_ckan/#step-4-upload-metadata-for-new-records","title":"Step 4: Upload metadata for new records","text":"Open GBL Admin and upload the new items found in reports/allNewItems_{today's date}.csv
"},{"location":"recipes/R-03_ckan/#step-5-delete-metadata-for-retired-records","title":"Step 5: Delete metadata for retired records","text":"Unpublish records found in reports/allDeletedItems_{today's date}.csv
. This can be done in GBL Admin manually (one by one) or with the GBL Admin documents update script.
"},{"location":"recipes/R-04_oai-pmh/","title":"Harvest via OAI-PMH","text":"Using Illinois Library Digital Collections as example
Steps:
"},{"location":"recipes/R-04_oai-pmh/#part-1-get-the-files-via-oai","title":"Part 1: get the files via oai","text":" - Use this OAI-PMH validator tool at https://validator.oaipmh.com
- Go to the Download XML tab
- Enter the base URL (https://digital.library.illinois.edu/oai-pmh) and the set name (6ff64b00-072d-0130-c5bb-0019b9e633c5-2)
- Wait for the app to pull all the XML files and download them (ideally in a ZIP, but sometimes that doesn't work and you need to click on each file)
"},{"location":"recipes/R-04_oai-pmh/#part-2-turn-the-records-into-a-csv-via-openrefine","title":"Part 2: turn the records into a CSV via OpenRefine","text":" - start OpenRefine
- Choose \"Get Data from this Computer\" and upload the XML files
- From the parsing options, select from the Header \"record\"
"},{"location":"recipes/R-04_oai-pmh/#part-3-collapse-multivalued-cells","title":"Part 3: Collapse multivalued cells","text":" - The multi-valued cells will start out being grouped together by which XML file they came from. We don't want that, so remove the column called File.
- Now, they are grouped by a value \"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\" Leave this for now.
- There are multiple Identifiers (dc:identifier), so select that column, Edit Cells - Join multi-valued cells
- Move the Identifier column to the beginning so that items will be grouped by these unique values
- Collapse the remaining cells with the same Join Multi-valued cells function
- Export to CSV
"},{"location":"recipes/R-05_iiif/","title":"IIIF","text":""},{"location":"recipes/R-05_iiif/#purpose","title":"Purpose","text":"To extract metadata from IIIF JSON Manifests.
"},{"location":"recipes/R-05_iiif/#step-1-download-the-jsons","title":"Step 1: Download the JSONs","text":" - Create a CSV file called \"jsonUrls.csv\" with just the URLs of the JSONs.
- Navigate to the Terminal/Command Line and into a directory where you can save files
- Type:
wget -i jsonUrls.csv
- Review that all of the JSONs downloaded to a local directory
"},{"location":"recipes/R-05_iiif/#step-2-run-the-extraction-script","title":"Step 2: Run the extraction script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-05_iiif.ipynb
- Move the downloaded file
jsonUrls.csv
into the same directory as the Jupyter Notebook. - Run all cells
Warning
This will lump all the subsections into single fields and the user will still need to split them.
"},{"location":"recipes/R-05_iiif/#step-3-merge-the-metadata","title":"Step 3: Merge the metadata","text":"Although the Jupyter Notebook extracts the metadata to a flat CSV, we still need to merge this with any existing metadata for the records.
"},{"location":"recipes/R-05_iiif/#step-4-upload-to-gbl-admin","title":"Step 4: Upload to GBL Admin","text":""},{"location":"recipes/R-06_mods/","title":"MODS","text":""},{"location":"recipes/R-06_mods/#purpose","title":"Purpose:","text":"To extract metadata from XML files in the MODS metadata format.
"},{"location":"recipes/R-06_mods/#step-1-obtain-a-list-of-urls-for-the-xml-files","title":"Step 1: Obtain a list of URLs for the XML files","text":"This list may be supplied by the submitter or we may need to query the website to find them.
"},{"location":"recipes/R-06_mods/#step-2-run-the-extraction-script","title":"Step 2: Run the extraction script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-06-mods.ipynb
- Move the downloaded file
arcHubs.csv
into the same directory as the Jupyter Notebook. - Run all cells.
"},{"location":"recipes/R-06_mods/#step-3-format-as-opengeometadata","title":"Step 3: Format as OpenGeoMetadata","text":"Manually adjust the names of the columns to match metadata into our GeoBTAA metadata template.
"},{"location":"recipes/R-06_mods/#step-4-upload-to-gbl-admin","title":"Step 4: Upload to GBL Admin","text":""},{"location":"recipes/R-07_ogm/","title":"OpenGeoMetadata","text":""},{"location":"recipes/R-07_ogm/#purpose","title":"Purpose","text":"To convert OpenGeoMetadata (GeoBlacklight) JSONs to a CSV in the GeoBTAA Metadata Profile
"},{"location":"recipes/R-07_ogm/#step-1-obtain-the-jsons","title":"Step 1: Obtain the JSONs","text":"Collect the metadata JSONs. This type of metadata typically is obtain in one of the following ways:
- Direct submission (Team Members from Wisconsin)
- Via an OpenGeoMetadata repository
"},{"location":"recipes/R-07_ogm/#step-2-run-the-conversion-script","title":"Step 2: Run the conversion script","text":" - Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-07_openGeoMetadata.ipynb
- Move the folder with the metadata JSONs into the same directory as the Jupyter Notebook.
- Declare the paths and folder name in the Notebook.
- Run all cells
Tip
Depending upon the source, you may want to adjust the script to accomodate custom fields.
"},{"location":"recipes/R-07_ogm/#step-3-edit-the-output-csv","title":"Step 3: Edit the output CSV","text":"The GeoBTAA Metadata Profile may have additional or different requirements. Consult with the Product Manager on which fields may need augmentation.
"},{"location":"recipes/R-07_ogm/#step-4-upload-to-gbl-admin","title":"Step 4: Upload to GBL Admin","text":""},{"location":"recipes/R-08_pasda/","title":"PASDA","text":""},{"location":"recipes/R-08_pasda/#purpose","title":"Purpose","text":"To harvest metadata from Pennsylvania Spatial Data Access (PASDA), a custom data portal. To begin, start the Jupyter Notebook:
- Start Jupyter Notebook and navigate to the Recipes directory.
- Open R-08_pasda.ipynb
"},{"location":"recipes/R-08_pasda/#step-1-obtain-a-list-of-landing-pages","title":"Step 1: Obtain a list of landing pages","text":"Run Part 1 of the Notebook to obtain a list of all of the records currently in PASDA. This list can be found by doing a blank search on the PASDA website: (https://www.pasda.psu.edu/uci/SearchResults.aspx?Keyword=+)
Since this is a large result list, the recipe recommends downloading the HTML file to your desktop (In the Safari browser, this is File-Save As - Save As Page Source)
Then, we can use the Beautiful Soup module to query this page and harvest the following values:
- Title
- Date Issued
- Publisher
- Description
- Metadata file link
- Download link
"},{"location":"recipes/R-08_pasda/#step-2-download-the-supplemental-metadata-files","title":"Step 2: Download the supplemental metadata files","text":"Context
Bounding boxes & keywords are not found in the landing pages, but most of the PASDA datasets have a supplemental metadata document, which does contain coordinates. The link to this document was scraped to the 'Metadata File\" column during the previous step.
Most of the records have supplemental metadata in ISO 19139 or FGDC format. The link to this document is found in the 'Metadata File\" column. Although these files are created as XMLs, the link is a rendered HTML.
There is additional information in these files that we want to scrape, including bounding boxes and geometry type.
At the end of Step 2, you will have a folder of HTML metadata files.
"},{"location":"recipes/R-08_pasda/#step-3-query-the-downloaded-supplemental-metadata-files","title":"Step 3: Query the downloaded supplemental metadata files","text":"This section of the script will scan each HTML metadata file. If it contains bounding box information, it will pull the coordinates. Otherwise, it will assign a default value of the State of Pennsylvania extents.
It will also pull the geometry type and keywords, if available.
"},{"location":"recipes/R-08_pasda/#step-4-add-default-and-calculated-values","title":"Step 4: Add default and calculated values","text":"This step will clean up the harvested metadata and add our administrative values to each row. At the end, there will be a CSV file in your directory named for today's date.
"},{"location":"recipes/R-08_pasda/#step-5-upload-the-csv-to-gbl-admin","title":"Step 5: Upload the CSV to GBL Admin","text":" - Upload the new records to GBL Admin
- Use the Date Accessioned field to search for records that were not present in the current harvest. Retire any records that have the code \"08a-01\" but were not part of this harvest.
"},{"location":"recipes/R-09_umedia/","title":"UMedia","text":""},{"location":"recipes/R-09_umedia/#purpose","title":"Purpose","text":"To harvest new records added to the University Of Minnesota's UMedia Digital Library.
"},{"location":"recipes/R-09_umedia/#step-1-set-up-folders","title":"Step 1: Set up folders","text":" - Navigate to the UMedia Recipe directory at R-09_umedia.ipynb
- Verify the following folders are present:
requests
This folder stores all search results in JSON format for each reaccession as request_YYYYMMDD.json
.
jsons
This folder stores all JSON files by different added month for UMedia maps. After we get the search result JSON file from each reaccession, we will read this request_YYYYMMDD.json
file in detail to filter out the included maps by month, and store them to dateAdded_YYYYMM.json
individually.
reports
This folder stores all CSV files for metadata by month. Once we have JSON files for different month, we extract all useful metadata and contribute in the dateAdded_YYYYMM.csv
in this folder.
"},{"location":"recipes/R-09_umedia/#step-2-run-the-harvesting-script","title":"Step 2: Run the harvesting script","text":" - Start Jupyter Notebook and open R-09_umedia.ipynb
- The second code cell will ask for an input on how many map records you want to harvest.
- The third code cell will ask for a date range. Select a month (in the form
yyyy-mm
) based on the last time you ran the script.
"},{"location":"recipes/R-09_umedia/#step-3-edit-the-metadata","title":"Step 3: Edit the metadata","text":""},{"location":"recipes/R-09_umedia/#step-4-upload-to-gbl-admin","title":"Step 4: Upload to GBL Admin","text":""},{"location":"recipes/add-bbox/","title":"Add bounding boxes","text":"Summary
This page describes processes for obtaining bounding box coordinates for our scanned maps. The coordinates will be used for indexing the records in the Big Ten Academic Alliance Geoportal.
**About bounding box coordinates for the BTAA Geoportal **
- Bounding boxes enable users to search for items with a map interface.
- The format is 4 coordinates in decimal degrees
- Provide the coordinates in this order: West, South, East, North.
- The bounding boxes do not need to be exact, particularly with old maps that may not be very precise anyways.
"},{"location":"recipes/add-bbox/#manual-method","title":"Manual method","text":""},{"location":"recipes/add-bbox/#part-a-setup","title":"Part A: Setup","text":" - Open and inspect the image file.
- Try to identify a single / combined region that the map or atlas represents
- You can also check to see if the map has the bounding coordinates printed in the text anywhere or you are able to find the bounds by inspecting the edges.
- Open another window with the Klokan Bounding Box tool.
- Set the Copy & Paste section to CSV.
"},{"location":"recipes/add-bbox/#part-b-find-the-coordinates","title":"Part B: Find the coordinates","text":""},{"location":"recipes/add-bbox/#option-1-search-for-a-place-name","title":"Option 1: Search for a place name","text":" - Use the search boxes on the Klokan Bounding Box tool to zoom to the region. (For example, search for \u201cIllinois\u201d.
- Manually adjust the grey overlay box in the Klokan site to line up the edges to the edges of the map.
- Try to line it up reasonably closely
"},{"location":"recipes/add-bbox/#option-2-draw-a-shape","title":"Option 2: Draw a shape","text":" - Switch to the Polygon tool by clicking on the pentagon icon
- Click as many points on the screen as needed to approximate the map extent.
- Click on the first point to close the polygon
- The interface will display a dotted line showing the bounding box rectangle.
"},{"location":"recipes/add-bbox/#part-c-copy-back-to-geobtaa-metadata","title":"Part C: Copy back to GeoBTAA metadata","text":" - Click the \u201cCopy to Clipboard\u201d icon on the Klokan site.
- Paste the coordinates into the Bounding Box field in the GeoBTAA metadata template or in the GBL Admin metadata editor.
"},{"location":"recipes/add-bbox/#programmatic-method","title":"Programmatic method","text":"The OpenStreetMap offers and API that allows users to query with place names and return a bounding box. Follow the Tutorial, Use OpenStreetMap to generate bounding boxes, for this method.
"},{"location":"recipes/clean/","title":"Validate and Clean Metadata","text":"Info
Find the cleaning script here: https://github.com/geobtaa/harvesting-guide/tree/main/recipes/R-00_clean
As a final step of Edit stage of the resource lifecycle, we run a cleaning script to fix common errors:
"},{"location":"recipes/clean/#required-fields","title":"Required fields","text":" - Resource Class: Checks that an entry exists and that it is one of the controlled values. If the field is empty, the cleaning script will insert
Datasets
as a default. - Access Rights: Checks that it contains either
Public
or Restricted
. If empty, the script will insert Public
as a default.
"},{"location":"recipes/clean/#conditionally-required-fields","title":"Conditionally required fields","text":" - Format: If the \"Download\" field has a value, Format must also be present. If empty, the script will insert
File
as a default.
"},{"location":"recipes/clean/#syntax","title":"Syntax","text":" - Date Range: If present, checks that it is valid with a range in the format
yyyy-yyyy
, where the second value is earlier than the first. If the second value is earlier (lower) than the first, the script will reorder them. - Bounding Box: There are numerous possible conditions that the script will fix:
- Rounding: The script will round all coordinates to two decimal points. (For some collections, we change this to three). This is done because the bounding boxes function as a finding aid and overly precise coordinates can be misleading. It also saves on the memory load for geometry and centroid calculations.
- non-degrees: If one of the coordinates exceeds the earth's coordinates (over 180 for longitude or 90 for latitude), the coordinates are considered invalid and the entire field will be cleared for that record.
- lines or points: If the script finds that easting and westing or north and south coordinates are equal, it will add a .001 to the end.
"},{"location":"recipes/clean/#reports","title":"Reports","text":"After cleaning, the script will produce two CSVS:
- Cleaned Metadata: All of the original rows with fixes applied. A new column called \"Cleaned\" will indicate if the row was edited by the script.
- Cleaning Log: A list of all the records and fields that were cleaned and what was done.
"},{"location":"recipes/secondary-tables/","title":"How to upload links in secondary tables in GBL Admin","text":" - We use two compound metadata fields,
Multiple Download Links
and Institutional Access Links
, that include multiple links that are formatted with both a label + a link. - Because these fields are not regular JSON flat key:value pairs, they are stored in secondary tables within GBL Admin.
- When using GBL Admin's Form view, these values can be entered by clicking into a side page linked from the record.
- For CSV uploads, these values are uploaded with a separate CSV from the one used for the main import template.
Tip
See the opengeometadata.org page on multiple downloads for how these fields are formatted in JSON
"},{"location":"recipes/secondary-tables/#manual-entry","title":"Manual entry","text":""},{"location":"recipes/secondary-tables/#multiple-download-links","title":"Multiple Download Links","text":" - On the Form view, scroll down to the end of the Links section and click the text \"Multiple Download Links\"
- Click the New Download URL button
- Enter a label (i.e., \"Shapefile\") and the download URL
- Repeat for as many as needed
"},{"location":"recipes/secondary-tables/#institutional-access-links","title":"Institutional Access Links","text":" - On the Form view, scroll down to the bottom of the right-hand navigation and click the text \"Institutional Access Links\"
- Click the New Access URL button
- Select an institution code and the access URL
- Repeat for as many as needed
"},{"location":"recipes/secondary-tables/#csv-upload-for-either-type","title":"CSV Upload for either type","text":" - Go to Admin Tools - Multiple Downloads or Access Links
- Upload a CSV in on of these formats:
CSV field headers for secondary tables
Multiple DownloadsInstitutional Access Links | friendlier_id | label | value |\n |---------------------|------------------|------------|\n | ID of target record | any string | the link |\n
| friendlier_id | institution_code | access_URL |\n |---------------------|------------------|------------|\n | ID of target record | 2 digit code | the link |\n
"},{"location":"recipes/split-bbox/","title":"Split Bounding Boxes that cross the 180th Meridian","text":""},{"location":"recipes/split-bbox/#problem","title":"Problem","text":"The BTAA Geoportal does not display bounding boxes that cross the 180th meridian (also known as the International Date Line.) In these circumstances, the West coordinate will be a positive number, but the East coordinate will be negative.
"},{"location":"recipes/split-bbox/#solution","title":"Solution","text":"One way to mitigate this is to create two bounding boxes for the OGM Aardvark Geometry
field. The Bounding Box value will be the same, but the Geometry field will have a multipolygon that is made up of two adjacent boxes.
The following script will scan a CSV of the records, identify which cross the 180th Meridian, and insert a multipolygon into a new column.
The script was designed with the assumption that the input CSV will be in the OGM Aardvark format, likely exported from GBL Admin. The CSV file must contain a field for Bounding Box
. It may contain a Geometry
field with some values that we do not want to overwrite.
This script will create a new field called \"Bounding Box (WKT)\". Items that crossed the 180th Meridian will have a multipolygon in that field. Items that don't cross will not have a value in that field. Copy and paste only the new values into the Geometry
column and reupload the CSV to GBL Admin.
import csv\n\ndef split_coordinates(coordinate_str):\n if not coordinate_str:\n return ''\n\n coordinates = coordinate_str.split(',')\n\n west, south, east, north = map(float, coordinates)\n\n if west > 0 and east < 0:\n polygon1 = f'({west} {south}, {179.9} {south}, {179.9} {north}, {west} {north}, {west} {south})'\n polygon2 = f'({-179.9} {south}, {-179.9} {north}, {east} {north}, {east} {south}, {-179.9} {south})'\n return f'MULTIPOLYGON(({polygon1}), ({polygon2}))'\n\n return coordinate_str\n\n# Specify the input CSV file path\ninput_file = 'your_input_file.csv'\n\n# Specify the output CSV file path\noutput_file = 'your_output_file.csv'\n\n# Specify the name of the column with the Bounding Box coordinates\ncoordinate_column = 'Bounding Box'\n\n# Specify the name of the new column to store the updated coordinates\nnew_coordinate_column = 'New Bounding Box (WKT)'\n\n# Read the input CSV and process the coordinates\nwith open(input_file, 'r') as file:\n reader = csv.DictReader(file)\n fieldnames = reader.fieldnames + [new_coordinate_column]\n\n with open(output_file, 'w', newline='') as output:\n writer = csv.DictWriter(output, fieldnames=fieldnames)\n writer.writeheader()\n\n for row in reader:\n bounding_box = row[coordinate_column]\n new_bounding_box = split_coordinates(bounding_box)\n row[new_coordinate_column] = new_bounding_box\n writer.writerow(row)\n\nprint(\"CSV processing completed!\")\n
"},{"location":"recipes/standardize-creators/","title":"Best Practices for Standardizing Creator Field Data","text":" Authors: Creator Standardization Working Group
Date: 28 November 2022
Info
See this Journal Article for a more thorough description of this process:
Laura Kane McElfresh (2023) Creator Name Standardization Using Faceted Vocabularies in the BTAA Geoportal, Cataloging & Classification Quarterly, DOI: 10.1080/01639374.2023.2200430
Creator names are a critical access point for the discovery of geospatial information. Within the BTAA Geoportal, creator names\u2013whether names of persons or corporate bodies\u2013are displayed on landing pages and the citation widget, and are indexed and faceted for searching and browsing. Standardizing the names of resource creators makes search results more predictable, thereby producing a better experience for Geoportal users.
To ensure that the Geoportal\u2019s collocation functions operate properly, this document recommends using the formulation of personal and corporate body names as they are found in identity registries and, when creators are not available in those registries, provides guidance for formulating names of creators. We seek to provide consistency of creator names within our database through the recommendations provided below. This document does not address the manual creation or editing of identity registry records.
These best practices assume that standardization of names will occur after data is ingested into the BTAA Geoportal; however this document may be used to inform description choices made before records are ingested into the Geoportal.
"},{"location":"recipes/standardize-creators/#preferred-identity-registries","title":"Preferred Identity Registries","text":"These best practices recommend consulting one or two name registries when deciding how to standardize names of creators: the Faceted Application of Subject Terminology (FAST) or the Library of Congress Name Authority File (LCNAF). FAST is a controlled vocabulary based on the Library of Congress Subject Headings (LCSH) that is well-suited to the faceted navigation of the Geoportal. The LCNAF is an authoritative list of names, events, geographic locations and organizations used by libraries and other organizations to collocate authorized creator names to make searching and browsing easier.
"},{"location":"recipes/standardize-creators/#overview-of-the-process","title":"Overview of the Process","text":"These best practices present the following workflow for standardizing names of creators:
"},{"location":"recipes/standardize-creators/#search-fast-for-the-creators-name","title":"Search FAST for the creator\u2019s name","text":" - If the creator\u2019s name is found, then use the name as found in FAST
- If there\u2019s no match in FAST, then consult the Guidance for Formulating Creator Names Not Present in the Registries Noted Above
"},{"location":"recipes/standardize-creators/#searching-fast","title":"Searching FAST","text":"To search the FAST registry for a creator name, you may use either assignFAST or searchFAST. assignFAST is ideal for quick searches, while searchFAST allows for advanced searching.
"},{"location":"recipes/standardize-creators/#assignfast","title":"assignFAST","text":" - Go to http://experimental.worldcat.org/fast/assignfast/ and begin typing the creator name into the text box.
- assignFAST will suggest headings in FAST format. When the correct heading appears, click it in the list of suggestions.
- The selected heading will appear in the text box, highlighted for copying.
- For example, if you type in
St. Francis, Minnesota
, you will see suggestions including - \u201cMinnesota--St. Francis (Anoka County) USE Minnesota--Saint Francis (Anoka County)\u201d
- \u201cMinnesota--St. Francis (Anoka Co.) USE Minnesota--Saint Francis (Anoka County)\u201d.
- Click on either of those suggestions and you will receive the authorized form of the name:
Minnesota--Saint Francis (Anoka County)
.
- Copy the authorized name from the text box and paste it into the spreadsheet.
"},{"location":"recipes/standardize-creators/#searching-lcnaf","title":"Searching LCNAF","text":"When a name is not found in FAST, search the Library of Congress Name Authority File (LCNAF) for a match using the directions found in the Searching LCNAF section below.
If no match is found, there\u2019s no requirement to do intensive research. Continue to the next section, Guidance for Formulating Creator Names Not Present in the Registries Noted Above. If using these Best Practices in a metadata sprint, you may alternatively move onto the next name in the sprint spreadsheet.
"},{"location":"recipes/standardize-creators/#guidance-for-formulating-creator-names-not-present-in-the-registries-noted-above","title":"Guidance for Formulating Creator Names Not Present in the Registries Noted Above","text":"When a personal or corporate body name cannot be found in neither FAST nor LCNAF, follow the directions below.
"},{"location":"recipes/standardize-creators/#personal-names","title":"Personal Names","text":"Personal names should be formulated in inverted order (last name first) based on the information that appears on the item in the Geoportal.
Felsted, L. E.\n Ackley, Seth\n Colvert, DeLynn C.\n Griffey, Ken, Jr. \n
In cases where extra information is needed to distinguish a name, you may add a parenthetical at the end of the name, e.g., Surveyor, Cartographer, Draftsman, Geologist, Engraver.
Perry, Katy (Cartographer)\n
"},{"location":"recipes/standardize-creators/#corporate-body-names","title":"Corporate Body Names","text":""},{"location":"recipes/standardize-creators/#abbreviations-and-initialisms","title":"Abbreviations and Initialisms","text":"Regardless of how a name appears on the resource, always use the spelled out form of the name as opposed to abbreviations or initialisms for the purpose of being clear. For example, use
United States Geological Survey
U.S.G.S.
USGS
Cook County Geographic Information Systems
Cook County GIS
"},{"location":"recipes/standardize-creators/#subordinate-bodies","title":"Subordinate Bodies","text":"A \u201csubordinate body\u201d is a corporate entity that is part of another corporate entity. To avoid confusion for Geoportal users, always include the name of the larger \u201cparent\u201d entity. For instance:
Cheyenne Light, Fuel and Power Company. Engineering Department
Engineering Department
Canada. Department of the Interior
Department of the Interior
"},{"location":"recipes/standardize-creators/#jurisdictional-geographic-names-used-in-the-creator-field","title":"Jurisdictional Geographic Names Used in the Creator Field","text":"For background, \u201cjurisdictional\u201d place names are those that are defined legally by a set of boundaries and overseen by a governmental agency. In the United States these would be cities, towns, townships, boroughs, villages (mostly), counties, states and so forth. Non-jurisdictional places are of two types: either entities in nature that have been given a name such as the Mississippi River or Rocky Mountains, or are administrative component areas of a larger formal jurisdiction such as ranger districts within a national forest.
It will be extremely rare NOT to find an authorized form of a jurisdictional place name in FAST and LCNAF. However, if you encounter a place not found in these resources, follow the pattern used in FAST.
For a dataset in which the creator name is given as \u201cCity of Kenosha\u201d, FAST formulates the jurisdictional place name as:
Wisconsin--Kenosha\n
"},{"location":"recipes/standardize-creators/#directions-for-standardizing-metadata-records-for-the-btaa-geoportal","title":"Directions for standardizing metadata records for the BTAA Geoportal","text":"When a name is not found in FAST, search the Library of Congress Name Authority File (LCNAF) for a match following the directions in the Searching LCNAF section below. If a matching name is found in LCNAF, a FAST record may be requested, as explained below.
If no match is found, there\u2019s no requirement to do intensive research. Instead, follow the direction in the section, Guidance for Formulating Creator Names Not Present in the Registries noted above.
"},{"location":"recipes/standardize-creators/#searching-lcnaf_1","title":"Searching LCNAF","text":"To search the LCNAF for a creator name, you may use either the Library of Congress Authorities or the LC Linked Data Service. The LC Linked Data Service is ideal for quick keyword searches, while Library of Congress Authorities allows for browse searching.
Searching the LC Linked Data Service:
- Go to https://id.loc.gov/authorities/names.html
- Type the creator name into the text box and press enter or click on Go
- The result that appears in the \"Label\" column is the LCNAF authorized form of the name, for instance,
Cumberland County (Pa.)
- The number at the far right in the \"Identifier\" column is the Library of Congress control number (LCCN), which for Cumberland County Pennsylvania is
n81032665
Often, when searching for names of persons, several results appear to be possible matches. In those cases, click on a heading in the results list and look for the \"Sources\" heading to see a list of citations that have been associated with that entity.
When a name is found in the LCNAF, submit a request that the LCNAF name be added to FAST by following the steps below.
"},{"location":"recipes/standardize-creators/#request-additions-to-fast","title":"Request Additions to FAST","text":"You will use the importFAST Subject Headings utility to request LCNAF additions to FAST.
- First, select the type of LCNAF name.
- Personal Name
- Corporate Name
- Topical: We are unlikely to use this one for a creator.
- Copy the \u201cIdentifier\u201d (the Library of Congress control number) from the LCNAF record and paste it into the \u201center name LCCN\u201d text box. Click the \u201cImport\u201d button.
- The form will automatically populate using the LCNAF name. Check to make sure the correct name has been imported.
- Enter the email address
geoportal@btaa.org
(you do not need to fill out the \u201cAnything extra\u201d text box) and click \u201cSubmit Heading\u201d. - The new FAST heading will appear! Please copy the FAST heading number from the end of the string (e.g. \u201c
fst02013467
\u201d) and paste it into the spreadsheet to show that the heading has been added to FAST. - Enter the newly created FAST heading into the spreadsheet.
Geographic names cannot be requested through the importFAST Subject Headings tool. In the spreadsheet, place \u201cYes\u201d in the column labeled \u201cRequest FAST geographic name\u201d.
"},{"location":"recipes/update-hub-list/","title":"How to update the list of ArcGIS Hub websites.","text":""},{"location":"recipes/update-hub-list/#background","title":"Background","text":"The BTAA Geoportal provides a central access point to find and browse public geospatial data. A large portion of these records come from ArcGIS Hubs that are maintained by states, counties, cities, and regional entities. These entities continually update the data and the website platforms. In turn, we need to continually update our links to these resources.
The ArcGIS Harvesting Recipe walks through how we programmatically query the ArcGIS Hubs to obtain the current list of datasets. This page describes how to keep our list of ArcGIS Hub websites updated.
"},{"location":"recipes/update-hub-list/#how-to-create-or-troubleshoot-an-arcgis-hub-website-record-in-gbl-admin","title":"How to create or troubleshoot an ArcGIS Hub website record in GBL Admin","text":"Info
The highlighted fields listed below are required for the ArcGIS harvesting script. If the script fails, check that these fields have been added. The underlined fields are used to query GBL Admin and produce the list of Hubs that we regularly harvest.
Each Hub has its own entry in GBL Admin. Manually create or update each record with the following parameters:
- Title: The name of the site as shown on its homepage. This value will be transferred into the Provider field of each dataset.
- Description: Usually \"Website for finding and accessing open data provided by \" followed by the name of the administrative place or organization publishing the site. Additional descriptions are helpful.
- Language: 3-letter code as shown on the OpenGeoMetadata website.
- Publisher: The administrative place or organization publishing the site. This value will be concatenated into the title of each dataset. For place names, use the FAST format (i.e.
Minnesota--Clay County
. - Resource Class:
Websites
This value is used for filtering & finding the Hubs in GBL Admin - Temporal Coverage:
Continually updated resource
- Spatial Coverage: Add place names using the FAST format as described for the B1G Profile.
- Bounding Box: If the Hub covers a specific area, include a bounding box for it using the manual method described in the Add bounding boxes recipe.
- Member Of: one of the following values:
ba5cc745-21c5-4ae9-954b-72dd8db6815a
(Government Open Geospatial Data) b0153110-e455-4ced-9114-9b13250a7093
(Research Institutes Geospatial Data Collection)
- Format:
ArcGIS Hub
This value is used for filtering & finding the Hubs in GBL Admin - Links - Reference - \"Full layer description\" : link to the homepage for the Hub
- ID and Code: Both of the values will be the same. Create a new code by following the description on the Code Naming Schema page. Use the Advanced Search in GBL Admin to query which codes have already been used. If it is not clear what code to create, ask the Product Manager or use the UUID Generator website to create a random one. The ID value will be transferred into the Code field of each dataset.
-
Identifier: If the record will be part of the monthly harvests, add this to the end of the baseUrl (usually the homepage): /api/feed/dcat-us/1.1.json
. The Identifier will be used to query the metadata for the website.
Warning
Always check the Identifier link! It should show a JSON API in your browser that displays all of the metadata for each dataset hosted by the website. The baseUrl may be slightly different than the landing page for the organization. For example, some entities may add the string \"data-\" to the beginning of their site URL. The best way to make sure you have the right URL is to look for a box labeled \"Search all data\". This will result in a link like \"baseUrl
/search\". Then, replace the \"/search\" with /api/feed/dcat-us/1.1.json
-
Access Rights: Public
for all ArcGIS Hubs.
- Accrual Method:
DCAT US 1.1
. This value is used for filtering & finding the Hubs in GBL Admin - Status: If the site is part of the ArcGIS Harvest, use the value
Indexed
. If the site is not part of the harvest, use Not indexed
. Other explanatory text can be included here, such as indicating if the site is broken. - Publication State: When a new record is created it will automatically be assigned
Draft
. Change the state to published
when the metadata record is ready. If the site breaks or is deprecated, change this value to unpublished
.
"},{"location":"recipes/update-hub-list/#how-to-remove-broken-or-deprecated-arcgis-hubs","title":"How to remove broken or deprecated ArcGIS Hubs","text":"If a Hub site stalls the Harvesting script, it needs to be updated in GBL Admin.
"},{"location":"recipes/update-hub-list/#if-the-site-is-missing","title":"If the site is missing:","text":"Try to find a replacement site. When a Hub is updated to a new version, sometimes the baseURL will change. If a new site is found, update:
- Links - Reference - \"Full layer description\" : new landing page
- Identifier : the new API (with the suffix
/api/feed/dcat-us/1.1.json
)
"},{"location":"recipes/update-hub-list/#if-the-site-is-present-but-the-api-is-returning-an-error","title":"If the site is present, but the API is returning an error:","text":"In this case, we freeze the website and the dataset records, but stop new harvests. Make these changes:
"},{"location":"recipes/update-hub-list/#website-record","title":"Website Record","text":" - Accrual Method: remove
DCAT US 1.1
and leave blank - Status: Change the value \"Indexed\" to \"Not indexed\". Leave a short explanation if the API is broken.
"},{"location":"recipes/update-hub-list/#dataset-records","title":"Dataset Records","text":" - Export all of the records using the
Code
field - Accrual Method: change from \"ArcGIS Hub\" to \"ArcGISHub-paused\"
Broken APIs
These steps will remove the records from our preset query to select active ArcGIS Hubs in GBL Admin. The effect will be to freeze them until the API starts working again. Once the API becomes accessible again, reverse the Accrual Method values.
"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index e1958b5e..cf394ee7 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ