Skip to content

Commit

Permalink
Documentation / Unify harvesters configuration
Browse files Browse the repository at this point in the history
  • Loading branch information
josegar74 committed Jul 11, 2024
1 parent 7d7d4a3 commit e70534e
Show file tree
Hide file tree
Showing 21 changed files with 346 additions and 211 deletions.
46 changes: 34 additions & 12 deletions docs/manual/docs/user-guide/harvesting/harvesting-csw.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,38 @@ This harvester will connect to a remote CSW server and retrieve metadata records

## Adding a CSW harvester

The figure above shows the options available:

- **Site** - Options about the remote site.
- *Name* - This is a short description of the remote site. It will be shown in the harvesting main page as the name for this instance of the CSW harvester.
- *Service URL* - The URL of the capabilities document of the CSW server to be harvested. eg. <http://geonetwork-site.com/srv/eng/csw?service=CSW&request=GetCabilities&version=2.0.2>. This document is used to discover the location of the services to call to query and retrieve metadata.
- *Icon* - An icon to assign to harvested metadata. The icon will be used when showing harvested metadata records in the search results.
- *Use account* - Account credentials for basic HTTP authentication on the CSW server.
- **Search criteria** - Using the Add button, you can add several search criteria. You can query only the fields recognised by the CSW protocol.
- **Options** - Scheduling options.
- **Options** - Specific harvesting options for this harvester.
- *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped.
To create a CSW harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `CSW`:

![](img/add-csw-harvester.png)

Providing the following information:

- **Identification**
- *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester.
- *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
- *User*: User who owns the harvested records.

- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).

- **Configure connection to OGC CSW 2.0.2**
- *Service URL*: The URL of the capabilities document of the CSW server to be harvested. eg. <http://geonetwork-site.com/srv/eng/csw?service=CSW&request=GetCabilities&version=2.0.2>. This document is used to discover the location of the services to call to query and retrieve metadata.
- *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the CSW server.
- *Search filter*: (Optional) Define the search criteria below to restrict the records to harvest.
- *Search options*:
- *Sort by*: Define sort option to retrieve the results. Sorting by 'identifier:A' means by UUID with alphabetical order. Any CSW queryables can be used in combination with A or D for setting the ordering.
- *Output Schema*: The metadata standard to request the metadata records from the CSW server.
- *Distributed search*: Enables the distributed search in remote server (if the remote server supports it). When this option is enabled, the remote catalog cascades the search to the Federated CSW servers that has configured.

- **Configure response processing for CSW**
- *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID?
- *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron).
- Accept all metadata without validation.
- Accept metadata that are XSD valid.
- Accept metadata that are XSD and schematron valid.
- *Check for duplicate resources based on the resource identifier*: If checked, ignores metadata with a resource identifier (`gmd:identificationInfo/*/gmd:citation/gmd:CI_Citation/gmd:identifier/*/gmd:code/gco:CharacterString`) that is assigned to other metadata record in the catalog. It only applies to records in ISO19139 or ISO profiles.
- *XPath filter*: (Optional) When record is retrived from remote server, check an XPath expression to accept or discard the record.
- *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork.
- *Batch edits*: (Optional) Allows to update harvested records, using XPATH syntax. It can be used to add, replace or delete element.
- *Category*: (Optional) A GeoNetwork category to assign to each metadata record.

- **Privileges** - Assign privileges to harvested metadata.
- **Categories**
46 changes: 30 additions & 16 deletions docs/manual/docs/user-guide/harvesting/harvesting-filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,35 @@ This harvester will harvest metadata as XML files from a filesystem available on

## Adding a Local File System harvester

The figure above shows the options available:

- **Site** - Options about the remote site.
- *Name* - This is a short description of the filesystem harvester. It will be shown in the harvesting main page as the name for this instance of the Local Filesystem harvester.
- *Directory* - The path name of the directory containing the metadata (as XML files) to be harvested.
- *Recurse* - If checked and the *Directory* path contains other directories, then the harvester will traverse the entire file system tree in that directory and add all metadata files found.
- *Keep local if deleted at source* - If checked then metadata records that have already been harvested will be kept even if they have been deleted from the *Directory* specified.
- *Icon* - An icon to assign to harvested metadata. The icon will be used when showing harvested metadata records in the search results.
- **Options** - Scheduling options.
- **Harvested Content** - Options that are applied to harvested content.
- *Apply this XSLT to harvested records* - Choose an XSLT here that will convert harvested records to a different format.
- *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped.
- **Privileges** - Assign privileges to harvested metadata.
- **Categories**
To create a Local File System harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `Directory`:

![](img/add-filesystem-harvester.png)

Providing the following information:

!!! Notes
- **Identification**
- *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester.
- *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
- *User*: User who owns the harvested records.

- in order to be successfully harvested, metadata records retrieved from the file system must match a metadata schema in the local GeoNetwork instance
- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).

- **Configure connection to Directory**
- *Directory*: The path name of the directory containing the metadata (as XML files) to be harvested. The directory must be accessible by GeoNetwork.
- *Also search in subfolders*: If checked and the *Directory* path contains other directories, then the harvester will traverse the entire file system tree in that directory and add all metadata files found.
- *Script to run before harvesting*
- *Type of record*

- **Configure response processing for filesystem**
- *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID?
- *Update catalog record only if file was updated*
- *Keep local even if deleted at source*: If checked then metadata records that have already been harvested will be kept even if they have been deleted from the *Directory* specified.
- *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron).
- Accept all metadata without validation.
- Accept metadata that are XSD valid.
- Accept metadata that are XSD and schematron valid.
- *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork.
- *Batch edits*: (Optional) Allows to update harvested records, using XPATH syntax. It can be used to add, replace or delete element.
- *Category*: (Optional) A GeoNetwork category to assign to each metadata record.

- **Privileges** - Assign privileges to harvested metadata.
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ To create a GeoNetwork 2.1-3.X harvester go to `Admin console` > `Harvesting` an
Providing the following information:

- **Identification**
- *Node name and logo*: A unique name for the harvester and optionally a logo to assign to the harvester.
- *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester.
- *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
- *User*: User who owns the harvested records.

- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester should be executed manually from the harvesters page. If enabled a schedule expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).
- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).

- **Configure connection to GeoNetwork (from 2.1 to 3.x)**
- *Catalog URL*:
Expand All @@ -35,6 +35,9 @@ Providing the following information:

It could be composed of parameter which will be sent to XSL transformation using the following syntax: `anonymizer?protocol=MYLOCALNETWORK:FILEPATH&[email protected]&thesaurus=MYORGONLYTHEASURUS`

- *Validate records before import*: If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped.
- *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron).
- Accept all metadata without validation.
- Accept metadata that are XSD valid.
- Accept metadata that are XSD and schematron valid.

- **Privileges** - Assign privileges to harvested metadata.
44 changes: 29 additions & 15 deletions docs/manual/docs/user-guide/harvesting/harvesting-geoportal.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,38 @@ This harvester will connect to a remote GeoPortal version 9.3.x or 10.x server a

## Adding a GeoPortal REST harvester

The figure above shows the options available:

- **Site** - Options about the remote site.
- *Name* - This is a short description of the remote site. It will be shown in the harvesting main page as the name for this instance of the GeoPortal REST harvester.
- *Base URL* - The base URL of the GeoPortal server to be harvested. eg. <http://yourhost.com/geoportal>. The harvester will add the additional path required to access the REST services on the GeoPortal server.
- *Icon* - An icon to assign to harvested metadata. The icon will be used when showing harvested metadata records in the search results.
- **Search criteria** - Using the Add button, you can add several search criteria. You can query any field on the GeoPortal server using the Lucene query syntax described at <http://webhelp.esri.com/geoportal_extension/9.3.1/index.htm#srch_lucene.htm>.
- **Options** - Scheduling options.
- **Harvested Content** - Options that are applied to harvested content.
- *Apply this XSLT to harvested records* - Choose an XSLT here that will convert harvested records to a different format. See notes section below for typical usage.
- *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped.
To create a GeoPortal REST harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `GeoPortal REST`:

![](img/add-geoportalrest-harvester.png)

Providing the following information:

- **Identification**
- *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester.
- *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
- *User*: User who owns the harvested records.

- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).

- **Configure connection to GeoPortal REST**
- *URL*: The base URL of the GeoPortal server to be harvested. eg. <http://yourhost.com/geoportal>. The harvester will add the additional path required to access the REST services on the GeoPortal server.
- *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the server.
- *Search filter*: (Optional) You can query any field on the GeoPortal server using the Lucene query syntax described at <http://webhelp.esri.com/geoportal_extension/9.3.1/index.htm#srch_lucene.htm>.

- **Configure response processing for geoPREST**
- *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron).
- Accept all metadata without validation.
- Accept metadata that are XSD valid.
- Accept metadata that are XSD and schematron valid.
- *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork.

- **Privileges** - Assign privileges to harvested metadata.
- **Categories**


!!! Notes

- this harvester uses two REST services from the GeoPortal API:
- This harvester uses two REST services from the GeoPortal API:
- `rest/find/document` with searchText parameter to return an RSS listing of metadata records that meet the search criteria (maximum 100000)
- `rest/document` with id parameter from each result returned in the RSS listing
- this harvester has been tested with GeoPortal 9.3.x and 10.x. It can be used in preference to the CSW harvester if there are issues with the handling of the OGC standards etc.
- typically ISO19115 metadata produced by the Geoportal software will not have a 'gmd' prefix for the namespace `http://www.isotc211.org/2005/gmd`. GeoNetwork XSLTs will not have any trouble understanding this metadata but will not be able to map titles and codelists in the viewer/editor. To fix this problem, please select the ``Add-gmd-prefix`` XSLT for the *Apply this XSLT to harvested records* in the **Harvested Content** set of options described earlier
- This harvester has been tested with GeoPortal 9.3.x and 10.x. It can be used in preference to the CSW harvester if there are issues with the handling of the OGC standards etc.
- Typically ISO19115 metadata produced by the Geoportal software will not have a 'gmd' prefix for the namespace `http://www.isotc211.org/2005/gmd`. GeoNetwork XSLTs will not have any trouble understanding this metadata but will not be able to map titles and codelists in the viewer/editor. To fix this problem, please select the ``Add-gmd-prefix`` XSLT for the *Apply this XSLT to harvested records* in the **Harvested Content** set of options described earlier
Loading

0 comments on commit e70534e

Please sign in to comment.