Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align main-feature/form-tabs #88

Merged
merged 52 commits into from
Aug 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
2d39b14
Merge pull request #44 from mjanez/develop
mjanez Apr 2, 2024
743941f
Merge pull request #45 from mjanez/develop
mjanez Apr 3, 2024
05841ef
Merge pull request #46 from mjanez/develop
mjanez Apr 4, 2024
7143d9a
Merge pull request #48 from mjanez/develop
mjanez Apr 9, 2024
bdd7146
Merge pull request #49 from mjanez/develop
mjanez Apr 10, 2024
aedab05
Merge pull request #51 from mjanez/develop
mjanez Apr 11, 2024
d41aa16
Merge pull request #52 from mjanez/develop
mjanez Apr 27, 2024
36608f3
Merge pull request #53 from mjanez/develop
mjanez May 16, 2024
6a13509
Merge pull request #54 from mjanez/develop
mjanez May 16, 2024
41b5afb
Merge pull request #55 from mjanez/develop
mjanez May 17, 2024
627a25e
Merge pull request #56 from mjanez/develop
mjanez May 17, 2024
861a05a
Merge pull request #57 from mjanez/develop
mjanez May 19, 2024
aac5a64
Merge pull request #58 from mjanez/develop
mjanez May 19, 2024
5026846
Merge pull request #59 from mjanez/develop
mjanez May 19, 2024
c880770
Merge pull request #60 from mjanez/develop
mjanez May 21, 2024
447bcc2
Merge pull request #61 from mjanez/develop
mjanez May 21, 2024
b541759
Merge pull request #62 from mjanez/develop
mjanez May 27, 2024
1681777
Merge pull request #63 from mjanez/develop
mjanez May 27, 2024
feb7b5f
Merge pull request #65 from mjanez/develop
mjanez May 28, 2024
1f112d1
Merge pull request #66 from mjanez/develop
mjanez May 28, 2024
3d227b0
Merge pull request #67 from mjanez/develop
mjanez May 28, 2024
6794880
Merge pull request #68 from mjanez/develop
mjanez Jun 7, 2024
3381418
Merge pull request #69 from mjanez/develop
mjanez Jun 7, 2024
19a3478
Merge pull request #70 from mjanez/develop
mjanez Jun 7, 2024
132c50a
Merge pull request #71 from mjanez/develop
mjanez Jun 7, 2024
b56146a
Merge pull request #72 from mjanez/develop
mjanez Jun 7, 2024
6bf6f59
Merge pull request #73 from mjanez/develop
mjanez Jun 13, 2024
e4869d4
Merge pull request #74 from mjanez/develop
mjanez Jun 15, 2024
7a11b96
Merge pull request #75 from mjanez/develop
mjanez Jun 15, 2024
1aa2bf2
Merge pull request #76 from mjanez/develop
mjanez Jun 17, 2024
6413e6f
Merge pull request #77 from mjanez/develop
mjanez Jun 17, 2024
4531fa4
Merge pull request #78 from mjanez/develop
mjanez Jun 17, 2024
fd91797
Merge pull request #79 from mjanez/develop
mjanez Jun 17, 2024
ff061e5
Merge pull request #81 from mjanez/develop
mjanez Jun 27, 2024
a67b050
Merge pull request #82 from mjanez/develop
mjanez Jul 2, 2024
e777613
Merge pull request #83 from mjanez/develop
mjanez Jul 9, 2024
28163ed
Merge pull request #84 from mjanez/develop
mjanez Jul 10, 2024
e7aa229
First approach
mjanez Jul 26, 2024
8ad1c74
Improve ckan harvester
mjanez Jul 29, 2024
3beccce
Improve ckan harvester
mjanez Jul 29, 2024
3661760
Merge pull request #86 from mjanez/feature/ckan-harvester-improve
mjanez Jul 29, 2024
85b79e4
Merge pull request #85 from mjanez/develop
mjanez Jul 30, 2024
36a298e
Fix bug when schemingdcat.endpoints_yaml is None
mjanez Jul 30, 2024
9c9a45e
Merge pull request #89 from mjanez/develop
mjanez Jul 30, 2024
32d7901
Fix file_size in resource metadata info
mjanez Jul 30, 2024
3952322
Fix CKAN harvester search functionality
mjanez Jul 31, 2024
a226240
Improve clean_tags
mjanez Jul 31, 2024
d516044
Merge pull request #90 from mjanez/feature/ckan-harvester-improve
mjanez Jul 31, 2024
28c3d3d
Merge pull request #91 from mjanez/develop
mjanez Jul 31, 2024
ea133a3
Add licenses.json
mjanez Aug 1, 2024
b363a37
Merge pull request #94 from mjanez/feature/dcat-ap-schemas
mjanez Aug 1, 2024
57736d4
Merge branch 'feature/form-tabs' into main
mjanez Aug 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 125 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -346,18 +346,20 @@ To use it, you need to add the `schemingdcat_ckan_harvester` plugin to your opti

The Scheming DCAT CKAN Harvester supports the same configuration options as the [CKAN Harvester](https://github.com/ckan/ckanext-harvest#the-ckan-harvester), plus the following additional options:

* `dataset_field_mapping/distribution_field_mapping` (Optional): Mapping field names from local to remote instance, all info at: [Field mapping structure](#field-mapping-structure)
* `dataset_field_mapping/distribution_field_mapping` (Optional): Mapping field names from local to remote instance, all info at: [CKAN Harvester Field mapping structure](#field-mapping-structure)
* `field_mapping_schema_version` (**Mandatory if exists** `dataset_field_mapping/distribution_field_mapping`): Schema version of the field_mapping to ensure compatibility with older schemas. The default is `2`.
* `schema` (Optional): The name of the schema to use for the harvested datasets. This is the `schema_name` as defined in the scheming file. The remote and local instances must have the same dataset schema. If not provided, the local instance schema will be used.
* `schema` (Optional): The name of the schema to use for the harvested datasets. This is the `schema_name` as defined in the scheming file. The remote and local instances must have the same dataset schema. If not provided, the `dataset_field_mapping/distribution_field_mapping` is needed to mapping fields.
* `allow_harvest_datasets` (Optional): If `true`, the harvester will create new records even if the package type is from the harvest source. If `false`, the harvester will only create records that originate from the instance. Default is `false`.
* `remote_orgs` (Optional): [WIP]. Only `only_local`.
* `remote_groups` (Optional): [WIP]. Only `only_local`.
* `clean_tags`: By default, tags are stripped of accent characters, spaces and capital letters for display. Setting this option to `False` will keep the original tag names. Default is `True`.

And example configuration might look like this:

```json
{
"api_version": 2,
"clean_tags": false,
"default_tags": [{"name": "inspire"}, {"name": "geodcatap"}],
"default_groups": ["transportation", "hb"],
"default_extras": {"encoding":"utf8", "harvest_description":"Harvesting from Sample Catalog", "harvest_url": "{harvest_source_url}/dataset/{dataset_id}"},
Expand Down Expand Up @@ -400,10 +402,129 @@ And example configuration might look like this:
// "field_value" extends the original list of values retrieved from the remote file for all records.
"field_value": ["https://www.example.org/codelist/a","https://www.example.org/codelist/b", "https://www.example.org/codelist/c"]
},
"my_custom_field": {
// If you need to map a field in a remote dict to the "extras" dict, use the "extras_" prefix to indicate that the field is there.
"field_name": "extras_remote_custom_field"
},
},
}
```
#### Field mapping structure
The `dataset_field_mapping`/`distribution_field_mapping` is structured as follows (multilingual version):

```json
{
...
"field_mapping_schema_version": 2,
"<dataset_field_mapping>/<distribution_field_mapping>": {
"<schema_field_name>": {
"languages": {
"<language>": {
<"field_value": "<fixed_value>/<fixed_value_list>">,/<"field_name": "<excel_field_name>/<excel_field_name_list>">
},
...
},
...
},
...
}
}
```

* `<schema_field_name>`: The name of the field in the CKAN schema.
* `<language>`: (Optional) The language code for multilingual fields. This should be a valid [ISO 639-1 language code](https://localizely.com/iso-639-1-list/). This is now nested under the `languages` key.
* `<fixed_value>/<fixed_value_list>`: (Optional) A fixed value or a list of fixed values that will be assigned to the field for all records.
* **Field labels**: Field name:
* `<field_name>/<field_name_list>`: (Optional) The name of the field in the remote file or a list of field names.

For fields that are not multilingual, you can directly use `field_name` without the `languages` key. For example:

```json
{
...
"field_mapping_schema_version": 2,
"<dataset_field_mapping>/<distribution_field_mapping>": {
"<schema_field_name>": {
<"field_value": "<fixed_value>/<fixed_value_list>">,/<"field_name": "<excel_field_name>/<excel_field_name_list>">
},
...
}
}
```

>[!IMPORTANT]
>The field mapping can be done either at the dataset level using `dataset_field_mapping` or at the resource level using `distribution_field_mapping`. The structure and options are the same for both. The `field_mapping_schema_version` is `2` by default, but needs to be set to avoid errors.

#### Field Types
There are two types of fields that can be defined in the configuration:

1. **Regular fields**: These fields have a field label to define the mapping or a fixed value for all its records.
- **Properties**: A field can have one of these three properties:
- **Fixed value fields (`field_value`)**: These fields have a fixed value that is assigned to all records. This is defined using the `field_value` property. If `field_value` is a list, `field_name` could be set at the same time, and the `field_value` extends the list obtained from the remote field.
- **Field labels**: Field name:
- **Name based fields (`field_name`)**: These fields are defined by their name in the Excel file. This is defined using the `field_name` property, or if you need to map a field in a remote dict to the `extras` dict, use the `extras_` prefix to indicate that the field is there.
2. **Multilingual Fields (`languages`)**: These fields have different values for different languages. Each language is represented as a separate object within the field object (`es`, `en`, ...). The language object can have `field_value` and `field_name` properties, just like a normal field.


**Example**
Here are some examples of configuration files:

* *Field names*: With `field_name` to define the mapping based on names of attributes in the remote sheet (`my_title`, `org_identifier`, `keywords`).

```json
{
"api_version": 2,
"clean_tags": false,

...
# other properties
...

"field_mapping_schema_version": 2,
"dataset_field_mapping": {
"title": {
"field_name": "my_title"
},
"title_translated": {
"languages": {
"en": {
"field_name": "my_title-en"
},
"de": {
"field_value": ""
},
"es": {
"field_name": "my_title"
}
}
},
"private": {
"field_name": "private"
},
"theme": {
"field_name": ["theme", "theme_eu"]
},
"tag_custom": {
"field_name": "keywords"
},
"tag_string": {
"field_name": ["theme_a", "theme_b", "theme_c"]
},
"theme_es": {
"field_value": "http://datos.gob.es/kos/sector-publico/sector/medio-ambiente"
},
"tag_uri": {
"field_name": "keyword_uri",
// "field_value" extends the original list of values retrieved from the remote file for all records.
"field_value": ["https://www.example.org/codelist/a","https://www.example.org/codelist/b", "https://www.example.org/codelist/c"]
},
"my_custom_field": {
// If you need to map a field in a remote dict to the "extras" dict, use the "extras_" prefix to indicate that the field is there.
"field_name": "extras_remote_custom_field"
}
}
}
```

###TODO: Scheming DCAT CSW INSPIRE Harvester
A harvester for remote CSW catalogues using the INSPIRE ISO 19139 metadata profile. This harvester is a subclass of the CSW Harvester provided by `ckanext-spatial` and is designed to work with the `schemingdcat` plugin to provide a more versatile and customizable harvester for CSW endpoints and GeoDCAT-AP CKAN instances.
Expand All @@ -429,7 +550,7 @@ Remote Google Sheet/Onedrive Excel metadata upload Harvester supports the follow
* `storage_type` - **Mandatory**: The type of storage to use for the harvested datasets as `onedrive` or `gspread`. Default is `onedrive`.
* `dataset_sheet` - **Mandatory**: The name of the sheet in the Excel file that contains the dataset records.
* `field_mapping_schema_version`: Schema version of the field_mapping to ensure compatibility with older schemas. The default is `2`.
* `dataset_field_mapping/distribution_field_mapping`: Mapping field names from local to remote instance, all info at: [Field mapping structure](#field-mapping-structure)
* `dataset_field_mapping/distribution_field_mapping`: Mapping field names from local to remote instance, all info at: [Field mapping structure](#field-mapping-structure-sheets-harvester)
* `credentials`: The `credentials` parameter should be used to provide the authentication credentials. The credentials depends on the `storage_type` used.
* For `onedrive`: The credentials parameter should be a dictionary with the following keys: `username`: A string representing the username. `password`: A string representing the password.
* For `gspread` or `gdrive`: The credentials parameter should be a string containing the credentials in `JSON` format. You can obtain the credentials by following the instructions provided in the [Google Workspace documentation.](https://developers.google.com/workspace/guides/create-credentials?hl=es-419)
Expand All @@ -452,7 +573,7 @@ Remote Google Sheet/Onedrive Excel metadata upload Harvester supports the follow
* `clean_tags`: By default, tags are stripped of accent characters, spaces and capital letters for display. Setting this option to `False` will keep the original tag names. Default is `True`.
* `source_date_format`: By default the harvester uses [`dateutil`](https://dateutil.readthedocs.io/en/stable/parser.html) to parse the date, but if the date format of the strings is particularly different you can use this parameter to specify the format, e.g. `%d/%m/%Y`. Accepted formats are: [COMMON_DATE_FORMATS](https://github.com/mjanez/ckanext-schemingdcat/blob/main/ckanext/schemingdcat/config.py#L185-L200)

#### Field mapping structure
#### Field mapping structure (Sheets harvester)
The `dataset_field_mapping`/`distribution_field_mapping` is structured as follows (multilingual version):

```json
Expand Down
Loading
Loading