Skip to content

Commit

Permalink
merge from Master
Browse files Browse the repository at this point in the history
  • Loading branch information
jmckenna committed Oct 16, 2024
2 parents 5f7bc56 + d6cb43c commit cf8221b
Show file tree
Hide file tree
Showing 70 changed files with 2,249 additions and 59 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/json-yaml-validate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
uses: actions/checkout@v4

- name: json-yaml-validate
uses: GrantBirki/json-yaml-validate@v2.7.1
uses: GrantBirki/json-yaml-validate@v3.2.1
with:
base_dir: "./dataGraphs/thematics"

Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
.idea/
/secret/
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
DO NOT REFERENCE

This is a temporary demo repository. It is being
used to expore external references to data graphs for use
used to explore external references to data graphs for use
in the OIH Book.


129 changes: 129 additions & 0 deletions SPARQL/OBIS/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# OBIS Depth review

## About

Note and comments on the OBIS data as related to depth values that might align
with the guidance at: XYZ

A simple query to look for the term "depth" in variable names for OBIS

```SPARQL
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
SELECT ?sid ?name
WHERE {
?sid schema:variableMeasured ?prop .
?prop schema:name ?name .
FILTER regex(str(?name), "depth", "i")
}
```

Find the unique instances of these names. Note this is just the string
value and so it might have false matches.

```SPARQL
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
SELECT ?name (COUNT(DISTINCT ?sid) as ?count)
WHERE {
?sid schema:variableMeasured ?prop .
?prop schema:name ?name .
FILTER regex(str(?name), "depth", "i")
}
GROUP BY ?name ORDER BY DESC(?count)
```

This has some example output like the following.

| Object Name | Count |
|-------------------|------------------- |
| water depth | "36"^^xsd:integer |
| Depth (m) | "14"^^xsd:integer |
| sampling_depth | "12"^^xsd:integer |
| Water depth | "9"^^xsd:integer |
| MinimumDepth_cm | "8"^^xsd:integer |
| MaximumDepth_cm | "8"^^xsd:integer |
| sampling_depth_min| "7"^^xsd:integer |


In terms of numerical values there are no uses
of maxValue, minValue, or value in the variableMeasured
referenced types.

We can start to look at associated data with the metadata
records.

```SPARQL
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
SELECT ?sid ?name ?url
WHERE {
?sid schema:distribution ?dist .
?dist schema:contentUrl ?url .
?sid schema:variableMeasured ?prop .
?prop schema:name ?name .
FILTER regex(str(?name), "depth", "i")
}
```

We can also use the above SPARQL to find resources that
both mention depth and have distribution links.


They have depth values in the measurement tables, but
in most cases depth will be in two Darwin Core fields
that currently do not feed into variableMeasured.

It is part of the archive which is in distribution.
See https://ipt.vliz.be/eurobis/archive.do?r=smhi-zoobenthos-reg
for example which has >1M measurements.

These are Darwin Core Archive formats with the metadata
and table values. These have tooling for them like
https://python-dwca-reader.readthedocs.io/en/latest/tutorial.html
which can read and understand these archives.

After installing and playing with this package for a few minutes
it was able to read the archives and scan on things like
DarwinCore term: http://rs.tdwg.org/dwc/terms/maximumDepthInMeters.

Doing this for all archives in OBIS might be a bit much. Though I
don't know how many dwca files there are in the distribution
links. It can be discussed further if needed.

At that point then, doing things like

```python
core_df['maximumDepthInMeters'].max() # 353.5 meters
```

is straightforward to process.

## OBIS API leveraging

There's no way currently to get depth statistics by dataset from the API except by going through all records
but I wouldn't recommend that.
One thing you could do is get dataset lists for depth slices,
eg https://api.obis.org/dataset?startdepth=5000&enddepth=6000
This is not the best approach since you have to query by ranges and get the related resources it seems.
However, there is a parquet (and csv) export from https://obis.org/data/access/ . Pieter said
that the parquet has depth in the form of the darwin
core fields minimumDepthInMeters and maximumDepthInMeters. So this might be the best route.
Pieter doesn't have time to work on this right away, but it might be easy for us to make an
"auxiliary" graph that we can test with and also share with Pieter. In the hopes it helps
him integrate the values into the production service eventually.
I am hoping that id in the parquet is the JSON-LD @id like https://obis.org/dataset/24e96d02-8909-4431-bc61-8cf8eadc9b7a
If that is the case this will be very easy! I am currently pulling down the parquet (18 Gb) and
will report what I find.


## References

* https://github.com/iodepo/odis-arch/blob/master/book/thematics/depth/index.md
* https://github.com/iodepo/odis-in/tree/master/SPARQL/OBIS (this document)
11 changes: 11 additions & 0 deletions SPARQL/OBIS/depth.rq
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>

SELECT ?name (COUNT(DISTINCT ?sid) as ?count)
WHERE {
?sid schema:variableMeasured ?prop .
?prop schema:name ?name .
FILTER regex(str(?name), "depth", "i")
}
GROUP BY ?name ORDER BY DESC(?count)
12 changes: 12 additions & 0 deletions SPARQL/OBIS/distributions.rq
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>

SELECT ?sid ?name ?url
WHERE {
?sid schema:distribution ?dist .
?dist schema:contentUrl ?url .
?sid schema:variableMeasured ?prop .
?prop schema:name ?name .
FILTER regex(str(?name), "depth", "i")
}
12 changes: 12 additions & 0 deletions SPARQL/OBIS/scratch.rq
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>

SELECT ?sid ?name ?url
WHERE {
?sid schema:distribution ?dist .
?dist schema:contentUrl ?url .
?sid schema:variableMeasured ?prop .
?prop schema:name ?name .
FILTER regex(str(?name), "depth", "i")
}
1 change: 0 additions & 1 deletion SPARQL/interop/orcidFind.rq
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ PREFIX schemax: <http://schema.org/>
PREFIX bds: <http://www.bigdata.com/rdf/search#>



SELECT ( COUNT (?obj) as ?count)
WHERE {
?s schema:identifier ?obj .
Expand Down
149 changes: 149 additions & 0 deletions SQL/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# DuckDB Notes


## Notes

### July 9

Focus on oihv2 since it has the fuller set of tables in it. However, we really need to have the
ability to set that up from the s3 store.

```sql

CREATE TABLE base (id VARCHAR, type VARCHAR, name VARCHAR, url VARCHAR, description VARCHAR, headline VARCHAR, g VARCHAR );
CREATE TABLE dataset (id VARCHAR, type VARCHAR, sameAs VARCHAR, license VARCHAR, citation VARCHAR, keyword VARCHAR, includedInDataCatalog VARCHAR, distribution VARCHAR, region VARCHAR, provider VARCHAR, publisher VARCHAR, creator VARCHAR);
CREATE TABLE sup_time (id VARCHAR, type VARCHAR, time VARCHAR, temporalCoverage VARCHAR, dateModified VARCHAR, datePublished VARCHAR, );
CREATE TABLE course (id VARCHAR, type VARCHAR, txt_location VARCHAR);
CREATE TABLE person (id VARCHAR, type VARCHAR, address VARCHAR, txt_knowsAbout VARCHAR, txt_knowsLanguage VARCHAR);
CREATE TABLE sup_geo (id VARCHAR, type VARCHAR, placename VARCHAR, geotype VARCHAR, geompred VARCHAR, geom VARCHAR, lat VARCHAR, long VARCHAR, g VARCHAR );

COPY base FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_baseQuery.parquet';
COPY dataset FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_dataset.parquet';
COPY sup_time FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_sup_temporal.parquet';
COPY course FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_course.parquet';
COPY person FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_person.parquet';
COPY sup_geo FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_sup_geo.parquet';


```

The old examples had union_by_name, but I am not sure what the value of these is.

```sql
CREATE TABLE sup_geo AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_geo.parquet', union_by_name=true);
```



### Older

Need to get to: [solrExample.json](solrExample.json)

This SQL statement will return all columns where there's a matching id in both table1 and table2.
```sql
SELECT *
FROM table1
INNER JOIN table2
ON table1.id = table2.id;
```

If you want to include all records from one of the tables,
even if there's no matching id in the other table, you would use a LEFT JOIN or RIGHT JOIN:
```sql
SELECT *
FROM table1
LEFT JOIN table2
ON table1.id = table2.id;
```

NOTE: If the id column exists in both of your tables,
you will need to use an alias to distinguish between them in your SELECT statement, like so:
```sql
SELECT table1.id AS id1, table2.id AS id2, ...
FROM table1
INNER JOIN table2
ON table1.id = table2.id;
```

## Console Commands
don't create schema, just tables for each parquet!

This is the simpler approach
Console commands 2
```
CREATE TABLE base (id VARCHAR, type VARCHAR, name VARCHAR, url VARCHAR, description VARCHAR, headline VARCHAR, g VARCHAR );
CREATE TABLE dataset (id VARCHAR, type VARCHAR, sameAs VARCHAR, license VARCHAR, citation VARCHAR, keyword VARCHAR, includedInDataCatalog VARCHAR, distribution VARCHAR, region VARCHAR, provider VARCHAR, publisher VARCHAR, creator VARCHAR);
CREATE TABLE sup_time (id VARCHAR, type VARCHAR, time VARCHAR, temporalCoverage VARCHAR, dateModified VARCHAR, datePublished VARCHAR, );
COPY base FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_baseQuery.parquet';
COPY dataset FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_dataset.parquet';
COPY sup_time FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_temporal.parquet';
CREATE TABLE course AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_course.parquet', union_by_name=true);
CREATE TABLE person AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_person.parquet', union_by_name=true);
CREATE TABLE sup_geo AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_geo.parquet', union_by_name=true);
```

This following apporach didn't seem to work as well, or be necessary since I ca
make tables and load the parquet to the tables and then join across those.
Console commands 1
```
CREATE SCHEMA base;
CREATE SCHEMA course;
CREATE SCHEMA dataset;
CREATE SCHEMA person;
CREATE SCHEMA sup_geo;
CREATE SCHEMA sup_time;
CREATE TABLE dataset.data (id VARCHAR, type VARCHAR, sameAs VARCHAR, license VARCHAR, citation VARCHAR, keyword VARCHAR, includedInDataCatalog VARCHAR, distribution VARCHAR, region VARCHAR, provider VARCHAR, publisher VARCHAR, creator VARCHAR);
CREATE TABLE base.data (id VARCHAR, type VARCHAR, name VARCHAR, url VARCHAR, description VARCHAR, headline VARCHAR, g VARCHAR );
COPY dataset.data FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_dataset.parquet';
COPY base.data FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_baseQuery.parquet';
CREATE TABLE course.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_course.parquet', union_by_name=true);
CREATE TABLE person.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_person.parquet', union_by_name=true);
CREATE TABLE sup_geo.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_geo.parquet', union_by_name=true);
CREATE TABLE sup_time.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_temporal.parquet', union_by_name=true);
```

## Columns we need

```
"id",
"type",
"txt_creator",
"txt_dateModified",
"txt_datePublished",
"description",
"txt_distribution",
"id_includedInDataCatalog",
"txt_includedInDataCatalog",
"txt_keywords",
"txt_license",
"name",
"id_provider",
"txt_provider",
"id_publisher",
"txt_publisher",
"geom_type",
"has_geom",
"geojson_point",
"geojson_simple",
"geojson_geom",
"geom_area",
"geom_length",
"the_geom",
"dt_startDate",
"n_startYear",
"dt_endDate",
"n_endYear",
"txt_temporalCoverage",
"txt_url",
"txt_variableMeasured",
"txt_version"
```

4 changes: 4 additions & 0 deletions SQL/aggBase.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SELECT id, any_value(type), any_value(name), any_value(url), any_value(description)
FROM base
GROUP BY base.id

4 changes: 4 additions & 0 deletions SQL/aggDataset.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SELECT id, ANY_VALUE(includedInDataCatalog), STRING_AGG( keyword, ', ') AS kw_list
FROM dataset
GROUP BY dataset.id

14 changes: 14 additions & 0 deletions SQL/aggUnion.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
SELECT base_agg.id, base_agg.type_list, base_agg.name_list, dataset_agg.kw_list, base_agg.b_url, base_agg.b_desc, base_agg.b_headline
FROM (
SELECT id, STRING_AGG(DISTINCT type, ', ') AS type_list, STRING_AGG(DISTINCT name, ', ') AS name_list,
any_value(url) AS b_URL, any_value(description) AS b_desc, any_value(headline) AS b_headline
FROM base
GROUP BY id
) AS base_agg
JOIN (
SELECT id, ANY_VALUE(includedInDataCatalog), STRING_AGG(DISTINCT keyword, ', ') AS kw_list
FROM dataset
GROUP BY id
) AS dataset_agg
ON base_agg.id = dataset_agg.id
ORDER By base_agg.id;
4 changes: 4 additions & 0 deletions SQL/innerJoin.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SELECT *
FROM dataset.data
INNER JOIN base.data
ON dataset.data.id = base.data.id;
6 changes: 6 additions & 0 deletions SQL/join.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
SELECT base.id, base.type, base.name, dataset.keyword
FROM base
INNER JOIN dataset
ON base.id = dataset.id
GROUP BY dataset.keyword
LIMIT 1000;
1 change: 1 addition & 0 deletions SQL/pragma.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SELECT * FROM PRAGMA_table_info('sup_time');
4 changes: 4 additions & 0 deletions SQL/solr1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SELECT dataset.id, dataset.type, dataset.license, dataset.keyword
FROM dataset
JOIN base
ON dataset.id = base.id;
Loading

0 comments on commit cf8221b

Please sign in to comment.