merge from Master

iodepo · Oct 16, 2024 · cf8221b · cf8221b
2 parents 5f7bc56 + d6cb43c
commit cf8221b
Show file tree

Hide file tree

Showing 70 changed files with 2,249 additions and 59 deletions.
diff --git a/.github/workflows/json-yaml-validate.yml b/.github/workflows/json-yaml-validate.yml
@@ -15,7 +15,7 @@ jobs:
         uses: actions/checkout@v4
 
       - name: json-yaml-validate
-        uses: GrantBirki/json-yaml-validate@v2.7.1
+        uses: GrantBirki/json-yaml-validate@v3.2.1
         with:
           base_dir: "./dataGraphs/thematics"        
 

diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
 .idea/
+/secret/
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 DO NOT REFERENCE
 
 This is a temporary demo repository.  It is being 
-used to expore external references to data graphs for use
+used to explore external references to data graphs for use
 in the OIH Book.
 
 
diff --git a/SPARQL/OBIS/README.md b/SPARQL/OBIS/README.md
@@ -0,0 +1,129 @@
+# OBIS Depth review
+
+## About
+
+Note and comments on the OBIS data as related to depth values that might align 
+with the guidance at:  XYZ
+
+A simple query to look for the term "depth" in variable names for OBIS
+
+```SPARQL
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX schema: <https://schema.org/>
+
+SELECT ?sid ?name
+WHERE {
+    ?sid schema:variableMeasured ?prop .
+    ?prop schema:name ?name .
+    FILTER regex(str(?name), "depth", "i")
+}
+```
+
+Find the unique instances of these names.  Note this is just the string
+value and so it might have false matches.
+
+```SPARQL
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX schema: <https://schema.org/>
+
+SELECT ?name (COUNT(DISTINCT ?sid) as ?count)
+WHERE {
+    ?sid schema:variableMeasured ?prop .
+    ?prop schema:name ?name .
+    FILTER regex(str(?name), "depth", "i")
+}
+GROUP BY ?name ORDER BY DESC(?count)
+```
+
+This has some example output like the following.
+
+| Object Name       | Count              |
+|-------------------|------------------- |
+| water depth	      | "36"^^xsd:integer |
+| Depth (m)         | "14"^^xsd:integer |
+| sampling_depth    | "12"^^xsd:integer |
+| Water depth       | "9"^^xsd:integer  |
+| MinimumDepth_cm	  | "8"^^xsd:integer  |
+| MaximumDepth_cm	  | "8"^^xsd:integer  |
+| sampling_depth_min| "7"^^xsd:integer  |
+
+
+In terms of numerical values there are no uses 
+of maxValue, minValue, or value in the variableMeasured
+referenced types. 
+
+We can start to look at associated data with the metadata
+records. 
+
+```SPARQL
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX schema: <https://schema.org/>
+
+SELECT ?sid ?name ?url
+WHERE {
+    ?sid schema:distribution ?dist .
+    ?dist schema:contentUrl ?url .
+    ?sid schema:variableMeasured ?prop .
+    ?prop schema:name ?name .
+    FILTER regex(str(?name), "depth", "i")
+}
+```
+
+We can also use the above SPARQL to find resources that
+both mention depth and have distribution links.
+
+
+They have depth values in the measurement tables, but 
+in most cases depth will be in two Darwin Core fields 
+that currently do not feed into variableMeasured.
+
+It is part of the archive which is in distribution. 
+See https://ipt.vliz.be/eurobis/archive.do?r=smhi-zoobenthos-reg 
+for example which has >1M measurements.
+
+These are Darwin Core Archive formats with the metadata
+and table values.  These have tooling for them like 
+https://python-dwca-reader.readthedocs.io/en/latest/tutorial.html
+which can read and understand these archives.  
+
+After installing and playing with this package for a few minutes
+it was able to read the archives and scan on things like 
+DarwinCore term: http://rs.tdwg.org/dwc/terms/maximumDepthInMeters.
+
+Doing this for all archives in OBIS might be a bit much.  Though I 
+don't know how many dwca files there are in the distribution
+links.  It can be discussed further if needed.
+
+At that point then, doing things like 
+
+```python
+core_df['maximumDepthInMeters'].max()  # 353.5 meters
+```
+
+is straightforward to process.  
+
+## OBIS API leveraging
+
+There's no way currently to get depth statistics by dataset from the API except by going through all records
+but I wouldn't recommend that.
+One thing you could do is get dataset lists for depth slices,
+eg https://api.obis.org/dataset?startdepth=5000&enddepth=6000
+This is not the best approach since you have to query by ranges and get the related resources it seems.
+However, there is a parquet (and csv) export from https://obis.org/data/access/ .  Pieter said
+that the parquet has depth in the form of the darwin
+core fields minimumDepthInMeters and maximumDepthInMeters.   So this might be the best route.
+Pieter doesn't have time to work on this right away, but it might be easy for us to make an
+"auxiliary" graph that we can test with and also share with Pieter.  In the hopes it helps
+him integrate the values into the production service eventually.
+I am hoping that id in the parquet is the JSON-LD @id like https://obis.org/dataset/24e96d02-8909-4431-bc61-8cf8eadc9b7a
+If that is the case this will be very easy!  I am currently pulling down the parquet (18 Gb)  and
+will report what I find.
+
+
+## References
+
+* https://github.com/iodepo/odis-arch/blob/master/book/thematics/depth/index.md
+* https://github.com/iodepo/odis-in/tree/master/SPARQL/OBIS (this document)
diff --git a/SPARQL/OBIS/depth.rq b/SPARQL/OBIS/depth.rq
@@ -0,0 +1,11 @@
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX schema: <https://schema.org/>
+
+SELECT ?name (COUNT(DISTINCT ?sid) as ?count)
+WHERE {
+    ?sid schema:variableMeasured ?prop .
+    ?prop schema:name ?name .
+    FILTER regex(str(?name), "depth", "i")
+}
+GROUP BY ?name ORDER BY DESC(?count)
diff --git a/SPARQL/OBIS/distributions.rq b/SPARQL/OBIS/distributions.rq
@@ -0,0 +1,12 @@
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX schema: <https://schema.org/>
+
+SELECT ?sid ?name ?url
+WHERE {
+    ?sid schema:distribution ?dist .
+    ?dist schema:contentUrl ?url .
+    ?sid schema:variableMeasured ?prop .
+    ?prop schema:name ?name .
+    FILTER regex(str(?name), "depth", "i")
+}
diff --git a/SPARQL/OBIS/scratch.rq b/SPARQL/OBIS/scratch.rq
@@ -0,0 +1,12 @@
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX schema: <https://schema.org/>
+
+SELECT ?sid ?name ?url
+WHERE {
+    ?sid schema:distribution ?dist .
+    ?dist schema:contentUrl ?url .
+    ?sid schema:variableMeasured ?prop .
+    ?prop schema:name ?name .
+    FILTER regex(str(?name), "depth", "i")
+}
diff --git a/SPARQL/interop/orcidFind.rq b/SPARQL/interop/orcidFind.rq
@@ -10,7 +10,6 @@ PREFIX schemax: <http://schema.org/>
 PREFIX bds: <http://www.bigdata.com/rdf/search#>
 
 
-
 SELECT  ( COUNT (?obj) as ?count)
 WHERE {
   ?s schema:identifier ?obj .

diff --git a/SQL/README.md b/SQL/README.md
@@ -0,0 +1,149 @@
+# DuckDB Notes
+
+
+## Notes
+
+### July 9
+
+Focus on oihv2 since it has the fuller set of tables in it.  However, we really need to have the 
+ability to set that up from the s3 store.  
+
+```sql
+
+CREATE TABLE base (id VARCHAR, type VARCHAR, name VARCHAR, url VARCHAR, description VARCHAR, headline VARCHAR, g VARCHAR );
+CREATE TABLE dataset (id VARCHAR, type VARCHAR, sameAs VARCHAR, license VARCHAR, citation VARCHAR, keyword VARCHAR, includedInDataCatalog VARCHAR, distribution VARCHAR, region VARCHAR, provider VARCHAR, publisher VARCHAR, creator VARCHAR);
+CREATE TABLE sup_time (id VARCHAR, type VARCHAR, time VARCHAR, temporalCoverage VARCHAR, dateModified VARCHAR, datePublished VARCHAR, );
+CREATE TABLE course (id VARCHAR, type VARCHAR, txt_location VARCHAR);
+CREATE TABLE person (id VARCHAR, type VARCHAR, address VARCHAR, txt_knowsAbout VARCHAR, txt_knowsLanguage VARCHAR);
+CREATE TABLE sup_geo (id VARCHAR, type VARCHAR, placename VARCHAR, geotype VARCHAR, geompred VARCHAR, geom VARCHAR, lat VARCHAR, long VARCHAR, g VARCHAR );
+
+COPY base FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_baseQuery.parquet';
+COPY dataset FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_dataset.parquet';
+COPY sup_time FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_sup_temporal.parquet';
+COPY course FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_course.parquet';
+COPY person FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_person.parquet';
+COPY sup_geo FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_sup_geo.parquet';
+
+
+```
+
+The old examples had union_by_name, but I am not sure what the value of these is.  
+
+```sql
+CREATE TABLE sup_geo AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_geo.parquet',  union_by_name=true);
+```
+
+
+
+### Older
+
+Need to get to: [solrExample.json](solrExample.json)
+
+This SQL statement will return all columns where there's a matching id in both table1 and table2.
+```sql
+SELECT *
+FROM table1
+INNER JOIN table2
+ON table1.id = table2.id;
+```
+
+If you want to include all records from one of the tables, 
+even if there's no matching id in the other table, you would use a LEFT JOIN or RIGHT JOIN:
+```sql
+SELECT *
+FROM table1
+LEFT JOIN table2
+ON table1.id = table2.id;
+```
+
+NOTE: If the id column exists in both of your tables,
+you will need to use an alias to distinguish between them in your SELECT statement, like so:
+```sql
+SELECT table1.id AS id1, table2.id AS id2, ...
+FROM table1
+INNER JOIN table2
+ON table1.id = table2.id;
+```
+
+## Console Commands
+don't create schema, just tables for each parquet!
+
+This is the simpler approach 
+Console commands 2
+```
+CREATE TABLE base (id VARCHAR, type VARCHAR, name VARCHAR, url VARCHAR, description VARCHAR, headline VARCHAR, g VARCHAR );
+CREATE TABLE dataset (id VARCHAR, type VARCHAR, sameAs VARCHAR, license VARCHAR, citation VARCHAR, keyword VARCHAR, includedInDataCatalog VARCHAR, distribution VARCHAR, region VARCHAR, provider VARCHAR, publisher VARCHAR, creator VARCHAR);
+CREATE TABLE sup_time (id VARCHAR, type VARCHAR, time VARCHAR, temporalCoverage VARCHAR, dateModified VARCHAR, datePublished VARCHAR, );
+
+COPY base FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_baseQuery.parquet';
+COPY dataset FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_dataset.parquet';
+COPY sup_time FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_temporal.parquet';
+
+CREATE TABLE course AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_course.parquet',  union_by_name=true);
+CREATE TABLE person AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_person.parquet',  union_by_name=true);
+CREATE TABLE sup_geo AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_geo.parquet',  union_by_name=true);
+
+
+```
+
+This following apporach didn't seem to work as well, or be necessary since I ca
+make tables and load the parquet to the tables and then join across those.
+Console commands 1
+```
+CREATE SCHEMA base;
+CREATE SCHEMA course;
+CREATE SCHEMA dataset;
+CREATE SCHEMA person;
+CREATE SCHEMA sup_geo;
+CREATE SCHEMA sup_time;
+
+CREATE TABLE dataset.data (id VARCHAR, type VARCHAR, sameAs VARCHAR, license VARCHAR, citation VARCHAR, keyword VARCHAR, includedInDataCatalog VARCHAR, distribution VARCHAR, region VARCHAR, provider VARCHAR, publisher VARCHAR, creator VARCHAR);
+CREATE TABLE base.data (id VARCHAR, type VARCHAR, name VARCHAR, url VARCHAR, description VARCHAR, headline VARCHAR, g VARCHAR );
+
+COPY dataset.data FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_dataset.parquet';
+COPY base.data FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_baseQuery.parquet';
+
+CREATE TABLE course.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_course.parquet',  union_by_name=true);
+CREATE TABLE person.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_person.parquet',  union_by_name=true);
+CREATE TABLE sup_geo.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_geo.parquet',  union_by_name=true);
+CREATE TABLE sup_time.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_temporal.parquet',  union_by_name=true);
+
+```
+
+## Columns we need
+
+```
+"id",
+"type",
+"txt_creator",
+"txt_dateModified",
+"txt_datePublished",
+"description",
+"txt_distribution",
+"id_includedInDataCatalog",
+"txt_includedInDataCatalog",
+"txt_keywords",
+"txt_license",
+"name",
+"id_provider",
+"txt_provider",
+"id_publisher",
+"txt_publisher",
+"geom_type",
+"has_geom",
+"geojson_point",
+"geojson_simple",
+"geojson_geom",
+"geom_area",
+"geom_length",
+"the_geom",
+"dt_startDate",
+"n_startYear",
+"dt_endDate",
+"n_endYear",
+"txt_temporalCoverage",
+"txt_url",
+"txt_variableMeasured",
+"txt_version"
+```
+
diff --git a/SQL/aggBase.sql b/SQL/aggBase.sql
@@ -0,0 +1,4 @@
+SELECT id, any_value(type), any_value(name), any_value(url), any_value(description)
+FROM base
+GROUP BY  base.id
+
diff --git a/SQL/aggDataset.sql b/SQL/aggDataset.sql
@@ -0,0 +1,4 @@
+SELECT id,    ANY_VALUE(includedInDataCatalog), STRING_AGG( keyword, ', ') AS kw_list
+FROM dataset
+GROUP BY  dataset.id
+
diff --git a/SQL/aggUnion.sql b/SQL/aggUnion.sql
@@ -0,0 +1,14 @@
+SELECT base_agg.id, base_agg.type_list, base_agg.name_list, dataset_agg.kw_list, base_agg.b_url, base_agg.b_desc, base_agg.b_headline
+FROM (
+    SELECT id, STRING_AGG(DISTINCT type, ', ') AS type_list, STRING_AGG(DISTINCT name, ', ') AS name_list,
+           any_value(url) AS b_URL, any_value(description) AS b_desc, any_value(headline) AS b_headline
+    FROM base
+    GROUP BY  id
+) AS base_agg
+JOIN (
+    SELECT id,    ANY_VALUE(includedInDataCatalog), STRING_AGG(DISTINCT keyword, ', ') AS kw_list
+    FROM dataset
+    GROUP BY  id
+) AS dataset_agg
+ON base_agg.id = dataset_agg.id
+ORDER By base_agg.id;
diff --git a/SQL/innerJoin.sql b/SQL/innerJoin.sql
@@ -0,0 +1,4 @@
+SELECT *
+FROM dataset.data
+INNER JOIN base.data
+ON dataset.data.id = base.data.id;
diff --git a/SQL/join.sql b/SQL/join.sql
@@ -0,0 +1,6 @@
+SELECT base.id, base.type, base.name, dataset.keyword
+FROM base
+INNER JOIN dataset
+ON base.id = dataset.id
+    GROUP BY  dataset.keyword
+LIMIT 1000;
diff --git a/SQL/pragma.sql b/SQL/pragma.sql
@@ -0,0 +1 @@
+SELECT * FROM PRAGMA_table_info('sup_time');
diff --git a/SQL/solr1.sql b/SQL/solr1.sql
@@ -0,0 +1,4 @@
+SELECT dataset.id,  dataset.type,  dataset.license,  dataset.keyword
+FROM dataset
+JOIN base
+ON dataset.id =  base.id;
-Original file line number
+Diff line change
@@ Expand Up / @@ -10,7 +10,6 @@ PREFIX schemax: <http://schema.org/> @@
     PREFIX bds: <http://www.bigdata.com/rdf/search#>
     SELECT  ( COUNT (?obj) as ?count)
     WHERE {
       ?s schema:identifier ?obj .
@@ Expand Down @@