-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
70 changed files
with
2,249 additions
and
59 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
.idea/ | ||
/secret/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# OBIS Depth review | ||
|
||
## About | ||
|
||
Note and comments on the OBIS data as related to depth values that might align | ||
with the guidance at: XYZ | ||
|
||
A simple query to look for the term "depth" in variable names for OBIS | ||
|
||
```SPARQL | ||
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
PREFIX schema: <https://schema.org/> | ||
SELECT ?sid ?name | ||
WHERE { | ||
?sid schema:variableMeasured ?prop . | ||
?prop schema:name ?name . | ||
FILTER regex(str(?name), "depth", "i") | ||
} | ||
``` | ||
|
||
Find the unique instances of these names. Note this is just the string | ||
value and so it might have false matches. | ||
|
||
```SPARQL | ||
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
PREFIX schema: <https://schema.org/> | ||
SELECT ?name (COUNT(DISTINCT ?sid) as ?count) | ||
WHERE { | ||
?sid schema:variableMeasured ?prop . | ||
?prop schema:name ?name . | ||
FILTER regex(str(?name), "depth", "i") | ||
} | ||
GROUP BY ?name ORDER BY DESC(?count) | ||
``` | ||
|
||
This has some example output like the following. | ||
|
||
| Object Name | Count | | ||
|-------------------|------------------- | | ||
| water depth | "36"^^xsd:integer | | ||
| Depth (m) | "14"^^xsd:integer | | ||
| sampling_depth | "12"^^xsd:integer | | ||
| Water depth | "9"^^xsd:integer | | ||
| MinimumDepth_cm | "8"^^xsd:integer | | ||
| MaximumDepth_cm | "8"^^xsd:integer | | ||
| sampling_depth_min| "7"^^xsd:integer | | ||
|
||
|
||
In terms of numerical values there are no uses | ||
of maxValue, minValue, or value in the variableMeasured | ||
referenced types. | ||
|
||
We can start to look at associated data with the metadata | ||
records. | ||
|
||
```SPARQL | ||
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
PREFIX schema: <https://schema.org/> | ||
SELECT ?sid ?name ?url | ||
WHERE { | ||
?sid schema:distribution ?dist . | ||
?dist schema:contentUrl ?url . | ||
?sid schema:variableMeasured ?prop . | ||
?prop schema:name ?name . | ||
FILTER regex(str(?name), "depth", "i") | ||
} | ||
``` | ||
|
||
We can also use the above SPARQL to find resources that | ||
both mention depth and have distribution links. | ||
|
||
|
||
They have depth values in the measurement tables, but | ||
in most cases depth will be in two Darwin Core fields | ||
that currently do not feed into variableMeasured. | ||
|
||
It is part of the archive which is in distribution. | ||
See https://ipt.vliz.be/eurobis/archive.do?r=smhi-zoobenthos-reg | ||
for example which has >1M measurements. | ||
|
||
These are Darwin Core Archive formats with the metadata | ||
and table values. These have tooling for them like | ||
https://python-dwca-reader.readthedocs.io/en/latest/tutorial.html | ||
which can read and understand these archives. | ||
|
||
After installing and playing with this package for a few minutes | ||
it was able to read the archives and scan on things like | ||
DarwinCore term: http://rs.tdwg.org/dwc/terms/maximumDepthInMeters. | ||
|
||
Doing this for all archives in OBIS might be a bit much. Though I | ||
don't know how many dwca files there are in the distribution | ||
links. It can be discussed further if needed. | ||
|
||
At that point then, doing things like | ||
|
||
```python | ||
core_df['maximumDepthInMeters'].max() # 353.5 meters | ||
``` | ||
|
||
is straightforward to process. | ||
|
||
## OBIS API leveraging | ||
|
||
There's no way currently to get depth statistics by dataset from the API except by going through all records | ||
but I wouldn't recommend that. | ||
One thing you could do is get dataset lists for depth slices, | ||
eg https://api.obis.org/dataset?startdepth=5000&enddepth=6000 | ||
This is not the best approach since you have to query by ranges and get the related resources it seems. | ||
However, there is a parquet (and csv) export from https://obis.org/data/access/ . Pieter said | ||
that the parquet has depth in the form of the darwin | ||
core fields minimumDepthInMeters and maximumDepthInMeters. So this might be the best route. | ||
Pieter doesn't have time to work on this right away, but it might be easy for us to make an | ||
"auxiliary" graph that we can test with and also share with Pieter. In the hopes it helps | ||
him integrate the values into the production service eventually. | ||
I am hoping that id in the parquet is the JSON-LD @id like https://obis.org/dataset/24e96d02-8909-4431-bc61-8cf8eadc9b7a | ||
If that is the case this will be very easy! I am currently pulling down the parquet (18 Gb) and | ||
will report what I find. | ||
|
||
|
||
## References | ||
|
||
* https://github.com/iodepo/odis-arch/blob/master/book/thematics/depth/index.md | ||
* https://github.com/iodepo/odis-in/tree/master/SPARQL/OBIS (this document) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
PREFIX schema: <https://schema.org/> | ||
|
||
SELECT ?name (COUNT(DISTINCT ?sid) as ?count) | ||
WHERE { | ||
?sid schema:variableMeasured ?prop . | ||
?prop schema:name ?name . | ||
FILTER regex(str(?name), "depth", "i") | ||
} | ||
GROUP BY ?name ORDER BY DESC(?count) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
PREFIX schema: <https://schema.org/> | ||
|
||
SELECT ?sid ?name ?url | ||
WHERE { | ||
?sid schema:distribution ?dist . | ||
?dist schema:contentUrl ?url . | ||
?sid schema:variableMeasured ?prop . | ||
?prop schema:name ?name . | ||
FILTER regex(str(?name), "depth", "i") | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
PREFIX schema: <https://schema.org/> | ||
|
||
SELECT ?sid ?name ?url | ||
WHERE { | ||
?sid schema:distribution ?dist . | ||
?dist schema:contentUrl ?url . | ||
?sid schema:variableMeasured ?prop . | ||
?prop schema:name ?name . | ||
FILTER regex(str(?name), "depth", "i") | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
# DuckDB Notes | ||
|
||
|
||
## Notes | ||
|
||
### July 9 | ||
|
||
Focus on oihv2 since it has the fuller set of tables in it. However, we really need to have the | ||
ability to set that up from the s3 store. | ||
|
||
```sql | ||
|
||
CREATE TABLE base (id VARCHAR, type VARCHAR, name VARCHAR, url VARCHAR, description VARCHAR, headline VARCHAR, g VARCHAR ); | ||
CREATE TABLE dataset (id VARCHAR, type VARCHAR, sameAs VARCHAR, license VARCHAR, citation VARCHAR, keyword VARCHAR, includedInDataCatalog VARCHAR, distribution VARCHAR, region VARCHAR, provider VARCHAR, publisher VARCHAR, creator VARCHAR); | ||
CREATE TABLE sup_time (id VARCHAR, type VARCHAR, time VARCHAR, temporalCoverage VARCHAR, dateModified VARCHAR, datePublished VARCHAR, ); | ||
CREATE TABLE course (id VARCHAR, type VARCHAR, txt_location VARCHAR); | ||
CREATE TABLE person (id VARCHAR, type VARCHAR, address VARCHAR, txt_knowsAbout VARCHAR, txt_knowsLanguage VARCHAR); | ||
CREATE TABLE sup_geo (id VARCHAR, type VARCHAR, placename VARCHAR, geotype VARCHAR, geompred VARCHAR, geom VARCHAR, lat VARCHAR, long VARCHAR, g VARCHAR ); | ||
|
||
COPY base FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_baseQuery.parquet'; | ||
COPY dataset FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_dataset.parquet'; | ||
COPY sup_time FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_sup_temporal.parquet'; | ||
COPY course FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_course.parquet'; | ||
COPY person FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_person.parquet'; | ||
COPY sup_geo FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/active/*_sup_geo.parquet'; | ||
|
||
|
||
``` | ||
|
||
The old examples had union_by_name, but I am not sure what the value of these is. | ||
|
||
```sql | ||
CREATE TABLE sup_geo AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_geo.parquet', union_by_name=true); | ||
``` | ||
|
||
|
||
|
||
### Older | ||
|
||
Need to get to: [solrExample.json](solrExample.json) | ||
|
||
This SQL statement will return all columns where there's a matching id in both table1 and table2. | ||
```sql | ||
SELECT * | ||
FROM table1 | ||
INNER JOIN table2 | ||
ON table1.id = table2.id; | ||
``` | ||
|
||
If you want to include all records from one of the tables, | ||
even if there's no matching id in the other table, you would use a LEFT JOIN or RIGHT JOIN: | ||
```sql | ||
SELECT * | ||
FROM table1 | ||
LEFT JOIN table2 | ||
ON table1.id = table2.id; | ||
``` | ||
|
||
NOTE: If the id column exists in both of your tables, | ||
you will need to use an alias to distinguish between them in your SELECT statement, like so: | ||
```sql | ||
SELECT table1.id AS id1, table2.id AS id2, ... | ||
FROM table1 | ||
INNER JOIN table2 | ||
ON table1.id = table2.id; | ||
``` | ||
|
||
## Console Commands | ||
don't create schema, just tables for each parquet! | ||
|
||
This is the simpler approach | ||
Console commands 2 | ||
``` | ||
CREATE TABLE base (id VARCHAR, type VARCHAR, name VARCHAR, url VARCHAR, description VARCHAR, headline VARCHAR, g VARCHAR ); | ||
CREATE TABLE dataset (id VARCHAR, type VARCHAR, sameAs VARCHAR, license VARCHAR, citation VARCHAR, keyword VARCHAR, includedInDataCatalog VARCHAR, distribution VARCHAR, region VARCHAR, provider VARCHAR, publisher VARCHAR, creator VARCHAR); | ||
CREATE TABLE sup_time (id VARCHAR, type VARCHAR, time VARCHAR, temporalCoverage VARCHAR, dateModified VARCHAR, datePublished VARCHAR, ); | ||
COPY base FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_baseQuery.parquet'; | ||
COPY dataset FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_dataset.parquet'; | ||
COPY sup_time FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_temporal.parquet'; | ||
CREATE TABLE course AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_course.parquet', union_by_name=true); | ||
CREATE TABLE person AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_person.parquet', union_by_name=true); | ||
CREATE TABLE sup_geo AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_geo.parquet', union_by_name=true); | ||
``` | ||
|
||
This following apporach didn't seem to work as well, or be necessary since I ca | ||
make tables and load the parquet to the tables and then join across those. | ||
Console commands 1 | ||
``` | ||
CREATE SCHEMA base; | ||
CREATE SCHEMA course; | ||
CREATE SCHEMA dataset; | ||
CREATE SCHEMA person; | ||
CREATE SCHEMA sup_geo; | ||
CREATE SCHEMA sup_time; | ||
CREATE TABLE dataset.data (id VARCHAR, type VARCHAR, sameAs VARCHAR, license VARCHAR, citation VARCHAR, keyword VARCHAR, includedInDataCatalog VARCHAR, distribution VARCHAR, region VARCHAR, provider VARCHAR, publisher VARCHAR, creator VARCHAR); | ||
CREATE TABLE base.data (id VARCHAR, type VARCHAR, name VARCHAR, url VARCHAR, description VARCHAR, headline VARCHAR, g VARCHAR ); | ||
COPY dataset.data FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_dataset.parquet'; | ||
COPY base.data FROM '/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_baseQuery.parquet'; | ||
CREATE TABLE course.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_course.parquet', union_by_name=true); | ||
CREATE TABLE person.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_person.parquet', union_by_name=true); | ||
CREATE TABLE sup_geo.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_geo.parquet', union_by_name=true); | ||
CREATE TABLE sup_time.data AS SELECT * FROM read_parquet('/home/fils/src/Projects/OIH/odis-arch/graphOps/extraction/mdp/output/*_sup_temporal.parquet', union_by_name=true); | ||
``` | ||
|
||
## Columns we need | ||
|
||
``` | ||
"id", | ||
"type", | ||
"txt_creator", | ||
"txt_dateModified", | ||
"txt_datePublished", | ||
"description", | ||
"txt_distribution", | ||
"id_includedInDataCatalog", | ||
"txt_includedInDataCatalog", | ||
"txt_keywords", | ||
"txt_license", | ||
"name", | ||
"id_provider", | ||
"txt_provider", | ||
"id_publisher", | ||
"txt_publisher", | ||
"geom_type", | ||
"has_geom", | ||
"geojson_point", | ||
"geojson_simple", | ||
"geojson_geom", | ||
"geom_area", | ||
"geom_length", | ||
"the_geom", | ||
"dt_startDate", | ||
"n_startYear", | ||
"dt_endDate", | ||
"n_endYear", | ||
"txt_temporalCoverage", | ||
"txt_url", | ||
"txt_variableMeasured", | ||
"txt_version" | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
SELECT id, any_value(type), any_value(name), any_value(url), any_value(description) | ||
FROM base | ||
GROUP BY base.id | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
SELECT id, ANY_VALUE(includedInDataCatalog), STRING_AGG( keyword, ', ') AS kw_list | ||
FROM dataset | ||
GROUP BY dataset.id | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
SELECT base_agg.id, base_agg.type_list, base_agg.name_list, dataset_agg.kw_list, base_agg.b_url, base_agg.b_desc, base_agg.b_headline | ||
FROM ( | ||
SELECT id, STRING_AGG(DISTINCT type, ', ') AS type_list, STRING_AGG(DISTINCT name, ', ') AS name_list, | ||
any_value(url) AS b_URL, any_value(description) AS b_desc, any_value(headline) AS b_headline | ||
FROM base | ||
GROUP BY id | ||
) AS base_agg | ||
JOIN ( | ||
SELECT id, ANY_VALUE(includedInDataCatalog), STRING_AGG(DISTINCT keyword, ', ') AS kw_list | ||
FROM dataset | ||
GROUP BY id | ||
) AS dataset_agg | ||
ON base_agg.id = dataset_agg.id | ||
ORDER By base_agg.id; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
SELECT * | ||
FROM dataset.data | ||
INNER JOIN base.data | ||
ON dataset.data.id = base.data.id; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
SELECT base.id, base.type, base.name, dataset.keyword | ||
FROM base | ||
INNER JOIN dataset | ||
ON base.id = dataset.id | ||
GROUP BY dataset.keyword | ||
LIMIT 1000; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
SELECT * FROM PRAGMA_table_info('sup_time'); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
SELECT dataset.id, dataset.type, dataset.license, dataset.keyword | ||
FROM dataset | ||
JOIN base | ||
ON dataset.id = base.id; |
Oops, something went wrong.