Provide data in CLDF format #2

xrotwang · 2017-05-30T07:52:58Z

It may be worthwhile to change the data format in this repos to CLDF. As far as I can tell, not too many changes would be required to do so:

Adopt the CLDF standard column names in the csv files.
Convert the YAML metadata files to JSON-LD according to the W3 recommendations for tabular data on the web.

What you would gain:

Some of the documentation could be offloaded to the CLDF spec.
Better machine readability, in particular of the metadata (while YAML has good support in many languages, support for JSON is still better; the additional benefit of using JSON-LD is machine readable info about license, proper citation, etc.).
A python API.
A first (big) step towards serving the autotyp data from a clld app.

tzakharko · 2017-06-07T11:12:52Z

For the initial export, we chose YAML because of its human-readability. This was done to make the data more accessible to a wider audience. We aim to provide supplementary JSON-LD metadata in future releases that expose more of the database internal structure.

In the meantime, if there is interest, it should be possible to create a pipeline that converts the current CSV/YAML format to CLDF CSV/JSON-LD automatically. We will gladly accept community help in setting this up.

@ALL: Please comment in this thread if you are interested in creating such a pipeline, we could use it to draft a roadmap.

xrotwang · 2017-06-22T14:08:24Z

I will have a look into this. Could be a good example for the refactored CLDF structure dataset spec.

xflr6 · 2018-01-10T12:47:10Z

First stab at the conversion is here: https://github.com/clld/autotyp-data/blob/cldf/autotyp_to_cldf.py
To workaround #9 and #10, the ill-formed data was removed, cf. the commits in the issues branch
Result as a ZIP-file: autotyp-cldf.zip

tzakharko · 2022-02-11T08:53:48Z

First of all, apologies that it took a while — our previous database pipeline was unmaintainable and so we had to redesign and rebuild it from scratch. With the new pipeline we are better equipped for tracking the dependencies between datasets (not explicitly part of metadata yet, but will be soonish), and so it is a good time to revisit this issue and chart a way for provide a robust solution.

One potential difficulty I see is that we decided to go with nested/repeated data for some datasets, as it simplifies handling and conceptualisation in practice. What would be a good way of mapping this kind of data model to CLDF? If I understand correctly, there is some support for repeated simple values, but what about nested records?

xrotwang · 2022-02-11T09:53:36Z

A relatively straightforward way to handle this is using JSON serialized as string as values, and adding enough metadata to the ParameterTable to make this transparent. A complete example using pycldf looks like this:

from csvw.metadata import Datatype
from pycldf import StructureDataset

ds = StructureDataset.in_dir('ds')
ds.add_component('ParameterTable', {'name': 'datatype', 'datatype': 'json'})
ds.write(
    ParameterTable=[dict(ID='pid', datatype='json')],
    ValueTable=[dict(ID='1', Language_ID='l', Parameter_ID='pid', Value='{"a": 2}')])

dt = Datatype.fromvalue(ds.get_object('ParameterTable', 'pid').data['datatype'])
for v in ds['ValueTable']:
    v = dt.parse(v['Value'])
    assert isinstance(v, dict)
    print(v['a'])

Here, we add a column datatype to ParameterTable, and mark it as JSON column (which is understood by csvw). When reading data from ValueTable, we first instantiate a csvw.metadata.Datatype instance from the datatype spec in ParameterTable, and then use this object to parse the value accordingly.

xrotwang · 2022-02-11T10:00:50Z

Btw. I'm in the process of putting together a conversion from the AUTOTYP v1.0 to a CLDF dataset - that's how I turn up all the issues I posted :)

tzakharko · 2022-02-11T11:25:32Z

A relatively straightforward way to handle this is using JSON serialized as string as values, and adding enough metadata to the ParameterTable to make this transparent. A complete example using pycldf looks like this:

That's neat! But at this point, what is the value of using CSV at all? Why not just go all JSON?

Btw. I'm in the process of putting together a conversion from the AUTOTYP v1.0 to a CLDF dataset - that's how I turn up all the issues I posted :)

Keep them coming :) One problem is that the published YAML metadata is just a subset of the much richer metadata we maintain internally for the export pipeline, and the mapping is not perfect. There are many improvements planned here, e.g. relationships between fields, more precise types and constraints etc. — these things unfortunately didn't make it for the big release.

xrotwang · 2022-02-11T11:47:57Z

As soon as a particular data type for values becomes more wide-spread - including standard analysis methods - it becomes a candidate for "more" standardisation in CLDF. Putting it into CLDF now basically puts it "on track" for this. Also, CSV - even if it includes smallish JSON snippets - plays nicer with version control, because it doesn't have (mostly) the "unspecified whitespace" and "attribute order" issues of JSON or XML.

I should add that CLDF comes with "built-in" validation. E.g. stuff like invalid values for categorical data, non-existent Glottocodes, etc. will be flagged "out of the box". And generating human readable metadata descriptions is easy, e.g. with cldfbench (see e.g. https://github.com/glottolog/glottolog-cldf/blob/master/cldf/README.md). So arguably, making CLDF the target release format for AUTOTYP might solve some of the issues here.

tzakharko · 2022-03-21T08:23:04Z

@xrotwang could you share your CLDF conversion pipeline with me? I would like to add it to the build system, so that we have CLDF as first class target.

xrotwang · 2022-03-21T08:35:04Z

It's here: https://github.com/cldf-datasets/autotypcldf
Using https://github.com/cldf/cldfbench
autotyp-data is pulled in as git submodule, see https://github.com/cldf-datasets/autotypcldf/tree/main/raw
And the conversion is run via

cldfbench makecldf cldfbench_autotyp.py --glottolog-version v4.5

which basically runs the code in https://github.com/cldf-datasets/autotypcldf/blob/main/cldfbench_autotypcldf.py

tzakharko · 2022-04-07T16:01:15Z

CLDF dataset is now available in the cldf-export branch

The python dataset classes are here. I have copied your code verbatim, just adjusted the file paths and removed the bibliography fix since it is not necessary anymore.

Could you have a look whether the CLDF data is ok like this? If there are no concerns I can draft a 1.1.0 release.

xrotwang · 2022-04-07T16:16:54Z

Looks ok:

$ cldf stats StructureDataset-metadata.json 
<cldf:v1.0:StructureDataset at .>
                     value
-------------------  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dc:conformsTo        http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
dc:source            sources.bib
prov:wasDerivedFrom  [{'rdf:about': 'new-autotyp-preview', 'rdf:type': 'prov:Entity', 'dc:created': 'v1.0.1-1-g1d0af14', 'dc:title': 'Repository'}, {'rdf:about': 'https://github.com/glottolog/glottolog', 'rdf:type': 'prov:Entity', 'dc:created': 'v4.5', 'dc:title': 'Glottolog'}, {'rdf:about': 'new-autotyp-preview', 'rdf:type': 'prov:Entity', 'dc:created': 'v1.0.1-1-g1d0af14', 'dc:title': 'Repository'}]
prov:wasGeneratedBy  [{'dc:title': 'python', 'dc:description': '3.9.10'}, {'dc:title': 'python-packages', 'dc:relation': 'requirements.txt'}]
rdf:ID               autotyp
rdf:type             http://www.w3.org/ns/dcat#Distribution

                   Type                 Rows
-----------------  -----------------  ------
values.csv         ValueTable         278536
languages.csv      LanguageTable        3053
contributions.csv  ContributionTable      46
parameters.csv     ParameterTable       1013
codes.csv          CodeTable            1402
sources.bib        Sources              5001

and creating a SQLite db from it works as well.

So, looks good to me.

nataliacp · 2023-05-24T08:43:55Z

I have posted a comment on closed issue #51 (which I can't reopen), so I am copying it here as it is relevant for the conversion to the cldf format. It is about the synthesis module but it could be applicable for other complex modules too.

I have a proposal to increase data reusability in cldf. Right now, the variables listed in the first comment in this thread are within a JSON format under the MaximallyInflectedVerbSynthesis umbrella variable. Most of these variables though are simple binary per-language variables and they could be incorporated straightforwardly in the CLDF format. The only problem is that the only values for these variables that can be trusted are for the languages that are TRUE for both housekeeping variables (IsVerbAgreementSurveyComplete and IsVerbInflectionSurveyComplete). What do you think @tzakharko and @xrotwang?

tzakharko self-assigned this Jun 7, 2017

tzakharko added enhancement help wanted labels Jun 7, 2017

tzakharko mentioned this issue Mar 22, 2023

MaximallyInflectedVerbSynthesis - variables lost in CLDF #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide data in CLDF format #2

Provide data in CLDF format #2

xrotwang commented May 30, 2017 •

edited

Loading

tzakharko commented Jun 7, 2017 •

edited

Loading

xrotwang commented Jun 22, 2017

xflr6 commented Jan 10, 2018

tzakharko commented Feb 11, 2022

xrotwang commented Feb 11, 2022

xrotwang commented Feb 11, 2022

tzakharko commented Feb 11, 2022

xrotwang commented Feb 11, 2022

tzakharko commented Mar 21, 2022

xrotwang commented Mar 21, 2022

tzakharko commented Apr 7, 2022 •

edited

Loading

xrotwang commented Apr 7, 2022

nataliacp commented May 24, 2023

Provide data in CLDF format #2

Provide data in CLDF format #2

Comments

xrotwang commented May 30, 2017 • edited Loading

tzakharko commented Jun 7, 2017 • edited Loading

xrotwang commented Jun 22, 2017

xflr6 commented Jan 10, 2018

tzakharko commented Feb 11, 2022

xrotwang commented Feb 11, 2022

xrotwang commented Feb 11, 2022

tzakharko commented Feb 11, 2022

xrotwang commented Feb 11, 2022

tzakharko commented Mar 21, 2022

xrotwang commented Mar 21, 2022

tzakharko commented Apr 7, 2022 • edited Loading

xrotwang commented Apr 7, 2022

nataliacp commented May 24, 2023

xrotwang commented May 30, 2017 •

edited

Loading

tzakharko commented Jun 7, 2017 •

edited

Loading

tzakharko commented Apr 7, 2022 •

edited

Loading