-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide data in CLDF format #2
Comments
For the initial export, we chose YAML because of its human-readability. This was done to make the data more accessible to a wider audience. We aim to provide supplementary JSON-LD metadata in future releases that expose more of the database internal structure. In the meantime, if there is interest, it should be possible to create a pipeline that converts the current CSV/YAML format to CLDF CSV/JSON-LD automatically. We will gladly accept community help in setting this up. @ALL: Please comment in this thread if you are interested in creating such a pipeline, we could use it to draft a roadmap. |
I will have a look into this. Could be a good example for the refactored CLDF structure dataset spec. |
First stab at the conversion is here: https://github.com/clld/autotyp-data/blob/cldf/autotyp_to_cldf.py |
First of all, apologies that it took a while — our previous database pipeline was unmaintainable and so we had to redesign and rebuild it from scratch. With the new pipeline we are better equipped for tracking the dependencies between datasets (not explicitly part of metadata yet, but will be soonish), and so it is a good time to revisit this issue and chart a way for provide a robust solution. One potential difficulty I see is that we decided to go with nested/repeated data for some datasets, as it simplifies handling and conceptualisation in practice. What would be a good way of mapping this kind of data model to CLDF? If I understand correctly, there is some support for repeated simple values, but what about nested records? |
A relatively straightforward way to handle this is using JSON serialized as string as values, and adding enough metadata to the ParameterTable to make this transparent. A complete example using from csvw.metadata import Datatype
from pycldf import StructureDataset
ds = StructureDataset.in_dir('ds')
ds.add_component('ParameterTable', {'name': 'datatype', 'datatype': 'json'})
ds.write(
ParameterTable=[dict(ID='pid', datatype='json')],
ValueTable=[dict(ID='1', Language_ID='l', Parameter_ID='pid', Value='{"a": 2}')])
dt = Datatype.fromvalue(ds.get_object('ParameterTable', 'pid').data['datatype'])
for v in ds['ValueTable']:
v = dt.parse(v['Value'])
assert isinstance(v, dict)
print(v['a']) Here, we add a column |
Btw. I'm in the process of putting together a conversion from the AUTOTYP v1.0 to a CLDF dataset - that's how I turn up all the issues I posted :) |
That's neat! But at this point, what is the value of using CSV at all? Why not just go all JSON?
Keep them coming :) One problem is that the published YAML metadata is just a subset of the much richer metadata we maintain internally for the export pipeline, and the mapping is not perfect. There are many improvements planned here, e.g. relationships between fields, more precise types and constraints etc. — these things unfortunately didn't make it for the big release. |
As soon as a particular data type for values becomes more wide-spread - including standard analysis methods - it becomes a candidate for "more" standardisation in CLDF. Putting it into CLDF now basically puts it "on track" for this. Also, CSV - even if it includes smallish JSON snippets - plays nicer with version control, because it doesn't have (mostly) the "unspecified whitespace" and "attribute order" issues of JSON or XML. I should add that CLDF comes with "built-in" validation. E.g. stuff like invalid values for categorical data, non-existent Glottocodes, etc. will be flagged "out of the box". And generating human readable metadata descriptions is easy, e.g. with cldfbench (see e.g. https://github.com/glottolog/glottolog-cldf/blob/master/cldf/README.md). So arguably, making CLDF the target release format for AUTOTYP might solve some of the issues here. |
@xrotwang could you share your CLDF conversion pipeline with me? I would like to add it to the build system, so that we have CLDF as first class target. |
It's here: https://github.com/cldf-datasets/autotypcldf cldfbench makecldf cldfbench_autotyp.py --glottolog-version v4.5 which basically runs the code in https://github.com/cldf-datasets/autotypcldf/blob/main/cldfbench_autotypcldf.py |
CLDF dataset is now available in the cldf-export branch The python dataset classes are here. I have copied your code verbatim, just adjusted the file paths and removed the bibliography fix since it is not necessary anymore. Could you have a look whether the CLDF data is ok like this? If there are no concerns I can draft a 1.1.0 release. |
Looks ok:
and creating a SQLite db from it works as well. So, looks good to me. |
I have posted a comment on closed issue #51 (which I can't reopen), so I am copying it here as it is relevant for the conversion to the cldf format. It is about the synthesis module but it could be applicable for other complex modules too. I have a proposal to increase data reusability in cldf. Right now, the variables listed in the first comment in this thread are within a JSON format under the MaximallyInflectedVerbSynthesis umbrella variable. Most of these variables though are simple binary per-language variables and they could be incorporated straightforwardly in the CLDF format. The only problem is that the only values for these variables that can be trusted are for the languages that are TRUE for both housekeeping variables (IsVerbAgreementSurveyComplete and IsVerbInflectionSurveyComplete). What do you think @tzakharko and @xrotwang? |
It may be worthwhile to change the data format in this repos to CLDF. As far as I can tell, not too many changes would be required to do so:
What you would gain:
The text was updated successfully, but these errors were encountered: