Skip to content

Commit

Permalink
Bugfix release 1.0.1
Browse files Browse the repository at this point in the history
- improved JSON output
- improved and corrected the metadata for multiple variables of the type value list
- improved the bibliography data, added Glottolog language and reference IDs (many thanks to
  Robert Forkel for doing this work)
- minor data fixes (duplicate entries in datasets `Alienability`, `Gender` and `NumeralClassifiers`)

Many thanks to Robert Forkel for reporting many of these issues and curating the bibliography
files!

Detailed changes:

- Fixed the DOI badge (now points to last released version 1.0.0)
- Added data type `logical` to the list of valid variable types
- Clarified that `value-list` is not actually a list
- Fixed an issue with JSON export where missing values were silently dropped
  by the serializer, they are now exported as `null`
- If a value list variable has no values (all missing), the json value list metadata
  is now serialized as an empty dictionary `{}` for consistency
- `NPStructurePresence` is no longer classified as a `PerLanguageSummaries` dataset
- `LID` field was sometimes serialized as string, fixed
- Missing glottocodes were sometimes serialized as explicit "NA" string, fixed
- Removed duplicate data entries from `Alienability`
- Removed duplicate data entries from `Gender`
- Removed duplicate data entries from `NumeralClassifiers`
- Added maps illustrating the geographical breakdown (by continent and area)
- improved the bibliography data, added Glottolog language and reference IDs (many thanks to
  Robert Forkel for doing this work)
- Multiple metadata fixes:
  - Added value list descriptions for `PhonologicalFusion::FusionBinned6` and all variables that
    rely on it (such as `GrammaticalMarkers::MarkerFusionBinned6`)
  - Added value list descriptions for `PositionalBehavior::MarkerBehaviorBinned4` and all variables
    that rely on it (such as `GrammaticalMarkers::MarkerBehaviorBinned4`)
  - Value list description for `LocusOfMarking::LocusOfMarkingBinned5` was missing the value
    'FloatingorClitic', fixed (this also fixes all the variables that rely on it, such as
    `GrammaticalMarkers::LocusOfMarkingBinned5`)
  - Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned4`
  - Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned5`
  - Fixed data type of `GrammaticalMarkers::MarkerExpressesMultipleCategories` to be `logical`
  - Added value list descriptions for `ClauseLinkage::IntuitiveClassification`, value "?" is now
    recoded as NA (missing)
  - Added value list descriptions for multiple fields in `ClauseLinkage` where they were missing.
    The fields are: `AnticipatoryArgumentMarking`, `CataphoraConstraints`, `CategoricalSymmetry`,
    `ClauseLayer`, `ClausePosition`, `Embedding`, `ExtractionConstraints`, `FinitenessSimplified`,
    `FocusMarkingInDependent`, `FocusMarking`, `IllocutionaryMarking`, `IllocutionaryScope`,
    `InterpropositionalSemanticRelation`, `ReferenceTrackingSystem`, `TenseMarking` and
    `TenseScope`
  - Fixed the value list description for `ClauseWordOrder::WordOrderAPLex`
  - Fixed the value list description for `SemanticClass::SemanticClassBinned`
  - Removed invalid values from `GrammaticalRelationsRaw::SelectedArguments::SemanticCondition`
  - Fixed the value list description for `Register::OriginContinent`
  - Computed variables in `GrammaticalMarkersPerLanguage` now have correct value list metadata
  - Computed variables in `LocusOfMarkingPerLanguage` now have correct value list metadata
  - Computed variables `MorphologyPerLanguage::HasAny*` are now correctly annotated as logical
  - Computed variables `NPStructurePerLanguage::NPHas*` are now correctly annotated as logical
  - `NPStructurePerLanguage::NPStructureID` is now correctly annotated as integer
  - Computed variables in `VerbInflection*` summary datasets now have correct value list metadata
  • Loading branch information
tzakharko committed Feb 24, 2022
1 parent bf28c24 commit 4481151
Show file tree
Hide file tree
Showing 136 changed files with 84,895 additions and 67,298 deletions.
56 changes: 55 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,61 @@
# AUTOTYP (in progress)
# AUTOTYP 1.0.1

This is a bugfix release that focuses on JSON output and improving metadata for variables of type
value list. Notable changes:

- improved JSON output
- improved and corrected the metadata for multiple variables of the type value list
- improved the bibliography data, added Glottolog language and reference IDs (many thanks to
Robert Forkel for doing this work)
- minor data fixes (duplicate entries in datasets `Alienability`, `Gender` and `NumeralClassifiers`)

Many thanks to Robert Forkel for reporting many of these issues and curating the bibliography
files!

Detailed changes:

- Fixed the DOI badge (now points to last released version 1.0.0)
- Added data type `logical` to the list of valid variable types
- Clarified that `value-list` is not actually a list
- Fixed an issue with JSON export where missing values were silently dropped
by the serializer, they are now exported as `null`
- If a value list variable has no values (all missing), the json value list metadata
is now serialized as an empty dictionary `{}` for consistency
- `NPStructurePresence` is no longer classified as a `PerLanguageSummaries` dataset
- `LID` field was sometimes serialized as string, fixed
- Missing glottocodes were sometimes serialized as explicit "NA" string, fixed
- Removed duplicate data entries from `Alienability`
- Removed duplicate data entries from `Gender`
- Removed duplicate data entries from `NumeralClassifiers`
- Added maps illustrating the geographical breakdown (by continent and area)
- improved the bibliography data, added Glottolog language and reference IDs (many thanks to
Robert Forkel for doing this work)
- Multiple metadata fixes:
- Added value list descriptions for `PhonologicalFusion::FusionBinned6` and all variables that
rely on it (such as `GrammaticalMarkers::MarkerFusionBinned6`)
- Added value list descriptions for `PositionalBehavior::MarkerBehaviorBinned4` and all variables
that rely on it (such as `GrammaticalMarkers::MarkerBehaviorBinned4`)
- Value list description for `LocusOfMarking::LocusOfMarkingBinned5` was missing the value
'FloatingorClitic', fixed (this also fixes all the variables that rely on it, such as
`GrammaticalMarkers::LocusOfMarkingBinned5`)
- Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned4`
- Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned5`
- Fixed data type of `GrammaticalMarkers::MarkerExpressesMultipleCategories` to be `logical`
- Added value list descriptions for `ClauseLinkage::IntuitiveClassification`, value "?" is now
recoded as NA (missing)
- Added value list descriptions for multiple fields in `ClauseLinkage` where they were missing.
The fields are: `AnticipatoryArgumentMarking`, `CataphoraConstraints`, `CategoricalSymmetry`,
`ClauseLayer`, `ClausePosition`, `Embedding`, `ExtractionConstraints`, `FinitenessSimplified`,
`FocusMarkingInDependent`, `FocusMarking`, `IllocutionaryMarking`, `IllocutionaryScope`,
`InterpropositionalSemanticRelation`, `ReferenceTrackingSystem`, `TenseMarking` and
`TenseScope`
- Fixed the value list description for `ClauseWordOrder::WordOrderAPLex`
- Fixed the value list description for `SemanticClass::SemanticClassBinned`
- Removed invalid values from `GrammaticalRelationsRaw::SelectedArguments::SemanticCondition`
- Fixed the value list description for `Register::OriginContinent`
- Computed variables in `GrammaticalMarkersPerLanguage` now have correct value list metadata
- Computed variables in `LocusOfMarkingPerLanguage` now have correct value list metadata
- Computed variables `MorphologyPerLanguage::HasAny*` are now correctly annotated as logical
- Computed variables `NPStructurePerLanguage::NPHas*` are now correctly annotated as logical
- `NPStructurePerLanguage::NPStructureID` is now correctly annotated as integer
- Computed variables in `VerbInflection*` summary datasets now have correct value list metadata
39 changes: 23 additions & 16 deletions aggregation-scripts/Alignment.R
Original file line number Diff line number Diff line change
Expand Up @@ -984,7 +984,17 @@ GR_roles <- GR_roles %>%
filter(!SelectorID %in% no_agreement_ID) %>%
# add glottocodes
left_join(select(Register, LID, Glottocode), by = "LID") %>%
select(LID, Glottocode, Language, everything())
select(LID, Glottocode, Language, everything()) %>%
# drop unused factor levels
mutate(
ReferentialCondition = fct_drop(ReferentialCondition),
CoargumentAtr = fct_drop(CoargumentAtr),
CoargumentP = fct_drop(CoargumentP),
ClauseRankCondition =fct_drop(ClauseRankCondition),
CategoryCondition = fct_drop(CategoryCondition),
SyntacticDomainCondition = fct_drop(SyntacticDomainCondition),
PolarityCondition = fct_drop(PolarityCondition)
)


alignments <- alignments %>%
Expand All @@ -1000,27 +1010,24 @@ alignments <- alignments %>%
) %>%
# add glottocodes
left_join(select(Register, LID, Glottocode), by = "LID") %>%
select(LID, Glottocode, Language, everything())
select(LID, Glottocode, Language, everything()) %>%
# drop unused factor levels
mutate(
ReferentialCondition = fct_drop(ReferentialCondition),
CoargumentAtr = fct_drop(CoargumentAtr),
CoargumentP = fct_drop(CoargumentP),
ClauseRankCondition =fct_drop(ClauseRankCondition),
CategoryCondition = fct_drop(CategoryCondition),
SyntacticDomainCondition = fct_drop(SyntacticDomainCondition),
PolarityCondition = fct_drop(PolarityCondition)
)


AlignmentForDefaultPredicatesPerLanguage <- AlignmentForDefaultPredicatesPerLanguage %>%
# add glottocodes
left_join(select(Register, LID, Glottocode), by = "LID") %>%
select(LID, Glottocode, Language, everything())


fix_metadata_levels <- function(desc, values) {
values <- as.character(unique(unlist(values)))
values <- values[!is.na(values)]
if(!all(values %in% desc$levels$level)) {
unknown_levels <- setdiff(values, desc$levels$level)
arg <- caller_arg(values)
cli::cli_abort("unknown values in {arg}: {unknown_levels}")
}
desc$levels <- filter(desc$levels, level %in% values)
desc
}


descriptor <- describe_data(
ptype = tibble(),
description = "
Expand Down
17 changes: 17 additions & 0 deletions aggregation-scripts/GrammaticalMarkers.R
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ GrammaticalMarkersPerLanguage <- GrammaticalMarkers %>%




descriptor <- describe_data(
ptype = tibble(),
description = "
Expand All @@ -111,12 +112,28 @@ descriptor <- describe_data(
new_variables %>%
rowwise() %>%
group_map(~ {
# build the descriptor
descriptor <- .metadata$GrammaticalMarkers$fields[[.$Variable]]
descriptor$description <- format_inline(
"Value of `GrammaticalMarkers::{.$Variable}` for exemplar {.q {.$MarkerExemplar}}"
)
descriptor$computed <- "GrammaticalMarkers.R"
descriptor


# fix factors
if(is.factor(descriptor$ptype)) {
descriptor <- fix_metadata_levels(
descriptor,
GrammaticalMarkersPerLanguage[[.$NewVariable]]
)
GrammaticalMarkersPerLanguage[[.$NewVariable]] <<- factor(
as.character(GrammaticalMarkersPerLanguage[[.$NewVariable]]),
levels = levels(descriptor$ptype)
)
}

descriptor
}) %>% set_names(new_variables$NewVariable)
)
)
Expand Down
29 changes: 26 additions & 3 deletions aggregation-scripts/LocusOfMarking.R
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,10 @@ MarkingPerMicrorelation <- LocusOfMarkingPerMicrorelation %>%
names_from=RoleCatLabel,
values_from=c(LocusOfMarking, LocusOfMarkingBinned5, LocusOfMarkingBinned6),
names_glue = "{.value}For{RoleCatLabel}",
values_fn = function(x) str_flatten(unique(x), "/"),
values_fn = function(x) {
x <- unique(x)
if(length(x) > 1) "multiple" else x
},
values_fill = NA
)

Expand All @@ -175,7 +178,6 @@ LocusOfMarkingPerLanguage <- inner_join(
arrange(LID, Language)



# TODO: improve this
descriptor <- describe_data(
ptype = tibble(),
Expand All @@ -184,11 +186,32 @@ descriptor <- describe_data(
fields = c(
.metadata$Register$fields[c("LID", "Language", "Glottocode")],
map(setdiff(names(LocusOfMarkingPerLanguage), c("LID", "Language", "Glottocode")), ~ {
describe_data(
descriptor <- describe_data(
ptype = if(is.logical(LocusOfMarkingPerLanguage[[.]])) logical() else factor(),
computed = "LocusOfMarking.R",
description = "<pending>"
)

# fix factors
if(is.factor(descriptor$ptype)) {
# variable name
var <- gsub("For.+$", "", .)

dd <- .metadata$LocusOfMarkingPerMicrorelation$fields$LocusOfMarking$element$fields[[var]]
!is_null(dd) || abort("Unknown variable {var}")

descriptor$levels <- add_row(dd$levels,
level = "multiple", description = "multiple different loci"
)
descriptor <- fix_metadata_levels(descriptor, LocusOfMarkingPerLanguage[[.]])

LocusOfMarkingPerLanguage[[.]] <<- factor(
as.character(LocusOfMarkingPerLanguage[[.]]),
levels = levels(descriptor$ptype)
)
}

descriptor
}) %>% set_names(setdiff(names(LocusOfMarkingPerLanguage), c("LID", "Language", "Glottocode")))
)
)
Expand Down
18 changes: 9 additions & 9 deletions aggregation-scripts/MorphologyPerLanguage.R
Original file line number Diff line number Diff line change
Expand Up @@ -393,56 +393,56 @@ descriptor <- describe_data(
"
),
HasAnyPrefixes = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are prefixes (restricted preposed formatives) present in the language"
),
HasAnySuffixes = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are suffixes (restricted postposed formatives) present in the language"
),
HasAnyInfixes = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are infixes (restricted interposed formatives) present in the language"
),
HasAnyProclitics = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "
Are proclitics (unrestricted or semirestricted preposed formatives) present
in the language
"
),
HasAnyEnclitics = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "
Are enclitics (unrestricted or semirestricted postposed formatives) present
in the language
"
),
HasAnyEndoclitics = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "
Are endoclitics (unrestricted or semirestricted interposed formatives) present
in the language
"
),
HasAnyPreposedFormatives = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are any preposed formatives present in the language"
),
HasAnyPostposedFormatives = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are any postposed formatives present in the language"
),
HasAnyInterposedFormatives = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are any interposed formatives present in the language"
),
Expand Down
20 changes: 10 additions & 10 deletions aggregation-scripts/NPStructurePerLanguage.R
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ to_camel_case <- function(x) {
}




# ███████╗██╗ ██╗███╗ ███╗███╗ ███╗ █████╗ ██████╗ ██╗ ██╗
# ██╔════╝██║ ██║████╗ ████║████╗ ████║██╔══██╗██╔══██╗╚██╗ ██╔╝
# ███████╗██║ ██║██╔████╔██║██╔████╔██║███████║██████╔╝ ╚████╔╝
Expand Down Expand Up @@ -354,9 +356,8 @@ NPStructurePresence <- NPStructure %>%
left_join(head_macrosem_constraints_presence, by = c("LID", "NPStructureID")) %>%
# add glottocodes
left_join(select(Register, LID, Glottocode), by = "LID") %>%
select(LID, Glottocode, Language, everything()) %>%
arrange(LID, Language)

select(LID, Glottocode, Language, NPStructureID, everything()) %>%
arrange(LID, Language, NPStructureID)


descriptor <- describe_data(
Expand Down Expand Up @@ -387,7 +388,7 @@ descriptor <- describe_data(
"
),
NPHasGovernment = describe_data(
ptype = integer(),
ptype = logical(),
computed = "NPStructurePerLanguage.R",
description = "
NPs with some kind of marker which is governed/assigned by the head
Expand Down Expand Up @@ -421,7 +422,7 @@ descriptor <- describe_data(
"
),
NPHasAdjGovernment = describe_data(
ptype = integer(),
ptype = logical(),
computed = "NPStructurePerLanguage.R",
description = "
Adjective attribution with some kind of marker which is governed/assigned
Expand All @@ -438,22 +439,21 @@ descriptor <- describe_data(

export_dataset("NPStructurePerLanguage", NPStructurePerLanguage, descriptor, c("PerLanguageSummaries", "NP"))



descriptor <- describe_data(
ptype = tibble(),
description = "Per-language presence of NP properties",
computed = "NPStructurePerLanguage.R",
fields = c(
.metadata$Register$fields[c("LID", "Language", "Glottocode")],
map(names(NPStructurePresence)[-(1:3)], ~ {
list(NPStructureID = .metadata$NPStructure$fields$NPStructureID),
map(names(NPStructurePresence)[-(1:4)], ~ {
describe_data(
ptype = logical(),
computed = "NPStructurePerLanguage.R",
description = "<pending>"
)
}) %>% set_names(names(NPStructurePresence)[-(1:3)])
}) %>% set_names(names(NPStructurePresence)[-(1:4)])
)
)

export_dataset("NPStructurePresence", NPStructurePresence, descriptor, c("PerLanguageSummaries", "NP"))
export_dataset("NPStructurePresence", NPStructurePresence, descriptor, "NP")
Loading

0 comments on commit 4481151

Please sign in to comment.