Bugfix release 1.0.1

- improved JSON output - improved and corrected the metadata for multiple variables of the type value list - improved the bibliography data, added Glottolog language and reference IDs (many thanks to Robert Forkel for doing this work) - minor data fixes (duplicate entries in datasets `Alienability`, `Gender` and `NumeralClassifiers`) Many thanks to Robert Forkel for reporting many of these issues and curating the bibliography files! Detailed changes: - Fixed the DOI badge (now points to last released version 1.0.0) - Added data type `logical` to the list of valid variable types - Clarified that `value-list` is not actually a list - Fixed an issue with JSON export where missing values were silently dropped by the serializer, they are now exported as `null` - If a value list variable has no values (all missing), the json value list metadata is now serialized as an empty dictionary `{}` for consistency - `NPStructurePresence` is no longer classified as a `PerLanguageSummaries` dataset - `LID` field was sometimes serialized as string, fixed - Missing glottocodes were sometimes serialized as explicit "NA" string, fixed - Removed duplicate data entries from `Alienability` - Removed duplicate data entries from `Gender` - Removed duplicate data entries from `NumeralClassifiers` - Added maps illustrating the geographical breakdown (by continent and area) - improved the bibliography data, added Glottolog language and reference IDs (many thanks to Robert Forkel for doing this work) - Multiple metadata fixes: - Added value list descriptions for `PhonologicalFusion::FusionBinned6` and all variables that rely on it (such as `GrammaticalMarkers::MarkerFusionBinned6`) - Added value list descriptions for `PositionalBehavior::MarkerBehaviorBinned4` and all variables that rely on it (such as `GrammaticalMarkers::MarkerBehaviorBinned4`) - Value list description for `LocusOfMarking::LocusOfMarkingBinned5` was missing the value 'FloatingorClitic', fixed (this also fixes all the variables that rely on it, such as `GrammaticalMarkers::LocusOfMarkingBinned5`) - Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned4` - Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned5` - Fixed data type of `GrammaticalMarkers::MarkerExpressesMultipleCategories` to be `logical` - Added value list descriptions for `ClauseLinkage::IntuitiveClassification`, value "?" is now recoded as NA (missing) - Added value list descriptions for multiple fields in `ClauseLinkage` where they were missing. The fields are: `AnticipatoryArgumentMarking`, `CataphoraConstraints`, `CategoricalSymmetry`, `ClauseLayer`, `ClausePosition`, `Embedding`, `ExtractionConstraints`, `FinitenessSimplified`, `FocusMarkingInDependent`, `FocusMarking`, `IllocutionaryMarking`, `IllocutionaryScope`, `InterpropositionalSemanticRelation`, `ReferenceTrackingSystem`, `TenseMarking` and `TenseScope` - Fixed the value list description for `ClauseWordOrder::WordOrderAPLex` - Fixed the value list description for `SemanticClass::SemanticClassBinned` - Removed invalid values from `GrammaticalRelationsRaw::SelectedArguments::SemanticCondition` - Fixed the value list description for `Register::OriginContinent` - Computed variables in `GrammaticalMarkersPerLanguage` now have correct value list metadata - Computed variables in `LocusOfMarkingPerLanguage` now have correct value list metadata - Computed variables `MorphologyPerLanguage::HasAny*` are now correctly annotated as logical - Computed variables `NPStructurePerLanguage::NPHas*` are now correctly annotated as logical - `NPStructurePerLanguage::NPStructureID` is now correctly annotated as integer - Computed variables in `VerbInflection*` summary datasets now have correct value list metadata
autotyp · Feb 24, 2022 · 4481151 · 4481151
1 parent bf28c24
commit 4481151
Show file tree

Hide file tree

Showing 136 changed files with 84,895 additions and 67,298 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,7 +1,61 @@
-# AUTOTYP (in progress)
+# AUTOTYP 1.0.1
+
+This is a bugfix release that focuses on JSON output and improving metadata for variables of type
+value list. Notable changes:
+
+- improved JSON output
+- improved and corrected the metadata for multiple variables of the type value list
+- improved the bibliography data, added Glottolog language and reference IDs (many thanks to 
+  Robert Forkel for doing this work)
+- minor data fixes (duplicate entries in datasets `Alienability`, `Gender` and `NumeralClassifiers`)
+
+Many thanks to Robert Forkel for reporting many of these issues and curating the bibliography 
+files!
+
+Detailed changes:
 
 - Fixed the DOI badge (now points to last released version 1.0.0)
 - Added data type `logical` to the list of valid variable types
 - Clarified that `value-list` is not actually a list
 - Fixed an issue with JSON export where missing values were silently dropped
   by the serializer, they are now exported as `null` 
+- If a value list variable has no values (all missing), the json value list metadata 
+  is now serialized as an empty dictionary `{}` for consistency
+- `NPStructurePresence` is no longer classified as a `PerLanguageSummaries` dataset
+- `LID` field was sometimes serialized as string, fixed
+- Missing glottocodes were sometimes serialized as explicit "NA" string, fixed
+- Removed duplicate data entries from `Alienability`
+- Removed duplicate data entries from `Gender`
+- Removed duplicate data entries from `NumeralClassifiers`
+- Added maps illustrating the geographical breakdown (by continent and area)
+- improved the bibliography data, added Glottolog language and reference IDs (many thanks to 
+  Robert Forkel for doing this work)
+- Multiple metadata fixes:
+  - Added value list descriptions for `PhonologicalFusion::FusionBinned6` and all variables that 
+    rely on it (such as `GrammaticalMarkers::MarkerFusionBinned6`)
+  - Added value list descriptions for `PositionalBehavior::MarkerBehaviorBinned4` and all variables 
+    that rely on it (such as `GrammaticalMarkers::MarkerBehaviorBinned4`)
+  - Value list description for `LocusOfMarking::LocusOfMarkingBinned5` was missing the value 
+    'FloatingorClitic', fixed (this also fixes all the variables that rely on it, such as 
+    `GrammaticalMarkers::LocusOfMarkingBinned5`)
+  - Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned4`
+  - Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned5`
+  - Fixed data type of `GrammaticalMarkers::MarkerExpressesMultipleCategories` to be `logical`
+  - Added value list descriptions for `ClauseLinkage::IntuitiveClassification`, value "?" is now
+    recoded as NA (missing)  
+  - Added value list descriptions for multiple fields in `ClauseLinkage` where they were missing. 
+    The fields are: `AnticipatoryArgumentMarking`, `CataphoraConstraints`, `CategoricalSymmetry`, 
+    `ClauseLayer`, `ClausePosition`, `Embedding`, `ExtractionConstraints`, `FinitenessSimplified`, 
+    `FocusMarkingInDependent`, `FocusMarking`, `IllocutionaryMarking`, `IllocutionaryScope`,  
+    `InterpropositionalSemanticRelation`, `ReferenceTrackingSystem`, `TenseMarking` and  
+    `TenseScope`
+  - Fixed the value list description for `ClauseWordOrder::WordOrderAPLex`
+  - Fixed the value list description for `SemanticClass::SemanticClassBinned`
+  - Removed invalid values from `GrammaticalRelationsRaw::SelectedArguments::SemanticCondition`
+  - Fixed the value list description for `Register::OriginContinent`
+  - Computed variables in `GrammaticalMarkersPerLanguage` now have correct value list metadata
+  - Computed variables in `LocusOfMarkingPerLanguage` now have correct value list metadata
+  - Computed variables `MorphologyPerLanguage::HasAny*` are now correctly annotated as logical
+  - Computed variables `NPStructurePerLanguage::NPHas*` are now correctly annotated as logical
+  - `NPStructurePerLanguage::NPStructureID` is now correctly annotated as integer
+  - Computed variables in `VerbInflection*` summary datasets now have correct value list metadata
diff --git a/aggregation-scripts/Alignment.R b/aggregation-scripts/Alignment.R
@@ -984,7 +984,17 @@ GR_roles <- GR_roles %>%
   filter(!SelectorID %in% no_agreement_ID) %>%
   # add glottocodes
   left_join(select(Register, LID, Glottocode), by = "LID") %>%
-  select(LID, Glottocode, Language, everything())
+  select(LID, Glottocode, Language, everything()) %>%
+  # drop unused factor levels
+  mutate(
+    ReferentialCondition = fct_drop(ReferentialCondition),
+    CoargumentAtr = fct_drop(CoargumentAtr),
+    CoargumentP = fct_drop(CoargumentP),
+    ClauseRankCondition =fct_drop(ClauseRankCondition),
+    CategoryCondition = fct_drop(CategoryCondition),
+    SyntacticDomainCondition = fct_drop(SyntacticDomainCondition),
+    PolarityCondition = fct_drop(PolarityCondition)
+  )
 
 
 alignments <- alignments %>%
@@ -1000,27 +1010,24 @@ alignments <- alignments %>%
   ) %>%
   # add glottocodes
   left_join(select(Register, LID, Glottocode), by = "LID") %>%
-  select(LID, Glottocode, Language, everything())
+  select(LID, Glottocode, Language, everything()) %>%
+  # drop unused factor levels
+  mutate(
+    ReferentialCondition = fct_drop(ReferentialCondition),
+    CoargumentAtr = fct_drop(CoargumentAtr),
+    CoargumentP = fct_drop(CoargumentP),
+    ClauseRankCondition =fct_drop(ClauseRankCondition),
+    CategoryCondition = fct_drop(CategoryCondition),
+    SyntacticDomainCondition = fct_drop(SyntacticDomainCondition),
+    PolarityCondition = fct_drop(PolarityCondition)
+  )
+
 
 AlignmentForDefaultPredicatesPerLanguage <- AlignmentForDefaultPredicatesPerLanguage %>%
   # add glottocodes
   left_join(select(Register, LID, Glottocode), by = "LID") %>%
   select(LID, Glottocode, Language, everything())
 
-
-fix_metadata_levels <- function(desc, values) {
-  values <- as.character(unique(unlist(values)))
-  values <- values[!is.na(values)]
-  if(!all(values %in% desc$levels$level)) {
-    unknown_levels <- setdiff(values, desc$levels$level)
-    arg <- caller_arg(values)
-    cli::cli_abort("unknown values in {arg}: {unknown_levels}")
-  }
-  desc$levels <- filter(desc$levels, level %in% values)
-  desc
-}
-
-
 descriptor <- describe_data(
   ptype = tibble(),
   description = "

diff --git a/aggregation-scripts/GrammaticalMarkers.R b/aggregation-scripts/GrammaticalMarkers.R
@@ -97,6 +97,7 @@ GrammaticalMarkersPerLanguage <- GrammaticalMarkers %>%
 
 
 
+
 descriptor <- describe_data(
   ptype = tibble(),
   description = "
@@ -111,12 +112,28 @@ descriptor <- describe_data(
     new_variables %>%
     rowwise() %>%
     group_map(~ {
+      # build the descriptor
       descriptor <- .metadata$GrammaticalMarkers$fields[[.$Variable]]
       descriptor$description <- format_inline(
         "Value of `GrammaticalMarkers::{.$Variable}` for exemplar {.q {.$MarkerExemplar}}"
       )
       descriptor$computed <- "GrammaticalMarkers.R"
       descriptor
+
+
+      # fix factors
+      if(is.factor(descriptor$ptype)) {
+        descriptor <- fix_metadata_levels(
+          descriptor,
+          GrammaticalMarkersPerLanguage[[.$NewVariable]]
+        )
+        GrammaticalMarkersPerLanguage[[.$NewVariable]] <<- factor(
+          as.character(GrammaticalMarkersPerLanguage[[.$NewVariable]]),
+          levels = levels(descriptor$ptype)
+        )
+      }
+
+      descriptor
     }) %>% set_names(new_variables$NewVariable)
   )
 )

diff --git a/aggregation-scripts/LocusOfMarking.R b/aggregation-scripts/LocusOfMarking.R
@@ -159,7 +159,10 @@ MarkingPerMicrorelation <- LocusOfMarkingPerMicrorelation %>%
     names_from=RoleCatLabel,
     values_from=c(LocusOfMarking, LocusOfMarkingBinned5, LocusOfMarkingBinned6),
     names_glue = "{.value}For{RoleCatLabel}",
-    values_fn = function(x) str_flatten(unique(x), "/"),
+    values_fn = function(x) {
+      x <- unique(x)
+      if(length(x) > 1) "multiple" else x
+    },
     values_fill = NA
   )
 
@@ -175,7 +178,6 @@ LocusOfMarkingPerLanguage <- inner_join(
   arrange(LID, Language)
 
 
-
 # TODO: improve this
 descriptor <- describe_data(
   ptype = tibble(),
@@ -184,11 +186,32 @@ descriptor <- describe_data(
   fields = c(
     .metadata$Register$fields[c("LID", "Language", "Glottocode")],
     map(setdiff(names(LocusOfMarkingPerLanguage), c("LID", "Language", "Glottocode")), ~ {
-      describe_data(
+      descriptor <- describe_data(
         ptype = if(is.logical(LocusOfMarkingPerLanguage[[.]])) logical() else factor(),
         computed = "LocusOfMarking.R",
         description = "<pending>"
       )
+
+      # fix factors
+      if(is.factor(descriptor$ptype)) {
+        # variable name
+        var <- gsub("For.+$", "", .)
+
+        dd <- .metadata$LocusOfMarkingPerMicrorelation$fields$LocusOfMarking$element$fields[[var]]
+        !is_null(dd) || abort("Unknown variable {var}")
+
+        descriptor$levels <- add_row(dd$levels,
+          level = "multiple", description = "multiple different loci"
+        )
+        descriptor <- fix_metadata_levels(descriptor, LocusOfMarkingPerLanguage[[.]])
+
+        LocusOfMarkingPerLanguage[[.]] <<- factor(
+          as.character(LocusOfMarkingPerLanguage[[.]]),
+          levels = levels(descriptor$ptype)
+        )
+      }
+
+      descriptor
     }) %>% set_names(setdiff(names(LocusOfMarkingPerLanguage), c("LID", "Language", "Glottocode")))
   )
 )

diff --git a/aggregation-scripts/MorphologyPerLanguage.R b/aggregation-scripts/MorphologyPerLanguage.R
@@ -393,56 +393,56 @@ descriptor <- describe_data(
       "
     ),
     HasAnyPrefixes = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "MorphologyPerLanguage.R",
       description = "Are prefixes (restricted preposed formatives) present in the language"
     ),
     HasAnySuffixes = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "MorphologyPerLanguage.R",
       description = "Are suffixes (restricted postposed formatives) present in the language"
     ),
     HasAnyInfixes = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "MorphologyPerLanguage.R",
       description = "Are infixes (restricted interposed formatives) present in the language"
     ),
     HasAnyProclitics = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "MorphologyPerLanguage.R",
       description = "
         Are proclitics (unrestricted or semirestricted preposed formatives) present
         in the language
       "
     ),
     HasAnyEnclitics = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "MorphologyPerLanguage.R",
       description = "
         Are enclitics (unrestricted or semirestricted postposed formatives) present
         in the language
       "
     ),
     HasAnyEndoclitics = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "MorphologyPerLanguage.R",
       description = "
         Are endoclitics (unrestricted or semirestricted interposed formatives) present
         in the language
       "
     ),
     HasAnyPreposedFormatives = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "MorphologyPerLanguage.R",
       description = "Are any preposed formatives present in the language"
     ),
     HasAnyPostposedFormatives = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "MorphologyPerLanguage.R",
       description = "Are any postposed formatives present in the language"
     ),
     HasAnyInterposedFormatives = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "MorphologyPerLanguage.R",
       description = "Are any interposed formatives present in the language"
     ),

diff --git a/aggregation-scripts/NPStructurePerLanguage.R b/aggregation-scripts/NPStructurePerLanguage.R
@@ -31,6 +31,8 @@ to_camel_case <- function(x) {
 }
 
 
+
+
 # ███████╗██╗   ██╗███╗   ███╗███╗   ███╗ █████╗ ██████╗ ██╗   ██╗
 # ██╔════╝██║   ██║████╗ ████║████╗ ████║██╔══██╗██╔══██╗╚██╗ ██╔╝
 # ███████╗██║   ██║██╔████╔██║██╔████╔██║███████║██████╔╝ ╚████╔╝
@@ -354,9 +356,8 @@ NPStructurePresence <- NPStructure %>%
   left_join(head_macrosem_constraints_presence, by = c("LID", "NPStructureID")) %>%
   # add glottocodes
   left_join(select(Register, LID, Glottocode), by = "LID") %>%
-  select(LID, Glottocode, Language, everything()) %>%
-  arrange(LID, Language)
-
+  select(LID, Glottocode, Language, NPStructureID, everything()) %>%
+  arrange(LID, Language, NPStructureID)
 
 
 descriptor <- describe_data(
@@ -387,7 +388,7 @@ descriptor <- describe_data(
       "
     ),
     NPHasGovernment = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "NPStructurePerLanguage.R",
       description = "
         NPs with some kind of marker which is governed/assigned by the head
@@ -421,7 +422,7 @@ descriptor <- describe_data(
       "
     ),
     NPHasAdjGovernment = describe_data(
-      ptype = integer(),
+      ptype = logical(),
       computed = "NPStructurePerLanguage.R",
       description = "
         Adjective attribution with some kind of marker which is governed/assigned
@@ -438,22 +439,21 @@ descriptor <- describe_data(
 
 export_dataset("NPStructurePerLanguage", NPStructurePerLanguage, descriptor, c("PerLanguageSummaries", "NP"))
 
-
-
 descriptor <- describe_data(
   ptype = tibble(),
   description = "Per-language presence of NP properties",
   computed = "NPStructurePerLanguage.R",
   fields = c(
     .metadata$Register$fields[c("LID", "Language", "Glottocode")],
-    map(names(NPStructurePresence)[-(1:3)], ~ {
+    list(NPStructureID = .metadata$NPStructure$fields$NPStructureID),
+    map(names(NPStructurePresence)[-(1:4)], ~ {
       describe_data(
         ptype = logical(),
         computed = "NPStructurePerLanguage.R",
         description = "<pending>"
       )
-    }) %>% set_names(names(NPStructurePresence)[-(1:3)])
+    }) %>% set_names(names(NPStructurePresence)[-(1:4)])
   )
 )
 
-export_dataset("NPStructurePresence", NPStructurePresence, descriptor, c("PerLanguageSummaries", "NP"))
+export_dataset("NPStructurePresence", NPStructurePresence, descriptor, "NP")