Implement additional LD measures #27

kbenoit · 2018-12-10T20:34:11Z

These would include:

(vocd-)D
HD-D

See McCarthy, Philip M, and Scott Jarvis. 2010. “MTLD, Vocd-D, and HD-D: a Validation Study of Sophisticated Approaches to Lexical Diversity Assessment.” Behavior Research Methods 42(2): 381–92.

Also for testing the implementations

Related to quanteda/quanteda#1508

jiongweilua · 2018-12-12T00:38:03Z

I built a simple function for computing the D of vocd_d in this commit

Some issues I encountered:

Was trying to find a simple way to sample features from tokens object but it seems tokens_sample only supports sampling documents as the size cannot be > ndoc(x). Is this an issue we ought to fix in tokens_sample?
As 'D' is a parameter to be estimated, I relied on the stats::nls function - is this an okay dependency or must we find an alternative way?

Also see McKee, G., Malvern, D., & Richards, B. (2000). Measuring vocabulary diversity using dedicated software. Literary and linguistic computing, 15(3), 323-338.

Think it's the original paper for vocd-D

kbenoit · 2018-12-12T02:06:38Z

On the first, you can add this function:

library("quanteda")

tokens_samplefrom <- function(x, size, replace = FALSE) {
    attrs <- attributes(x)
    result <- lapply(unclass(x), sample, size = size, replace = replace)
    attributes(result) <- attrs
    quanteda:::tokens_recompile(result)
}

toks <- tokens(c("a b c d e f", "q r s t u v w x"))
set.seed(100)
tokens_samplefrom(toks, size = 3)
## tokens from 2 documents.
## text1 :
## [1] "b" "f" "c"
## 
## text2 :
## [1] "q" "t" "s"

nls() is fine because it's in the (always loaded) stats package.

jiongweilua · 2018-12-14T16:23:36Z

Prof. @kbenoit ,

See commit e0f90d0 for my outline code for vocd-D after incorporating tokens_samplefrom and apply, and see commit 67b11c6 for my outline code for hd-D

Would be great if you could:

Review the hd-D code: The formula for hd-D is never explicitly specified in McCarthy & Jarvis (2011) but based on McCarthy & Jarvis (2007), I understood that HD-D := sum_over_all_sampsize(ATTR_sampsize * 1/samp_size) but am not 100% sure
Advise on how I can construct unit tests for vocd-D: Since vocd-D involves sampling, there will be some sampling variability how R samples (even with set.seed) vs the online platforms. My guess is we try large n samples + specifying a tight threshold for how much D can vary?

kbenoit · 2018-12-15T07:51:13Z

For tests, or examples with anything stochastic, use set.seed().

On the HD-D code, I will return to the LD stuff but if @koheiw and I can agree on the structure of a new function (see quanteda/quanteda#1520 (comment)) then this will make writing those functions different (and easier). Let's wait on that issue before I return to this code. However I will try to take a look at the McCarthy & Jarvis (2007) to understand HD-D. I think there is code on the Internet somewhere for this, the vocd software perhaps?

kbenoit · 2019-01-06T18:36:23Z

Working branch for this is dev-MTLD.

jiongweilua · 2019-01-07T08:47:18Z

@kbenoit Acknowledged!

jiongweilua self-assigned this Dec 11, 2018

kbenoit closed this as completed Jan 6, 2019

kbenoit reopened this Jan 6, 2019

kbenoit transferred this issue from quanteda/quanteda Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement additional LD measures #27

Implement additional LD measures #27

kbenoit commented Dec 10, 2018

jiongweilua commented Dec 12, 2018

kbenoit commented Dec 12, 2018

jiongweilua commented Dec 14, 2018

kbenoit commented Dec 15, 2018

kbenoit commented Jan 6, 2019

jiongweilua commented Jan 7, 2019

Implement additional LD measures #27

Implement additional LD measures #27

Comments

kbenoit commented Dec 10, 2018

jiongweilua commented Dec 12, 2018

kbenoit commented Dec 12, 2018

jiongweilua commented Dec 14, 2018

kbenoit commented Dec 15, 2018

kbenoit commented Jan 6, 2019

jiongweilua commented Jan 7, 2019