Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement additional LD measures #27

Open
kbenoit opened this issue Dec 10, 2018 · 6 comments
Open

Implement additional LD measures #27

kbenoit opened this issue Dec 10, 2018 · 6 comments
Assignees

Comments

@kbenoit
Copy link
Contributor

kbenoit commented Dec 10, 2018

These would include:

  • (vocd-)D
  • HD-D

See McCarthy, Philip M, and Scott Jarvis. 2010. “MTLD, Vocd-D, and HD-D: a Validation Study of Sophisticated Approaches to Lexical Diversity Assessment.” Behavior Research Methods 42(2): 381–92.

Also for testing the implementations

Related to quanteda/quanteda#1508

@jiongweilua jiongweilua self-assigned this Dec 11, 2018
@jiongweilua
Copy link

I built a simple function for computing the D of vocd_d in this commit

Some issues I encountered:

  • Was trying to find a simple way to sample features from tokens object but it seems tokens_sample only supports sampling documents as the size cannot be > ndoc(x). Is this an issue we ought to fix in tokens_sample?
  • As 'D' is a parameter to be estimated, I relied on the stats::nls function - is this an okay dependency or must we find an alternative way?

Also see McKee, G., Malvern, D., & Richards, B. (2000). Measuring vocabulary diversity using dedicated software. Literary and linguistic computing, 15(3), 323-338.

Think it's the original paper for vocd-D

@kbenoit
Copy link
Contributor Author

kbenoit commented Dec 12, 2018

On the first, you can add this function:

library("quanteda")

tokens_samplefrom <- function(x, size, replace = FALSE) {
    attrs <- attributes(x)
    result <- lapply(unclass(x), sample, size = size, replace = replace)
    attributes(result) <- attrs
    quanteda:::tokens_recompile(result)
}

toks <- tokens(c("a b c d e f", "q r s t u v w x"))
set.seed(100)
tokens_samplefrom(toks, size = 3)
## tokens from 2 documents.
## text1 :
## [1] "b" "f" "c"
## 
## text2 :
## [1] "q" "t" "s"

nls() is fine because it's in the (always loaded) stats package.

@jiongweilua
Copy link

Prof. @kbenoit ,

See commit e0f90d0 for my outline code for vocd-D after incorporating tokens_samplefrom and apply, and see commit 67b11c6 for my outline code for hd-D

Would be great if you could:

  • Review the hd-D code: The formula for hd-D is never explicitly specified in McCarthy & Jarvis (2011) but based on McCarthy & Jarvis (2007), I understood that HD-D := sum_over_all_sampsize(ATTR_sampsize * 1/samp_size) but am not 100% sure

  • Advise on how I can construct unit tests for vocd-D: Since vocd-D involves sampling, there will be some sampling variability how R samples (even with set.seed) vs the online platforms. My guess is we try large n samples + specifying a tight threshold for how much D can vary?

@kbenoit
Copy link
Contributor Author

kbenoit commented Dec 15, 2018

For tests, or examples with anything stochastic, use set.seed().

On the HD-D code, I will return to the LD stuff but if @koheiw and I can agree on the structure of a new function (see quanteda/quanteda#1520 (comment)) then this will make writing those functions different (and easier). Let's wait on that issue before I return to this code. However I will try to take a look at the McCarthy & Jarvis (2007) to understand HD-D. I think there is code on the Internet somewhere for this, the vocd software perhaps?

@kbenoit
Copy link
Contributor Author

kbenoit commented Jan 6, 2019

Working branch for this is dev-MTLD.

@kbenoit kbenoit closed this as completed Jan 6, 2019
@kbenoit kbenoit reopened this Jan 6, 2019
@jiongweilua
Copy link

@kbenoit Acknowledged!

@kbenoit kbenoit transferred this issue from quanteda/quanteda Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants