Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MATTR calculation defaults to wrong window length if provided window length exceeds document length #60

Open
mweylandt opened this issue Sep 1, 2023 · 4 comments

Comments

@mweylandt
Copy link

mweylandt commented Sep 1, 2023

Hello,

apololgies if my issue is based on a misunderstanding.

When I use textstat_lexdiv to calculate MATTR, and a document is shorter than the MATTR_window specified as an argument to the function, the function throws an error.

This is because the function (compute_mattr) checks for this case, and resets the MATTR_window value to the longest document in the corpus. Using this value in the tokens_ngrams function down the line creates a list with empty entries, which trips up the calculation of the TTR and causes the error.

I believe the window should be set to the shortest document in the corpus -- as MATTR is calculated by averaging the TTRs of a moving window across the document, it seems reasonable for that window to be the length of the shortest document. An alternative would be rewriting it so it returns NA for the documents that are too short to calculate this value.

Reproducible Example

txt <- c("Anyway, like I was sayin', shrimp is the fruit of the sea. You can
          barbecue it, boil it, broil it, bake it, saute it.",
         "There's shrimp-kabobs,
          shrimp creole, shrimp gumbo. Pan fried, deep fried, stir-fried. There's
          pineapple shrimp, lemon shrimp, coconut shrimp, pepper shrimp, shrimp soup,
          shrimp stew, shrimp salad, shrimp and potatoes, shrimp burger, shrimp
          sandwich.")
tokens(txt) %>%
  textstat_lexdiv(measure = c("TTR", "CTTR", "K", "MATTR"))

# Error in textstat_lexdiv.dfm(dfm(tokens(y)), "TTR") : 
#  dfm must have at least one non-zero value
# In addition: Warning message:
# MATTR_window exceeds some documents' token lengths, resetting to 33 

More Details

I'm including the original function below, with suggested fix in comments

function (x, MATTR_window = 100L) 
{
  if (MATTR_window < 1) 
    stop("MATTR_window must be positive")
  if (any(ntoken(x) < MATTR_window)) {
    MATTR_window <- max(ntoken(x)) # this should be min(ntoken(x))
    warning("MATTR_window exceeds some documents' token lengths, resetting to ", 
            MATTR_window, call. = FALSE)
  }
  x <- tokens_ngrams(x, n = MATTR_window, concatenator = " ")  
  temp <- lapply(as.list(x), function(y) textstat_lexdiv(dfm(tokens(y)), 
                                                         "TTR")[["TTR"]])
  result <- unlist(lapply(temp, mean))
  return(result)
}

Again, if I've misunderstood any conceptual issue (which may well be, as the same process is applied to MSTTR), apologies -- new to these text diversity measures. If not, happy to do a pull request if that saves you some time!

@kbenoit
Copy link
Contributor

kbenoit commented Sep 5, 2023

Thanks for pointing this out, I'll fix it asap.

@mweylandt
Copy link
Author

Hello, thanks for responding!

I've thought about this some more and the fix I suggested may also be inadequate. One could come across a case (like I have recently) where there is a very wide range of document lengths. In practical terms, the calculation could default to using a window size of 1 or 2 for calculations, which would render MATTR and MSTTR meaningless as well. I wonder if it would make sense to write it such that any documents with fewer tokens than the window width simply don't get a MATTR/MSTTR rather than the one based on a window of the minimum document length.

Would appreciate your thoughts, and be happy to assist if it is possible to do so! Thanks for a great set of tools.

@kbenoit
Copy link
Contributor

kbenoit commented Sep 5, 2023

That's a good idea - set a minimum document length below which a document has an NA returned for a moving average measure.

@mweylandt
Copy link
Author

I thought I would share how I ended up doing it for my project, in case it's helpful.

I simply check whether the dfm is empty in the function that calculates MATTR, and then return NA for it.

compute_mattr<- function (x, MATTR_window = 100L, min_window = 5L) 
{
  if (MATTR_window < 1) 
    stop("MATTR_window must be positive")
  if (any(ntoken(x) < MATTR_window)) {
    MATTR_window <- min_window
    warning("MATTR_window exceeds some documents' token lengths, resetting to minimum window size: ", 
            min_window, call. = FALSE)
  }
  if (any(ntoken(x) < min_window)) {
    warning("min_window exceeds some documents' token lengths, these documents will return NA", 
            call. = FALSE)
  }
  
  
  x <- tokens_ngrams(x, n = MATTR_window, concatenator = " ")  
  
# check whether the dfm is empty and return NA, else go on as previously
  check_dfm <- function(y){
      txdfm <-dfm(tokens(y))
      if(!sum(txdfm)) return(NA)
      quanteda.textstats::textstat_lexdiv(txdfm, "TTR")[["TTR"]]
    }
  
  temp <- lapply(as.list(x), check_dfm)
  result <- unlist(lapply(temp, mean))
  
  return(result)
}

txt <- c("fish sticks",
         "Anyway, like I was sayin', shrimp is the fruit of the sea. You can
          barbecue it, boil it, broil it, bake it, saute it.",
         "There's shrimp-kabobs,
          shrimp creole, shrimp gumbo. Pan fried, deep fried, stir-fried. There's
          pineapple shrimp, lemon shrimp, coconut shrimp, pepper shrimp, shrimp soup,
          shrimp stew, shrimp salad, shrimp and potatoes, shrimp burger, shrimp
          sandwich.")

toks <- tokens(txt)

#> compute_mattr(toks, MATTR_window = 35, min_window = 5)
#    text1     text2     text3 
#       NA 0.9057471 0.8574074 
# Warning messages:
# 1: MATTR_window exceeds some documents' token lengths, resetting to minimum window size: 5 
# 2: min_window exceeds some documents' token lengths, these documents will return NA 
> 

I worried that these checks would slow the function down on large corpora but in my (limited) tests it seems fine.

The other alternative is to allow textstat_lexdiv to pass an empty dfm along to compute_lexdiv_dfm_stats .
Currently it checks and throws an error, and currently also compute_lexdiv_dfm_stats can't handle the empty dfm in any case (in my tests so far).

just thought I'd put this here in case it's helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants