Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to include 0s when converting textstat_simil to data.frame #18

Open
dhalpern opened this issue Jul 15, 2019 · 3 comments
Open

option to include 0s when converting textstat_simil to data.frame #18

dhalpern opened this issue Jul 15, 2019 · 3 comments

Comments

@dhalpern
Copy link

Requested feature

Currently, when converting a textstat_simil matrix to a data.frame, any 0s get dropped. 0s might be substantively important though so might be nice to have a feature that includes them.

This is the current behavior:

library(tidytext)
library(quanteda) 
dat <- data_frame(doc = rep(1:5, each = 2),
                    word = c("a", "b",
                               "a", "c",
                               "a", "c",
                               "b", "e",
                               "b", "f"), count = rep(1, 10)) 
tstat_mat <- dat %>% 
    cast_dfm(doc, word, count) %>% 
    textstat_simil(method = "cosine", margin = "documents")

tstat_mat
textstat_simil object; method = "cosine"
    1   2   3   4   5
1 1.0 0.5 0.5 0.5 0.5
2 0.5 1.0 1.0   0   0
3 0.5 1.0 1.0   0   0
4 0.5   0   0 1.0 0.5
5 0.5   0   0 0.5 1.0

tstat_mat %>% as.data.frame()
  document1 document2 cosine
1         1         2    0.5
2         1         3    0.5
3         2         3    1.0
4         1         4    0.5
5         1         5    0.5
6         4         5    0.5

It would be great to have an option for the dataframe include all pairs with 0s where needed

Use case

Similarities of 0 might be substantively interesting

Additional context

@kbenoit
Copy link
Contributor

kbenoit commented Jul 25, 2019

@koheiw we could probably add this as an option - to keep 0s - to proxy2triplet(). Then add a include_zeros = FALSE argument to as.data.frame.textstat_proxy().

@koheiw
Copy link
Collaborator

koheiw commented Jul 30, 2019

I think when min_simil is not used, as.data.frame.textstat_proxy() should return all the values.

@kbenoit
Copy link
Contributor

kbenoit commented Jul 31, 2019

Makes sense to me, and this treats the . as missing rather than zero when min_simil is used.

@kbenoit kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants