-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed improvement for bind_tf_idf #237
Comments
There have been some really big improvements in vctrs and dplyr since this code was originally written, so it would be a great idea for us to update it. 👍 |
I started working on this today but I noticed that using dplyr more directly is slower in the cases I have tested out: library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(janeaustenr)
library(tidytext)
book_words <- austen_books() |>
unnest_tokens(word, text) |>
count(book, word, sort = TRUE)
bench::mark(
current_tidytext = bind_tf_idf(book_words, word, book, n),
use_dplyr = book_words |>
mutate(tf = n / sum(n), .by = "book") %>%
mutate(doc_total = n(), .by = "word") %>%
mutate(idf = log(n_distinct(book) / doc_total),
tf_idf = tf * idf) |>
select(-doc_total)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 current_tidytext 28.6ms 29.1ms 33.8 9.13MB 7.25
#> 2 use_dplyr 46.3ms 46.7ms 21.4 6.24MB 25.7 Created on 2023-07-03 with reprex v2.0.2 Let me find a convenient dataset with a lot more short texts to compare. |
Hmmm, it still looks faster to keep as is, even with shorter and more numerous documents: library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidytext)
word_counts <- modeldata::tate_text |>
unnest_tokens(word, title) |>
count(id, word, sort = TRUE)
bench::mark(
current_tidytext = bind_tf_idf(word_counts, word, id, n),
use_dplyr = word_counts |>
mutate(tf = n / sum(n), .by = "id") %>%
mutate(doc_total = n(), .by = "word") %>%
mutate(idf = log(n_distinct(id) / doc_total),
tf_idf = tf * idf) |>
select(-doc_total)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 current_tidytext 19.8ms 20.3ms 49.2 4.27MB 6.71
#> 2 use_dplyr 22.6ms 22.7ms 44.0 2.83MB 249. Created on 2023-07-03 with reprex v2.0.2 @sometimesabird can you show me an example where this would be faster? |
Hi, I came across this issue a bit randomly but thought I'd give it a try. I use the text preparations steps described here to get a large enough word count. I added the package suppressPackageStartupMessages({
library(dplyr)
library(collapse)
library(sotu)
library(readtext)
library(tidytext)
})
file_paths <- sotu_dir()
sotu_texts <- readtext(file_paths)
sotu_whole <-
sotu_meta %>%
arrange(president) %>% # sort metadata
bind_cols(sotu_texts) %>% # combine with texts
as_tibble()
tidy_sotu <- sotu_whole %>%
unnest_tokens(word, text) |>
fcount(doc_id, word, sort = TRUE, name = "n")
bench::mark(
current_tidytext = bind_tf_idf(tidy_sotu, word, doc_id, n),
use_collapse = tidy_sotu |>
fgroup_by(doc_id) |>
fmutate(tf = n / sum(n)) %>%
fungroup() |>
fcount(word, name = "doc_total", add = TRUE) |>
fmutate(idf = log(n_distinct(doc_id) / doc_total),
tf_idf = tf * idf) |>
fselect(-doc_total),
use_dplyr = tidy_sotu |>
mutate(tf = n / sum(n), .by = "doc_id") %>%
mutate(doc_total = n(), .by = "word") %>%
mutate(idf = log(n_distinct(doc_id) / doc_total),
tf_idf = tf * idf) |>
select(-doc_total)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 current_tidytext 415.5ms 421.6ms 2.37 73.5MB 2.37
#> 2 use_collapse 29.2ms 35.8ms 22.7 27.5MB 11.3
#> 3 use_dplyr 331.5ms 351.5ms 2.84 46.4MB 2.84 |
Thanks @etiennebacher! I also should try out using vctrs directly for comparison. |
Hey, guys, I noticed that
bind_tf_idf()
doesn't really use dplyr, which has better performance relative to base R. I had a 30% improvement in speed for getting tfidf for a corpus of 100,000 tweets using this code:The text was updated successfully, but these errors were encountered: