-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NASA data.json has changed #72
Comments
Hi @juliasilge, The NASA case study requires just a little correction to work, primarily the dataset ids. Please refer to the comments in the codeblock below, library(tidyverse)
library(tidytext)
library(jsonlite)
library(widyr)
library(igraph)
library(ggraph)
set.seed(1234)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)
# previously metadata$dataset$`_id`$`$oid`
ids = metadata$dataset$identifier
nasa_title <- tibble(id = ids, title = metadata$dataset$title)
nasa_title <- nasa_title %>%
unnest_tokens(word, title) %>%
anti_join(stop_words, by = "word") %>%
# remove terms v1.0, l2, 0.500, i, ii, ...
filter(!str_detect(word, "^[v|l][0-9]?[\\.[0-9]?]"),
!str_detect(word, "^[0-9]+[\\.[0-9]+]*$"),
!str_detect(word, "^[i]+$"))
# sample outcome
nasa_title %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE) %>%
# reduce threshold from 250 to 150
filter(n > 150) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n),
edge_colour = "navyblue",
show.legend = FALSE) +
geom_node_point(size = 3, col = "darkblue") +
geom_node_text(
aes(label = name),
repel = TRUE,
family = "Menlo",
size = 3,
point.padding = unit(0.2, "lines")
) +
theme_void() |
Thanks so much @tmasjc! |
I'm guessing this is related to the JSON change, but I'm not sure. The map of the title_word_pairs works fine, but the map of the desc_word_pairs does not. I can't figure out why. All of this code up to this point works: ##NASA Datamining
But then when I go for the second graph:
It gives this error: Error: Aesthetics must be valid data columns. Problematic aesthetic(s): edge_alpha = n, edge_width = n. Did you mistype the name of a data column or forget to add after_stat()? I tried seeing if something was different about the description tibble versus the other tibble but they look identical to me pretty much:
|
@walinchus it's because in the new version of the JSON available from NASA's website, the variable is now called library(tidyverse)
library(tidytext)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
metadata <- fromJSON("https://data.nasa.gov/data.json")
## notice `description` now!!
names(metadata$dataset)
#> [1] "accessLevel" "landingPage"
#> [3] "bureauCode" "issued"
#> [5] "@type" "modified"
#> [7] "references" "keyword"
#> [9] "contactPoint" "publisher"
#> [11] "identifier" "description"
#> [13] "title" "programCode"
#> [15] "distribution" "accrualPeriodicity"
#> [17] "theme" "citation"
#> [19] "temporal" "spatial"
#> [21] "language" "data-presentation-form"
#> [23] "release-place" "series-name"
#> [25] "creator" "graphic-preview-description"
#> [27] "graphic-preview-file" "editor"
#> [29] "issue-identification" "describedBy"
#> [31] "dataQuality" "describedByType"
#> [33] "license" "rights"
metadata_wrangled <- as_tibble(metadata$dataset) %>%
select(title, description, keyword) %>%
mutate(id = row_number())
library(widyr)
desc_word_pairs <- metadata_wrangled %>%
unnest_tokens(word, description) %>%
anti_join(get_stopwords()) %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE)
#> Warning: `distinct_()` is deprecated as of dplyr 0.7.0.
#> Please use `distinct()` instead.
#> See vignette('programming') for more help
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Joining, by = "word"
desc_word_pairs
#> # A tibble: 23,412,441 x 3
#> item1 item2 n
#> <chr> <chr> <dbl>
#> 1 data set 4982
#> 2 contains data 4414
#> 3 data 2 4394
#> 4 data system 4219
#> 5 data product 4132
#> 6 data using 4122
#> 7 data 1 4039
#> 8 data used 3899
#> 9 data resolution 3889
#> 10 data instrument 3725
#> # … with 23,412,431 more rows Created on 2021-01-22 by the reprex package (v0.3.0) |
Ah great thanks. |
Oh no wait I already switched out "description" for "desc." (See my first post). Hmm any other ideas? |
Ah, I apologize; it wasn't quite clear where things were going wrong. The key to finding where things were going wrong is to look at If you instead filter down to things above 2000, you get a more reasonable plot: library(tidyverse)
library(tidytext)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
metadata <- fromJSON("https://data.nasa.gov/data.json")
metadata_wrangled <- as_tibble(metadata$dataset) %>%
select(title, description, keyword) %>%
mutate(id = row_number())
library(widyr)
desc_word_pairs <- metadata_wrangled %>%
unnest_tokens(word, description) %>%
anti_join(get_stopwords()) %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE)
#> Warning: `distinct_()` is deprecated as of dplyr 0.7.0.
#> Please use `distinct()` instead.
#> See vignette('programming') for more help
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Joining, by = "word"
library(igraph)
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:dplyr':
#>
#> as_data_frame, groups, union
#> The following objects are masked from 'package:purrr':
#>
#> compose, simplify
#> The following object is masked from 'package:tidyr':
#>
#> crossing
#> The following object is masked from 'package:tibble':
#>
#> as_data_frame
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
library(ggraph)
set.seed(1234)
desc_word_pairs %>%
filter(n >= 2000) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "darkred") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() Created on 2021-01-22 by the reprex package (v0.3.0) In the future, it would be great to create a reprex (a minimal reproducible example) for something like this. The goal of a reprex is to make it easier for someone to recreate your problem so that they/I can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with: install.packages("reprex") Thanks! 🙌 |
Will do thanks. I haven't heard of reprex before but it sounds great. I am still learning R thanks to the help of great books like this one! |
Update id definition to reflect current NASA metadata (metadata$dataset$identifier) as per issue dgrtwo#72
Good afternoon,
Mistake: It seems to me that problem with changing _id with identifier is haunting us |
@Oleh-Zaritskyi I believe a many-to-many relationship here is expected, so you will want to specify that. Also, note that the NASA library(tidyverse)
library(tidytext)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
metadata <- fromJSON("https://data.nasa.gov/data.json")
metadata_wrangled <- as_tibble(metadata$dataset) |>
select(title, description, keyword) |>
mutate(id = row_number())
desc_tf_idf <- metadata_wrangled |>
unnest_tokens(word, description) |>
anti_join(stop_words) |>
count(id, word, sort = TRUE) |>
bind_tf_idf(word, id, n)
#> Joining with `by = join_by(word)`
nasa_keyword <- metadata_wrangled |>
unnest(keyword) |>
select(id, keyword)
full_join(desc_tf_idf, nasa_keyword, relationship = "many-to-many")
#> Joining with `by = join_by(id)`
#> # A tibble: 6,194,658 × 7
#> id word n tf idf tf_idf keyword
#> <int> <chr> <int> <dbl> <dbl> <dbl> <chr>
#> 1 9987 gt 96 0.201 5.29 1.06 active
#> 2 9987 gt 96 0.201 5.29 1.06 gmat
#> 3 9987 gt 96 0.201 5.29 1.06 goddard space flight center
#> 4 9987 gt 96 0.201 5.29 1.06 project
#> 5 9987 lt 96 0.201 5.33 1.07 active
#> 6 9987 lt 96 0.201 5.33 1.07 gmat
#> 7 9987 lt 96 0.201 5.33 1.07 goddard space flight center
#> 8 9987 lt 96 0.201 5.33 1.07 project
#> 9 16591 gt 94 0.188 5.29 0.997 sbir/sttr
#> 10 16591 gt 94 0.188 5.29 0.997 nasa headquarters
#> # ℹ 6,194,648 more rows Created on 2024-05-15 with reprex v2.1.0 |
@juliasilge Thank you very much. I'l try . Thank for you greate work, sorry for disturbing you |
@juliasilge Thank you, almost get the final topic. |
@Oleh-Zaritskyi You need to convert one of those columns to be the same type as the other one, using |
@juliasilge Thank you, everything works correctly |
The data.json made available by NASA has changed its schema so we likely want to update the analysis at some point.
The text was updated successfully, but these errors were encountered: