renaming nearest_to to closest_to for back-compatibility

bmschmidt · Mar 13, 2017 · b479c76 · b479c76
1 parent 12dcfa5
commit b479c76
Show file tree

Hide file tree

Showing 19 changed files with 218 additions and 173 deletions.
diff --git a/NAMESPACE b/NAMESPACE
@@ -2,6 +2,7 @@
 
 export("%>%")
 export(as.VectorSpaceModel)
+export(closest_to)
 export(cosineDist)
 export(cosineSimilarity)
 export(distend)

diff --git a/NEWS.md b/NEWS.md
@@ -1,26 +1,29 @@
 # VERSION 2.0
 
-Upgrade focusing on ease of use and CRAN-ability. Bumping major version because of a breaking change in the behavior of `nearest_to`, which now returns a data.frame.
+Upgrade focusing on ease of use and CRAN-ability. Bumping major version because of a breaking change in the behavior of `closest_to`, which now returns a data.frame.
 
 # Changes
 
-## Change in nearest_to behavior.
+## New default function: closest_to.
 
-There's a change in `nearest_to` that will break some existing code. Now it returns a data.frame instead of a list. The data.frame columns have elaborate names so they can easily be manipulated with dplyr, and/or plotted with ggplot. There are flags to return to the old behavior (`as_df=FALSE`).
+`nearest_to` was previously the easiest way to interact with cosine similarity functions. That's been deprecated
+in favor of a new function, `closest_to`. (I would have changed the name but for back-compatibility reasons.)
+The data.frame columns have elaborate names so they can easily be manipulated with dplyr, and/or plotted with ggplot.
+`nearest_to` is now just a wrapped version of the new function.
 
 ## New syntax for vector addition.
 
 This package now allows formula scoping for the most common operations, and string inputs to access in the context of a particular matrix. This makes this much nicer for handling the bread and butter word2vec operations.
 
 For instance, instead of writing 
 ```R
-vectors %>% nearest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])
+vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])
 ```
 
 (whew!), you can write
 
 ```R
-vectors %>% nearest_to(~"king" - "man" + "woman")
+vectors %>% closest_to(~"king" - "man" + "woman")
 ```
 
 

diff --git a/R/matrixFunctions.R b/R/matrixFunctions.R
@@ -12,11 +12,11 @@
 #'
 #' @examples
 #'
-#' nearest_to(demo_vectors,"great")
+#' closest_to(demo_vectors,"great")
 #' # stopwords like "and" and "very" are no longer top ten.
 #' # I don't know if this is really better, though.
 #'
-#' nearest_to(improve_vectorspace(demo_vectors),"great")
+#' closest_to(improve_vectorspace(demo_vectors),"great")
 #'
 improve_vectorspace = function(vectorspace,D=round(ncol(vectorspace)/100)) {
   mean = methods::new("VectorSpaceModel",
@@ -531,9 +531,9 @@ filter_to_rownames <- function(matrix,words) {
 #' subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=FALSE]]
 #' similarities = cosineSimilarity(subjects,subjects)
 #'
-#' # Use 'nearest_to' to build up a large list of similar words to a seed set.
+#' # Use 'closest_to' to build up a large list of similar words to a seed set.
 #' subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=TRUE]]
-#' new_subject_list = nearest_to(demo_vectors,subjects,20)
+#' new_subject_list = closest_to(demo_vectors,subjects,20)
 #' new_subjects = demo_vectors[[new_subject_list$word,average=FALSE]]
 #'
 #' # Plot the cosineDistance of these as a dendrogram.
@@ -637,10 +637,10 @@ project = function(matrix,vector) {
 #' See `project` for more details.
 #'
 #' @examples
-#' nearest_to(demo_vectors,demo_vectors[["man"]])
+#' closest_to(demo_vectors,demo_vectors[["man"]])
 #'
 #' genderless = reject(demo_vectors,demo_vectors[["he"]] - demo_vectors[["she"]])
-#' nearest_to(genderless,genderless[["man"]])
+#' closest_to(genderless,genderless[["man"]])
 #'
 #' @export
 reject = function(matrix,vector) {
@@ -673,12 +673,12 @@ reject = function(matrix,vector) {
 #' See `project` for more details and usage.
 #'
 #' @examples
-#' nearest_to(demo_vectors,"sweet")
+#' closest_to(demo_vectors,"sweet")
 #'
 #' # Stretch out the vectorspace 4x longer along the gender direction.
 #' more_sexist = distend(demo_vectors, ~ "man" + "he" - "she" -"woman", 4)
 #'
-#' nearest_to(more_sexist,"sweet")
+#' closest_to(more_sexist,"sweet")
 #'
 #' @export
 distend = function(matrix,vector, multiplier) {
@@ -692,7 +692,6 @@ distend = function(matrix,vector, multiplier) {
 #' @param vector  A vector (or a string or a formula coercable to a vector)
 #' of the same length as the VectorSpaceModel. See below.
 #' @param n The number of closest words to include.
-#' @param as_df Return as a data.frame? If false, returns a named vector, for back-compatibility.
 #' @param fancy_names If true (the default) the data frame will have descriptive names like
 #' 'similarity to "king+queen-man"'; otherwise, just 'similarity.' The default can speed up
 #'  interactive exploration.
@@ -704,8 +703,8 @@ distend = function(matrix,vector, multiplier) {
 #' 'cosineSimilarity'; the listing of several words similar to a given vector.
 #' Unlike cosineSimilarity, it returns a data.frame object instead of a matrix.
 #' cosineSimilarity is more powerful, because it can compare two matrices to
-#' each other; nearest_to can only take a vector or vectorlike object as its second argument.
-#' But with (or without) the argument n=Inf, nearest_to is often better for
+#' each other; closest_to can only take a vector or vectorlike object as its second argument.
+#' But with (or without) the argument n=Inf, closest_to is often better for
 #' plugging directly into a plot.
 #'
 #' As with cosineSimilarity, the second argument can take several forms. If it's a vector or
@@ -717,26 +716,26 @@ distend = function(matrix,vector, multiplier) {
 #' @examples
 #'
 #' # Synonyms and similar words
-#' nearest_to(demo_vectors,demo_vectors[["good"]])
+#' closest_to(demo_vectors,demo_vectors[["good"]])
 #'
 #' # If 'matrix' is a VectorSpaceModel object,
 #' # you can also just enter a string directly, and
 #' # it will be evaluated in the context of the passed matrix.
 #'
-#' nearest_to(demo_vectors,"good")
+#' closest_to(demo_vectors,"good")
 #'
 #' # You can also express more complicated formulas.
 #'
-#' nearest_to(demo_vectors,"good")
+#' closest_to(demo_vectors,"good")
 #'
 #' # Something close to the classic king:man::queen:woman;
 #' # What's the equivalent word for a female teacher that "guy" is for
 #' # a male one?
 #'
-#' nearest_to(demo_vectors,~ "guy" - "man" + "woman")
+#' closest_to(demo_vectors,~ "guy" - "man" + "woman")
 #'
 #' @export
-nearest_to = function(matrix, vector, n=10, as_df = TRUE, fancy_names = TRUE) {
+closest_to = function(matrix, vector, n=10, fancy_names = TRUE) {
   label = deparse(substitute(vector),width.cutoff=500)
   if (substr(label,1,1)=="~") {label = substr(label,2,500)}
 
@@ -749,20 +748,37 @@ nearest_to = function(matrix, vector, n=10, as_df = TRUE, fancy_names = TRUE) {
   # For sorting.
   ords = order(-sims[,1])
 
-  if (!as_df) {
-    structure(
-      1-sims[ords[1:n]], # Convert from similarity to distance.
-      names=rownames(sims)[ords[1:n]])
+  return_val = data.frame(rownames(sims)[ords[1:n]], sims[ords[1:n]],stringsAsFactors=FALSE)
+  if (fancy_names) {
+    names(return_val) = c("word", paste("similarity to", label))
   } else {
-    return_val = data.frame(rownames(sims)[ords[1:n]], sims[ords[1:n]],stringsAsFactors=FALSE)
-    if (fancy_names) {
-      names(return_val) = c("word", paste("similarity to", label))
-    } else {
-      names(return_val) = c("word","similarity")
-    }
-    rownames(return_val) = NULL
-    return_val
+    names(return_val) = c("word","similarity")
   }
+  rownames(return_val) = NULL
+  return_val
 }
 
 
+#' Nearest vectors to a word
+#'
+#' @description This a wrapper around closest_to, included for back-compatibility. Use
+#' closest_to for new applications.
+#' @param ... See `closest_to`
+#'
+#' @return a names vector of cosine similarities. See 'nearest_to' for more details.
+#' @export
+#'
+#' @examples
+#'
+#' # Recommended usage in 1.0:
+#' nearest_to(demo_vectors, demo_vectors[["good"]])
+#'
+#' # Recommended usage in 2.0:
+#' demo_vectors %>% closest_to("good")
+#'
+nearest_to = function(...) {
+  vals = closest_to(...,fancy_names = F)
+  returnable = 1 - vals$similarity
+  names(returnable) = vals$word
+  returnable
+}
diff --git a/README.md b/README.md
@@ -9,8 +9,8 @@ An R package for building and exploring word embedding models.
 This package does three major things to make it easier to work with word2vec and other vectorspace models of language.
 
 1. [Trains word2vec models](#creating-text-vectors) using an extended Jian Li's word2vec code; reads and writes the binary word2vec format so that you can import pre-trained models such as Google's; and provides tools for reading only *part* of a model (rows or columns) so you can explore a model in memory-limited situations.
-2. [Creates a new `VectorSpaceModel` class in R that gives a better syntax for exploring a word2vec or GloVe model than native matrix methods.](#vectorspacemodel-object) For example, instead of writing `model[rownames(model)=="king",]`, you can write `model[["king"]]`, and instead of writing `vectors %>% nearest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])` (whew!), you can write
-`vectors %>% nearest_to(~"king" - "man" + "woman")`.
+2. [Creates a new `VectorSpaceModel` class in R that gives a better syntax for exploring a word2vec or GloVe model than native matrix methods.](#vectorspacemodel-object) For example, instead of writing `model[rownames(model)=="king",]`, you can write `model[["king"]]`, and instead of writing `vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])` (whew!), you can write
+`vectors %>% closest_to(~"king" - "man" + "woman")`.
 3. [Implements several basic matrix operations that are useful in exploring word embedding models including cosine similarity, nearest neighbor, and vector projection](#useful-matrix-operations) with some caching that makes them much faster than the simplest implementations.
 
 ### Quick start
@@ -85,7 +85,7 @@ Each takes a `VectorSpaceModel` as its first argument. Sometimes, it's appropria
 
   * `cosineSimilarity(VSM_1,VSM_2)` calculates the cosine similarity of every vector in on vector space model to every vector in another. This is `n^2` complexity. With a vocabulary size of 20,000 or so, it can be reasonable to compare an entire set to itself; or you can compare a larger set to a smaller one to search for particular terms of interest. 
   * `cosineDistance(VSM_1,VSM_2)` is the inverse of cosineSimilarity. It's not really a distance metric, but can be used as one for clustering and the like.
-  * `nearest_to(VSM,vector,n)` wraps a particularly common use case for `cosineSimilarity`, of finding the top `n` terms in a `VectorSpaceModel` closest to term m
+  * `closest_to(VSM,vector,n)` wraps a particularly common use case for `cosineSimilarity`, of finding the top `n` terms in a `VectorSpaceModel` closest to term m
   * `project(VSM,vector)` takes a `VectorSpaceModel` and returns the portion parallel to the vector `vector`. 
   * `reject(VSM,vector)` is the inverse of `project`; it takes a `VectorSpaceModel` and returns the portion orthogonal to the vector `vector`. This makes it possible, for example, to collapse a vector space by removing certain distinctions of meaning.
   * `magnitudes` calculated the magnitude of each element in a VSM. This is useful in many operations.
@@ -98,15 +98,15 @@ not_that_kind_of_bank = chronam_vectors[["bank"]] %>%
       reject(chronam_vectors[["cashier"]]) %>% 
       reject(chronam_vectors[["depositors"]]) %>%   
       reject(chronam_vectors[["check"]])
-chronam_vectors %>% nearest_to(not_that_kind_of_bank)
+chronam_vectors %>% closest_to(not_that_kind_of_bank)
 ```
 
 These functions also allow an additional layer of syntactic sugar when working with word vectors. 
 
 Or even just as a formula, if you're working entirely with a single model, so you don't have to keep referring to words; instead, you can use a formula interface to reduce typing and increase clarity.
 
 ```{r}
-vectors %>% nearest_to(~ "king" - "man" + "woman")
+vectors %>% closest_to(~ "king" - "man" + "woman")
 ```
 
 

diff --git a/inst/doc/exploration.R b/inst/doc/exploration.R
@@ -6,33 +6,33 @@ library(magrittr)
 demo_vectors[["good"]]
 
 ## ------------------------------------------------------------------------
-demo_vectors %>% nearest_to(demo_vectors[["good"]])
+demo_vectors %>% closest_to(demo_vectors[["good"]])
 
 ## ------------------------------------------------------------------------
-demo_vectors %>% nearest_to("bad")
+demo_vectors %>% closest_to("bad")
 
 ## ------------------------------------------------------------------------
 
-demo_vectors %>% nearest_to(~"good"+"bad")
+demo_vectors %>% closest_to(~"good"+"bad")
 
 # The same thing could be written as:
-# demo_vectors %>% nearest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
+# demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
 
 ## ------------------------------------------------------------------------
-demo_vectors %>% nearest_to(~"good" - "bad")
+demo_vectors %>% closest_to(~"good" - "bad")
 
 ## ------------------------------------------------------------------------
-demo_vectors %>% nearest_to(~ "bad" - "good")
+demo_vectors %>% closest_to(~ "bad" - "good")
 
 ## ------------------------------------------------------------------------
-demo_vectors %>% nearest_to(~ "he" - "she")
-demo_vectors %>% nearest_to(~ "she" - "he")
+demo_vectors %>% closest_to(~ "he" - "she")
+demo_vectors %>% closest_to(~ "she" - "he")
 
 ## ------------------------------------------------------------------------
-demo_vectors %>% nearest_to(~ "guy" - "he" + "she")
+demo_vectors %>% closest_to(~ "guy" - "he" + "she")
 
 ## ------------------------------------------------------------------------
-demo_vectors %>% nearest_to(~ "guy" + ("she" - "he"))
+demo_vectors %>% closest_to(~ "guy" + ("she" - "he"))
 
 ## ------------------------------------------------------------------------
 
@@ -42,13 +42,13 @@ demo_vectors[[c("lady","woman","man","he","she","guy","man"), average=F]] %>%
 
 ## ------------------------------------------------------------------------
 top_evaluative_words = demo_vectors %>% 
-   nearest_to(~ "good"+"bad",n=75)
+   closest_to(~ "good"+"bad",n=75)
 
 goodness = demo_vectors %>% 
-  nearest_to(~ "good"-"bad",n=Inf) 
+  closest_to(~ "good"-"bad",n=Inf) 
 
 femininity = demo_vectors %>% 
-  nearest_to(~ "she" - "he", n=Inf)
+  closest_to(~ "she" - "he", n=Inf)
 
 ## ------------------------------------------------------------------------
 library(ggplot2)

diff --git a/inst/doc/exploration.Rmd b/inst/doc/exploration.Rmd
@@ -48,15 +48,15 @@ demo_vectors[["good"]]
 These numbers are meaningless on their own. But in the vector space, we can find similar words.
 
 ```{r}
-demo_vectors %>% nearest_to(demo_vectors[["good"]])
+demo_vectors %>% closest_to(demo_vectors[["good"]])
 ```
 
 The `%>%` is the pipe operator from magrittr; it helps to keep things organized, and is particularly useful with some of the things we'll see later. The 'similarity' scores here are cosine similarity in a vector space; 1.0 represents perfect similarity, 0 is no correlation, and -1.0 is complete opposition. In practice, vector "opposition" is different from the colloquial use of "opposite," and very rare. You'll only occasionally see vector scores below 0--as you can see above, "bad" is actually one of the most similar words to "good."
 
 When interactively exploring a single model (rather than comparing *two* models), it can be a pain to keep retyping words over and over. Rather than operate on the vectors, this package also lets you access the word directly by using R's formula notation: putting a tilde in front of it. For a single word, you can even access it directly, as so.
 
 ```{r}
-demo_vectors %>% nearest_to("bad")
+demo_vectors %>% closest_to("bad")
 ```
 
 ## Vector math
@@ -65,16 +65,16 @@ The tildes are necessary syntax where things get interesting--you can do **math*
 
 ```{r}
 
-demo_vectors %>% nearest_to(~"good"+"bad")
+demo_vectors %>% closest_to(~"good"+"bad")
 
 # The same thing could be written as:
-# demo_vectors %>% nearest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
+# demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
 ```
 
 Those are words that are common to both "good" and "bad". We could also find words that are shaded towards just good but *not* bad by using subtraction.
 
 ```{r}
-demo_vectors %>% nearest_to(~"good" - "bad")
+demo_vectors %>% closest_to(~"good" - "bad")
 ```
 
 > What does this "subtraction" vector mean? 
@@ -88,14 +88,14 @@ demo_vectors %>% nearest_to(~"good" - "bad")
 Again, you can easily switch the order to the opposite: here are a bunch of bad words:
 
 ```{r}
-demo_vectors %>% nearest_to(~ "bad" - "good")
+demo_vectors %>% closest_to(~ "bad" - "good")
 ```
 
 All sorts of binaries are captured in word2vec models. One of the most famous, since Mikolov's original word2vec paper, is *gender*. If you ask for similarity to "he"-"she", for example, you get words that appear mostly in a *male* context. Since these examples are from teaching evaluations, after just a few straightforwardly gendered words, we start to get things that only men are ("arrogant") or where there are very few women in the university ("physics")
 
 ```{r}
-demo_vectors %>% nearest_to(~ "he" - "she")
-demo_vectors %>% nearest_to(~ "she" - "he")
+demo_vectors %>% closest_to(~ "he" - "she")
+demo_vectors %>% closest_to(~ "she" - "he")
 ```
 
 ## Analogies
@@ -112,7 +112,7 @@ removing its similarity to "he", and additing a similarity to "she".
 This yields the answer: the most similar term to "guy" for a woman is "lady."
 
 ```{r}
-demo_vectors %>% nearest_to(~ "guy" - "he" + "she")
+demo_vectors %>% closest_to(~ "guy" - "he" + "she")
 ```
 
 If you're using the other mental framework, of thinking of these as real vectors, 
@@ -122,7 +122,7 @@ to femininity. You can then add this vector to "guy", and that will take you to
 only the grouping is different.
 
 ```{r}
-demo_vectors %>% nearest_to(~ "guy" + ("she" - "he"))
+demo_vectors %>% closest_to(~ "guy" + ("she" - "he"))
 ```
 
 Principal components can let you plot a subset of these vectors to see how they relate. You can imagine an arrow from "he" to "she", from "guy" to "lady", and from "man" to "woman"; all run in roughly the same direction.
@@ -140,13 +140,13 @@ First we build up three data_frames: first, a list of the 50 top evaluative word
 
 ```{r}
 top_evaluative_words = demo_vectors %>% 
-   nearest_to(~ "good"+"bad",n=75)
+   closest_to(~ "good"+"bad",n=75)
 
 goodness = demo_vectors %>% 
-  nearest_to(~ "good"-"bad",n=Inf) 
+  closest_to(~ "good"-"bad",n=Inf) 
 
 femininity = demo_vectors %>% 
-  nearest_to(~ "she" - "he", n=Inf)
+  closest_to(~ "she" - "he", n=Inf)
 ```
 
 Then we can use tidyverse packages to join and plot these.