Skip to content

Commit

Permalink
renaming nearest_to to closest_to for back-compatibility
Browse files Browse the repository at this point in the history
  • Loading branch information
bmschmidt committed Mar 13, 2017
1 parent 12dcfa5 commit b479c76
Show file tree
Hide file tree
Showing 19 changed files with 218 additions and 173 deletions.
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

export("%>%")
export(as.VectorSpaceModel)
export(closest_to)
export(cosineDist)
export(cosineSimilarity)
export(distend)
Expand Down
13 changes: 8 additions & 5 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,29 @@
# VERSION 2.0

Upgrade focusing on ease of use and CRAN-ability. Bumping major version because of a breaking change in the behavior of `nearest_to`, which now returns a data.frame.
Upgrade focusing on ease of use and CRAN-ability. Bumping major version because of a breaking change in the behavior of `closest_to`, which now returns a data.frame.

# Changes

## Change in nearest_to behavior.
## New default function: closest_to.

There's a change in `nearest_to` that will break some existing code. Now it returns a data.frame instead of a list. The data.frame columns have elaborate names so they can easily be manipulated with dplyr, and/or plotted with ggplot. There are flags to return to the old behavior (`as_df=FALSE`).
`nearest_to` was previously the easiest way to interact with cosine similarity functions. That's been deprecated
in favor of a new function, `closest_to`. (I would have changed the name but for back-compatibility reasons.)
The data.frame columns have elaborate names so they can easily be manipulated with dplyr, and/or plotted with ggplot.
`nearest_to` is now just a wrapped version of the new function.

## New syntax for vector addition.

This package now allows formula scoping for the most common operations, and string inputs to access in the context of a particular matrix. This makes this much nicer for handling the bread and butter word2vec operations.

For instance, instead of writing
```R
vectors %>% nearest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])
vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])
```

(whew!), you can write

```R
vectors %>% nearest_to(~"king" - "man" + "woman")
vectors %>% closest_to(~"king" - "man" + "woman")
```


Expand Down
72 changes: 44 additions & 28 deletions R/matrixFunctions.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@
#'
#' @examples
#'
#' nearest_to(demo_vectors,"great")
#' closest_to(demo_vectors,"great")
#' # stopwords like "and" and "very" are no longer top ten.
#' # I don't know if this is really better, though.
#'
#' nearest_to(improve_vectorspace(demo_vectors),"great")
#' closest_to(improve_vectorspace(demo_vectors),"great")
#'
improve_vectorspace = function(vectorspace,D=round(ncol(vectorspace)/100)) {
mean = methods::new("VectorSpaceModel",
Expand Down Expand Up @@ -531,9 +531,9 @@ filter_to_rownames <- function(matrix,words) {
#' subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=FALSE]]
#' similarities = cosineSimilarity(subjects,subjects)
#'
#' # Use 'nearest_to' to build up a large list of similar words to a seed set.
#' # Use 'closest_to' to build up a large list of similar words to a seed set.
#' subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=TRUE]]
#' new_subject_list = nearest_to(demo_vectors,subjects,20)
#' new_subject_list = closest_to(demo_vectors,subjects,20)
#' new_subjects = demo_vectors[[new_subject_list$word,average=FALSE]]
#'
#' # Plot the cosineDistance of these as a dendrogram.
Expand Down Expand Up @@ -637,10 +637,10 @@ project = function(matrix,vector) {
#' See `project` for more details.
#'
#' @examples
#' nearest_to(demo_vectors,demo_vectors[["man"]])
#' closest_to(demo_vectors,demo_vectors[["man"]])
#'
#' genderless = reject(demo_vectors,demo_vectors[["he"]] - demo_vectors[["she"]])
#' nearest_to(genderless,genderless[["man"]])
#' closest_to(genderless,genderless[["man"]])
#'
#' @export
reject = function(matrix,vector) {
Expand Down Expand Up @@ -673,12 +673,12 @@ reject = function(matrix,vector) {
#' See `project` for more details and usage.
#'
#' @examples
#' nearest_to(demo_vectors,"sweet")
#' closest_to(demo_vectors,"sweet")
#'
#' # Stretch out the vectorspace 4x longer along the gender direction.
#' more_sexist = distend(demo_vectors, ~ "man" + "he" - "she" -"woman", 4)
#'
#' nearest_to(more_sexist,"sweet")
#' closest_to(more_sexist,"sweet")
#'
#' @export
distend = function(matrix,vector, multiplier) {
Expand All @@ -692,7 +692,6 @@ distend = function(matrix,vector, multiplier) {
#' @param vector A vector (or a string or a formula coercable to a vector)
#' of the same length as the VectorSpaceModel. See below.
#' @param n The number of closest words to include.
#' @param as_df Return as a data.frame? If false, returns a named vector, for back-compatibility.
#' @param fancy_names If true (the default) the data frame will have descriptive names like
#' 'similarity to "king+queen-man"'; otherwise, just 'similarity.' The default can speed up
#' interactive exploration.
Expand All @@ -704,8 +703,8 @@ distend = function(matrix,vector, multiplier) {
#' 'cosineSimilarity'; the listing of several words similar to a given vector.
#' Unlike cosineSimilarity, it returns a data.frame object instead of a matrix.
#' cosineSimilarity is more powerful, because it can compare two matrices to
#' each other; nearest_to can only take a vector or vectorlike object as its second argument.
#' But with (or without) the argument n=Inf, nearest_to is often better for
#' each other; closest_to can only take a vector or vectorlike object as its second argument.
#' But with (or without) the argument n=Inf, closest_to is often better for
#' plugging directly into a plot.
#'
#' As with cosineSimilarity, the second argument can take several forms. If it's a vector or
Expand All @@ -717,26 +716,26 @@ distend = function(matrix,vector, multiplier) {
#' @examples
#'
#' # Synonyms and similar words
#' nearest_to(demo_vectors,demo_vectors[["good"]])
#' closest_to(demo_vectors,demo_vectors[["good"]])
#'
#' # If 'matrix' is a VectorSpaceModel object,
#' # you can also just enter a string directly, and
#' # it will be evaluated in the context of the passed matrix.
#'
#' nearest_to(demo_vectors,"good")
#' closest_to(demo_vectors,"good")
#'
#' # You can also express more complicated formulas.
#'
#' nearest_to(demo_vectors,"good")
#' closest_to(demo_vectors,"good")
#'
#' # Something close to the classic king:man::queen:woman;
#' # What's the equivalent word for a female teacher that "guy" is for
#' # a male one?
#'
#' nearest_to(demo_vectors,~ "guy" - "man" + "woman")
#' closest_to(demo_vectors,~ "guy" - "man" + "woman")
#'
#' @export
nearest_to = function(matrix, vector, n=10, as_df = TRUE, fancy_names = TRUE) {
closest_to = function(matrix, vector, n=10, fancy_names = TRUE) {
label = deparse(substitute(vector),width.cutoff=500)
if (substr(label,1,1)=="~") {label = substr(label,2,500)}

Expand All @@ -749,20 +748,37 @@ nearest_to = function(matrix, vector, n=10, as_df = TRUE, fancy_names = TRUE) {
# For sorting.
ords = order(-sims[,1])

if (!as_df) {
structure(
1-sims[ords[1:n]], # Convert from similarity to distance.
names=rownames(sims)[ords[1:n]])
return_val = data.frame(rownames(sims)[ords[1:n]], sims[ords[1:n]],stringsAsFactors=FALSE)
if (fancy_names) {
names(return_val) = c("word", paste("similarity to", label))
} else {
return_val = data.frame(rownames(sims)[ords[1:n]], sims[ords[1:n]],stringsAsFactors=FALSE)
if (fancy_names) {
names(return_val) = c("word", paste("similarity to", label))
} else {
names(return_val) = c("word","similarity")
}
rownames(return_val) = NULL
return_val
names(return_val) = c("word","similarity")
}
rownames(return_val) = NULL
return_val
}


#' Nearest vectors to a word
#'
#' @description This a wrapper around closest_to, included for back-compatibility. Use
#' closest_to for new applications.
#' @param ... See `closest_to`
#'
#' @return a names vector of cosine similarities. See 'nearest_to' for more details.
#' @export
#'
#' @examples
#'
#' # Recommended usage in 1.0:
#' nearest_to(demo_vectors, demo_vectors[["good"]])
#'
#' # Recommended usage in 2.0:
#' demo_vectors %>% closest_to("good")
#'
nearest_to = function(...) {
vals = closest_to(...,fancy_names = F)
returnable = 1 - vals$similarity
names(returnable) = vals$word
returnable
}
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ An R package for building and exploring word embedding models.
This package does three major things to make it easier to work with word2vec and other vectorspace models of language.

1. [Trains word2vec models](#creating-text-vectors) using an extended Jian Li's word2vec code; reads and writes the binary word2vec format so that you can import pre-trained models such as Google's; and provides tools for reading only *part* of a model (rows or columns) so you can explore a model in memory-limited situations.
2. [Creates a new `VectorSpaceModel` class in R that gives a better syntax for exploring a word2vec or GloVe model than native matrix methods.](#vectorspacemodel-object) For example, instead of writing `model[rownames(model)=="king",]`, you can write `model[["king"]]`, and instead of writing `vectors %>% nearest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])` (whew!), you can write
`vectors %>% nearest_to(~"king" - "man" + "woman")`.
2. [Creates a new `VectorSpaceModel` class in R that gives a better syntax for exploring a word2vec or GloVe model than native matrix methods.](#vectorspacemodel-object) For example, instead of writing `model[rownames(model)=="king",]`, you can write `model[["king"]]`, and instead of writing `vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])` (whew!), you can write
`vectors %>% closest_to(~"king" - "man" + "woman")`.
3. [Implements several basic matrix operations that are useful in exploring word embedding models including cosine similarity, nearest neighbor, and vector projection](#useful-matrix-operations) with some caching that makes them much faster than the simplest implementations.

### Quick start
Expand Down Expand Up @@ -85,7 +85,7 @@ Each takes a `VectorSpaceModel` as its first argument. Sometimes, it's appropria

* `cosineSimilarity(VSM_1,VSM_2)` calculates the cosine similarity of every vector in on vector space model to every vector in another. This is `n^2` complexity. With a vocabulary size of 20,000 or so, it can be reasonable to compare an entire set to itself; or you can compare a larger set to a smaller one to search for particular terms of interest.
* `cosineDistance(VSM_1,VSM_2)` is the inverse of cosineSimilarity. It's not really a distance metric, but can be used as one for clustering and the like.
* `nearest_to(VSM,vector,n)` wraps a particularly common use case for `cosineSimilarity`, of finding the top `n` terms in a `VectorSpaceModel` closest to term m
* `closest_to(VSM,vector,n)` wraps a particularly common use case for `cosineSimilarity`, of finding the top `n` terms in a `VectorSpaceModel` closest to term m
* `project(VSM,vector)` takes a `VectorSpaceModel` and returns the portion parallel to the vector `vector`.
* `reject(VSM,vector)` is the inverse of `project`; it takes a `VectorSpaceModel` and returns the portion orthogonal to the vector `vector`. This makes it possible, for example, to collapse a vector space by removing certain distinctions of meaning.
* `magnitudes` calculated the magnitude of each element in a VSM. This is useful in many operations.
Expand All @@ -98,15 +98,15 @@ not_that_kind_of_bank = chronam_vectors[["bank"]] %>%
reject(chronam_vectors[["cashier"]]) %>%
reject(chronam_vectors[["depositors"]]) %>%
reject(chronam_vectors[["check"]])
chronam_vectors %>% nearest_to(not_that_kind_of_bank)
chronam_vectors %>% closest_to(not_that_kind_of_bank)
```

These functions also allow an additional layer of syntactic sugar when working with word vectors.

Or even just as a formula, if you're working entirely with a single model, so you don't have to keep referring to words; instead, you can use a formula interface to reduce typing and increase clarity.

```{r}
vectors %>% nearest_to(~ "king" - "man" + "woman")
vectors %>% closest_to(~ "king" - "man" + "woman")
```


Expand Down
26 changes: 13 additions & 13 deletions inst/doc/exploration.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,33 @@ library(magrittr)
demo_vectors[["good"]]

## ------------------------------------------------------------------------
demo_vectors %>% nearest_to(demo_vectors[["good"]])
demo_vectors %>% closest_to(demo_vectors[["good"]])

## ------------------------------------------------------------------------
demo_vectors %>% nearest_to("bad")
demo_vectors %>% closest_to("bad")

## ------------------------------------------------------------------------

demo_vectors %>% nearest_to(~"good"+"bad")
demo_vectors %>% closest_to(~"good"+"bad")

# The same thing could be written as:
# demo_vectors %>% nearest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
# demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])

## ------------------------------------------------------------------------
demo_vectors %>% nearest_to(~"good" - "bad")
demo_vectors %>% closest_to(~"good" - "bad")

## ------------------------------------------------------------------------
demo_vectors %>% nearest_to(~ "bad" - "good")
demo_vectors %>% closest_to(~ "bad" - "good")

## ------------------------------------------------------------------------
demo_vectors %>% nearest_to(~ "he" - "she")
demo_vectors %>% nearest_to(~ "she" - "he")
demo_vectors %>% closest_to(~ "he" - "she")
demo_vectors %>% closest_to(~ "she" - "he")

## ------------------------------------------------------------------------
demo_vectors %>% nearest_to(~ "guy" - "he" + "she")
demo_vectors %>% closest_to(~ "guy" - "he" + "she")

## ------------------------------------------------------------------------
demo_vectors %>% nearest_to(~ "guy" + ("she" - "he"))
demo_vectors %>% closest_to(~ "guy" + ("she" - "he"))

## ------------------------------------------------------------------------

Expand All @@ -42,13 +42,13 @@ demo_vectors[[c("lady","woman","man","he","she","guy","man"), average=F]] %>%

## ------------------------------------------------------------------------
top_evaluative_words = demo_vectors %>%
nearest_to(~ "good"+"bad",n=75)
closest_to(~ "good"+"bad",n=75)

goodness = demo_vectors %>%
nearest_to(~ "good"-"bad",n=Inf)
closest_to(~ "good"-"bad",n=Inf)

femininity = demo_vectors %>%
nearest_to(~ "she" - "he", n=Inf)
closest_to(~ "she" - "he", n=Inf)

## ------------------------------------------------------------------------
library(ggplot2)
Expand Down
26 changes: 13 additions & 13 deletions inst/doc/exploration.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,15 @@ demo_vectors[["good"]]
These numbers are meaningless on their own. But in the vector space, we can find similar words.

```{r}
demo_vectors %>% nearest_to(demo_vectors[["good"]])
demo_vectors %>% closest_to(demo_vectors[["good"]])
```

The `%>%` is the pipe operator from magrittr; it helps to keep things organized, and is particularly useful with some of the things we'll see later. The 'similarity' scores here are cosine similarity in a vector space; 1.0 represents perfect similarity, 0 is no correlation, and -1.0 is complete opposition. In practice, vector "opposition" is different from the colloquial use of "opposite," and very rare. You'll only occasionally see vector scores below 0--as you can see above, "bad" is actually one of the most similar words to "good."

When interactively exploring a single model (rather than comparing *two* models), it can be a pain to keep retyping words over and over. Rather than operate on the vectors, this package also lets you access the word directly by using R's formula notation: putting a tilde in front of it. For a single word, you can even access it directly, as so.

```{r}
demo_vectors %>% nearest_to("bad")
demo_vectors %>% closest_to("bad")
```

## Vector math
Expand All @@ -65,16 +65,16 @@ The tildes are necessary syntax where things get interesting--you can do **math*

```{r}
demo_vectors %>% nearest_to(~"good"+"bad")
demo_vectors %>% closest_to(~"good"+"bad")
# The same thing could be written as:
# demo_vectors %>% nearest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
# demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
```

Those are words that are common to both "good" and "bad". We could also find words that are shaded towards just good but *not* bad by using subtraction.

```{r}
demo_vectors %>% nearest_to(~"good" - "bad")
demo_vectors %>% closest_to(~"good" - "bad")
```

> What does this "subtraction" vector mean?
Expand All @@ -88,14 +88,14 @@ demo_vectors %>% nearest_to(~"good" - "bad")
Again, you can easily switch the order to the opposite: here are a bunch of bad words:

```{r}
demo_vectors %>% nearest_to(~ "bad" - "good")
demo_vectors %>% closest_to(~ "bad" - "good")
```

All sorts of binaries are captured in word2vec models. One of the most famous, since Mikolov's original word2vec paper, is *gender*. If you ask for similarity to "he"-"she", for example, you get words that appear mostly in a *male* context. Since these examples are from teaching evaluations, after just a few straightforwardly gendered words, we start to get things that only men are ("arrogant") or where there are very few women in the university ("physics")

```{r}
demo_vectors %>% nearest_to(~ "he" - "she")
demo_vectors %>% nearest_to(~ "she" - "he")
demo_vectors %>% closest_to(~ "he" - "she")
demo_vectors %>% closest_to(~ "she" - "he")
```

## Analogies
Expand All @@ -112,7 +112,7 @@ removing its similarity to "he", and additing a similarity to "she".
This yields the answer: the most similar term to "guy" for a woman is "lady."

```{r}
demo_vectors %>% nearest_to(~ "guy" - "he" + "she")
demo_vectors %>% closest_to(~ "guy" - "he" + "she")
```

If you're using the other mental framework, of thinking of these as real vectors,
Expand All @@ -122,7 +122,7 @@ to femininity. You can then add this vector to "guy", and that will take you to
only the grouping is different.

```{r}
demo_vectors %>% nearest_to(~ "guy" + ("she" - "he"))
demo_vectors %>% closest_to(~ "guy" + ("she" - "he"))
```

Principal components can let you plot a subset of these vectors to see how they relate. You can imagine an arrow from "he" to "she", from "guy" to "lady", and from "man" to "woman"; all run in roughly the same direction.
Expand All @@ -140,13 +140,13 @@ First we build up three data_frames: first, a list of the 50 top evaluative word

```{r}
top_evaluative_words = demo_vectors %>%
nearest_to(~ "good"+"bad",n=75)
closest_to(~ "good"+"bad",n=75)
goodness = demo_vectors %>%
nearest_to(~ "good"-"bad",n=Inf)
closest_to(~ "good"-"bad",n=Inf)
femininity = demo_vectors %>%
nearest_to(~ "she" - "he", n=Inf)
closest_to(~ "she" - "he", n=Inf)
```

Then we can use tidyverse packages to join and plot these.
Expand Down
Loading

0 comments on commit b479c76

Please sign in to comment.