Skip to content

Commit

Permalink
Merge pull request colearendt#67 from jeremystan/document
Browse files Browse the repository at this point in the history
Document
  • Loading branch information
Jeremy Stanley authored Sep 6, 2016
2 parents 62891a8 + a6a92c7 commit adff7b5
Show file tree
Hide file tree
Showing 29 changed files with 980 additions and 384 deletions.
55 changes: 37 additions & 18 deletions R/append_values.R
Original file line number Diff line number Diff line change
@@ -1,27 +1,46 @@
#' Appends all values with a specified type as a new column
#' Appends all JSON values with a specified type as a new column
#'
#' The append_values_X functions let you take any remaining JSON and add it as
#' a column X (for X in "string", "number", "logical") insofar as it is of the
#' JSON type specified.
#' The \code{append_values} functions let you take any scalar JSON values
#' of a given type ("string", "number", "logical") and add them as a new
#' column named \code{column.name}. This is particularly useful after using
#' \code{\link{gather_keys}} to stack many objects.
#'
#' Any values that do not conform to the type specified will be NA in the resulting
#' column. This includes other scalar types (e.g., numbers or logicals if you are
#' using append_values_string) and *also* any rows where the JSON is still an
#' object or an array.
#' Any values that can not be converted to the specified will be \code{NA} in
#' the resulting column. This includes other scalar types (e.g., numbers or
#' logicals if you are using \code{append_values_string}) and *also* any rows
#' where the JSON is NULL or an object or array.
#'
#' Note that the \code{append_values} functions do not alter the JSON
#' attribute of the \code{tbl_json} object in any way.
#'
#' @name append_values
#' @param .x a json string or tbl_json object
#' @param column.name the column.name to append the values into the data.frame
#' under
#' @param force parameter that determines if the variable type should be computed or not
#' if force is FALSE, then the function may take more memory
#' @param recursive logical indicating whether to extract a single value from a
#' nested object. Only used when force = TRUE. If force = FALSE, and
#' recursive=TRUE, throws an error.
#' @seealso \code{\link{gather_keys}} to gather all object keys first,
#' \code{\link{spread_all}} to spread values into new columns
#' @param .x a json string or \code{\link{tbl_json}} object
#' @param column.name the name of the column to append values as
#' @param force should values be coerced to the appropriate type
#' when possible, otherwise, types are checked first (requires more
#' memory)
#' @param recursive logical indicating whether to recurisvely extract a single
#' value from a nested object. Only used when \code{force = TRUE}. If
#' \code{force = FALSE}, and \code{recursive = TRUE}, throws an error.
#' @return a \code{\link{tbl_json}} object
#' @examples
#'
#' # Stack names
#' '{"first": "bob", "last": "jones"}' %>%
#' gather_keys() %>%
#' append_values_string()
#' gather_keys %>%
#' append_values_string
#'
#' # This is most useful when data is stored in keys and values
#' # For example, tags in recipes:
#' recipes <- c('{"name": "pie", "tags": {"apple": 10, "pie": 2, "flour": 5}}',
#' '{"name": "cookie", "tags": {"chocolate": 2, "cookie": 1}}')
#' recipes %>%
#' spread_values(name = jstring("name")) %>%
#' enter_object("tags") %>%
#' gather_keys("tag") %>%
#' append_values_number("count")
NULL

#' Creates the append_values_* functions
Expand Down
64 changes: 48 additions & 16 deletions R/enter_object.R
Original file line number Diff line number Diff line change
@@ -1,26 +1,58 @@
#' Dive into a specific object "key"
#' Enter into a specific object and discard all other JSON data
#'
#' JSON can contain nested objects, such as {"key1": {"key2": [1, 2, 3]}}. The
#' function enter_object() can be used to access the array nested under "key1"
#' and "key2". After using enter_object(), all further tidyjson calls happen
#' inside the referenced object (all other JSON data outside the object
#' is discarded). If the object doesn't exist for a given row / index, then that
#' data.frame row will be discarded.
#' When manipulating a JSON object, \code{enter_object} lets you navigate to
#' a specific value of the object by referencing it's key. JSON can contain
#' nested objects, and you can pass in more than one character string into
#' \code{enter_object} to navigate through multiple objects simultaneously.
#'
#' This is useful when you want to limit your data to just information found in
#' a specific key. Use the ... to specific a sequence of keys that you want to
#' enter into. Keep in mind that any rows with JSON that do not contain the key
#' will be discarded by this function.
#' After using \code{enter_object}, all further tidyjson calls happen inside the
#' referenced object (all other JSON data outside the object is discarded).
#' If the object doesn't exist for a given row / index, then that row will be
#' discarded.
#'
#' In pipelines, \code{enter_object} is often preceded by \code{gather_keys} and
#' followed by \code{gather_array} if the key contains an array, or
#' \code{spread_all} if the key contains an object.
#'
#' @seealso \code{\link{gather_keys}} to access keys that could be entered
#' into, \code{\link{gather_array}} to gather an array in an object and
#' \code{\link{spread_all}} to spread values in an object.
#' @param .x a json string or tbl_json object
#' @param ... path to filter
#' @param ... a sequence of character strings designating the object key or
#' sequences of keys you wish to enter
#' @return a \code{\link{tbl_json}} object
#' @export
#' @examples
#' c('{"name": "bob", "children": ["sally", "george"]}', '{"name": "anne"}') %>%
#' spread_values(parent.name = jstring("name")) %>%
#' enter_object("children") %>%
#'
#' # Let's start with a simple example of parents and children
#' json <- c('{"parent": "bob", "children": ["sally", "george"]}',
#' '{"parent": "fred", "children": ["billy"]}',
#' '{"parent": "anne"}')
#'
#' # We can see the keys and types in each
#' json %>% gather_keys %>% json_types
#'
#' # Let's capture the parent first and then enter in the children object
#' json %>% spread_all %>% enter_object("children")
#'
#' # Notice that "anne" was discarded, as she has no children
#'
#' # We can now use gather array to stack the array
#' json %>% spread_all %>% enter_object("children") %>%
#' gather_array("child.num")
#'
#' # And append_values_string to add the children names
#' json %>% spread_all %>% enter_object("children") %>%
#' gather_array("child.num") %>%
#' append_values_string("child")
#'
#' # A more realistc example with companies data
#' library(dplyr)
#' companies %>%
#' enter_object("acquisitions") %>%
#' gather_array %>%
#' append_values_string("children")
#' spread_all %>%
#' glimpse
enter_object <- function(.x, ...) {

if (!is.tbl_json(.x)) .x <- as.tbl_json(.x)
Expand Down
144 changes: 108 additions & 36 deletions R/gather.R
Original file line number Diff line number Diff line change
Expand Up @@ -52,51 +52,123 @@ gather_factory <- function(default.column.name, default.column.empty,

}

#' Stack a JSON {"key": value} object
#' Gather a JSON object into key-value pairs
#'
#' Given a JSON key value structure, like {"key1": 1, "key2": 2}, the
#' gather_keys() function duplicates the rows of the tbl_json data.frame for
#' every key, adds a new column (default name "key") to capture the key names,
#' and then dives into the JSON values to enable further manipulation with
#' downstream tidyjson functions.
#' \code{gather_keys} collapses a JSON object into key-value pairs, creating
#' a new column \code{'key'} to store the object key names, and storing the
#' values in the \code{'JSON'} attribute for further tidyjson manipulation.
#' All other columns are duplicated as necessary. This allows you to access the
#' keys of the objects just like \code{\link{gather_array}} lets you access the
#' values of an array.
#'
#' This allows you to *enter into* the keys of the objects just like \code{gather_array}
#' let you enter elements of the array.
#' \code{gather_keys} is often followed by \code{\link{enter_object}} to enter
#' into a value that is an object, by \code{\link{append_values}} to append all
#' scalar values as a new column or \code{\link{json_types}} to determine the
#' types of the keys.
#'
#' @param .x a json string or tbl_json object whose JSON attribute should always be an object
#' @seealso \code{\link{gather_array}} to gather a JSON array,
#' \code{\link{enter_object}} to enter into an object,
#' \code{\link[tidyr]{gather}} to gather key-value pairs in a data
#' frame
#' @param .x a JSON string or \code{tbl_json} object whose JSON attribute should
#' always be an object
#' @param column.name the name to give to the column of key names created
#' @return a tbl_json with a new column (column.name) that captures the keys
#' and JSON attribute of the associated value data
#' @return a \code{\link{tbl_json}} object
#' @export
#' @examples
#' '{"name": "bob", "age": 32}' %>% gather_keys %>% json_types
#'
#' # Let's start with a very simple example
#' json <- '{"name": "bob", "age": 32, "gender": "male"}'
#'
#' # Check that this is an object
#' json %>% json_types
#'
#' # Gather keys and check types
#' json %>% gather_keys %>% json_types
#'
#' # Sometimes data is stored in key names
#' json <- '{"2014": 32, "2015": 56, "2016": 14}'
#'
#' # Then we can use the column.name argument to change the name of the keys
#' json %>% gather_keys("year")
#'
#' # We can also use append_values_number to capture the values, since they are
#' # all of the same type
#' json %>% gather_keys("year") %>% append_values_number("count")
#'
#' # This can even work with a more complex, nested example
#' json <- '{"2015": {"1": 10, "3": 1, "11": 5}, "2016": {"2": 3, "5": 15}}'
#' json %>% gather_keys("year") %>% gather_keys("month") %>%
#' append_values_number("count")
#'
#' # Most JSON starts out as an object (or an array of objects), and gather_keys
#' # can be used to inspect the top level (or 2nd level) keys and their structure
#' library(dplyr)
#' worldbank %>% gather_keys %>% json_types %>% count(key, type)
gather_keys <- gather_factory("key", character(0), names, "object")

#' Stack a JSON array
#'
#' Given a JSON array, such as [1, 2, 3], gather_array will "stack" the array in
#' the tbl_json data.frame, by replicating each row of the data.frame by the
#' length of the corresponding JSON array. A new column (by default called
#' "array.index") will be added to keep track of the referenced position in the
#' array for each row of the resuling data.frame.
#'
#' JSON can contain arrays of data, which can be simple vectors (fixed or varying
#' length integer, character or logical vectors). But they also often contain
#' lists of other objects (like a list of purchases for a user). The function
#' gather_array() takes JSON arrays and duplicates the rows in the data.frame to
#' correspond to the indices of the array, and puts the elements of
#' the array into the JSON attribute. This is equivalent to "stacking" the array
#' in the data.frame, and lets you continue to manipulate the remaining JSON
#' in the elements of the array. For simple arrays, use append_values_* to
#' capture all of the values of the array. For more complex arrays (where the
#' values are themselves objects or arrays), continue using other tidyjson
#' functions to structure the data as needed.
#'
#' @param .x a json string or tbl_json object whose JSON attribute should always be an array
#' Gather a JSON array into index-value pairs
#'
#' \code{gather_array} collapses a JSON array into index-value pairs, creating
#' a new column \code{'array.index'} to store the index of the array, and
#' storing values in the \code{'JSON'} attribute for further tidyjson
#' manipulation. All other columns are duplicated as necessary. This allows you
#' to access the values of the array just like \code{\link{gather_keys}} lets
#' you access the values of an object.
#'
#' JSON arrays can be simple vectors (fixed or varying length number, string
#' or logical vectors with or without null values). But they also often contain
#' lists of other objects (like a list of purchases for a user). Thus, the
#' best analogy in R for a JSON array is an unnamed list.
#'
#' \code{gather_array} is often preceded by \code{\link{enter_object}} when the
#' array is nested under a JSON object, and is often followed by
#' \code{\link{gather_keys}} or \code{\link{enter_object}} if the array values
#' are objects, or by \code{\link{append_values}} to append all scalar values
#' as a new column or \code{\link{json_types}} to determine the types of the
#' array elements (JSON does not guarantee they are the same type).
#'
#' @seealso \code{\link{gather_keys}} to gather a JSON object,
#' \code{\link{enter_object}} to enter into an object,
#' \code{\link[tidyr]{gather}} to gather key-value pairs in a data
#' frame
#' @param .x a json string or tbl_json object whose JSON attribute should always
#' be an array
#' @param column.name the name to give to the array index column created
#' @return a tbl_json with a new column (column.name) that captures the array
#' index and JSON attribute extracted from the array
#' @return a \code{\link{tbl_json}} object
#' @export
#' @examples
#' '[1, "a", {"k": "v"}]' %>% gather_array %>% json_types
#'
#' # A simple character array example
#' json <- '["a", "b", "c"]'
#'
#' # Check that this is an array
#' json %>% json_types
#'
#' # Gather array and check types
#' json %>% gather_array %>% json_types
#'
#' # Extract string values
#' json %>% gather_array %>% append_values_string
#'
#' # A more complex mixed type example
#' json <- '["a", 1, true, null, {"key": "value"}]'
#'
#' # Then we can use the column.name argument to change the name of the keys
#' json %>% gather_array %>% json_types
#'
#' # A nested array
#' json <- '[["a", "b", "c"], ["a", "d"], ["b", "c"]]'
#'
#' # Extract both levels
#' json %>% gather_array("index.1") %>% gather_array("index.2") %>%
#' append_values_string
#'
#' # Some JSON begins as an array
#' commits %>% gather_array
#'
#' # We can use spread_all to capture all keys (where recursive = FALSE is used
#' # to limit the dept to just top level keys
#' library(dplyr)
#' commits %>% gather_array %>% spread_all(recursive = FALSE) %>% glimpse
gather_array <- gather_factory("array.index", integer(0), seq_along, "array")
31 changes: 22 additions & 9 deletions R/json_complexity.R
Original file line number Diff line number Diff line change
@@ -1,19 +1,32 @@
#' Add a column that contains the complexity (recursively unlisted length) of the JSON data
#' Compute the complexity (recursively unlisted length) of JSON data
#'
#' When investigating complex JSON data it can be helpful to identify the
#' complexity of deeply nested documents. The json_complexity() function adds a
#' column (default name "complexity") that contains the 'complexity' of the JSON
#' associated with each row. Essentially, every on-null scalar value is found in the
#' object by recursively stripping away all objects or arrays, and the complexity
#' is the count of these scalar values. Note that 'null' has complexity 0.
#' complexity of deeply nested documents. The \code{json_complexity} function
#' adds a column (default name \code{"complexity"}) that contains the
#' 'complexity' of the JSON associated with each row. Essentially, every on-null
#' scalar value is found in the object by recursively stripping away all objects
#' or arrays, and the complexity is the count of these scalar values. Note that
#' 'null' has complexity 0, as do empty objects and arrays.
#'
#' @seealso \code{\link{json_lengths}} to compute the length of each value
#' @param .x a json string or tbl_json object
#' @param column.name the name to specify for the length column
#' @return a tbl_json object with column.name column that tells the length
#' @return a \code{\link{tbl_json}} object
#' @export
#' @examples
#' c('[1, 2, [3, 4]]', '{"k1": 1, "k2": [2, [3, 4]]}', '1', {}) %>%
#' json_lengths %>% json_complexity
#'
#' # A simple example
#' json <- c('[1, 2, [3, 4]]', '{"k1": 1, "k2": [2, [3, 4]]}', '1', 'null')
#'
#' # Complexity is larger than length for nested objects
#' json %>% json_lengths %>% json_complexity
#'
#' # Worldbank has complexity ranging from 8 to 17
#' library(magrittr)
#' worldbank %>% json_complexity %$% table(complexity)
#'
#' # Commits are much more regular
#' commits %>% gather_array %>% json_complexity %$% table(complexity)
json_complexity <- function(.x, column.name = "complexity") {

if (!is.tbl_json(.x)) .x <- as.tbl_json(.x)
Expand Down
31 changes: 23 additions & 8 deletions R/json_lengths.R
Original file line number Diff line number Diff line change
@@ -1,18 +1,33 @@
#' Add a column that contains the length of the JSON data
#' Compute the length of JSON data
#'
#' When investigating JSON data it can be helpful to identify the lengths of the
#' JSON objects or arrays, especialy when they are 'ragged' across documents. The
#' json_lengths() function adds a column (default name "length") that contains
#' the 'length' of the JSON associated with each row. For objects, this will
#' be equal to the number of keys. For arrays, this will be equal to the length
#' of the array. All scalar values will be of length 1.
#' JSON objects or arrays, especialy when they are 'ragged' across documents.
#' The \code{json_lengths} function adds a column (default name \code{"length"})
#' that contains the 'length' of the JSON associated with each row. For objects,
#' this will be equal to the number of keys. For arrays, this will be equal to
#' the length of the array. All scalar values will be of length 1, and null
#' will have length 0.
#'
#' @seealso \code{\link{json_complexity}} to compute the recursive length of
#' each value
#' @param .x a json string or tbl_json object
#' @param column.name the name to specify for the length column
#' @return a tbl_json object with column.name column that tells the length
#' @return a \code{\link{tbl_json}} object
#' @export
#' @examples
#' c('[1, 2, 3]', '{"k1": 1, "k2": 2}', '1', {}) %>% json_lengths
#'
#' # A simple example
#' json <- c('[1, 2, 3]', '{"k1": 1, "k2": 2}', '1', 'null')
#'
#' # Complexity is larger than length for nested objects
#' json %>% json_lengths
#'
#' # Worldbank objcts are either length 7 or 8
#' library(magrittr)
#' worldbank %>% json_lengths %$% table(length)
#'
#' # All commits are length 8
#' commits %>% gather_array %>% json_lengths %$% table(length)
json_lengths <- function(.x, column.name = "length") {

if (!is.tbl_json(.x)) .x <- as.tbl_json(.x)
Expand Down
Loading

0 comments on commit adff7b5

Please sign in to comment.