jailbreakr

Warning: This project is in the early scoping stages; do not use for anything other than amusement/frustration purposes

Data Liberator. To extract tabular data people put in nontabular structures in a program designed to hold tables.

Installation

Requires the development version of xml2 (for xml_find_lgl) as well as cellrangr and linen. Chances are you'll want rexcel too.

devtools::install_github(c("hadley/xml2",
                           "rsheets/linen",
                           "rsheets/cellranger",
                           "rsheets/rexcel",
                           "rsheets/jailbreakr"))

Goals

There are two large excel spreadsheet corpora; it would be nice to use these to get a feel for what fraction of spreadsheets we can handle or the range of non-table-like data out there.

The first is the EUSES corpus of 4,447 spreadsheets (16,853 worksheets). This is all xls files (rather than xlsx) and therefore need either an xls -> xlsx conversion or support in jailbreakr for xls files.

The second, larger, one is the Enron corpus of 15,770 spreadsheets (79,983)

Roadmap

data structure package:
- linen? General representation of spreadsheet data, plus some limited low-level operations on that data
- depends on cell ranger, tibble
- constructor function
- print methods
- subsetting, range extraction etc.
- plot method - for quickly getting a feel for structure, or a shiny app
- summary: this has n sheets, no formulae, 3 plots, etc, things about the references between the sheets?
- where it came from (excel, googlesheet, etc), with filenames, reference ids etc.
- probably needs references to handle multiple sheets and formulae within them, definitely if we need to do things with plots, but make them immutable at first?
- md5 or other "id" so that we can see if the upstream source has changed. This is different for googlesheets where the id is properly baked into the sheet
low level packages:
- googlesheets
- rexcel
- these depend on linen, and will have to provide things like ids and filenames to satisfy all the features that linen will do.
jailbreakr
- uses output in linen format that is provided by googlesheets or rexcel

Ideas

Can we feed things through openrefine or something?

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
R		R
internal-projects		internal-projects
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
Makefile		Makefile
NAMESPACE		NAMESPACE
README.md		README.md
appveyor.yml		appveyor.yml
jailbreakr.Rproj		jailbreakr.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jailbreakr

Installation

Goals

Roadmap

Ideas

About

Releases

Packages

Contributors 2

Languages

rsheets/jailbreakr

Folders and files

Latest commit

History

Repository files navigation

jailbreakr

Installation

Goals

Roadmap

Ideas

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages