Warning: This project is in the early scoping stages; do not use for anything other than amusement/frustration purposes
Data Liberator. To extract tabular data people put in nontabular structures in a program designed to hold tables.
Requires the development version of xml2 (for xml_find_lgl
) as well as cellrangr and linen. Chances are you'll want rexcel too.
devtools::install_github(c("hadley/xml2",
"rsheets/linen",
"rsheets/cellranger",
"rsheets/rexcel",
"rsheets/jailbreakr"))
There are two large excel spreadsheet corpora; it would be nice to use these to get a feel for what fraction of spreadsheets we can handle or the range of non-table-like data out there.
The first is the EUSES corpus of 4,447 spreadsheets (16,853 worksheets). This is all xls files (rather than xlsx) and therefore need either an xls -> xlsx conversion or support in jailbreakr for xls files.
The second, larger, one is the Enron corpus of 15,770 spreadsheets (79,983)
-
data structure package:
- linen? General representation of spreadsheet data, plus some limited low-level operations on that data
- depends on cell ranger, tibble
- constructor function
- print methods
- subsetting, range extraction etc.
- plot method - for quickly getting a feel for structure, or a shiny app
- summary: this has n sheets, no formulae, 3 plots, etc, things about the references between the sheets?
- where it came from (excel, googlesheet, etc), with filenames, reference ids etc.
- probably needs references to handle multiple sheets and formulae within them, definitely if we need to do things with plots, but make them immutable at first?
- md5 or other "id" so that we can see if the upstream source has changed. This is different for googlesheets where the id is properly baked into the sheet
-
low level packages:
- googlesheets
- rexcel
- these depend on linen, and will have to provide things like ids and filenames to satisfy all the features that linen will do.
-
jailbreakr
- uses output in linen format that is provided by googlesheets or rexcel
Can we feed things through openrefine or something?