From 4a6cae601b2c058d7683545f27ed1dae6c6c4b35 Mon Sep 17 00:00:00 2001 From: Anthony Cowley Date: Sun, 22 Oct 2023 14:28:26 -0400 Subject: [PATCH] Updated links to point to the main branch --- README.md | 12 ++++++------ README.org | 14 +++++++------- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index f38edd0..dabb182 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ For a running example, we will use variations of the [prestige.csv](http://vince If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row. `Frames` provides `TemplateHaskell` machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data. -We generate a collection of definitions generated by inspecting the data file at compile time (using `tableTypes`), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an **in-core** array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the `foldl` library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [program](https://github.com/acowley/Frames/tree/master/test/UncurryFold.hs). +We generate a collection of definitions generated by inspecting the data file at compile time (using `tableTypes`), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an **in-core** array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the `foldl` library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [program](https://github.com/acowley/Frames/tree/main/test/UncurryFold.hs). ```haskell {-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-} @@ -45,7 +45,7 @@ averageRatio = L.fold (L.premap (ratio . rcast) avg) <$> loadRows ### Missing Header Row -Now consider a case where our data file lacks a header row (I deleted the first row from \`prestige.csv\`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names **do** come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by `rowGen` we care to change, passing the result to `tableTypes'`. [Link to code.](https://github.com/acowley/Frames/tree/master/test/UncurryFoldNoHeader.hs) +Now consider a case where our data file lacks a header row (I deleted the first row from \`prestige.csv\`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names **do** come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by `rowGen` we care to change, passing the result to `tableTypes'`. [Link to code.](https://github.com/acowley/Frames/tree/main/test/UncurryFoldNoHeader.hs) ```haskell {-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-} @@ -84,7 +84,7 @@ Sometimes not every row has a value for every column. I went ahead and blanked t "athletes",11.44,8206,8.13,,3373,NA -We can no longer parse a `Double` for that row, so we will work with row types parameterized by a `Maybe` type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the `prestige` column was parsed, only keeping those rows for which it was not, then project the `income` column from those rows, and finally throw away `Nothing` elements. [Link to code](https://github.com/acowley/Frames/tree/master/test/UncurryFoldPartialData.hs). +We can no longer parse a `Double` for that row, so we will work with row types parameterized by a `Maybe` type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the `prestige` column was parsed, only keeping those rows for which it was not, then project the `income` column from those rows, and finally throw away `Nothing` elements. [Link to code](https://github.com/acowley/Frames/tree/main/test/UncurryFoldPartialData.hs). ```haskell {-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications, TypeOperators #-} @@ -127,7 +127,7 @@ For comparison to working with data frames in other languages, see the [tutorial ## Demos -There are various [demos](https://github.com/acowley/Frames/tree/master/demo) in the repository. Be sure to run the `getdata` build target to download the data files used by the demos! You can also download the data files manually and put them in a `data` directory in the directory from which you will be running the executables. +There are various [demos](https://github.com/acowley/Frames/tree/main/demo) in the repository. Be sure to run the `getdata` build target to download the data files used by the demos! You can also download the data files manually and put them in a `data` directory in the directory from which you will be running the executables. ## Contribute @@ -146,9 +146,9 @@ To get just ghc and cabal in your shell, a simple `nix develop` will do. ## Benchmarks -The [benchmark](https://github.com/acowley/Frames/tree/master/benchmarks/InsuranceBench.hs) shows several ways of dealing with data when you want to perform multiple traversals. +The [benchmark](https://github.com/acowley/Frames/tree/main/benchmarks/InsuranceBench.hs) shows several ways of dealing with data when you want to perform multiple traversals. -Another [demo](https://github.com/acowley/Frames/tree/master/benchmarks/BenchDemo.hs) shows how to fuse multiple passes into one so that the full data set is never resident in memory. A [Pandas version](https://github.com/acowley/Frames/tree/master/benchmarks/panda.py) of a similar program is also provided for comparison. +Another [demo](https://github.com/acowley/Frames/tree/main/benchmarks/BenchDemo.hs) shows how to fuse multiple passes into one so that the full data set is never resident in memory. A [Pandas version](https://github.com/acowley/Frames/tree/main/benchmarks/panda.py) of a similar program is also provided for comparison. This is a trivial program, but shows that performance is comparable to Pandas, and the memory savings of a compiled program are substantial. diff --git a/README.org b/README.org index 95918aa..4e0898a 100644 --- a/README.org +++ b/README.org @@ -21,12 +21,12 @@ For a running example, we will use variations of the *** Clean Data If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row. =Frames= provides =TemplateHaskell= machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data. -We generate a collection of definitions generated by inspecting the data file at compile time (using ~tableTypes~), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an *in-core* array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the =foldl= library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [[https://github.com/acowley/Frames/tree/master/test/UncurryFold.hs][program]]. +We generate a collection of definitions generated by inspecting the data file at compile time (using ~tableTypes~), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an *in-core* array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the =foldl= library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [[https://github.com/acowley/Frames/tree/main/test/UncurryFold.hs][program]]. #+INCLUDE: "test/UncurryFold.hs" src haskell *** Missing Header Row -Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names *do* come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by ~rowGen~ we care to change, passing the result to ~tableTypes'~. [[https://github.com/acowley/Frames/tree/master/test/UncurryFoldNoHeader.hs][Link to code.]] +Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names *do* come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by ~rowGen~ we care to change, passing the result to ~tableTypes'~. [[https://github.com/acowley/Frames/tree/main/test/UncurryFoldNoHeader.hs][Link to code.]] #+INCLUDE: "test/UncurryFoldNoHeader.hs" src haskell @@ -37,7 +37,7 @@ Sometimes not every row has a value for every column. I went ahead and blanked t "athletes",11.44,8206,8.13,,3373,NA #+END_EXAMPLE -We can no longer parse a ~Double~ for that row, so we will work with row types parameterized by a ~Maybe~ type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the =prestige= column was parsed, only keeping those rows for which it was not, then project the =income= column from those rows, and finally throw away ~Nothing~ elements. [[https://github.com/acowley/Frames/tree/master/test/UncurryFoldPartialData.hs][Link to code]]. +We can no longer parse a ~Double~ for that row, so we will work with row types parameterized by a ~Maybe~ type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the =prestige= column was parsed, only keeping those rows for which it was not, then project the =income= column from those rows, and finally throw away ~Nothing~ elements. [[https://github.com/acowley/Frames/tree/main/test/UncurryFoldPartialData.hs][Link to code]]. #+INCLUDE: "test/UncurryFoldPartialData.hs" src haskell @@ -47,15 +47,15 @@ For comparison to working with data frames in other languages, see the ** Demos There are various -[[https://github.com/acowley/Frames/tree/master/demo][demos]] in the repository. Be sure to run the =getdata= build target to download the data files used by the demos! You can also download the data files manually and put them in a =data= directory in the directory from which you will be running the executables. +[[https://github.com/acowley/Frames/tree/main/demo][demos]] in the repository. Be sure to run the =getdata= build target to download the data files used by the demos! You can also download the data files manually and put them in a =data= directory in the directory from which you will be running the executables. ** Benchmarks -The [[https://github.com/acowley/Frames/tree/master/benchmarks/InsuranceBench.hs][benchmark]] shows several ways of +The [[https://github.com/acowley/Frames/tree/main/benchmarks/InsuranceBench.hs][benchmark]] shows several ways of dealing with data when you want to perform multiple traversals. -Another [[https://github.com/acowley/Frames/tree/master/benchmarks/BenchDemo.hs][demo]] shows how to fuse multiple +Another [[https://github.com/acowley/Frames/tree/main/benchmarks/BenchDemo.hs][demo]] shows how to fuse multiple passes into one so that the full data set is never resident in -memory. A [[https://github.com/acowley/Frames/tree/master/benchmarks/panda.py][Pandas version]] of a similar program +memory. A [[https://github.com/acowley/Frames/tree/main/benchmarks/panda.py][Pandas version]] of a similar program is also provided for comparison. This is a trivial program, but shows that performance is comparable to