Skip to content

Commit 457100d

Browse files
authored
Improvements to docs (#222)
1 parent 6065256 commit 457100d

File tree

6 files changed

+77
-51
lines changed

6 files changed

+77
-51
lines changed

annotations/README.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The parallelizability study informed the design of the annotation language, whic
88

99
> _N.b.: We welcome contributions to the study and annotatations for common commands._
1010
11-
#### Main Parallelizability Classes
11+
## Main Parallelizability Classes
1212

1313
PaSh introduces four major parallelizability classes:
1414

@@ -35,7 +35,7 @@ If parallelized on a single input, each stage would need to wait on the results
3535
The last class, `side-effectful`, contains commands that have side-effects across the system -- for example, updating environment variables, interacting with the filesystem, and accessing the network.
3636
Such commands are not parallelizable without finer-grained concurrency control mechanisms that can detect side-effects across the system.
3737

38-
#### Parallelizability Study of Commands in GNU & POSIX
38+
## Parallelizability Study of Commands in GNU & POSIX
3939

4040
The parallelizability study of commands in GNU and POSIX is comprised of two parts: a coarse-grained parallelizability study, and a set of annotations for commands.
4141

@@ -47,7 +47,7 @@ Annotations can be thought of as defining a bidirectional correspondence between
4747
Since command behaviors (and correspondence) can change based on their arguments, annotations contain a sequence of predicates.
4848
Each predicate is accompanied by information that instantiates the correspondence between a command and a dataflow node.
4949

50-
#### A Simple Example: `chmod`
50+
## A Simple Example: `chmod`
5151

5252
As a first example, below we present the annotations for `chmod`.
5353

@@ -65,7 +65,7 @@ As a first example, below we present the annotations for `chmod`.
6565

6666
The annotation for `chmod` is very simple, since it only needs to establish that `chmod` is side-effectful and therefore cannot be translated to a dataflow node.
6767

68-
#### Another Example: `cut`
68+
## Another Example: `cut`
6969

7070
As another example, below we present the annotations for `chmod`.
7171

@@ -136,7 +136,7 @@ Inputs are always assigned to the non-option arguments and the output is always
136136
The option "stdin-hyphen" indicates that a non-option argument that is just a dash `-` represents the stdin, and the option “empty-args-stdin” indicates that if non-option arguments are empty, then the command reads from its stdin.
137137
The list identified by "short-long" contains a correspondence of short and long argument names for this command.
138138

139-
#### How to Annotate a Command
139+
## How to Annotate a Command
140140

141141
The first step to annotating a command is to identify its default class: `stateless`, `pure`, `non-parallelizable`, and `side-effectful`. How does the command behave without any inputs?
142142
The next step is to identify the set of inputs and their order.

compiler/README.md

+51-22
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,57 @@
11
# The PaSh Compiler
2+
Quick Jump: [intro](#introduction) | [overview](#compiler-overview) | [details](#zooming-into-fragments) | [earlier versions](#earlier-versions)
23

3-
A diagram of the compiler is shown below:
4+
## Introduction
45

5-
A correspondence between blocks in the diagram and Python modules is shown below:
6+
PaSh has recently shifted away from ahead-of-time compilation and towards just-in-time compilation intermixed with the execution of a script.
7+
This shift brings many benefits, allowing PaSh to correctly handle expansion and other important details -- but complicates the clear exposition of the two phases.
8+
A high-level diagram of PaSh's end-to-end operation is shown below:
9+
10+
<img src="https://docs.google.com/drawings/d/e/2PACX-1vSIuacgBR_QFOzawoAdJMmTjgsdnDUkp1DbSjLVlrowlhL6kxqckXXsL7SPoRXKfaC1hw9HQzJitmDP/pub?w=1364&amp;h=454">
11+
12+
PaSh pre-processes a sequential script to insert calls to the `pash_runtime.py`.
13+
It then invokes the script, switching between evaluation, execution, and parallelization at runtime:
14+
(i) it first parses the script, creating an abstact syntax tree (AST);
15+
(ii) it then expands the nodes of the AST, often calling the shell which performs that expansion;
16+
(iii) it compiles dataflow regions, parts of the AST that are potentially parallelizable, through an iterative optimization proceedure applied over a dataflow graph (DFG); and
17+
(iv) finally emits the parallel script by translating the DFG to AST and unparsing the AST back to a shell script.
18+
The compilation takes into account information about individual commands through [annotations](../annotations), and the emitted parallel script uses additional constructs provided by PaSh's [runtime library](../runtime).
619

7-
- PaSh Preprocessor -- [pash.py](../compiler/pash.py)
8-
- Expand, Compile -- [ast_to_ir.py](../compiler/ast_to_ir.py)
9-
- Annotations -- [annotations.py](../compiler/annotations.py), [command_categories.py](../compiler/command_categories.py)
10-
- Optimize -- [pash_runtime.py](../compiler/pash_runtime.py)
20+
A correspondence between blocks in the diagram and Python modules is shown below:
1121

12-
**Note:** At the time of the paper submission, PaSh did not have a preprocessing component, and didn't handle variable expansion. These changes significantly improve the practical applicability of PaSh since it can be used on scripts where the environment variables are modified throughout the script.
22+
- Preprocessing: [pash.py](./pash.py)
23+
- Expansion and compilation: [ast_to_ir.py](./ast_to_ir.py)
24+
- Dealing with annotations: [annotations.py](./annotations.py), [command_categories.py](./command_categories.py)
25+
- Optimization: [pash_runtime.py](./pash_runtime.py)
1326

14-
First, there is the parser in [compiler/parser](../compiler/parser), which is a port of [libdash](https://github.com/mgree/), the dash parser extended with OCaml bindings, extended with ocaml2json and json2ocaml code to interface with PaSh.
27+
## Compiler Overview
1528

16-
Now let's get to the compiler. It's entry point is [compiler/pash.py](../compiler/pash.py) that parses a script and replaces potentially parallelizable regions with calls to [compiler/pash_runtime.sh](../compiler/pash_runtime.sh). It then executes the script.
29+
Now let's get to the compiler.
30+
It's entry point is [pash.py](./pash.py) that parses a script and replaces potentially parallelizable regions with calls to [pash_runtime.sh](./pash_runtime.sh).
31+
It then executes the script.
1732
This allows invoking the compiler during the runtime to have information about the values of environment variables.
1833

19-
The runtime script [compiler/pash_runtime.sh](../compiler/pash_runtime.sh) simply invokes the compiler [compiler/pash_runtime.py](../compiler/pash_runtime.py) and if it succeeds it executes the optimized script, otherwise it executes the original script.
34+
The [pash_runtime.sh](./pash_runtime.sh) script simply invokes the [pash.py](./pash.py) compiler:
35+
if it succeeds it executes the optimized script, otherwise it executes the original script.
2036

21-
Now the compiler has several stages:
37+
The compiler has several stages:
2238

23-
1. It expands words in the AST and then it turns it into our dataflow model (guided by annotations)
24-
- The expansion and translation happens in [ast_to_ir.py](../compiler/ast_to_ir.py)
25-
- The dataflow model is mostly defined in [ir.py](../compiler/ir.py)
26-
- The annotations are processed in [annotations.py](../compiler/annotations.py) and [command_categories.py](../compiler/command_categories.py)
39+
1. It expands words in the AST and then it turns it into our dataflow model (guided by [annotations](../annotations))
40+
- The expansion and translation happens in [ast_to_ir.py](./ast_to_ir.py)
41+
- The dataflow model is defined mostly in [ir.py](./ir.py)
42+
- The annotations are processed in [annotations.py](./annotations.py) and [command_categories.py](./command_categories.py)
2743
2. It performs transformations on the dataflow graph to expose parallelism (guided by annotations)
28-
- Translations happen in [pash_runtime.py](../compiler/pash_runtime.py)
44+
- Translations happen in [pash_runtime.py](./pash_runtime.py)
2945
3. It then translates the dataflow graph back to a shell script to execute it with bash
30-
- The `dfg2shell` translation happens in [ir_to_ast.py](../compiler/ir_to_ast.py)
46+
- The `dfg2shell` translation happens in [ir_to_ast.py](./ir_to_ast.py)
47+
48+
[//]: # (TODO: the parsing/unparsing components need update)
49+
50+
## Zooming into Fragments
3151

32-
A few interesting fragments are shown below.
52+
A few interesting fragments are outlined below.
3353

34-
The [ast_to_ir.py](https://github.com/andromeda/pash/blob/main/compiler/ast_to_ir.py) contains a case statement that essentially pattern-matches on constructs of the shells script AST and then compiles them accordingly.
54+
The [ast_to_ir.py](./ast_to_ir.py) file contains a case statement that essentially pattern-matches on constructs of the shells script AST and then compiles them accordingly.
3555
```Python
3656
compile_cases = {
3757
"Pipe": (lambda fileIdGen, config:
@@ -43,11 +63,12 @@ Now the compiler has several stages:
4363
"Or": (lambda fileIdGen, config:
4464
lambda ast_node: compile_node_and_or_semi(ast_node, fileIdGen, config))
4565
# ... more code ...
46-
4766
```
4867

68+
The following function from [ir.py](./ir.py) is responsible for parallelizing a single node (_i.e._, a command) in the dataflow graph.
69+
Look at the schematic in the comments starting [on line 637](./ir.py#L637) that gives the high-level overview of what this function does (not shown below).
4970

50-
The following function from [ir.py](https://github.com/andromeda/pash/blob/main/compiler/ir.py) is responsible for parallelizing a single node (i.e., command) in the dataflow graph. Look at the schematic in the comments starting [on line 637](https://github.com/andromeda/pash/blob/main/compiler/ir.py#L637) that gives the high-level overview of what this function does (not shown below).
71+
[//]: # (TODO: Add schematic here)
5172

5273
```Python
5374
# See comment on line 637
@@ -60,7 +81,7 @@ The following function from [ir.py](https://github.com/andromeda/pash/blob/main/
6081
# ... more code ...
6182
```
6283

63-
Another interesting fragment is in [ir_to_ast.py](https://github.com/andromeda/pash/blob/main/compiler/ir_to_ast.py), which translates the parallel dataflow graph back to an AST.
84+
Another interesting fragment is in [ir_to_ast.py](./ir_to_ast.py), which translates the parallel dataflow graph back to an AST.
6485

6586
```Python
6687
def ir2ast(ir, args):
@@ -82,3 +103,11 @@ def ir2ast(ir, args):
82103

83104
This AST is then unparsed back into a (parallel) shell script.
84105

106+
## Earlier Versions
107+
108+
The compiler is outlined in the [EuroSys paper](https://arxiv.org/pdf/2007.09436.pdf), but has evolved considerably since then:
109+
110+
* PaSh originally did not have a preprocessing component, and didn't handle variable expansion. It now does both, significantly improving its practical applicability since it can be used on scripts where the environment variables are modified throughout the script.
111+
112+
* PaSh originally was using code in [parser](./parser) -- a port of [libdash](https://github.com/mgree/), the `dash` parser extended with OCaml bindings -- and specifically the `ocaml2json` and `json2ocaml` binaries to interface with PaSh. PaSh now uses a custom parser written in Python, avoiding any dependency to OCaml and simplifying dependency management.
113+

compiler/parser/libdash

docs/README.md

+11-15
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
11
# PaSh Documentation
2+
Quick Jump: [using pash](#using-pash) | [videos](#videos--video-presentations) | [papers](#academic-papers)
23

3-
## Introductory Material
4+
## Using PaSh
45

56
The following resources offer overviews of important PaSh components.
67

78
* Short tutorial: [introduction](./tutorial.md#introduction), [installation](./tutorial.md#installation), [execution](./tutorial.md#running-scripts), and [next steps](./tutorial.md#what-next)
8-
* Annotations: [parallelizability](../annotations#main-parallelizability-classes) | [study](../annotations#parallelizability-study-of-commands-in-gnu--posix) | [example 1](../annotations#a-simple-example-chmod) | [example 1](../annotations#another-example-cut) | [howto](../annotations#how-to-annotate-a-command)
9-
* Compiler: [overview](../compiler)
10-
* Runtime: [overview](../runtime)
11-
* Scripts: [oneliners](../runtime)
9+
* Annotations: [parallelizability](../annotations#main-parallelizability-classes), [study](../annotations#parallelizability-study-of-commands-in-gnu--posix), [example 1](../annotations#a-simple-example-chmod), [example 1](../annotations#another-example-cut), [howto](../annotations#how-to-annotate-a-command)
10+
* Compiler: [intro](../compiler#introduction), [overview](../compiler#compiler-overview), [details](../compiler#zooming-into-fragments), [earlier versions](../compiler#earlier-versions)
11+
* Runtime: [split](../runtime#stream-splitting), [eager](../runtime#eager-stream-polling), [cleanup](../runtime#cleanup-logic), [aggregate](../runtime#aggregators)
12+
* Scripts: [one-liners](#common-unix-one-liners), [unix50](#unix-50-from-bell-labs), [weather analysis](#noaa-weather-analysis), [web indexing](#wikipedia-web-indexing)
1213

1314
## Videos & Video Presentations
1415

@@ -22,17 +23,16 @@ The following presentations offer short PaSh introductions:
2223

2324
The following papers present or use PaSh.
2425

25-
#### An Order-aware Dataflow Model for Extracting Shell Script Parallelism
26+
**An Order-aware Dataflow Model for Extracting Shell Script Parallelism**
2627
Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, Martin Rinard
2728
pdf | bibtex
2829

29-
#### Automatic Synthesis of Parallel and Distributed Unix Commands with KumQuat
30+
**Automatic Synthesis of Parallel and Distributed Unix Commands with KumQuat**
3031
Nikos Vasilakis*, Jiasi Shen*, Martin Rinard
3132
pdf | bibtex
3233

33-
#### The Once and Future Shell
34+
**The Once and Future Shell**
3435
Michael Greenberg, Konstantinos Kallas, Nikos Vasilakis
35-
[pdf]() | <details><summary>bibtex</summary>
3636
```bibtex
3737
@inproceedings{pash:hotos:21,
3838
author = {Greenberg, Michael, and Kallas, Konstantinos, and Vasilakis, Nikos},
@@ -43,11 +43,10 @@ Michael Greenberg, Konstantinos Kallas, Nikos Vasilakis
4343
series = {HotOS '19}
4444
}
4545
```
46-
</details>
4746

48-
#### PaSh: Light-touch Data-Parallel Shell Processing
47+
**PaSh: Light-touch Data-Parallel Shell Processing**
4948
Nikos Vasilakis*, Konstantinos Kallas*, Konstantinos Mamouras, Achilles Benetopoulos, Lazar Cvetković
50-
[pdf](https://arxiv.org/pdf/2007.09436.pdf) | <details><summary>bibtex</summary>
49+
[arxiv](https://arxiv.org/pdf/2007.09436.pdf) | acm | video
5150
```bibtex
5251
@inproceedings{pash:eurosys:21,
5352
author = {Vasilakis, Nikos, and Kallas, Konstantinos, and Mamouras, Konstantinos, and Benetopoulos, Achilles, and Cvetkovi\'{c}, Lazar},
@@ -63,6 +62,3 @@ Nikos Vasilakis*, Konstantinos Kallas*, Konstantinos Mamouras, Achilles Benetopo
6362
series = {EuroSys '21}
6463
}
6564
```
66-
</details>
67-
68-

evaluation/benchmarks/README.md

+5-4
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
# Experimental Evaluation
2+
Quick Jump: [one-liners](#common-unix-one-liners) | [unix50](#unix-50-from-bell-labs) | [weather analysis](#noaa-weather-analysis) | [web indexing](#wikipedia-web-indexing)
23

34
_Most benchmark sets in the evaluation infrastructure include a `input/setup.sh` script for fetching inputs and setting up the experiment appropriately._
45
See [Running other script]() later.
56

6-
#### Section 6.1: Common Unix one-liners
7+
#### Common Unix one-liners
78

89
The one-liner scripts are included in [evaluation/microbenchmarks](../evaluation/microbenchmarks).
910
The list of scripts (and their correspondence to the names in the paper) are seen below:
@@ -61,7 +62,7 @@ Note that `-m` supersedes `-s` but `-l` does not supersede any of the two.
6162
Also note that if you run a script partially, it might end up saving partial results,
6263
therefore having 0 speedups in some points of the plots.
6364

64-
#### Section 6.2: Unix50 from Bell Labs
65+
#### Unix50 from Bell Labs
6566

6667
All of the Unix50 pipelines are in [evaluation/unix50/unix50.sh](../evaluation/unix50/unix50.sh).
6768
The inputs of the pipelines are in [evaluation/unix50/](../evaluation/unix50/).
@@ -112,7 +113,7 @@ These differences are due to the evolution of PaSh and the refinement of its ann
112113
The issue with these splits is that they do not manage to split the file (since there is only one line)
113114
leaving the rest of the script to run sequentially.
114115

115-
#### Section 6.3: Use Case: NOAA Weather Analysis
116+
#### NOAA Weather Analysis
116117

117118
Note that input files that are needed by this script
118119
are `curl`ed from a server in the local network and therefore
@@ -169,7 +170,7 @@ is actually higher than what is reported in the paper since it doesn't
169170
have to write the intermediate files (between preprocessing and processing) to disk.
170171

171172

172-
#### Section 6.4: Use Case: Wikipedia Web Indexing
173+
#### Wikipedia Web Indexing
173174

174175
Note that input files that are needed by this script (complete Wikipedia)
175176
are saved locally on the server and therefore this program cannot be run from elsewhere.

runtime/README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,20 @@ Quick Jump: [Stream Splitting](#stream-splitting) | [Eager Stream Polling](#eage
33

44
PaSh includes a small library of runtime primitives supporting the runtime execution of parallel scripts emitted by the compiler.
55

6-
### Stream Splitting
6+
## Stream Splitting
77

88
The PaSh compiler inserts `split` nodes to expose parallelism when parallelizable nodes only have one input.
99

10-
### Eager Stream Polling
10+
## Eager Stream Polling
1111

1212
To overcome the laziness challenges outlined in Sec. 5, PaSh inserts and instantiates `eager` nodes on streams.
1313

14-
### Cleanup Logic
14+
## Cleanup Logic
1515

1616
PaSh contains cleanup logic for dealing with dangling FIFOs.
1717
This is implemented in `wait_for_output_and_sigpipe_rest.sh`.
1818

19-
### Aggregators
19+
## Aggregators
2020

2121
There is a small custom aggregator library provided in [agg/py/](agg/py/).
2222
These aggregators are used to merge partial results from the parallel scripts.

0 commit comments

Comments
 (0)