You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: annotations/README.md
+5-5
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ The parallelizability study informed the design of the annotation language, whic
8
8
9
9
> _N.b.: We welcome contributions to the study and annotatations for common commands._
10
10
11
-
####Main Parallelizability Classes
11
+
## Main Parallelizability Classes
12
12
13
13
PaSh introduces four major parallelizability classes:
14
14
@@ -35,7 +35,7 @@ If parallelized on a single input, each stage would need to wait on the results
35
35
The last class, `side-effectful`, contains commands that have side-effects across the system -- for example, updating environment variables, interacting with the filesystem, and accessing the network.
36
36
Such commands are not parallelizable without finer-grained concurrency control mechanisms that can detect side-effects across the system.
37
37
38
-
####Parallelizability Study of Commands in GNU & POSIX
38
+
## Parallelizability Study of Commands in GNU & POSIX
39
39
40
40
The parallelizability study of commands in GNU and POSIX is comprised of two parts: a coarse-grained parallelizability study, and a set of annotations for commands.
41
41
@@ -47,7 +47,7 @@ Annotations can be thought of as defining a bidirectional correspondence between
47
47
Since command behaviors (and correspondence) can change based on their arguments, annotations contain a sequence of predicates.
48
48
Each predicate is accompanied by information that instantiates the correspondence between a command and a dataflow node.
49
49
50
-
####A Simple Example: `chmod`
50
+
## A Simple Example: `chmod`
51
51
52
52
As a first example, below we present the annotations for `chmod`.
53
53
@@ -65,7 +65,7 @@ As a first example, below we present the annotations for `chmod`.
65
65
66
66
The annotation for `chmod` is very simple, since it only needs to establish that `chmod` is side-effectful and therefore cannot be translated to a dataflow node.
67
67
68
-
####Another Example: `cut`
68
+
## Another Example: `cut`
69
69
70
70
As another example, below we present the annotations for `chmod`.
71
71
@@ -136,7 +136,7 @@ Inputs are always assigned to the non-option arguments and the output is always
136
136
The option "stdin-hyphen" indicates that a non-option argument that is just a dash `-` represents the stdin, and the option “empty-args-stdin” indicates that if non-option arguments are empty, then the command reads from its stdin.
137
137
The list identified by "short-long" contains a correspondence of short and long argument names for this command.
138
138
139
-
####How to Annotate a Command
139
+
## How to Annotate a Command
140
140
141
141
The first step to annotating a command is to identify its default class: `stateless`, `pure`, `non-parallelizable`, and `side-effectful`. How does the command behave without any inputs?
142
142
The next step is to identify the set of inputs and their order.
A correspondence between blocks in the diagram and Python modules is shown below:
6
+
PaSh has recently shifted away from ahead-of-time compilation and towards just-in-time compilation intermixed with the execution of a script.
7
+
This shift brings many benefits, allowing PaSh to correctly handle expansion and other important details -- but complicates the clear exposition of the two phases.
8
+
A high-level diagram of PaSh's end-to-end operation is shown below:
PaSh pre-processes a sequential script to insert calls to the `pash_runtime.py`.
13
+
It then invokes the script, switching between evaluation, execution, and parallelization at runtime:
14
+
(i) it first parses the script, creating an abstact syntax tree (AST);
15
+
(ii) it then expands the nodes of the AST, often calling the shell which performs that expansion;
16
+
(iii) it compiles dataflow regions, parts of the AST that are potentially parallelizable, through an iterative optimization proceedure applied over a dataflow graph (DFG); and
17
+
(iv) finally emits the parallel script by translating the DFG to AST and unparsing the AST back to a shell script.
18
+
The compilation takes into account information about individual commands through [annotations](../annotations), and the emitted parallel script uses additional constructs provided by PaSh's [runtime library](../runtime).
A correspondence between blocks in the diagram and Python modules is shown below:
11
21
12
-
**Note:** At the time of the paper submission, PaSh did not have a preprocessing component, and didn't handle variable expansion. These changes significantly improve the practical applicability of PaSh since it can be used on scripts where the environment variables are modified throughout the script.
22
+
- Preprocessing: [pash.py](./pash.py)
23
+
- Expansion and compilation: [ast_to_ir.py](./ast_to_ir.py)
24
+
- Dealing with annotations: [annotations.py](./annotations.py), [command_categories.py](./command_categories.py)
First, there is the parser in [compiler/parser](../compiler/parser), which is a port of [libdash](https://github.com/mgree/), the dash parser extended with OCaml bindings, extended with ocaml2json and json2ocaml code to interface with PaSh.
27
+
## Compiler Overview
15
28
16
-
Now let's get to the compiler. It's entry point is [compiler/pash.py](../compiler/pash.py) that parses a script and replaces potentially parallelizable regions with calls to [compiler/pash_runtime.sh](../compiler/pash_runtime.sh). It then executes the script.
29
+
Now let's get to the compiler.
30
+
It's entry point is [pash.py](./pash.py) that parses a script and replaces potentially parallelizable regions with calls to [pash_runtime.sh](./pash_runtime.sh).
31
+
It then executes the script.
17
32
This allows invoking the compiler during the runtime to have information about the values of environment variables.
18
33
19
-
The runtime script [compiler/pash_runtime.sh](../compiler/pash_runtime.sh) simply invokes the compiler [compiler/pash_runtime.py](../compiler/pash_runtime.py) and if it succeeds it executes the optimized script, otherwise it executes the original script.
34
+
The [pash_runtime.sh](./pash_runtime.sh) script simply invokes the [pash.py](./pash.py) compiler:
35
+
if it succeeds it executes the optimized script, otherwise it executes the original script.
20
36
21
-
Now the compiler has several stages:
37
+
The compiler has several stages:
22
38
23
-
1. It expands words in the AST and then it turns it into our dataflow model (guided by annotations)
24
-
- The expansion and translation happens in [ast_to_ir.py](../compiler/ast_to_ir.py)
25
-
- The dataflow model is mostly defined in [ir.py](../compiler/ir.py)
26
-
- The annotations are processed in [annotations.py](../compiler/annotations.py) and [command_categories.py](../compiler/command_categories.py)
39
+
1. It expands words in the AST and then it turns it into our dataflow model (guided by [annotations](../annotations))
40
+
- The expansion and translation happens in [ast_to_ir.py](./ast_to_ir.py)
41
+
- The dataflow model is defined mostly in [ir.py](./ir.py)
42
+
- The annotations are processed in [annotations.py](./annotations.py) and [command_categories.py](./command_categories.py)
27
43
2. It performs transformations on the dataflow graph to expose parallelism (guided by annotations)
28
-
- Translations happen in [pash_runtime.py](../compiler/pash_runtime.py)
44
+
- Translations happen in [pash_runtime.py](./pash_runtime.py)
29
45
3. It then translates the dataflow graph back to a shell script to execute it with bash
30
-
- The `dfg2shell` translation happens in [ir_to_ast.py](../compiler/ir_to_ast.py)
46
+
- The `dfg2shell` translation happens in [ir_to_ast.py](./ir_to_ast.py)
47
+
48
+
[//]: #(TODO: the parsing/unparsing components need update)
49
+
50
+
## Zooming into Fragments
31
51
32
-
A few interesting fragments are shown below.
52
+
A few interesting fragments are outlined below.
33
53
34
-
The [ast_to_ir.py](https://github.com/andromeda/pash/blob/main/compiler/ast_to_ir.py) contains a case statement that essentially pattern-matches on constructs of the shells script AST and then compiles them accordingly.
54
+
The [ast_to_ir.py](./ast_to_ir.py) file contains a case statement that essentially pattern-matches on constructs of the shells script AST and then compiles them accordingly.
35
55
```Python
36
56
compile_cases = {
37
57
"Pipe": (lambdafileIdGen, config:
@@ -43,11 +63,12 @@ Now the compiler has several stages:
The following function from [ir.py](./ir.py) is responsible for parallelizing a single node (_i.e._, a command) in the dataflow graph.
69
+
Look at the schematic in the comments starting [on line 637](./ir.py#L637) that gives the high-level overview of what this function does (not shown below).
49
70
50
-
The following function from [ir.py](https://github.com/andromeda/pash/blob/main/compiler/ir.py) is responsible for parallelizing a single node (i.e., command) in the dataflow graph. Look at the schematic in the comments starting [on line 637](https://github.com/andromeda/pash/blob/main/compiler/ir.py#L637) that gives the high-level overview of what this function does (not shown below).
71
+
[//]: # (TODO: Add schematic here)
51
72
52
73
```Python
53
74
# See comment on line 637
@@ -60,7 +81,7 @@ The following function from [ir.py](https://github.com/andromeda/pash/blob/main/
60
81
# ... more code ...
61
82
```
62
83
63
-
Another interesting fragment isin [ir_to_ast.py](https://github.com/andromeda/pash/blob/main/compiler/ir_to_ast.py), which translates the parallel dataflow graph back to an AST.
84
+
Another interesting fragment isin [ir_to_ast.py](./ir_to_ast.py), which translates the parallel dataflow graph back to an AST.
64
85
65
86
```Python
66
87
def ir2ast(ir, args):
@@ -82,3 +103,11 @@ def ir2ast(ir, args):
82
103
83
104
This ASTis then unparsed back into a (parallel) shell script.
84
105
106
+
## Earlier Versions
107
+
108
+
The compiler is outlined in the [EuroSys paper](https://arxiv.org/pdf/2007.09436.pdf), but has evolved considerably since then:
109
+
110
+
* PaSh originally did not have a preprocessing component, and didn't handle variable expansion. It now does both, significantly improving its practical applicability since it can be used on scripts where the environment variables are modified throughout the script.
111
+
112
+
* PaSh originally was using code in [parser](./parser) -- a port of [libdash](https://github.com/mgree/), the `dash` parser extended with OCaml bindings --and specifically the `ocaml2json`and`json2ocaml` binaries to interface with PaSh. PaSh now uses a custom parser written in Python, avoiding any dependency to OCaml and simplifying dependency management.
The following resources offer overviews of important PaSh components.
6
7
7
8
* Short tutorial: [introduction](./tutorial.md#introduction), [installation](./tutorial.md#installation), [execution](./tutorial.md#running-scripts), and [next steps](./tutorial.md#what-next)
_Most benchmark sets in the evaluation infrastructure include a `input/setup.sh` script for fetching inputs and setting up the experiment appropriately._
4
5
See [Running other script]() later.
5
6
6
-
#### Section 6.1: Common Unix one-liners
7
+
#### Common Unix one-liners
7
8
8
9
The one-liner scripts are included in [evaluation/microbenchmarks](../evaluation/microbenchmarks).
9
10
The list of scripts (and their correspondence to the names in the paper) are seen below:
@@ -61,7 +62,7 @@ Note that `-m` supersedes `-s` but `-l` does not supersede any of the two.
61
62
Also note that if you run a script partially, it might end up saving partial results,
62
63
therefore having 0 speedups in some points of the plots.
63
64
64
-
#### Section 6.2: Unix50 from Bell Labs
65
+
#### Unix50 from Bell Labs
65
66
66
67
All of the Unix50 pipelines are in [evaluation/unix50/unix50.sh](../evaluation/unix50/unix50.sh).
67
68
The inputs of the pipelines are in [evaluation/unix50/](../evaluation/unix50/).
@@ -112,7 +113,7 @@ These differences are due to the evolution of PaSh and the refinement of its ann
112
113
The issue with these splits is that they do not manage to split the file (since there is only one line)
113
114
leaving the rest of the script to run sequentially.
114
115
115
-
#### Section 6.3: Use Case: NOAA Weather Analysis
116
+
#### NOAA Weather Analysis
116
117
117
118
Note that input files that are needed by this script
118
119
are `curl`ed from a server in the local network and therefore
@@ -169,7 +170,7 @@ is actually higher than what is reported in the paper since it doesn't
169
170
have to write the intermediate files (between preprocessing and processing) to disk.
170
171
171
172
172
-
#### Section 6.4: Use Case: Wikipedia Web Indexing
173
+
#### Wikipedia Web Indexing
173
174
174
175
Note that input files that are needed by this script (complete Wikipedia)
175
176
are saved locally on the server and therefore this program cannot be run from elsewhere.
0 commit comments