Skip to content

Commit

Permalink
release
Browse files Browse the repository at this point in the history
  • Loading branch information
fabuzaid21 committed Dec 5, 2017
1 parent bff24a4 commit 7b61f67
Show file tree
Hide file tree
Showing 189 changed files with 206 additions and 57 deletions.
2 changes: 1 addition & 1 deletion bin/macrodiff → bin/macrobase-sql
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/usr/bin/env bash

BIN=`dirname "$0"`
BASE=$BIN/../repl
BASE=$BIN/../sql
java -Xmx4g -cp "$BASE/target/classes:$BASE/target/*" \
edu.stanford.futuredata.macrobase.sql.MacrobaseSQLRepl "$@"

2 changes: 1 addition & 1 deletion docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ pages:
- 'Tutorial': 'user-guide/tutorial.md'
- 'Common Parameters': 'user-guide/parameters.md'
- 'Advanced Configuration': 'user-guide/advanced-configuration.md'
- 'MacroDiff': 'user-guide/macrodiff.md'
- 'MacroBase SQL': 'user-guide/sql.md'

theme:
name: 'readthedocs'
50 changes: 0 additions & 50 deletions docs/source/user-guide/macrodiff.md

This file was deleted.

199 changes: 199 additions & 0 deletions docs/source/user-guide/sql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# MacroBase SQL

In addition to the UI and Java API, you can also write your MacroBase jobs as
SQL queries from the command line, using our custom shell. This enables
interactive exploration of your data using our MacroBase operators.

## New SQL Operator: DIFF

To extend MacroBase queries in SQL, we've extend the standard SQL syntax to
include our own operator, called `DIFF`.

### DIFF Inputs

- `DIFF` takes in two relations as arguments: the first relation is the
**outliers** relation, while the second is the **inliers** relation. This
represents the classification stage of the MacroBase pipeline. Important:
both relations must share the same schema!

- The `DIFF` operator also requires you to specify which columns in the given
table you want to consider for explanations. This is done using the `ON`
keyword in SQL; `ON location, version, hw_model` means that MacroBase will
use those three columns for explanation generation. If you want to consider
all possible columns, use `ON *`.

- `COMPARE BY` specifies a **ratio metric function**, such as `risk_ratio`,
`global_ratio`, or `prevalence_ratio`. Users must also specify an **aggregation
function** (such as `COUNT`, or `AVG`) that takes a column (or multiple
columns) as an argument.

- You can also specify an optional keyword called `MAX COMBO`, that specifies
the maximum order you want for your generated explanations (`MAX COMBO [number]`).

- Remember: a `DIFF` query is just SQL! So you can include any other standard
SQL clause: you can add `WHERE` clauses, `ORDER BY`s, and `LIMIT`s, for
example. (`GROUP BY` and `HAVING` is not yet supported.) For example, if you
want to prune out results with low support (e.g., 0.2), simply add `WHERE support > 0.2`
to your SQL query. (By default, DIFF queries prune out all
results with support less than 0.2, and ratios less than 1.5.)

#### Summary
Overall, the formal definition of a `DIFF` query looks something like this:
```sql
SELECT <column_name>,..., <column_name>
FROM DIFF (<relation>, <relation>)
ON { <column_name>,..., <column_name> | * }
COMPARE BY { <ratio_metric_fn>(<aggregation_fn>(<column_name> | *)) }
[ MAX COMBO <number> ]
[ WHERE <boolean_expression> ]
[ ORDER BY <column_name> ( ASC | DESC) }
[ LIMIT { <number> | ALL } ];
```

### DIFF Output

In order for MacroBase to compatible with traditional SQL queries, the `DIFF`
operator has to be _composable_---it has to seamlessly integrate with other SQL
commands. (This is because SQL is fundamentally a [relational
algebra](https://en.wikipedia.org/wiki/Relational_algebra), except that it
operates on multi-sets instead of sets.)

This means that the `DIFF` operator has to output a relation, just like the
output of any other SQL operator: a `SELECT`, a `WHERE`, a `JOIN`, or a `GROUP BY`
always outputs a relation, so that users can layer additional queries or
clauses downstream. The `DIFF` operator does, in fact, output a relation---it
outputs the columns specified in the `ON` clause, plus three additional
columns: the ratio column, the **support** column, and the outlier counts
column.

For example, suppose we ran the following SQL query in Macrobase-SQL:

```sql
SELECT * FROM
DIFF
(SELECT * FROM sample WHERE usage > 1000.0) outliers,
(SELECT * FROM sample WHERE usage < 1000.0) inliers
ON
location, version
COMPARE BY
risk_ratio(COUNT(*)) ORDER BY support;
```

Then, the output of that query will look something like this:

```
-------------------------------------------------------------------------------------------
| location | version | risk_ratio | support | outlier_count |
-------------------------------------------------------------------------------------------
| CAN | v2 | 1.5173 | .210526 | 8.0 |
| null | v1 | 10.456989 | .789474 | 30.0 |
| CAN | v1 | 46.424051 | .789474 | 30.0 |
| CAN | null | ∞ | 1.0 | 38.0 |
-------------------------------------------------------------------------------------------
```

Here, the `null` values correspond to an explanation that does not include a
value for that particular column; if you've written `GROUP BY` or `CUBE`
queries in SQL, these `nulll`s mean the exact same thing. (In database
parlance, the inclusion of these `null`s means we are returning a normalized
relation.) For example, the last row in the output above is examining the
explanation results for `location=CAN`; the row immediately above examines the
results for `location=CAN && version=v1`.


## Building MacroBase SQL

MacroBase-SQL depends on `macrobase-lib`, so build and install the `lib/`
directory if you haven't already: `cd lib && mvn install && cd -`. Then build
`macrobase-sql`: `cd sql && mvn package && cd -`.

## Running Macrobase SQL

To run MacroBase-SQL, run `bin/macrobase-sql`. You can also import a .sql file
with pre-written SQL queries; just run `bin/macrobase-sql -f [path/to/file]`.

You should see this:

```
Welcome to
__ ___ ____
/ |/ /___ _______________ / __ )____ _________
/ /|_/ / __ `/ ___/ ___/ __ \/ __ / __ `/ ___/ _ \
/ / / / /_/ / /__/ / / /_/ / /_/ / /_/ (__ ) __/
/_/ /_/\__,_/\___/_/ \____/_____/\__,_/____/\___/
macrobase-sql>
```

### Demo

If you run the `bin/macrobase-sql -f sql/demo.sql`, you should see the following output:

```
Welcome to
__ ___ ____
/ |/ /___ _______________ / __ )____ _________
/ /|_/ / __ `/ ___/ ___/ __ \/ __ / __ `/ ___/ _ \
/ / / / /_/ / /__/ / / /_/ / /_/ / /_/ (__ ) __/
/_/ /_/\__,_/\___/_/ \____/_____/\__,_/____/\___/
IMPORT FROM CSV FILE 'core/demo/sample.csv' INTO sample(usage double, latency double, location string, version string);
1057 rows
-----------------------------------------------------
| usage | latency | location | version |
-----------------------------------------------------
| 30.77 | 238.0 | CAN | v2 |
| 31.28 | 611.0 | CAN | v2 |
| 31.17 | 768.0 | RUS | v4 |
| 30.94 | 192.0 | AUS | v3 |
| 35.36 | 401.0 | UK | v3 |
| 39.12 | 531.0 | RUS | v4 |
| 33.9 | 223.0 | UK | v3 |
...
| 1000.77 | 864.0 | CAN | v2 |
| 1000.77 | 864.0 | CAN | v2 |
| 1000.77 | 864.0 | CAN | v2 |
| 1000.77 | 864.0 | CAN | v2 |
| 1000.77 | 864.0 | CAN | v2 |
| 1000.77 | 864.0 | CAN | v2 |
| 1000.77 | 864.0 | CAN | v2 |
-----------------------------------------------------
SELECT * FROM
DIFF
(SELECT * FROM sample WHERE usage > 1000.0) outliers,
(SELECT * FROM sample WHERE usage < 1000.0) inliers
ON
location, version
COMPARE BY
risk_ratio(COUNT(*)) ORDER BY support;
4 rows
-------------------------------------------------------------------------------------------
| location | version | risk_ratio | support | outlier_count |
-------------------------------------------------------------------------------------------
| CAN | v2 | 1.5173 | .210526 | 8.0 |
| null | v1 | 10.456989 | .789474 | 30.0 |
| CAN | v1 | 46.424051 | .789474 | 30.0 |
| CAN | null | ∞ | 1.0 | 38.0 |
-------------------------------------------------------------------------------------------
SELECT * FROM
DIFF
(SELECT * FROM sample WHERE usage > 1000.0) outliers,
(SELECT * FROM sample WHERE usage < 1000.0) inliers
ON
location, version
COMPARE BY
global_ratio(COUNT(*)) WHERE global_ratio > 10.0;
1 row
-------------------------------------------------------------------------------------------
| location | version | global_ratio | support | outlier_count |
-------------------------------------------------------------------------------------------
| CAN | v1 | 10.562958 | .789474 | 30.0 |
-------------------------------------------------------------------------------------------
```
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ public String toString() {
* @param maxNumToPrint maximum number of rows from the DataFrame to print
*/
public void prettyPrint(final int maxNumToPrint) {
System.out.println(numRows + (numRows == 1 ? "row" : " rows"));
System.out.println(numRows + (numRows == 1 ? " row" : " rows"));

final int maxColNameLength = schema.getColumnNames().stream()
.reduce("", (x, y) -> x.length() > y.length() ? x : y).length() + 4; // 2 extra spaces on both sides
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
<module>legacy</module>
<module>frontend</module>
<!-- <module>contrib</module> -->
<module>repl</module>
<module>sql</module>
<module>lib</module>
<module>core</module>
</modules>
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ public static DataFrame diff(final DataFrame outliers, final DataFrame inliers,
summarizer.setAttributes(cols);
summarizer.setRatioMetric(ExplanationMetric.getMetricFn(ratioMetricStr));
summarizer.setMinSupport(0.2); // TODO:
summarizer.setMinRatioMetric(2.0); // TODO:
summarizer.setMinRatioMetric(1.5); // TODO:

summarizer.process(combined);
final APExplanation explanations = summarizer.getResults();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ private void runRepl() throws IOException {
}

private String readConsoleInput() throws IOException {
reader.setPrompt("macrodiff> ");
reader.setPrompt("macrobase-sql> ");
String line = reader.readLine();
if (line == null || line.equalsIgnoreCase("quit") || line.equalsIgnoreCase("exit")) {
return "";
Expand All @@ -127,7 +127,7 @@ private String readConsoleInput() throws IOException {
commandBuilder.append("\n");
commandBuilder.append(line);
}
reader.setPrompt("macrodiff> ");
reader.setPrompt("macrobase-sql> ");
return commandBuilder.toString();
}

Expand Down
File renamed without changes.
File renamed without changes.

0 comments on commit 7b61f67

Please sign in to comment.