-
Notifications
You must be signed in to change notification settings - Fork 32
Tutorial
This tutorial walks through sq's features.
If you haven't installed sq
, see here. If you've already installed sq
, check that you're on a recent version like so:
$ sq --version
sq 0.14.5 #4faf1608 2020-12-30T19:38:47Z
If your version is older than that, please upgrade. Then start with sq help
.
Let's set out with an example, in this case an Excel file.
# No data source added to sq yet, so "sq ls" is empty.
$ sq ls
$ wget https://sq.io/testdata/xl_demo.xlsx
$ sq add xl_demo.xlsx --opts header=true --handle @xl_demo
@xl_demo xlsx xl_demo.xlsx
$ sq ls
@xl_demo* xlsx xl_demo.xlsx
$ sq ls -v
HANDLE DRIVER LOCATION OPTIONS
@xl_demo* xlsx /Users/neilotoole/sq/xl_demo.xlsx header=true
$ sq inspect @xl_demo
HANDLE DRIVER NAME SIZE TABLES LOCATION
@xl_demo xlsx xl_demo.xlsx 9.7KB 2 /Users/neilotoole/sq/xl_demo.xlsx
TABLE ROWS SIZE NUM COLS COL NAMES COL TYPES
person 7 - 4 uid, username, email, address_id numeric, string, string, string
address 2 - 6 address_id, street, city, state, zip, country numeric, string, string, string, numeric, string
Let's step through the above:
-
sq ls
: list the current sources. There are none. -
wget
: download an Excel file to use for this demo. -
sq add
: add a source. The type is inferred (or can be specified with--driver=xlsx
) to bexlsx
, and we give this source the handle@xl_demo
. This Excel file has a header row, so we also specify--opts=header=true
to indicate that the actual data begins on row 1, not row 0 (which is the header row). -
sq ls
: lists the sources again; this time we see our new@xl_demo
source. The asterisk beside the handle (@xl_demo*
) indicates that this is the active source. -
sq ls -v
: lists the sources yet again, this time verbosely (-v
). -
sq inspect
: inspects@xl_demo
, showing the structure of the source.
Now that we have added a source to sq
, we can query it.
$ sq @xl_demo.person
uid username email address_id
1 neilotoole [email protected] 1
2 ksoze [email protected] 2
3 kubla [email protected] NULL
4 tesla [email protected] 1
5 augustus [email protected] 2
6 julius [email protected] NULL
7 plato [email protected] 1
That listed the contents of the person
table (which for Excel, a table means a sheet). Being that @xl_demo
is the active source, sq @xl_demo.person
can also be accomplished by sq .person
.
The same query can be executed in the scratch
db's native SQL dialect like this (SQLite by default):
$ sq sql "SELECT * FROM person"
uid username email address_id
1 neilotoole [email protected] 1
2 ksoze [email protected] 2
3 kubla [email protected] NULL
4 tesla [email protected] 1
5 augustus [email protected] 2
6 julius [email protected] NULL
7 plato [email protected] 1
Here's some examples of using the SLQ query language:
$ sq '.person | .uid == 3'
uid username email address_id
3 kubla [email protected] NULL
It should be obvious that the above query effectively does a WHERE uid = 3
.
$ sq '.person | .[2:5] | .uid, .email, .address_id'
uid email address_id
3 [email protected] NULL
4 [email protected] 1
5 [email protected] 2
Above we select (zero-indexed) rows 2-5, and output columns uid
and email
. The same could be accomplished by:
$ sq sql 'SELECT uid, email, address_id FROM person LIMIT 3 OFFSET 2'
uid email address_id
3 [email protected] NULL
4 [email protected] 1
5 [email protected] 2
We could also output in JSON using sq --json '.person | .[2:5] | .uid, .email, .address_id'
:
[
{
"uid": 3,
"email": "[email protected]",
"address_id": null
},
{
"uid": 4,
"email": "[email protected]",
"address_id": 1
},
{
"uid": 5,
"email": "[email protected]",
"address_id": 2
}
]
In addition to standard JSON, you can output in JSON Lines format with the
--jsonl
flag:{"uid": 3, "email": "[email protected]", "address_id": null} {"uid": 4, "email": "[email protected]", "address_id": 1} {"uid": 5, "email": "[email protected]", "address_id": 2}Or in JSON Array format with
--jsona
:[3, "[email protected]", null] [4, "[email protected]", 1] [5, "[email protected]", 2]Or output in other formats: XML, HTML, Markdown, CSV, Excel, etc..
Let's look at the other "table" in the Excel spreadsheet:
$ sq .address
address_id street city state zip country
1 1600 Penn Washington DC 12345 US
2 999 Coleridge St Ulan Bator UB 888 MN
We can join across the tables (sheets) of @xl_demo
:
$ sq '.person, .address | join(.address_id) | .email, .city'
email city
[email protected] Washington
[email protected] Ulan Bator
[email protected] Washington
[email protected] Ulan Bator
[email protected] Washington
Let's grab another data source, this time in CSV. We'll download the file.
$ wget https://sq.io/testdata/person.csv
Now let's take a look at it:
$ cat person.csv | sq inspect
HANDLE DRIVER NAME SIZE TABLES LOCATION
@stdin csv @stdin 199.0B 1 @stdin
TABLE ROWS SIZE NUM COLS COL NAMES COL TYPES
data 7 - 4 A, B, C, D INTEGER, TEXT, TEXT, INTEGER
Note that because CSV is "monotable" (only has one table of data), its data is represented as a single table named
data
. Also note that because this particular CSV file doesn't have a header row, its columns are given namesA
,B
,C
, etc., following what Excel would do.
We can pipe that CSV file to sq
and performs the usual actions on it:
$ cat person.csv | sq '.data | .[2:5]'
A B C D
3 kubla [email protected] NULL
4 tesla [email protected] 1
5 augustus [email protected] 2
We could continue to cat
the CSV file to sq
, but being that we'll use it later in this tutorial, we'll add it as a source:
$ sq add person.csv -h @csv_demo
@csv_demo csv person.csv
As with the sheets of the Excel file, we can also join across sources:
$ sq '@csv_demo.data, @xl_demo.address | join(.D == .address_id) | .C, .city'
C city
[email protected] Washington
[email protected] Ulan Bator
[email protected] Washington
[email protected] Ulan Bator
[email protected] Washington
Now that we've added multiple sources, let's see what sq ls
has to say:
$ sq ls
@xl_demo* xlsx xl_demo.xlsx
@csv_demo csv person.csv
Note that @xl_demo
is the active source (it has an asterisk beside it). We can do this with @csv_demo
:
$ sq '@csv_demo.data | .[0:1]'
A B C D
1 neilotoole [email protected] 1
But not this:
$ sq '.data | .[0:1]'
sq: SQL query against @xl_demo failed: SELECT * FROM "data" LIMIT 1 OFFSET 0: no such table: data
Because the active source is still @xl_demo
. To see the active source:
$ sq src
@xl_demo xlsx xl_demo.xlsx
Let's switch the active source to the CSV file:
$ sq src @csv_demo
@csv_demo csv person.csv
Now we can use the shorthand form (omit the @csv_demo
handle) and sq
will look for table .data
in the active source (which is now @csv_demo
):
$ sq '.data | .[0:1]'
A B C D
1 neilotoole [email protected] 1
A useful feature is to ping the sources to verify that they're accessible:
$ sq ping all
@csv_demo 1ms pong
@xl_demo 3ms pong
Or we could ping just one:
$ sq ping @xl_demo
@xl_demo 4ms pong
Having read this far, you can be forgiven for thinking that sq
only deals with document-type formats such as Excel or CSV, but that is not the case. Let's add some SQL databases.
First we'll do postgres; we'll start a pre-built Sakila database via docker on port (note that it will take a moment for the Postgres container to start up):
$ docker run -p 5432:5432 sakiladb/postgres:latest
Let's add that Postgres database as a source:
$ sq add "postgres://sakila:p_ssW0rd@localhost/sakila?sslmode=disable"
@sakila_pg postgres sakila@localhost/sakila
If you don't want to type the password on the command line, you could instead use an environment variable:
$ export DEMO_PW=p_ssW0rd $ sq add "postgres://sakila:$DEMO_PW@localhost/sakila?sslmode=disable" @sakila_pg postgres sakila@localhost/sakila
The new source should show up in sq ls
:
$ sq ls
@xl_demo xlsx xl_demo.xlsx
@csv_demo csv person.csv
@sakila_pg* postgres sakila@localhost/sakila
Ping the new source just for fun:
$ sq ping @sakila_pg
@sakila_pg 9ms pong
We'll inspect a single table (just to keep the output small):
$ sq inspect @sakila_pg.actor
TABLE ROWS SIZE NUM COLS COL NAMES COL TYPES
actor 200 72.0KB 4 actor_id, first_name, last_name, last_update integer, character varying, character varying, timestamp without time zone
And take a little taste of that data:
$ sq '@sakila_pg.actor | .[0:2]'
actor_id first_name last_name last_update
1 PENELOPE GUINESS 2006-02-15 04:34:33 +0000 UTC
2 NICK WAHLBERG 2006-02-15 04:34:33 +0000 UTC
Now we'll add and start a MySQL instance of Sakila:
$ docker run -p 3306:3306 sakiladb/mysql:latest
$ sq add "mysql://sakila:p_ssW0rd@localhost/sakila"
@sakila_my mysql sakila@localhost/sakila
And get some data from @sakila_my
:
$ sq '@sakila_my.actor | .[0:2]'
actor_id first_name last_name last_update
1 PENELOPE GUINESS 2006-02-15 04:34:33 +0000 UTC
2 NICK WAHLBERG 2006-02-15 04:34:33 +0000 UTC
Note that as before
sq
can join across sources:sq '@sakila_pg.city, @sakila_my.country | join(.country_id) | .city, .country | .[0:3]' city country A Corua (La Corua) Spain Abha Saudi Arabia Abu Dhabi United Arab Emirates
In addition to JSON, CSV, etc., sq
can write query results to database tables.
We'll use the film_category
table as an example: the table is in both @sakila_pg
and @sakila_my
. We're going to truncate the table in @sakila_my
and then repopulate via a query against @sakila_pg
.
$ sq '@sakila_pg.film_category | count()'
count
1000
$ sq '@sakila_my.film_category | count()'
COUNT(*)
1000
Make a copy of the table as a backup.
$ sq tbl copy .film_category .film_category_bak
Copied table: @sakila_my.film_category --> @sakila_my.film_category_bak (1000 rows copied)
Note that the
sq tbl copy
makes use each database's own table copy functionality. Thussq tbl copy
can't be used to directly copy a table from one database to another. Butsq
provides a means of doing this, keep reading.
Truncate the @sakila_my.film_category
table:
$ sq tbl truncate @sakila_my.film_category
Truncated 1000 rows from @sakila_my.film_category
$ sq '@sakila_my.film_category | count()'
COUNT(*)
0
The @sakila_my.film_category
table is now empty. But we can repopulate it via a query against @sakila_pg
. For this example, we'll just do 500
rows.
$ sq '@sakila_pg.film_category | .[0:500]' --insert @sakila_my.film_category
Inserted 500 rows into @sakila_my.film_category
$ sq '@sakila_my.film_category | count()'
COUNT(*)
500
We can now use the sq tbl
commands to restore the previous state.
# Set @sakila_my as the active source just for brevity.
$ sq src @sakila_my
$ sq tbl drop .film_category
Dropped table @sakila_my.film_category
# Restore the film_category table from the backup table we made earlier
$ sq tbl copy .film_category_bak .film_category
Copied table: @sakila_my.film_category_bak --> @sakila_my.film_category (1000 rows copied)
# Verify that the table is restored
$ sq '.film_category | count()'
COUNT(*)
1000
# Get rid of the backup table
$ sq tbl drop .film_category_bak
Dropped table @sakila_my.film_category_bak
Note that sq
plays nicely with jq
. For example, list the names of the columns in table @sakila_pg.actor
:
$ sq inspect --json @sakila_pg.actor | jq -r '.columns[] | .name'
actor_id
first_name
last_name
last_update