Skip to content

Commit

Permalink
Merge pull request #3012 from szarnyasg/nits-20240608a
Browse files Browse the repository at this point in the history
Minor fixes
  • Loading branch information
szarnyasg authored Jun 8, 2024
2 parents 9a25a46 + a75d698 commit dbc171c
Show file tree
Hide file tree
Showing 8 changed files with 81 additions and 51 deletions.
88 changes: 53 additions & 35 deletions docs/api/python/conversion.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@ This page documents the rules for converting [Python objects to DuckDB](#object-

This is a mapping of Python object types to DuckDB [Logical Types](../../sql/data_types/overview):

* `None` -> `NULL`
* `bool` -> `BOOLEAN`
* `datetime.timedelta` -> `INTERVAL`
* `str` -> `VARCHAR`
* `bytearray` -> `BLOB`
* `memoryview` -> `BLOB`
* `decimal.Decimal` -> `DECIMAL` / `DOUBLE`
* `uuid.UUID` -> `UUID`
* `None` `NULL`
* `bool` `BOOLEAN`
* `datetime.timedelta` `INTERVAL`
* `str` `VARCHAR`
* `bytearray` `BLOB`
* `memoryview` `BLOB`
* `decimal.Decimal` `DECIMAL` / `DOUBLE`
* `uuid.UUID` `UUID`

The rest of the conversion rules are as follows.

Expand Down Expand Up @@ -156,39 +156,57 @@ DuckDB's Python client provides multiple additional methods that can be used to

* `pl()` fetches the data as a Polars DataFrame

Below are some examples using this functionality. See the Python [guides](../../guides/index#python-client) for more examples.
### Examples

Below are some examples using this functionality. See the [Python guides](../../guides/index#python-client) for more examples.

Fetch as Pandas DataFrame:

```python
# fetch as Pandas DataFrame
df = con.execute("SELECT * FROM items").fetchdf()
print(df)
# item value count
# 0 jeans 20.0 1
# 1 hammer 42.2 2
# 2 laptop 2000.0 1
# 3 chainsaw 500.0 10
# 4 iphone 300.0 2

# fetch as dictionary of numpy arrays
```

```text
item value count
0 jeans 20.0 1
1 hammer 42.2 2
2 laptop 2000.0 1
3 chainsaw 500.0 10
4 iphone 300.0 2
```

Fetch as dictionary of NumPy arrays:

```python
arr = con.execute("SELECT * FROM items").fetchnumpy()
print(arr)
# {'item': masked_array(data=['jeans', 'hammer', 'laptop', 'chainsaw', 'iphone'],
# mask=[False, False, False, False, False],
# fill_value='?',
# dtype=object), 'value': masked_array(data=[20.0, 42.2, 2000.0, 500.0, 300.0],
# mask=[False, False, False, False, False],
# fill_value=1e+20), 'count': masked_array(data=[1, 2, 1, 10, 2],
# mask=[False, False, False, False, False],
# fill_value=999999,
# dtype=int32)}

# fetch as an Arrow table. Converting to Pandas afterwards just for pretty printing
```

```text
{'item': masked_array(data=['jeans', 'hammer', 'laptop', 'chainsaw', 'iphone'],
mask=[False, False, False, False, False],
fill_value='?',
dtype=object), 'value': masked_array(data=[20.0, 42.2, 2000.0, 500.0, 300.0],
mask=[False, False, False, False, False],
fill_value=1e+20), 'count': masked_array(data=[1, 2, 1, 10, 2],
mask=[False, False, False, False, False],
fill_value=999999,
dtype=int32)}
```

Fetch as an Arrow table. Converting to Pandas afterwards just for pretty printing:

```python
tbl = con.execute("SELECT * FROM items").fetch_arrow_table()
print(tbl.to_pandas())
# item value count
# 0 jeans 20.00 1
# 1 hammer 42.20 2
# 2 laptop 2000.00 1
# 3 chainsaw 500.00 10
# 4 iphone 300.00 2
```

```text
item value count
0 jeans 20.00 1
1 hammer 42.20 2
2 laptop 2000.00 1
3 chainsaw 500.00 10
4 iphone 300.00 2
```
9 changes: 6 additions & 3 deletions docs/api/python/data_ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,8 +135,11 @@ DuckDB supports querying multiple types of Apache Arrow objects including [table
import duckdb
import pandas as pd
test_df = pd.DataFrame.from_dict({"i": [1, 2, 3, 4], "j": ["one", "two", "three", "four"]})
duckdb.sql("SELECT * FROM test_df").fetchall()
# [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]
print(duckdb.sql("SELECT * FROM test_df").fetchall())
```

```text
[(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]
```

DuckDB also supports "registering" a DataFrame or Arrow object as a virtual table, comparable to a SQL `VIEW`. This is useful when querying a DataFrame/Arrow object that is stored in another way (as a class variable, or a value in a dictionary). Below is a Pandas example:
Expand All @@ -149,7 +152,7 @@ import pandas as pd
my_dictionary = {}
my_dictionary["test_df"] = pd.DataFrame.from_dict({"i": [1, 2, 3, 4], "j": ["one", "two", "three", "four"]})
duckdb.register("test_df_view", my_dictionary["test_df"])
duckdb.sql("SELECT * FROM test_df_view").fetchall()
print(duckdb.sql("SELECT * FROM test_df_view").fetchall())
```

```text
Expand Down
3 changes: 2 additions & 1 deletion docs/data/csv/auto_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ FlightDate|UniqueCarrier|OriginCityName|DestCityName
```

In this file, the dialect detection works as follows:

* If we split by a `|` every row is split into `4` columns
* If we split by a `,` rows 2-4 are split into `3` columns, while the first row is split into `1` column
* If we split by `;`, every row is split into `1` column
Expand Down Expand Up @@ -109,7 +110,7 @@ The type detection works by attempting to convert the values in each column to t

Note everything can be cast to `VARCHAR`. This type has the lowest priority – i.e., columns are converted to `VARCHAR` if they cannot be cast to anything else. In [`flights.csv`](/data/flights.csv) the `FlightDate` column will be cast to a `DATE`, while the other columns will be cast to `VARCHAR`.

The detected types can be individually overridden using the `types` option. This option takes either a list of types (e.g., `types = [INTEGER, VARCHAR, DATE]`) which overrides the types of the columns in-order of occurrence in the CSV file. Alternatively, `types` takes a `name -> type` map which overrides options of individual columns (e.g., `types = {'quarter': INTEGER}`).
The detected types can be individually overridden using the `types` option. This option takes either a list of types (e.g., `types = [INTEGER, VARCHAR, DATE]`) which overrides the types of the columns in-order of occurrence in the CSV file. Alternatively, `types` takes a `name``type` map which overrides options of individual columns (e.g., `types = {'quarter': INTEGER}`).

The type detection can be entirely disabled by using the `all_varchar` option. If this is set all columns will remain as `VARCHAR` (as they originally occur in the CSV file).

Expand Down
4 changes: 2 additions & 2 deletions docs/data/csv/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,13 +64,13 @@ CREATE TABLE ontime AS
Write the result of a query to a CSV file.

```sql
COPY (SELECT * FROM ontime) TO 'flights.csv' WITH (HEADER true, DELIMITER '|');
COPY (SELECT * FROM ontime) TO 'flights.csv' WITH (HEADER, DELIMITER '|');
```

If we serialize the entire table, we can simply refer to it with its name.

```sql
COPY ontime TO 'flights.csv' WITH (HEADER true, DELIMITER '|');
COPY ontime TO 'flights.csv' WITH (HEADER, DELIMITER '|');
```

## CSV Loading
Expand Down
2 changes: 1 addition & 1 deletion docs/data/csv/tips.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ SELECT * FROM read_csv('flights.csv', names = ['DateOfFlight', 'CarrierName']);

## Override the Types of Specific Columns

The `types` flag can be used to override types of only certain columns by providing a struct of `name -> type` mappings.
The `types` flag can be used to override types of only certain columns by providing a struct of `name``type` mappings.

```sql
SELECT * FROM read_csv('flights.csv', types = {'FlightDate': 'DATE'});
Expand Down
12 changes: 9 additions & 3 deletions docs/guides/python/execute_sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,22 @@ By default this will create a relation object. The result can be converted to va
```python
results = duckdb.sql("SELECT 42").fetchall()
print(results)
# [(42,)]
```

```text
[(42,)]
```

Several other result objects exist. For example, you can use `df` to convert the result to a Pandas DataFrame.

```python
results = duckdb.sql("SELECT 42").df()
print(results)
# 42
# 0 42
```

```text
42
0 42
```

By default, a global in-memory connection will be used. Any data stored in files will be lost after shutting down the program. A connection to a persistent database can be created using the `connect` function.
Expand Down
10 changes: 6 additions & 4 deletions docs/internals/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,17 @@ The SQLStatement represents a complete SQL statement. The type of the SQL Statem
## Binder

The binder converts all nodes into their **bound** equivalents. In the binder phase:

* The tables and columns are resolved using the catalog
* Types are resolved
* Aggregate/window functions are extracted

The following conversions happen:
* SQLStatement -> [`BoundStatement`](https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/planner/bound_statement.hpp)
* QueryNode -> [`BoundQueryNode`](https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/planner/bound_query_node.hpp)
* TableRef -> [`BoundTableRef`](https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/planner/bound_tableref.hpp)
* ParsedExpression -> [`Expression`](https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/planner/expression.hpp)

* SQLStatement → [`BoundStatement`](https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/planner/bound_statement.hpp)
* QueryNode → [`BoundQueryNode`](https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/planner/bound_query_node.hpp)
* TableRef → [`BoundTableRef`](https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/planner/bound_tableref.hpp)
* ParsedExpression → [`Expression`](https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/planner/expression.hpp)

## Logical Planner

Expand Down
4 changes: 2 additions & 2 deletions docs/sql/data_types/union.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ The only exception to this is when casting a `UNION` to `VARCHAR`, in which case

A type can always be implicitly cast to a `UNION` if it can be implicitly cast to one of the `UNION` member types.

* If there are multiple candidates, the built in implicit casting priority rules determine the target type. For example, a `FLOAT -> UNION(i INTEGER, v VARCHAR)` cast will always cast the `FLOAT` to the `INTEGER` member before `VARCHAR`.
* If the cast still is ambiguous, i.e., there are multiple candidates with the same implicit casting priority, an error is raised. This usually happens when the `UNION` contains multiple members of the same type, e.g., a `FLOAT -> UNION(i INTEGER, num INTEGER)` is always ambiguous.
* If there are multiple candidates, the built in implicit casting priority rules determine the target type. For example, a `FLOAT``UNION(i INTEGER, v VARCHAR)` cast will always cast the `FLOAT` to the `INTEGER` member before `VARCHAR`.
* If the cast still is ambiguous, i.e., there are multiple candidates with the same implicit casting priority, an error is raised. This usually happens when the `UNION` contains multiple members of the same type, e.g., a `FLOAT``UNION(i INTEGER, num INTEGER)` is always ambiguous.

So how do we disambiguate if we want to create a `UNION` with multiple members of the same type? By using the `union_value` function, which takes a keyword argument specifying the tag. For example, `union_value(num := 2::INTEGER)` will create a `UNION` with a single member of type `INTEGER` with the tag `num`. This can then be used to disambiguate in an explicit (or implicit, read on below!) `UNION` to `UNION` cast, like `CAST(union_value(b := 2) AS UNION(a INTEGER, b INTEGER))`.

Expand Down

0 comments on commit dbc171c

Please sign in to comment.