Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why query this parquet file reports Scanning of nested columns in Parquet files is disabled? #102

Closed
l1t1 opened this issue Sep 7, 2023 · 1 comment

Comments

@l1t1
Copy link

l1t1 commented Sep 7, 2023

data: https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet 7,134,977,202 bytes
I use this python script from #78 and modify it to show timer

#!/usr/bin/env python3
import readline
import time
from argparse import ArgumentParser
from tableauhyperapi import HyperProcess, Connection, Telemetry, CreateMode, HyperException
# hyperapi-cli
## An interactive HyperAPI SQL cli

##This script allows you to interactively execute SQL commands via HyperAPI.

## Usage
##bash
##./hyperapi-cli.py [optional hyper database file]
##

def main():
    parser = ArgumentParser("HyperAPI interactive cli.")
    parser.add_argument("database", type=str, nargs='?',
                        help="A Hyper file to attach on startup")

    args = parser.parse_args()
    create_mode = CreateMode.CREATE_IF_NOT_EXISTS if args.database else CreateMode.NONE

    with HyperProcess(Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper_process:
        try:
            with Connection(hyper_process.endpoint, args.database, create_mode) as connection:
                while True:
                    try:
                        sql = input("> ")
                    except (EOFError, KeyboardInterrupt):
                        return
                    try:
                        t=time.time()
                        with connection.execute_query(sql) as result:
                            print("\t".join(str(column.name)
                                  for column in result.schema.columns))
                            for row in result:
                                print("\t".join(str(column) for column in row))
                        print(round(time.time()-t,3),"s\n")
                    except HyperException as exception:
                        print(f"Error executing SQL: {exception}")
        except HyperException as exception:
            print(f"Unable to connect to the database: {exception}")


if __name__ == "__main__":
    main()

query result

> select count(*) from external('./hacknernews.parquet');
"count"
28737557
0.779 s

> select * from external('./hacknernews.parquet') limit 1;
Error executing SQL: Scanning of nested columns in Parquet files is disabled.
Hint: Do not select group column kids when scanning the file
Context: 0xfa6b0e2f

duckdb can select * the same file

D describe select * from 'd:/hacknernews.parquet';
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id          │ BIGINT      │ YES     │         │         │         │
│ deleted     │ UTINYINT    │ YES     │         │         │         │
│ type        │ BLOB        │ YES     │         │         │         │
│ by          │ BLOB        │ YES     │         │         │         │
│ time        │ BIGINT      │ YES     │         │         │         │
│ text        │ BLOB        │ YES     │         │         │         │
│ dead        │ UTINYINT    │ YES     │         │         │         │
│ parent      │ BIGINT      │ YES     │         │         │         │
│ poll        │ BIGINT      │ YES     │         │         │         │
│ kids        │ BIGINT[]    │ YES     │         │         │         │
│ url         │ BLOB        │ YES     │         │         │         │
│ score       │ INTEGER     │ YES     │         │         │         │
│ title       │ BLOB        │ YES     │         │         │         │
│ parts       │ BIGINT[]    │ YES     │         │         │         │
│ descendants │ INTEGER     │ YES     │         │         │         │
├─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 15 rows                                                 6 columns │
└───────────────────────────────────────────────────────────────────┘
@l1t1 l1t1 changed the title why count is similar, sum is much slower vs duckdb query same big parquet file? why query this parquet file reports Scanning of nested columns in Parquet files is disabled? Sep 7, 2023
@l1t1
Copy link
Author

l1t1 commented Sep 7, 2023

I got it from https://tableau.github.io/hyper-db/docs/sql/external/formats#external-format-parquet Nested columns and therefore the nested types MAP and LIST are not supported.

@l1t1 l1t1 closed this as completed Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant