diff --git a/docs/source/index.rst b/docs/source/index.rst index 3ecdab37..34eb23b2 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -71,9 +71,10 @@ Example user-guide/introduction user-guide/basics - user-guide/configuration + user-guide/data-sources user-guide/common-operations/index user-guide/io/index + user-guide/configuration user-guide/sql diff --git a/docs/source/user-guide/common-operations/index.rst b/docs/source/user-guide/common-operations/index.rst index b15b04c6..d7c708c2 100644 --- a/docs/source/user-guide/common-operations/index.rst +++ b/docs/source/user-guide/common-operations/index.rst @@ -18,6 +18,8 @@ Common Operations ================= +The contents of this section are designed to guide a new user through how to use DataFusion. + .. toctree:: :maxdepth: 2 diff --git a/docs/source/user-guide/data-sources.rst b/docs/source/user-guide/data-sources.rst new file mode 100644 index 00000000..b7f10b41 --- /dev/null +++ b/docs/source/user-guide/data-sources.rst @@ -0,0 +1,185 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. _user_guide_data_sources: + +Data Sources +============ + +DataFusion provides a wide variety of ways to get data into a DataFrame to perform operations. + +Local file +---------- + +DataFusion has the abilty to read from a variety of popular file formats, such as :ref:`Parquet `, +:ref:`CSV `, :ref:`JSON `, and :ref:`AVRO `. + +.. ipython:: python + + from datafusion import SessionContext + ctx = SessionContext() + df = ctx.read_csv("pokemon.csv") + df.show() + +Create in-memory +---------------- + +Sometimes it can be convenient to create a small DataFrame from a Python list or dictionary object. +To do this in DataFusion, you can use one of the three functions +:py:func:`~datafusion.context.SessionContext.from_pydict`, +:py:func:`~datafusion.context.SessionContext.from_pylist`, or +:py:func:`~datafusion.context.SessionContext.create_dataframe`. + +As their names suggest, ``from_pydict`` and ``from_pylist`` will create DataFrames from Python +dictionary and list objects, respectively. ``create_dataframe`` assumes you will pass in a list +of list of `PyArrow Record Batches `_. + +The following three examples all will create identical DataFrames: + +.. ipython:: python + + import pyarrow as pa + + ctx.from_pylist([ + { "a": 1, "b": 10.0, "c": "alpha" }, + { "a": 2, "b": 20.0, "c": "beta" }, + { "a": 3, "b": 30.0, "c": "gamma" }, + ]).show() + + ctx.from_pydict({ + "a": [1, 2, 3], + "b": [10.0, 20.0, 30.0], + "c": ["alpha", "beta", "gamma"], + }).show() + + batch = pa.RecordBatch.from_arrays( + [ + pa.array([1, 2, 3]), + pa.array([10.0, 20.0, 30.0]), + pa.array(["alpha", "beta", "gamma"]), + ], + names=["a", "b", "c"], + ) + + ctx.create_dataframe([[batch]]).show() + + +Object Store +------------ + +DataFusion has support for multiple storage options in addition to local files. +The example below requires an appropriate S3 account with access credentials. + +Supported Object Stores are + +- :py:class:`~datafusion.object_store.AmazonS3` +- :py:class:`~datafusion.object_store.GoogleCloud` +- :py:class:`~datafusion.object_store.Http` +- :py:class:`~datafusion.object_store.LocalFileSystem` +- :py:class:`~datafusion.object_store.MicrosoftAzure` + +.. code-block:: python + + from datafusion.object_store import AmazonS3 + + region = "us-east-1" + bucket_name = "yellow-trips" + + s3 = AmazonS3( + bucket_name=bucket_name, + region=region, + access_key_id=os.getenv("AWS_ACCESS_KEY_ID"), + secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"), + ) + + path = f"s3://{bucket_name}/" + ctx.register_object_store("s3://", s3, None) + + ctx.register_parquet("trips", path) + + ctx.table("trips").show() + +Other DataFrame Libraries +------------------------- + +DataFusion can import DataFrames directly from other libraries, such as +`Polars `_ and `Pandas `_. +Since DataFusion version 42.0.0, any DataFrame library that supports the Arrow FFI PyCapsule +interface can be imported to DataFusion using the +:py:func:`~datafusion.context.SessionContext.from_arrow` function. Older verions of Polars may +not support the arrow interface. In those cases, you can still import via the +:py:func:`~datafusion.context.SessionContext.from_polars` function. + +.. ipython:: python + + import pandas as pd + + data = { "a": [1, 2, 3], "b": [10.0, 20.0, 30.0], "c": ["alpha", "beta", "gamma"] } + pandas_df = pd.DataFrame(data) + + datafusion_df = ctx.from_arrow(pandas_df) + datafusion_df.show() + + import polars as pl + polars_df = pl.DataFrame(data) + + datafusion_df = ctx.from_arrow(polars_df) + datafusion_df.show() + +Delta Lake +---------- + +DataFusion 43.0.0 and later support the ability to register table providers from sources such +as Delta Lake. This will require a recent version of +`deltalake `_ to provide the required interfaces. + +.. code-block:: python + + from deltalake import DeltaTable + + delta_table = DeltaTable("path_to_table") + ctx.register_table_provider("my_delta_table", delta_table) + df = ctx.table("my_delta_table") + df.show() + +On older versions of ``deltalake`` (prior to 0.22) you can use the +`Arrow DataSet `_ +interface to import to DataFusion, but this does not support features such as filter push down +which can lead to a significant performance difference. + +.. code-block:: python + + from deltalake import DeltaTable + + delta_table = DeltaTable("path_to_table") + ctx.register_dataset("my_delta_table", delta_table.to_pyarrow_dataset()) + df = ctx.table("my_delta_table") + df.show() + +Iceberg +------- + +Coming soon! + +Custom Table Provider +--------------------- + +You can implement a custom Data Provider in Rust and expose it to DataFusion through the +the interface as describe in the :ref:`Custom Table Provider ` +section. This is an advanced topic, but a +`user example `_ +is provided in the DataFusion repository. diff --git a/docs/source/user-guide/io/avro.rst b/docs/source/user-guide/io/avro.rst index 5f1ff728..66398ac7 100644 --- a/docs/source/user-guide/io/avro.rst +++ b/docs/source/user-guide/io/avro.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _io_avro: + Avro ==== diff --git a/docs/source/user-guide/io/csv.rst b/docs/source/user-guide/io/csv.rst index d2a62bfe..144b6615 100644 --- a/docs/source/user-guide/io/csv.rst +++ b/docs/source/user-guide/io/csv.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _io_csv: + CSV === diff --git a/docs/source/user-guide/io/json.rst b/docs/source/user-guide/io/json.rst index f9da3755..39030db7 100644 --- a/docs/source/user-guide/io/json.rst +++ b/docs/source/user-guide/io/json.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _io_json: + JSON ==== `JSON `_ (JavaScript Object Notation) is a lightweight data-interchange format. diff --git a/docs/source/user-guide/io/parquet.rst b/docs/source/user-guide/io/parquet.rst index 75bc981c..c5b9ca3d 100644 --- a/docs/source/user-guide/io/parquet.rst +++ b/docs/source/user-guide/io/parquet.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _io_parquet: + Parquet ======= @@ -22,7 +24,6 @@ It is quite simple to read a parquet file using the :py:func:`~datafusion.contex .. code-block:: python - from datafusion import SessionContext ctx = SessionContext() diff --git a/docs/source/user-guide/io/table_provider.rst b/docs/source/user-guide/io/table_provider.rst index 2ff9ae46..bd1d6b80 100644 --- a/docs/source/user-guide/io/table_provider.rst +++ b/docs/source/user-guide/io/table_provider.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _io_custom_table_provider: + Custom Table Provider =====================