Skip to content

venkata91/SparkArrowFlight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparkArrowFlight

How to Use

This is an example to demonstrate a basic Apache Arrow Flight data service with Apache Spark and TensorFlow clients. The service uses a simple producer with an InMemoryStore from the Arrow Flight examples. This allows clients to put/get Arrow streams to an in-memory store.

The Spark client maps partitions of an existing DataFrame to produce an Arrow stream for each partition that is put in the service under a string based FlightDescriptor. Then a PyArrow client reads each Arrow stream to produce a Pandas DataFrame. Optionally (if TF installed), a TensorFlow client reads each Arrow stream, one at a time, into an ArrowStreamDataset so records can be iterated over as Tensors.

Python Prerequisites

  • Python 3.8
  • PySpark 3.1.1
  • PyArrow 2.0.0
  • Optionally: TensorFlow, TensorFlow I/O

Usage

In one terminal, start the Arrow Flight service on port 8888

$ bin/run_flight_server.sh

In another terminal, start the example to run a PySpark application to put data to the service and then create pyarrow and TensorFlow clients to conusume it.

$ bin/run_flight_example.sh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published