Skip to content

Latest commit

 

History

History
74 lines (55 loc) · 2.1 KB

README.md

File metadata and controls

74 lines (55 loc) · 2.1 KB

Shapefile Data Source for Apache Spark

A library for parsing and querying shapefile data with Apache Spark, for Spark SQL and DataFrames.

Requirements

This library requires Spark 2.0+

Using with Spark shell

$SPARK_HOME/bin/spark-shell --packages com.esri:spark-shp:0.8

Features

This package allows reading shapefiles in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options:

  • path The location of shapefile(s). Similar to Spark can accept standard Hadoop globbing expressions.
  • shape An optional name of the shape column. Default value is shape.
  • columns An optional list of comma separated attribute column names. Default value is blank indicating all attribute fields.
  • format An optional parameter to define the output format of the shape field. Default value is SHP. Possible values are:

SQL API

CREATE TABLE gps
USING com.esri.spark.shp
OPTIONS (path "data/gps.shp")

Python API

df = spark.read \
    .format("com.esri.spark.shp") \
    .options(path="data/gps.shp", columns="atext,adate", format="GEOJSON") \
    .load() \
    .cache()

Building From Source

This library is built using Apache Maven. To build the jar, execute the following command:

mvn clean install

Data

Create Conda Env

export ENV=spark-shp
conda remove --yes --all --name $ENV
conda create --yes --name $ENV python=3.6
source activate $ENV
conda install --yes --quiet -c conda-forge\
    jupyterlab\
    tqdm\
    future\
    matplotlib=3.1\
    gdal=2.4\
    pyproj=2.2\
    shapely=1.6\
    pyshp=2.1