Skip to content
This repository has been archived by the owner on Dec 31, 2020. It is now read-only.
Eron Wright edited this page Dec 23, 2015 · 6 revisions

WikiFormatsMNIST

Using the Data Reader

Package

The MNIST reader is defined in the ai.cookie.spark.sql.sources.mnist package.

Classes

Class Description
DefaultSource The default Spark Data Source (parameters below).
MnistDataFrameReader A convenience class providing a read function based on the default data source.

Data Source Parameters

Parameter Description
labelsPath An absolute (or relative) URI to an MNIST labels data file (binary variant, e.g. t10k-labels-idx1-ubyte).
imagesPath An absolute (or relative) URI to an MNIST images data file (binary variant, e.g. t10k-images-idx3-ubyte).
maxSplitSize An integer value constraining the maximum size (in bytes) of a DataFrame partition (default 10MB).
maxRecords An integer value limiting the total number of records to produce.

Schema

Column Description
label The digit represented by the image.
features A raw image of a digit.

Feature Data

The feature data is encoded in a Vector for interoperability with Spark ML.

Metadata

The label and features columns contain relevant metadata, based on the MNIST dataset definition.

Column Description
label None
features (experimental) A metadata value named 'shape' (as Array[Long]) indicating the height x width of each image.

Walkthrough

Coming soon

$ spark-shell --packages "ai.cookie:cookie-datasets_2.10:0.1.0"
scala> import ai.cookie.spark.sql.sources.mnist._
scala> val df = sqlContext.read.mnist("src/test/resources/t10k-images-idx3-ubyte", "src/test/resources/t10k-labels-idx1-ubyte")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> df.show
+-----+--------------------+
|label|            features|
+-----+--------------------+
|  7.0|[0.0,0.0,0.0,0.0,...|
|  2.0|[0.0,0.0,0.0,0.0,...|
|  1.0|[0.0,0.0,0.0,0.0,...|
|  0.0|[0.0,0.0,0.0,0.0,...|
|  4.0|[0.0,0.0,0.0,0.0,...|
|  1.0|[0.0,0.0,0.0,0.0,...|
|  4.0|[0.0,0.0,0.0,0.0,...|
|  9.0|[0.0,0.0,0.0,0.0,...|
|  5.0|[0.0,0.0,0.0,0.0,...|
|  9.0|[0.0,0.0,0.0,0.0,...|
|  0.0|[0.0,0.0,0.0,0.0,...|
|  6.0|[0.0,0.0,0.0,0.0,...|
|  9.0|[0.0,0.0,0.0,0.0,...|
|  0.0|[0.0,0.0,0.0,0.0,...|
|  1.0|[0.0,0.0,0.0,0.0,...|
|  5.0|[0.0,0.0,0.0,0.0,...|
|  9.0|[0.0,0.0,0.0,0.0,...|
|  7.0|[0.0,0.0,0.0,0.0,...|
|  3.0|[0.0,0.0,0.0,0.0,...|
|  4.0|[0.0,0.0,0.0,0.0,...|
+-----+--------------------+
only showing top 20 rows

Getting Started

Formats

Development

Clone this wiki locally