This repository has been archived by the owner on Dec 31, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
MNIST
Eron Wright edited this page Dec 23, 2015
·
6 revisions
Wiki ▸ Formats ▸ MNIST
The MNIST reader is defined in the ai.cookie.spark.sql.sources.mnist package.
Class | Description |
---|---|
DefaultSource | The default Spark Data Source (parameters below). |
MnistDataFrameReader | A convenience class providing a read function based on the default data source. |
Parameter | Description |
---|---|
labelsPath | An absolute (or relative) URI to an MNIST labels data file (binary variant, e.g. t10k-labels-idx1-ubyte). |
imagesPath | An absolute (or relative) URI to an MNIST images data file (binary variant, e.g. t10k-images-idx3-ubyte). |
maxSplitSize | An integer value constraining the maximum size (in bytes) of a DataFrame partition (default 10MB). |
maxRecords | An integer value limiting the total number of records to produce. |
Column | Description |
---|---|
label | The digit represented by the image. |
features | A raw image of a digit. |
The feature data is encoded in a Vector for interoperability with Spark ML.
The label
and features
columns contain relevant metadata, based on the MNIST dataset definition.
Column | Description |
---|---|
label | None |
features | (experimental) A metadata value named 'shape' (as Array[Long] ) indicating the height x width of each image. |
Coming soon
$ spark-shell --packages "ai.cookie:cookie-datasets_2.10:0.1.0"
scala> import ai.cookie.spark.sql.sources.mnist._
scala> val df = sqlContext.read.mnist("src/test/resources/t10k-images-idx3-ubyte", "src/test/resources/t10k-labels-idx1-ubyte")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
scala> df.show
+-----+--------------------+
|label| features|
+-----+--------------------+
| 7.0|[0.0,0.0,0.0,0.0,...|
| 2.0|[0.0,0.0,0.0,0.0,...|
| 1.0|[0.0,0.0,0.0,0.0,...|
| 0.0|[0.0,0.0,0.0,0.0,...|
| 4.0|[0.0,0.0,0.0,0.0,...|
| 1.0|[0.0,0.0,0.0,0.0,...|
| 4.0|[0.0,0.0,0.0,0.0,...|
| 9.0|[0.0,0.0,0.0,0.0,...|
| 5.0|[0.0,0.0,0.0,0.0,...|
| 9.0|[0.0,0.0,0.0,0.0,...|
| 0.0|[0.0,0.0,0.0,0.0,...|
| 6.0|[0.0,0.0,0.0,0.0,...|
| 9.0|[0.0,0.0,0.0,0.0,...|
| 0.0|[0.0,0.0,0.0,0.0,...|
| 1.0|[0.0,0.0,0.0,0.0,...|
| 5.0|[0.0,0.0,0.0,0.0,...|
| 9.0|[0.0,0.0,0.0,0.0,...|
| 7.0|[0.0,0.0,0.0,0.0,...|
| 3.0|[0.0,0.0,0.0,0.0,...|
| 4.0|[0.0,0.0,0.0,0.0,...|
+-----+--------------------+
only showing top 20 rows
Getting Started
Formats
Development