Skip to content
This repository has been archived by the owner on Oct 4, 2018. It is now read-only.
/ scetta Public archive

Prototype of CDL mapping in Scala & Spark

Notifications You must be signed in to change notification settings

dpla-attic/scetta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

Data

Original records for California Digital Library are available here

Prerequisites

  • You need Java 8 of some variety installed.
  • brew install sbt

Building

sbt clean package

Running

sbt "run-main la.dp.scetta.StandaloneMain /Users/michael/Downloads/cdl /Users/michael/Downloads/cdl-output"

The first time you do this, if you haven't run any sort of build ahead of time, sbt will download a scala install and a bunch of dependencies. That's not part of the normal operation of the code, so don't bother timing that if it happens.

Please note that I did nothing to make this user-friendly. Both of these directories must exist and IN_DIRECTORY should be full of CDL JSON files. It does filter for files that end in .json, however.

There are some warnings about exhaustiveness checks. Those are actually cool because it's Scala's typesystem warning us that we haven't handled all the potential cases for matches against the JSON path expressions. Leaving them in because they're illustrative of Scala's type checking.

The output you should see on the CLI is the total runtime, the number of files processed, and a list of files with problems and their stacktraces, if any.

Spark Version

The Sequencefile version of our data is in S3 here: s3://dpla-mdpdb/cdl.seq

To run this directly against S3 sbt "run-main la.dp.scetta.SparkMain local[4] s3n://access_key_id:secret_key@dpla-mdpdb/cdl.seq <outfile.seq>"

Basically, the only step required to generate that file was to point the tar-to-seq program available here at the cdl_original_record.tar.gz file. The process took 46 seconds on my laptop.

A Sequencefile is a record-based, splittable file format that comes from the Hadoop world. Each "line" in that file is a key that's the original filename, and a value that contains the bytes of each file. The whole file is split into blocks and compressed internally, which means that software like Hadoop and Spark can split it into chunks that represent batches of documents and operate over them with separate workers, while never having to unpack the file on a local filesystem. This makes loading it into memory quicker as well, since the IO can be pipelined.

Spark can be run in a local mode that only uses the software embedded in the project, so no install should be necessary. Just do:

sbt "run-main la.dp.scetta.SparkMain local[4] <infile.seq> <outfile.seq>"

The parameter local[4] is the identifier for the Spark cluster you're trying to connect to. In this case, we're specifying one embedded in the launching process by saying local, and we're telling it to use 4 workers with [4]. In practice, we could be running this on a cluster of multiple machines that we get Amazon EMR to launch purely by specifying the master node in place of this parameter.

If you catch it while it's running, there's a Spark dashboard available at http://localhost:4040/jobs/.

About

Prototype of CDL mapping in Scala & Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages