Skip to content

How to create a Track

Jerome Banks edited this page Apr 20, 2015 · 4 revisions

Guide to creating a dead simple Track in Satisfaction

Here's how to write a very simple track with Satisfaction. It assumes that you are using Hive, and wish to implement some ETL workflow from an HDFS directory or Hive table, into a Hive table

Create a Scala Project

First step is to create a project for the code to live in. You probably want to create a git repo as well. You'll also need to have scala and sbt installed on your machine. The naming convention that we use is to prefix the name of track projects with satisfy-, but you can of course name your project what ever you want.

mkdir satisfy-mytrack
cd satisfy-mytrack
git create satisfy-mytrack

Create a build.sbt file

First step is to create the Scala SBT build file build.sbt. You'll need to define satisfaction as a dependency, and import the plugin. You'll also need to add the satisfy SBT plugin in your plugins.sbt file

This would be in your project/plugins.sbt file.

  addSbtPlugin("com.tagged.satisfaction" %% "sbt-satisfy" % "0.15")

This would be your build.sbt file.

 import sbtSatisfy._
 import SatisfyKeys._
 
 name := "satisfy-mytrack"
 trackName := "MyTrack"
 version := "0.0.1"
 
 organizaton := "org.mytoplevel.myorg"
 scalaVersion := "2.10.2"

 libraryDependencies ++= Seq(
    "com.tagged.satisfaction" %% "satisfaction-core" % "2.5.11" ,
    "com.tagged.satisfaction" %% "satisfaction-hadoop" % "2.5.11",
    "com.tagged.satisfaction" %% "satisfaction-hive-ms" % "2.5.11",
    "com.tagged.satisfaction" %% "satisfaction-hive" % "2.5.11",
    "com.klout" % "brickhouse" % "0.7.8-jdb-SNAPSHOT"
 )

For Hive tracks, you'll want to depend on the core, hadoop, hive-ms, and hive modules, (for what ever the current release is). You'll probably also want to include the magnificent Brickhouse project, for its helpful Hive UDFs.

Define your Track

Next, you'll need to define the Scala class for defining your Track. If your workflow consists mostly of Hive scripts, then you'll want to define it as a Hive track.

In directory src/main/scala/myorg/satisfy/mytrack, create a file MyTrack.scala, which would look similar to this:

package myorg
package satisfy
package my track


import satisfaction._
import satisfaction.hadoop.hive._
import satisfaction.hadoop.hive.ms._
import satisfaction.notifier.EmailNotified
import satisfaction.retry.Retryable

class MyTrack extends HiveTrack(TrackDescriptor("MyTrack")) with Hourly with EmailNotified with Retryable {
    
     override def timeOffset = minuteOfHour( 15)
     override def notifyOnFailure = true;
     override def notifyOnSuccess = false;
     override def retryNotifier = Some(notifier)
 
     override def recipients = Set("[email protected]")

}

Notice how the track extends the Hourly trait. This defines the track to run every hour, at 15 minutes past the hour. Other frequencies can be defined, or the track could be defined to run according to a crontab specification.

Notice also that it has the EmailNotified trait. This allows you to be notified by email if a job were to fail for whatever reason. (These things do happen occasionally, and you might want to find out about it 🎱 ). On job failures, we also define the track to be restarted automatically, so that the job can overcome transitory issues, like network problems, or temporary lack of disk space, etc.

Define your init method

There's probably a more Scala like to do this, ( and this may change in future releases, if there is a more elegant solution), but right now, the semantics of your workflow are defined by a method called init which gets invoked right after the Track gets created. In the init method, you would define the Hive tables that you are reading from, and which you are creating. You also define the "goals" for the track, by referencing the Hive HQL scripts which actually create your tables.

For the sake of this example,