-
Notifications
You must be signed in to change notification settings - Fork 3
How to create a Track
Here's how to write a very simple track with Satisfaction. It assumes that you are using Hive, and wish to implement some ETL workflow from an HDFS directory or Hive table, into a Hive table
First step is to create a project for the code to live in. You probably want to create a git repo as well. You'll also need to have scala and sbt installed on your machine. The naming convention that we use is to prefix the name of track projects with satisfy-
, but you can of course name your project what ever you want.
mkdir satisfy-mytrack
cd satisfy-mytrack
git create satisfy-mytrack
First step is to create the Scala SBT build file build.sbt
. You'll need to define satisfaction as a dependency, and import the plugin. You'll also need to add the satisfy SBT plugin in your plugins.sbt
file
This would be in your project/plugins.sbt
file.
addSbtPlugin("com.tagged.satisfaction" %% "sbt-satisfy" % "0.15")
This would be your build.sbt
file.
import sbtSatisfy._
import SatisfyKeys._
name := "satisfy-mytrack"
trackName := "MyTrack"
version := "0.0.1"
organizaton := "org.mytoplevel.myorg"
scalaVersion := "2.10.2"
libraryDependencies ++= Seq(
"com.tagged.satisfaction" %% "satisfaction-core" % "2.5.11" ,
"com.tagged.satisfaction" %% "satisfaction-hadoop" % "2.5.11",
"com.tagged.satisfaction" %% "satisfaction-hive-ms" % "2.5.11",
"com.tagged.satisfaction" %% "satisfaction-hive" % "2.5.11",
"com.klout" % "brickhouse" % "0.7.8-jdb-SNAPSHOT"
)
For Hive tracks, you'll want to depend on the core, hadoop, hive-ms, and hive modules, (for what ever the current release is). You'll probably also want to include the magnificent Brickhouse project, for its helpful Hive UDFs.
Next, you'll need to define the Scala class for defining your Track. If your workflow consists mostly of Hive scripts, then you'll want to define it as a Hive track.
In directory src/main/scala/myorg/satisfy/mytrack
, create a file MyTrack.scala
, which would look similar to this:
package myorg
package satisfy
package my track
import satisfaction._
import satisfaction.hadoop.hive._
import satisfaction.hadoop.hive.ms._
import satisfaction.notifier.EmailNotified
import satisfaction.retry.Retryable
class MyTrack extends HiveTrack(TrackDescriptor("MyTrack")) with Hourly with EmailNotified with Retryable {
override def timeOffset = minuteOfHour( 15)
override def notifyOnFailure = true;
override def notifyOnSuccess = false;
override def retryNotifier = Some(notifier)
override def recipients = Set("[email protected]")
}
Notice how the track extends the Hourly
trait. This defines the track to run every hour, at 15 minutes past the hour. Other frequencies can be defined, or the track could be defined to run according to a crontab
specification.
Notice also that it has the EmailNotified
trait. This allows you to be notified by email if a job were to fail for whatever reason. (These things do happen occasionally, and you might want to find out about it 🎱 ). On job failures, we also define the track to be restarted automatically, so that the job can overcome transitory issues, like network problems, or temporary lack of disk space, etc.
There's probably a more Scala
like to do this, ( and this may change in future releases, if there is a more elegant solution), but right now, the semantics of your workflow are defined by a method called init
which gets invoked right after the Track gets created. In the init
method, you would define the Hive tables that you are reading from, and which you are creating. You also define the "goals" for the track, by referencing the Hive HQL scripts which actually create your tables.
For the sake of this example,