-
-
Notifications
You must be signed in to change notification settings - Fork 355
RDDGenerator
RDDGenerator
provides an easy way to generate arbitrary RDDs, to be able to check any property.
If you don't know scalacheck, I suggest you read about it first; to understand the concepts of properties and generators.
To Generate RDDs, use the following method:
genRDD[T: ClassTag](sc: SparkContext, minPartitions: Int = 1)(getGenerator: => Gen[T]): Gen[RDD[T]]
Which Generates an RDD of the desired type. Attempt to try different number of partitions so as to catch problems with empty partitions, etc. minPartitions
defaults to 1, but when generating data too large for a single machine choose a larger value. getGenerator
used to create the generator. This function will be used to create the generator as many times as required.
just create a generator for your required RDD type or use generators that is supported by default.
Example: (Use supported generator)
class RDDsCheck extends FunSuite with with SharedSparkContext with Checkers {
test("map should not change number of elements") {
val property =
forAll(RDDGenerator.genRDD[String](sc)(Arbitrary.arbitrary[String])) {
rdd => rdd.map(_.length).count() == rdd.count()
}
check(property)
}
}
Example: (Custom Generator)
class RDDsCheck extends FunSuite with SharedSparkContext with Checkers {
test("custom generator") {
val property =
forAll(RDDGenerator.genRDD[Person](sc) {
val generator: Gen[Person] = for {
name <- Arbitrary.arbitrary[String]
age <- Arbitrary.arbitrary[Int]
} yield (Person(name, age))
generator
}) {
rdd => rdd.map(_.age).count() == rdd.count()
}
check(property)
}
}
case class Person(name: String, age: Int)
You can specify the size of the RDDs using implicit PropertyCheckConfig
.
Example:
class RDDsCheck extends FunSuite with SharedSparkContext with Checkers {
test("generate rdd of specific size") {
implicit val generatorDrivenConfig =
PropertyCheckConfig(minSize = 10, maxSize = 20)
val prop = forAll(RDDGenerator.genRDD[String](sc)(Arbitrary.arbitrary[String])){
rdd => rdd.count() <= 20
}
check(prop)
}
}