cache CIM classes as pairRDD #2

derrickoswald · 2017-02-12T10:08:47Z

One of the most common operations for CIM class RDD is to generate a pairRDD for join operations with:

XXX.keyBy (_.id)

It may be advantageous to formalize this use-case by storing pre-keyed pairRDD in the persistent RDD cache pool instead of just CIM object RDD, since the id (CIM rdf:ID = mRID) is the unique identifier for each CIM object.

Unfortunately, this has pervasive downstream consequences. Each operation to "get" an RDD by name, which is used extensively in CIMScala and dependent code like CIMApplication, would need to be modified to take advantage of this - or to work-around it if the keyBy (_.id) is not required.

For example:

val elements = get ("Elements").asInstanceOf[RDD[Element]].keyBy (_.id).join (...

becomes

val elements = get ("Elements").asInstanceOf[RDD[Element]].join (...

and

val terms = get ("Terminal").asInstanceOf[RDD[Terminal]].keyBy (_.ConductingEquipment).join (...

becomes

val terms = get ("Terminal").asInstanceOf[RDD[Terminal]].values.keyBy (_.ConductingEquipment).join (...

This also has effects on partitioning. I believe that the first element of the pair's hash code is used as the partition function for pairRDD, and hence caching pairRDD would trigger a shuffle as objects were coalesced into the machine that "owns" them.

Benchmarks should be performed before and after this change to determine if there is an actual speed improvement with typical use-case scenarios.

The text was updated successfully, but these errors were encountered:

derrickoswald added the enhancement label Feb 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache CIM classes as pairRDD #2

cache CIM classes as pairRDD #2

derrickoswald commented Feb 12, 2017

cache CIM classes as pairRDD #2

cache CIM classes as pairRDD #2

Comments

derrickoswald commented Feb 12, 2017