You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the most common operations for CIM class RDD is to generate a pairRDD for join operations with:
XXX.keyBy (_.id)
It may be advantageous to formalize this use-case by storing pre-keyed pairRDD in the persistent RDD cache pool instead of just CIM object RDD, since the id (CIM rdf:ID = mRID) is the unique identifier for each CIM object.
Unfortunately, this has pervasive downstream consequences. Each operation to "get" an RDD by name, which is used extensively in CIMScala and dependent code like CIMApplication, would need to be modified to take advantage of this - or to work-around it if the keyBy (_.id) is not required.
For example:
valelements= get ("Elements").asInstanceOf[RDD[Element]].keyBy (_.id).join (...
becomes
valelements= get ("Elements").asInstanceOf[RDD[Element]].join (...
and
valterms= get ("Terminal").asInstanceOf[RDD[Terminal]].keyBy (_.ConductingEquipment).join (...
becomes
valterms= get ("Terminal").asInstanceOf[RDD[Terminal]].values.keyBy (_.ConductingEquipment).join (...
This also has effects on partitioning. I believe that the first element of the pair's hash code is used as the partition function for pairRDD, and hence caching pairRDD would trigger a shuffle as objects were coalesced into the machine that "owns" them.
Benchmarks should be performed before and after this change to determine if there is an actual speed improvement with typical use-case scenarios.
The text was updated successfully, but these errors were encountered:
One of the most common operations for CIM class RDD is to generate a pairRDD for join operations with:
XXX.keyBy (_.id)
It may be advantageous to formalize this use-case by storing pre-keyed pairRDD in the persistent RDD cache pool instead of just CIM object RDD, since the id (CIM rdf:ID = mRID) is the unique identifier for each CIM object.
Unfortunately, this has pervasive downstream consequences. Each operation to "get" an RDD by name, which is used extensively in CIMScala and dependent code like CIMApplication, would need to be modified to take advantage of this - or to work-around it if the
keyBy (_.id)
is not required.For example:
This also has effects on partitioning. I believe that the first element of the pair's hash code is used as the partition function for pairRDD, and hence caching pairRDD would trigger a shuffle as objects were coalesced into the machine that "owns" them.
Benchmarks should be performed before and after this change to determine if there is an actual speed improvement with typical use-case scenarios.
The text was updated successfully, but these errors were encountered: