reduceByKey 函数 map 端 combine 的实现变化 #61

Angryrou · 2017-05-03T09:12:29Z

非常感谢你的文章！我在阅读学习的时候也做了一些demo，想请教一个问题：

在第二章 Job 逻辑执行图 —— 逻辑图的生成这个部分中，您有提到 实际 RDD 个数比我们想象的多一些 我参照 groupByKey 和 reduceByKey 分别做了两个实验，发现结果和预期不一致（请见下图和我的实验）。结果都只是产生了 “ParallelCollectionRDD” 和 "ShuffledRDD" 两种，并没有看到中间过程的RDD。我比较了源码中PariRDDFunctions.scala里的实现，发现果然已经有变化了。

请问现在的 map 端的 combine 工作是怎么实现的？

Job 逻辑执行图：

我的实验代码：（spark 2.1.0）

object Test {
  def main(args: Array[String]) {
    val sc = new SparkContext("local[2]", "Test")

    val data = Array[(Char, Int)](('A', 1), ('B', 1), ('C', 1), ('B', 1), ('C', 1), ('D', 1), ('C', 1), ('A', 1))
    val a = sc.parallelize(data, 3)

    val groupByKeyRDD = a.groupByKey()
    val reduceByKeyRDD = a.reduceByKey(_ + _)

    reduceByKeyRDD.foreach(println)
    groupByKeyRDD.foreach(println)

    println(groupByKeyRDD.toDebugString)
    println(reduceByKeyRDD.toDebugString)

    sc.stop()
  }
}

输出结果：

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduceByKey 函数 map 端 combine 的实现变化 #61

reduceByKey 函数 map 端 combine 的实现变化 #61

Angryrou commented May 3, 2017

reduceByKey 函数 map 端 combine 的实现变化 #61

reduceByKey 函数 map 端 combine 的实现变化 #61

Comments

Angryrou commented May 3, 2017