Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduceByKey 函数 map 端 combine 的实现变化 #61

Open
Angryrou opened this issue May 3, 2017 · 0 comments
Open

reduceByKey 函数 map 端 combine 的实现变化 #61

Angryrou opened this issue May 3, 2017 · 0 comments

Comments

@Angryrou
Copy link

Angryrou commented May 3, 2017

非常感谢你的文章!我在阅读学习的时候也做了一些demo,想请教一个问题:

在第二章 Job 逻辑执行图 —— 逻辑图的生成 这个部分中,您有提到 实际 RDD 个数比我们想象的多一些 我参照 groupByKeyreduceByKey 分别做了两个实验,发现结果和预期不一致(请见下图和我的实验)。结果都只是产生了 “ParallelCollectionRDD” 和 "ShuffledRDD" 两种,并没有看到中间过程的RDD。我比较了源码中PariRDDFunctions.scala里的实现,发现果然已经有变化了。

请问现在的 map 端的 combine 工作是怎么实现的?

Job 逻辑执行图:
groupbykey
reducebykey

我的实验代码:(spark 2.1.0)

object Test {
  def main(args: Array[String]) {
    val sc = new SparkContext("local[2]", "Test")

    val data = Array[(Char, Int)](('A', 1), ('B', 1), ('C', 1), ('B', 1), ('C', 1), ('D', 1), ('C', 1), ('A', 1))
    val a = sc.parallelize(data, 3)

    val groupByKeyRDD = a.groupByKey()
    val reduceByKeyRDD = a.reduceByKey(_ + _)

    reduceByKeyRDD.foreach(println)
    groupByKeyRDD.foreach(println)

    println(groupByKeyRDD.toDebugString)
    println(reduceByKeyRDD.toDebugString)

    sc.stop()
  }
}

输出结果:

output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant