Dataframe #451

lguo · 2020-02-12T18:47:13Z

The following changes in the design doc are covered.

Training datasets will be created directly before training a coordinate.
Residuals will be computed by using a UDF on the training DataFrame. For random effects, the per-entity models will first need to be joined to the DataFrame by REID. A single UDF will do all scoring for fixed and random effects at once. This UDF will also sum the residuals and offsets. Directly before aggregation, the DataFrame will be converted to a RDD, and then aggregation will proceed unmodified.
Model scoring will work like coordinate scoring;
Random effect vector projection will be disabled.

Training datasets will be created directly before training a coordinate. FixedEffectDataset is merged into FixedEffectCoordinate; RandomEffectDataset is merged into RandomEffectCoordinate; Random effect vector projection will be disabled

2. Scores are changed to use Dataframe 3. Residuals will be computed by using a UDF on the training DataFrame. For random effects, the per-entity models will first need to be joined to the DataFrame by REID. A single UDF will do all scoring for fixed and random effects at once.

ashelkovnykov

Initial comments on WIP

photon-api/src/main/scala/com/linkedin/photon/ml/estimators/GameEstimator.scala

ashelkovnykov · 2020-02-15T01:56:18Z

photon-api/src/main/scala/com/linkedin/photon/ml/estimators/GameEstimator.scala

@@ -653,25 +492,31 @@ class GameEstimator(val sc: SparkContext, implicit val logger: Logger) extends P
    val interceptIndices = getOrDefault(coordinateInterceptIndices)

    // Create the optimization coordinates for each component model
-    val coordinates: Map[CoordinateId, C forSome { type C <: Coordinate[_] }] =
+    val coordinates: Map[CoordinateId, C forSome { type C <: Coordinate }] =


There's nothing wrong here, but it might be easier to keep the Dataset objects like we did for the tests to wrap the DataFrame of training data (once it is generated) and the feature shard ID.

photon-api/src/main/scala/com/linkedin/photon/ml/Constants.scala

photon-api/src/main/scala/com/linkedin/photon/ml/model/RandomEffectModel.scala

photon-lib/src/main/scala/com/linkedin/photon/ml/algorithm/Coordinate.scala

photon-api/src/main/scala/com/linkedin/photon/ml/model/FixedEffectModel.scala

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/RandomEffectCoordinate.scala

photon-api/src/main/scala/com/linkedin/photon/ml/supervised/model/GeneralizedLinearModel.scala

ashelkovnykov · 2020-02-15T02:49:52Z

Forgot to comment - since all of these commits are related to one task and don't seem to have any logical separation, would you kindly crush them into one commit.

…own sampler

ashelkovnykov

I skipped reviewing much of the scoring changes as they looked like they were still early WIP and subject to many changes.

photon-client/src/main/scala/com/linkedin/photon/ml/util/Utils.scala

photon-api/src/main/scala/com/linkedin/photon/ml/util/ApiUtils.scala

photon-client/src/main/scala/com/linkedin/photon/ml/Constants.scala

ashelkovnykov · 2020-02-21T23:58:44Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/FixedEffectCoordinate.scala

+    optimizationProblem: DistributedOptimizationProblem[Objective],
+    featureShardId: FeatureShardId,
+    inputColumnsNames: InputColumnsNames)
+  extends Coordinate {


This isn't what I was picturing when writing the design document - I was thinking of something more like generateDataset in the proof-of-concept tests:

CoordinateDescent calls train in FixedEffectCoordinate with training DataFrame

train calls generateDataset

generateDataset drops unnecessary columns and trains a FixedEffectModel

CoordinateDescent calls score in FixedEffectCoordinate with training DataFrame and FixedEffectModel

score returns a new DataFrame with a scores column

CoordinateDescent calls train in RandomEffectCoordinate with the scored DataFrame

train calls generateDataset

generateDataset merges the offset column with the scores column, then drops unnecessary columns and trains RandomEffectModel

CoordinateDescent calls score in RandomEffectCoordinate with the scored DataFrame and RandomEffectModel

score returns a new DataFrame with another scores column

etc.

What you described is exactly what is implemented in the comments, but I put the corresponding code logic in different methods (instead of the one you suggested).

See CoordinateDescent line 192-208:

logger.debug(s"Updating coordinate of class ${coordinate.getClass}") // compute scores using the previous coordinate model and update offsets prevModelOpt.map(model => coordinate.updateOffset(model)) // Train a new model val (model, tracker) = initialModelOpt.map( initialModel => Timed(s"Train new model using existing model as starting point") { coordinate.trainModel(initialModel) }).getOrElse( Timed(s"Train new model") { coordinate.trainModel() }) // Log summary logOptimizationSummary(logger, coordinateId, model, tracker)

ashelkovnykov · 2020-02-22T00:01:58Z

photon-api/src/main/scala/com/linkedin/photon/ml/model/FixedEffectModel.scala

-    val modelBroadcast: Broadcast[GeneralizedLinearModel],
-    val featureShardId: String)
+  val modelBroadcast: Broadcast[GeneralizedLinearModel],
+  val featureShardId: String)


These two lines should be indented once more

...rc/main/scala/com/linkedin/photon/ml/optimization/game/RandomEffectOptimizationProblem.scala

lguo · 2020-02-24T17:38:19Z

I skipped reviewing much of the scoring changes as they looked like they were still early WIP and subject to many changes.

FixedEffectCoordinate.updateOffset (and RandomEffectCoordinate.updateOffset) are used to compute scores instead merging scores back to original dataset.

junshi15 · 2020-03-04T02:10:34Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/CoordinateFactory.scala

      case (
-          fEDataset: FixedEffectDataset,
+          None,


why this becomes "None"? this is for fixed effect case?

junshi15 · 2020-03-04T02:29:36Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/FixedEffectCoordinate.scala

+    if (hasOffsetField && hasCoordinateScoreField) {
+      // offset = offset - old_coordinateScore + new_coordinateScore
+      dataset.withColumn(offset, col(offset) - col(SCORE_FIELD))
+      fixedEffectModel.computeScore(dataset, SCORE_FIELD)


where is the new score saved?

I think new score is saved in SCORE_FIELD.

junshi15 · 2020-03-04T02:32:43Z

photon-api/src/main/scala/com/linkedin/photon/ml/model/RandomEffectModel.scala

-    if (modelsRDD.first()._2.coefficients.variancesOption.isDefined) {
-      stringBuilder.append(s"\nVariance: ${modelsRDD.values.map(_.coefficients.variancesL2NormOption.get).stats()}")
-    }
+    //stringBuilder.append(s"\nLength: ${modelsRDD.values.map(_.coefficients.means.length).stats()}")


why not delete them if they are not used.

does this means we don't have stats in the log file any more?

junshi15 · 2020-03-04T02:38:48Z

photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/scoring/GameScoringDriver.scala

-      case (_, model: RandomEffectModel) => model.unpersistRDD()
-      case _ =>
-    }
+//    gameModel.toMap.foreach {


delete it if not used

cmjiang · 2020-03-04T05:29:01Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/CoordinateFactory.scala

@@ -34,35 +37,40 @@ import com.linkedin.photon.ml.util.PhotonBroadcast
 object CoordinateFactory {

  /**
-   * Creates a [[Coordinate]] of the appropriate type, given the input [[Dataset]],
+   * Creates a [[Coordinate]] of the appropriate type, given the input data set,


Let's keep "dataset" as one word.

cmjiang · 2020-03-04T05:29:20Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/CoordinateFactory.scala

   * @param dataset The input data to use for training
+   * @param featureShardId


Miss parameter description.

cmjiang · 2020-03-04T05:29:33Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/CoordinateFactory.scala

   * @param dataset The input data to use for training
+   * @param featureShardId
+   * @param inputColumnsNames


Miss parameter description.

cmjiang · 2020-03-04T05:29:48Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/CoordinateFactory.scala

   * @param coordinateOptConfig The optimization settings for training
   * @param lossFunctionFactoryConstructor A constructor for the loss function factory function
   * @param glmConstructor A constructor for the type of [[GeneralizedLinearModel]] being trained
   * @param downSamplerFactory A factory function for the [[DownSampler]] (if down-sampling is enabled)
   * @param normalizationContext The [[NormalizationContext]]
   * @param varianceComputationType Should the trained coefficient variances be computed in addition to the means?
   * @param interceptIndexOpt The index of the intercept, if one is present
-   * @return A [[Coordinate]] for the [[Dataset]] of type [[D]]
+   * @param rETypeOpt


Miss parameter description.

cmjiang · 2020-03-04T05:34:49Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/CoordinateFactory.scala


    val lossFunctionFactory = lossFunctionFactoryConstructor(coordinateOptConfig)

-    (dataset, coordinateOptConfig, lossFunctionFactory) match {
+    (rETypeOpt, coordinateOptConfig, lossFunctionFactory) match {


Can we just do a match on (coordinateOptConfig, lossFunctionFactory)? This rETypeOpt seems to be redundant.

cmjiang · 2020-03-04T05:38:00Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/FixedEffectCoordinate.scala

-    val optimizationTracker = new FixedEffectOptimizationTracker(optimizationProblem.getStatesTracker)
-
-    (updatedFixedEffectModel, optimizationTracker)
+  override protected[algorithm] def updateOffset(model: DatumScoringModel) = {


Comments are missing for updateOffset.

cmjiang · 2020-03-04T05:43:13Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/FixedEffectCoordinate.scala


-    new CoordinateDataScores(scores)
+  def updateOffset(


Miss comments for updateOffset.

cmjiang · 2020-03-04T05:51:01Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/FixedEffectCoordinate.scala

+      fixedEffectModel.computeScore(dataset, SCORE_FIELD)
+        .withColumn(offset, col(offset) + col(SCORE_FIELD))
+    } else {
+      throw new UnsupportedOperationException("It shouldn't happen!")


Can you make the error message more explicit?

cmjiang · 2020-03-04T05:51:47Z

photon-api/src/main/scala/com/linkedin/photon/ml/algorithm/FixedEffectCoordinate.scala

 }

 object FixedEffectCoordinate {

+  def SCORE_FIELD = "fixed_score"


Better to call it fixed_effect_score.

cmjiang · 2020-03-04T05:58:13Z

photon-api/src/main/scala/com/linkedin/photon/ml/model/RandomEffectModel.scala

+    }
+  }
+
+  def toDataFrame(input: RDD[(REType, GeneralizedLinearModel)]): DataFrame = {


Miss comments.

cmjiang · 2020-03-04T05:58:36Z

...-api/src/main/scala/com/linkedin/photon/ml/optimization/DistributedOptimizationProblem.scala

@@ -17,7 +17,6 @@ package com.linkedin.photon.ml.optimization
 import breeze.linalg.{Vector, cholesky, diag}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.storage.StorageLevel
-


This line is needed.

cmjiang · 2020-03-04T05:59:46Z

photon-api/src/main/scala/com/linkedin/photon/ml/supervised/model/GeneralizedLinearModel.scala

 import org.apache.spark.rdd.RDD
-


This line is needed.

cmjiang · 2020-03-04T06:08:31Z

photon-api/src/main/scala/com/linkedin/photon/ml/supervised/model/GeneralizedLinearModel.scala

+
+      score
+    })
+


This line is not necessary.

cmjiang · 2020-03-04T06:13:28Z

photon-api/src/main/scala/com/linkedin/photon/ml/supervised/model/GeneralizedLinearModel.scala

+
+      var score = 0D
+
+      coefficients match {


If the features are dense, then the coefficients are usually dense. If the features are sparse (for random effect), then the coefficients are sparse. So it seems that

features.foreachActive { case (index, value) => score += value * denseCoef(index)}

is good enough. Will there be cases that coefficients are sparse but features are dense?

cmjiang · 2020-03-04T06:14:36Z

photon-api/src/main/scala/com/linkedin/photon/ml/transformers/GameTransformer.scala

-        .reduceByKey(_ + _)
-        .values
-        .stats()
+          .groupBy(idTag).agg(count("*").alias("cnt"))


Indent two spaces back.

cmjiang · 2020-03-04T06:16:17Z

photon-lib/src/main/scala/com/linkedin/photon/ml/constants/DataConst.scala

@@ -0,0 +1,25 @@
+/*
+ * Copyright 2017 LinkedIn Corp. All rights reserved.


cmjiang · 2020-03-04T06:17:10Z

photon-lib/src/main/scala/com/linkedin/photon/ml/sampling/DownSampler.scala

@@ -15,10 +15,9 @@
 package com.linkedin.photon.ml.sampling

 import java.util.Random
-
+import com.linkedin.photon.ml.Types.UniqueSampleId


Please reorder this import.

lguo added 6 commits February 10, 2020 23:03

test commit

aec6486

Migrate to Dataframe

0df72a0

Training datasets will be created directly before training a coordinate. FixedEffectDataset is merged into FixedEffectCoordinate; RandomEffectDataset is merged into RandomEffectCoordinate; Random effect vector projection will be disabled

fix compilation bugs

4b373c6

Fix errors in CoordinateDescent

ace606f

reset the changes in README.md

26c6210

ashelkovnykov reviewed Feb 15, 2020

View reviewed changes

lguo added 4 commits February 20, 2020 09:34

Address Alex's comments

05b3e01

Fix game scoring codes; remove usage of RandomEffectOptimizationProblem

ae98cc8

fix problems in RandomEffectCoordinate

1875588

fix compilation errors: model loading/storing and score storing and d…

1f36c78

…own sampler

ashelkovnykov reviewed Feb 22, 2020

View reviewed changes

Address comments

77a8dc4

junshi15 reviewed Mar 4, 2020

View reviewed changes

cmjiang reviewed Mar 4, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe #451

Dataframe #451

lguo commented Feb 12, 2020

ashelkovnykov left a comment

ashelkovnykov Feb 15, 2020

ashelkovnykov commented Feb 15, 2020

ashelkovnykov left a comment

ashelkovnykov Feb 21, 2020

lguo Feb 24, 2020

ashelkovnykov Feb 22, 2020

lguo commented Feb 24, 2020

junshi15 Mar 4, 2020

junshi15 Mar 4, 2020

cmjiang Mar 4, 2020

junshi15 Mar 4, 2020

junshi15 Mar 4, 2020

junshi15 Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

cmjiang Mar 4, 2020

		* @param dataset The input data to use for training
		* @param featureShardId

		@@ -0,0 +1,25 @@
		/*
		* Copyright 2017 LinkedIn Corp. All rights reserved.

Dataframe #451

Are you sure you want to change the base?

Dataframe #451

Conversation

lguo commented Feb 12, 2020

ashelkovnykov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashelkovnykov commented Feb 15, 2020

ashelkovnykov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lguo commented Feb 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment