Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]add the confidence interval computation #387

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

YazhiGao
Copy link

@YazhiGao YazhiGao commented Aug 14, 2018

this is currently the implementation of ratio modeling for feature selection of random effect in photon.
I follow the algorithm described in the original publication but some twists are made according to discussion with yiming and alex.

  • Unit tests all pass.

  • Integration tests all pass.

The algorithm in reality(highly related with codebase instead of only mathematical expression) is as follows:

1.pass in the featureStatisticSummary
2.identify the binomial columns
3.compute the lowerbound for binomial columns based on the t value
4.select the feature based on only the following lowerbound criterion(non-binomial and intercept columns are kept automatically)

if (t < 1) {
  T_l = 1 / T_u
}

if (T_l > 1D) {
  //  select feature
}

As a WIP commit, there are things to polish in near future since we currently focus on the feasibility of this experimental method and try to minimize user-side changes :

  • unit tests not fully covering all scenarios of feature selection. Currently the binomial cases are not selected, we need to craft some data that covering all cases.

  • binomial feature column identification predicate needs to be stronger. Current solution is inherently flawed, we need more computation at feature summary stage to ensure this one.

  • hyperparameter interface design. for convenience purposes, the user side interface for pass in normal distribution quartile and lowerbound threshold hyperparameter redesign.

  • the relationship with pearson correlation feature selection. We need another parameter to decide on the algorithm of feature selection or mix them in later stage.

  • crafted test data need some change, currently some unneeded feature summary entries are not carefully addressed.

  • further experiment report and benchmark report after regression tests

  • the way we currently keep non-binary and intercept columns is not good for further feature ranking report planned. need redesign

@joshvfleming @ashelkovnykov

import breeze.linalg.{SparseVector, Vector}

import com.linkedin.photon.ml.Types.FeatureShardStatisticsMap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.{StorageLevel => SparkStorageLevel}
import org.apache.spark.{Partitioner, SparkContext}

import com.linkedin.photon.ml.Types.{FeatureShardId, REId, REType, UniqueSampleId}
import com.linkedin.photon.ml.Types._
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -535,7 +541,11 @@ class GameEstimator(val sc: SparkContext, implicit val logger: Logger) extends P
None
}

val rawRandomEffectDataSet = RandomEffectDataSet(gameDataSet, reConfig, partitioner, existingModelKeysRddOpt)
val rawRandomEffectDataSet = RandomEffectDataSet(gameDataSet,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First argument in a new line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

def filterFeaturesByRatioCIBound(
intervalBound: Double,
percentage: Double,
globalFeatureShardStats: BasicStatisticalSummary): LocalDataSet = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args should be indented 4 spaces.

protected[ml] def computeRatioCILowerBound(
randomLabelAndFeatures: Array[(Double, Vector[Double])],
quartile: Double,
globalFeatureShardStats: BasicStatisticalSummary): Map[Int, Double] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

randomEffectNumSamples += 1
features.activeIterator.foreach { case (key, value) =>
randomFeatureFirstOrderSums.update(key, randomFeatureFirstOrderSums.getOrElse(key, 0.0) + value)}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be faster / cleaner to do a vector sum reduce here, instead of manually iterating through the vector, e.g.:

  val randomFeatureFirstOrderSums = randomLabelAndFeatures
    .map(_._2)
    .reduce(_ + _)

Also, generally speaking mutability is frowned upon in the Scala world.

upperBound match {
case Some(upperBound) => None
case _ => {
upperBound = Some(t * math.exp(math.sqrt(variance) * quartile))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also get its own function.

lowerBound match {
case Some(lowerBound) => None
case _ => {
lowerBound = Some(t * math.exp(-math.sqrt(variance) * quartile))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same function as above.

case _ => py = Some(globalPopulationFirstOrderMeans(key))
}
val t = ( x / m ) / py.get
val variance = 1.0 / x - 1.0 / m + 1.0 / y - 1.0 / n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should extract this out to a function that returns (mean, variance). There's too much going on here in-line.


randomFeatureFirstOrderSums.keySet.foreach { key =>
// Do computation on only binary and non-intercept features
if ((globallFeatureNonZero(key) * 1.0) / globalPopulationNumSamples == globalMean(key) && key != lastColumn) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put this in an isBinary detector function to make the code easier to scan. Also, this check for intercept is not quite right -- we might not have an intercept, in which case this will ignore an actual feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a temporary solution that i mentioned in to-do list. we will figure this thorough discussion.

val labelAndFeatures = dataPoints.map { case (_, labeledPoint) => (labeledPoint.label, labeledPoint.features) }
val lowerBounds = LocalDataSet.computeRatioCILowerBound(labelAndFeatures, percentage, globalFeatureShardStats)
val filteredFeaturesIndexSet = lowerBounds.toArray
.filter(_._2 > intervalBound).map(_._1).toSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of these chained calls should get its own line.

@YazhiGao YazhiGao force-pushed the ratio_random_effect_select branch 2 times, most recently from 070d006 to 93d78d3 Compare August 15, 2018 20:59
@@ -192,6 +248,61 @@ class LocalDataSetTest {
}
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments style is

/**
 *
 **/

)

val computed = LocalDataSet.computeRatioCILowerBound(labelAndFeatures, 2.575, globalStats)
println(computed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant println

@@ -126,6 +126,36 @@ protected[ml] case class LocalDataSet(dataPoints: Array[(Long, LabeledPoint)]) {
}
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments are off. Same for line 245.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should look like:

/**
 * text text text
 *
 */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is resolved

* Compute Ratio Confidence Interval lower bounds.
*
* @param randomLabelAndFeatures An array of (label, feature) tuples
* @param quartile The quartile score of standard normal distribution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's quartile score?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter should be called zScore, since that's what it is

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@YazhiGao YazhiGao changed the title add the confidence interval computation WIP [WIP]add the confidence interval computation Aug 16, 2018
val randomEffectNumSamples: Long = randomLabelAndFeatures.length.toLong
val randomFeatureFirstOrderSums = randomLabelAndFeatures.map(_._2).toSeq.reduce(_ + _)

val m: Long = randomEffectNumSamples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m is not all the samples in random effect. it's the number of instance having binary feature f. x is out of this number m, how many are positive. Same for y and n for global.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YazhiGao This one is my mistake: I misremembered the formulation when explaining it to you. Ping me if it's not clear what @mayiming means here.

Copy link
Contributor

@ashelkovnykov ashelkovnykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please apply the Photon Style doc in the root of the project, it will reduce the number of style issues that need to be manually fixed.

Answering your 7 questions in order:

  1. TODO - please fix the requested locations first and then we'll reconsider the unit tests
  2. See comments
  3. Ignore this for now - lock them the lower bound threshold to 1.0 and the z-score to 2.575. We can adjust them after PoC testing.
  4. It's one or the other. During testing, we can compare how the Pearson correlation filtering performs. Based on past experience, I don't think it helps very much and I would personally prefer to remove it entirely. This would be in a separate commit.
  5. See 1
  6. After unit tests have been resolved we can introduce logging code for testing purposes
  7. See comments

Thanks

featureShardStats match {
case Some(featureShardStats) => featureShardStatsInMap = Some(featureShardStats.toMap[FeatureShardId,BasicStatisticalSummary])
case None => None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Scala, var means variable and val means constant. As a general rule, var should not be used in Scala unless using a val would be inefficient or logically difficult.

Looking at how featureShardStats is used, this whole block should be scrapped and calculateAndSaveFeatureShardStats should be modified to return a Map instead of an Iterable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -41,4 +43,6 @@ object Types {
// A "feature shard" is an arbitrary set of "feature bags"
// A random effect model corresponds to a single feature shard
type FeatureShardId = String
type FeatureShardStatisticsMap = Map[FeatureShardId, BasicStatisticalSummary]
type FeatureShardStatisticsMapOpt = Option[FeatureShardStatisticsMap]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types should be scrapped. The FeatureShardStatistics type in the GameTrainingDriver should replace Iterable with Map (see comments on GameTrainingDriver).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -15,17 +15,15 @@
package com.linkedin.photon.ml.estimators

import scala.language.existentials

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whitespace should not be removed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

import org.apache.spark.SparkContext
import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators, Params}
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
import org.slf4j.Logger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whitespace should not be removed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -43,6 +41,7 @@ import com.linkedin.photon.ml.supervised.classification.{LogisticRegressionModel
import com.linkedin.photon.ml.supervised.regression.{LinearRegressionModel, PoissonRegressionModel}
import com.linkedin.photon.ml.util._


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -441,6 +447,7 @@ object GameTrainingDriver extends GameDriver {
.setCoordinateDescentIterations(getRequiredParam(coordinateDescentIterations))
.setComputeVariance(getOrDefault(computeVariance))
.setIgnoreThresholdForNewModels(getOrDefault(ignoreThresholdForNewModels))
.setFeatureStats(featureShardStatsInMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only set this if the training task is logistic regression.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

val t = ( x / m ) / py
val variance = 1.0 / x - 1.0 / m + 1.0 / y - 1.0 / n
(t,variance)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py should not be an input

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

quartile: Double
): Double = {
t * math.exp(-math.sqrt(variance) * quartile)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary braces

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

quartile: Double
): Double = {
t * math.exp(math.sqrt(variance) * quartile)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary braces

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

key: Int
): Boolean = {
(globallFeatureNonZero(key) * 1.0) / globalPopulationNumSamples == globalMean(key)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function shouldn't need to exist (see other comments), but if it did:

Passing an Array and an index to access within the Array is poor craftsmanship.
Why multiply by 1D?
Unnecessary braces
Need whitespace after function definition

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is necessary as Josh pointed out to not write too much inline. As we further refine our predicate, we can do it faster.

@YazhiGao
Copy link
Author

fixed issues but unit tests undone. early update for pending reviews.

@YazhiGao YazhiGao force-pushed the ratio_random_effect_select branch 2 times, most recently from 16e0374 to 6dace03 Compare August 27, 2018 22:46
@YazhiGao
Copy link
Author

pushed with all tests passed and issues resolved.

featureShardStatsOpt match {
case Some(featureShardStats) => {
val featureShardId = randomEffectDataConfiguration.featureShardId
val (binaryIndices, nonBinaryIndices) = filterBinaryFeatures(featureShardStats(featureShardId))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Filter" isn't a good name for this, because it's not really filtering a collection. When you see filter, you expect it to be (seq in, seq out).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshvfleming segregateBinaryFeatures?

@YazhiGao We can get around the intercept issue here, for now. Since we still need to run experiments, this code doesn't need to be 100% production ready yet, so what we can do is this:

val (binaryIndices, rawNonBinaryIndices) = filterBinaryFeatures(featureShardStats(featureShardId))
val nonBinaryIndices = rawNonBinaryIndices + globalFeatureShardStats.mean.length

Since we know that the intercept index is always the last index (for now), we can insert it into the set of non-binary indices to guarantee that it is not filtered.

I spoke with @joshvfleming offline and we agreed that we need a refactor to move the concept of the intercept into the GameEstimator, so this should be a reasonable hack for now, until that refactor takes place.

Copy link
Contributor

@ashelkovnykov ashelkovnykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not reviewed test cases yet.

}
}

lowerBounds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in my previous review, we don't need all of these cases. We don't have to apply the paper exactly as it is written, since we have a special case where X ⊂ Y, always.

This entire block, from line 281 to line 316 can be replaced with the following:

binaryIndices
  .map { key =>
    val x_col = x(key)
    val m_col = m(key)
    val y_col = y(key)
    val n_col = n(key)

    val lowerBound = if (y_col == 0 || x_col == 0 || x_col == m_col) {
      0D
    } else {
      val (t, variance) = computeMeanAndVariance(x_col, m_col, y_col, n_col)
  
      if (t < 1D) {
        1D / computeUpperBound(t, variance, zScore)
      } else {
        computeLowerBound(t, variance, zScore)
      }
    }
    
    (key, lowerBound)
  }
  .toMap

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't add the x_col==m_col as the condition for 0D output right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some thinking, I think the original code is right. We should only eliminate the x!=0 && y==0 case. this is the only impossible case in our data. The original code has excluded that.

case None => featuresIndices
}
featuresAsBV.map{v => v(featuresIndicesWithoutIntercept.toIndexedSeq).toVector}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This intercept issue has become larger than we thought. As mentioned elsewhere, we should just hardcode the logic for now and return to it in a separte intercept refactor. I propose we use this block of code for now:

  private def calculateStatistics(
      data: DataFrame,
      featureIndexMapLoaders: Map[FeatureShardId, IndexMapLoader]): Map[FeatureShardId, BasicStatisticalSummary] =
    featureIndexMapLoaders
      .map { case (featureShardId, indexMapLoader) =>
        // Calling rdd explicitly here to avoid a typed encoder lookup in Spark 2.1
        val featuresAsBV: RDD[BreezeVector[Double]] = data
          .select(featureShardId)
          .rdd
          .map(x => VectorUtils.mlToBreeze(x.getAs[SparkMLVector](0)))

        val featuresForSummary = indexMapLoader
          .indexMapForDriver()
          .get(Constants.INTERCEPT_KEY)
          .map(_ => featuresAsBV.map(dropIntercept))
          .getOrElse(featuresAsBV)

        (featureShardId, BasicStatisticalSummary(featuresForSummary))
      }

  private def dropIntercept(baseVector: BreezeVector[Double]): BreezeVector[Double] = baseVector match {
    case dense: DenseVector[Double] =>
      new DenseVector[Double](dense.data, dense.offset, dense.stride, dense.length - 1)

    case sparse: SparseVector[Double] =>
      new SparseVector[Double](sparse.index.filterNot(_ < sparse.length - 1), sparse.data, sparse.length - 1)
  }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there is no intercept?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can the dropIntercept handle the case with no intercept?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only gets called if there is an intercept.

@YazhiGao YazhiGao force-pushed the ratio_random_effect_select branch 2 times, most recently from 46fc425 to 3122532 Compare September 7, 2018 22:24
@YazhiGao
Copy link
Author

YazhiGao commented Sep 7, 2018

unit tests and integration tests passed.
rebased #390 to this commit.

@YazhiGao
Copy link
Author

YazhiGao commented Sep 7, 2018

Should we take the case where x==m && y==n seriously like the original paper did to approximate the result using x = m -0.5, y = n - 0.5 or follow @ashelkovnykov 's opinion that this feature intuitively useless and we should output 0 lowerbound.
@mayiming @joshvfleming

@YazhiGao YazhiGao force-pushed the ratio_random_effect_select branch 2 times, most recently from 79d405e to 2e73bb6 Compare September 10, 2018 17:49
@ashelkovnykov
Copy link
Contributor

@YazhiGao @joshvfleming @mayiming

I'm set in my position that following the paper 100% is incorrect, due to our specific application:

X Y lower bound
0 0 Hardcode to 0 - this feature is never used
x 0 Impossible since X ⊂ Y
m 0 Impossible since X ⊂ Y
0 y Regular case - need special value so no 'divide by zero' error during variance computation
x y Regular case
m y Regular case
0 n Impossible since X ⊂ Y
x n Impossible since X ⊂ Y
m n Harcode to 0 - the lower bound approaches 1 from the bottom, thus feature will never be selected

Thus, the code can looks something like this:

if (y == 0 || y == n) {
  0
} else {
  val x = max(x_raw, X_MIN)
  ...
}

The X_MIN in the paper is 0.5 - this value seems arbitrary and I wonder whether or not we should decrease it.

@joshvfleming
Copy link
Contributor

@ashelkovnykov is right. The formulation in the paper is between any two ratios, but in our case one set is a subset of the other. So there are cases that can't arise, and because of this, the code can be much simpler.

)
val x_0_y_0_g_positive = Array(0.0)

// Second case x == 0 and y != 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments about splitting the responsibilities of the tests. The cases for testComputeRatioCILowerBound should be:

  1. x = 0, y = 0
  2. x = m, y = n
  3. x = 0, 0 < y < n, epsilon = default
  4. x = 0, 0 < y < n, epsilon = something else
  5. x = m, 0 < y < n
  6. 0 < x < m, 0 < y < n

@YazhiGao
Copy link
Author

fixed issues except for test cases redesign, all current tests passed.

Copy link
Contributor

@ashelkovnykov ashelkovnykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for testing - @YazhiGao and I will work on the unit tests later, pending good experimental results from Baolei

@YazhiGao
Copy link
Author

ready for experiment now.

@@ -212,6 +254,116 @@ object LocalDataSet {
new SparseVector(filteredIndexBuilder.result(), filteredDataBuilder.result(), features.length)
}

/**
* Compute Ratio Confidence Interval lower bounds.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add more descriptions, and in particular reference the paper/book we refer to.

val n_col = n(key)

val lowerBound = if (y_col == 0.0 || y_col == n_col) {
0D
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for m_col = 0, we may also want to return 0D here. Since if it enter next branch it will produce Infinity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants