-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]add the confidence interval computation #387
base: master
Are you sure you want to change the base?
Conversation
import breeze.linalg.{SparseVector, Vector} | ||
|
||
import com.linkedin.photon.ml.Types.FeatureShardStatisticsMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an empty line here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
import org.apache.spark.broadcast.Broadcast | ||
import org.apache.spark.rdd.RDD | ||
import org.apache.spark.storage.{StorageLevel => SparkStorageLevel} | ||
import org.apache.spark.{Partitioner, SparkContext} | ||
|
||
import com.linkedin.photon.ml.Types.{FeatureShardId, REId, REType, UniqueSampleId} | ||
import com.linkedin.photon.ml.Types._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -535,7 +541,11 @@ class GameEstimator(val sc: SparkContext, implicit val logger: Logger) extends P | |||
None | |||
} | |||
|
|||
val rawRandomEffectDataSet = RandomEffectDataSet(gameDataSet, reConfig, partitioner, existingModelKeysRddOpt) | |||
val rawRandomEffectDataSet = RandomEffectDataSet(gameDataSet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First argument in a new line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
def filterFeaturesByRatioCIBound( | ||
intervalBound: Double, | ||
percentage: Double, | ||
globalFeatureShardStats: BasicStatisticalSummary): LocalDataSet = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Args should be indented 4 spaces.
protected[ml] def computeRatioCILowerBound( | ||
randomLabelAndFeatures: Array[(Double, Vector[Double])], | ||
quartile: Double, | ||
globalFeatureShardStats: BasicStatisticalSummary): Map[Int, Double] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
randomEffectNumSamples += 1 | ||
features.activeIterator.foreach { case (key, value) => | ||
randomFeatureFirstOrderSums.update(key, randomFeatureFirstOrderSums.getOrElse(key, 0.0) + value)} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be faster / cleaner to do a vector sum reduce here, instead of manually iterating through the vector, e.g.:
val randomFeatureFirstOrderSums = randomLabelAndFeatures
.map(_._2)
.reduce(_ + _)
Also, generally speaking mutability is frowned upon in the Scala world.
upperBound match { | ||
case Some(upperBound) => None | ||
case _ => { | ||
upperBound = Some(t * math.exp(math.sqrt(variance) * quartile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also get its own function.
lowerBound match { | ||
case Some(lowerBound) => None | ||
case _ => { | ||
lowerBound = Some(t * math.exp(-math.sqrt(variance) * quartile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same function as above.
case _ => py = Some(globalPopulationFirstOrderMeans(key)) | ||
} | ||
val t = ( x / m ) / py.get | ||
val variance = 1.0 / x - 1.0 / m + 1.0 / y - 1.0 / n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should extract this out to a function that returns (mean, variance). There's too much going on here in-line.
|
||
randomFeatureFirstOrderSums.keySet.foreach { key => | ||
// Do computation on only binary and non-intercept features | ||
if ((globallFeatureNonZero(key) * 1.0) / globalPopulationNumSamples == globalMean(key) && key != lastColumn) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this in an isBinary
detector function to make the code easier to scan. Also, this check for intercept is not quite right -- we might not have an intercept, in which case this will ignore an actual feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a temporary solution that i mentioned in to-do list. we will figure this thorough discussion.
val labelAndFeatures = dataPoints.map { case (_, labeledPoint) => (labeledPoint.label, labeledPoint.features) } | ||
val lowerBounds = LocalDataSet.computeRatioCILowerBound(labelAndFeatures, percentage, globalFeatureShardStats) | ||
val filteredFeaturesIndexSet = lowerBounds.toArray | ||
.filter(_._2 > intervalBound).map(_._1).toSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of these chained calls should get its own line.
070d006
to
93d78d3
Compare
@@ -192,6 +248,61 @@ class LocalDataSetTest { | |||
} | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments style is
/**
*
**/
) | ||
|
||
val computed = LocalDataSet.computeRatioCILowerBound(labelAndFeatures, 2.575, globalStats) | ||
println(computed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant println
@@ -126,6 +126,36 @@ protected[ml] case class LocalDataSet(dataPoints: Array[(Long, LabeledPoint)]) { | |||
} | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments are off. Same for line 245.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should look like:
/**
* text text text
*
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is resolved
* Compute Ratio Confidence Interval lower bounds. | ||
* | ||
* @param randomLabelAndFeatures An array of (label, feature) tuples | ||
* @param quartile The quartile score of standard normal distribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's quartile score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter should be called zScore
, since that's what it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
val randomEffectNumSamples: Long = randomLabelAndFeatures.length.toLong | ||
val randomFeatureFirstOrderSums = randomLabelAndFeatures.map(_._2).toSeq.reduce(_ + _) | ||
|
||
val m: Long = randomEffectNumSamples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m is not all the samples in random effect. it's the number of instance having binary feature f. x is out of this number m, how many are positive. Same for y and n for global.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please apply the Photon Style doc in the root of the project, it will reduce the number of style issues that need to be manually fixed.
Answering your 7 questions in order:
- TODO - please fix the requested locations first and then we'll reconsider the unit tests
- See comments
- Ignore this for now - lock them the lower bound threshold to
1.0
and the z-score to2.575
. We can adjust them after PoC testing. - It's one or the other. During testing, we can compare how the Pearson correlation filtering performs. Based on past experience, I don't think it helps very much and I would personally prefer to remove it entirely. This would be in a separate commit.
- See 1
- After unit tests have been resolved we can introduce logging code for testing purposes
- See comments
Thanks
featureShardStats match { | ||
case Some(featureShardStats) => featureShardStatsInMap = Some(featureShardStats.toMap[FeatureShardId,BasicStatisticalSummary]) | ||
case None => None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Scala, var
means variable and val
means constant. As a general rule, var
should not be used in Scala unless using a val
would be inefficient or logically difficult.
Looking at how featureShardStats
is used, this whole block should be scrapped and calculateAndSaveFeatureShardStats
should be modified to return a Map
instead of an Iterable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -41,4 +43,6 @@ object Types { | |||
// A "feature shard" is an arbitrary set of "feature bags" | |||
// A random effect model corresponds to a single feature shard | |||
type FeatureShardId = String | |||
type FeatureShardStatisticsMap = Map[FeatureShardId, BasicStatisticalSummary] | |||
type FeatureShardStatisticsMapOpt = Option[FeatureShardStatisticsMap] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These types should be scrapped. The FeatureShardStatistics
type in the GameTrainingDriver
should replace Iterable
with Map
(see comments on GameTrainingDriver
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -15,17 +15,15 @@ | |||
package com.linkedin.photon.ml.estimators | |||
|
|||
import scala.language.existentials | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whitespace should not be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
import org.apache.spark.SparkContext | ||
import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators, Params} | ||
import org.apache.spark.ml.util.Identifiable | ||
import org.apache.spark.rdd.RDD | ||
import org.apache.spark.sql.DataFrame | ||
import org.slf4j.Logger | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whitespace should not be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -43,6 +41,7 @@ import com.linkedin.photon.ml.supervised.classification.{LogisticRegressionModel | |||
import com.linkedin.photon.ml.supervised.regression.{LinearRegressionModel, PoissonRegressionModel} | |||
import com.linkedin.photon.ml.util._ | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -441,6 +447,7 @@ object GameTrainingDriver extends GameDriver { | |||
.setCoordinateDescentIterations(getRequiredParam(coordinateDescentIterations)) | |||
.setComputeVariance(getOrDefault(computeVariance)) | |||
.setIgnoreThresholdForNewModels(getOrDefault(ignoreThresholdForNewModels)) | |||
.setFeatureStats(featureShardStatsInMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only set this if the training task is logistic regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
val t = ( x / m ) / py | ||
val variance = 1.0 / x - 1.0 / m + 1.0 / y - 1.0 / n | ||
(t,variance) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
py
should not be an input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
quartile: Double | ||
): Double = { | ||
t * math.exp(-math.sqrt(variance) * quartile) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary braces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
quartile: Double | ||
): Double = { | ||
t * math.exp(math.sqrt(variance) * quartile) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary braces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
key: Int | ||
): Boolean = { | ||
(globallFeatureNonZero(key) * 1.0) / globalPopulationNumSamples == globalMean(key) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function shouldn't need to exist (see other comments), but if it did:
Passing an Array
and an index to access within the Array
is poor craftsmanship.
Why multiply by 1D
?
Unnecessary braces
Need whitespace after function definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is necessary as Josh pointed out to not write too much inline. As we further refine our predicate, we can do it faster.
93d78d3
to
8e8eb55
Compare
fixed issues but unit tests undone. early update for pending reviews. |
16e0374
to
6dace03
Compare
pushed with all tests passed and issues resolved. |
photon-api/src/main/scala/com/linkedin/photon/ml/data/RandomEffectDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/RandomEffectDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/RandomEffectDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
featureShardStatsOpt match { | ||
case Some(featureShardStats) => { | ||
val featureShardId = randomEffectDataConfiguration.featureShardId | ||
val (binaryIndices, nonBinaryIndices) = filterBinaryFeatures(featureShardStats(featureShardId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Filter" isn't a good name for this, because it's not really filtering a collection. When you see filter
, you expect it to be (seq in, seq out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshvfleming segregateBinaryFeatures
?
@YazhiGao We can get around the intercept issue here, for now. Since we still need to run experiments, this code doesn't need to be 100% production ready yet, so what we can do is this:
val (binaryIndices, rawNonBinaryIndices) = filterBinaryFeatures(featureShardStats(featureShardId))
val nonBinaryIndices = rawNonBinaryIndices + globalFeatureShardStats.mean.length
Since we know that the intercept index is always the last index (for now), we can insert it into the set of non-binary indices to guarantee that it is not filtered.
I spoke with @joshvfleming offline and we agreed that we need a refactor to move the concept of the intercept into the GameEstimator
, so this should be a reasonable hack for now, until that refactor takes place.
photon-api/src/main/scala/com/linkedin/photon/ml/data/RandomEffectDataSet.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have not reviewed test cases yet.
photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/GameTrainingDriver.scala
Outdated
Show resolved
Hide resolved
photon-lib/src/main/scala/com/linkedin/photon/ml/stat/BasicStatisticalSummary.scala
Outdated
Show resolved
Hide resolved
photon-lib/src/main/scala/com/linkedin/photon/ml/stat/BasicStatisticalSummary.scala
Outdated
Show resolved
Hide resolved
photon-lib/src/main/scala/com/linkedin/photon/ml/stat/BasicStatisticalSummary.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
lowerBounds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in my previous review, we don't need all of these cases. We don't have to apply the paper exactly as it is written, since we have a special case where X ⊂ Y
, always.
This entire block, from line 281 to line 316 can be replaced with the following:
binaryIndices
.map { key =>
val x_col = x(key)
val m_col = m(key)
val y_col = y(key)
val n_col = n(key)
val lowerBound = if (y_col == 0 || x_col == 0 || x_col == m_col) {
0D
} else {
val (t, variance) = computeMeanAndVariance(x_col, m_col, y_col, n_col)
if (t < 1D) {
1D / computeUpperBound(t, variance, zScore)
} else {
computeLowerBound(t, variance, zScore)
}
}
(key, lowerBound)
}
.toMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't add the x_col==m_col as the condition for 0D output right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after some thinking, I think the original code is right. We should only eliminate the x!=0 && y==0 case. this is the only impossible case in our data. The original code has excluded that.
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
case None => featuresIndices | ||
} | ||
featuresAsBV.map{v => v(featuresIndicesWithoutIntercept.toIndexedSeq).toVector} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This intercept issue has become larger than we thought. As mentioned elsewhere, we should just hardcode the logic for now and return to it in a separte intercept refactor. I propose we use this block of code for now:
private def calculateStatistics(
data: DataFrame,
featureIndexMapLoaders: Map[FeatureShardId, IndexMapLoader]): Map[FeatureShardId, BasicStatisticalSummary] =
featureIndexMapLoaders
.map { case (featureShardId, indexMapLoader) =>
// Calling rdd explicitly here to avoid a typed encoder lookup in Spark 2.1
val featuresAsBV: RDD[BreezeVector[Double]] = data
.select(featureShardId)
.rdd
.map(x => VectorUtils.mlToBreeze(x.getAs[SparkMLVector](0)))
val featuresForSummary = indexMapLoader
.indexMapForDriver()
.get(Constants.INTERCEPT_KEY)
.map(_ => featuresAsBV.map(dropIntercept))
.getOrElse(featuresAsBV)
(featureShardId, BasicStatisticalSummary(featuresForSummary))
}
private def dropIntercept(baseVector: BreezeVector[Double]): BreezeVector[Double] = baseVector match {
case dense: DenseVector[Double] =>
new DenseVector[Double](dense.data, dense.offset, dense.stride, dense.length - 1)
case sparse: SparseVector[Double] =>
new SparseVector[Double](sparse.index.filterNot(_ < sparse.length - 1), sparse.data, sparse.length - 1)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if there is no intercept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can the dropIntercept
handle the case with no intercept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only gets called if there is an intercept.
46fc425
to
3122532
Compare
unit tests and integration tests passed. |
Should we take the case where |
79d405e
to
2e73bb6
Compare
@YazhiGao @joshvfleming @mayiming I'm set in my position that following the paper 100% is incorrect, due to our specific application:
Thus, the code can looks something like this:
The X_MIN in the paper is |
@ashelkovnykov is right. The formulation in the paper is between any two ratios, but in our case one set is a subset of the other. So there are cases that can't arise, and because of this, the code can be much simpler. |
photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/GameTrainingDriver.scala
Show resolved
Hide resolved
photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/GameTrainingDriver.scala
Outdated
Show resolved
Hide resolved
photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/GameTrainingDriver.scala
Outdated
Show resolved
Hide resolved
...n-lib/src/integTest/scala/com/linkedin/photon/ml/stat/BasicStatisticalSummaryIntegTest.scala
Outdated
Show resolved
Hide resolved
photon-lib/src/main/scala/com/linkedin/photon/ml/stat/BasicStatisticalSummary.scala
Outdated
Show resolved
Hide resolved
photon-api/src/test/scala/com/linkedin/photon/ml/data/LocalDataSetTest.scala
Outdated
Show resolved
Hide resolved
photon-api/src/test/scala/com/linkedin/photon/ml/data/LocalDataSetTest.scala
Outdated
Show resolved
Hide resolved
photon-api/src/test/scala/com/linkedin/photon/ml/data/LocalDataSetTest.scala
Outdated
Show resolved
Hide resolved
photon-api/src/test/scala/com/linkedin/photon/ml/data/LocalDataSetTest.scala
Show resolved
Hide resolved
) | ||
val x_0_y_0_g_positive = Array(0.0) | ||
|
||
// Second case x == 0 and y != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments about splitting the responsibilities of the tests. The cases for testComputeRatioCILowerBound
should be:
- x = 0, y = 0
- x = m, y = n
- x = 0, 0 < y < n, epsilon = default
- x = 0, 0 < y < n, epsilon = something else
- x = m, 0 < y < n
- 0 < x < m, 0 < y < n
2e73bb6
to
e562b9b
Compare
fixed issues except for test cases redesign, all current tests passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready for testing - @YazhiGao and I will work on the unit tests later, pending good experimental results from Baolei
photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/GameTrainingDriver.scala
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
photon-api/src/main/scala/com/linkedin/photon/ml/data/LocalDataSet.scala
Outdated
Show resolved
Hide resolved
e562b9b
to
ffc2a44
Compare
ready for experiment now. |
@@ -212,6 +254,116 @@ object LocalDataSet { | |||
new SparseVector(filteredIndexBuilder.result(), filteredDataBuilder.result(), features.length) | |||
} | |||
|
|||
/** | |||
* Compute Ratio Confidence Interval lower bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to add more descriptions, and in particular reference the paper/book we refer to.
val n_col = n(key) | ||
|
||
val lowerBound = if (y_col == 0.0 || y_col == n_col) { | ||
0D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for m_col = 0, we may also want to return 0D here. Since if it enter next branch it will produce Infinity.
this is currently the implementation of ratio modeling for feature selection of random effect in photon.
I follow the algorithm described in the original publication but some twists are made according to discussion with yiming and alex.
Unit tests all pass.
Integration tests all pass.
The algorithm in reality(highly related with codebase instead of only mathematical expression) is as follows:
1.pass in the featureStatisticSummary
2.identify the binomial columns
3.compute the lowerbound for binomial columns based on the t value
4.select the feature based on only the following lowerbound criterion(non-binomial and intercept columns are kept automatically)
As a WIP commit, there are things to polish in near future since we currently focus on the feasibility of this experimental method and try to minimize user-side changes :
unit tests not fully covering all scenarios of feature selection. Currently the binomial cases are not selected, we need to craft some data that covering all cases.
binomial feature column identification predicate needs to be stronger. Current solution is inherently flawed, we need more computation at feature summary stage to ensure this one.
hyperparameter interface design. for convenience purposes, the user side interface for pass in normal distribution quartile and lowerbound threshold hyperparameter redesign.
the relationship with pearson correlation feature selection. We need another parameter to decide on the algorithm of feature selection or mix them in later stage.
crafted test data need some change, currently some unneeded feature summary entries are not carefully addressed.
further experiment report and benchmark report after regression tests
the way we currently keep non-binary and intercept columns is not good for further feature ranking report planned. need redesign
@joshvfleming @ashelkovnykov