Update scale_mgm() function using pooled SD and bump version.

DavisLaboratory · Mar 29, 2024 · dfe6155 · dfe6155
1 parent 8ff99c3
commit dfe6155
Show file tree

Hide file tree

Showing 5 changed files with 43 additions and 11 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: smartid
 Title: Scoring and Marker Selection Method Based on Modified TF-IDF
-Version: 0.99.4
+Version: 0.99.5
 Authors@R: 
     person("Jinjin", "Chen", email = "[email protected]", role = c("aut", "cre"),
            comment = c(ORCID = "0000-0001-7923-5723"))

diff --git a/NEWS.md b/NEWS.md
@@ -8,7 +8,7 @@
 
 # smartid 0.99.2
 
-* Added test for `gs_score` function.
+* Added test for `gs_score()` function.
 
 # smartid 0.99.3
 
@@ -17,3 +17,7 @@
 # smartid 0.99.4
 
 * Add details for TF, IDF, IAE functions.
+
+# smartid 0.99.5
+
+* Update `scale_mgm()` function, using pooled SD.
diff --git a/R/scale_mgm.R b/R/scale_mgm.R
@@ -1,4 +1,8 @@
-#' scale by mean of group mean in case extreme unbalanced data
+#' scale by mean of group mean for imbalanced data
+#'
+#' @details
+#' \deqn{z=\frac{x-\frac{\sum_k^{n_D}(\mu_k)}{n_D}}{s_{pooled}}}
+#' where \eqn{s_{pooled}=\sqrt{\frac{\sum_k^{n_D}{(n_k-1){s_k}^2}}{\sum_k^{n_D}{n_k}-k}}}
 #'
 #' @param expr matrix
 #' @param label a vector of group label
@@ -9,12 +13,28 @@
 #' @examples
 #' scale_mgm(matrix(rnorm(100), 10), label = rep(letters[1:2], 5))
 scale_mgm <- function(expr, label) {
-  ## compute sds
-  sds <- sparseMatrixStats::rowSds(expr, na.rm = TRUE)
-  # sds <- sapply(unique(label), \(i)
-  #               sparseMatrixStats::rowSds(expr[, label == i], na.rm = TRUE)
+  # ## compute overall sds
+  # sds <- sparseMatrixStats::rowSds(expr, na.rm = TRUE)
+
+  # ## compute group sds
+  # sds <- vapply(unique(label), \(i)
+  #               sparseMatrixStats::rowSds(expr[, label == i, drop = FALSE],
+  #                                         na.rm = TRUE),
+  #               rep(1, nrow(expr))
   #        ) # get sds of each group
-  # colnames(sds) <- unique(label)
+  # sds <- sparseMatrixStats::rowMeans2(sds)
+
+  ## compute pooled sds
+  sds <- vapply(
+    unique(label), \(i)
+    sparseMatrixStats::rowVars(expr[, label == i, drop = FALSE],
+      na.rm = TRUE
+    ),
+    rep(1, nrow(expr))
+  ) # get vars of each group
+  ng <- table(label)[unique(label)] # get group sizes in the same order
+  sds <- sds %*% cbind(ng - 1)
+  sds <- as.numeric(sqrt(sds / sum(ng - 1)))
 
   ## compute group means
   mgm <- vapply(

diff --git a/man/scale_mgm.Rd b/man/scale_mgm.Rd
diff --git a/vignettes/smartid_Demo.Rmd b/vignettes/smartid_Demo.Rmd
@@ -155,7 +155,11 @@ names(metadata(data_sim))
 
 ## Scale and Transform Score
 
-Scaling is needed to find the markers specific to the group, however, standard scaling might fail due to the rare populations. Here `smartid` uses a special scaling strategy `scale_mgm()`, which can scale imbalanced data by given group labels.
+Scaling is needed to find the markers specific to the group, however, standard scaling might fail due to the rare populations. Here `smartid` uses a special scaling strategy `scale_mgm()`, which can scale imbalanced data by given group labels. By doing this, we can avoid the bias towards features with larger numerical ranges during feature selection.
+
+The scale method is depicted as below:
+
+$$z=\frac{x-\frac{\sum_k^{n_D}(\mu_k)}{n_D}}{s_{pooled}},\ s_{pooled}=\sqrt{\frac{\sum_k^{n_D}{(n_k-1){s_k}^2}}{\sum_k^{n_D}{n_k}-k}}$$
 
 The score will be transformed using softmax before passing to EM algorithm.