Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement svychisq for multiply-imputed data #236

Open
ajdamico opened this issue May 15, 2017 · 16 comments
Open

implement svychisq for multiply-imputed data #236

ajdamico opened this issue May 15, 2017 · 16 comments
Assignees
Milestone

Comments

@ajdamico
Copy link
Owner

https://stats.stackexchange.com/questions/78479/how-to-run-chi-squared-test-on-imputed-data

i am not totally sure how to do it (one commenter says there's a mistake). mitools:::MIcombine.default does not properly deal with a list of htest objects (the this_result object below).

if you think it's something you can implement, we can temporarily put it in lodown (which has lots of other dataset specific helper functions, like pnad_postStratify) and ask dr. lumley if he would prefer for it to go in library(mitools)? currently have two others implemented at: https://github.com/ajdamico/lodown/blob/master/R/survey_functions.R

thanks

library(mitools)
library(survey)

imp1 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )
imp2 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )

imp.design <- svydesign( ~1 , data = imputationList( list( imp1 , imp2 ) ) )

# svytotal works fine
MIcombine( with( imp.design , svytotal( ~col1 ) ) )

this_result <- with( imp.design , svychisq( ~col1+col2 ) )

# svychisq fails inside of mitools:::MIcombine.default
MIcombine( this_result )

# object types
class( this_result )
lapply( this_result , class )
@ajdamico ajdamico added this to the v0.1.0 milestone May 15, 2017
@ajdamico
Copy link
Owner Author

@guilhermejacob could you:

(1) add your code here with maybe two easily-runnable test cases
(2) describe what you've done to djalma and see if he agrees with your math
(3) integrate your code into https://github.com/ajdamico/lodown/blob/master/R/survey_functions.R
(4) decide whether to propose it to lumley as an addition to library(mitools)

@guilhermejacob
Copy link
Contributor

Ok, I was able to replicate this example (I don't think it's great, but works).
There's a lot of warnings to this method.

And I don't know how to integrate that so It will run on all Wald statistics.
Anyways, I think it's kind of good.

So answering your points in the previous comments:

(1) If you want to git it a try run this:

# our test function
test_fun <- function (results, variances, call = sys.call(), df.complete = Inf, ...) {
  m <- length(results)
  if ( m != 3 ) { warning( "This test was designed for m = 3. Use it cautiously.")}
  oldcall <- attr(results, "call")
  ndf <- results[[1]]$parameter[[1]]
  if (missing(variances)) {
    results <- lapply(results, function(x) x$statistic )
    variances <- sqrt( unlist(results) )
  }
  cbar <- results[[1]]
  for (i in 2:m) {
    cbar <- cbar + results[[i]]
  }
  
  cbar <- cbar/m
  evar <- var(variances)
  
  r <- (1 + 1/m) * evar
  
  D_2 <- ( cbar/ndf - r * ( m + 1 ) / ( m - 1) ) / ( 1 + r )
  
  v3 <- (ndf^(-3/m)) * (m-1) * ( 1 + 1/r)^2
  
  pval <- 1 - stats::pf( D_2 , df1 = ndf , df2 = v3 , ncp = FALSE )[[1]]
  
  rval <- list(statistic = D_2 , ndf = ndf , ddf = v3 , p.value = pval )
  #class(rval) <- "list"
  warning( "The real p-value could be half or twice the presented below.")
  rval
}


library(mitools)
library(survey)

imp1 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )
imp2 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )

imp.design <- svydesign( ~1 , data = imputationList( list( imp1 , imp2 ) ) )
# imp.design <- as.svrepdesign( imp.design , replicates = 100 )

# svytotal works fine
MIcombine( with( imp.design , svytotal( ~col1 ) ) )

# svytotal works fine
MIcombine( with( imp.design , svymean( ~col1 ) ) )

this_result <- with( imp.design , svychisq( ~col1+col2 , statistic = "Chisq" ) )

# svychisq fails inside of mitools:::MIcombine.default
MIcombine( this_result )

results <- with( imp.design , svychisq( ~col1+col2 ) ) 

test_fun( results = results )

(2) It takes the results statistics, calculates the variance of the statistics square roots across the m imputations, the calculates the statistic in Li et al. (1991). From that we get the p-value using the degrees of freedom from the chi-square and the formula'a denominator degrees of freedom.

(3) We have to think about that. The only real improvement of this function is minimal: it gets the dfs from the chi-square test. If you just supply it with the test, you could use other package like miceadds to do it.

(4) Discussion ahead.

@guilhermejacob
Copy link
Contributor

The test was meant for cases when m = 3. So, We have the following warnings:

The pooled Chi^2-test can be used when k is large, if U and B cannot be retrieved, or if only Chi^2-statistics are available. Compared to the other three methods, however, the results from the Chi^2-test are considerably less reliable. The results were optimized for m = 3 and, unlike the other tests, do not necessarily improve for larger m. According to Li et al. (1991a) the true result could be within a range of one half to twice the obtained p-value. This test should only be used as a rough guide. - Van Buuren (2012, p.159).

and

Li, Meng, et al. (1991) used Monte Carlo simulations to study the performance of D2 statistic under a variety of conditions. Their results suggest that type I error rates can either be too high or too low, depending on the fraction of missing information (e.g., when the fraction of missing information was less than 20%, type I errors dropped below the nominal 0.05 level). Their simulations also indicate that D2 has lower power than D1. Considered as a whole, these simulation results suggest that D2 does not yield accurate inferences, and the authors recommend using the procedure “primarily as a screening test statistic” (p. 83). - Enders (2010, p.240).

The original article can be found here: Li et al. (1991).

@DjalmaPessoa
Copy link
Collaborator

It is not clear to me the step:

variances <- sqrt( unlist(results))

Could you please cite the formula in Li et al. (1991) that you are using to combine the results?

I believe a better way of testing the independence in a two-way table could be to fit a loglinear model using the function svyloglin and testing if the interation effect is null. It might be possible to use MIcombine for that.

@guilhermejacob
Copy link
Contributor

I think it's equation 2.2 in page 74. I'm taking the square roots of every imputed result.
The part inside the square brackets in 2.2 is the sample variance of the square roots of each results.
Makes sense, right?

@DjalmaPessoa
Copy link
Collaborator

DjalmaPessoa commented May 29, 2017 via email

@DjalmaPessoa
Copy link
Collaborator

Where is defined the distribution of the test statistic in 2.1. What are the number of degrees of freedom in the numerator and in the denominator of the F distribution?

@guilhermejacob
Copy link
Contributor

Yes, I'm using the correct formula. It's also the same in the books above.
The numerator degrees of freedom k comes from are those from the Chi-square test (as in p.67 of Li et al., 1991). The denominator degrees of freedom are defined in equations 2.16 and 2.17 of Li et al. (1991).

@DjalmaPessoa
Copy link
Collaborator

you mentioned that:

The test was meant for the cases when m = 3 ?

In your example k= (c-1)x(r-1)= 16 , the value of k has no influence in the applicability?

@guilhermejacob
Copy link
Contributor

It doesn't seem to be a problem from the books I read, so I don't think it's a problem. Section 2.4 in the article might give some additional information about this.

@DjalmaPessoa
Copy link
Collaborator

DjalmaPessoa commented May 30, 2017 via email

@guilhermejacob
Copy link
Contributor

Sometimes, it's a matter of faith haha
It should work with Wald too. But X^2 uses the Rao-Scott adjustment, if that's what you're asking.

@DjalmaPessoa
Copy link
Collaborator

I just said I'd rather use Wald's test!

@guilhermejacob
Copy link
Contributor

guilhermejacob commented Jun 6, 2017

@DjalmaPessoa , you are absolutely right.
After rereading your comments, I decided to start from scratch.
Instead of writing it, I'll just borrow the miceadds::micombine.chisquare function, passing just two arguments: the pooled results from svychisq and the Chi-squared degrees of freedom.

(1) add your code here with maybe two easily-runnable test cases

MIsvychisq<-function(formula, design , statistic = "Chisq" , ... ) {
  
  if ( !( statistic %in% c( "Chisq" ) ) ) { stop( " This method is only implemented for `statistic = 'Chisq'`." ) }
  
  m <- with( design , svychisq( formula , statistic = statistic ) ) 
  
  dk  <- as.numeric( sapply( m , FUN = function( x ) x[["statistic"]] ) ) 
  df <- as.numeric( sapply( m , FUN = function( x ) x[["parameter"]][ "df" ] ) )
  
  return( miceadds::micombine.chisquare( dk=dk, df=df[[1]] , display = TRUE , version = 1 ) ) 
  
}

library(mitools)
library(survey)

imp1 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )
imp2 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )
imp3 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )

imp.design <- svydesign( ~1 , data = imputationList( list( imp1 , imp2 , imp3 ) ) )

MIsvychisq( formula = ~col1+col2 , imp.design )

It defaults to statistic = 'Chisq', as I'm not sure how it works with F-statistics.
(2) describe what you've done to djalma and see if he agrees with your math
Well, there's no math by my part here.
(3) integrate your code into https://github.com/ajdamico/lodown/blob/master/R/survey_functions.R
Will do.
(4) decide whether to propose it to lumley as an addition to library(mitools)
This function is too small to be a meaningful addition to survey.

@guilhermejacob
Copy link
Contributor

For pooled F-statistics, this should help: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4029775/pdf/nihms560910.pdf

@DjalmaPessoa
Copy link
Collaborator

Example using Lumley's suggestion:

library(mitools)
data.dir<-system.file("dta",package="mitools")
files.men<-list.files(data.dir,pattern="m.\.dta$",full=TRUE)
men<-imputationList(lapply(files.men, foreign::read.dta))
files.women<-list.files(data.dir,pattern="f.\.dta$",full=TRUE)
women<-imputationList(lapply(files.women, foreign::read.dta))
men<-update(men, sex=1)
women<-update(women,sex=0)
all<-rbind(men,women)
library(survey)
designs<-svydesign(id=~id, strata=~sex, data=all)
results_loglinear <- with(designs, svyloglin(~sex*alcdos))
MIcombine(results_loglinear)

but if we want to test if the interaction effect between sex and alcdos is null we need to get the p-values for the combined object.
This would correspond to the test using svychisq with Rao-Scott correction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants