Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using PanelMatch with large DFs #89

Closed
LuMesserschmidt opened this issue Jan 13, 2022 · 1 comment
Closed

Using PanelMatch with large DFs #89

LuMesserschmidt opened this issue Jan 13, 2022 · 1 comment

Comments

@LuMesserschmidt
Copy link

Dear colleagues,

thank you for providing such an innovative public good to the community. I am researching how FDI projects affect local nighttime development. I have seen your answers on issues #53 and #46. Working with >15 million rows, I am struggling with memory issues and system abort errors (even though I work on a 500GB RAM cloud with 20 nodes) and I hope that you can help me to overcome those:

** Let me provide a bit more background of the data:**
I have divided the world into raster cells (~900k) and for each of these cells, I got 17 years of observation (2002 - 2018): How the light pollution developed ("lights"), whether a cell has been treated in the same year ("treatment"), how much FDI they received ("fdi_volume"). Moreover, I control for the population size ("hyde") in this raster. There are many cells that have never been treated and the distribution of FDI projects is extremely uneven.

Here is a small reproducible example:
`library(tidyverse)

year= as.numeric(c(2002:2018))
country= c("AFG","ALB","Country")
project_num= c(1:5)
treatment=sample(c(0,1), 255,replace=TRUE)

set.seed(1000)
lights=runif(255, 1, 63)
hyde=runif(255,1000, 200000)
fdi_volume =runif(255,1, 200)

dt<- merge(year,country) %>% dplyr::rename(year=x, country=y)
dt<- merge(dt,project_num) %>% dplyr::rename(project_num=y) %>% mutate(id=paste(country,project_num,sep="-"))
dt<- cbind(dt,treatment)
dt<- cbind(dt,lights)
dt<- cbind(dt,hyde)
dt<- cbind(dt,fdi_volume)`

**What solutions do you have discovered to work with large datasets? **
I found that Mahalanobis treatment matching worked under specific circumstances, while propensity score matching and weighting always failed. I tried to find workarounds by splitting the sample or writing a loop but I haven´t yet come up with a sufficient solution (I read your wiki on Matched Set Objects).

** Alternatives**
In case there is no loop, there might be another workaround: So far, I am including the country as a covariate. One idea could be to divide the dataset by countries and run the panel match individually. But here I am having doubts:

  • First, isn´t it too biasing for my results when I do not take into account treatment cells from other countries?
  • Second, given that the distribution of treated cells is extremely uneven, I have some countries that only find really small numbers of matched sets, while in bigger countries there are many. Would a divide by country influence the robustness in smaller countries due to missing matches?
  • Third, how can I calculate the average treatment effect when I have countries with different numbers and volumes of FDI projects and sizes? Should I just take the average of all effects or what would be the most sophisticating strategy (similar question to issue Is it possible to weight the ATT estimates? #75?

If you allow me, let me post a few more questions here instead of starting new issues:

  • Have there been any development on issue Extend method to continuous treatments #61? I do have the FDI volume for each project and it would certainly make it more robust to take this instead of a treatment dummy.
  • I do have a democracy variable that I wanted to include in the covariates, but given the high collinearity, it seems to result in errors. Do you agree that it´s reasonable to exclude the dummy given that regime shifts barely took place between 2002-2018 and that the information thus is included in the country dummy?
  • Choosing the lead, lag, and size.match still feels a bit arbitrary to me. I can surely understand that you don´t want to communicate one-fits-all answers, but have you developed any additional guidelines/rules of thumb that would satisfy polsci reviewers? Otherwise, I will just run the function for several lags and leads and combine the treatment effects as point estimate lines with CI in a ggplot.
@LuMesserschmidt
Copy link
Author

To give a brief follow up on my case:

  • I was able to run most of the models through a 360 GB RAM cloud, but at some point, it still crashes. I am moving now to our 9TB RAM cluster and will hopefully make progress. Observing the memory usage I figured that the PanelMatch functions create peaks that force the termination of the process while running smoothly most of the time. One idea to overcome this is to provide a function that allows for parallelization after the matching pairs have been created. To give an example:

I looped the PanelMatch function by country (as described above) and calculated the treatment effect for every country. I then calculated the pooled mean and variance (https://www.ncbi.nlm.nih.gov/books/NBK56512/). This has partially inflated standard errors but effect estimates are nearly the same. Do you have any opinions on whether this looping violates any of your model assumptions?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant