-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V64 getting duplicate ind_ids when building kibbleset #248
Comments
Our long-term plan is to make it OK for there to be duplicate ind_ids in cohorts because we want to support multiple differentiated values for any given variable for each individual. This would allow us to support longitudinal studies. Adding this support will, unfortunately take more time than we currently have available, so I'm looking into a short-term "fix" that avoids the problem without adversely affecting our long-term plans. The first step is to be certainly that our current duplicates issues is what we suspect it is. Thus far, we have:
|
Just so it's on the record somewhere other than my work diary - the "unable to preview download" error Jake's been running into related to this issue is because V2239 (one of thise identified above) is part of the variable set he's trying to download. So resolution of that would make his test pass. |
DATABASE UPDATES for this problem: UPDATE equivalence_groups SET first_member = 'ANSYMMANY' WHERE first_member = 'V2239' AND variable_name <> 'V2239'; |
Changing the milestone on this one since while we can't solve the whole problem right away, we're trying to at least get substantial parts of it before final release since it's bitten our release candidate testers. |
Switching this back to Second Public Release, as the immediate stuff (or at least the biggest step towards same) is tracked in #259. To sum up: per discussion with @WValenti (and he can feel free to clarify further as needed) there are three classes of issue here, described here as best I can recollect:
For example, the "V2239" problem that Jake ran into was an example of case 2 above. There were two values that were actually semantically the same but encoded differently; Bill had created a "NRGR_V2239" to fix the encoding issue so that V2239's information could still be found in the equivalence group/Fully Harmonized Variable, but then the original "raw" V2239 (with the original, different encoding) was also inadvertently inclued in the equivalence group. He removed it and All Was Well - but for that variable only, and these things have to be gone through one-at-a-time, and not all of them are going to be that "easily" resolved. Case 3 above, in particular, is a really thorny problem, and the "ideal fix" - support for longitudinal data - is potentially years away. It's an enhancement we absolutely want to add someday, of course, but for now we don't have it, and any decision on which value should ultimately "win" is arguably technically a group discussion, potentially on a variable-by-variable basis. And that by itself could possibly take weeks. So arguably what we need to do is get a complete handle on the scope of the problem, and then decide how to proceed for the release. For the time being, we're dealing with the common case (case 1 above) in #259, as we believe the number of variables in cases 2 and 3 are relatively few. (For the record - and I've also noted this in #259 - @WValenti was in favor of an approach that would mask all these cases in the meantime, so things would continue to appear to work without any crashes, with some known documented caveats. We didn't do that as I objected because of concerns about "side effects" that he - and I hope/assume I'm representing his perspective on this accurately - considered extremely unlikely. Said approach might be revisited if this turns out to be even nastier, but I'd want to have group buy-in first.) |
Forgot to mention in the above - @WValenti is planning on going through those variables and fixing them where they can be quickly fixed, so we're not just "leaving it at that". :) |
Stumbled across this entirely by chance. This makes the variable unusable in cohorts (or anything that uses cohorts internally), because we then get a "primary key duplicate" error trying to add the results to cohortInds.
@WValenti's initial working theory is that it has to do with some kind of "overlap" between interview variables and distribution variables - and since the fix for something like that involves a clinical decision, that would have to be discussed with collaborators.
The text was updated successfully, but these errors were encountered: