Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V64 getting duplicate ind_ids when building kibbleset #248

Open
Viqsi opened this issue Jul 29, 2024 · 6 comments
Open

V64 getting duplicate ind_ids when building kibbleset #248

Viqsi opened this issue Jul 29, 2024 · 6 comments
Assignees
Labels
FDD Foundational Document / Decision - issue with record/reference of a major design decision being made

Comments

@Viqsi
Copy link
Member

Viqsi commented Jul 29, 2024

image

Stumbled across this entirely by chance. This makes the variable unusable in cohorts (or anything that uses cohorts internally), because we then get a "primary key duplicate" error trying to add the results to cohortInds.

@WValenti's initial working theory is that it has to do with some kind of "overlap" between interview variables and distribution variables - and since the fix for something like that involves a clinical decision, that would have to be discussed with collaborators.

@WValenti
Copy link
Member

WValenti commented Oct 9, 2024

Our long-term plan is to make it OK for there to be duplicate ind_ids in cohorts because we want to support multiple differentiated values for any given variable for each individual. This would allow us to support longitudinal studies. Adding this support will, unfortunately take more time than we currently have available, so I'm looking into a short-term "fix" that avoids the problem without adversely affecting our long-term plans.

The first step is to be certainly that our current duplicates issues is what we suspect it is. Thus far, we have:

  1. study variables that have intrinsic duplicates, such as the study that provided multiple versions of the interview instrument for a few individuals depending upon their level of progress and corrections made by the interviewers. These have theoretically been removed by use of an "interview_to_use" flag in DIGS_interviews, but it should be checked. CHECKED and we're good, no duplicates here.
  2. name-harmonized variables where study variable duplication propagates out. This tier of harmonization will not introduce new problems by definition. WRONG - SEX should show up here. I should have found this problem when I originally standardized DI-PAD, as same-name-different-range warrants RENAMING in the raw.
  3. equivalence grouped variables where name-harmonized variable duplication propagates out, OR where the list of variable_names equivalenced includes multiple representations of the same information for some ind_ids. There are currently 27 of those:
    I10410 (=value) - components 3.0r7 DCOCTHINKCLEAR / PSMR both from tsid 16, sects J & K, pgs 79 / 96, PSMR should not be in group. variable_name_equivalence_pairs shows the association is a typo, so need to move PSMR to PSYCHOSIS_W__MANIA.
    I14720 (<>value) - components ADMIT_MED_HOSP_AFTER_NUM1 and i14720 are both from tsid 109. i14720 is ANY hospital type, so MED hosp variants need their own equivalence group, just like PSYCH hosp variants apparently already have.
    I1510 (=value) - two ind_ids from tsid 5 have both V185 and V216, so that's wrong. Holy crap. They answered both the in-person and telephone sections for the question. The harmonization is good. Need guidance.
    I1520 (<>value) - same as i1510.
    I17447 (<>value) - tsid 5 variables v2432 and v2433 are both in the same group. ASPD and ADHD. Since i17447 is ASPD, v2433 is in the wrong group and should be moved.
    I17611 (<>value) - global suspiciousness i17611 (sec. M, 0-6) and i17636 (sec. W, 0-4) are both legit and in different sections of the same interviews. Sigh. Need guidance.
    I20070 (<>value) - NRGR_I20070 and NRGR_V101 are both present for many ind_ids. i20070 is head injury and v101 is angina or MI, so v101 and NRGR_V101 get split-out.
    I20370 (=value) - same as i14720, but with V191 and V224.
    I20750 (<>value) - everyone has i20750 AND SEV_MANIA_NUM_MIXED_EPS in tsid 109. i20750 is severe mania mixed number of SYMPTOMS, while SEV_MANIA_NUM_MIXED_EPS is severe mania mixed number of EPISODES. Replace with SEV_MANIA___MIXED_SX_S in I20750 group.
    I3420 (=value)
    I4540 (=value)
    I4550 (<>value)
    I8480 (<>value)
    I8760 (<>value)
    NRGR_VERSION (=value) NRGR_
    SEX (<>value) NRGR_, but not as simple as expected since SEX is a name-harmonized variable across many instruments.
    V1537 (<>value) NRGR_
    V1541 (<>value) NRGR_
    V1543 (<>value) NRGR_
    V2196 (<>value) NRGR_
    V2239 (<>value) NRGR_ - Jake's issue. Has both V2239 and NRGR_V2239 and should only have one. NRGR_V2239 was created to convert the true/false values of V2239 to the 0/1 values of all other (DIGS V3-related) variables in the group, so V2239 should be excluded (not just moved elsewhere).
    V535 (<>value)
    V64 (=value)
    V701 (<>value)
    V838 (<>value) NRGR_
    V9 (=value)
    V973 (<>value) NRGR_
    ...so 10 are NOT equal and NOT easy NRGR_ fixes. The rest are either easy or equal so no rush.
    Note this list includes V64, which is the first member of AGE, which is the variable that was reported for this issue.
  4. DIVER UNIONs of variable_names that combine multiple variables representing the same underlying information for some ind_ids.

@Viqsi Viqsi added the FDD Foundational Document / Decision - issue with record/reference of a major design decision being made label Oct 14, 2024
@Viqsi
Copy link
Member Author

Viqsi commented Oct 15, 2024

Just so it's on the record somewhere other than my work diary - the "unable to preview download" error Jake's been running into related to this issue is because V2239 (one of thise identified above) is part of the variable set he's trying to download. So resolution of that would make his test pass.

@WValenti
Copy link
Member

DATABASE UPDATES for this problem:

UPDATE equivalence_groups SET first_member = 'ANSYMMANY' WHERE first_member = 'V2239' AND variable_name <> 'V2239';
-- NOTE THAT changing variable_name_equivalence_pairs is not a simple process anymore because of things like this, so I want to say that we no longer generate equivalence_groups from variable_name_equivalence_pairs.

@Viqsi
Copy link
Member Author

Viqsi commented Oct 15, 2024

Changing the milestone on this one since while we can't solve the whole problem right away, we're trying to at least get substantial parts of it before final release since it's bitten our release candidate testers.

@Viqsi
Copy link
Member Author

Viqsi commented Oct 15, 2024

Switching this back to Second Public Release, as the immediate stuff (or at least the biggest step towards same) is tracked in #259.

To sum up: per discussion with @WValenti (and he can feel free to clarify further as needed) there are three classes of issue here, described here as best I can recollect:

  1. Situations in which we end up with multiple instances of an individual with identical values for each instance. (This is being addressed in Insert operations into cohortInds should always SELECT DISTINCT #259).
  2. Situations in which we have harmonization problems, where variables that should not have been harmonized were done so in error, and in some cases lead to multiple instances of an individual with distinct (or distinct-appearing) values.
  3. Situations that involve genuine longitudinal data, in which there are multiple instances of an individual with distinct values, and both are valid for different times and/or situations.

For example, the "V2239" problem that Jake ran into was an example of case 2 above. There were two values that were actually semantically the same but encoded differently; Bill had created a "NRGR_V2239" to fix the encoding issue so that V2239's information could still be found in the equivalence group/Fully Harmonized Variable, but then the original "raw" V2239 (with the original, different encoding) was also inadvertently inclued in the equivalence group. He removed it and All Was Well - but for that variable only, and these things have to be gone through one-at-a-time, and not all of them are going to be that "easily" resolved.

Case 3 above, in particular, is a really thorny problem, and the "ideal fix" - support for longitudinal data - is potentially years away. It's an enhancement we absolutely want to add someday, of course, but for now we don't have it, and any decision on which value should ultimately "win" is arguably technically a group discussion, potentially on a variable-by-variable basis. And that by itself could possibly take weeks.

So arguably what we need to do is get a complete handle on the scope of the problem, and then decide how to proceed for the release. For the time being, we're dealing with the common case (case 1 above) in #259, as we believe the number of variables in cases 2 and 3 are relatively few.

(For the record - and I've also noted this in #259 - @WValenti was in favor of an approach that would mask all these cases in the meantime, so things would continue to appear to work without any crashes, with some known documented caveats. We didn't do that as I objected because of concerns about "side effects" that he - and I hope/assume I'm representing his perspective on this accurately - considered extremely unlikely. Said approach might be revisited if this turns out to be even nastier, but I'd want to have group buy-in first.)

@Viqsi
Copy link
Member Author

Viqsi commented Oct 15, 2024

Forgot to mention in the above - @WValenti is planning on going through those variables and fixing them where they can be quickly fixed, so we're not just "leaving it at that". :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FDD Foundational Document / Decision - issue with record/reference of a major design decision being made
Projects
None yet
Development

No branches or pull requests

2 participants