Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export createConceptCountsTable() #1067

Open
cebarboza opened this issue Aug 8, 2023 · 8 comments
Open

export createConceptCountsTable() #1067

cebarboza opened this issue Aug 8, 2023 · 8 comments

Comments

@cebarboza
Copy link
Collaborator

At Darwin EU we are trying to run CohortDiagnostics in a more efficient way when running it for multiple studies. We believe it would be a good enhancement to export the function createConceptCountsTable() to generate the concept_counts before executing the diagnostics.

By doing this, we can perform this calculation just once for a specific vocabulary_version, instead of repeating this process for each study.

In our fork darwin-eu-dev/CohortDiagnostics, the user can createConceptCountsTable(). This table is saved in the cohortDatabaseSchema.

CohortDiagnostics::createConceptCountsTable(connectionDetails = connectionDetails,
                                            cdmDatabaseSchema = cdmDatabaseSchema,
                                            conceptCountsDatabaseSchema = cohortDatabaseSchema,
                                            conceptCountsTable = "concept_counts",
                                            removeCurrentTable = TRUE)

Then we use the parameter useExternalConceptCountsTable in executeDiagnostics(). If TRUE, executeDiagnostics() uses the concept_counts created previously in the cohortDatabaseSchema. The user should specify the name of the external concept counts table, generally concept_counts

CohortDiagnostics::executeDiagnostics(cohortDefinitionSet =  cohortDefinitionSet,
                                      connectionDetails = connectionDetails,
                                      cohortTable = cohortTable,
                                      cohortDatabaseSchema = cohortDatabaseSchema,
                                      cdmDatabaseSchema = cdmDatabaseSchema,
                                      conceptCountsTable = "concept_counts",
                                      exportFolder = exportFolder,
                                      databaseId = "Eunomia",
                                      minCellCount = 5,
                                      useExternalConceptCountsTable = FALSE)

We also modified the CreateConceptCountTable.sql file, to add a new column with the vocabulary_version.

https://github.com/darwin-eu-dev/CohortDiagnostics/blob/ca6d9074bb097b9ce60b7bc6bf72e68a84f650fe/inst/sql/sql_server/CreateConceptCountTable.sql#L102C1-L106C2

{@table_is_temp} ? {} : { 
ALTER TABLE @work_database_schema.@concept_counts_table
ADD vocabulary_version VARCHAR(20) NULL;
UPDATE @work_database_schema.@concept_counts_table SET vocabulary_version = (SELECT vocabulary_version FROM @cdm_database_schema.vocabulary WHERE vocabulary_id = 'None');
}

Then, there are checks in place that evaluate if the vocabulary_version in the concept_counts table is equal to the version of the database the user is running the diagnostics.

https://github.com/darwin-eu-dev/CohortDiagnostics/blob/ca6d9074bb097b9ce60b7bc6bf72e68a84f650fe/R/RunDiagnostics.R#L679C1-L708C4

 # Defines variables and checks version of external concept counts table -----
  if (useExternalConceptCountsTable == FALSE) {
    conceptCountsTableIsTemp <- TRUE
    if (conceptCountsTable != "#concept_counts") {
      conceptCountsTable <- "#concept_counts"
    }
  } else {
    if (conceptCountsTable == "#concept_counts") {
      stop("Temporary conceptCountsTable name. Please provide a valid external ConceptCountsTable name")
    }
    conceptCountsTableIsTemp <- FALSE
    conceptCountsTable <- conceptCountsTable
    dataSourceInfo <- getCdmDataSourceInformation(connection = connection, cdmDatabaseSchema = cdmDatabaseSchema)
    vocabVersion <- dataSourceInfo$vocabularyVersion
    vocabVersionExternalConceptCountsTable <- DatabaseConnector::renderTranslateQuerySql(
      connection = connection,
      sql = "SELECT DISTINCT vocabulary_version FROM @work_database_schema.@concept_counts_table;",
      work_database_schema = cohortDatabaseSchema,
      concept_counts_table = conceptCountsTable,
      snakeCaseToCamelCase = TRUE,
      tempEmulationSchema = getOption("sqlRenderTempEmulationSchena")
    )
    if (!identical(vocabVersion, vocabVersionExternalConceptCountsTable[1,1])) {
      stop(paste0("External concept counts table (", 
                 vocabVersionExternalConceptCountsTable, 
                 ") does not match database (", 
                 vocabVersion, 
                 "). Update concept_counts with createConceptCountsTable()"))
    }
  }

There's also a vignette to explain how to run this functions UseExternalConceptTable.Rmd. We have been testing this approach but we wanted to discuss this before sending a pull request.

@azimov
Copy link
Collaborator

azimov commented Sep 19, 2023

Sorry for the lateness in reply I missed this issue when you first raised it.

This approach seems reasonable, but I wonder if the solution would be better included inside CohortGenerator as an extra step following cohort creation? It seems like this solution might be generally more useful than just for cohort diagnostics - tagging @anthonysena for visibility.

However, I'm happy to look at adopting this change into the main code. Just submit a PR and I will gladly test it out.

@cebarboza
Copy link
Collaborator Author

No worries and thanks for the response. I'll send you the PR as soon as possible.

I would argue that the function createConceptCountsTable() is already part of CohortDiagnostics' workflow.

We also used the variable useExternalConceptCountsTable inside CD, so it was more natural for us to develop this, avoiding any major modification.

@katy-sadowski
Copy link

Did anything ever come out of this? If not, I'd be willing to help out.

I'm curious why useExternalConceptCountsTable is no longer supported? Also wondering if we ever considered providing the option to leverage ACHILLES concept counts table here?

@azimov
Copy link
Collaborator

azimov commented May 2, 2024

Did anything ever come out of this? If not, I'd be willing to help out.

I'm curious why useExternalConceptCountsTable is no longer supported? Also wondering if we ever considered providing the option to leverage ACHILLES concept counts table here?

This was a fork that I never got a PR for. I expect that this could be useful, especially in contexts where the Achilles counts are there. If you're interested in following through I will gladly work to get the PR included.

@cebarboza
Copy link
Collaborator Author

Hi @azimov I sent the PR for exporting the function to useExternalConceptCountsTable.

@ablack3
Copy link
Collaborator

ablack3 commented Jun 12, 2024

Also wondering if we ever considered providing the option to leverage ACHILLES concept counts table here?

Good idea @katy-sadowski!

@cebarboza and I started working on the idea to speed up CohortDiagnstics by using the pre-computed concept counts in Achilles.

So instead of building the concepts counts table CohortDiagnostics would have the option to use the data that is already available in Achilles.

Here is a SQL query that would pull the data for

-- note this is postgresql sql

-- Achilles analysis ids used for the concept counts table
-- condition: 400 (persons), 401 (standard concepts), 425 (source concepts)
-- drug: 700 (persons), 701 (standard concepts), 725 (source concepts)
-- procedure: 600 (persons), 601 (standard concepts), 625 (source concepts)
-- measurement: 1800 (persons), 1801 (standard concept), 1825 (source concepts)
-- observation: 800 (persons), 801 (standard concepts), 825 (source concepts)

SELECT
	q1.concept_id, 
	concept_count, 
	concept_subjects
INTO @concept_counts_table
FROM (
	SELECT 
		CAST(stratum_1 as int) as concept_id,
		count_value as concept_count
	FROM results.achilles_results
	WHERE analysis_id IN (401,601,701,801,1801,425,625,725,825,1825) AND stratum_1 != '0'
) q1
LEFT JOIN 
(
	SELECT 
		CAST(stratum_1 as INT) as concept_id,
		count_value as concept_subjects
	FROM results.achilles_results
	WHERE analysis_id IN (400,600,700,800,1800) AND stratum_1 != '0'
) q2
ON q1.concept_id = q2.concept_id

I ran this on IPCI and the CohortDiagnostics sql script on IPCI and compared the results.

image

df is created by the sql script in cohort diagnostics and df2 is coming from the sql pasted above.

Note there is a huge difference in the number of rows. This is likely because a lot of data in IPCI falls outside the observation period. Achilles does not count concepts outside observation period. Cohort diagnostics does include concepts outside observation period.

Another difference is the inclusion of concept id 0. this concept id occurs in multiple cdm tables and so is repeated in the concept table created by cohort diagnostics. I removed these rows. In the table created by cohort diagnostics there is no way to tell which domain the rows with concept id 0 came from.

image

@azimov, @katy-sadowski what do you think about adding achilles as an option for getting concept counts in cohort diagnostics?

Also what do you think about making the counts consistent by excluding concepts outside observation period and removing concept id 0?

@ablack3
Copy link
Collaborator

ablack3 commented Jun 12, 2024

I created a pr for use of the achilles tables. Not tested yet but wanted to get some feedback on the approach first. #1120

@ablack3
Copy link
Collaborator

ablack3 commented Sep 16, 2024

Achilles has an optional table that has concept counts in a format that we could possibly use out of the box without any reformmating or copying.

The table is called achilles_result_concept_count and if the optimizeAchillesCache = TRUE then it will be created.

https://github.com/OHDSI/Achilles/blob/ea478a34179f52c13bb798abd41d7ad96d1b1fc2/R/Achilles.R#L145

https://github.com/OHDSI/Achilles/blob/main/inst/sql/sql_server/analyses/create_result_concept_table.sql

I'm not sure yet if this is exactly what we want because I would need to better understand the sql script and the analyses that it is pulling from but it seems like we could possibly use this table as is for concept counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants