Home

Welcome to the py-icare wiki!

Example data

Example datasets are provided at the data/ directory of this repository. Users can use them to explore the different features of iCARE and examine the outputs that they generate.

breast_cancer_covariates_info.json: the metadata associated with the risk factors present in the covariate dataset (reference_covariate_data.csv). For each risk factor, the provided metadata includes information on— the data type (type; either "continuous" or "discrete"), a list of categories if the variable is discrete (levels), and optionally the reference category of the discrete variable (ref). If a reference category is not provided for a discrete variable, the first listed category under (levels) is assumed to be the reference value.
breast_cancer_model_formula.txt: a patsy formula, which is a symbolic description of the covariate model to be fitted. Patsy is a Python substitute for R's formula class objects. If you are an R programmer, please read the patsy manual since patsy is not a perfect drop-in replacement for R's formula syntax.
breast_cancer_72_snps_info.csv: published information (SNP name, odds ratio, and allele frequency) on 72 breast cancer-associated SNPs. Reference: Michailidou, Kyriaki, et al. "Association analysis identifies 65 new breast cancer risk loci." Nature 551.7678 (2017): 92-94.
breast_cancer_model_log_odds_ratios.json: published log odds ratios associated with each risk factor in the breast cancer covariate model (breast_cancer_model_formula.txt). These were obtained from a logistic regression model adjusted for cohort and fine categories of age.
breast_cancer_model_log_odds_ratios_post_50.json:
reference_covariate_data.csv: a simulated reference dataset specifying some of the breast cancer-associated risk factors (see table below) for 14,137 individuals. The simulation is based on the National Health Interview Survey (NHIS) and National Health and Nutrition Examination Survey (NHANES). This dataset is representative of the US population. Reference: 1) 2010 National Health Interview Survey (NHIS) Public Use Data Release, NHIS Survey Description. 2011. (Accessed at ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2010/srvydesc.pdf.); 2) Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Questionnaire. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2010.

Variable name	Description	Value encoding
`id`	Subject ID	A unique identifier for each individual.
`famhist`	Family history (of breast cancer among first degree relatives)	`{0: "absence" (reference), 1: "presence"}`
`menarche_dec`	Age at menarche (years)	`{1: <=11 (reference), 2: 11-11.5, 3: 11.5-12, 5: 12-13, 8: 13-14, 9: 14-15, 10: >=15}`
`parity`	Parity (number of full-term pregnancies)	`{0: nulliparous (reference), 1: 1, 2: 2, 3: 3, 4: >=4}`
`birth_dec`	Age at first child birth (years)	`{1: <=19 (reference), 2: 19-22, 3: 22-23, 4: 23-25, 7: 25-27, 8: 27-30, 9: 30-34, 10: 34-38, 11: >=38}`
`agemeno_dec`	Age at menopause (years)	`{1: <=40 (reference), 2: 40-45, 3: 45-47, 4: 47-48, 5: 48-50, 6: 50-51, 7: 51-52, 8: 52-53, 9: 53-55, 10: >=55}`
`height_dec`	Height (meters)	`{1: <=1.55 (reference), 2: 1.55-1.57, 3: 1.57-1.60, 4: 1.60-1.61, 5: 1.61-1.63, 6: 1.63-1.65, 7: 1.65-1.66, 8: 1.66-1.68, 9: 1.68-1.71, 10: >=1.71}`
`bmi_dec`	Body mass index (kg/m²)	`{1: <=21.5 (reference), 2: 21.5-23, 3: 23-24.2, 4: 24.2-25.3, 5: 25.3-26.5, 6: 26.5-27.8, 7: 27.8-29.3, 8: 29.3-31.4, 9: 31.4-34.6, 10: >=34.6}`
`rd_menohrt`	Use of Hormone Replacement Therapy (HRT)	`{0: "pre-menopausal" (reference), 1: "post-menopausal and never HRT user", 2: "post-menopausal and ever HRT user"}`
`rd2_everhrt_e`	Use of estrogen-only therapy	`{0: "otherwise" (reference), 1: "post-menopausal and ever user of estrogen-only therapy"}`
`rd2_everhrt_c`	Use of estrogen + progesterone combined therapy	`{0: "otherwise" (reference), 1: "post-menopausal and ever user of combined therapy"}`
`rd2_currhrt`	Current use of HRT	`{0: "otherwise" (reference), 1: "post-menopausal and current HRT user"}`
`alcoholdweek_dec`	Alcohol (drinks/week)	`{1: "none" (reference), 4: 0-0.4, 5: 0.4-0.8, 6: 0.8-1.5, 7: 1.5-3.2, 8: 3.2-5.7, 9: 5.7-9.8, 10: >9.8}`
`ever_smoke`	Smoking status	`{0: "never" (reference), 1: "ever"}`

reference_covariate_data_post_50.csv:
age_specific_breast_cancer_incidence_rates.csv: age-specific breast cancer incidence rates. Reference: Surveillance, Epidemiology, and End Results (SEER) Program SEER*Stat Database: Incidence - SEER 18 Regs Research Data, Nov 2011 Sub, Vintage 2009 Pops (2000-2009) <Katrina/Rita Population Adjustment> - Linked To County Attributes - Total U.S., 1969-2010 Counties. In: National Cancer Institute D, Surveillance Research Program, Surveillance Systems Branch, ed. SEER18 ed.
age_specific_all_cause_mortality_rates.csv: age-specific all-cause mortality rates. Reference: Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS). Underlying Cause of Death 1999-2011 on CDC WONDER Online Database, released 2014. Data are from the Multiple Cause of Death Files, 1999-2011, as compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program. Accessed at http://wonder.cdc.gov/ucd-icd10.html on Aug 26, 2014.
query_covariates_profile.csv: a query dataset specifying the risk factors (same variables as in the reference covariate dataset reference_covariate_data.csv) for three hypothetical individuals. Missing values, if present, are handled by iCARE.
query_snp_profile.csv: a query dataset specifying the allele dosages for the breast cancer-associated SNPs (same SNPs as in the breast_cancer_72_snps_info.csv file) for three hypothetical individuals. Note that some of the SNPs for some individuals are missing. These are handled by iCARE.
validation_cohort_data.csv: a simulated dataset of a full cohort study of 50,000 individuals. This dataset helps illustrate the model validation capabilities of iCARE. The variables in this dataset are as follows.

Variable name	Description	Value encoding
`study_entry_age`	Age at study entry (years)	continuous (integer)
`study_exit_age`	Age at study exit (years)	continuous (integer)
`observed_outcome`	Disease status	`{0: "normal", 1: "case"}`
`time_of_onset`	Time (in years) from study entry to the development of the disease. Set to `Inf` if the subject did not develop the disease during the follow-up period.	continuous (float)
`observed_followup`	Number of years that the subject was followed-up in the study i.e. the difference between the age at study entry and the age at study exit.	continuous (integer)

validation_nested_case_control_data.csv: a simulated dataset of a case-control study of 5,285, nested within a cohort study. Inclusion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Example data

Clone this wiki locally