Skip to content

5.2.An example of extracting ICD codes from UK Biobank tabular data

Xuan Mai PHAM edited this page Aug 22, 2024 · 12 revisions

The purpose of this exercise is to extract data from the tabular data in Beluga of subjects (participants) encoded U071 (COVID-19, virus identified) correlating with the second-scan date using the xsv command. Please find a Step-by-Step process for your reference below.

Step-by-step

The ICD (International Classification Disease) and IDP(imaging-derived phenotypes) information are available in current.csv file, a big CSV table (separate from the bulk data).

1. current.csv

Available on /lustre03/project/6008063/neurohub/UKB/Tabular/current.csv, you will find the most updated UK Biobank data set available on Beluga.

2. xsv with the option headers

The command will show the headers of the CSV file, the complete list of data fields with their column number.

xsv headers current.csv

Currently 21,407 data fields are associated with about 0.5 million UK Biobank subjects.

3. Diagnoses - main ICD10 (Data field 41202)

The first question would be to know if the data field is available in our data set.

You can search for the data field using the grep command:

xsv headers current.csv | grep 41202

Files that contain data on an individual participant are named in this fashion:

<Datafield column number>''<FIELD-ID>''<INSTANCE-ID>''<ARRAY-ID>

The outcome will be the 41202 data field associated with the array, here a total of 79 arrays. (The first column is the data field column number.)

   18504  41202-0.1
   18505  41202-0.2
   18506  41202-0.3
   18507  41202-0.4
.......
   18579  41202-0.76
   18580  41202-0.77
   18581  41202-0.78
   18582  41202-0.79

4. Extract the 41202 data

You can use xsv and select the data from the range of column associated, here column 18503 to 18582.

xsv select 18503-18582 current.csv

To get the subject number (EID) you can select 1, (as the patient number is the first column of the file)

xsv select 1,18503-18582 current.csv

5. Subjects with code U071

xsv select 1,18503-18582 current.csv | grep U071

6. Extract subjects correlating with the second-scan date (Data field 41280)

xsv headers current.csv | grep 41280

grep patients diagnosed in 2021 for example:

xsv select 1,19888-20146 current.csv | grep 2021

7. Combine both results by selecting subjects coded U071 after 2021-06 (for example)

xsv select 1,18503-18582,1,19888-20146 current.csv | grep U071 | grep 2021-06

8. What we have in Beluga

5061 participants with an ICD code of U07.1 (covid tested positive)

Clone this wiki locally