HARE File Descriptions and Data Dictionaries

Olivia Smith
osmith@utexas.edu
GitHub: @ossmith

Below is comprehensive information about each of the files created using the HARE pipeline and data dictionaries for those which require it.

hare.reference.assets.tar.gz

HAR BED Files

harsRichard2020.GRCh37.bed and harsRichard2020.GRCh38.bed BED file which contains the human accelerated regions (HARs) discovered/annotated in various different publications. This file is sourced from the supplement of Richard, et al., 2020. The original HAR file was created with using human genome reference GRCh37. The GRCh38 HAR BED file was created using the UCSC Genome Browser liftover tool. Any file in BED format can be used to build the elements of interest set.

Column Name	Data Type	Description
CHR	int	Chromosome
CHR_START	int	Starting position of the HAR (bp)
CHR_END	int	End position of the HAR (bp)
PUB	string	Publication which identified/annotated the HAR

UCSC Genome Annotations

UCSC.GRCh37.annotation.autosomes.bed.gz and UCSC.GRCh37.annotation.autosomes.bed.gz are gzipped BED files with all of the gene annotations for human autosomes in GRCh37 and GRCh38. Each autosome was downloaded separately from UCSC Genome Browser and then assembled into the single autosome file. Unzip this file for viewing or for use in the pipeline with the following command:

gunzip UCSC.GRCh37.autosomes.bed.gz

Data dictionary descriptions are taken from the UCSC Genome Browser at time of publication.

Column Name	Data Type	Description
CHR	int	Chromosome
CHR_START	int	Starting position of the feature (bp)
CHR_END	int	End position of the feature (bp)
NAME	string	Feature name
SCORE	int	A score between 0 and 1000. See UCSC Genome Browser format documentation for more details.
STRAND	string	Defines the strand. Either "." (=no strand) or "+" or "-".
THICK_START	int	The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, THICK_START and THICK_END are usually set to the CHR_START position.
THICK_END	int	The ending position at which the feature is drawn thickly (for example the stop codon in gene displays).
ITEM_RGB	string	An RGB value of the form R,G,B (e.g. 255,0,0). See UCSC Genome Browser format documentation for more details.
BLOCK_COUNT	int	The number of blocks (exons) in the BED line.
BLOCK_SIZE	int	A comma-separated list of the block sizes. The number of items in this list should correspond to BLOCK_COUNT.
BLOCK_START	int	A comma-separated list of block starts. All of the BLOCK_START positions should be calculated relative to CHR_START. The number of items in this list should correspond to BLOCK_COUNT.

[OUT_STEM].snps

A list of SNPs with genome-wide association to the phenotype in VCF file format. See SAMtools documentation for more details on this file format.

[OUT_STEM].annotation

Output from the Ensembl Variant Effect Predictor command line tool. Comments are preceded by #, including column headers. For data dictionary, see the 'default VEP output documentation'.

[OUT_STEM].biomart

Output from the BioMart location finding for the elements annotated by VEP ([OUT_STEM].annotation file). Headers are included in this file.

Column Name	Data Type	Description
ENSEMBL_ID	string	EnsemblID for the feature
START	int	First position of feature
END	int	Last position of feature
CHR	int	Chromosome the feature is located on
GENE_NAME	string	Gene name associated with the feature in Ensembl
STRAND	integer	Strand which the feature is located on. `1` is for forward and `-1` is for reverse.

[OUT_STEM].locations.bed

BED file which contains only the locations of the elements annotated via VEP which will be intersected against the genomic elements of interest.

Column Name	Data Type	Description
CHR	int	Chromosome
START	int	Starting position of the feature (bp)
END	int	End position of the feature (bp)

[OUT_STEM].intersections

This file contains the calculations of the intersections/bp for the simulation and phenotype-associated element sets.

Column Name	Data Type	Description
category	string	Category for the calculation which specifies whether the element set is either a `simulation` or the `test_set` (phenotype-associated).
int_per_bp	float	Intersections per base pair computed across the entire element set.
set_size	int	Number of elements present in the element set. This number should be the same across all simulations associated with a given phenotype-associated element set.

[OUT_STEM].stats

sigtest results file which contains information about the run parameters, model fitting, and hypothesis testing (including the p-value) of the intersect results.

Column Name	Data Type	Description
FILENAME	string	Category for the calculation which specifies whether the element set is either "simulation" or "test_set" (phenotype-associated).
SET_SIZE	int	Number of elements present in the element set. This number applies both to the phenotype-associated and simulation element sets.
N_SIMULATIONS	int	Number of simulations used to generate the background distribution.
SIM_IPB	float	The mean intersections/bp across all the simulations.
SET_IPB	float	The intersections/bp in the phenotype-associated element set.
P_EMPIRICAL	float	Empirical p-value calculated as the fraction of the simulations which had higher intersections/bp than the phenotype-associated element set.
WEIBULL_SHAPE	float	INFO
WEIBULL_SCALE	float	INFO
P_WEIBULL	float	P-value calculated from fit to weibull distribution (one tailed).
ADJUSTED_P	float	Empirical p-value adjusted for multiple hypothesis testing using Benjamini-Hochberg method. Note that this value should only be considered valid if all tests are provided as input in one run (use comma-separated list, see README for details).

[OUT_STEM].rnk

prerank results file which contains ranked list of genes and a score (either minimum or mean depending on score method used). This file does not contain headers.

Column	Data Type	Description
1	string	HGNC symbol for the feature or gene
2	float	Associated score. Computed as the minimum or mean -log10(p) of positions associated with that feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataDictionary.md

dataDictionary.md

HARE File Descriptions and Data Dictionaries

hare.reference.assets.tar.gz

HAR BED Files

UCSC Genome Annotations

[OUT_STEM].snps

[OUT_STEM].annotation

[OUT_STEM].biomart

[OUT_STEM].locations.bed

[OUT_STEM].intersections

[OUT_STEM].stats

[OUT_STEM].rnk

Files

dataDictionary.md

Latest commit

History

dataDictionary.md

File metadata and controls

HARE File Descriptions and Data Dictionaries

hare.reference.assets.tar.gz

HAR BED Files

UCSC Genome Annotations

[OUT_STEM].snps

[OUT_STEM].annotation

[OUT_STEM].biomart

[OUT_STEM].locations.bed

[OUT_STEM].intersections

[OUT_STEM].stats

[OUT_STEM].rnk