Skip to content
/ hicberg Public

Statistical profiling based program for contact (Hi-C) and pair ended genomic data reconstruction

License

Notifications You must be signed in to change notification settings

sebgra/hicberg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HiC-BERG

Badges

MIT License GPLv3 License AGPL License codecov

Python package to reconstruct Hi-C contact maps.

Documentation

Documentation and tutorial: https://sebgra.github.io/hicberg/

Table of contents

Environment and dependencies

Environnement

Create environment by using following command :

mamba env create -n [ENV_NAME] -f hicberg.yaml;

Dependencies

To ensure that HiC-BERG is correctly working, Bowtie2, Samtools, bedGraphToBigWig and BedTools have to be installed. These can be install through :

mamba install bowtie2 -c bioconda;
mamba install samtools -c bioconda;
mamba install -c bioconda ucsc-bedgraphtobigwig;
mamba install bedtools -c bioconda;

Installation

Install my-project with pip:

  pip install hicberg

or in developper mode:

  mamba activate [ENV_NAME];
  pip install -e . ;

pip

Install HiC-BERG locally by using

  pip install -e .

Conda / Mamba

We highly recommend installing HiC-BERG through Mamba.

conda install -c bioconda hicberg
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
mamba create -n hicberg python=3.11.4
mamba activate hicberg
mamba install bioconda::bowtie2
mamba install bioconda::samtools
mamba install bioconda::bedtools
mamba install bioconda::ucsc-bedgraphtobigwig

TO BE COMPLETED

Docker

HiC-BERG can be used via Docker to limit dependency compatibility problems. More information about container such as Docker can be found there.

To be used through Docker, the HiC-BERG container as to be build using :

# Pick user and group variables
export HOST_UID=$(id -u);                                                                
export HOST_GID=$(id -g);

# Build container
sudo docker build --build-arg UID=$HOST_UID --build-arg GID=$HOST_GID -t hicberg:0.0.1 .

HiC-BERG can therefore be used with usual command detailed further using :

sudo docker run   -v $PWD:$PWD  -w <working directory> -u $(id -u):$(id -g)  hicberg:0.0.1 <hicberg_command>

N.B : working_directory and output (-o argument) of HiC-BERG command have to be the same.

For instance if you need to access the help of HiC-BERG through Docker :

sudo docker run -v $PWD:$PWD  -w <working directory> -u $(id -u):$(id -g)  hicberg:0.0.1 hicberg --help

HiC-BERG can also be used through Docker in interactive mode using :

sudo docker run -v $PWD:$PWD  -w <working directory> -u $(id -u):$(id -g) -it  hicberg:0.0.1

Then, the user will be directly placed in an interactive shell where HiC-BERG can be use directly, by typing any of HiC-BERG command such as further examples.

Usage/Examples

Full pipeline

HiC-BERG

All components of the pipeline can be run at once using the hicberg pipeline command. This allows to generate a contact matrix and its reconstruction from reads in a single command.
By default, the output is in COOL format. More detailed documentation can be found on the readthedocs website:

https://sebgra.github.io/hicberg/

hicberg pipeline [--enzyme=["DpnII", "HinfI"]] [--distance=1000]
[--rate=1.0] [--cpus=1] [--mode="full"] [--max-alignments=None] [--sensitivity="very-sensitive"]
[--bins=2000] [--circular=""] [--mapq=35] [--kernel-size=11] [--deviation=0.5] [--start-stage="fastq"]
[--exit-stage=None] [--output=DIR] [--index=None] [--blacklist=STR] [--force]  <genome> <input1> <input2>

For example, to run the pipeline using 8 threads, in default mode, using ARIMA Hi-C kit enzymes (DpnII & HinfI) without blacklisting and generate a matrix and its reconstruction in the directory out:

hicberg pipeline -e DpnII -e HinfI --cpus 8 -o out/  genome.fa  reads_for.fq  rev_reads.fq 

Snakemake usage

Configuration

Several parameters and Hi-C banks can be set in the file config.yaml. The parameters are the following:

samples: "config/samples.csv"
base_dir: Path to the base directory containing data
out_dir: Path where to save results
ref_dir: Path to the folder containing genomes 
fastq_dir:  Path to the folder containing fastq files

name : library_name
    ref: Path to the reference genome from ref_dir 
    R1: Path to the foward reads file from fastq_dir
    R2: Path to the reverse read file from fastq_dir
    circular: Coma separated list of circular chromosomes
    enzymes : Coma separated list of enzymes used for the experiment
    sampling_rate: Sampling rate of the restriction map
    res: Resolution of the contact matrix (in bp)

The samples.csv file can be used to set the parameters for each library. The file is a csv file with the following columns:

library;species;sampling_rates;enzymes;kernel_sizes;deviations;modes;resolutions;max_reports;circularity

#name_1
library_name_1;species_1;sampling_rate_1;enzymes_1;kernel_sizes_1;deviations_1;mode_1;resolutions_1;max_reports_1

#name_2
library_name_2;species_2;sampling_rate_2;enzymes_2;kernel_sizes_2;deviations_2;mode_2;resolutions_2;max_reports_2

Run

Locally

The HiC-BERG pipeline can be run using Snakemake. The pipeline is defined in the file Snakefile. The pipeline can be run using the following command:

snakemake --cores [cpus]

Cluster

HiC-BERG can also be run on a cluster using Snakemake. The cluster configuration is defined in the file cluster_slurm.yaml. The pipeline can be run using the following command:

snakemake --cluster "sbatch --mem {cluster.mem} -p {cluster.partition} --qos {cluster.queue} -c {cluster.ncpus} 
-n {cluster.ntasks} -J {cluster.name} -o {cluster.output} -e {cluster.error}" --cluster-config config/cluster_slurm.yaml -j 16 --rerun-incomplete

As for the local run, the libraries to process are defined in the file samples.csv. Computational resources can be set in the file cluster_slurm.yaml. In this file, the different parameters have to be specified by rules. For instance:

hicberg_step_0:
    queue: normal
    partition: common
    ncpus: 16
    mem: 32G
    ntasks: 1
    name: hicberg.{rule}.{wildcards}
    output: logs/cluster/{rule}.{wildcards}.out
    error: logs/cluster/{rule}.{wildcards}.err

Jobs will be sent to the cluster as usual sbatch commands. The fields queue, partition, ncpus, mem, ntasks, name, output and error are mandatory.

Considering the previous example, the following command will be sent to the cluster:

sbatch --mem 32G -p common -c 16 -n 1 -J hicberg.hicberg_step_0.{wildcards} -o logs/cluster/hicberg_step_0.{wildcards}.out -e logs/cluster/hicberg_step_0.{wildcards}.err

And such for each rule defined in the file cluster_slurm.yaml, through all the libraries specified in sample.csv.

Log files will be saved in the folder logs/cluster. The output and error files will be named as the following: {rule}.{wildcards}.out and {rule}.{wildcards}.err. The wildcards are the parameters specified in the file samples.csv.

_N.B: The parameter --rerun-incomplete is used to restart the pipeline from the last step if it has been interrupted.

N.B 2: The parameter -j is used to specify the number of jobs to run in parallel. It is recommended to set this parameter to the number of libraries to process.

N.B 3: The ressources allocated to each job can be modified in the file cluster_slurm.yaml. The parameters mem and ncpus are used to specify the memory and the number of CPUs to allocate to each job. The parameter ntasks is used to specify the number of tasks to run in parallel. The parameter queue is used to specify the queue to use. The parameter partition is used to specify the partition to use. The parameters name, output and error are used to specify the name of the job and the output and error files._

Individual components

I/O

Create folder

Be careful to create a folder before running the pipeline. The folder can be created using the following command:

hicberg create-folder --output=DIR [--name="folder_name"] [--force]

For example to create a folder named "test" on the desktop:

hicberg create-folder -o ~/Desktop/ -n test

The folders architecture will be the following:

├── output
│   ├── alignments
│   ├── contacts
│   ├── index
│   ├── plots
│   ├── statistics

Preprocessing

After having created a folder with the previous command mentioned in create folder, the genome can be processed to generate fragment file fragment_fixed_sizes.txt and the dictionary of chromosomes' sizes chromosome_sizes.npy using the following command:

hicberg get-tables --output=DIR --genome=FILE [--bins=2000] 

For example to these files in a folder named "test" previously created on the desktop with a binning size of 2000 bp :

hicberg get-tables -o ~/Desktop/test/  --bins 2000 <genome>

The files fragment_fixed_sizes.txt and chromosome_sizes.npy will be generated in the folder output/.

Alignment

After having created a folder with the previous command mentioned in create folder and performed the creation of fragment file fragment_fixed_sizes.txt and the dictionary of chromosomes' sizes chromosome_sizes.npy , the reads can be aligned using the following command:

hicberg alignment  --output=DIR [--cpus=1] [--max-alignments=None] [--sensitivity="very-sensitive"] [--index=index]
  [--verbosity] <genome> <forward> <reverse>

For example to align reads in a folder named "test" previously created on the desktop with 8 threads:

hicberg alignment -o ~/Desktop/test/ --cpus 8  <genome.fa>  <reads_for.fq>  <rev_reads.fq>

If the user have already created the index, the following command can be used:

hicberg alignment -o ~/Desktop/test/ --cpus 8 --index index_prefix  <genome.fa>  <reads_for.fq>  <rev_reads.fq>

The files XXX.btl2, 1.sorted.bam and 2.sorted.bam will be created.

Classification

hicberg classify --output=DIR [--mapq=35] 

Considering the previous example, to classify the reads in a folder named "test" previously created on the desktop:

hicberg classify -o ~/Desktop/test/

The files created are:

  • group0.1.bam and group0.2.bam : bam files containing the reads of group0 e.g. where at least one read of the pair is unaligned.
  • group1.1.bam and group1.2.bam : bam files containing the reads of group1 e.g. where both reads of the pair are aligned only one time.
  • group2.1.bam and group2.2.bam : bam files containing the reads of group2 e.g. where at least one reads of the pair are aligned more than one time.

Pairs and matrix building

Build pairs

After having aligned the reads, the pairs file group1.pairs can be built using the following command:

hicberg build-pairs --output=DIR [--recover]

If the flag argument recover is used, the pairs file will be built from the last step of the analysis e.g. after having computed the statistics and re-attributed reads from group2 bam files.

Considering the previous example, to build the matrix in a folder named "test" previously created on the desktop:

hicberg build-pairs -o ~/Desktop/test/ 

The file group1.pairs will be created.

If the pairs file has to be built after reads of group2 reassignment, the following command can be used:

hicberg build-pairs -o ~/Desktop/test/ --recover

Thus, the built pairs file will be all_group.pairs.

Build matrix

After having aligned the reads and built the pairs file group1.pairs, the cooler matrix unrescued_map.cool can be built using the following command:

hicberg build-matrix  --output=DIR [--recover]

If the flag argument recover is used, the matrix file will be built from the last step of the analyis e.g. after having computed the statistics and re-attributed reads from group2 bam files.

Considering the previous example, to build the matrix in a folder named "test" previously created on the desktop:

hicberg build-matrix -o ~/Desktop/test/ 

The file unrescued_map.cool will be created.

If the cooler file has to be built after reads of group2 re-assignament, the following command can be used:

hicberg build-matrix -o ~/Desktop/test/ --recover

Thus, the built matrix file will be rescued_map.cool.

Statistics

After having aligned the reads and built the pairs file group1.pairs, the cooler matrix unrescued_map.cool, the statistical laws for the reassignment of the reads from group2 can be learnt by using the following command:

hicberg statistics --output=DIR [--bins=bins_number] [--circular=""] [--rate=1.0] [--mode="standard"]
[--kernel-size=11] [--deviation=0.5] [--balcklist=STR] <genome>

Considering the previous example, to get the statistical laws (with respect of ARIMA kit enzymes) and default parameters for density estimation, without sub-sampling the restriction map and without blacklisting regions and considering "chrM" as circular in a folder named "test" previously created on the desktop:

hicberg statistics  -e DpnII -e HinfI -c "chrM" -o ~/Desktop/test/ <genome.fa>

The statistical laws are going to be saved as:

  • xs.npy : dictionary containing the log binned genome as dictionary such as {chromosome: [log bins]}
  • uncuts.npy, loops.npy, weirds.npy : dictionary containing the distribution of uncuts, loops and weirds as dictionary such as {chromosome: [distribution]}
  • pseudo_ps.pdf : plot of the distribution of the pseudo ps, i.e. ps equivalent for trans-chromosomal cases, extracted from the reads of group1.
  • coverage: dictionary containing the coverage of the genome as dictionary such as {chromosome: [coverage]}
  • d1d2.npy: np.array containing the d1d2 law as dictionary such as [distribution]
  • density_map.npy : dictionary containing the density map as dictionary such as {chromosome_pair: [density map]}
Blacklisting regions

The user can specify regions to blacklist in the analysis. The regions to blacklist have to be specified in a bed file or a coma separated list of genomic coordinates considering UCSC format. The bed file has to be formatted as follow:

chr1 200000 220000
chr1 308000 314000
...
chr3 100000 120000

List of blacklisted regions provided as a string can be specified as follow:

chr1:200000-220000,chr1:308000-314000,...,chr3:100000-120000

So the learning of the statistical laws, as previously described, with blacklisting regions for P(s) computing can be done using the following command:

  • Using a bed file:
hicberg statistics  -e DpnII -e HinfI -c "chrM" -o ~/Desktop/test/ -B <blacklist.bed> <genome.fa>
  • Using a string:
hicberg statistics -o ~/Desktop/test/ -B "chr1:200000-220000,chr1:308000-314000,chr3:100000-120000" <genome>

Omics mode

The omics mode can be used to reconstruct any pair-ended sequenced genomic data. For such, the model used relies on the $P(s)$ and the coverage. After reconstruction, the data is located in the folder statistics/. The files generated are:

  • coverage.bed : bed file containing the coverage of the genome
  • coverage.bedgraph : bedgraph file containing the coverage of the genome
  • signal.bw : bigwig file containing the reconstructed signal (pair-ended data)

The omics mode can be used using the following command:

hicberg pipeline  -o <out folder> -t <cpus> -m omics  -s <alignment sensitivity>   genome.fa  reads_for.fq  rev_reads.fq 

Reconstruction

After having learnt the statistical laws (based on reads of group1), the reads from group2 can be reassigned using the following command:

hicberg rescue --output=DIR  [--enzyme=["DpnII", "HinfI"]] [--mode="full"] [--cpus=1] <genome>

Considering the previous example, to reassign the reads from group2 in a folder named "test" previously created on the desktop:

hicberg rescue -e DpnII -e HinfI -o ~/Desktop/test/ <genome>

The files group2.1.rescued.bam and group2.2.rescued.bam will be created.

Plot

To plot all the information about the analysis, the following command can be used:

hicberg plot --output=DIR  [--bins=2000] <genome>

Considering all the previous analysis, with 2000bp as bin size to plot all the information in a folder named "test" previously created on the desktop:

hicberg plot -o ~/Desktop/test/ -b 2000 <genome>

The plots created are:

  • patterns_distribution_X.pdf : plot of the distribution of the different patterns extracted from the reads of group1.
  • coverage_X.png : plot of the genome coverage extracted from the reads of group1.
  • d1d2.pdf : plot of the d1d2 law extracted from the reads of group1.
  • density_X-Y.pdf : plot of the density map extracted from the reads of group1.
  • Couple_sizes_distribution.pdf : plot of the distribution of the plausible couple sizes extracted from the reads of group2.
  • chr_X.pdf : plot of the original map and reconstructed one for each chromosome.

Tidy folder

To tidy the folder, the following command can be used:

hicberg tidy --output=DIR

Considering all the previous analysis, to tidy the folder in a folder named "test" previously created on the desktop:

hicberg tidy -o ~/Desktop/test/ 

After tidying the folders architecture will be the following:

/home/sardine/Bureau/sample_name
├── alignments
│   ├── group0.1.bam
│   ├── group0.2.bam
│   ├── group1.1.bam
│   ├── group1.2.bam
│   ├── group2.1.bam
│   ├── group2.1.rescued.bam
│   ├── group2.2.bam
│   └── group2.2.rescued.bam
├── chunks
│   ├── chunk_for_X.bam
│   └── chunk_rev_X.bam
├── contacts
│   ├── matrices
│   │   ├── rescued_map.cool
│   │   └── unrescued_map.cool
│   └── pairs
│       ├── all_group.pairs
│       └── group1.pairs
├── fragments_fixed_sizes.txt
├── chromosome_sizes.txt
├── index
│   ├── index.1.bt2l
│   ├── index.2.bt2l
│   ├── index.3.bt2l
│   ├── index.4.bt2l
│   ├── index.rev.1.bt2l
│   └── index.rev.2.bt2l
├── plots
│   ├── chr_X.pdf
│   ├── Couple_sizes_distribution.pdf
│   ├── coverage_X.pdf
│   ├── patterns_distribution_X.pdf
│   ├── pseudo_ps.pdf
│   └── density_X-Y.pdf
└── statistics
    ├── chromosome_sizes.npy
    ├── coverage.npy
    ├── d1d2.npy
    ├── density_map.npy
    ├── dist.frag.npy
    ├── loops.npy
    ├── restriction_map.npy
    ├── trans_ps.npy
    ├── uncuts.npy
    ├── weirds.npy
    ├── xs.npy
    ├── chromsome_sizes.bed(*) 
    ├── coverage.bed(*)
    ├── coverage.bedgraph(*)
    └── signal.bw(*)

(*) : files generated by hicberg in omics mode

Chaining pipeline steps

It is possible to chain the different steps of the pipeline by using the following command:

# 0. Prepare analysis
hicberg pipeline  -o --output=DIR  [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME  --start-stage fastq  --exit-stage bam <genome> <input1> <input2>

# 1. Align reads
hicberg pipeline -o --output=DIR  [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME  --start-stage bam  --exit-stage groups <genome> <input1> <input2>

# 2. Group reads
hicberg pipeline -o --output=DIR  [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME  --start-stage groups  --exit-stage build <genome> <input1> <input2>

# 3. Build pairs & cool
hicberg pipeline -o --output=DIR  [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME  --start-stage build  --exit-stage stats <genome> <input1> <input2>

# 4. Compute statistics
hicberg pipeline -o --output=DIR  [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME  --start-stage stats  --exit-stage rescue <genome> <input1> <input2>

# 5. Reassign ambiguous reads, build pairs & cool then get results
hicberg pipeline -o --output=DIR  [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME  --start-stage rescue  --exit-stage final <genome> <input1> <input2>

Evaluating the model

Evaluating the model

General principle

HiC-BERG provide a method to evaluate the inferred reconstructed maps. The evaluation is based on first a split of the original uniquely mapping reads into two sets :

  • group1.X.out.bam : alignment files where selected reads are complementary with the group1.X.in.bam from the alignment files. Thus the reads are uniquely mapped (as the original alignment files) and used to learn the statistics for read couple inference.
  • group1.X_in.bam: alignment files where selected reads are duplicated between all possible genomic intervals defined by the user. Thus ambiguity is introduced in the alignment of the reads.

Hence, the most plausible couple from fake duplicated reads in group1.X.in.bam is inferred and the corresponding Hi-C contact matrix is built and compared to the one built from the original reads in group1.X.bam (unrescued_map.cool). The two matrices are then compared (modified bins through duplication) compared using the Pearson correlation coefficient that relates the quality of the reconstruction. The closest the coefficient is to 1, the better the reconstruction is.

The evaluation pipeline can be illustrated as follow :

HiC-BERG Evaluation

The genomic intervals used to duplicate the reads are defined by the user through the definition of source and target intervals.The source interval is set through the parameters --chromosome , --position and --bins. The target intervals are set through the parameters --strides and eventually --trans_chromosome with --trans_position.

So in a basic example considering only one chromosome and two artificial duplicated sequence, it is necessary to define a source interval corresponding to the chromosome of interest and a target interval corresponding to the duplicated sequence. The source interval is defined by the chromosome name (chromosome), the position (--position) and the width of the interval in number of bins (bins).

Thus the source interval is defined as $[chromosome:position-binsbin size ; chromosome:position+bins * binsize]$ and the target interval as $[chromosome:(position-binsbinsize) + stride ; chromosome:(position+bins*binsize) + stride]$.

For example, if the source interval is chromosome 1, position 68000 and strides set as [0, 50000] with a bin size of 2000bp, the source interval is defined as chr1:68000-70000 and the target interval is defined as chr1:118000-120000.

The files group1.1.in.bam, group1.2.in.bam, group1.1.out.bam and group1.2.out.bam will be created.

The duplicated aligned reads should look like this :

group1.1.in.bam :

NS500150:487:HNLLNBGXC:1:11101:1071:2862        0       chr1    69227   255     35M     *       0       0       ATCTGTTGTGNNGAAGGATACTCCCAGAACTCGTT     AAAAAEEEAE##EEEEEEEEEEEEEEEEEEEEEEE     AS:i:-2 XN:i:0  XM:i:2  XO:i:0  NM:i:2  MD:Z:10G0A23    YT:Z:UU   XG:i:230218
NS500150:487:HNLLNBGXC:1:11101:1071:2862        0       chr1    119227  255     35M     *       0       0       ATCTGTTGTGNNGAAGGATACTCCCAGAACTCGTT     AAAAAEEEAE##EEEEEEEEEEEEEEEEEEEEEEE     AS:i:-2 XN:i:0  XM:i:2  XO:i:0  NM:i:2  MD:Z:10G0A23    YT:Z:UU   XG:i:230218     XF:Z:Fake
NS500150:487:HNLLNBGXC:1:11101:3001:19423       16      chr1    118866  255     35M     *       0       0       GAAAAAGGATTGGTCCAATAAGTGGGAAAAAAGAT     EEAEEEAEE/EAE/EEEEEEEE/EEEEEE6AAAAA     AS:i:0  XN:i:0  XM:i:0  XO:i:0  NM:i:0  MD:Z:35 YT:Z:UU XG:i:230218
NS500150:487:HNLLNBGXC:1:11101:3001:19423       16      chr1    68866   255     35M     *       0       0       GAAAAAGGATTGGTCCAATAAGTGGGAAAAAAGAT     EEAEEEAEE/EAE/EEEEEEEE/EEEEEE6AAAAA     AS:i:0  XN:i:0  XM:i:0  XO:i:0  NM:i:0  MD:Z:35 YT:Z:UU XG:i:230218       XF:Z:Fake
NS500150:487:HNLLNBGXC:1:11101:4986:15168       16      chr1    69239   255     35M     *       0       0       GAAGGATACTCCCAGAACTCGTTACTGTCTGGACT     EEEEEEEEEEEEEEEEEEEEEEEAEEEAEEAAAAA     AS:i:0  XN:i:0  XM:i:0  XO:i:0  NM:i:0  MD:Z:35 YT:Z:UU XG:i:230218
NS500150:487:HNLLNBGXC:1:11101:4986:15168       16      chr1    119239  255     35M     *       0       0       GAAGGATACTCCCAGAACTCGTTACTGTCTGGACT     EEEEEEEEEEEEEEEEEEEEEEEAEEEAEEAAAAA     AS:i:0  XN:i:0  XM:i:0  XO:i:0  NM:i:0  MD:Z:35 YT:Z:UU XG:i:230218       XF:Z:Fake
...

group1.2.in.bam :

NS500150:487:HNLLNBGXC:1:11101:1071:2862        16      chr1    103994  255     35M     *       0       0       TGCTTTTTTGGGATTGGGAATGATTTTTCCTCCTT     EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA     AS:i:0  XN:i:0  XM:i:0  XO:i:0  NM:i:0  MD:Z:35 YT:Z:UU XG:i:230218
NS500150:487:HNLLNBGXC:1:11101:3001:19423       16      chr1    121776  255     35M     *       0       0       GGTCAAGAAATGGTTTTCACAGGCGAAATCATTGG     EEEEEEEEEEE<EEEE/EEEEEEEEEEAEEAAAAA     AS:i:0  XN:i:0  XM:i:0  XO:i:0  NM:i:0  MD:Z:35 YT:Z:UU XG:i:230218
NS500150:487:HNLLNBGXC:1:11101:4986:15168       0       chr1    86626   255     35M     *       0       0       GATCTAGGGGTACCTCCTCGGGAAACATCCAGCCC     AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE     AS:i:0  XN:i:0  XM:i:0  XO:i:0  NM:i:0  MD:Z:35 YT:Z:UU XG:i:230218
...

The XF:Z:Fake signed read duplication.

In the case of trans chromosomal duplications, the user has to specify the names of trans chromosomes and the relative positions for each trans chromosome selected. The user has to provide as many position as the number of chromosome names selected.

For example, if the source interval is chromosome 1, position 100000 and strides set as [0, 50000] with a bin size of 2000bp and the specified trans chromosomes and trans positions are respectively [chr2, chr8] and [70000, 130000], the source interval is defined as chr1:100000-102000 and the target intervals are defined as chr1:150000-152000, chr2:70000-72000 and chr8:130000-132000.

The stride is the number of bins between the first bin of the source interval and the first bin of the target interval. The stride can be negative or positive. If the stride is negative, the target interval is located before the source interval. If the stride is positive, the target interval is located after the source interval. The stride can be set to 0, in this case the target interval is the same as the source interval. The target interval can be located on the same chromosome as the source interval or on another chromosome. In this case, the chromosome name and the position of the first bin of the target interval must be specified. All the parameters --position, --strides, --trans-chromosome and --trans-position should be provided as coma separated lists.

The benchmark can be performed considering several modes. The modes are defined by the parameter --modes. The modes are defined as a list of strings separated by comas. The modes are the same as the one used for the reconstruction :

  • full
  • random
  • ps
  • cov
  • d1d2
  • density
  • standard (ps and cov)
  • one_enzyme (ps, cov and d1d2)
  • omics (ps, cov)

N.B : depending on the modes selected for the benchmark, if one of the mode is not included in the list of modes selected for the reconstruction, the reconstruction will not be performed for this mode, and the corresponding statistics will not be computed.

The evaluation can be run using the following command :

hicberg benchmark  --output=DIR [--chromosome=STR] [--position=INT] [--trans-chromosome=STR][--trans-position=INT] [--stride=INT] [--bins=INT] [--auto=INT] [--rounds=INT] [--magnitude=FLOAT] [--modes=STR] [--pattern=STR] [--threshold=FLOAT] [--jitter=INT] [--trend] [--top=INT] [--genome=STR][--force]

Considering a benchmark with 4 artificially duplicated sequences set at chr1:100000-102000 (source), chr1:200000-202000 (target 1), chr4:50000-52000 (target 2) and chr7:300000-302000 (target 3), with 2000bp as bin size and considering full and ps_only modes to get the performance of the reconstructions considering a folder named "test" previously created on the desktop containing the original alignment files and the unreconstructed maps, the command line is the following :

hicberg benchmark  -o ~/Desktop/test/ -c chr1 -p 100000 -s 0,100000 -C chr4,chr7 -P 50000,30000 -m full,ps_only -g <genome>

It is also possible to let the source and target intervals being picked at random. However in such cases, the empty bins are not considered in the evaluation. The random mode is activated by setting the parameter --auto to the number of desired artificially duplicated sequences.

Thus, considering a benchmark with 100 artificially duplicated sequences , with 2000bp as bin size and considering full and ps_only modes to get the performance of the reconstructions considering a folder named "test" previously created on the desktop containing the original alignment files and the unreconstructed maps, the command line is the following :

hicberg benchmark  -o ~/Desktop/test/ -a 100 -m full,ps_only  -g <genome.fa>

Pattern based evaluation

A pattern based strategy can be set. Such strategy is similar to the strategy where genomic intervals have to be defined as mentioned above where user has to provide th genomic coordinates of the intervals for read selections whereas in pattern base strategy, only pattern type has to be specified to set the genomic intervals. The pattern is defined by the parameter --pattern.

Such patterns are going to be detected from the original Hi-C map using Chromosight. Then genomic coordinates of the detected patterns are going to be used to select the reads for the evaluation. The number of duplication are going to be adjusted by specifying the chromosome name with --chromosome parameter, the --threshold parameter to set the minimum Pearson score to consider a pattern as detected and the --top parameter to eventually keep the top k% to the remaining detected patterns. Thus, the genomic intervals are going to be defined as the genomic coordinates of the detected patterns.

Thus, the same read strategy will be applied. After reconstruction, the evaluation will be performed using the Pearson correlation coefficient between the original and reconstructed bins selected from the original map. Furthermore, a second pattern detection with Chromosight will be performed on the reconstructed map. Then, the precision, recall and F1 score will be computed to evaluate the reconstruction, by comparing the number of retrieved patterns while identifying false positives and false negatives.

N.B. : Because of the stochasticity of Chromosight while splitting the Hi-C map for pattern detection, some patterns can be considered as false positives and negatives because their coordinates after reconstruction are aside of the original one. We recommend using the --jitter parameter to allow pattern to be considered as retrieved post-reconstruction if they are detected at j bins from the original coordinates.

Considering a benchmark based on loops patterns on chromosome 7 with a threshold of 0.5 with all the detected patterns after thresholding (i.e. 100% rate) and a jitter of 0 with detrend, in full mode, the command line is the following :

hicberg benchmark  -o ~/Desktop/test/ -c chr7 -p 100000 -S loops -t 0.5 -k 100 -j 0 -T -m random -g <genome.fa>

Python usage

All components of the hicberg program can be used as python modules. See the documentation on readthedocs. The expected contact map format for the library is a simple COOL file, and the objects handled by the library are simple Numpy arrays through Cooler. The various submodules of hicberg contain various utilities.

import hicberg.io #Functions for I/O and folder management.
import hicberg.align #Functions for sequence alignment steps
import hicberg.utils #Functions for handling reads and alignment
import hicberg.statistics #Functions for extract and create statistical models
import hicberg.omics #Functions to treat non Hi-C data. 
import hicberg.pipeline #Functions to run end to end Hi-C map reconstruction.

Connecting the modules

All the steps described here are handled automatically when running the hicberg pipeline. But if you want to connect the different modules manually, the intermediate input and output files can be processed using some python scripting.

File formats

  • pair files: This format is used for all intermediate files in the pipeline and is also used by hicberg build_matrix. It is a tab-separated format holding information about Hi-C pairs. It has an official specification defined by the 4D Nucleome data coordination and integration center.
    • readID: Read (pair) identifier.

    • chr1: Chromosome identifier of the forward read of the pair.

    • pos1: 0-based position of the forward mate, in base pairs.

    • chr2: Chromosome identifier of the reverse read of the pair.

    • pos2: 0-based position of the reverse mate, in base pairs.

    • strand1: Orientation of the aligned forward mate.

    • strand2: Orientation of the aligned reverse mate.

## pairs format v1.0
#columns: readID chr1 pos1 strand1 chr2 pos2 strand2
#chromsize: chr1 230218
#chromsize: chr2 813184
NS500150:487:HNLLNBGXC:1:11101:3066:7109        chr2    683994  chr2    684725  -       -
NS500150:487:HNLLNBGXC:1:11101:6114:4800        chr2    795379  chr2    796279  +       +
NS500150:487:HNLLNBGXC:1:11101:6488:14927       chr2    379433  chr2    379138  -       +
...
  • cool files: This format is used to store genomic interaction data such as Hi-C contact matrices. These file can be handled using cooler Python library.

  • npy files: This format is used to store dictionaries containing information about genomic coordinates, binning or statistical laws. Dictionaries are stores with chromosome as key and arrays as values. Such file can be handled using numpy Python library.

    • chromosome_sizes.npy : This file is used to store the size of each chromosome. Structure is the following : {chromosome: size}
    • xs.npy : This file is used to store the log binned genome. Structure is the following : {chromosome: [log bins]} with log bins a list of integers.
    • uncuts.npy : This file is used to store the distribution of uncuts. Structure is the following : {chromosome: [distribution]} with distribution a list of integers.
    • loops.npy : This file is used to store the distribution of loops. Structure is the following : {chromosome: [distribution]} with distribution a list of integers.
    • weirds.npy : This file is used to store the distribution of weirds. Structure is the following : {chromosome: [distribution]} with distribution a list of integers.
    • pseudo_ps.npy : This file is used to store the distribution of pseudo ps. Structure is the following : {(chrom1, chrom2): [map]} with (chrom1, chrom2) a tuple of chromosomes where chrom1 is different than chrom2 and map a float value.
    • coverage.npy : This file is used to store the coverage of the genome. Structure is the following : {chromosome: [coverage]} with coverage a list of integers.
    • d1d2.npy : This file is used to store the d1d2 law. Structure is the following : [distribution] with distribution a list of integers.
    • density_map.npy : This file is used to store the density map. Structure is the following : {(chrom1, chrom2): [density map]} with (chrom1, chrom2) a tuple of chromosomes density map a 2D numpy array.
  • bt2l files: Thi format is used to store index of genomes performer using Bowtie2.

  • bam files: This format is used to built analyses on, by several functions of hicberg. It is a compressed standard alignment format file providing multiple information about read alignments performer by Bowtie2. Such files can be handled through Samtools and it's Python wrapper PySam. More details about SAM and BAM format can be found here.

  • bed files: This format is used to store genomic intervals. It is a tab-separated format holding information about genomic intervals. It is a standard format used by the UCSC genome browser.

chr4    150    200
chr6    300    400
chr4    800    900
chr2    680000 684000
...
  • chromosome_sizes.bed : This file is used to store the size of each chromosome. Structure is the following : chromosome start end

  • coverage.bed : This file is used to store the coverage of the genome. Structure is the following : chromosome start end coverage

  • coverage.bedgraph : This file is used to store the coverage of the genome. Structure is the following : chromosome start end coverage

  • signal.bw : This file is used to store the coverage of the genome. Structure is the following : chromosome start end coverage

  • fragments_fixed_sizes.txt:

    • chrom: Chromosome identifier. Order should be the same as in pairs files.

    • start: 0-based start of fragment, in base pairs.

    • end: 0-based end of fragment, in base pairs.

chrom start end
chr1   0       2000
chr1   2000    4000
chr1   4000    6000
...
chr1   14000   16000
...
chr2   0   2000
chr2  2000  4000
...

Contributing

All contributions are welcome, in the form of bug reports, suggestions, documentation or pull requests. We use the Numpy standard for docstrings when documenting functions.

The code formatting standard we use is black, with --line-length=79 to follow PEP8 recommendations. We use pytest with the pytest-doctest and pytest-pylint plugins as our testing framework. Ideally, new functions should have associated unit tests, placed in the tests folder. To test the code, you can run:

coverage run --source=hicberg -m pytest -v tests --cov-report=xml

Authors

Citation

About

Statistical profiling based program for contact (Hi-C) and pair ended genomic data reconstruction

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published