Skip to content

Commit

Permalink
Merge pull request #93 from apriha/develop
Browse files Browse the repository at this point in the history
v4.1.0
  • Loading branch information
apriha authored Apr 14, 2021
2 parents 7d9fc5c + 5c6c503 commit 5a47dc0
Show file tree
Hide file tree
Showing 12 changed files with 769 additions and 156 deletions.
17 changes: 1 addition & 16 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,6 @@ jobs:
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Set default for downloads
shell: bash
run: echo "DOWNLOADS_ENABLED=false" >> $GITHUB_ENV
- name: Determine if downloads are enabled for this job
# for testing, limit downloads from the resource servers to only the selected job for
# PRs and the master branch; note that the master branch is tested weekly via `cron`,
Expand All @@ -94,21 +91,9 @@ jobs:
echo "DOWNLOADS_ENABLED=true" >> $GITHUB_ENV
fi
- name: Install dependencies
shell: bash
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
pip install pytest-cov awscli
pip install pytest-cov
pip install .
if [[ $DOWNLOADS_ENABLED == "false" ]]; then
# use cached resources on Amazon S3
aws s3 cp s3://lineage-resources/resources.tar.gz resources.tar.gz
if [[ -f resources.tar.gz ]]; then
tar -xzf resources.tar.gz
rm resources.tar.gz
fi
fi
- name: Ensure Python and source code are on same drive (Windows)
if: ${{ matrix.os == 'windows-latest' }}
shell: cmd
Expand Down
6 changes: 6 additions & 0 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,12 @@ To set up ``lineage`` for local development:

$ pipenv run pytest --cov-report=html --cov=lineage tests

.. note:: Downloads during tests are disabled by default. To enable downloads, set
the environment variable ``DOWNLOADS_ENABLED=true``.

.. note:: If you receive errors when running the tests, you may need to specify the temporary
directory with an environment variable, e.g., ``TMPDIR="/path/to/tmp/dir"``.

.. note:: After running the tests, a coverage report can be viewed by opening
``htmlcov/index.html`` in a browser.

Expand Down
45 changes: 26 additions & 19 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ lineage

Capabilities
------------
- Compute centiMorgans (cMs) of shared DNA between individuals using the HapMap Phase II genetic map
- Find shared DNA and genes between individuals
- Compute centiMorgans (cMs) of shared DNA using a variety of genetic maps (e.g., HapMap Phase II, 1000 Genomes Project)
- Plot shared DNA between individuals
- Determine genes shared between individuals (i.e., genes transcribed from shared DNA segments)
- Find discordant SNPs between child and parent(s)
- Read, write, merge, and remaps SNPs for an individual via the `snps <https://github.com/apriha/snps>`_ package

Expand Down Expand Up @@ -139,20 +139,21 @@ Not counting mtDNA SNPs, there are 37 discordant SNPs between these two datasets
Find Shared DNA
'''''''''''''''
``lineage`` uses the probabilistic recombination rates throughout the human genome from the
`International HapMap Project <https://www.genome.gov/10001688/international-hapmap-project/>`_ to
compute the shared DNA (in centiMorgans) between two individuals. Additionally, ``lineage``
denotes when the shared DNA is shared on either one or both chromosomes in a pair. For example,
when siblings share a segment of DNA on both chromosomes, they inherited the same DNA from their
mother and father for that segment.
`International HapMap Project <https://www.genome.gov/10001688/international-hapmap-project/>`_
and the `1000 Genomes Project <https://www.internationalgenome.org>`_ to compute the shared DNA
(in centiMorgans) between two individuals. Additionally, ``lineage`` denotes when the shared DNA
is shared on either one or both chromosomes in a pair. For example, when siblings share a segment
of DNA on both chromosomes, they inherited the same DNA from their mother and father for that
segment.

With that background, let's find the shared DNA between the ``User662`` and ``User663`` datasets,
calculating the centiMorgans of shared DNA and plotting the results:

>>> results = l.find_shared_dna([user662, user663], cM_threshold=0.75, snp_threshold=1100)
Downloading resources/genetic_map_HapMapII_GRCh37.tar.gz
Downloading resources/cytoBand_hg19.txt.gz
Saving output/shared_dna_User662_User663.png
Saving output/shared_dna_one_chrom_User662_User663_GRCh37.csv
Saving output/shared_dna_User662_User663_HapMap2.png
Saving output/shared_dna_one_chrom_User662_User663_GRCh37_HapMap2.csv

Notice that the centiMorgan and SNP thresholds for each DNA segment can be tuned. Additionally,
notice that two files were downloaded to facilitate the analysis and plotting - future analyses
Expand All @@ -177,11 +178,11 @@ created; these files are detailed in the documentation and their generation can
``save_output=False`` argument. In this example, the output files consist of a CSV file that
details the shared segments of DNA on one chromosome and a plot that illustrates the shared DNA:

.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663.png
.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663_HapMap2.png

Find Shared Genes
'''''''''''''''''
The `Central Dogma of Molecular Biology <https://www.nature.com/nature/focus/crick/pdf/crick227.pdf>`_
The `Central Dogma of Molecular Biology <https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology>`_
states that genetic information flows from DNA to mRNA to proteins: DNA is transcribed into
mRNA, and mRNA is translated into a protein. It's more complicated than this (it's biology
after all), but generally, one mRNA produces one protein, and the mRNA / protein is considered a
Expand All @@ -205,21 +206,27 @@ Loading SNPs('resources/4583.ftdna-illumina.3482.csv.gz')
>>> user4584 = l.create_individual('User4584', 'resources/4584.ftdna-illumina.3483.csv.gz')
Loading SNPs('resources/4584.ftdna-illumina.3483.csv.gz')

Now let's find the shared genes:
Now let's find the shared genes, specifying a
`population-specific <https://www.internationalgenome.org/faq/which-populations-are-part-your-study/>`_
1000 Genomes Project genetic map (e.g., as predicted by `ezancestry <https://github.com/arvkevi/ezancestry>`_!):

>>> results = l.find_shared_dna([user4583, user4584], shared_genes=True)
>>> results = l.find_shared_dna([user4583, user4584], shared_genes=True, genetic_map="CEU")
Downloading resources/CEU_omni_recombination_20130507.tar
Downloading resources/knownGene_hg19.txt.gz
Downloading resources/kgXref_hg19.txt.gz
Saving output/shared_dna_User4583_User4584.png
Saving output/shared_dna_one_chrom_User4583_User4584_GRCh37.csv
Saving output/shared_dna_two_chroms_User4583_User4584_GRCh37.csv
Saving output/shared_genes_one_chrom_User4583_User4584_GRCh37.csv
Saving output/shared_genes_two_chroms_User4583_User4584_GRCh37.csv
Saving output/shared_dna_User4583_User4584_CEU.png
Saving output/shared_dna_one_chrom_User4583_User4584_GRCh37_CEU.csv
Saving output/shared_dna_two_chroms_User4583_User4584_GRCh37_CEU.csv
Saving output/shared_genes_one_chrom_User4583_User4584_GRCh37_CEU.csv
Saving output/shared_genes_two_chroms_User4583_User4584_GRCh37_CEU.csv

The plot that illustrates the shared DNA is shown below. Note that in addition to outputting the
shared DNA segments on either one or both chromosomes, the shared genes on either one or both
chromosomes are also output.

.. note:: Shared DNA is not computed on the X chromosome with the 1000 Genomes Project genetic
maps since the X chromosome is not included in these genetic maps.

In this example, there are 15,976 shared genes on both chromosomes transcribed from 36 segments
of shared DNA:

Expand All @@ -228,7 +235,7 @@ of shared DNA:
>>> len(results['two_chrom_shared_dna'])
36

.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User4583_User4584.png
.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User4583_User4584_CEU.png

Documentation
-------------
Expand Down
Binary file removed docs/images/shared_dna_User4583_User4584.png
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
33 changes: 17 additions & 16 deletions docs/output_files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,25 +57,26 @@ In the filenames below, ``name1`` is the name of the first
:class:`~lineage.individual.Individual` and ``name2`` is the name of the second
:class:`~lineage.individual.Individual`. (If more individuals are compared, all
:class:`~lineage.individual.Individual` names will be included in the filenames and plot titles
using the same conventions.)
using the same conventions.) Additionally, ``genetic_map`` corresponds to the genetic map used
in the calculations of shared DNA, specified as a parameter to :meth:`~lineage.Lineage.find_shared_dna`.

.. note:: Genetic maps do not have recombination rates for the Y chromosome since the Y
chromosome does not recombine. Therefore, shared DNA will not be shown on the Y
chromosome.

shared_dna_<name1>_<name2>.png
``````````````````````````````
shared_dna_<name1>_<name2>_<genetic_map>.png
````````````````````````````````````````````
This plot illustrates shared DNA (i.e., no shared DNA, shared DNA on one chromosome, and shared
DNA on both chromosomes). The centromere for each chromosome is also detailed. Two examples of
this plot are shown below.

.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663.png
.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663_HapMap2.png

In the above plot, note that the two individuals only share DNA on one chromosome. In this plot,
the larger regions where "No shared DNA" is indicated are due to SNPs not being available in
those regions (i.e., SNPs were not tested in those regions).

.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User4583_User4584.png
.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User4583_User4584_CEU.png

In the above plot, the areas where "No shared DNA" is indicated are the regions where SNPs were
not tested or where DNA is not shared. The areas where "One chromosome shared" is indicated are
Expand All @@ -85,8 +86,8 @@ shared" is indicated are regions where the individuals share DNA on both chromos
Note that the regions where DNA is shared on both chromosomes is a subset of the regions where
one chromosome is shared.

shared_dna_one_chrom_<name1>_<name2>_GRCh37.csv
```````````````````````````````````````````````
shared_dna_one_chrom_<name1>_<name2>_GRCh37_<genetic_map>.csv
`````````````````````````````````````````````````````````````
If DNA is shared on one chromosome, a CSV file details the shared segments of DNA.

======= ===========
Expand All @@ -100,8 +101,8 @@ cMs CentiMorgans of matching DNA segment
snps Number of SNPs in matching DNA segment
======= ===========

shared_dna_two_chroms_<name1>_<name2>_GRCh37.csv
````````````````````````````````````````````````
shared_dna_two_chroms_<name1>_<name2>_GRCh37_<genetic_map>.csv
``````````````````````````````````````````````````````````````
If DNA is shared on two chromosomes, a CSV file details the shared segments of DNA.

======= ===========
Expand All @@ -128,11 +129,11 @@ In the filenames below, ``name1`` is the name of the first
:class:`~lineage.individual.Individual` names will be included in the filenames using the same
convention.)

shared_genes_one_chrom_<name1>_<name2>_GRCh37.csv
`````````````````````````````````````````````````
shared_genes_one_chrom_<name1>_<name2>_GRCh37_<genetic_map>.csv
```````````````````````````````````````````````````````````````
If DNA is shared on one chromosome, this file details the genes shared between the individuals
on at least one chromosome; these genes are located in the shared DNA segments specified in
`shared_dna_one_chrom_<name1>_<name2>_GRCh37.csv`_.
`shared_dna_one_chrom_<name1>_<name2>_GRCh37_<genetic_map>.csv`_.

=========== ============
Column* Description*
Expand All @@ -151,10 +152,10 @@ description Description
\* `UCSC Genome Browser <http://genome.ucsc.edu>`_ /
`UCSC Table Browser <http://genome.ucsc.edu/cgi-bin/hgTables>`_

shared_genes_two_chroms_<name1>_<name2>_GRCh37.csv
``````````````````````````````````````````````````
shared_genes_two_chroms_<name1>_<name2>_GRCh37_<genetic_map>.csv
````````````````````````````````````````````````````````````````
If DNA is shared on both chromosomes in a pair, this file details the genes shared between the
individuals on both chromosomes; these genes are located in the shared DNA segments specified in
`shared_dna_two_chroms_<name1>_<name2>_GRCh37.csv`_.
`shared_dna_two_chroms_<name1>_<name2>_GRCh37_<genetic_map>.csv`_.

The file has the same columns as `shared_genes_one_chrom_<name1>_<name2>_GRCh37.csv`_.
The file has the same columns as `shared_genes_one_chrom_<name1>_<name2>_GRCh37_<genetic_map>.csv`_.
54 changes: 41 additions & 13 deletions src/lineage/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,12 +280,13 @@ def find_shared_dna(
snp_threshold=1100,
shared_genes=False,
save_output=True,
genetic_map="HapMap2",
):
""" Find the shared DNA between individuals.
Computes the genetic distance in centiMorgans (cMs) between SNPs using the HapMap Phase II
GRCh37 genetic map. Applies thresholds to determine the shared DNA. Plots shared DNA.
Optionally determines shared genes (i.e., genes that are transcribed from the shared DNA).
Computes the genetic distance in centiMorgans (cMs) between SNPs using the specified genetic
map. Applies thresholds to determine the shared DNA. Plots shared DNA. Optionally determines
shared genes (i.e., genes transcribed from the shared DNA).
All output is saved to the output directory as `CSV` or `PNG` files.
Expand All @@ -300,6 +301,16 @@ def find_shared_dna(
determine shared genes
save_output : bool
specifies whether to save output files in the output directory
genetic_map : {'HapMap2', 'ACB', 'ASW', 'CDX', 'CEU', 'CHB', 'CHS', 'CLM', 'FIN', 'GBR', 'GIH', 'IBS', 'JPT', 'KHV', 'LWK', 'MKK', 'MXL', 'PEL', 'PUR', 'TSI', 'YRI'}
genetic map to use for computation of shared DNA; `HapMap2` corresponds to the HapMap
Phase II genetic map from the
`International HapMap Project <https://www.genome.gov/10001688/international-hapmap-project/>`_
and all others correspond to the
`population-specific <https://www.internationalgenome.org/faq/which-populations-are-part-your-study/>`_
genetic maps generated from the
`1000 Genomes Project <https://www.internationalgenome.org>`_ phased OMNI data.
Note that shared DNA is not computed on the X chromosome with the 1000 Genomes
Project genetic maps since the X chromosome is not included in these genetic maps.
Returns
-------
Expand Down Expand Up @@ -339,6 +350,18 @@ def find_shared_dna(
two_chrom_discrepant_snps,
)

genetic_map_dfs = self._resources.get_genetic_map(genetic_map)

if len(genetic_map_dfs) == 0:
return self._find_shared_dna_return_helper(
one_chrom_shared_dna,
two_chrom_shared_dna,
one_chrom_shared_genes,
two_chrom_shared_genes,
one_chrom_discrepant_snps,
two_chrom_discrepant_snps,
)

cols = ["genotype{}".format(str(i)) for i in range(len(individuals))]

df = individuals[0].snps
Expand All @@ -351,19 +374,17 @@ def find_shared_dna(

one_x_chrom = self._is_one_individual_male(individuals)

genetic_map = self._resources.get_genetic_map_HapMapII_GRCh37()

tasks = []

chroms_to_drop = []
for chrom in df["chrom"].unique():
if chrom not in genetic_map.keys():
if chrom not in genetic_map_dfs.keys():
chroms_to_drop.append(chrom)
continue

tasks.append(
{
"genetic_map": genetic_map[chrom],
"genetic_map": genetic_map_dfs[chrom],
# get positions for the current chromosome
"snps": pd.DataFrame(df.loc[(df["chrom"] == chrom)]["pos"]),
}
Expand Down Expand Up @@ -433,6 +454,7 @@ def find_shared_dna(
two_chrom_shared_dna,
one_chrom_shared_genes,
two_chrom_shared_genes,
genetic_map,
)

return self._find_shared_dna_return_helper(
Expand Down Expand Up @@ -479,6 +501,7 @@ def _find_shared_dna_output_helper(
two_chrom_shared_dna,
one_chrom_shared_genes,
two_chrom_shared_genes,
genetic_map,
):
cytobands = self._resources.get_cytoBand_hg19()

Expand All @@ -498,14 +521,17 @@ def _find_shared_dna_output_helper(
two_chrom_shared_dna,
cytobands,
os.path.join(
self._output_dir, "shared_dna_{}.png".format(individuals_filename)
self._output_dir,
f"shared_dna_{individuals_filename}_{genetic_map}.png",
),
"{} shared DNA".format(individuals_plot_title),
f"{individuals_plot_title} shared DNA",
37,
)

if len(one_chrom_shared_dna) > 0:
file = "shared_dna_one_chrom_{}_GRCh37.csv".format(individuals_filename)
file = (
f"shared_dna_one_chrom_{individuals_filename}_GRCh37_{genetic_map}.csv"
)
save_df_as_csv(
one_chrom_shared_dna,
self._output_dir,
Expand All @@ -516,7 +542,9 @@ def _find_shared_dna_output_helper(
)

if len(two_chrom_shared_dna) > 0:
file = "shared_dna_two_chroms_{}_GRCh37.csv".format(individuals_filename)
file = (
f"shared_dna_two_chroms_{individuals_filename}_GRCh37_{genetic_map}.csv"
)
save_df_as_csv(
two_chrom_shared_dna,
self._output_dir,
Expand All @@ -527,7 +555,7 @@ def _find_shared_dna_output_helper(
)

if len(one_chrom_shared_genes) > 0:
file = "shared_genes_one_chrom_{}_GRCh37.csv".format(individuals_filename)
file = f"shared_genes_one_chrom_{individuals_filename}_GRCh37_{genetic_map}.csv"
save_df_as_csv(
one_chrom_shared_genes,
self._output_dir,
Expand All @@ -537,7 +565,7 @@ def _find_shared_dna_output_helper(
)

if len(two_chrom_shared_genes) > 0:
file = "shared_genes_two_chroms_{}_GRCh37.csv".format(individuals_filename)
file = f"shared_genes_two_chroms_{individuals_filename}_GRCh37_{genetic_map}.csv"
save_df_as_csv(
two_chrom_shared_genes,
self._output_dir,
Expand Down
Loading

0 comments on commit 5a47dc0

Please sign in to comment.