Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vcf support #308

Merged
merged 34 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
a9ae332
finalised changes in GenotypeData.hs
stschiff Aug 2, 2024
5a76fe5
in the middle of adapting to new GenotypeData
stschiff Aug 2, 2024
2137153
adapted forge
stschiff Aug 2, 2024
2c0908d
in the middle of fixing compiler bugs
stschiff Aug 5, 2024
7df42a4
fixed forge and genoconvert
stschiff Aug 5, 2024
1287f1d
in the middle of fixing rectify
stschiff Aug 5, 2024
4b83f0a
fixed Rectify
stschiff Aug 5, 2024
832d50d
fixed serve
stschiff Aug 5, 2024
fc7d6bc
fixed survey
stschiff Aug 5, 2024
b50e384
fixed validate
stschiff Aug 5, 2024
c8ec12c
fixed optparse module
stschiff Aug 5, 2024
d239c1c
fixed pedantic errors
stschiff Aug 5, 2024
f828d8c
test compile, but fail
stschiff Sep 6, 2024
af97954
fixed some test bugs
stschiff Sep 6, 2024
fe31593
Merge branch 'gzip-support-reading' into vcf_support
stschiff Sep 6, 2024
d95f866
Merge branch 'gzip-support-reading' into vcf_support
stschiff Sep 6, 2024
6054801
fixed more tests
stschiff Sep 6, 2024
c68c13d
updated golden tests
stschiff Sep 6, 2024
fa23dea
Merge branch 'master' into vcf_support
stschiff Sep 9, 2024
b63eb01
added VCF read test
stschiff Oct 23, 2024
3aef6a3
stylish-haskell
stschiff Oct 23, 2024
e269c58
moved getFormat to GenotypeData.hs
stschiff Oct 24, 2024
fb59083
added help texts for genofile input
stschiff Oct 24, 2024
abb2ff8
refactored test packages
stschiff Oct 24, 2024
3a376aa
added VCF to init golden-test
stschiff Oct 25, 2024
b9d2fcd
added genoconvert golden test with VCF
stschiff Oct 25, 2024
06dd173
added forge golden test with VCF
stschiff Oct 25, 2024
fe70c20
stylish-haskell
stschiff Oct 25, 2024
d2ab538
Merge branch 'master' into vcf_support
stschiff Oct 25, 2024
d8d2da8
fixed some minors after merging master
stschiff Oct 25, 2024
855614b
fixed some import layout with stylish-haskell
stschiff Oct 25, 2024
cb4a017
bumped version nr and updated changelogs
stschiff Oct 25, 2024
7015965
added note about lack of writing support for gzip and VCF.
stschiff Oct 25, 2024
d10a59f
some small changes in the release changelog
nevrome Oct 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
docs/_build/
.stack-work/
dist-newstyle/
dist-newstyle/
.DS_Store
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
- V 1.5.7.0:
- Added support for VCF files (Variant Call Format) in Janno-packages.
- restructured test package structure, affecting some of the unit- and golden tests.
- V 1.5.6.0:
- Introduced individual `Janno...` types for every .janno column (except Poseidon_ID) in a new module `ColumnTypes`. This was done to improve .janno validation error messages.
- Defined a typeclass `Makeable` with a function `make` to write smart constructors for the column types.
Expand Down
30 changes: 27 additions & 3 deletions CHANGELOGRELEASE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### V 1.5.6.0
### V 1.5.7.0

This release further improves `.janno` parsing error messages and adds reading support for gzipped PLINK (`.bed` and `.bim`) and EIGENSTRAT (`.geno` and `.snp`) files.
This release further improves `.janno` parsing error messages and adds reading support for gzipped PLINK (`.bed` and `.bim`) and EIGENSTRAT (`.geno` and `.snp`) files. We also added (experimental) support for reading VCF files.

#### Better .janno error messages

Expand All @@ -26,8 +26,32 @@ The error messages now include the relevant column name and are more concrete an

#### Reading support for gzipped genotype data

...
Although not yet part of the Poseidon 2.7.1 standard, Poseidon packages can now contain gzipped genotype files. Specifically, for EIGENSTRAT-formatted genotype data, the genotype matrix file (`.geno`) and the snp-list file (`.snp`) can now also be zipped. This strictly requires file endings with `.gz`, so `.geno.gz` and `.snp.gz`, respectively. Similarly, for PLINK-formatted genotype data, we now also accept `.bed.gz` and `.bim.gz`. Any such files with the `gz` file ending are assumed to be gzipped, and are decoded on the fly using stream-processing. Gzipped and unzipped files can also be mixed within the same package.

For commands that support the `--genoOne` option (`init`, `forge` and `genoconvert`), note that we make some assumptions, which are summarised in the help text for the option:

```
-p,--genoOne FILE One of the input genotype data files. Expects .bed,
.bed.gz, .bim, .bim.gz or .fam for PLINK, or .geno,
.geno.gz, .snp, .snp.gz or .ind for EIGENSTRAT. The
other files must be in the same directory and must
have the same base name. If a gzipped file is given,
it is assumed that the file pairs (.geno.gz, .snp.gz)
or (.bim.gz, .bed.gz) are both zipped, but not the
.fam or .ind file. If a .ind or .fam file is given,
it is assumed that none of the file triples is
zipped. For VCF please see option --vcfFile
```

At this point, `genoconvert` and `forge` do _not_ support writing of gzipped files. This will be added in the future.

#### VCF support for genotype data

Although not yet part of the Poseidon 2.7.1 standard, Poseidon packages can now contain VCF (Variant Call Format) files as genotype data, optionally gzipped. In contrast to EIGENSTRAT and PLINK format, which require triples of files, the VCF format requires just one file with ending `.vcf` or `.vcf.gz`. VCF files contain sample names, but no information about genetic sex or group names. This information is usually provided in `.janno` files, so there is no loss of information in Poseidon packages. For `trident init`, which constructs a minimal `.janno` file from the genotypem file, we set the `Genetic_Sex` column to "U", and the `Group_Name` column to "unknown".

The VCF file format is very flexible and can encode a large amount of information (see https://samtools.github.io/hts-specs/VCFv4.2.pdf). We do not consider our parsing of VCF files to be complete. The feature is for now experimental, since future users may encounter valid VCF files that cause parsing errors in edge cases. Do not hesitate to file an issue in such a case: https://github.com/poseidon-framework/poseidon-hs/issues.

At this point, `genoconvert` and `forge` do _not_ support writing of VCF files. This will be added in the future.

### V 1.5.4.0

Expand Down
2 changes: 1 addition & 1 deletion poseidon-hs.cabal
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: poseidon-hs
version: 1.5.6.0
version: 1.5.7.0
synopsis: A package with tools for working with Poseidon genotype data
description: The tools in this package read and analyse Poseidon-formatted genotype databases, a modular system for storing genotype data from thousands of individuals.
license: MIT
Expand Down
63 changes: 37 additions & 26 deletions src/Poseidon/CLI/Forge.hs
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,14 @@
resolveUniqueEntityIndices)
import Poseidon.GenotypeData (GenoDataSource (..),
GenotypeDataSpec (..),
GenotypeFormatSpec (..),
GenotypeFileSpec (..),
SNPSetSpec (..),
printSNPCopyProgress,
selectIndices, snpSetMergeList)
import Poseidon.Janno (JannoRow (..), JannoRows (..),
ListColumn (..),
getMaybeListColumn,
jannoRows2EigenstratIndEntries,
writeJannoFile)
import Poseidon.Package (PackageReadOptions (..),
PoseidonPackage (..),
Expand All @@ -41,9 +42,8 @@
import Poseidon.Utils (PoseidonException (..),
PoseidonIO, checkFile,
determinePackageOutName,
envErrorLength, envInputPlinkMode,
envLogAction, logInfo, logWarning,
uniqueRO)
envErrorLength, envLogAction,
logInfo, logWarning, uniqueRO)

import Control.Exception (catch, throwIO)
import Control.Monad (filterM, forM, forM_, unless,
Expand Down Expand Up @@ -76,7 +76,7 @@
, _forgeEntityInput :: [EntityInput SignedEntity] -- Empty list = forge all packages
, _forgeSnpFile :: Maybe FilePath
, _forgeIntersect :: Bool
, _forgeOutFormat :: GenotypeFormatSpec
, _forgeOutFormat :: String

Check warning on line 79 in src/Poseidon/CLI/Forge.hs

View check run for this annotation

Codecov / codecov/patch

src/Poseidon/CLI/Forge.hs#L79

Added line #L79 was not covered by tests
, _forgeOutMode :: ForgeOutMode
, _forgeOutPacPath :: FilePath
, _forgeOutPacName :: Maybe String
Expand Down Expand Up @@ -114,7 +114,7 @@
) = do

-- load packages --
properPackages <- readPoseidonPackageCollection pacReadOpts $ [getPacBaseDirs x | x@PacBaseDir {} <- genoSources]
properPackages <- readPoseidonPackageCollection pacReadOpts $ [getPacBaseDir x | x@PacBaseDir {} <- genoSources]
pseudoPackages <- mapM makePseudoPackageFromGenotypeData [getGenoDirect x | x@GenoDirect {} <- genoSources]
logInfo $ "Unpackaged genotype data files loaded: " ++ show (length pseudoPackages)
let allPackages = properPackages ++ pseudoPackages
Expand Down Expand Up @@ -177,29 +177,37 @@
-- create new directory
logInfo $ "Writing to directory (will be created if missing): " ++ outPath
liftIO $ createDirectoryIfMissing True outPath
-- compile genotype data structure
let (outInd, outSnp, outGeno) = case outFormat of
GenotypeFormatEigenstrat -> (outName <.> ".ind", outName <.> ".snp", outName <.> ".geno")
GenotypeFormatPlink -> (outName <.> ".fam", outName <.> ".bim", outName <.> ".bed")
-- output warning if any snpSet is set to Other
snpSetList <- fillMissingSnpSets relevantPackages
let newSNPSet = case
maybeSnpFile of
Nothing -> snpSetMergeList snpSetList intersect_
Just _ -> SNPSetOther
let genotypeData = GenotypeDataSpec outFormat outGeno Nothing outSnp Nothing outInd Nothing (Just newSNPSet)
-- compile genotype data structure
genotypeFileData <- case outFormat of
"EIGENSTRAT" -> return $
GenotypeEigenstrat (outName <.> ".geno") Nothing
(outName <.> ".snp") Nothing
(outName <.> ".ind") Nothing
"PLINK" -> return $
GenotypePlink (outName <.> ".bed") Nothing
(outName <.> ".bim") Nothing
(outName <.> ".fam") Nothing
_ -> liftIO . throwIO $
PoseidonGenericException ("Illegal outFormat " ++ outFormat ++ ". Only Outformats EIGENSTRAT or PLINK are allowed at the moment")

Check warning on line 197 in src/Poseidon/CLI/Forge.hs

View check run for this annotation

Codecov / codecov/patch

src/Poseidon/CLI/Forge.hs#L196-L197

Added lines #L196 - L197 were not covered by tests
let genotypeData = GenotypeDataSpec genotypeFileData (Just newSNPSet)

-- assemble and write result depending on outMode --
logInfo "Creating new package entity"
let pacSource = head relevantPackages
case outMode of
GenoOut -> do
_ <- compileGenotypeData outPath (outInd, outSnp, outGeno) relevantPackages relevantIndices
_ <- compileGenotypeData outPath genotypeFileData relevantPackages relevantIndices
return ()
MinimalOut -> do
let pac = newMinimalPackageTemplate outPath outName genotypeData
pac <- newMinimalPackageTemplate outPath outName genotypeData
writePoseidonYmlFile pac
_ <- compileGenotypeData outPath (outInd, outSnp, outGeno) relevantPackages relevantIndices
_ <- compileGenotypeData outPath genotypeFileData relevantPackages relevantIndices
return ()
PreservePymlOut -> do
normalPac <- newPackageTemplate outPath outName genotypeData
Expand All @@ -217,15 +225,15 @@
writeBibFile outPath outName relevantBibEntries
copyREADMEFile outPath pacSource
copyCHANGELOGFile outPath pacSource
newNrSnps <- compileGenotypeData outPath (outInd, outSnp, outGeno) relevantPackages relevantIndices
newNrSnps <- compileGenotypeData outPath genotypeFileData relevantPackages relevantIndices
writingJannoFile outPath outName newNrSnps relevantJannoRows
NormalOut -> do
pac <- newPackageTemplate outPath outName genotypeData
(Just (Right newJanno)) relevantSeqSourceRows relevantBibEntries
writePoseidonYmlFile pac
writeSSFile outPath outName relevantSeqSourceRows
writeBibFile outPath outName relevantBibEntries
newNrSnps <- compileGenotypeData outPath (outInd, outSnp, outGeno) relevantPackages relevantIndices
newNrSnps <- compileGenotypeData outPath genotypeFileData relevantPackages relevantIndices
writingJannoFile outPath outName newNrSnps relevantJannoRows

where
Expand Down Expand Up @@ -262,22 +270,25 @@
let fullSourcePath = posPacBaseDir pacSource </> path
liftIO $ checkFile fullSourcePath Nothing
liftIO $ copyFile fullSourcePath $ outPath </> path
compileGenotypeData :: FilePath -> (String,String,String) -> [PoseidonPackage] -> [Int] -> PoseidonIO (VUM.IOVector Int)
compileGenotypeData outPath (outInd, outSnp, outGeno) relevantPackages relevantIndices = do
compileGenotypeData :: FilePath -> GenotypeFileSpec -> [PoseidonPackage] -> [Int] -> PoseidonIO (VUM.IOVector Int)
compileGenotypeData outPath gFileSpec relevantPackages relevantIndices = do
logInfo "Compiling genotype data"
logInfo "Processing SNPs..."
logA <- envLogAction
inPlinkPopMode <- envInputPlinkMode
currentTime <- liftIO getCurrentTime
errLength <- envErrorLength
newNrSNPs <- liftIO $ catch (
runSafeT $ do
(eigenstratIndEntries, eigenstratProd) <- getJointGenotypeData logA intersect_ inPlinkPopMode relevantPackages maybeSnpFile
eigenstratProd <- getJointGenotypeData logA intersect_ relevantPackages maybeSnpFile
let eigenstratIndEntries = jannoRows2EigenstratIndEntries . getJointJanno $ relevantPackages
let newEigenstratIndEntries = map (eigenstratIndEntries !!) relevantIndices
let (outG, outS, outI) = (outPath </> outGeno, outPath </> outSnp, outPath </> outInd)
let outConsumer = case outFormat of
GenotypeFormatEigenstrat -> writeEigenstrat outG outS outI newEigenstratIndEntries
GenotypeFormatPlink -> writePlink outG outS outI (map (eigenstratInd2PlinkFam outPlinkPopMode) newEigenstratIndEntries)
let outConsumer = case gFileSpec of
GenotypeEigenstrat outG _ outS _ outI _ ->
writeEigenstrat (outPath </> outG) (outPath </> outS) (outPath </> outI) newEigenstratIndEntries
GenotypePlink outG _ outS _ outI _ ->
writePlink (outPath </> outG) (outPath </> outS) (outPath </> outI) (map (eigenstratInd2PlinkFam outPlinkPopMode) newEigenstratIndEntries)
_ -> liftIO . throwIO $
PoseidonGenericException "only Outformats EIGENSTRAT or PLINK are allowed at the moment"

Check warning on line 291 in src/Poseidon/CLI/Forge.hs

View check run for this annotation

Codecov / codecov/patch

src/Poseidon/CLI/Forge.hs#L290-L291

Added lines #L290 - L291 were not covered by tests
let extractPipe = if packageWise then cat else P.map (selectIndices relevantIndices)
-- define main forge pipe including file output.
-- The final tee forwards the results to be used in the snpCounting-fold
Expand All @@ -290,7 +301,7 @@
) (throwIO . PoseidonGenotypeExceptionForward errLength)
logInfo "Done"
return newNrSNPs
writingJannoFile :: FilePath -> String -> (VUM.MVector VUM.RealWorld Int) -> [JannoRow] -> PoseidonIO ()
writingJannoFile :: FilePath -> String -> VUM.MVector VUM.RealWorld Int -> [JannoRow] -> PoseidonIO ()
writingJannoFile outPath outName newNrSNPs rows = do
logInfo "Creating .janno file"
snpList <- liftIO $ VU.freeze newNrSNPs
Expand Down Expand Up @@ -328,7 +339,7 @@
fillMissingSnpSets :: [PoseidonPackage] -> PoseidonIO [SNPSetSpec]
fillMissingSnpSets packages = forM packages $ \pac -> do
let pac_ = posPacNameAndVersion pac
maybeSnpSet = snpSet . posPacGenotypeData $ pac
maybeSnpSet = genotypeSnpSet . posPacGenotypeData $ pac
case maybeSnpSet of
Just s -> return s
Nothing -> do
Expand Down
Loading