diff --git a/CHANGELOG.md b/CHANGELOG.md index a332dba4..92e5f3f1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,5 @@ +- V 1.4.1.0: + - Added new tool `trident jannocoalesce`, which merges information from a source .janno file to a target .janno file. - V 1.4.0.4: - Added better error messages for generic cassava parsing (e.g. for broken Int and Double fields) in .janno files. - Added better error handling and messages for inconsistent `Date_*`, `Contamination_*` and `Relation_*` columns in .janno files using an `Except` & `Writer` monad stack. diff --git a/CHANGELOGRELEASE.md b/CHANGELOGRELEASE.md index 353cd2db..7a6cafd8 100644 --- a/CHANGELOGRELEASE.md +++ b/CHANGELOGRELEASE.md @@ -1,3 +1,49 @@ +### V 1.4.1.0 + +This release adds an entirely new subcommand to merge two `.janno` files (`jannocoalecse`) and improves the error messages for broken `.janno` files. + +#### Merging `.janno` files with `jannocoalesce` + +The need for a tool to combine the information of two `.janno` files arose in the Poseidon ecosystem as we started to conceptualize the Poseidon [Minotaur Archive](https://github.com/poseidon-framework/minotaur-archive). This archive will be populated by paper-wise Poseidon packages for which the genotype data was regenerated through the Minotaur workflow (work in progress). We plan to reprocess various packages that are already in the [Poseidon Community Archive](https://github.com/poseidon-framework/community-archive) and for these packages we want to copy e.g. spatiotemporal information from the already available `.janno` files. `jannocoalesce` is the answer to this specific need, but can also be useful for various other applications. + +It generally works by reading a source `.janno` file with `-s|--sourceFile` (or all `.janno` files in a `-d|--baseDir`) and a target `.janno` file with `-t|--targetFile`. It then merges these files by a key column, which can be selected with `--sourceKey` and `--targetKey`. The default for both of these key columns is the `Poseidon_ID`. In case the entries in the key columns slightly and systematically differ, e.g. because the `Poseidon_ID`s in either have a special suffix (for example `_SG`), then the `--stripIdRegex` option allows to strip these with a regular expression. + +`jannocoalesce` generally attempts to fill **all** empty cells in the target `.janno` file with information from the source. `--includeColumns` and `--excludeColumns` allow to select specific columns for which this should be done. In some cases it may be desirable to not just fill empty fields in the target, but overwrite the information already there with the `-f|--force` option. If the target file should be preserved, then the output can be directed to a new output `.janno` file with `-o|--outFile`. + +#### Better error messages for broken `.janno` files + +`.janno` file validation is a core feature of `trident`. With this release we try to improve the error messages for a two common situations: + +1. Broken number fields. This can happen, if some text or wrong character ends up in a number field. + +So far the error messages for this case have been pretty technical. Here for example if an integer field is filled with `430;`, where the integer number `430` is accidentally written with a trailing `;`: + +``` +parse error (Failed reading: conversion error: expected Int, got "430;" (incomplete field parse, leftover: [59])) +``` + +The new error message is more clear: + +``` +parse error in one column (expected data type: Int, broken value: "430;", problematic characters: ";") +``` + +2. Inconsistent `Date_*`, `Contamination_*` and `Relation_*` columns. These sets of columns have to be cross-consistent, following a logic that is especially complex for the `Date_*` fields (see [here](https://www.poseidon-adna.org/#/janno_details?id=the-columns-in-detail)). + +So far any inconsistency was reported with this generic error message: + +``` +The Date_* columns are not consistent +``` + +Now we include far more precise messages, like e.g.: + +``` +Date_Type is not "C14", but either Date_C14_Uncal_BP or Date_C14_Uncal_BP_Err are not empty. +``` + +This should simplify tedious `.janno` file debugging in the future. + ### V 1.4.0.3 This small release fixes a performance issue related to finding the latest version of all packages. The bug had severe detrimental effects on `forge` and `fetch`, which are now resolved. diff --git a/poseidon-hs.cabal b/poseidon-hs.cabal index 939c4518..cab929fd 100644 --- a/poseidon-hs.cabal +++ b/poseidon-hs.cabal @@ -1,5 +1,5 @@ name: poseidon-hs -version: 1.4.0.4 +version: 1.4.1.0 synopsis: A package with tools for working with Poseidon Genotype Data description: The tools in this package read and analyse Poseidon-formatted genotype databases, a modular system for storing genotype data from thousands of individuals. license: MIT @@ -21,7 +21,8 @@ library Poseidon.CLI.Summarise, Poseidon.CLI.Validate, Poseidon.Utils, Poseidon.CLI.Survey, Poseidon.CLI.Forge, Poseidon.CLI.Init, Poseidon.CLI.Rectify, Poseidon.CLI.Fetch, Poseidon.CLI.Genoconvert, - Poseidon.CLI.OptparseApplicativeParsers, Poseidon.CLI.Timetravel + Poseidon.CLI.OptparseApplicativeParsers, Poseidon.CLI.Timetravel, + Poseidon.CLI.Jannocoalesce other-modules: Paths_poseidon_hs hs-source-dirs: src build-depends: base >= 4.7 && < 5, sequence-formats>=1.6.1, text, time, pipes-safe, @@ -52,7 +53,7 @@ Test-Suite poseidon-tools-tests filepath, pipes, pipes-safe, pipes-ordered-zip, unordered-containers, cassava, containers, process other-modules: Poseidon.PackageSpec, Poseidon.JannoSpec, - Poseidon.BibFileSpec, Poseidon.MathHelpersSpec, + Poseidon.BibFileSpec, Poseidon.MathHelpersSpec, Poseidon.JannocoalesceSpec, Poseidon.SummariseSpec, Poseidon.SurveySpec, Poseidon.GenotypeDataSpec, Poseidon.EntitiesListSpec, PoseidonGoldenTests.GoldenTestsValidateChecksumsSpec, PoseidonGoldenTests.GoldenTestsRunCommands, Poseidon.ChronicleSpec, diff --git a/src-executables/Main-trident.hs b/src-executables/Main-trident.hs index 28be33b2..b04b7a06 100644 --- a/src-executables/Main-trident.hs +++ b/src-executables/Main-trident.hs @@ -11,6 +11,8 @@ import Poseidon.CLI.Genoconvert (GenoconvertOptions (.. runGenoconvert) import Poseidon.CLI.Init (InitOptions (..), runInit) +import Poseidon.CLI.Jannocoalesce (JannoCoalesceOptions (..), + runJannocoalesce) import Poseidon.CLI.List (ListOptions (..), runList) import Poseidon.CLI.OptparseApplicativeParsers @@ -68,6 +70,7 @@ data Subcommand = | CmdChronicle ChronicleOptions | CmdTimetravel TimetravelOptions | CmdServe ServeOptions + | CmdJannoCoalesce JannoCoalesceOptions main :: IO () main = do @@ -88,18 +91,20 @@ main = do runCmd :: Subcommand -> PoseidonIO () runCmd o = case o of - CmdInit opts -> runInit opts - CmdList opts -> runList opts - CmdFetch opts -> runFetch opts - CmdForge opts -> runForge opts - CmdGenoconvert opts -> runGenoconvert opts - CmdSummarise opts -> runSummarise opts - CmdSurvey opts -> runSurvey opts - CmdRectify opts -> runRectify opts - CmdValidate opts -> runValidate opts - CmdChronicle opts -> runChronicle opts - CmdTimetravel opts -> runTimetravel opts - CmdServe opts -> runServerMainThread opts + -- alphabetic order + CmdChronicle opts -> runChronicle opts + CmdFetch opts -> runFetch opts + CmdForge opts -> runForge opts + CmdGenoconvert opts -> runGenoconvert opts + CmdJannoCoalesce opts -> runJannocoalesce opts + CmdInit opts -> runInit opts + CmdList opts -> runList opts + CmdRectify opts -> runRectify opts + CmdServe opts -> runServerMainThread opts + CmdSummarise opts -> runSummarise opts + CmdSurvey opts -> runSurvey opts + CmdTimetravel opts -> runTimetravel opts + CmdValidate opts -> runValidate opts optParserInfo :: OP.ParserInfo Options optParserInfo = OP.info ( @@ -131,6 +136,7 @@ subcommandParser = OP.subparser ( OP.command "fetch" fetchOptInfo <> OP.command "forge" forgeOptInfo <> OP.command "genoconvert" genoconvertOptInfo <> + OP.command "jannocoalesce" jannocoalesceOptInfo <> OP.command "rectify" rectifyOptInfo <> OP.commandGroup "Package creation and manipulation commands:" ) <|> @@ -182,6 +188,8 @@ subcommandParser = OP.subparser ( (OP.progDesc "Construct package directories from chronicle files") serveOptInfo = OP.info (OP.helper <*> (CmdServe <$> serveOptParser)) (OP.progDesc "Serve Poseidon packages via HTTP or HTTPS") + jannocoalesceOptInfo = OP.info (OP.helper <*> (CmdJannoCoalesce <$> jannocoalesceOptParser)) + (OP.progDesc "Coalesce information from one or multiple janno files to another one") initOptParser :: OP.Parser InitOptions initOptParser = InitOptions <$> parseInGenotypeDataset @@ -260,3 +268,13 @@ serveOptParser = ServeOptions <$> parseArchiveBasePaths <*> parsePort <*> parseIgnoreChecksums <*> parseMaybeCertFiles + +jannocoalesceOptParser :: OP.Parser JannoCoalesceOptions +jannocoalesceOptParser = JannoCoalesceOptions <$> parseJannocoalSourceSpec + <*> parseJannocoalTargetFile + <*> parseJannocoalOutSpec + <*> parseJannocoalJannoColumns + <*> parseJannocoalOverride + <*> parseJannocoalSourceKey + <*> parseJannocoalTargetKey + <*> parseJannocoalIdStripRegex diff --git a/src/Poseidon/CLI/Jannocoalesce.hs b/src/Poseidon/CLI/Jannocoalesce.hs new file mode 100644 index 00000000..8a5cc64e --- /dev/null +++ b/src/Poseidon/CLI/Jannocoalesce.hs @@ -0,0 +1,161 @@ +{-# LANGUAGE OverloadedStrings #-} +{-# LANGUAGE TupleSections #-} + +module Poseidon.CLI.Jannocoalesce where + +import Poseidon.Janno (JannoRow (..), JannoRows (..), + readJannoFile, writeJannoFile) +import Poseidon.Package (PackageReadOptions (..), + defaultPackageReadOptions, + getJointJanno, + readPoseidonPackageCollection) +import Poseidon.Utils (PoseidonException (..), PoseidonIO, + logDebug, logInfo, logWarning) + +import Control.Monad (filterM, forM_, when) +import Control.Monad.Catch (MonadThrow, throwM) +import Control.Monad.IO.Class (liftIO) +import qualified Data.ByteString.Char8 as BSC +import qualified Data.Csv as Csv +import qualified Data.HashMap.Strict as HM +import qualified Data.IORef as R +import Data.List ((\\)) +import Data.Text (pack, replace, unpack) +import System.Directory (createDirectoryIfMissing) +import System.FilePath (takeDirectory) +import Text.Regex.TDFA ((=~)) + +-- the source can be a single janno file, or a set of base directories as usual. +data JannoSourceSpec = JannoSourceSingle FilePath | JannoSourceBaseDirs [FilePath] + +data CoalesceJannoColumnSpec = + AllJannoColumns + | IncludeJannoColumns [BSC.ByteString] + | ExcludeJannoColumns [BSC.ByteString] + +data JannoCoalesceOptions = JannoCoalesceOptions + { _jannocoalesceSource :: JannoSourceSpec + , _jannocoalesceTarget :: FilePath + , _jannocoalesceOutSpec :: Maybe FilePath -- Nothing means "in place" + , _jannocoalesceJannoColumns :: CoalesceJannoColumnSpec + , _jannocoalesceOverwriteColumns :: Bool + , _jannocoalesceSourceKey :: String -- by default set to "Poseidon_ID" + , _jannocoalesceTargetKey :: String -- by default set to "Poseidon_ID" + , _jannocoalesceIdStrip :: Maybe String -- an optional regex to strip from target and source keys + } + +runJannocoalesce :: JannoCoalesceOptions -> PoseidonIO () +runJannocoalesce (JannoCoalesceOptions sourceSpec target outSpec fields overwrite sKey tKey maybeStrip) = do + JannoRows sourceRows <- case sourceSpec of + JannoSourceSingle sourceFile -> readJannoFile sourceFile + JannoSourceBaseDirs sourceDirs -> do + let pacReadOpts = defaultPackageReadOptions { + _readOptIgnoreChecksums = True + , _readOptGenoCheck = False + , _readOptIgnoreGeno = True + , _readOptOnlyLatest = True + } + getJointJanno <$> readPoseidonPackageCollection pacReadOpts sourceDirs + JannoRows targetRows <- readJannoFile target + + newJanno <- makeNewJannoRows sourceRows targetRows fields overwrite sKey tKey maybeStrip + + let outPath = maybe target id outSpec + logInfo $ "Writing to file (directory will be created if missing): " ++ outPath + liftIO $ do + createDirectoryIfMissing True (takeDirectory outPath) + writeJannoFile outPath (JannoRows newJanno) + +type CounterMismatches = R.IORef Int +type CounterCopied = R.IORef Int + +makeNewJannoRows :: [JannoRow] -> [JannoRow] -> CoalesceJannoColumnSpec -> Bool -> String -> String -> Maybe String -> PoseidonIO [JannoRow] +makeNewJannoRows sourceRows targetRows fields overwrite sKey tKey maybeStrip = do + logInfo "Starting to coalesce..." + counterMismatches <- liftIO $ R.newIORef 0 + counterCopied <- liftIO $ R.newIORef 0 + newRows <- mapM (makeNewJannoRow counterMismatches counterCopied) targetRows + counterCopiedVal <- liftIO $ R.readIORef counterCopied + counterMismatchesVal <- liftIO $ R.readIORef counterMismatches + logInfo $ "Copied " ++ show counterCopiedVal ++ " values" + when (counterMismatchesVal > 0) $ + logWarning $ "Failed to find matches for " ++ show counterMismatchesVal ++ " target rows in source" + return newRows + where + makeNewJannoRow :: CounterMismatches -> CounterCopied -> JannoRow -> PoseidonIO JannoRow + makeNewJannoRow cm cp targetRow = do + posId <- getKeyFromJanno targetRow tKey + sourceRowCandidates <- filterM (\r -> (matchWithOptionalStrip maybeStrip posId) <$> getKeyFromJanno r sKey) sourceRows + case sourceRowCandidates of + [] -> do + logWarning $ "no match for target " ++ posId ++ " in source" + liftIO $ R.modifyIORef cm (+1) + return targetRow + [keyRow] -> mergeRow cp targetRow keyRow fields overwrite sKey tKey + _ -> throwM $ PoseidonGenericException $ "source file contains multiple rows with key " ++ posId + +getKeyFromJanno :: (MonadThrow m) => JannoRow -> String -> m String +getKeyFromJanno jannoRow key = do + let jannoRowDict = Csv.toNamedRecord jannoRow + case jannoRowDict HM.!? (BSC.pack key) of + Nothing -> throwM $ PoseidonGenericException ("Key " ++ key ++ " not present in .janno file") + Just r -> return $ BSC.unpack r + +matchWithOptionalStrip :: (Maybe String) -> String -> String -> Bool +matchWithOptionalStrip maybeRegex id1 id2 = + case maybeRegex of + Nothing -> id1 == id2 + Just r -> + let id1stripped = stripR r id1 + id2stripped = stripR r id2 + in id1stripped == id2stripped + where + stripR :: String -> String -> String + stripR r s = + let match = s =~ r + in if null match then s else unpack $ replace (pack match) "" (pack s) + +mergeRow :: CounterCopied -> JannoRow -> JannoRow -> CoalesceJannoColumnSpec -> Bool -> String -> String -> PoseidonIO JannoRow +mergeRow cp targetRow sourceRow fields overwrite sKey tKey = do + let sourceKeys = HM.keys sourceRowRecord + sourceKeysDesired = determineDesiredSourceKeys sourceKeys fields + -- fill in the target row with dummy values for desired fields that might not be present yet + targetComplete = HM.union targetRowRecord (HM.fromList $ map (, BSC.empty) sourceKeysDesired) + newRowRecord = HM.mapWithKey fillFromSource targetComplete + parseResult = Csv.runParser . Csv.parseNamedRecord $ newRowRecord + logInfo $ "matched target " ++ BSC.unpack (targetComplete HM.! BSC.pack tKey) ++ + " with source " ++ BSC.unpack (sourceRowRecord HM.! BSC.pack sKey) + case parseResult of + Left err -> throwM . PoseidonGenericException $ ".janno row-merge error: " ++ err + Right r -> do + let newFields = HM.differenceWith (\v1 v2 -> if v1 == v2 then Nothing else Just v1) newRowRecord targetComplete + if HM.null newFields then do + logDebug "-- no changes" + else do + forM_ (HM.toList newFields) $ \(key, val) -> do + liftIO $ R.modifyIORef cp (+1) + logDebug $ "-- copied \"" ++ BSC.unpack val ++ "\" from column " ++ BSC.unpack key + return r + where + targetRowRecord :: Csv.NamedRecord + targetRowRecord = Csv.toNamedRecord targetRow + sourceRowRecord :: Csv.NamedRecord + sourceRowRecord = Csv.toNamedRecord sourceRow + determineDesiredSourceKeys :: [BSC.ByteString] -> CoalesceJannoColumnSpec -> [BSC.ByteString] + determineDesiredSourceKeys keys AllJannoColumns = keys + determineDesiredSourceKeys _ (IncludeJannoColumns included) = included + determineDesiredSourceKeys keys (ExcludeJannoColumns excluded) = keys \\ excluded + fillFromSource :: BSC.ByteString -> BSC.ByteString -> BSC.ByteString + fillFromSource key targetVal = + -- don't overwrite key + if key /= BSC.pack tKey + -- overwrite field only if it's requested + && includeField key fields + -- overwrite only empty fields, except overwrite is set + && (targetVal `elem` ["n/a", "", BSC.empty] || overwrite) + then HM.findWithDefault "" key sourceRowRecord + else targetVal + includeField :: BSC.ByteString -> CoalesceJannoColumnSpec -> Bool + includeField _ AllJannoColumns = True + includeField key (IncludeJannoColumns xs) = key `elem` xs + includeField key (ExcludeJannoColumns xs) = key `notElem` xs diff --git a/src/Poseidon/CLI/OptparseApplicativeParsers.hs b/src/Poseidon/CLI/OptparseApplicativeParsers.hs index 50176dc8..6c3a40a3 100644 --- a/src/Poseidon/CLI/OptparseApplicativeParsers.hs +++ b/src/Poseidon/CLI/OptparseApplicativeParsers.hs @@ -2,33 +2,39 @@ module Poseidon.CLI.OptparseApplicativeParsers where -import Poseidon.CLI.Chronicle (ChronOperation (..)) -import Poseidon.CLI.List (ListEntity (..), RepoLocationSpec (..)) -import Poseidon.CLI.Rectify (ChecksumsToRectify (..), - PackageVersionUpdate (..)) -import Poseidon.CLI.Validate (ValidatePlan (..)) -import Poseidon.Contributor (ContributorSpec (..), - contributorSpecParser) -import Poseidon.EntityTypes (EntitiesList, EntityInput (..), - PoseidonEntity, SignedEntitiesList, - SignedEntity, readEntitiesFromString) -import Poseidon.GenotypeData (GenoDataSource (..), - GenotypeDataSpec (..), - GenotypeFormatSpec (..), - SNPSetSpec (..)) -import Poseidon.ServerClient (ArchiveEndpoint (..)) -import Poseidon.Utils (LogMode (..), TestMode (..)) -import Poseidon.Version (VersionComponent (..), parseVersion) - -import Control.Applicative ((<|>)) -import Data.List.Split (splitOn) -import Data.Version (Version) -import qualified Options.Applicative as OP -import SequenceFormats.Plink (PlinkPopNameMode (PlinkPopNameAsBoth, PlinkPopNameAsFamily, PlinkPopNameAsPhenotype)) -import System.FilePath (dropExtension, takeExtension, (<.>)) -import qualified Text.Parsec as P -import Text.Read (readMaybe) - +import Poseidon.CLI.Chronicle (ChronOperation (..)) +import Poseidon.CLI.Jannocoalesce (CoalesceJannoColumnSpec (..), + JannoSourceSpec (..)) +import Poseidon.CLI.List (ListEntity (..), + RepoLocationSpec (..)) +import Poseidon.CLI.Rectify (ChecksumsToRectify (..), + PackageVersionUpdate (..)) +import Poseidon.CLI.Validate (ValidatePlan (..)) +import Poseidon.Contributor (ContributorSpec (..), + contributorSpecParser) +import Poseidon.EntityTypes (EntitiesList, EntityInput (..), + PoseidonEntity, SignedEntitiesList, + SignedEntity, + readEntitiesFromString) +import Poseidon.GenotypeData (GenoDataSource (..), + GenotypeDataSpec (..), + GenotypeFormatSpec (..), + SNPSetSpec (..)) +import Poseidon.ServerClient (ArchiveEndpoint (..)) +import Poseidon.Utils (LogMode (..), TestMode (..)) +import Poseidon.Version (VersionComponent (..), + parseVersion) + +import Control.Applicative ((<|>)) +import qualified Data.ByteString.Char8 as BSC +import Data.List.Split (splitOn) +import Data.Version (Version) +import qualified Options.Applicative as OP +import SequenceFormats.Plink (PlinkPopNameMode (PlinkPopNameAsBoth, PlinkPopNameAsFamily, PlinkPopNameAsPhenotype)) +import System.FilePath (dropExtension, takeExtension, + (<.>)) +import qualified Text.Parsec as P +import Text.Read (readMaybe) parseChronOperation :: OP.Parser ChronOperation parseChronOperation = (CreateChron <$> parseChronOutPath) <|> (UpdateChron <$> parseChronUpdatePath) @@ -762,3 +768,81 @@ parseMaybeArchiveName = OP.option (Just <$> OP.str) ( OP.value Nothing <> OP.showDefault ) + +parseJannocoalSourceSpec :: OP.Parser JannoSourceSpec +parseJannocoalSourceSpec = parseJannocoalSingleSource <|> (JannoSourceBaseDirs <$> parseBasePaths) + where + parseJannocoalSingleSource = OP.option (JannoSourceSingle <$> OP.str) ( + OP.long "sourceFile" <> + OP.short 's' <> + OP.metavar "FILE" <> + OP.help "The source .janno file." + ) + +parseJannocoalTargetFile :: OP.Parser FilePath +parseJannocoalTargetFile = OP.strOption ( + OP.long "targetFile" <> + OP.short 't' <> + OP.metavar "FILE" <> + OP.help "The target .janno file to fill." + ) + +parseJannocoalOutSpec :: OP.Parser (Maybe FilePath) +parseJannocoalOutSpec = OP.option (Just <$> OP.str) ( + OP.long "outFile" <> + OP.short 'o' <> + OP.metavar "FILE" <> + OP.value Nothing <> + OP.showDefault <> + OP.help "An optional file to write the results to. \ + \If not specified, change the target file in place." + ) + +parseJannocoalJannoColumns :: OP.Parser CoalesceJannoColumnSpec +parseJannocoalJannoColumns = includeJannoColumns OP.<|> excludeJannoColumns OP.<|> pure AllJannoColumns + where + includeJannoColumns = OP.option (IncludeJannoColumns . map BSC.pack . splitOn "," <$> OP.str) ( + OP.long "includeColumns" <> + OP.help "A comma-separated list of .janno column names to coalesce. \ + \If not specified, all columns that can be found in the source \ + \and target will get filled." + ) + excludeJannoColumns = OP.option (ExcludeJannoColumns . map BSC.pack . splitOn "," <$> OP.str) ( + OP.long "excludeColumns" <> + OP.help "A comma-separated list of .janno column names NOT to coalesce. \ + \All columns that can be found in the source and target will get filled, \ + \except the ones listed here." + ) + +parseJannocoalOverride :: OP.Parser Bool +parseJannocoalOverride = OP.switch ( + OP.long "force" <> + OP.short 'f' <> + OP.help "With this option, potential non-missing content in target columns gets overridden \ + \with non-missing content in source columns. By default, only missing data gets filled-in." + ) + +parseJannocoalSourceKey :: OP.Parser String +parseJannocoalSourceKey = OP.strOption ( + OP.long "sourceKey" <> + OP.help "The .janno column to use as the source key." <> + OP.value "Poseidon_ID" <> + OP.showDefault + ) + +parseJannocoalTargetKey :: OP.Parser String +parseJannocoalTargetKey = OP.strOption ( + OP.long "targetKey" <> + OP.help "The .janno column to use as the target key." <> + OP.value "Poseidon_ID" <> + OP.showDefault + ) + +parseJannocoalIdStripRegex :: OP.Parser (Maybe String) +parseJannocoalIdStripRegex = OP.option (Just <$> OP.str) ( + OP.long "stripIdRegex" <> + OP.help "An optional regular expression to identify parts of the IDs to strip \ + \before matching between source and target. Uses POSIX Extended regular expressions." <> + OP.value Nothing + ) + diff --git a/src/Poseidon/Janno.hs b/src/Poseidon/Janno.hs index 26f42809..036080dd 100644 --- a/src/Poseidon/Janno.hs +++ b/src/Poseidon/Janno.hs @@ -27,9 +27,11 @@ module Poseidon.Janno ( JannoLibraryBuilt (..), AccessionID (..), makeAccessionID, + makeLatitude, makeLongitude, writeJannoFile, readJannoFile, createMinimalJanno, + createMinimalSample, jannoHeaderString, CsvNamedRecord (..), JannoRows (..), diff --git a/test/Poseidon/JannocoalesceSpec.hs b/test/Poseidon/JannocoalesceSpec.hs new file mode 100644 index 00000000..2f088453 --- /dev/null +++ b/test/Poseidon/JannocoalesceSpec.hs @@ -0,0 +1,144 @@ +{-# LANGUAGE OverloadedStrings #-} +module Poseidon.JannocoalesceSpec (spec) where + +import Poseidon.CLI.Jannocoalesce (CoalesceJannoColumnSpec (..), + makeNewJannoRows, mergeRow) +import Poseidon.Janno (CsvNamedRecord (..), + JannoList (..), JannoRow (..), + createMinimalSample, makeLatitude, + makeLongitude) +import Poseidon.Utils (testLog) + +import Control.Monad.IO.Class (liftIO) +import qualified Data.HashMap.Strict as HM +import qualified Data.IORef as R +import SequenceFormats.Eigenstrat (EigenstratIndEntry (..), Sex (..)) +import Test.Hspec + +spec :: Spec +spec = do + testMergeSingleRow + testCoalesceMultipleRows + +jannoTargetRow :: JannoRow +jannoTargetRow = + let row = createMinimalSample (EigenstratIndEntry "Name" Male "SamplePop") + in row { + jCountry = Just "Austria", + jSite = Just "Vienna", + jDateNote = Just "dating didn't work", + jAdditionalColumns = CsvNamedRecord $ HM.fromList [ + ("AdditionalColumn2", "C") + ] + } + +jannoSourceRow :: JannoRow +jannoSourceRow = + let row = createMinimalSample (EigenstratIndEntry "Name" Male "SamplePop2") + in row { + jCountry = Just "Austria", + jSite = Just "Salzburg", + jLatitude = makeLatitude 30.0, + jLongitude = makeLongitude 30.0, + jAdditionalColumns = CsvNamedRecord $ HM.fromList [ + ("AdditionalColumn1", "A"), + ("AdditionalColumn2", "B") + ] + } + +jannoTargetRows :: [JannoRow] +jannoTargetRows = + let row1 = createMinimalSample (EigenstratIndEntry "Ind1" Male "SamplePop") + row2 = createMinimalSample (EigenstratIndEntry "Ind2" Male "SamplePop") + row3 = createMinimalSample (EigenstratIndEntry "Ind3_AB" Male "SamplePop") + in [row1 {jCountry = Just "Germany"}, row2, row3] + +jannoSourceRows :: [JannoRow] +jannoSourceRows = + let row1 = createMinimalSample (EigenstratIndEntry "Ind1" Male "SamplePop") + row2 = createMinimalSample (EigenstratIndEntry "Ind2" Male "SamplePop") + row3 = createMinimalSample (EigenstratIndEntry "Ind3" Male "SamplePop") + in [ + row1 {jCountry = Just "Austria", jAdditionalColumns = CsvNamedRecord $ HM.fromList [("Poseidon_ID_alt", "Ind1")]}, + row2 {jCountry = Just "Austria", jAdditionalColumns = CsvNamedRecord $ HM.fromList [("Poseidon_ID_alt", "Ind2")]}, + row3 {jCountry = Just "Austria", jAdditionalColumns = CsvNamedRecord $ HM.fromList [("Poseidon_ID_alt", "Ind3_AB")]}] + +testMergeSingleRow :: Spec +testMergeSingleRow = + describe "Poseidon.Jannocoalesce.mergeRow" $ do + it "should correctly merge without fields and no override" $ do + cp <- liftIO $ R.newIORef 0 + merged <- testLog $ mergeRow cp jannoTargetRow jannoSourceRow AllJannoColumns False "Poseidon_ID" "Poseidon_ID" + jSite merged `shouldBe` Just "Vienna" + jGroupName merged `shouldBe` JannoList ["SamplePop"] + jLatitude merged `shouldBe` makeLatitude 30.0 + jLongitude merged `shouldBe` makeLongitude 30.0 + jAdditionalColumns merged `shouldBe` CsvNamedRecord (HM.fromList [ + ("AdditionalColumn1", "A"), + ("AdditionalColumn2", "C") + ]) + it "should correctly merge without fields and override" $ do + cp <- liftIO $ R.newIORef 0 + merged <- testLog $ mergeRow cp jannoTargetRow jannoSourceRow AllJannoColumns True "Poseidon_ID" "Poseidon_ID" + jSite merged `shouldBe` Just "Salzburg" + jGroupName merged `shouldBe` JannoList ["SamplePop2"] + jLatitude merged `shouldBe` makeLatitude 30.0 + jLongitude merged `shouldBe` makeLongitude 30.0 + jAdditionalColumns merged `shouldBe` CsvNamedRecord (HM.fromList [ + ("AdditionalColumn1", "A"), + ("AdditionalColumn2", "B") + ]) + it "should correctly merge with fields selection and no override" $ do + cp <- liftIO $ R.newIORef 0 + merged <- testLog $ mergeRow cp jannoTargetRow jannoSourceRow (IncludeJannoColumns ["Group_Name", "Latitude"]) False "Poseidon_ID" "Poseidon_ID" + jSite merged `shouldBe` Just "Vienna" + jGroupName merged `shouldBe` JannoList ["SamplePop"] + jLatitude merged `shouldBe` makeLatitude 30.0 + jLongitude merged `shouldBe` Nothing + jAdditionalColumns merged `shouldBe` CsvNamedRecord (HM.fromList [ + ("AdditionalColumn2", "C") + ]) + it "should correctly merge with negative field selection and no override" $ do + cp <- liftIO $ R.newIORef 0 + merged <- testLog $ mergeRow cp jannoTargetRow jannoSourceRow (ExcludeJannoColumns ["Latitude"]) False "Poseidon_ID" "Poseidon_ID" + jSite merged `shouldBe` Just "Vienna" + jGroupName merged `shouldBe` JannoList ["SamplePop"] + jLatitude merged `shouldBe` Nothing + jLongitude merged `shouldBe` makeLongitude 30.0 + jAdditionalColumns merged `shouldBe` CsvNamedRecord (HM.fromList [ + ("AdditionalColumn1", "A"), + ("AdditionalColumn2", "C") + ]) + it "should correctly merge with fields and override" $ do + cp <- liftIO $ R.newIORef 0 + merged <- testLog $ mergeRow cp jannoTargetRow jannoSourceRow (IncludeJannoColumns ["Group_Name", "Latitude"]) True "Poseidon_ID" "Poseidon_ID" + jSite merged `shouldBe` Just "Vienna" + jGroupName merged `shouldBe` JannoList ["SamplePop2"] + jLatitude merged `shouldBe` makeLatitude 30.0 + jLongitude merged `shouldBe` Nothing + jAdditionalColumns merged `shouldBe` CsvNamedRecord (HM.fromList [ + ("AdditionalColumn2", "C") + ]) + +testCoalesceMultipleRows :: Spec +testCoalesceMultipleRows = describe "Poseidon.Jannocoalesce.makeNewJannoRows" $ do + it "should correctly copy with simple matching" $ do + newJ <- testLog $ makeNewJannoRows jannoSourceRows jannoTargetRows AllJannoColumns False "Poseidon_ID" "Poseidon_ID" Nothing + jCountry (newJ !! 0) `shouldBe` Just "Germany" + jCountry (newJ !! 1) `shouldBe` Just "Austria" + jCountry (newJ !! 2) `shouldBe` Nothing + it "should correctly copy with simple matching and overwrite" $ do + newJ <- testLog $ makeNewJannoRows jannoSourceRows jannoTargetRows AllJannoColumns True "Poseidon_ID" "Poseidon_ID" Nothing + jCountry (newJ !! 0) `shouldBe` Just "Austria" + jCountry (newJ !! 1) `shouldBe` Just "Austria" + it "should throw with duplicate source keys" $ do + let s = jannoSourceRows ++ [jannoSourceRows !! 1] + testLog (makeNewJannoRows s jannoTargetRows AllJannoColumns False "Poseidon_ID" "Poseidon_ID" Nothing) `shouldThrow` anyException + it "should correctly copy with suffix strip" $ do + newJ <- testLog $ makeNewJannoRows jannoSourceRows jannoTargetRows AllJannoColumns False "Poseidon_ID" "Poseidon_ID" (Just "_AB") + jCountry (newJ !! 0) `shouldBe` Just "Germany" + jCountry (newJ !! 1) `shouldBe` Just "Austria" + jCountry (newJ !! 2) `shouldBe` Just "Austria" + it "should correctly copy with alternative ID column" $ do + newJ <- testLog $ makeNewJannoRows jannoSourceRows jannoTargetRows AllJannoColumns False "Poseidon_ID_alt" "Poseidon_ID" Nothing + jCountry (newJ !! 2) `shouldBe` Just "Austria" diff --git a/test/PoseidonGoldenTests/GoldenTestCheckSumFile.txt b/test/PoseidonGoldenTests/GoldenTestCheckSumFile.txt index 6c6899b4..e1929b66 100644 --- a/test/PoseidonGoldenTests/GoldenTestCheckSumFile.txt +++ b/test/PoseidonGoldenTests/GoldenTestCheckSumFile.txt @@ -112,4 +112,7 @@ b43da4d5734371c0648553120f812466 fetch fetch/multi_packages_2/Lamnidis_2018-1.0. 1d2a588b88e6d1017147c01f19d0b878 listRemote listRemote/listRemote1 0ddad9ea097bca0253e0c3c6157efa68 listRemote listRemote/listRemote2 b2286cf9af7c6c8757b8109a1f58e2d9 listRemote listRemote/listRemote3 -0433b2a80ee5a2eb5bf8c6404130e562 listRemote listRemote/listRemote4 \ No newline at end of file +0433b2a80ee5a2eb5bf8c6404130e562 listRemote listRemote/listRemote4 +282cedf121f37e81c1e45ec0dfb97560 jannocoalesce jannocoalesce/target1.janno +df34d0542c0a94cf9556619bff2e301d jannocoalesce jannocoalesce/target2.janno +a202f0c1636d55258454ad0a0dfea977 jannocoalesce jannocoalesce/target3.janno \ No newline at end of file diff --git a/test/PoseidonGoldenTests/GoldenTestData/chronicle/chronicle2.yml b/test/PoseidonGoldenTests/GoldenTestData/chronicle/chronicle2.yml index 99865099..8a07b513 100644 --- a/test/PoseidonGoldenTests/GoldenTestData/chronicle/chronicle2.yml +++ b/test/PoseidonGoldenTests/GoldenTestData/chronicle/chronicle2.yml @@ -1,29 +1,29 @@ title: Chronicle title description: Chronicle description chronicleVersion: 0.2.0 -lastModified: 2023-09-22 +lastModified: 2024-02-22 packages: - title: Lamnidis_2018 version: 1.0.0 - commit: eb2e7c2af61b6738f0ad8862645c23dd57bf0bd1 + commit: 14f4b6a0b1670ebe4b2544eeb1d74efd4f618b28 path: Lamnidis_2018 - title: Lamnidis_2018 version: 1.0.1 - commit: eb2e7c2af61b6738f0ad8862645c23dd57bf0bd1 + commit: 14f4b6a0b1670ebe4b2544eeb1d74efd4f618b28 path: Lamnidis_2018_newVersion - title: Schiffels version: 1.1.1 - commit: fa2e92af97376489b32ce8b6874428c958d55f3f + commit: de7c7af1794902b9f85bc57975a0af960d5c24bc path: Schiffels - title: Schiffels_2016 version: 1.0.1 - commit: eb2e7c2af61b6738f0ad8862645c23dd57bf0bd1 + commit: 14f4b6a0b1670ebe4b2544eeb1d74efd4f618b28 path: Schiffels_2016 - title: Schmid_2028 version: 1.0.0 - commit: eb2e7c2af61b6738f0ad8862645c23dd57bf0bd1 + commit: 14f4b6a0b1670ebe4b2544eeb1d74efd4f618b28 path: Schmid_2028 - title: Wang_2020 version: 0.1.0 - commit: eb2e7c2af61b6738f0ad8862645c23dd57bf0bd1 + commit: 14f4b6a0b1670ebe4b2544eeb1d74efd4f618b28 path: Wang_2020 diff --git a/test/PoseidonGoldenTests/GoldenTestData/jannocoalesce/target1.janno b/test/PoseidonGoldenTests/GoldenTestData/jannocoalesce/target1.janno new file mode 100644 index 00000000..8760de51 --- /dev/null +++ b/test/PoseidonGoldenTests/GoldenTestData/jannocoalesce/target1.janno @@ -0,0 +1,4 @@ +Poseidon_ID Genetic_Sex Group_Name Alternative_IDs Relation_To Relation_Degree Relation_Type Relation_Note Collection_ID Country Country_ISO Location Site Latitude Longitude Date_Type Date_C14_Labnr Date_C14_Uncal_BP Date_C14_Uncal_BP_Err Date_BC_AD_Start Date_BC_AD_Median Date_BC_AD_Stop Date_Note MT_Haplogroup Y_Haplogroup Source_Tissue Nr_Libraries Library_Names Capture_Type UDG Library_Built Genotype_Ploidy Data_Preparation_Pipeline_URL Endogenous Nr_SNPs Coverage_on_Target_SNPs Damage Contamination Contamination_Err Contamination_Meas Contamination_Note Genetic_Source_Accession_IDs Primary_Contact Publication Note Keywords AdditionalColumn1 AdditionalColumn2 +XXX011 M POP1 Paul;Peter XXX012;I1234 first;second father_of;grandfather_of yyy n/a xxx DE xxx xxx 0.0 0.0 C14 A-1;A-2;A-3 3000;3100;2900 30;40;20 -1200 -1000 -800 x x x A C xxx;yyy 2 Lib1;Lib2 Shotgun;1240K minus ds diploid ftp://test.test 0.0 0 0.0 0.0 10 1 ANGSD v.123 n/a Ich;mag;Kekse Ich unpublished This is a fine sample Hutschnur test1 test2 +XXX012 F POP2 n/a XXX011 first daughter_of n/a n/a xxx FR xxx xxx -90.0 -180.0 contextual n/a n/a n/a -5500 -5000 -4500 yyy B B xxx 0 Lib3 1240K half ss haploid https://www.google.de 0.0 0 0.0 100.0 20;50;70 2;5;7.4 Schmutzi v145;Zwiebel;other xxx aus Du PaulNature2026 Cheesecake n/a test3 test4 +XXX013 M POP1 Skeleton Joe XXX011 sixthToTenth n/a xxx n/a xxx EG xxx xxx 90.0 180.0 modern n/a n/a n/a 2000 2000 2000 n/a C A xxx 0 n/a ReferenceGenome plus mixed diploid http://huhu.org/23&test 0.0 0 0.0 50.0 n/a n/a n/a n/a der Dose Müllers Kuh BovineCell1618 n/a A;B;C test5 test6 diff --git a/test/PoseidonGoldenTests/GoldenTestData/jannocoalesce/target2.janno b/test/PoseidonGoldenTests/GoldenTestData/jannocoalesce/target2.janno new file mode 100644 index 00000000..65080c7c --- /dev/null +++ b/test/PoseidonGoldenTests/GoldenTestData/jannocoalesce/target2.janno @@ -0,0 +1,4 @@ +Poseidon_ID Genetic_Sex Group_Name Alternative_IDs Relation_To Relation_Degree Relation_Type Relation_Note Collection_ID Country Country_ISO Location Site Latitude Longitude Date_Type Date_C14_Labnr Date_C14_Uncal_BP Date_C14_Uncal_BP_Err Date_BC_AD_Start Date_BC_AD_Median Date_BC_AD_Stop Date_Note MT_Haplogroup Y_Haplogroup Source_Tissue Nr_Libraries Library_Names Capture_Type UDG Library_Built Genotype_Ploidy Data_Preparation_Pipeline_URL Endogenous Nr_SNPs Coverage_on_Target_SNPs Damage Contamination Contamination_Err Contamination_Meas Contamination_Note Genetic_Source_Accession_IDs Primary_Contact Publication Note Keywords +XXX011 M POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 0.0 0.0 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a +XXX012 F POP2 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a -90.0 -180.0 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a +XXX013 M POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 90.0 180.0 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a diff --git a/test/PoseidonGoldenTests/GoldenTestData/jannocoalesce/target3.janno b/test/PoseidonGoldenTests/GoldenTestData/jannocoalesce/target3.janno new file mode 100644 index 00000000..866bce4a --- /dev/null +++ b/test/PoseidonGoldenTests/GoldenTestData/jannocoalesce/target3.janno @@ -0,0 +1,4 @@ +Poseidon_ID Genetic_Sex Group_Name Alternative_IDs Relation_To Relation_Degree Relation_Type Relation_Note Collection_ID Country Country_ISO Location Site Latitude Longitude Date_Type Date_C14_Labnr Date_C14_Uncal_BP Date_C14_Uncal_BP_Err Date_BC_AD_Start Date_BC_AD_Median Date_BC_AD_Stop Date_Note MT_Haplogroup Y_Haplogroup Source_Tissue Nr_Libraries Library_Names Capture_Type UDG Library_Built Genotype_Ploidy Data_Preparation_Pipeline_URL Endogenous Nr_SNPs Coverage_on_Target_SNPs Damage Contamination Contamination_Err Contamination_Meas Contamination_Note Genetic_Source_Accession_IDs Primary_Contact Publication Note Keywords AdditionalColumn1 AdditionalColumn2 +XXX011 M POP1 Paul;Peter XXX012;I1234 first;second father_of;grandfather_of yyy n/a xxx DE xxx xxx n/a n/a C14 A-1;A-2;A-3 3000;3100;2900 30;40;20 -1200 -1000 -800 x x x A C xxx;yyy 2 Lib1;Lib2 Shotgun;1240K minus ds diploid ftp://test.test 0.0 0 0.0 0.0 10 1 ANGSD v.123 n/a Ich;mag;Kekse Ich unpublished This is a fine sample Hutschnur test1 test2 +XXX012 F POP2 n/a XXX011 first daughter_of n/a n/a xxx FR xxx xxx n/a n/a contextual n/a n/a n/a -5500 -5000 -4500 yyy B B xxx 0 Lib3 1240K half ss haploid https://www.google.de 0.0 0 0.0 100.0 20;50;70 2;5;7.4 Schmutzi v145;Zwiebel;other xxx aus Du PaulNature2026 Cheesecake n/a test3 test4 +XXX013 M POP1 Skeleton Joe XXX011 sixthToTenth n/a xxx n/a xxx EG xxx xxx n/a n/a modern n/a n/a n/a 2000 2000 2000 n/a C A xxx 0 n/a ReferenceGenome plus mixed diploid http://huhu.org/23&test 0.0 0 0.0 50.0 n/a n/a n/a n/a der Dose Müllers Kuh BovineCell1618 n/a A;B;C test5 test6 diff --git a/test/PoseidonGoldenTests/GoldenTestsRunCommands.hs b/test/PoseidonGoldenTests/GoldenTestsRunCommands.hs index 9613413f..b86c1ffd 100644 --- a/test/PoseidonGoldenTests/GoldenTestsRunCommands.hs +++ b/test/PoseidonGoldenTests/GoldenTestsRunCommands.hs @@ -4,59 +4,65 @@ module PoseidonGoldenTests.GoldenTestsRunCommands ( createStaticCheckSumFile, createDynamicCheckSumFile, staticCheckSumFile, dynamicCheckSumFile ) where -import Poseidon.CLI.Fetch (FetchOptions (..), runFetch) -import Poseidon.CLI.Forge (ForgeOptions (..), runForge) -import Poseidon.CLI.Genoconvert (GenoconvertOptions (..), - runGenoconvert) -import Poseidon.CLI.Init (InitOptions (..), runInit) -import Poseidon.CLI.List (ListEntity (..), ListOptions (..), - RepoLocationSpec (..), runList) -import Poseidon.CLI.Rectify (ChecksumsToRectify (..), - PackageVersionUpdate (..), - RectifyOptions (..), runRectify) -import Poseidon.CLI.Serve (ServeOptions (..), runServer) -import Poseidon.CLI.Summarise (SummariseOptions (..), runSummarise) -import Poseidon.CLI.Survey (SurveyOptions (..), runSurvey) -import Poseidon.CLI.Timetravel (TimetravelOptions (..), - runTimetravel) -import Poseidon.CLI.Validate (ValidateOptions (..), - ValidatePlan (..), runValidate) -import Poseidon.Contributor (ContributorSpec (..)) -import Poseidon.EntityTypes (EntityInput (..), - PoseidonEntity (..), - readEntitiesFromString) -import Poseidon.GenotypeData (GenoDataSource (..), - GenotypeDataSpec (..), - GenotypeFormatSpec (..), - SNPSetSpec (..)) -import Poseidon.ServerClient (ArchiveEndpoint (..)) -import Poseidon.Utils (LogMode (..), TestMode (..), - getChecksum, testLog, - usePoseidonLogger) -import Poseidon.Version (VersionComponent (..)) +import Poseidon.CLI.Fetch (FetchOptions (..), runFetch) +import Poseidon.CLI.Forge (ForgeOptions (..), runForge) +import Poseidon.CLI.Genoconvert (GenoconvertOptions (..), + runGenoconvert) +import Poseidon.CLI.Init (InitOptions (..), runInit) +import Poseidon.CLI.Jannocoalesce (CoalesceJannoColumnSpec (..), + JannoCoalesceOptions (..), + JannoSourceSpec (..), + runJannocoalesce) +import Poseidon.CLI.List (ListEntity (..), ListOptions (..), + RepoLocationSpec (..), runList) +import Poseidon.CLI.Rectify (ChecksumsToRectify (..), + PackageVersionUpdate (..), + RectifyOptions (..), runRectify) +import Poseidon.CLI.Serve (ServeOptions (..), runServer) +import Poseidon.CLI.Summarise (SummariseOptions (..), + runSummarise) +import Poseidon.CLI.Survey (SurveyOptions (..), runSurvey) +import Poseidon.CLI.Timetravel (TimetravelOptions (..), + runTimetravel) +import Poseidon.CLI.Validate (ValidateOptions (..), + ValidatePlan (..), runValidate) +import Poseidon.Contributor (ContributorSpec (..)) +import Poseidon.EntityTypes (EntityInput (..), + PoseidonEntity (..), + readEntitiesFromString) +import Poseidon.GenotypeData (GenoDataSource (..), + GenotypeDataSpec (..), + GenotypeFormatSpec (..), + SNPSetSpec (..)) +import Poseidon.ServerClient (ArchiveEndpoint (..)) +import Poseidon.Utils (LogMode (..), TestMode (..), + getChecksum, testLog, + usePoseidonLogger) +import Poseidon.Version (VersionComponent (..)) -import Control.Concurrent (forkIO, killThread, newEmptyMVar) -import Control.Concurrent.MVar (takeMVar) -import Control.Exception (finally) -import Control.Monad (forM_, unless, when) -import Data.Either (fromRight) -import Data.Function ((&)) -import qualified Data.Text as T -import qualified Data.Text.IO as T -import Data.Version (makeVersion) -import GHC.IO.Handle (hClose, hDuplicate, hDuplicateTo) -import Poseidon.CLI.Chronicle (ChronOperation (..), - ChronicleOptions (..), runChronicle) -import Poseidon.EntityTypes (PacNameAndVersion (..)) -import SequenceFormats.Plink (PlinkPopNameMode (..)) -import System.Directory (copyFile, createDirectory, - createDirectoryIfMissing, - doesDirectoryExist, listDirectory, - removeDirectoryRecursive) -import System.FilePath.Posix (()) -import System.IO (IOMode (WriteMode), hPutStrLn, - openFile, stderr, stdout, withFile) -import System.Process (callCommand) +import Control.Concurrent (forkIO, killThread, newEmptyMVar) +import Control.Concurrent.MVar (takeMVar) +import Control.Exception (finally) +import Control.Monad (forM_, unless, when) +import Data.Either (fromRight) +import Data.Function ((&)) +import qualified Data.Text as T +import qualified Data.Text.IO as T +import Data.Version (makeVersion) +import GHC.IO.Handle (hClose, hDuplicate, hDuplicateTo) +import Poseidon.CLI.Chronicle (ChronOperation (..), + ChronicleOptions (..), + runChronicle) +import Poseidon.EntityTypes (PacNameAndVersion (..)) +import SequenceFormats.Plink (PlinkPopNameMode (..)) +import System.Directory (copyFile, createDirectory, + createDirectoryIfMissing, + doesDirectoryExist, listDirectory, + removeDirectoryRecursive) +import System.FilePath.Posix (()) +import System.IO (IOMode (WriteMode), hPutStrLn, + openFile, stderr, stdout, withFile) +import System.Process (callCommand) -- file paths -- @@ -182,6 +188,8 @@ runCLICommands interactive testDir checkFilePath = do testPipelineFetch testDir checkFilePath hPutStrLn stderr "--- list --remote" testPipelineListRemote testDir checkFilePath + hPutStrLn stderr "--- jannocoalesce" + testPipelineJannocoalesce testDir checkFilePath -- close error sink hClose devNull unless interactive $ hDuplicateTo stderr_old stderr @@ -1076,3 +1084,48 @@ testPipelineListRemote testDir checkFilePath = do ) ( killThread threadID ) + +testPipelineJannocoalesce :: FilePath -> FilePath -> IO () +testPipelineJannocoalesce testDir checkFilePath = do + -- simple coalesce + let jannocoalesceOpts1 = JannoCoalesceOptions { + _jannocoalesceSource = JannoSourceSingle "test/testDat/testJannoFiles/normal_full.janno", + _jannocoalesceTarget = "test/testDat/testJannoFiles/minimal_full.janno", + _jannocoalesceOutSpec = Just (testDir "jannocoalesce" "target1.janno"), + _jannocoalesceJannoColumns = AllJannoColumns, + _jannocoalesceOverwriteColumns = False, + _jannocoalesceSourceKey = "Poseidon_ID", + _jannocoalesceTargetKey = "Poseidon_ID", + _jannocoalesceIdStrip = Nothing + } + runAndChecksumFiles checkFilePath testDir (testLog $ runJannocoalesce jannocoalesceOpts1) "jannocoalesce" [ + "jannocoalesce" "target1.janno" + ] + -- only coalesce certain columns (--includeColumns) + let jannocoalesceOpts2 = JannoCoalesceOptions { + _jannocoalesceSource = JannoSourceSingle "test/testDat/testJannoFiles/normal_full.janno", + _jannocoalesceTarget = "test/testDat/testJannoFiles/minimal_full.janno", + _jannocoalesceOutSpec = Just (testDir "jannocoalesce" "target2.janno"), + _jannocoalesceJannoColumns = IncludeJannoColumns ["Latitude", "Longitude"], + _jannocoalesceOverwriteColumns = False, + _jannocoalesceSourceKey = "Poseidon_ID", + _jannocoalesceTargetKey = "Poseidon_ID", + _jannocoalesceIdStrip = Nothing + } + runAndChecksumFiles checkFilePath testDir (testLog $ runJannocoalesce jannocoalesceOpts2) "jannocoalesce" [ + "jannocoalesce" "target2.janno" + ] + -- do not coalesce certain columns (--excludeColumns) + let jannocoalesceOpts3 = JannoCoalesceOptions { + _jannocoalesceSource = JannoSourceSingle "test/testDat/testJannoFiles/normal_full.janno", + _jannocoalesceTarget = "test/testDat/testJannoFiles/minimal_full.janno", + _jannocoalesceOutSpec = Just (testDir "jannocoalesce" "target3.janno"), + _jannocoalesceJannoColumns = ExcludeJannoColumns ["Latitude", "Longitude"], + _jannocoalesceOverwriteColumns = False, + _jannocoalesceSourceKey = "Poseidon_ID", + _jannocoalesceTargetKey = "Poseidon_ID", + _jannocoalesceIdStrip = Nothing + } + runAndChecksumFiles checkFilePath testDir (testLog $ runJannocoalesce jannocoalesceOpts3) "jannocoalesce" [ + "jannocoalesce" "target3.janno" + ] diff --git a/test/testDat/testJannoFiles/normal_subset.janno b/test/testDat/testJannoFiles/normal_subset.janno new file mode 100644 index 00000000..ea7a635d --- /dev/null +++ b/test/testDat/testJannoFiles/normal_subset.janno @@ -0,0 +1,4 @@ +Poseidon_ID Genetic_Sex Group_Name Location Site Latitude Longitude Date_Type Date_C14_Labnr Date_C14_Uncal_BP Date_C14_Uncal_BP_Err Date_BC_AD_Start Date_BC_AD_Median Date_BC_AD_Stop Date_Note MT_Haplogroup Y_Haplogroup Source_Tissue Nr_Libraries Library_Names Capture_Type UDG Library_Built Genotype_Ploidy Data_Preparation_Pipeline_URL Endogenous Nr_SNPs Coverage_on_Target_SNPs Damage Contamination Contamination_Err Contamination_Meas Contamination_Note Genetic_Source_Accession_IDs Primary_Contact Publication Note Keywords AdditionalColumn2 AdditionalColumn1 +XXX011 M POP1;POP3 Aachen xxx 0 0 C14 A-1;A-2;A-3 3000;3100;2900 30;40;20 -1200 -1000 -800 x x x A C xxx;yyy 2 Lib1;Lib2 Shotgun;1240K minus ds diploid ftp://test.test 0 0 0 0 10 1 ANGSD v.123 Ich;mag;Kekse Ich unpublished This is a fine sample Hutschnur test2 test1 +XXX012 F POP2 xxx Cologne -90 -180 contextual n/a n/a n/a -5500 -5000 -4500 yyy B B xxx 0 Lib3 1240K half ss haploid https://www.google.de 0 0 0 100 20;50;70 2;5;7.4 Schmutzi v145;Zwiebel;other xxx aus Du PaulNature2026 Cheesecake n/a test4 test3 +XXX013 M POP1 xxx xxx 90 180 modern n/a n/a n/a 2000 2000 2000 n/a C A xxx 0 ReferenceGenome plus mixed diploid http://huhu.org/23&test 0 0 0 50 n/a n/a n/a n/a der Dose Müllers Kuh BovineCell1618 n/a A;B;C test6 test5