Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jannocoalesce #282

Merged
merged 37 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
6a7859e
started jannoCoalesce
stschiff Nov 8, 2023
8877211
minor
stschiff Nov 8, 2023
6db282c
first simple implementation, no compile, no test yet
stschiff Nov 15, 2023
cdaa4e5
continued
stschiff Nov 21, 2023
0e57116
finished API and compiles. Not tested yet
stschiff Nov 30, 2023
ade046a
added unit tests
stschiff Dec 4, 2023
fc567f6
added golden test for jannocoalesce
stschiff Dec 5, 2023
0d019f0
bumped version, added Changelog
stschiff Dec 5, 2023
6ee7de7
stylish-haskell
stschiff Dec 6, 2023
7a8c331
continued
stschiff Dec 18, 2023
21db6a4
added tests
stschiff Dec 19, 2023
8f5b769
changed some option short forms
stschiff Dec 19, 2023
d859957
added log outputs, not tested yet
stschiff Dec 19, 2023
4486b34
pedantic fixes
stschiff Dec 19, 2023
c760e83
fixed tests
nevrome Dec 20, 2023
d2dc568
stylish haskell
nevrome Dec 20, 2023
1bb95a9
various minor suggestions
nevrome Dec 20, 2023
d2d97cb
stylish haskell
nevrome Dec 20, 2023
7453873
added message to indicate the start of the coalescing process
nevrome Dec 20, 2023
9d44eb1
Merge pull request #284 from poseidon-framework/jannocoalesceMinorAdj…
nevrome Jan 9, 2024
a4b30ba
Merge branch 'master' into jannocoalesce
nevrome Feb 21, 2024
899433a
changed the interface of jannocoalesce to allow excluding columns
nevrome Feb 21, 2024
07fe33a
added some more tests for the new column selection functionality in j…
nevrome Feb 21, 2024
3769a4c
added test cases for additional columns in jannocoalesce - currently …
nevrome Feb 21, 2024
8254653
work towards fixing the additional column issue
nevrome Feb 22, 2024
7e5e213
fixed the expected test results
nevrome Feb 22, 2024
35c3df1
an idea for an alternative implementation of jannocoalesce - seems to…
nevrome Feb 22, 2024
f478969
update of golden tests output
nevrome Feb 22, 2024
a1a796f
stylish-haskell
nevrome Feb 22, 2024
240e05d
some improvements to the jannocoalesce code
nevrome Feb 22, 2024
4df6d49
added some helpful summary statistics for jannocoalesce with an IORef
nevrome Feb 23, 2024
8460a36
fixed tests, which call mergeRow directly
nevrome Feb 23, 2024
b3cff29
stylish-haskell
nevrome Feb 23, 2024
fb22b43
added a single comment to mergeRow
stschiff Feb 23, 2024
04c557e
Merge pull request #288 from poseidon-framework/jannocoalesceMinorAdj…
nevrome Feb 26, 2024
12bbf35
work on the release changelog
nevrome Feb 26, 2024
d81a951
completed the release-changelog
nevrome Feb 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
- V 1.4.1.0:
- Added new tool `trident jannocoalesce`, which merges information from a source .janno file to a target .janno file.
- V 1.4.0.4:
- Added better error messages for generic cassava parsing (e.g. for broken Int and Double fields) in .janno files.
- Added better error handling and messages for inconsistent `Date_*`, `Contamination_*` and `Relation_*` columns in .janno files using an `Except` & `Writer` monad stack.
Expand Down
46 changes: 46 additions & 0 deletions CHANGELOGRELEASE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,49 @@
### V 1.4.1.0

This release adds an entirely new subcommand to merge two `.janno` files (`jannocoalecse`) and improves the error messages for broken `.janno` files.

#### Merging `.janno` files with `jannocoalesce`

The need for a tool to combine the information of two `.janno` files arose in the Poseidon ecosystem as we started to conceptualize the Poseidon [Minotaur Archive](https://github.com/poseidon-framework/minotaur-archive). This archive will be populated by paper-wise Poseidon packages for which the genotype data was regenerated through the Minotaur workflow (work in progress). We plan to reprocess various packages that are already in the [Poseidon Community Archive](https://github.com/poseidon-framework/community-archive) and for these packages we want to copy e.g. spatiotemporal information from the already available `.janno` files. `jannocoalesce` is the answer to this specific need, but can also be useful for various other applications.

It generally works by reading a source `.janno` file with `-s|--sourceFile` (or all `.janno` files in a `-d|--baseDir`) and a target `.janno` file with `-t|--targetFile`. It then merges these files by a key column, which can be selected with `--sourceKey` and `--targetKey`. The default for both of these key columns is the `Poseidon_ID`. In case the entries in the key columns slightly and systematically differ, e.g. because the `Poseidon_ID`s in either have a special suffix (for example `_SG`), then the `--stripIdRegex` option allows to strip these with a regular expression.

`jannocoalesce` generally attempts to fill **all** empty cells in the target `.janno` file with information from the source. `--includeColumns` and `--excludeColumns` allow to select specific columns for which this should be done. In some cases it may be desirable to not just fill empty fields in the target, but overwrite the information already there with the `-f|--force` option. If the target file should be preserved, then the output can be directed to a new output `.janno` file with `-o|--outFile`.

#### Better error messages for broken `.janno` files

`.janno` file validation is a core feature of `trident`. With this release we try to improve the error messages for a two common situations:

1. Broken number fields. This can happen, if some text or wrong character ends up in a number field.

So far the error messages for this case have been pretty technical. Here for example if an integer field is filled with `430;`, where the integer number `430` is accidentally written with a trailing `;`:

```
parse error (Failed reading: conversion error: expected Int, got "430;" (incomplete field parse, leftover: [59]))
```

The new error message is more clear:

```
parse error in one column (expected data type: Int, broken value: "430;", problematic characters: ";")
```

2. Inconsistent `Date_*`, `Contamination_*` and `Relation_*` columns. These sets of columns have to be cross-consistent, following a logic that is especially complex for the `Date_*` fields (see [here](https://www.poseidon-adna.org/#/janno_details?id=the-columns-in-detail)).

So far any inconsistency was reported with this generic error message:

```
The Date_* columns are not consistent
```

Now we include far more precise messages, like e.g.:

```
Date_Type is not "C14", but either Date_C14_Uncal_BP or Date_C14_Uncal_BP_Err are not empty.
```

This should simplify tedious `.janno` file debugging in the future.

### V 1.4.0.3

This small release fixes a performance issue related to finding the latest version of all packages. The bug had severe detrimental effects on `forge` and `fetch`, which are now resolved.
Expand Down
7 changes: 4 additions & 3 deletions poseidon-hs.cabal
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: poseidon-hs
version: 1.4.0.4
version: 1.4.1.0
synopsis: A package with tools for working with Poseidon Genotype Data
description: The tools in this package read and analyse Poseidon-formatted genotype databases, a modular system for storing genotype data from thousands of individuals.
license: MIT
Expand All @@ -21,7 +21,8 @@ library
Poseidon.CLI.Summarise, Poseidon.CLI.Validate, Poseidon.Utils,
Poseidon.CLI.Survey, Poseidon.CLI.Forge, Poseidon.CLI.Init,
Poseidon.CLI.Rectify, Poseidon.CLI.Fetch, Poseidon.CLI.Genoconvert,
Poseidon.CLI.OptparseApplicativeParsers, Poseidon.CLI.Timetravel
Poseidon.CLI.OptparseApplicativeParsers, Poseidon.CLI.Timetravel,
Poseidon.CLI.Jannocoalesce
other-modules: Paths_poseidon_hs
hs-source-dirs: src
build-depends: base >= 4.7 && < 5, sequence-formats>=1.6.1, text, time, pipes-safe,
Expand Down Expand Up @@ -52,7 +53,7 @@ Test-Suite poseidon-tools-tests
filepath, pipes, pipes-safe, pipes-ordered-zip,
unordered-containers, cassava, containers, process
other-modules: Poseidon.PackageSpec, Poseidon.JannoSpec,
Poseidon.BibFileSpec, Poseidon.MathHelpersSpec,
Poseidon.BibFileSpec, Poseidon.MathHelpersSpec, Poseidon.JannocoalesceSpec,
Poseidon.SummariseSpec, Poseidon.SurveySpec, Poseidon.GenotypeDataSpec,
Poseidon.EntitiesListSpec, PoseidonGoldenTests.GoldenTestsValidateChecksumsSpec,
PoseidonGoldenTests.GoldenTestsRunCommands, Poseidon.ChronicleSpec,
Expand Down
42 changes: 30 additions & 12 deletions src-executables/Main-trident.hs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ import Poseidon.CLI.Genoconvert (GenoconvertOptions (..
runGenoconvert)
import Poseidon.CLI.Init (InitOptions (..),
runInit)
import Poseidon.CLI.Jannocoalesce (JannoCoalesceOptions (..),
runJannocoalesce)
import Poseidon.CLI.List (ListOptions (..),
runList)
import Poseidon.CLI.OptparseApplicativeParsers
Expand Down Expand Up @@ -68,6 +70,7 @@ data Subcommand =
| CmdChronicle ChronicleOptions
| CmdTimetravel TimetravelOptions
| CmdServe ServeOptions
| CmdJannoCoalesce JannoCoalesceOptions

main :: IO ()
main = do
Expand All @@ -88,18 +91,20 @@ main = do

runCmd :: Subcommand -> PoseidonIO ()
runCmd o = case o of
CmdInit opts -> runInit opts
CmdList opts -> runList opts
CmdFetch opts -> runFetch opts
CmdForge opts -> runForge opts
CmdGenoconvert opts -> runGenoconvert opts
CmdSummarise opts -> runSummarise opts
CmdSurvey opts -> runSurvey opts
CmdRectify opts -> runRectify opts
CmdValidate opts -> runValidate opts
CmdChronicle opts -> runChronicle opts
CmdTimetravel opts -> runTimetravel opts
CmdServe opts -> runServerMainThread opts
-- alphabetic order
CmdChronicle opts -> runChronicle opts
CmdFetch opts -> runFetch opts
CmdForge opts -> runForge opts
CmdGenoconvert opts -> runGenoconvert opts
CmdJannoCoalesce opts -> runJannocoalesce opts
CmdInit opts -> runInit opts
CmdList opts -> runList opts
CmdRectify opts -> runRectify opts
CmdServe opts -> runServerMainThread opts
CmdSummarise opts -> runSummarise opts
CmdSurvey opts -> runSurvey opts
CmdTimetravel opts -> runTimetravel opts
CmdValidate opts -> runValidate opts

optParserInfo :: OP.ParserInfo Options
optParserInfo = OP.info (
Expand Down Expand Up @@ -131,6 +136,7 @@ subcommandParser = OP.subparser (
OP.command "fetch" fetchOptInfo <>
OP.command "forge" forgeOptInfo <>
OP.command "genoconvert" genoconvertOptInfo <>
OP.command "jannocoalesce" jannocoalesceOptInfo <>
OP.command "rectify" rectifyOptInfo <>
OP.commandGroup "Package creation and manipulation commands:"
) <|>
Expand Down Expand Up @@ -182,6 +188,8 @@ subcommandParser = OP.subparser (
(OP.progDesc "Construct package directories from chronicle files")
serveOptInfo = OP.info (OP.helper <*> (CmdServe <$> serveOptParser))
(OP.progDesc "Serve Poseidon packages via HTTP or HTTPS")
jannocoalesceOptInfo = OP.info (OP.helper <*> (CmdJannoCoalesce <$> jannocoalesceOptParser))
(OP.progDesc "Coalesce information from one or multiple janno files to another one")

initOptParser :: OP.Parser InitOptions
initOptParser = InitOptions <$> parseInGenotypeDataset
Expand Down Expand Up @@ -260,3 +268,13 @@ serveOptParser = ServeOptions <$> parseArchiveBasePaths
<*> parsePort
<*> parseIgnoreChecksums
<*> parseMaybeCertFiles

jannocoalesceOptParser :: OP.Parser JannoCoalesceOptions
jannocoalesceOptParser = JannoCoalesceOptions <$> parseJannocoalSourceSpec
<*> parseJannocoalTargetFile
<*> parseJannocoalOutSpec
<*> parseJannocoalJannoColumns
<*> parseJannocoalOverride
<*> parseJannocoalSourceKey
<*> parseJannocoalTargetKey
<*> parseJannocoalIdStripRegex
161 changes: 161 additions & 0 deletions src/Poseidon/CLI/Jannocoalesce.hs
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TupleSections #-}

module Poseidon.CLI.Jannocoalesce where

import Poseidon.Janno (JannoRow (..), JannoRows (..),
readJannoFile, writeJannoFile)
import Poseidon.Package (PackageReadOptions (..),
defaultPackageReadOptions,
getJointJanno,
readPoseidonPackageCollection)
import Poseidon.Utils (PoseidonException (..), PoseidonIO,
logDebug, logInfo, logWarning)

import Control.Monad (filterM, forM_, when)
import Control.Monad.Catch (MonadThrow, throwM)
import Control.Monad.IO.Class (liftIO)
import qualified Data.ByteString.Char8 as BSC
import qualified Data.Csv as Csv
import qualified Data.HashMap.Strict as HM
import qualified Data.IORef as R
import Data.List ((\\))
import Data.Text (pack, replace, unpack)
import System.Directory (createDirectoryIfMissing)
import System.FilePath (takeDirectory)
import Text.Regex.TDFA ((=~))

-- the source can be a single janno file, or a set of base directories as usual.
data JannoSourceSpec = JannoSourceSingle FilePath | JannoSourceBaseDirs [FilePath]

data CoalesceJannoColumnSpec =
AllJannoColumns
| IncludeJannoColumns [BSC.ByteString]
| ExcludeJannoColumns [BSC.ByteString]

data JannoCoalesceOptions = JannoCoalesceOptions
{ _jannocoalesceSource :: JannoSourceSpec
, _jannocoalesceTarget :: FilePath
, _jannocoalesceOutSpec :: Maybe FilePath -- Nothing means "in place"
, _jannocoalesceJannoColumns :: CoalesceJannoColumnSpec
, _jannocoalesceOverwriteColumns :: Bool
, _jannocoalesceSourceKey :: String -- by default set to "Poseidon_ID"
, _jannocoalesceTargetKey :: String -- by default set to "Poseidon_ID"
, _jannocoalesceIdStrip :: Maybe String -- an optional regex to strip from target and source keys
}

runJannocoalesce :: JannoCoalesceOptions -> PoseidonIO ()
runJannocoalesce (JannoCoalesceOptions sourceSpec target outSpec fields overwrite sKey tKey maybeStrip) = do
JannoRows sourceRows <- case sourceSpec of
JannoSourceSingle sourceFile -> readJannoFile sourceFile
JannoSourceBaseDirs sourceDirs -> do
let pacReadOpts = defaultPackageReadOptions {
_readOptIgnoreChecksums = True
, _readOptGenoCheck = False
, _readOptIgnoreGeno = True
, _readOptOnlyLatest = True
}
getJointJanno <$> readPoseidonPackageCollection pacReadOpts sourceDirs
JannoRows targetRows <- readJannoFile target

newJanno <- makeNewJannoRows sourceRows targetRows fields overwrite sKey tKey maybeStrip

let outPath = maybe target id outSpec
stschiff marked this conversation as resolved.
Show resolved Hide resolved
logInfo $ "Writing to file (directory will be created if missing): " ++ outPath
liftIO $ do
createDirectoryIfMissing True (takeDirectory outPath)
writeJannoFile outPath (JannoRows newJanno)

type CounterMismatches = R.IORef Int
type CounterCopied = R.IORef Int

makeNewJannoRows :: [JannoRow] -> [JannoRow] -> CoalesceJannoColumnSpec -> Bool -> String -> String -> Maybe String -> PoseidonIO [JannoRow]
makeNewJannoRows sourceRows targetRows fields overwrite sKey tKey maybeStrip = do
logInfo "Starting to coalesce..."
counterMismatches <- liftIO $ R.newIORef 0
counterCopied <- liftIO $ R.newIORef 0
newRows <- mapM (makeNewJannoRow counterMismatches counterCopied) targetRows
counterCopiedVal <- liftIO $ R.readIORef counterCopied
counterMismatchesVal <- liftIO $ R.readIORef counterMismatches
logInfo $ "Copied " ++ show counterCopiedVal ++ " values"
when (counterMismatchesVal > 0) $
logWarning $ "Failed to find matches for " ++ show counterMismatchesVal ++ " target rows in source"
return newRows
where
makeNewJannoRow :: CounterMismatches -> CounterCopied -> JannoRow -> PoseidonIO JannoRow
makeNewJannoRow cm cp targetRow = do
posId <- getKeyFromJanno targetRow tKey
sourceRowCandidates <- filterM (\r -> (matchWithOptionalStrip maybeStrip posId) <$> getKeyFromJanno r sKey) sourceRows
case sourceRowCandidates of
[] -> do
logWarning $ "no match for target " ++ posId ++ " in source"
liftIO $ R.modifyIORef cm (+1)
return targetRow
[keyRow] -> mergeRow cp targetRow keyRow fields overwrite sKey tKey
_ -> throwM $ PoseidonGenericException $ "source file contains multiple rows with key " ++ posId

getKeyFromJanno :: (MonadThrow m) => JannoRow -> String -> m String
getKeyFromJanno jannoRow key = do
let jannoRowDict = Csv.toNamedRecord jannoRow
case jannoRowDict HM.!? (BSC.pack key) of
Nothing -> throwM $ PoseidonGenericException ("Key " ++ key ++ " not present in .janno file")
Just r -> return $ BSC.unpack r

matchWithOptionalStrip :: (Maybe String) -> String -> String -> Bool
matchWithOptionalStrip maybeRegex id1 id2 =
case maybeRegex of
Nothing -> id1 == id2
Just r ->
let id1stripped = stripR r id1
id2stripped = stripR r id2
in id1stripped == id2stripped
where
stripR :: String -> String -> String
stripR r s =
let match = s =~ r
in if null match then s else unpack $ replace (pack match) "" (pack s)

mergeRow :: CounterCopied -> JannoRow -> JannoRow -> CoalesceJannoColumnSpec -> Bool -> String -> String -> PoseidonIO JannoRow
mergeRow cp targetRow sourceRow fields overwrite sKey tKey = do
let sourceKeys = HM.keys sourceRowRecord
sourceKeysDesired = determineDesiredSourceKeys sourceKeys fields
-- fill in the target row with dummy values for desired fields that might not be present yet
targetComplete = HM.union targetRowRecord (HM.fromList $ map (, BSC.empty) sourceKeysDesired)
newRowRecord = HM.mapWithKey fillFromSource targetComplete
parseResult = Csv.runParser . Csv.parseNamedRecord $ newRowRecord
logInfo $ "matched target " ++ BSC.unpack (targetComplete HM.! BSC.pack tKey) ++
" with source " ++ BSC.unpack (sourceRowRecord HM.! BSC.pack sKey)
case parseResult of
Left err -> throwM . PoseidonGenericException $ ".janno row-merge error: " ++ err
Right r -> do
let newFields = HM.differenceWith (\v1 v2 -> if v1 == v2 then Nothing else Just v1) newRowRecord targetComplete
if HM.null newFields then do
logDebug "-- no changes"
else do
forM_ (HM.toList newFields) $ \(key, val) -> do
liftIO $ R.modifyIORef cp (+1)
logDebug $ "-- copied \"" ++ BSC.unpack val ++ "\" from column " ++ BSC.unpack key
return r
where
targetRowRecord :: Csv.NamedRecord
targetRowRecord = Csv.toNamedRecord targetRow
sourceRowRecord :: Csv.NamedRecord
sourceRowRecord = Csv.toNamedRecord sourceRow
determineDesiredSourceKeys :: [BSC.ByteString] -> CoalesceJannoColumnSpec -> [BSC.ByteString]
determineDesiredSourceKeys keys AllJannoColumns = keys
determineDesiredSourceKeys _ (IncludeJannoColumns included) = included
determineDesiredSourceKeys keys (ExcludeJannoColumns excluded) = keys \\ excluded
fillFromSource :: BSC.ByteString -> BSC.ByteString -> BSC.ByteString
fillFromSource key targetVal =
-- don't overwrite key
if key /= BSC.pack tKey
-- overwrite field only if it's requested
&& includeField key fields
-- overwrite only empty fields, except overwrite is set
&& (targetVal `elem` ["n/a", "", BSC.empty] || overwrite)
then HM.findWithDefault "" key sourceRowRecord
else targetVal
includeField :: BSC.ByteString -> CoalesceJannoColumnSpec -> Bool
includeField _ AllJannoColumns = True
includeField key (IncludeJannoColumns xs) = key `elem` xs
includeField key (ExcludeJannoColumns xs) = key `notElem` xs
Loading