Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature janna-coalesce or janno-join #278

Closed
stschiff opened this issue Oct 20, 2023 · 5 comments
Closed

new feature janna-coalesce or janno-join #278

stschiff opened this issue Oct 20, 2023 · 5 comments

Comments

@stschiff
Copy link
Member

We decided that a new feature is needed in trident, to merge janno files. The new command, e.g. named trident janno-coalesce would take a source-package and a target-package, match rows on the basis on an ID match (which could by default be using the PoseidonIDs in both packages, but alternatively would allow other janno-columns in the first and second file, similarly to other join-operation functions, e.g. in the tidyverse). It would then fill any fields missing in the target but filled in the source, and perhaps report warnings for conflicting information in the two janno files.

@nevrome
Copy link
Member

nevrome commented Nov 8, 2023

A suggestion for the syntax:

trident jannocoalesce \
  --targetFile file/that/should/be/filled.janno \
  --sourceFile file/that/should/be/queried.janno \
  --outFile new/completed/file.janno \
  --fillColumns "Country,Latitude,Longitude,..." \ # (default: All)
  --overwriteColumns # default: False
  1. jannocoalesce, because we already have genoconvert without a -
  2. --targetFile, --sourceFile and --outFile are mandatory; --outFile can be identical to --targetFile (for brave people and automation)
  3. --fillColumns defines the columns that should be merged; the default is to merge all overlapping columns
  4. --overwriteColumns allows to overwrite the columns completely from the source, even if the target has some values filled; the default would be to preserve all available information in the target
  5. Merging can only be done by Poseidon_ID; I think allowing to merge by other columns is pretty complex - if users want to do that they should use qjanno or the janno R package

@stschiff
Copy link
Member Author

stschiff commented Nov 8, 2023

Very nice! I have some minor comments, but can be discussed after a first go. I hope to be able to get to it this week.

@stschiff
Copy link
Member Author

stschiff commented Nov 8, 2023

Just some quick ideas:

  • I would add an option --inplace which is just a shortcut for setting the output-file equal to the targetFile (can only set either --outFile or --inplace.
  • I think --overwriteColumns should be "smart", and not overwrite information if the sourceFile has missing information. I think that's clear, just wanted to make this explicit here once.
  • Interestingly, --jannocoalesce can lead to invalid janno-files, even if both the source and the target were valid. This is because we might end up with inconsistent column info (for example if one selected only the contamination column, but not its errorbar). I think that is OK and responsibility of the user, but should we consider running validate automatically on the outFile? Perhaps even before we actually write it out? Not sure...

@nevrome
Copy link
Member

nevrome commented Nov 8, 2023

These are good observations! 1. is a neat idea and 2. is indeed what I had in mind. For 3. I think we should not validate the output. There might be workflows where the user actually does not need a valid .janno file in the end. And for pipelines and automation it's probably better to keep the two steps clearly separate for error reporting.

@nevrome
Copy link
Member

nevrome commented Dec 7, 2023

I close this now, because the discussion has moved to the concrete PR implementing the feature: #282

@nevrome nevrome closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants