-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real-world data sets #18
Comments
There may be additional benefits from restricting analyses to specific regions. The chemical characteristics of CDRs and FRs can be quite different. It may be useful to confine an analysis to CDR3, for example, even if full-length sequences are available. |
Thanks for the elaborate summary of our discussion, @williamdlees! I'll incorporate some of our email conversation here along with my proposed solution(s). It appears that much of the issue can be solved by allowing the user to specify custom columns for Having said this, I also agree that
My thought is to perform each of these filters by default, and tell the user that we are doing so, but allow the user to toggle each one off if they wish. The first one might be tricky as IgBlast, e.g., contains Practically speaking, I don't see the integration of IMGT alignments with Does this sound like a reasonable approach? Looking forward to hearing further thoughts or suggestions. |
Personal bias, but I think the easiest approach for IMGT/HighV-QUEST is to just use MakeDb-imgt from Change-O with |
@javh - Thanks for the tip! Would this work with any AIRR-formatted table (csv), e.g. one I supply after renaming all of |
MakeDb is a parser for IMGT/HighV-QUEST, IgBLAST and iMMune-align data, so it'll output an AIRR tsv from the raw data of those alignment tools. We don't have explicit partis support because (a) there is an option for changeo output in partis and (b) adding AIRR support is on the partis todo list (I don't know if this was finished). Ideally, I imagine sumrep just accepting an AIRR tsv file without caring about the source. Trying to harmonize the custom output of different alignment tools is serious chore... |
Ah, I see. Yeah, that's the approach I've been using for retrieving IgBlast annotations. You're correct that |
Hi Branden, Thanks very much for the detailed reply. I wrote an initial response just now, which may have reached you by email, but subsequently deleted it because I overlooked the germline_alignment field. Do you have a spec of which functions use 'sequence', which use 'sequence_alignment' and which use 'germline_alignment'? I'm finding it hard to propose a solution without knowing a little bit more about that. |
Apologies for the multiple messages. I looked at the code and it seems to me that you use 'sequence_alignment' and 'germline_alignment' in functions that explicitly reference the germline, and 'sequence' in those that don't. Is that correct? Are gapped alignments accepted? Is there a particular reason not to use sequence_alignment throughout? |
Hi William, Sorry for the confusion here. It might be helpful to revisit our spreadsheet from a while back which details each of our summary statistics and partitions them into four levels of assumption -- naked sequences, V(D)J-aligned sequences, inferred clonal families, and inferred phylogenetic trees. So, any statistic in the first level (naked sequences) makes use of the
The assumptions under which each summary is operating are included in the documentation, but I could add some more text explaining which columns are expected for each category, or even another table that maps assumptions to column defaults. Perhaps the most thorough thing to do would be include an "Expected columns" in the summary functions table; the main downside I see here is that it will be easy forget to maintain this anytime a default changes, but that shouldn't be an excuse not to do it. It's also fairly easy for the user to enter the command Gapped alignments are not explicitly supported and would likely not be until post-manuscript, though I'm not sure I see the benefit over using I hope this clears up some of the confusion -- looking forward to your response! |
Thanks Branden, that's very helpful. I see that you use the aligned sequences with some shazam functions. The mutability matrix and substitution matrix functions require IMGT-gapped alignments (by the way, the docs also warn explicitly that ambiguous nucleotides must not be present). I am not sure how well those functions would work with some other alignment, but given their nature I'd expect that they would at least require in-frame codon-based alignments. The docs for calcBaseline are not explicit but I suspect it may also be expecting an IMGT alignment. Perhaps there is some marginal utility in being able to run the tool on unannotated raw sequences but I can't see it myself. What use is a distance comparison when the sequences compared could vary considerably in length, with some extending potentially into the UTRs and others into the constant region? Whatever benefit there might be is, to me, greatly outweighed by the corresponding difficulty today in running the length and hotspot analyses on a defined region following annotation. A number of functions calculate only over the junction - GRAVY, Atchley and so on. I think that needs to be drawn out a little more in the documentation. I suggest that the documentation should explicitly state that the aligned sequence and germline must be IMGT-aligned, and that analyses (such as distance calculations) that won't handle a gapped alignment are run on a de-gapped copy of the aligned sequence, rather than the raw sequence. This is a small change that would mainly affect the documentation, but it would ensure that analyses are run on a defined region - the variable region - which is the region that's really of interest for the time being. A small enhancement, which would address the length issue, would be to allow the user to specify a mask, and to run calculations on that mask rather than on the entire aligned sequence. For example the user could say that they wanted calculations run on codons 56-65 of the IMGT-aligned sequences (which happens to correspond to IMGT CDR2). That would also allow (as a further enhancement) those calculations which are currently restricted to the junction to be run on user-defined regions, which, to me, would be a useful extension. Putting the column dependencies into the help as opposed to higher-level documentation feels like a good way to go, particularly if they can be automatically derived from the code in some way so that they are guaranteed to be up to date. But we should also try to reduce the hard-coded dependencies on specific columns as much as possible and I've tried to suggest a way that could be done with relatively small impact on the code base. Let me know what you think. All the best William |
Dear William-- Thank you very much for thinking about these issues. I wanted to clarify that our original intention in taking unaligned sequences was to have some statistics that could not be biased by further processing. For example, aligned sequences can be messed up by having the wrong germline set. However, I think that you are right that utility overrides principles in this case. I personally hadn't thought about length heterogeneity, and there's no real way to deal with that without having alignments. Branden and I chatted and he's going to look into moving into using the aligned sequences here. He may be reaching out to you for help. Again, thanks. |
Thanks Erick, and Branden. I can understand the concern re. bias. Very happy to help in any way I can to come up with a straightforward approach to address all these issues, which I think is achievable. |
Just to chime in briefly here, most things in alakazam/shazam expect IMGT numbered sequences, unless they operate exclusively on the junction region (in which case they require identical length junctions). You can get around the IMGT numbering requirement in The |
I may be wrong about the 5-mer functions needing a multiple alignment... A pairwise alignment between the inferred germline and observed sequence might be sufficient, as long as the non-CDR3 V segment ends at 312. I'll have to check. |
Thanks for the information, @javh ! I'll address the In light of this along with @williamdlees 's suggestions, @matsen and I propose the following changes:
While I understand the appeal of the position masking feature, I'm hesitant to incorporate this approach for a couple of reasons:
As always, I look forward to any comments or feedback. |
These columns would be the The alakazam/shazam functions default to Change-O style column names, but they should all take an argument letting you specify the column name and use the AIRR defined names through that mechanism. If one of them doesn't have an argument for the input fields, then just drop an issue into the appropriate Bitbucket repo and someone will fix it. |
Thanks Branden. It would be worth specifying sequence_alignment closely, I think. For example
I'm not sure the particular answers to these questions matter that much but I do think it's important to document them. As we've discussed, it's important that the sequences in both sets are compared over a span that is covered by the reads in both sets. I think it's reasonable to me that we leave it to the user to ensure this by choosing the region spanned by sequence_alignment and masking it appropriately, but it would be helpful to warn, for example by comparing average read lengths in the two alignments. |
We added a few functions to alakazam to deal with these situations (eg, maskSeqEnds which creates equal end masking to avoid artifacts of low "coverage" positions, padSeqEnds, etc). I definitely think we should be wary of feature bloat in sumrep, but some preprocessing to make sure sequences are comparable seems fair. As for gaps, we treat (For indels, you might instead want to use |
Ok -- after lots of helpful discussion with @matsen, @javh, and @williamdlees, I think we're ready to commit to a specification of
I think treating both
I think gaps make the most sense in this situation. For the purposes of sumrep, positions not covered by the read will not impact any of the summaries, so these positions are only of interest if the user wants to mask sequences to a particular region. Masking is still something sumrep won't do for the user, although I can include a remark/example somewhere on how this can be done.
This is a great idea, although I'll put it on the back-burner for now. (I'll file an issue to make sure I don't forget about it!) Looking forward to any feedback! |
Sounds good to me. The only reason to treat
If this is out of score for sumrep, which I think it is, then I think |
Thanks Branden, your outlined approach sounds good to me. William |
This is a marker for the issues we discussed on the AIRR-SW-WG call today.
Some characteristics of real-world data sets:
Issues:
Possible approaches, or components of an approach:
The text was updated successfully, but these errors were encountered: