Skip to content

Commit

Permalink
edit
Browse files Browse the repository at this point in the history
  • Loading branch information
nsheff committed Aug 9, 2024
1 parent fece5ad commit deca138
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/seqcol_rationale.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ The split into two parts thus provides some important modularity, but it doesn't

The second strategy is the comparison function. In short, the goal here is to move away from comparing collections by simply checking if their digests are identical; instead, we want a more powerful comparison that can answer multiple questions using a single digest.

In more detail: each of the scenarios described above can be viewed as constructing a digest to make it really easy to ask a particular comparison question. For example, in the first use case, the question is "do these two collections have exactly the same sequence content, regardless or order?" In the second example, the question is more general: "do these two collections have exactly the same sequence content and sequence names, in the same order?". The third question is "Do these collections have the same coordinate system, in the same order?". To answer any of these questions, if you had the bespoke digest, you'd simply see if two digests are identical. If they are, the comparison question is satisfied. This is convenient *if your question happens to be the one used to construct the digest*. But the problem is that the final digest representing a sequence collection can only answer *one* such question. A simple string check approach simply offers only a single comparison: are these two things identical, or not? Therefore, to accommodate our complex use cases, where we have multiple things we want to compare in different scenarios, we need something more than just comparing digests. We need a single digest to be able to answer *all* of the above questions, and more.
In more detail: each of the scenarios described above can be viewed as constructing a digest to make it really easy to ask a particular comparison question. For example, in the first use case, the question is "do these two collections have exactly the same sequence content, regardless of order?" In the second example, the question is more general: "do these two collections have exactly the same sequence content and sequence names, in the same order?". The third question is "Do these collections have the same coordinate system, in the same order?". To answer any of these questions, if you had the bespoke digest, you'd simply see if two digests are identical. If they are, the comparison question is satisfied. This is convenient *if your question happens to be the one used to construct the digest*. But the problem is that the final digest representing a sequence collection can only answer *one* such question. A simple string check approach simply offers only a single comparison: are these two things identical, or not? Therefore, to accommodate our complex use cases, where we have multiple things we want to compare in different scenarios, we need something more than just comparing digests. We need a single digest to be able to answer *all* of the above questions, and more.

Here's our alternative: instead of a digest-matching query, we design a *comparison function*. The comparison function goes beyond simply comparing digest strings; it provides a comparison of every attribute in the collection, including how many elements match, whether they are in the same order, and more. The output lets you answer all the questions posed above, plus more. You can tell if two collections have the same sequences, whether they are in the same order, whether their names match, whether the sequences differ but the lengths match, etc. It also allows you to determine more complex comparisons: is one sequence collection a subset of another? Do they have at least some sequences or names in common? Are their coordinate systems compatible, even if the sequence content differs? All of these questions are immediately answerable from the result of the comparison function.

Expand Down

0 comments on commit deca138

Please sign in to comment.