Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable comparing different pipeline runs #634

Open
Tracked by #568
PhilippeMoussalli opened this issue Nov 15, 2023 · 0 comments
Open
Tracked by #568

Enable comparing different pipeline runs #634

PhilippeMoussalli opened this issue Nov 15, 2023 · 0 comments

Comments

@PhilippeMoussalli
Copy link
Contributor

PhilippeMoussalli commented Nov 15, 2023

Users should be able to compare pipeline run together, the focus should be on comparing two runs together. Future improvements could include comparing multiple runs.

Ideally the user could select two pipeline runs and a given component and compare the columns side by side. This comparison would require merging the two selected columns together based on ID:

  • For components with a one-to-one transformation this should be doable
  • For components that change/extend the index it might make less sense to provide this comparison:
    • Chunking example: we will end up with different new chunk id depending on the chunking overlap so two runs can have a mismatch in the number of rows
    • For the laion dataset the retrieved ids are similar so comparison does not provide an added advantage

Proposed approach: we can do an inner join between the datasets from both runs, to mitigate OOM issues we can limit the number of partitions loaded from both datasets. While we don't have a guarantee that the indexes will be alligned across partitions (e.g. partition 1 from first run might have different ids than partition 2 due to various factors), it might not be an issue if we steer the focus towards comparing a fraction of the examples since it's not feasible to compare all the samples for large datasets. For better performance we can suggest to use the explorer on a more performant machine.

Alternatively, we can also offer the possibility to search for a specific id across all partitions and compare in case the user wants to investigate a specific example. This can be achieved with iloc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: On hold
Development

No branches or pull requests

2 participants