Enable comparing different pipeline runs #634

PhilippeMoussalli · 2023-11-15T10:05:49Z

Users should be able to compare pipeline run together, the focus should be on comparing two runs together. Future improvements could include comparing multiple runs.

Ideally the user could select two pipeline runs and a given component and compare the columns side by side. This comparison would require merging the two selected columns together based on ID:

For components with a one-to-one transformation this should be doable
For components that change/extend the index it might make less sense to provide this comparison:
- Chunking example: we will end up with different new chunk id depending on the chunking overlap so two runs can have a mismatch in the number of rows
- For the laion dataset the retrieved ids are similar so comparison does not provide an added advantage

Proposed approach: we can do an inner join between the datasets from both runs, to mitigate OOM issues we can limit the number of partitions loaded from both datasets. While we don't have a guarantee that the indexes will be alligned across partitions (e.g. partition 1 from first run might have different ids than partition 2 due to various factors), it might not be an issue if we steer the focus towards comparing a fraction of the examples since it's not feasible to compare all the samples for large datasets. For better performance we can suggest to use the explorer on a more performant machine.

Alternatively, we can also offer the possibility to search for a specific id across all partitions and compare in case the user wants to investigate a specific example. This can be achieved with iloc

The text was updated successfully, but these errors were encountered:

PhilippeMoussalli mentioned this issue Nov 15, 2023

Extend data explorer for document-based data #568

Closed

github-project-automation bot added this to Fondant development Nov 15, 2023

github-project-automation bot moved this to Backlog in Fondant development Nov 15, 2023

janvanlooyml6 moved this from Backlog to In Progress in Fondant development Nov 24, 2023

RobbeSneyders moved this from In Progress to On hold in Fondant development Nov 30, 2023

RobbeSneyders added the Data explorer label Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable comparing different pipeline runs #634

Enable comparing different pipeline runs #634

PhilippeMoussalli commented Nov 15, 2023 •

edited

Loading

Enable comparing different pipeline runs #634

Enable comparing different pipeline runs #634

Comments

PhilippeMoussalli commented Nov 15, 2023 • edited Loading

PhilippeMoussalli commented Nov 15, 2023 •

edited

Loading