You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Users should be able to compare pipeline run together, the focus should be on comparing two runs together. Future improvements could include comparing multiple runs.
Ideally the user could select two pipeline runs and a given component and compare the columns side by side. This comparison would require merging the two selected columns together based on ID:
For components with a one-to-one transformation this should be doable
For components that change/extend the index it might make less sense to provide this comparison:
Chunking example: we will end up with different new chunk id depending on the chunking overlap so two runs can have a mismatch in the number of rows
For the laion dataset the retrieved ids are similar so comparison does not provide an added advantage
Proposed approach: we can do an inner join between the datasets from both runs, to mitigate OOM issues we can limit the number of partitions loaded from both datasets. While we don't have a guarantee that the indexes will be alligned across partitions (e.g. partition 1 from first run might have different ids than partition 2 due to various factors), it might not be an issue if we steer the focus towards comparing a fraction of the examples since it's not feasible to compare all the samples for large datasets. For better performance we can suggest to use the explorer on a more performant machine.
Alternatively, we can also offer the possibility to search for a specific id across all partitions and compare in case the user wants to investigate a specific example. This can be achieved with iloc
The text was updated successfully, but these errors were encountered:
Users should be able to compare pipeline run together, the focus should be on comparing two runs together. Future improvements could include comparing multiple runs.
Ideally the user could select two pipeline runs and a given component and compare the columns side by side. This comparison would require merging the two selected columns together based on ID:
Proposed approach: we can do an inner join between the datasets from both runs, to mitigate OOM issues we can limit the number of partitions loaded from both datasets. While we don't have a guarantee that the indexes will be alligned across partitions (e.g. partition 1 from first run might have different ids than partition 2 due to various factors), it might not be an issue if we steer the focus towards comparing a fraction of the examples since it's not feasible to compare all the samples for large datasets. For better performance we can suggest to use the explorer on a more performant machine.
Alternatively, we can also offer the possibility to search for a specific id across all partitions and compare in case the user wants to investigate a specific example. This can be achieved with
iloc
The text was updated successfully, but these errors were encountered: