Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integration tests using example files from apache/orc #65

Merged
merged 3 commits into from
Mar 13, 2024

Conversation

progval
Copy link
Contributor

@progval progval commented Mar 10, 2024

Should I use apache/orc as a git submodule instead of copying data files here? I figured it's not worth the overhead of submodules, considering they only weigh 24MB.

Some tests are failing, I tried to annotate the reason when I understood. Some (commented with // Why?) look like actual bugs.

Resolves #27

Copy link

codecov bot commented Mar 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.52%. Comparing base (424b021) to head (f6f615b).
Report is 45 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #65      +/-   ##
==========================================
+ Coverage   77.22%   80.52%   +3.29%     
==========================================
  Files          34       30       -4     
  Lines        3302     3106     -196     
==========================================
- Hits         2550     2501      -49     
+ Misses        752      605     -147     

Copy link
Collaborator

@WenyXu WenyXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WenyXu WenyXu requested a review from Jefffrey March 10, 2024 15:02
@Jefffrey
Copy link
Collaborator

Thanks for this, I'll take a look soon 👍

Copy link
Collaborator

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a README to the tests/integration/data directory just to indicate where the data came from? For easier viewing from the repo (instead of needing to track down the PR)

Comment on lines +65 to +68
// pretty_assertions consumes too much RAM and CPU on large diffs,
// and it's unreadable anyway
assert_eq!(lines[0..1000], expected_lines[0..1000]);
assert!(lines == expected_lines);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can parse the expected JSON into Arrow first with arrow_json then compare on RecordBatches

Assuming the schema inference works in our favour 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would avoid the issue of different decimal representation.

However, it makes the tests a little unreliable, as they wouldn't detect data lost both by arrow_json and datafusion_orc. Probably not a big deal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. I guess we're a bit handicapped by the expected data being in JSON form which can make it harder for us to be rigorous. I'll raise an issue for exploring ways around this 👍

Comment on lines 83 to 93
#[ignore] // Why?
fn metaData() {
test_expected_file("TestOrcFile.metaData");
}
#[test]
#[ignore] // Why?
fn test1() {
test_expected_file("TestOrcFile.test1");
}
#[test]
#[should_panic] // Incorrect timezone + representation differs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for leaving these annotations, gives us a goal to work towards

For all tests that fail we can just #[ignore] them and leave TODO comments to address these

@Jefffrey Jefffrey merged commit 8e254a6 into datafusion-contrib:main Mar 13, 2024
9 checks passed
@Jefffrey
Copy link
Collaborator

Thanks for this @progval ❤️

waynexia pushed a commit that referenced this pull request Oct 24, 2024
* Add integration tests using example files from apache/orc

* s/should_panic/ignore/ and add TODO

* Add README to data files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration tests that using example files from apache/orc
3 participants