-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow to select a subset of columns #189
Conversation
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
closes #172 Signed-off-by: Luka Peschke <[email protected]>
c11fc28
to
c5e6f68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As always great tests and comments.
|
||
for sheet_name_or_idx in [0, "January"]: | ||
for col in ["Month", "Year"]: | ||
sheet = excel_reader_single_sheet.load_sheet(0, use_columns=[col]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_sheet(sheet_name_or_idx)
?
Same below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops good catch 8ff4192
def test_single_sheet_invalid_column_indices_column_does_not_exist_int( | ||
excel_reader_single_sheet_with_unnamed_columns: fastexcel.ExcelReader, | ||
) -> None: | ||
expected_message = """column at index 42 not found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add to ExcelSheet
the range of indices and list of columns. Currently we only have the width
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
045a114 . I wonder wether the property should be called use_columns
to match the parameter though 🤔
for (idx, name) in column_names.iter().enumerate() { | ||
// If we have an index for the given column, extract it and add it to the schema. Otherwise, | ||
// just ignore it | ||
if let Some(col_idx) = match selected_columns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon the code is not optimal performance-wise here but we probably don't care to build the schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was wondering if I should remove the TryFrom<&ExcelSheet>
implem for RecordBatch
to avoid re-instantiating column_names
as well, but this isn't really a hot path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it troubles me a bit but we can go on with this for now and rethink this in the future.
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
…op in PR previews Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
@PrettyWood I'm looking at the implem for column ranges right now, and I'm thinking about wether column names should be loaded and validated on sheet instatiation. However, this would have an impactt:
Any thoughts about this ? |
Yeah this is what I wanted originally when I said "Would be nice to add to ExcelSheet the range of indices and list of columns. Currently we only have the width" For your 4th point I have no idea to be honest. Clearly lifetimes are a mess with pyo3 (cf the discussion we had on bytes) |
scriptimport argparse
import fastexcel
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
return parser.parse_args()
def main():
args = get_args()
excel_file = fastexcel.read_excel(args.file)
use_columns = args.column or None
for sheet_name in excel_file.sheet_names:
excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
if __name__ == "__main__":
main() beforeinvalid columns
valid
afterinvalid
valid
So the perf impact seems neglectible to me. If the code LGTY, I'll expose the columns though an |
Signed-off-by: Luka Peschke <[email protected]>
ab49e55
to
1edc77a
Compare
Perfect 🎉 |
Great, I'll do it later this afternoon then. Anything else I need to change ? |
Nope I'll test the final PR tonight and will probably merge. Great work |
Signed-off-by: Luka Peschke <[email protected]>
closes #172 . Depends on #186