-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pull sheet names as part of text from xlsx files? #14
Comments
(I'd be happy to create a pull request for this, but I'm not sure where you would prefer to place such a boolean. If you let me know, I'd be happy to create one. |
Hi, Thanks for opening this issue! I would definitely like this to be the default behaviour of the library, not sure why I hadn't done this in the first place. A PR that appends the sheet name near the I suppose we could add a boolean option to configure this, in the constructor of the Regards, |
I know this would could be a breaking change which was the intent of the boolean. Not sure how many users you have that need no format changes. Speaking of formats, what format is this? I see the row by row conversion to YAML and the row / sheet separators. I know '---' is the document header syntax, but I'm not familiar with '==='. Also, is there a reason you picked YAML instead of, say CSV? I'm not being critical here - I'm just curious what you had in mind and the use cases. Want to make sure that whatever I put in aligns with the plans. |
I have no idea either, but you're right - it is a breaking change, and to be safe we should hide it behind a boolean flag that is false by default.
I did not follow a format, I made my own 😅 The --- to separate rows and === to separate sheets is completely arbitrary. I chose YAML because it maintained a text-sense of structure instead of a grid-sense of structure, i.e., col-header:value instead of value,value,value. The 'text-based structure' was actually useful to me in the project I wrote this package for; where I was extracting text from files to identify 'topics', primarily based on position and frequency of words. |
That said, I am open to adding more options that configure the format of the output. |
Description
When pulling text from a spreadsheet, the current extractor does not return the sheet names in the text. It would be GREAT if there was an options to preface the sheet text by the sheet name.
Why
Often, important contextual information is included in sheet names.
It would be easy to implement - in the office-text-extractor code, you are pulling them already as the sheet data is accessed via the sheet name. Adding a simple boolean flag on whether or not to output the sheet names into the
===
separator denoting new sheet text could be a solution? It could be set to false by default for backward compatibility.Alternatives
I mean, I love the all in one nature of office-text-extractor, but I could process the files myself instead.
The text was updated successfully, but these errors were encountered: