-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
24 common csv parser #35
Conversation
Moved the read_functions dictionary inside of the read_data_to_pandas function to avoid creation of a global variable.
Allows for easier future modifications if necessary.
@sgoldenCS , this looks good. thank you. I have two suggestions (that I am happy to discuss):
|
@dlersch I'm happy to rename the unit test. It's just a name anyway. As for removing the pandas_parser, I'm hesitant to do that. While the utility I've implemented does the same work, I really question the idea of having multiple essentially duplicate parsers. I know @schr476 believes that it's too complicated, but if the goal of the modules is to reduce unnecessary work, I don't see how writing 150 lines of duplicate code for every file format makes any sense. The code is almost the same size as the full "pandas_parser"! It also fragments the code base and introduces more opportunities for bugs. If we want a registered version for each file type, that could be solved by properly addressing #32 and using the pandas_parser. If we're unhappy with the "pandas_parser" name, we can agree on a more generic name like "base_parser_to_pandas". Ultimately, no good rationale for making these changes has been offered besides complaints about naming... |
@sgoldenCS I do see your points. If you could just rename the unit-test, that would be great and I will accept your pull-request. As for a more generic name for the pandas_parser -> How about dataframe_parser ? The object returned is technically a dataframe . This is just a suggestion. I have no attachments to the name |
I think we agreeed that the goal of the parser is to read a particular data format. |
I don't remember if we agreed to that, but I don't think any agreement invalidates my points. We can still have a registered version of the parser for each data format without having to rewrite all of the code each time we need a new file format. |
@sgoldenCS , If we conform to one output format (Pandas) then we can write converters to satisfy other needed. |
I'm not sure I follow. I haven't mentioned output data types at all, so why are you talking about output converters? I feel this needs to be an actual conversation, not just GitHub comments. I don't believe having these disagreements on a (semi-)public platform is fair to me or anyone else who receives notifications for it. |
Pandas is the output type for the CSVParser. |
I've removed the registration of the PandasParser and updated the name of the code file. I also removed all references to Pandas in the file that weren't required. I will be preparing a small presentation to explain how I believe the "parser_to_dataframe.py" implementation will allow for much faster development while still maintaining good versioning practices. @schr476 and I had a discussion this morning and I like the idea of defining a Parser Plugin to formalize the process of creating read_functions for the parser_to_dataframe.py module. If we agree on this approach, we can make an issue/branch and implement it. |
The registry now uses the "parser_to_dataframe" as a base class with file_format given as a registry kwarg. The unit test has been updated to fix naming issues and output logging info to stdout to verify modules are working appropriately with the registry.
After my conversation with @schr476 yesterday, I have updated a few things.
The unit test for the CSVParser still works correctly and completes all tests, so none of these changes should break anything. A few things to note:
|
Also fixes a few documentation errors and the unittest logging (now -v turns on extra logging from the module)
The functionality of the CSV parser is being handled by the Parser2Dataframe now and is no longer used as the entrypoint to the registered CSVParser. Keeping this code will only make the repository more confusing.
We can complete this pull request and its associated issues. Any further changes can be made through a future issue/pull if we need to make them, but this code should serve as a good starting point. I've deleted the csv_parser_v0.py file to avoid confusion as it is not registered and its functionality is fully implemented in Parser2DataFrame. The parser_utilities.py file in the utils folder is not being used, but the functions defined there could still be useful to someone in the future, so I'll leave them there for now. |
commit 60441a1 Author: Steven Goldenberg <[email protected]> Date: Thu May 23 11:59:21 2024 -0400 Upload of basic Tensorflow model Still needs a unit test, but the model works well for relatively standard models. Subclassed models may need specialized modules. commit b009362 Author: Steven Goldenberg <[email protected]> Date: Fri May 10 11:46:50 2024 -0400 Squashed commit of the following: commit 7ff070a Author: Steven Goldenberg <[email protected]> Date: Thu May 9 17:07:04 2024 -0400 Format pandas_standard_scaler using black Unittests still work and changes seem to mostly be single quotes to double and removing spaces. commit dac7a31 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 16:17:17 2024 -0400 Make unittest for config IO commit 447a444 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 15:35:09 2024 -0400 Fix more documentation in pandas_standard_scaler commit 1a46420 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 14:35:00 2024 -0400 Update pandas_standard_scaler.py Adding lots of additional documentation. Some functions aren't fully documented, but this is a very good start. commit de25bd6 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 13:21:29 2024 -0400 Rename IO functions for yaml configurations [save/load]_config --> [save/load]_yaml_config to avoid confusion if we want another version for JSON or something else. commit a001284 Author: Steven Goldenberg <[email protected]> Date: Wed May 8 14:52:09 2024 -0400 Simplify save/load config Saving and loading configurations are now done by utility functions in the utils folder. This simplifies the modules and allows for unit testing on the config I/O functions outside of any module. Utility functions try to properly handle FileNotFound and FileExists errors. commit 83f019f Merge: a3f758f c63e630 Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:25:19 2024 -0400 Merge branch 'main' into 36-make-pandasstandardscaler-for-hugs commit a3f758f Author: Steven Goldenberg <[email protected]> Date: Mon Apr 29 15:12:56 2024 -0400 Fix reverse() and add unit test commit 79566d2 Author: Steven Goldenberg <[email protected]> Date: Mon Apr 29 14:53:35 2024 -0400 Full implementation of scaler with unittests The implementation avoids using scikit-learn entirely for a number of reasons including no option for axis changes and no option for changing the epsilon value when dealing with small variances. commit ca21fcf Author: Steven Goldenberg <[email protected]> Date: Fri Apr 26 17:35:18 2024 -0400 Started implementation of PandasStandardScaler Using scikit-learn's StandardScaler as a base. Saving and loading is a bit tricky as the internal state of the scikit-learn implementation isn't easily saved. It looks like you can save it's internal __dict__ and then set the attributes on load, but I'm not sure if this is robust... commit 574d0ac Author: Steven Goldenberg <[email protected]> Date: Fri Apr 26 16:17:59 2024 -0400 Fix tab bug in pandas_standard_scaler.py commit 676d640 Author: Steven Goldenberg <[email protected]> Date: Fri Apr 26 16:07:31 2024 -0400 Create pandas_standard_scaler.py Inital upload of a standard scaler that supports pandas dataframes as input. Maybe it should be renamed to DataFrame scaler to match the parser_to_dataframe.py file... commit c63e630 Merge: eea09a5 ff117c0 Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:22:45 2024 -0400 Merge pull request #35 from JeffersonLab/24-common-csv-parser 24 common csv parser commit ff117c0 Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:18:31 2024 -0400 Delete csv_parser_v0.py The functionality of the CSV parser is being handled by the Parser2Dataframe now and is no longer used as the entrypoint to the registered CSVParser. Keeping this code will only make the repository more confusing. commit e11864f Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:09:36 2024 -0400 Save_Load Unit Tests Also fixes a few documentation errors and the unittest logging (now -v turns on extra logging from the module)
This pull request should close #24 and close #33. The implementation of CSVParser is essentially the PandasParser with the file_format field forced to 'csv'.
Personally, I'm not too fond of this solution as it will require a different parser for every file format and will result in a large amount of code duplication. However, I don't see another way to address #24 and #33 since it's apparent the current solution is not acceptable....