24 common csv parser #35

sgoldenCS · 2024-04-22T17:17:41Z

This pull request should close #24 and close #33. The implementation of CSVParser is essentially the PandasParser with the file_format field forced to 'csv'.

Personally, I'm not too fond of this solution as it will require a different parser for every file format and will result in a large amount of code duplication. However, I don't see another way to address #24 and #33 since it's apparent the current solution is not acceptable....

Moved the read_functions dictionary inside of the read_data_to_pandas function to avoid creation of a global variable.

Allows for easier future modifications if necessary.

dlersch · 2024-04-23T02:00:27Z

@sgoldenCS , this looks good. thank you. I have two suggestions (that I am happy to discuss):

Rename the utest_pandas_parser_v0.py to: utest_csv_parser.py just so that it is consistent with everything else in the repo and is consistent with the parser name
Remove the pandas_parser_v0.py You implemented a nice utility that does everything that the pandas_parser did. If people need a json_parser for example, they can just use your util and write their own parser, following your example

sgoldenCS · 2024-04-23T16:11:27Z

@dlersch I'm happy to rename the unit test. It's just a name anyway.

As for removing the pandas_parser, I'm hesitant to do that. While the utility I've implemented does the same work, I really question the idea of having multiple essentially duplicate parsers. I know @schr476 believes that it's too complicated, but if the goal of the modules is to reduce unnecessary work, I don't see how writing 150 lines of duplicate code for every file format makes any sense. The code is almost the same size as the full "pandas_parser"! It also fragments the code base and introduces more opportunities for bugs.

If we want a registered version for each file type, that could be solved by properly addressing #32 and using the pandas_parser. If we're unhappy with the "pandas_parser" name, we can agree on a more generic name like "base_parser_to_pandas". Ultimately, no good rationale for making these changes has been offered besides complaints about naming...

dlersch · 2024-04-23T16:21:30Z

@sgoldenCS I do see your points. If you could just rename the unit-test, that would be great and I will accept your pull-request. As for a more generic name for the pandas_parser -> How about dataframe_parser ? The object returned is technically a dataframe . This is just a suggestion. I have no attachments to the name

schr476 · 2024-04-23T16:35:05Z

@dlersch I'm happy to rename the unit test. It's just a name anyway.

As for removing the pandas_parser, I'm hesitant to do that. While the utility I've implemented does the same work, I really question the idea of having multiple essentially duplicate parsers. I know @schr476 believes that it's too complicated, but if the goal of the modules is to reduce unnecessary work, I don't see how writing 150 lines of duplicate code for every file format makes any sense. The code is almost the same size as the full "pandas_parser"! It also fragments the code base and introduces more opportunities for bugs.

If we want a registered version for each file type, that could be solved by properly addressing #32 and using the pandas_parser. If we're unhappy with the "pandas_parser" name, we can agree on a more generic name like "base_parser_to_pandas". Ultimately, no good rationale for making these changes has been offered besides complaints about naming...

I think we agreeed that the goal of the parser is to read a particular data format.
We can make the default output format to be Pandas and converters to another format as needed.

sgoldenCS · 2024-04-23T17:16:50Z

I don't remember if we agreed to that, but I don't think any agreement invalidates my points. We can still have a registered version of the parser for each data format without having to rewrite all of the code each time we need a new file format.

schr476 · 2024-04-23T17:22:10Z

@sgoldenCS , If we conform to one output format (Pandas) then we can write converters to satisfy other needed.
Therefore you write a parser per data type and converters as needed (which I would assume is relatively small)

sgoldenCS · 2024-04-23T17:35:51Z

I'm not sure I follow. I haven't mentioned output data types at all, so why are you talking about output converters?

I feel this needs to be an actual conversation, not just GitHub comments. I don't believe having these disagreements on a (semi-)public platform is fair to me or anyone else who receives notifications for it.

schr476 · 2024-04-23T18:39:04Z

I'm not sure I follow. I haven't mentioned output data types at all, so why are you talking about output converters?

I feel this needs to be an actual conversation, not just GitHub comments. I don't believe having these disagreements on a (semi-)public platform is fair to me or anyone else who receives notifications for it.

Pandas is the output type for the CSVParser.
For simplicity, I propose we use Pandas DFs as a common output format and move forward.

sgoldenCS · 2024-04-24T14:56:00Z

I've removed the registration of the PandasParser and updated the name of the code file. I also removed all references to Pandas in the file that weren't required.

I will be preparing a small presentation to explain how I believe the "parser_to_dataframe.py" implementation will allow for much faster development while still maintaining good versioning practices.

@schr476 and I had a discussion this morning and I like the idea of defining a Parser Plugin to formalize the process of creating read_functions for the parser_to_dataframe.py module. If we agree on this approach, we can make an issue/branch and implement it.

The registry now uses the "parser_to_dataframe" as a base class with file_format given as a registry kwarg. The unit test has been updated to fix naming issues and output logging info to stdout to verify modules are working appropriately with the registry.

sgoldenCS · 2024-04-26T18:43:17Z

After my conversation with @schr476 yesterday, I have updated a few things.

The unit test has had naming issues fixed
The unit test now prints all logging messages (debug and above) to stdout
The parser_to_dataframe now accepts a "registry_config"
Documentation has been added in the comments of parser_to_dataframe to explain the configuration priorities
- Highest priority is anything in the "config" that should be used by the user
- Next is anything defined in the "registry_config", which should be reserved for the registry to pass kwargs to the module
- Any remaining unspecified arguments are set by a default config defined in the module
The registry now registers the CSVParser using Parser2DataFrame with kwargs={'registry_config': {'file_format': 'csv'}}

The unit test for the CSVParser still works correctly and completes all tests, so none of these changes should break anything.

A few things to note:

The csv_parser_v0.py code has not been deleted but is no longer registered. Since the new registered CSVParser implements all of the same functionality, we could delete it if people agree.
The parser_to_dataframe.py code could use the new utility functions instead of the implementations in the module (they're identical, but using utilities could be nice)
Using the registry_config works for now, but we should agree on an approach (see Registry Kwargs Updating Issue #32)

Also fixes a few documentation errors and the unittest logging (now -v turns on extra logging from the module)

The functionality of the CSV parser is being handled by the Parser2Dataframe now and is no longer used as the entrypoint to the registered CSVParser. Keeping this code will only make the repository more confusing.

sgoldenCS · 2024-05-02T13:22:27Z

We can complete this pull request and its associated issues. Any further changes can be made through a future issue/pull if we need to make them, but this code should serve as a good starting point.

I've deleted the csv_parser_v0.py file to avoid confusion as it is not registered and its functionality is fully implemented in Parser2DataFrame. The parser_utilities.py file in the utils folder is not being used, but the functions defined there could still be useful to someone in the future, so I'll leave them there for now.

commit 60441a1 Author: Steven Goldenberg <[email protected]> Date: Thu May 23 11:59:21 2024 -0400 Upload of basic Tensorflow model Still needs a unit test, but the model works well for relatively standard models. Subclassed models may need specialized modules. commit b009362 Author: Steven Goldenberg <[email protected]> Date: Fri May 10 11:46:50 2024 -0400 Squashed commit of the following: commit 7ff070a Author: Steven Goldenberg <[email protected]> Date: Thu May 9 17:07:04 2024 -0400 Format pandas_standard_scaler using black Unittests still work and changes seem to mostly be single quotes to double and removing spaces. commit dac7a31 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 16:17:17 2024 -0400 Make unittest for config IO commit 447a444 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 15:35:09 2024 -0400 Fix more documentation in pandas_standard_scaler commit 1a46420 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 14:35:00 2024 -0400 Update pandas_standard_scaler.py Adding lots of additional documentation. Some functions aren't fully documented, but this is a very good start. commit de25bd6 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 13:21:29 2024 -0400 Rename IO functions for yaml configurations [save/load]_config --> [save/load]_yaml_config to avoid confusion if we want another version for JSON or something else. commit a001284 Author: Steven Goldenberg <[email protected]> Date: Wed May 8 14:52:09 2024 -0400 Simplify save/load config Saving and loading configurations are now done by utility functions in the utils folder. This simplifies the modules and allows for unit testing on the config I/O functions outside of any module. Utility functions try to properly handle FileNotFound and FileExists errors. commit 83f019f Merge: a3f758f c63e630 Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:25:19 2024 -0400 Merge branch 'main' into 36-make-pandasstandardscaler-for-hugs commit a3f758f Author: Steven Goldenberg <[email protected]> Date: Mon Apr 29 15:12:56 2024 -0400 Fix reverse() and add unit test commit 79566d2 Author: Steven Goldenberg <[email protected]> Date: Mon Apr 29 14:53:35 2024 -0400 Full implementation of scaler with unittests The implementation avoids using scikit-learn entirely for a number of reasons including no option for axis changes and no option for changing the epsilon value when dealing with small variances. commit ca21fcf Author: Steven Goldenberg <[email protected]> Date: Fri Apr 26 17:35:18 2024 -0400 Started implementation of PandasStandardScaler Using scikit-learn's StandardScaler as a base. Saving and loading is a bit tricky as the internal state of the scikit-learn implementation isn't easily saved. It looks like you can save it's internal __dict__ and then set the attributes on load, but I'm not sure if this is robust... commit 574d0ac Author: Steven Goldenberg <[email protected]> Date: Fri Apr 26 16:17:59 2024 -0400 Fix tab bug in pandas_standard_scaler.py commit 676d640 Author: Steven Goldenberg <[email protected]> Date: Fri Apr 26 16:07:31 2024 -0400 Create pandas_standard_scaler.py Inital upload of a standard scaler that supports pandas dataframes as input. Maybe it should be renamed to DataFrame scaler to match the parser_to_dataframe.py file... commit c63e630 Merge: eea09a5 ff117c0 Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:22:45 2024 -0400 Merge pull request #35 from JeffersonLab/24-common-csv-parser 24 common csv parser commit ff117c0 Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:18:31 2024 -0400 Delete csv_parser_v0.py The functionality of the CSV parser is being handled by the Parser2Dataframe now and is no longer used as the entrypoint to the registered CSVParser. Keeping this code will only make the repository more confusing. commit e11864f Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:09:36 2024 -0400 Save_Load Unit Tests Also fixes a few documentation errors and the unittest logging (now -v turns on extra logging from the module)

sgoldenCS added 4 commits April 18, 2024 13:18

Upload of CSV parser

cfaddfe

Update parser_utilities.py

49e4a7a

Moved the read_functions dictionary inside of the read_data_to_pandas function to avoid creation of a global variable.

Change parser_id to global

927192d

Allows for easier future modifications if necessary.

Fix typo

9cc304b

sgoldenCS requested review from dlersch and Kishanrajput April 22, 2024 17:17

sgoldenCS assigned dlersch, sgoldenCS and Kishanrajput Apr 22, 2024

sgoldenCS linked an issue Apr 22, 2024 that may be closed by this pull request

Common CSV parser #24

Closed

1 task

sgoldenCS requested a review from schr476 April 22, 2024 17:17

Rename utest_pandas_parser_v0.py to utest_csv_parser.py

06da065

sgoldenCS and others added 2 commits April 24, 2024 10:44

Removed Registration of PandasParser

3b114af

Update and rename pandas_parser_v0.py to parser_to_dataframe.py

652e506

sgoldenCS added 2 commits May 2, 2024 09:09

Save_Load Unit Tests

e11864f

Also fixes a few documentation errors and the unittest logging (now -v turns on extra logging from the module)

Delete csv_parser_v0.py

ff117c0

The functionality of the CSV parser is being handled by the Parser2Dataframe now and is no longer used as the entrypoint to the registered CSVParser. Keeping this code will only make the repository more confusing.

sgoldenCS merged commit c63e630 into main May 2, 2024

sgoldenCS deleted the 24-common-csv-parser branch November 4, 2024 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

24 common csv parser #35

24 common csv parser #35

sgoldenCS commented Apr 22, 2024

dlersch commented Apr 23, 2024

sgoldenCS commented Apr 23, 2024

dlersch commented Apr 23, 2024

schr476 commented Apr 23, 2024

sgoldenCS commented Apr 23, 2024

schr476 commented Apr 23, 2024

sgoldenCS commented Apr 23, 2024

schr476 commented Apr 23, 2024

sgoldenCS commented Apr 24, 2024

sgoldenCS commented Apr 26, 2024

sgoldenCS commented May 2, 2024

24 common csv parser #35

24 common csv parser #35

Conversation

sgoldenCS commented Apr 22, 2024

dlersch commented Apr 23, 2024

sgoldenCS commented Apr 23, 2024

dlersch commented Apr 23, 2024

schr476 commented Apr 23, 2024

sgoldenCS commented Apr 23, 2024

schr476 commented Apr 23, 2024

sgoldenCS commented Apr 23, 2024

schr476 commented Apr 23, 2024

sgoldenCS commented Apr 24, 2024

sgoldenCS commented Apr 26, 2024

sgoldenCS commented May 2, 2024