Skip to content

Data Parsers

Steven Goldenberg edited this page May 2, 2024 · 4 revisions

Data Parser Base Class

Required Functions

Currently, all parsers must implement the following functions:

  • get_info()
    • Prints general information about the module and it's use. Preferred implementation uses print(inspect.getdoc(self)) to print the class docstring.
  • save_config(path: str) and load_config(path: str)
    • Saves/Loads all information needed to rebuild a fresh version of this module. When calling save(path), a config.yaml file will be created at that path.
  • save(path: str) and load(path: str)
    • Saves/Loads the full module so you can "pick up where you left off" from a previous run. Normally, these functions will need to save/load the configuration and any internal state. Parsers generally do not have an internal state (no trainable parameters), so this function can be implemented through a call to save_config() or load_config(). Unlike save_config(), save() should require a fresh directory to be made by default.
  • load_data()
    • Returns data based on the current configuration of the module. Instead of passing filepaths as an argument, this information is stored by the configuration. For now, Pandas DataFrames are the only supported output type.
  • save_data()
    • This function may not need to be implemented and could possibly be removed from the core abstract class (see #16). Currently, the module is not responsible for maintaining an internal state that include data to save.

Suggested functions/properties

  • @property name()
    • Defines a generic name for the module that could be used in the future for loading modules by name or automatic registration.

Data Parser Configuration File

A configuration for data parsers should include:

  • filepaths: str | list[str]: Paths to files the module should parse. Defaults to [] which produces a warning when load_data() is called.
  • file_format: str = 'csv': Format of files to parse. Currently supports csv, feather, json and pickle. Defaults to csv. Alternatively, use a registered version that sets this parameter.
  • read_kwargs: dict = {}: Arguments to be passed to the read function
  • concat_kwargs: dict = {}: Arguments to be passed to the concatenate function

Available Data Parsers

Currently Registered Parsers:

  • CSVParser_v0

Parsers To Be Registered Soon:

  • FeatherParser_v0
  • JSONParser_v0
  • PickleParser_v0

Parsers To Be Developed:

  • NumpyParser_v0

CSV Data Parser

The registered CSVParser_v0 is a registered version of Parser2DataFrame with file_format=csv preconfigured.

The backend function used for reading files is Pandas' pd.read_csv(). The configuration argument read_kwargs accepts any keyword argument that pd.read_csv() does.

CSVParser_v0 Usage Examples

Basic example where config is defined/loaded in driver:

# Example configuration. 
config = dict(
    filepaths = './path_to_file.csv',
    read_kwargs = dict(
        usecols = ['column1', 'column3'],
        index_col = 'Date',
        parse_dates = True)
)

parser = make('CSVParser_v0', config=config)
data = parser.load_data()
parser.save('./path_to_save_module')

Load a parser from a saved configuration:

# Make default parser
parser = make('CSVParser_v0')

# Load saved parser from file
# Assumes './example_parser/config.yaml' exists...
parser.load('./example_parser') 
data = parser.load_data()