-
Notifications
You must be signed in to change notification settings - Fork 1
Data Parsers
Steven Goldenberg edited this page May 2, 2024
·
4 revisions
Currently, all parsers must implement the following functions:
-
get_info()
- Prints general information about the module and it's use. Preferred implementation uses
print(inspect.getdoc(self))
to print the class docstring.
- Prints general information about the module and it's use. Preferred implementation uses
-
save_config(path: str)
andload_config(path: str)
- Saves/Loads all information needed to rebuild a fresh version of this module. When calling
save(path)
, a config.yaml file will be created at that path.
- Saves/Loads all information needed to rebuild a fresh version of this module. When calling
-
save(path: str)
andload(path: str)
- Saves/Loads the full module so you can "pick up where you left off" from a previous run. Normally, these functions will need to save/load the configuration and any internal state. Parsers generally do not have an internal state (no trainable parameters), so this function can be implemented through a call to
save_config()
orload_config()
. Unlikesave_config()
,save()
should require a fresh directory to be made by default.
- Saves/Loads the full module so you can "pick up where you left off" from a previous run. Normally, these functions will need to save/load the configuration and any internal state. Parsers generally do not have an internal state (no trainable parameters), so this function can be implemented through a call to
-
load_data()
- Returns data based on the current configuration of the module. Instead of passing
filepaths
as an argument, this information is stored by the configuration. For now, Pandas DataFrames are the only supported output type.
- Returns data based on the current configuration of the module. Instead of passing
-
save_data()
- This function may not need to be implemented and could possibly be removed from the core abstract class (see #16). Currently, the module is not responsible for maintaining an internal state that include data to save.
-
@property name()
- Defines a generic name for the module that could be used in the future for loading modules by name or automatic registration.
A configuration for data parsers should include:
-
filepaths: str | list[str]
: Paths to files the module should parse. Defaults to[]
which produces a warning when load_data() is called. -
file_format: str = 'csv'
: Format of files to parse. Currently supports csv, feather, json and pickle. Defaults to csv. Alternatively, use a registered version that sets this parameter. -
read_kwargs: dict = {}
: Arguments to be passed to the read function -
concat_kwargs: dict = {}
: Arguments to be passed to the concatenate function
Currently Registered Parsers:
- CSVParser_v0
Parsers To Be Registered Soon:
- FeatherParser_v0
- JSONParser_v0
- PickleParser_v0
Parsers To Be Developed:
- NumpyParser_v0
The registered CSVParser_v0 is a registered version of Parser2DataFrame with file_format=csv
preconfigured.
The backend function used for reading files is Pandas' pd.read_csv()
. The configuration argument read_kwargs
accepts any keyword argument that pd.read_csv()
does.
Basic example where config is defined/loaded in driver:
# Example configuration.
config = dict(
filepaths = './path_to_file.csv',
read_kwargs = dict(
usecols = ['column1', 'column3'],
index_col = 'Date',
parse_dates = True)
)
parser = make('CSVParser_v0', config=config)
data = parser.load_data()
parser.save('./path_to_save_module')
Load a parser from a saved configuration:
# Make default parser
parser = make('CSVParser_v0')
# Load saved parser from file
# Assumes './example_parser/config.yaml' exists...
parser.load('./example_parser')
data = parser.load_data()