Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI function to export data sheets to JSON #125

Merged
merged 14 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,12 @@ rpft --help

# Command Line Interface (CLI)

The CLI supports three subcommands:
The CLI supports the following subcommands:

- `create_flows`: create RapidPro flows (in JSON format) from spreadsheets
- `create_flows`: create RapidPro flows (in JSON format) from spreadsheets using content index
- `flows_to_sheets`: convert RapidPro flows (in JSON format) into spreadsheets
- `convert`: save input spreadsheets as JSON
- `save_data_sheets`: save input spreadsheets as nested JSON using content index - an experimental feature that is likely to change.

Full details of the available options for each can be found via the help feature:

Expand Down
8 changes: 4 additions & 4 deletions docs/components.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

This toolkit consists of three components.

The first component ([](/src/rpft/parsers/common)) is RapidPro-agnostic and takes care of reader spreadsheets and converting them into internal data models and other output formats, see [](sheets.md)
`rpft.parsers.common` is RapidPro-agnostic and takes care of reader spreadsheets and converting them into internal data models and other output formats, see [sheets.md](sheets.md).

The second component ([](/src/rpft/parsers/creation)) defines data models for a spreadsheet format for RapidPro flows, and process spreadsheets into RapidPro flows (and back) using the first component.
`rpft.parsers.creation` defines data models for a spreadsheet format for RapidPro flows, and process spreadsheets into RapidPro flows (and back) using `rpft.parsers.common`.

The third component ([](/src/rpft/rapidpro)) defines internal representations of RapidPro flows and to read and write to a JSON format that can be import to/exported from RapidPro. It is partially entangled with the second component, as it needs to be aware of the data models of the second component to convert RapidPro flows into the spreadsheet format.
`rpft.rapidpro` defines internal representations of RapidPro flows, and reads and writes to a JSON format that can be imported to and exported from RapidPro. It is partially entangled with `rpft.parsers.creation` as it needs to be aware of the data models of the second component to convert RapidPro flows into the spreadsheet format.

The latter two components are (poorly) documented here: [](rapidpro.md)
The latter two components are [documented](rapidpro.md).
84 changes: 39 additions & 45 deletions docs/models.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,16 @@
# Models

`RowModel`s are subclasses of [`pydantic.BaseModel`]
(https://docs.pydantic.dev/latest/concepts/models/#basic-model-usage), and may
contain basic types, lists and other models as attributes, nested arbirarily
deep. Every `Sheet` can only be parsed in the context of a given `RowModel`
(which can, however, be automatically inferred from the sheet headers, if desired).
`RowModel`s are subclasses of [pydantic.BaseModel], and may contain basic types, lists and other models as attributes, nested arbirarily deep. Every `Sheet` can only be parsed in the context of a given `RowModel` (which can, however, be automatically inferred from the sheet headers, if desired).

Technically, there is no `RowModel` class, but instead it is called `ParserModel`
and is defined in [](/src/rpft/parsers/common/rowparser.py). `ParserModel` attributes have to be
basic types, lists or `ParserModel`s.
The only addition to `pydantic.BaseModel` are the (optional) methods `header_name_to_field_name`, `field_name_to_header_name` and (for full row models) `header_name_to_field_name_with_context` that allow remapping
column header names to different model attributes.
Technically, there is no `RowModel` class, but instead is called `ParserModel` and is defined in `rpft.parsers.common.rowparser`. `ParserModel` attributes have to be basic types, lists or `ParserModel`s. The only additions to `pydantic.BaseModel` are the optional methods:

Example:
- `header_name_to_field_name`
- `field_name_to_header_name`
- `header_name_to_field_name_with_context` (for full row models)

```
These methods allow remapping column header names to different model attributes, for example:

```python
class SubModel(ParserModel):
word: str = ""
number: int = 0
Expand All @@ -24,36 +20,29 @@ class MyModel(ParserModel):
sub: SubModel = SubModel()
```

The headers of a sheet and its content that can be parsed into `MyModel` could for example be:
The following table could be parsed into an instance of `MyModel`:

|numbers.1 | numbers.2 | sub.word | sub.number |
|----------|-----------|----------|------------|
| 42 | 16 | hello | 24 |

with each column containing a basic type (int, int, int, str, int).


However, the headers and content could also look like this:
Each column contains a basic type, in this case, `int`, `int`, `str`, `int`. However, the table could be expressed differently.

|numbers | sub |
|--------|-----------------------|
| 42;16 | word;hello\|number;24 |

With the first column representing a `List[int]` and the second a `SubModel`.

How sheets and their column headers correspond to `RowModel`s is specified in
[RapidPro sheet specification].

More examples can also be found in the tests:
The first column has type `List[int]`, the second `SubModel`. How sheets and their column headers correspond to `RowModel`s is specified in the [RapidPro sheet specification].

- [](/src/rpft/parsers/common/tests/test_rowparser.py)
- [](/src/rpft/parsers/common/tests/test_full_rows.py)
- [](/src/rpft/parsers/common/tests/test_differentways.py)
More examples can be found in the tests:

- `tests.test_rowparser`
- `tests.test_full_rows`
- `tests.test_differentways`

The `header_name_to_field_name` and related `ParserModel` methods can be used to map column headers to fields of a different name, for example:
The method `header_name_to_field_name` and related `ParserModel` methods can be used to map column headers to fields of a different name, for example:

```
```python
class MyModel(ParserModel):
number: int = 0
first_name_and_surname: str = ""
Expand All @@ -69,36 +58,41 @@ class MyModel(ParserModel):
return field
```

Then
Then, the following would be a valid table that can be converted into `MyModel` instances.

| number | name |
|--------|------|
| 42 | John |

would be a valid table that can be converted into `MyModel` instances.
The original motivation for this feature was that the original flow sheet format had a column named 'from', which is a keyword in Python, and thus could not be used as a field name, so it had to be remapped.

There is also a more complex use case where we have a list of conditions, each condition being a model with multiple attributes, such as value, variable and name (when we think of it from a OOP standpoint). However, the original sheet format had columns 'condition', 'condition\_variable', 'condition\_name', etc, containing a list of the value/variable/name fields respectively, so technically their headers should have been 'condition.\*.value', 'condition.\*.variable' and 'condition.\*.name'. The remapping feature is used to map the short forms to the accurate forms.

Then there is the context specific remapping, where remapping happens taking the content of the row into account. In practice, we remap certain headers based on a row type (encoded in a type column). This is when different row types have different attributes (so really it's arguable whether they should be in a spreadsheet at all), and for compactness, we map some of their attributes to the same column header. In particular, each row type has a 'main argument', which may be of different types, which all get mapped to the 'message_text' column header.

The module `rpft.parsers.creation.flowrowmodel` shows all of these use cases.


## Automatic model inference

Models of sheets can now be automatically inferred if no explicit model is provided, see [model inference](/src/rpft/parsers/common/model_inference.py)
Models of sheets can now be automatically inferred if no explicit model is provided, see [model inference].

This is done exclusively by parsing the header row of a sheet. Headers can be annotated with types (basic types and list; dict and existing models are currently not supported). If no annotation is present, the column is assumed to be a string.
This is done exclusively by parsing the header row of a sheet. Headers can be annotated as basic types and `list`. `dict` and existing models are currently not supported. If no annotation is present, the column is assumed to be a string.

Examples of what the data in a column can represent:
- `field`: `field` is inferred to be a string
- `field:int`: `field` is inferred to be a int
- `field:list`: `field` is inferred to be a list
- `field:List[int]`: `field` is inferred to be a list of integers
- `field.1`: `field` is inferred to be a list, and this column contains its first entry
- `field.1:int`: `field` is inferred to be a list of integers, and this column contains its first entry
- `field.subfield`: `field` is inferred to be another model with one or multiple subfields, and this column contains values for the `subfield` subfield
- `field.subfield:int`: `field` is inferred to be another model with one or multiple subfields, and this column contains values for the `subfield` subfield which is inferred to be an integer
- `field.1.subfield`: `field` is inferred to be a list of another model with one or multiple subfields, and this column contains values for the `subfield` subfield of the first list entry

Intermediate models like in the last three examples are created automatically.
- `field`: no annotation; type assumed to be `str`
- `field:int`: integer
- `field:list`: list
- `field:List[int]`: list of integers
- `field.1`: first entry in a list
- `field.1:int`: first entry in list; integer
- `field.subfield`: subfield; string
- `field.subfield:int`: integer subfield
- `field.1.subfield`: list of objects with string subfield; first item of list

Field name remapping cannot be done when using automated model inference.
Intermediate models like in the last three examples are created automatically. Field name remapping cannot be done when using automated model inference. `*`-notation is also not currently supported, but could be done in principle.

`*`-notation is also not currently supported, but could be done in principle.

[model inference]: /src/rpft/parsers/common/model_inference.py
[pydantic.BaseModel]: https://docs.pydantic.dev/latest/concepts/models/#basic-model-usage
[RapidPro sheet specification]: https://docs.google.com/document/d/1m2yrzZS8kRGihUkPW0YjMkT_Fmz_L7Gl53WjD0AJRV0/edit?usp=sharing
28 changes: 11 additions & 17 deletions docs/rapidpro.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,24 @@
Used for parsing collections of flows (with templating). Flow-specific features are ommitted here. We only give the general idea of a content index and its parser.


The [`ContentIndexParser`](/src/rpft/parsers/creation/contentindexparser.py) takes a [`SheetReader`](sheets.md) and looks for one or multiple sheets called `content_index` and processes them (in the order provided). Rows of a content index generally reference other sheets with additional meta information. These may also be, again, content index sheets, which in that case are parsed recursively (and from a parsing order perspective, its rows are parsed right in between the rows above and the rows below of the containing content index).
The class `rpft.parsers.creation.contentindexparser.ContentIndexParser` takes a [SheetReader](sheets.md) and looks for one or multiple sheets called `content_index` and processes them (in the order provided). Rows of a content index generally reference other sheets with additional meta information. These may also be, again, content index sheets, which in that case are parsed recursively (and from a parsing order perspective, its rows are parsed right in between the rows above and the rows below of the containing content index).

In essence (modulo technicalities), for each type of sheet, the content index maintains dictionaries (one per sheet type) mapping sheet names to the actual sheets. When a content index sheet is processed, each row is inspected and the referenced sheet added to the relevant (type-specific) dictionary. If an entry of the given name already exists, it is overwritten. Thus it is possible to have a parent content index containing some data, and a (later) child content index replacing some of that data. There is also an `ignore_row` type indicating that a previously referenced sheet should be deleted from its respective index.
In essence, for each type of sheet, the content index maintains dictionaries (one per sheet type) mapping sheet names to the actual sheets. When a content index sheet is processed, each row is inspected and the referenced sheet added to the relevant (type-specific) dictionary. If an entry with a given name already exists, it is overwritten. Thus it is possible to have a parent content index containing some data, and a (later) child content index replacing some of that data. There is also an `ignore_row` type indicating that a previously referenced sheet should be deleted from its respective index.

Sheets can also be renamed before being added to the respective dict using the `new_name` column.

Details of the content index sheet format are detailed in [New features documentation].

There are two sheet types of particular interest
There are two sheet types of particular interest.

- [`DataSheet`](/src/rpft/parsers/creation/contentindexparser.py): Similar to a [`RowDataSheet`](sheets.md), however, assumed that the `RowModel` has an `ID` field, and rather than storing a list of rows, it stores an ordered dict of rows, indexed by their ID.
- [`TemplateSheet`](/src/rpft/parsers/creation/contentindexparser.py): Wrapper around `tablib.Dataset`, with template arguments.
- `rpft.parsers.creation.contentindexparser.DataSheet`: Similar to a [RowDataSheet](sheets.md), but assumes that the `RowModel` has an `ID` field, and, rather than storing a list of rows, stores an ordered `dict` of rows, indexed by their ID.
- `rpft.parsers.creation.contentindexparser.TemplateSheet`: Wrapper around `tablib.Dataset`, with template arguments.

Note: It may be worthwhile unifying the data structures used here, to be consistent with `Sheet` and `RowDataSheet` documented in [](sheets.md). Also see the discussion there why `DataSheet`s can be exported to nested JSON, while `TemplateSheet`s can only be exported to flat JSON.
Note: It may be worthwhile unifying the data structures used here, to be consistent with `Sheet` and `RowDataSheet` documented in [sheets](sheets.md). Also see the discussion there why `DataSheet`s can be exported to nested JSON, while `TemplateSheet`s can only be exported to flat JSON.

`DataSheet`s are often used to instantiate `TemplateSheet`s, and the ContentIndexParser has mechanisms for this, see [New features documentation]. Furthermore, `DataSheet`s can also be concatenated, filtered and sorted via the `operation` column, see [here](https://docs.google.com/document/d/1Onx2RhNoWKW9BQvFrgTc5R5hcwDy1OMsLKnNB7YxQH0/edit#heading=h.c93jouk7sqq)
`DataSheet`s are often used to instantiate `TemplateSheet`s, and the `ContentIndexParser` has mechanisms for this, see [New features documentation]. Furthermore, `DataSheet`s can also be concatenated, filtered and sorted via the `operation` column, see [Data sheet operations].



Relevant code: `parse_all_flows` in [](/src/rpft/parsers/creation/contentindexparser.py).
Relevant code in `rpft.parsers.creation.contentindexparser.ContentIndexParser.parse_all_flows`.

Examples:

Expand All @@ -34,9 +32,7 @@ Examples:

## FlowParser

See `/src/rpft/parsers/creation/flowparser.py` and [RapidPro sheet specification].
Parser to turn sheets in the standard format (Documentation TBD) into RapidPro flows.
See `/src/tests/input` and `/src/tests/output` for some examples.
See `rpft.parsers.creation.flowparser` and [RapidPro sheet specification]. Parser to turn sheets in the standard format (Documentation TBD) into RapidPro flows. See `/src/tests/input` and `/src/tests/output` for some examples.

Examples:

Expand All @@ -46,11 +42,9 @@ Examples:

## RapidPro models

See `/src/rpft/rapidpro/models`. Models for flows, nodes, etc, with convenience
functions to assemble RapidPro flows. Each model has a `render` method
to render the model into a dictionary, that can be exported to a json
file whose fields are consistent with the format used by RapidPro.
See `rpft.rapidpro.models`. Models for flows, nodes, etc, with convenience functions to assemble RapidPro flows. Each model has a `render` method to render the model into a dictionary, that can be exported to a json file whose fields are consistent with the format used by RapidPro.


[Data sheet operations]: https://docs.google.com/document/d/1Onx2RhNoWKW9BQvFrgTc5R5hcwDy1OMsLKnNB7YxQH0/edit#heading=h.c93jouk7sqq
[RapidPro sheet specification]: https://docs.google.com/document/d/1m2yrzZS8kRGihUkPW0YjMkT_Fmz_L7Gl53WjD0AJRV0/edit?usp=sharing
[New features documentation]: https://docs.google.com/document/d/1Onx2RhNoWKW9BQvFrgTc5R5hcwDy1OMsLKnNB7YxQH0/edit?usp=sharing
Loading
Loading