-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mechanisms to reference parent repositories #125
Comments
I am finding it difficult to see how this all fits together. Would you be able to provide a more complete example, i.e. with all the elements above combined? |
As an example to give context without any inheritance, you can see the refactored config for parenttext-crisis: https://github.com/IDEMSInternational/parenttext-crisis/blob/pipeline-refactor/config.json Let's try to build an example showcasing inheritance: Assume we have a grandparent repo living at the URL
Assume we have a parent repo living at the URL Let it also define a pipeline that produces flows. Then the config could look like this:
Finally, let's define the child repo. Its two data sources inherit from the parent, and its flow data source recursively inherits from the grandparent. Again, these lists of input files are concatenated and used in the pipeline step referencing these sources, but when pulling data, only the child data is stored locally. For the second data source, we have a dict rather than a list of files, which is merged with the parent dict. (Note: This may not be what we want, because in case of a collision in dict keys, the child data overwrites the parent entry. We might prefer to have each dict value be a list of files, so that in a case of key collision, we can concatenate them? The use case for dicts is for steps that have specific input files, where each key has a semantic meaning. Using
|
I was also envisioning that For example:
This particular example can also be achieved by the step referencing multiple sources, but there may be reasons to use one or the other depending on the use case.
|
Just want to flesh out the 'single sources list' concept a bit, so that it might be considered at this stage, even if it is not implemented now. An example: {
"sources": [
{
"location": "git+https://.../[email protected]",
"labels": ["flow_template", "flow_content"]
},
{
"location": "<sheet_id_2>",
"labels": ["flow_content"],
"annotations": {
"idems.alias": "N_onboarding.json"
}
},
{
"location": "safeguarding.xlsx",
"labels": ["flow_safeguarding"],
"annotations": {
"idems.safeguarding.key": "fra"
}
}
]
} This doesn't capture all of the various use cases, and there might be some tricky edge cases, but what I want to highlight is:
Steps would query the sources list for the information they need. The list structure inherently indicates the order of precedence. To keep things simple, the merge strategy would be set globally for the deployment; the sources list would be fully merged/resolved before steps are allowed to query it. I'm fairly confident that this format could handle our current needs (maybe with some tweaking) and is open to extension, for our future needs. However, I have not contemplated every scenario, so let me know if you see any issues, and let me know what you think in any case. |
Thanks for the proposal, I like the streamlined structure with labels and annotations (which also means that a source can be referenced by multiple steps, even if we currently don't need that). There's a few things that are still a bit unclear to me:
One question that informs a lot of the other questions: I assume we still don't store (duplicated) parent content within the child repo when pulling, right? Or do you want a repo to store all parent data as well whenever we do a pull operation? Given the overwrite mechanism mentioned above that comes for free, it seems like that may be a convenient option as well.
|
There needs to be a clear distinction made between data that needs to be pulled for the purpose of storing it in git, and that which needs to be pulled for the purpose of creating a chatbot. I would expect that data that is in another git repository should only be pulled when creating the chatbot. The pull-commit process should be able to infer which sources need to be acted on - essentially, sources where 'location' is a Google Sheets ID, and 'labels' contains any label starting with 'flow_*', but more labels or annotations could be added, to be more explicit. A source where the 'location' is a git URL and 'labels' has one or more 'flow_*' labels could be considered a parent repository. The translation repo is a special case, but I would expect it not to have any 'flow_*' labels, or perhaps it would have a 'flow_translations' label, which would set it apart from other git repositories. At the moment, we assume there are po/pot files in translation repos, but my preference would be to try to infer the format, and as a last resort, an annotation could specify the format. When creating the chatbot, we would need a local copy of all sources, which means recursively processing any repository that has flow templates or content. If there is a conflict with sources, the tags/labels/annotations from the last source should replace all others. I prefer this because it seems straightforward and is probably good enough for now. |
Ok, thanks for clarifying. I agree that only the local data should be stored within the repo, and parent repo data is fetched whenever needed for pipeline runs. I think we're on the same page regarding conflict resolution as well. I'm not entirely sure yet about the automatic inference one which sources need to be acted upon. I get the high level idea but I think this needs fleshing out in more details (e.g., I'm not so confident about using a |
Context
Many of our chatbots share common data. Rather than duplicating that for each, we want to split this shared data out to parent repositories.
We currently have two key operations to consider:
pull_data
: Pull data from various data sources and store them in an input foldercompile_flows
: Execute a set of steps the compile RapidPro flows from the input data.Each repository has a config.
The main parts of the config consist of subconfigs for pipeline steps, and data sources.
Most pipeline steps need data as input, and thus reference data sources that define groups of input files.
Steps
Steps can reference one (or multiple -- is there need for that?) sources for their input data.
Sources
Sources are an aggregation of data in the same format. (Remark: Same format for simplicity. Will discuss composition of sources to make new sources later, if we want to combine.)
Steps can reference sources for their input data. Some steps have a list of inputs, while others have a specific set of inputs. Some steps may have both (not at this point though). This motivates the following:
Sources may have a list of data files as input (order matters).
Sources may also have a dictionary of data files as input
Sources may have both.
When pulling data, each source gets its own local subfolder. The list entries (str) and dict keys determine the filenames of the locally stored files.
For Google sheets, the sheet_ids are non-descript. Thus the config has an (optional) global field
sheet_names
in which a mapping from names to sheet_ids can be provided. When a source references an input file, it first looks up whether it's in thesheet_names
map and in that case uses the respective values. (Currently, this is only done for google sheets. It seems sensible to extend this also to other types of references, e.g. file paths?)Hierarchical design proposal
Parents
The config can specify parent repositories.
Design question: Does each repo have (at most) a single parent, or can it have multiple and reference them more like libraries (that we selectively take data from)?
Single parent:
Libraries:
Composition and Inheritance
In addition to a list/dict of file references, a source may reference other sources:
When pulling data, only data from the files list/dict is locally stored (in the folder for this source).
When running the flow compilation pipeline, source lists are taken from the parent repos (recursively) and file lists/dicts are composed as follows:
Thus when a pipeline step references a source, it has access to a joint files list/dict.
The text was updated successfully, but these errors were encountered: