Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanisms to reference parent repositories #125

Open
geoo89 opened this issue Apr 24, 2024 · 7 comments
Open

Mechanisms to reference parent repositories #125

geoo89 opened this issue Apr 24, 2024 · 7 comments

Comments

@geoo89
Copy link
Collaborator

geoo89 commented Apr 24, 2024

Context

Many of our chatbots share common data. Rather than duplicating that for each, we want to split this shared data out to parent repositories.

We currently have two key operations to consider:

  • pull_data: Pull data from various data sources and store them in an input folder
  • compile_flows: Execute a set of steps the compile RapidPro flows from the input data.

Each repository has a config.
The main parts of the config consist of subconfigs for pipeline steps, and data sources.
Most pipeline steps need data as input, and thus reference data sources that define groups of input files.

Steps

Steps can reference one (or multiple -- is there need for that?) sources for their input data.

	{   
		"id": "edits_pretranslation",
		"type": "edits",
		"sources": ["edits_pretranslation"]
	}

Sources

Sources are an aggregation of data in the same format. (Remark: Same format for simplicity. Will discuss composition of sources to make new sources later, if we want to combine.)

Steps can reference sources for their input data. Some steps have a list of inputs, while others have a specific set of inputs. Some steps may have both (not at this point though). This motivates the following:

Sources may have a list of data files as input (order matters).

"edits_pretranslation": {
	"format": "sheets",
	"subformat": "google_sheets",
	"files_list": [
		"ab_testing_sheet_ID",
		"localisation_sheet_ID"
	]
},

Sources may also have a dictionary of data files as input

"qr_treatment": {
	"format": "json",
	"files_dict": {
		"select_phrases_file": "./edits/select_phrases.json",
		"special_words_file": "./edits/special_words.json"
	}
}

Sources may have both.

When pulling data, each source gets its own local subfolder. The list entries (str) and dict keys determine the filenames of the locally stored files.

For Google sheets, the sheet_ids are non-descript. Thus the config has an (optional) global field sheet_names in which a mapping from names to sheet_ids can be provided. When a source references an input file, it first looks up whether it's in the sheet_names map and in that case uses the respective values. (Currently, this is only done for google sheets. It seems sensible to extend this also to other types of references, e.g. file paths?)

    "sheet_names" : {
        "localised_sheets" : "google_sheet_id1",
        "T_content" : "google_sheet_id2",
        "N_onboarding_data" : "google_sheet_id3",
        "T_onboarding" : "google_sheet_id4",
	}

Hierarchical design proposal

Parents

The config can specify parent repositories.

Design question: Does each repo have (at most) a single parent, or can it have multiple and reference them more like libraries (that we selectively take data from)?

Single parent:

    "parent": {
        "repo_url": "",
        "branch": "",
        "commit_hash": "",
        "commit_tag": ""
    }

Libraries:

    "parents": {
        "parent1": {
            "repo_url": "",
            "branch": "",
            "commit_hash": "",
            "commit_tag": ""
        },
        "parent2": {
            "repo_url": "",
            "branch": "",
            "commit_hash": "",
            "commit_tag": ""
        }
    }

Composition and Inheritance

In addition to a list/dict of file references, a source may reference other sources:

"my_source": {
	"parent_sources": [
		"parent1.source1",
		"other_local_source",
	],
	"files_dict": {...},
	"files_list": [...],
}

When pulling data, only data from the files list/dict is locally stored (in the folder for this source).

When running the flow compilation pipeline, source lists are taken from the parent repos (recursively) and file lists/dicts are composed as follows:

  • The file lists of all parent_sources are concatenated (in order), with the own files_list concatenated at the end.
  • The file dicts of all parent_sources and the own files_dict are merged. (In case of duplicate keys, the latter value is taken.) (Note: This may not be what we want: The use case for dicts is for steps that have specific input files, where each key has a semantic meaning. In case of a collision in dict keys, the child data overwrites the parent entry, which means there is no clean way of a step reading multiple input files of the same type (key) and merging them. We might prefer to have each dict value be a list of files, so that in a case of key collision, we can concatenate them.)

Thus when a pipeline step references a source, it has access to a joint files list/dict.

@istride
Copy link
Collaborator

istride commented Apr 29, 2024

I am finding it difficult to see how this all fits together. Would you be able to provide a more complete example, i.e. with all the elements above combined?

@geoo89
Copy link
Collaborator Author

geoo89 commented Apr 29, 2024

As an example to give context without any inheritance, you can see the refactored config for parenttext-crisis: https://github.com/IDEMSInternational/parenttext-crisis/blob/pipeline-refactor/config.json

Let's try to build an example showcasing inheritance:

Assume we have a grandparent repo living at the URL grandparent_url. Let its config be:

{
    "parents": {
        "grandparent1": {
            "repo_url": "",
            ...
        }
    },
    "sheet_names" : {
        "grandparent_templates" : "google_sheet_id",
        ...
    },
    "sources": {
        "flow_definitions": {
            "format": "sheets",
            "subformat": "google_sheets",
            "files_list": [
                "grandparent_templates",
                "some_other_google_sheet_id_which_we_didnt_give_a_name",
                ...
            ]
        }
    },
    "steps": [
        # maybe this repo defines a pipeline to produce flows,
        # but maybe it only defines data sources
        ...
    ]
}

Assume we have a parent repo living at the URL parent_url. It declares the grandparent repo to be its parent (optionally specifying a branch/tag/commit hash), and references some of its sources. It defines two data sources, one for flows which references the parent repo, and one for flow expiration times. For the flows data source, the resulting input data (for the pipeline) is the grandparent list of sheets concatenated with the own list of sheets. However, when pulling data, only the own data is locally stored (for archiving on github).

Let it also define a pipeline that produces flows. Then the config could look like this:

{
    "parents": {
        "grandparent1": {
            "repo_url": "grandparent_url",
            "branch": "",
            "commit_hash": "",
            "commit_tag": "",
        }
    },
    "sheet_names" : {
        "base_templates" : "google_sheet_idA",
        "base_content" : "google_sheet_idB",
        ...
    },
    "sources": {
        "flow_definitions": {
            "parent_sources": ["grandparent1.flow_definitions"],
            "format": "sheets",
            "subformat": "google_sheets",
            "files_list": [
                "base_templates",
                "base_content",
                ...
            ]
        },
        "expiration_times": {
            "format": "json",
            "files_dict": {
                "special_expiration_file": "./edits/specific_expiration.json"
            }
        }
    },
    "steps": [
        {   
            "id": "create_flows",
            "type": "create_flows",
            "sources": ["flow_definitions"],
            "models_module": "models.parenttext_models",
            "tags": [4,"response"]
        },
        {
            "id": "update_expiration_times",
            "type": "update_expiration_times",
            "sources": ["expiration_times"],
            "default_expiration_time": 1440
        }
    ]
}

Finally, let's define the child repo. Its two data sources inherit from the parent, and its flow data source recursively inherits from the grandparent. Again, these lists of input files are concatenated and used in the pipeline step referencing these sources, but when pulling data, only the child data is stored locally. For the second data source, we have a dict rather than a list of files, which is merged with the parent dict. (Note: This may not be what we want, because in case of a collision in dict keys, the child data overwrites the parent entry. We might prefer to have each dict value be a list of files, so that in a case of key collision, we can concatenate them? The use case for dicts is for steps that have specific input files, where each key has a semantic meaning. Using special_expiration_file and special_expiration_file_child and having the step infer from a name pattern that these are the same kind of data is ugly.)

{
    "parents": {
        "parent1": {
            "repo_url": "parent_url",
            ...
        }
    },
    "sheet_names" : {
        "localised_sheets" : "google_sheet_id1",
        "N_onboarding_data" : "google_sheet_id2",
        "T_onboarding" : "google_sheet_id3",
        ...
    },
    "sources": {
        "flow_definitions": {
            "parent_sources": ["parent1.flow_definitions"],
            "format": "sheets",
            "subformat": "google_sheets",
            "files_list": [
                "N_onboarding_data",
                "T_onboarding",
                ...
                "localised_sheets"
            ]
        },
        "expiration_times": {
            "parent_sources": ["parent1.expiration_times"],
            "format": "json",
            "files_dict": {
                "special_expiration_file_child": "./edits/specific_expiration.json"
            }
        }
    },
    "steps": [
        {   
            "id": "create_flows",
            "type": "create_flows",
            "sources": ["flow_definitions"],
            ...
        },
        {
            "id": "update_expiration_times",
            "type": "update_expiration_times",
            "sources": ["expiration_times"],
            ...
        }
    ]
}

@geoo89
Copy link
Collaborator Author

geoo89 commented Apr 29, 2024

I was also envisioning that parent_sources can be sources from the repo itself, allowing composition. I suspect that this may complicate implementation so it could be added as a feature only if needed.

For example:

{
    "parents": {
        "parent1": {...}
    },
    "sheet_names" : {
        ...
    },
    "sources": {
        "flow_template_definitions": {
            "parent_sources": ["parent1.flow_definitions"],
            "format": "sheets",
            "subformat": "google_sheets",
            "files_list": [
                "N_onboarding_data",
                "T_onboarding",
                ...
                "localised_sheets"
            ]
        },
        "flow_data_sheets": {
            "format": "sheets",
            "subformat": "google_sheets",
            "files_list": [
                "N_onboarding_data",
                ...
                "localised_sheets"
            ]
        },
        "flow_inputs": {
            "parent_sources": ["flow_template_definitions", "flow_data_sheets"],
            "format": "sheets",
        }
    },
    "steps": [
        {   
            "id": "create_flows",
            "type": "create_flows",
            "models_module": "models.parenttext_models",
            "sources": ["flow_inputs"],
            "tags": [4,"response"]
        },
    ]
}

This particular example can also be achieved by the step referencing multiple sources, but there may be reasons to use one or the other depending on the use case.

{
    "parents": {
        "parent1": {...}
    },
    "sheet_names" : {
        ...
    },
    "sources": {
        "flow_template_definitions": {
            "parent_sources": ["parent1.flow_definitions"],
            "format": "sheets",
            "subformat": "google_sheets",
            "files_list": [...]
        },
        "flow_data_sheets": {
            "format": "sheets",
            "subformat": "google_sheets",
            "files_list": [...]
        }
    },
    "steps": [
        {   
            "id": "create_flows",
            "type": "create_flows",
            "models_module": "models.parenttext_models",
            "sources": ["flow_template_definitions", "flow_data_sheets"],
            "tags": [4,"response"]
        },
    ]
}

@istride
Copy link
Collaborator

istride commented Apr 30, 2024

Just want to flesh out the 'single sources list' concept a bit, so that it might be considered at this stage, even if it is not implemented now.

An example:

{
  "sources": [
    {
      "location": "git+https://.../[email protected]",
      "labels": ["flow_template", "flow_content"]
    },
    {
      "location": "<sheet_id_2>",
      "labels": ["flow_content"],
      "annotations": {
        "idems.alias": "N_onboarding.json"
      }
    },
    {
      "location": "safeguarding.xlsx",
      "labels": ["flow_safeguarding"],
      "annotations": {
        "idems.safeguarding.key": "fra"
      }
    }
  ]
}

This doesn't capture all of the various use cases, and there might be some tricky edge cases, but what I want to highlight is:

  • location to locate all sources
    • repository URLs might be specified in the same way as pip requirements.txt file
    • format inferred from file extension, prefix ("git+"), or regex match to Google Sheets IDs, etc
  • labels to allow steps or other processes to locate the sources that are of interest to them; could also be called 'tags'
  • annotations to provide extra key-value metadata to any step or process that might be interested
    • "idems.alias" could be used when pulling from Google Sheets and converting to JSON, and when creating flows (to locate the converted file in the local repository); it might make more sense to flip this around so that location references the local JSON file and an annotation indicates the Google Sheet ID
    • Annotations and labels are inspired by their use in Kubernetes manifests
    • if it is not possible to locate all sources by using a single string, annotations could be used to provide further location information

Steps would query the sources list for the information they need. The list structure inherently indicates the order of precedence. To keep things simple, the merge strategy would be set globally for the deployment; the sources list would be fully merged/resolved before steps are allowed to query it.

I'm fairly confident that this format could handle our current needs (maybe with some tweaking) and is open to extension, for our future needs. However, I have not contemplated every scenario, so let me know if you see any issues, and let me know what you think in any case.

@geoo89
Copy link
Collaborator Author

geoo89 commented May 8, 2024

Thanks for the proposal, I like the streamlined structure with labels and annotations (which also means that a source can be referenced by multiple steps, even if we currently don't need that). There's a few things that are still a bit unclear to me:

  • As sources don't have their own IDs, when pulling data (and converting it to json for storage), I assume the data from all source will be stored in the same location (i.e. folder)? In particular, in case of name clashes, the latter overwrites the former?
    • If this is the case, then this idea automatically facilitates an overwrite mechanism by just writing files locally in order specified by the source, which is neat. However, it also requires to be careful with any commonly used file names.
  • Grouping of sheets from my proposal seemed useful to make things more tidy, but I guess it's not strictly necessary (or could be featured here as well).

One question that informs a lot of the other questions: I assume we still don't store (duplicated) parent content within the child repo when pulling, right? Or do you want a repo to store all parent data as well whenever we do a pull operation? Given the overwrite mechanism mentioned above that comes for free, it seems like that may be a convenient option as well.

  • If a source references a repository, how do we specify/infer whether the repository is a parent repository (i.e. with the above assumption, we don't store its content locally) or some other kind of repo? How do we determine what data to pull from a source repository (if any)?
    • e.g. we have the translation repo (with pot files or whatever else there is), which has data to be pulled and converted to json. How do we know that it is not a parent repo to ignore? How do we know what format the data stored there is in?
    • When we know that something is a parent repo (through whatever means), what exactly do we pull in preparation for a pipeline run? Do we process its sources and recursively pull all sources from the parent? Or do we just copy over the input file tree from the parent (this works if each repo also stores its parent data: then our other local sources overwrite some of these as specified)?
  • Are parent tags/annotations ignored? Or united with the child tags somehow?

@istride
Copy link
Collaborator

istride commented May 9, 2024

There needs to be a clear distinction made between data that needs to be pulled for the purpose of storing it in git, and that which needs to be pulled for the purpose of creating a chatbot. I would expect that data that is in another git repository should only be pulled when creating the chatbot.

The pull-commit process should be able to infer which sources need to be acted on - essentially, sources where 'location' is a Google Sheets ID, and 'labels' contains any label starting with 'flow_*', but more labels or annotations could be added, to be more explicit.

A source where the 'location' is a git URL and 'labels' has one or more 'flow_*' labels could be considered a parent repository.

The translation repo is a special case, but I would expect it not to have any 'flow_*' labels, or perhaps it would have a 'flow_translations' label, which would set it apart from other git repositories. At the moment, we assume there are po/pot files in translation repos, but my preference would be to try to infer the format, and as a last resort, an annotation could specify the format.

When creating the chatbot, we would need a local copy of all sources, which means recursively processing any repository that has flow templates or content.

If there is a conflict with sources, the tags/labels/annotations from the last source should replace all others. I prefer this because it seems straightforward and is probably good enough for now.

@geoo89
Copy link
Collaborator Author

geoo89 commented May 15, 2024

Ok, thanks for clarifying. I agree that only the local data should be stored within the repo, and parent repo data is fetched whenever needed for pipeline runs. I think we're on the same page regarding conflict resolution as well.

I'm not entirely sure yet about the automatic inference one which sources need to be acted upon. I get the high level idea but I think this needs fleshing out in more details (e.g., I'm not so confident about using a flow_ prefix for this, what about edits for example?). I think due to time constraints I'll go with the initial proposal for now as it's fully defined, and leave this issue open (or create a new one referencing this) so that at some point we can consider going for a new config format which may replace the old one or be a meta-config on top of the old one, let's see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants