Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is related to the issue #216 and TODO list in previous PR #222.
The main change is to add node_type indexing in
src/train/profiler/node_type_index.py
.I introduce NodeTypeSpec, NodeTypeIndexCollection to group inputting machine data by spec on each pipeline training.
As shown below, at collection, we will autogenerate NodeTypeSpec (processor, #cores, #chips, memory and so on) and keep it in data path. At training, we will read that value by the machine id and then index it in the NodeTypeIndexCollection.
If the same spec has been indexed, it will use the same index number. However, we expect a step to append data from the same group before training. For AWS instance, we expect single profile per one index. The machine index will be kept under pipeline folder in Json format (node_type_index.json). We can read this file and generate machine index on export.
In addition to above enhancement, this PR also includes multiple bug fixes on CI workflow including adding complete-train pipeline run on tekton test.
Signed-off-by: Sunyanan Choochotkaew [email protected]