Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add node type indexing #225

Merged
merged 1 commit into from
Feb 1, 2024
Merged

Conversation

sunya-ch
Copy link
Contributor

@sunya-ch sunya-ch commented Jan 29, 2024

This PR is related to the issue #216 and TODO list in previous PR #222.

The main change is to add node_type indexing in src/train/profiler/node_type_index.py.
I introduce NodeTypeSpec, NodeTypeIndexCollection to group inputting machine data by spec on each pipeline training.
As shown below, at collection, we will autogenerate NodeTypeSpec (processor, #cores, #chips, memory and so on) and keep it in data path. At training, we will read that value by the machine id and then index it in the NodeTypeIndexCollection.
If the same spec has been indexed, it will use the same index number. However, we expect a step to append data from the same group before training. For AWS instance, we expect single profile per one index. The machine index will be kept under pipeline folder in Json format (node_type_index.json). We can read this file and generate machine index on export.

node_type_index

In addition to above enhancement, this PR also includes multiple bug fixes on CI workflow including adding complete-train pipeline run on tekton test.

Signed-off-by: Sunyanan Choochotkaew [email protected]

@sunya-ch sunya-ch marked this pull request as draft February 1, 2024 08:29
Signed-off-by: Sunyanan Choochotkaew <[email protected]>
@sunya-ch
Copy link
Contributor Author

sunya-ch commented Feb 1, 2024

I will update exporter for separating each node type. Here are examples of exported value for the pipeline trained on SPECPower data.

https://github.com/sunya-ch/kepler-model-db/blob/specpower/models/v0.7/README.md

Pipeline README page

Screenshot 2024-02-01 at 18 08 26

Model error report page (per node_type)

Screenshot 2024-02-01 at 18 08 41

@sunya-ch sunya-ch marked this pull request as ready for review February 1, 2024 09:18
@rootfs
Copy link
Contributor

rootfs commented Feb 1, 2024

also cc @KaiyiLiu1234

@rootfs rootfs merged commit dc4d631 into sustainable-computing-io:main Feb 1, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants