Skip to content

Unable to use TPU on GKE using on-demand quota #621

Closed
@samos123

Description

@samos123

Currently axlearn either adds a nodeSelector for spot=true or it adds a nodeSelector for reservation:

        if tier == "0" and cfg.reservation is not None:
            logging.info("Found tier=%s in env. Using reservation=%s", tier, cfg.reservation)
            selector.update({"cloud.google.com/reservation-name": cfg.reservation})
        else:
            logging.info("Found tier=%s in env. Using spot quota", tier)
            selector.update({"cloud.google.com/gke-spot": "true"})
            tolerations.append(
                {
                    "key": "cloud.google.com/gke-spot",
                    "operator": "Equal",
                    "value": "true",
                    "effect": "NoSchedule",
                }
            )

It should be possible to launch a job using on-demand TPU, however today that's not possible unless you remove this line:

selector.update({"cloud.google.com/gke-spot": "true"})

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions