Enable consumption and configuration of specific hyperscaler resources [EPIC] #18195

varbanv · 2023-09-20T07:28:23Z

Description

Provide a way for end users to consume and be charged for a pre-defined set of hyperscaler resources:

specialized node types. For example GPU and ARM nodes, network, memory and CPU optimized nodes, etc.
additional types of storage, including SSD and read-write-many options.

To have standard machine types configurable in worker pools gets addressed in separate story #18709. It is expected that any additional node specific settings will be added to that concept as further option

Context

Problem

Currently, Kyma is a layer on top of Kubernetes and as such provides a very limited set of infrastructure configuration options at provisioning time.
However, customers looking to adopt Kyma that already use existing hyperscaler offerings already take advantage of specialized resources as part of their workloads (for example faster storage, GPU nodes, network optimized nodes, etc). This prevents those users from on-boarding on Kyma without having to re-engineer their workloads.

Benefits

For customers:

greater flexibility around infrastructure requirements
ability to meet requirements in order to move workloads to Kyma
reduced maintenance related to hyperscaler accounts

For us:

increase adoption
abstract and bundle infrastructure related requirements in one feature
mitigate the abandoned BYOC approach

Potential problems

billing could become more complex if we don't introduce an as way for customers to track their costs

Gathering of Resources to support

advanced machine types
- Mandatory as that is the most asked feature
- like GPUs, at the end all types which are available in gardener. The machines are selectable in the service instance parameters as part of the worker pool setup.
- including ARM types, which will require an additional setting in the worker pool config
- advanced machine types can have a very high price and the price will differ dependent on multiple factors like the platform, a generic pricing based on amount of CPU/Mem seems not adequate
Cloud Manager Module resources
- Mandatory as it is about to go live and we need charging for it
- All are requested via custom resource in the cluster
- NFS storage
  - Charged per volume size and type
- VPC usage
  - no price per se
- Redis instances
  - price per usage and data transfer rate
Application LoadBalancers
- Some hyperscaler provide different load balancers types, to be used in front of a kubernetes cluster
- Will be selected via annotation on the istio/service resources
- Mandatory, as currently you can get multiple via Kubernetes Services of type loadbalancer and will not get charged
Storage classes
- Customers can create additional storage classes via k8s resources (available classes should be provided by default?)
- Customers needs to get charged on base of storage class
- Mandatory as currently expensive storage gets charged with the base price
Web/Application Firewall
- Part of firewall package, usually in an enterprise tier
- Paid with a monthly fee + a fee per amount of protected resources
- Most probably configured via Cloud Manager module?
- Not mandatory at the moment for closing the epic but should be respected in the concept
Assured workload (GCP)
- Needed for restricted markets like KSA
- Premium tier requires 20% more charge on machine types

Billing requirements

need to stay simple as
- on one hand it is very complex to add new items to the bill
- the pricing model of CF is successful
- different prices per region and items will make users live troublesome
at the moment only storage and compute is billed and most probably that should stay like that, so all items need to be mapped to that
storage and compute can have different entities listed like machine types, however there should be no regional differences

Acceptance criteria

Users can enable/disable hyperscaler resource usage for their cluster from a pre-defined list of options
Users can enable/disable hyperscaler resource usage via BTP Cockpit, Kyma Dashboard and command line
Users will be billed for all additional usage in their clusters and the charges will be correctly reflected in their billing summary
Users cannot select or enable hyperscaler resources that are not part of the pre-defined list.
Kyma will provide a mechanism for end users to access relevant information related to the hyperscaler resource they use without providing direct access to the hyperscaler account
List of available resources should be based on region and hyperscaler availability
In case of quota/availability limitations for a specific resource, end users will be given reasonable information about why they cannot consume the desired resource

Tasks

business workshop to decide on the pricing model
technical workshop to define the architecture
decide which machine types/nodes to expose first
add new node type input to the worker pool config
KMC handles metering for new node type
KMC handles metering for non-basic storage types
provide feedback to the end user about errors during resource configuration changes - e.g. no GPUs available from the hyperscaler
document the pricing and calculation for the additional node types

Disper · 2023-11-06T10:19:54Z

That would potentially require rewriting the provisioning/de-provisioning logic in the infrastructure manager, as we want to move away from the old provisioner.
Question if the implementation would enable specific set of resources or a generic one, that we will with any types of resources? How will that pre-defined list look like?

marco-porru · 2023-11-17T09:33:42Z

cKMS team is evaluating the usage of Kyma.
The team need to have "confidential computing capabilities". This kind of machine is surely available for azure and gcp

marco-porru · 2023-12-07T15:15:56Z

SAP for Me would like to use m6g and m6in machine types

valentinvieriu · 2024-01-29T10:49:19Z

+1 for GPUs

marco-porru · 2024-01-29T11:09:27Z

+1 for GPUs

Thanks Valentin for reporting it. I think it's worth mentioning the context and let me do it on behalf of you for simplicity 😄 : it's to make it possible for Core AI to run on Kyma (subject to future discussions and agreements)

varbanv · 2024-03-21T15:06:53Z

Had a preliminary workshop with @tobiscr and @PK85 and added a first set of tasks to work on.

marco-porru · 2024-04-04T19:55:35Z

+1 team for GPU (ICN Munich)

marco-porru · 2024-04-15T09:49:51Z

Enable more private connectivity (e.g. via firewall), requested by not less than 3 teams (e.g. S/4HANA ABAP Machines)

marco-porru · 2024-04-23T08:17:29Z

Enable assured workload GCP module (relevant for KSA), requested by BTP email service

abbi-gaurav · 2024-04-24T13:39:16Z

A customer is looking for very high IOPS storage.
e.g. enabling ultra disks for storage could help them: https://learn.microsoft.com/en-us/azure/virtual-machines/disks-enable-ultra-ssd?tabs=azure-portal

abbi-gaurav · 2024-05-24T13:45:36Z

At present, customers are able to use resources for which they are not charged such as

additional load balancers
non default storage class e.g. for multiple read write

We should somehow make the customers aware that they might have to pay for this in the future, so it should not come as a surprise for them.

@NHingerl , could you please help? IMHO, putting this info out might not need to wait until this epic is done.

lanthoor · 2024-06-09T04:59:37Z

SAP IPR would like to use g5 and r7i instance types along with other hyperscaler resources like ALB/NLB.

pthd · 2024-06-14T09:46:08Z

+1 for GPU support.
AI scenarios required GPU powered instances.
More precisely we want to leverage Transformer models which run much faster on GPU.

MarcusNotheis · 2024-06-28T07:53:58Z

We would be interested in OpenSearch consumption

marco-porru · 2024-07-09T09:13:36Z

GPUs for the Product Services team (already LIVE)
In particular from GCP
A100
H100
H200 machines

marco-porru · 2024-07-09T14:13:35Z

GPUs requested also by NGS (already live)

github-actions · 2024-09-24T00:17:58Z

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs.
Thank you for your contributions.

a-thaler · 2024-10-25T14:44:54Z

In the default worker pool we will continue to support the current machine types only. With additional worker pools we will support additional machine types to be used. As soon as the worker pool feature is ready (#18709), we will start adding some compute-intensive types followed then by GPUs.

In parallel we are already working on a concept to emit also non-billable metrics, bringing more transparency on what actually gets charged. That is still in a conceptual phase still identifying what is possible to achieve.

For the compute-intensive workloads we are currently thinking to add these ones (non-ARM based):

AWS: C7i
GCP: c2d-high-cpu
Azure: fsv2

varbanv added the Epic label Sep 20, 2023

pbochynski mentioned this issue Dec 20, 2023

Multiple worker groups [KIM/feature] kyma-project/infrastructure-manager#46

Open

tobiscr mentioned this issue Mar 12, 2024

Create proposal for Cluster CR we expect from KEB kyma-project/infrastructure-manager#125

Closed

5 tasks

varbanv assigned k15r Apr 22, 2024

tobiscr mentioned this issue May 8, 2024

Limited cluster infrastructure control to Gardener based Kyma clusters for end users [EPIC] #13153

Closed

1 task

tobiscr added the area/control-plane Related to all activities around Kyma Control Plane label May 22, 2024

tobiscr mentioned this issue Jul 8, 2024

Configurable multiple node pools [EPIC] #18709

Open

5 tasks

tobiscr changed the title ~~Enable consumption and configuration of specific hyperscaler resources~~ Enable consumption and configuration of specific hyperscaler resources [EPIC] Jul 8, 2024

varbanv assigned a-thaler and unassigned k15r Jul 15, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2024

a-thaler removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable consumption and configuration of specific hyperscaler resources [EPIC] #18195

Enable consumption and configuration of specific hyperscaler resources [EPIC] #18195

varbanv commented Sep 20, 2023 •

edited by a-thaler

Loading

Disper commented Nov 6, 2023 •

edited

Loading

marco-porru commented Nov 17, 2023

marco-porru commented Dec 7, 2023

valentinvieriu commented Jan 29, 2024

marco-porru commented Jan 29, 2024 •

edited

Loading

varbanv commented Mar 21, 2024

marco-porru commented Apr 4, 2024

marco-porru commented Apr 15, 2024

marco-porru commented Apr 23, 2024

abbi-gaurav commented Apr 24, 2024

abbi-gaurav commented May 24, 2024

lanthoor commented Jun 9, 2024

pthd commented Jun 14, 2024

MarcusNotheis commented Jun 28, 2024

marco-porru commented Jul 9, 2024

marco-porru commented Jul 9, 2024

github-actions bot commented Sep 24, 2024

a-thaler commented Oct 25, 2024

Enable consumption and configuration of specific hyperscaler resources [EPIC] #18195

Enable consumption and configuration of specific hyperscaler resources [EPIC] #18195

Comments

varbanv commented Sep 20, 2023 • edited by a-thaler Loading

Description

Context

Problem

Benefits

Potential problems

Gathering of Resources to support

Billing requirements

Acceptance criteria

Tasks

Disper commented Nov 6, 2023 • edited Loading

marco-porru commented Nov 17, 2023

marco-porru commented Dec 7, 2023

valentinvieriu commented Jan 29, 2024

marco-porru commented Jan 29, 2024 • edited Loading

varbanv commented Mar 21, 2024

marco-porru commented Apr 4, 2024

marco-porru commented Apr 15, 2024

marco-porru commented Apr 23, 2024

abbi-gaurav commented Apr 24, 2024

abbi-gaurav commented May 24, 2024

lanthoor commented Jun 9, 2024

pthd commented Jun 14, 2024

MarcusNotheis commented Jun 28, 2024

marco-porru commented Jul 9, 2024

marco-porru commented Jul 9, 2024

github-actions bot commented Sep 24, 2024

a-thaler commented Oct 25, 2024

varbanv commented Sep 20, 2023 •

edited by a-thaler

Loading

Disper commented Nov 6, 2023 •

edited

Loading

marco-porru commented Jan 29, 2024 •

edited

Loading