Skip to content

Latest commit

 

History

History
655 lines (529 loc) · 24.3 KB

install.md

File metadata and controls

655 lines (529 loc) · 24.3 KB

Instructions for Admins and Users

Prerequisites

Software required

The recommended mimimum Linux kernel version for running the PMEM-CSI driver is 4.15. See Persistent Memory Programming for more details about supported kernel versions.

Hardware required

Persistent memory device(s) are required for operation. However, some development and testing can be done using QEMU-emulated persistent memory devices. See the "QEMU and Kubernetes" section for the commands that create such a virtual test cluster.

Persistent memory pre-provisioning

The PMEM-CSI driver needs pre-provisioned regions on the NVDIMM device(s). The PMEM-CSI driver itself intentionally leaves that to the administrator who then can decide how much and how PMEM is to be used for PMEM-CSI.

Beware that the PMEM-CSI driver will run without errors on a node where PMEM was not prepared for it. It will then report zero local storage for that node, something that currently is only visible in the log files.

When running the Kubernetes cluster and PMEM-CSI on bare metal, the ipmctl utility can be used to create regions. App Direct Mode has two configuration options - interleaved or non-interleaved. One region per each NVDIMM is created in non-interleaved configuration. In such a configuration, a PMEM-CSI volume cannot be larger than one NVDIMM.

Example of creating regions without interleaving, using all NVDIMMs:

# ipmctl create -goal PersistentMemoryType=AppDirectNotInterleaved

Alternatively, multiple NVDIMMs can be combined to form an interleaved set. This causes the data to be striped over multiple NVDIMM devices for improved read/write performance and allowing one region (also, PMEM-CSI volume) to be larger than single NVDIMM.

Example of creating regions in interleaved mode, using all NVDIMMs:

# ipmctl create -goal PersistentMemoryType=AppDirect

When running inside virtual machines, each virtual machine typically already gets access to one region and ipmctl is not needed inside the virtual machine. Instead, that region must be made available for use with PMEM-CSI because when the virtual machine comes up for the first time, the entire region is already allocated for use as a single block device:

# ndctl list -RN
{
  "regions":[
    {
      "dev":"region0",
      "size":34357641216,
      "available_size":0,
      "max_available_extent":0,
      "type":"pmem",
      "persistence_domain":"unknown",
      "namespaces":[
        {
          "dev":"namespace0.0",
          "mode":"raw",
          "size":34357641216,
          "sector_size":512,
          "blockdev":"pmem0"
        }
      ]
    }
  ]
}
# ls -l /dev/pmem*
brw-rw---- 1 root disk 259, 0 Jun  4 16:41 /dev/pmem0

Labels must be initialized in such a region, which must be performed once after the first boot:

# ndctl disable-region region0
disabled 1 region
# ndctl init-labels nmem0
initialized 1 nmem
# ndctl enable-region region0
enabled 1 region
# ndctl list -RN
[
  {
    "dev":"region0",
    "size":34357641216,
    "available_size":34357641216,
    "max_available_extent":34357641216,
    "type":"pmem",
    "iset_id":10248187106440278,
    "persistence_domain":"unknown"
  }
]
# ls -l /dev/pmem*
ls: cannot access '/dev/pmem*': No such file or directory

Installation and setup

Get source code

PMEM-CSI uses Go modules and thus can be checked out and (if that should be desired) built anywhere in the filesystem. Pre-built container images are available and thus users don't need to build from source, but they will still need some additional files. To get the source code, use:

git clone https://github.com/intel/pmem-csi

Run PMEM-CSI on Kubernetes

This section assumes that a Kubernetes cluster is already available with at least one node that has persistent memory device(s). For development or testing, it is also possible to use a cluster that runs on QEMU virtual machines, see the "QEMU and Kubernetes".

  • Make sure that the alpha feature gates CSINodeInfo and CSIDriverRegistry are enabled

The method to configure alpha feature gates may vary, depending on the Kubernetes deployment. It may not be necessary anymore when the feature has reached beta state, which depends on the Kubernetes version.

  • Label the cluster nodes that provide persistent memory device(s)
    $ kubectl label node <your node> storage=pmem
  • Set up certificates

Certificates are required as explained in Security. If you are not using the test cluster described in Starting and stopping a test cluster where certificates are created automatically, you must set up certificates manually. This can be done by running the ./test/setup-ca-kubernetes.sh script for your cluster. This script requires "cfssl" tools which can be downloaded. These are the steps for manual set-up of certificates:

  • Download cfssl tools
   $ curl -L https://pkg.cfssl.org/R1.2/cfssl_linux-amd64 -o _work/bin/cfssl --create-dirs
   $ curl -L https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64 -o _work/bin/cfssljson --create-dirs
   $ chmod a+x _work/bin/cfssl _work/bin/cfssljson
  • Run certificates set-up script
   $ KUBCONFIG="<<your cluster kubeconfig path>> PATH="$PATH:$PWD/_work/bin" ./test/setup-ca-kubernetes.sh
  • Deploy the driver to Kubernetes

The deploy/kubernetes-<kubernetes version> directory contains pmem-csi*.yaml files which can be used to deploy the driver on that Kubernetes version. The files in the directory with the highest Kubernetes version might also work for more recent Kubernetes releases. All of these deployments use images published by Intel on Docker Hub.

For each Kubernetes version, four different deployment variants are provided:

  • direct or lvm: one uses direct device mode, the other LVM device mode.
  • testing: the variants with testing in the name enable debugging features and shouldn't be used in production.

For example, to deploy for production with LVM device mode onto Kubernetes 1.14, use:

    $ kubectl create -f deploy/kubernetes-1.14/pmem-csi-lvm.yaml

The PMEM-CSI scheduler extender and webhook are not enabled in this basic installation. See below for instructions about that.

These variants were generated with kustomize. kubectl >= 1.14 includes some support for that. The sub-directories of deploy/kustomize-<kubernetes version> can be used as bases for kubectl kustomize. For example:

  • Change namespace:

    $ mkdir -p my-pmem-csi-deployment
    $ cat >my-pmem-csi-deployment/kustomization.yaml <<EOF
    namespace: pmem-csi
    bases:
      - ../deploy/kubernetes-1.14/lvm
    EOF
    $ kubectl create namespace pmem-csi
    $ kubectl create --kustomize my-pmem-csi-deployment
    
  • Configure how much PMEM is used by PMEM-CSI for LVM (see Namespace modes in LVM device mode):

    $ mkdir -p my-pmem-csi-deployment
    $ cat >my-pmem-csi-deployment/kustomization.yaml <<EOF
    bases:
      - ../deploy/kubernetes-1.14/lvm
    patchesJson6902:
      - target:
          group: apps
          version: v1
          kind: DaemonSet
          name: pmem-csi-node
        path: lvm-parameters-patch.yaml
    EOF
    $ cat >my-pmem-csi-deployment/lvm-parameters-patch.yaml <<EOF
    # pmem-ns-init is in the init container #0. Append arguments at the end.
    - op: add
      path: /spec/template/spec/initContainers/0/args/-
      value: "--useforfsdax=90"
    EOF
    $ kubectl create --kustomize my-pmem-csi-deployment
    
  • Wait until all pods reach 'Running' status

    $ kubectl get pods
    NAME                    READY   STATUS    RESTARTS   AGE
    pmem-csi-node-8kmxf     2/2     Running   0          3m15s
    pmem-csi-node-bvx7m     2/2     Running   0          3m15s
    pmem-csi-controller-0   2/2     Running   0          3m15s
    pmem-csi-node-fbmpg     2/2     Running   0          3m15s
  • Verify that the node labels have been configured correctly
    $ kubectl get nodes --show-labels

The command output must indicate that every node with PMEM has these two labels:

pmem-csi.intel.com/node=<NODE-NAME>,storage=pmem

If storage=pmem is missing, label manually as described above. If pmem-csi.intel.com/node is missing, then double-check that the alpha feature gates are enabled, that the CSI driver is running on the node, and that the driver's log output doesn't contain errors.

  • Define two storage classes using the driver
    $ kubectl create -f deploy/kubernetes-<kubernetes version>/pmem-storageclass-ext4.yaml
    $ kubectl create -f deploy/kubernetes-<kubernetes version>/pmem-storageclass-xfs.yaml
  • Provision two pmem-csi volumes
    $ kubectl create -f deploy/kubernetes-<kubernetes version>/pmem-pvc.yaml
  • Verify two Persistent Volume Claims have 'Bound' status
    $ kubectl get pvc
    NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
    pmem-csi-pvc-ext4   Bound    pvc-f70f7b36-6b36-11e9-bf09-deadbeef0100   4Gi        RWO            pmem-csi-sc-ext4   16s
    pmem-csi-pvc-xfs    Bound    pvc-f7101fd2-6b36-11e9-bf09-deadbeef0100   4Gi        RWO            pmem-csi-sc-xfs    16s
  • Start two applications requesting one provisioned volume each
    $ kubectl create -f deploy/kubernetes-<kubernetes version>/pmem-app.yaml

These applications use storage: pmem in the nodeSelector list to ensure scheduling to a node supporting pmem device, and each requests a mount of a volume, one with ext4-format and another with xfs-format file system.

  • Verify two application pods reach 'Running' status
    $ kubectl get po my-csi-app-1 my-csi-app-2
    NAME           READY   STATUS    RESTARTS   AGE
    my-csi-app-1   1/1     Running   0          6m5s
    NAME           READY   STATUS    RESTARTS   AGE
    my-csi-app-2   1/1     Running   0          6m1s
  • Check that applications have a pmem volume mounted with added dax option
    $ kubectl exec my-csi-app-1 -- df /data
    Filesystem           1K-blocks      Used Available Use% Mounted on
    /dev/ndbus0region0fsdax/5ccaa889-551d-11e9-a584-928299ac4b17
                           4062912     16376   3820440   0% /data
    $ kubectl exec my-csi-app-2 -- df /data
    Filesystem           1K-blocks      Used Available Use% Mounted on
    /dev/ndbus0region0fsdax/5cc9b19e-551d-11e9-a584-928299ac4b17
                           4184064     37264   4146800   1% /data

    $ kubectl exec my-csi-app-1 -- mount |grep /data
    /dev/ndbus0region0fsdax/5ccaa889-551d-11e9-a584-928299ac4b17 on /data type ext4 (rw,relatime,dax)
    $ kubectl exec my-csi-app-2 -- mount |grep /data
    /dev/ndbus0region0fsdax/5cc9b19e-551d-11e9-a584-928299ac4b17 on /data type xfs (rw,relatime,attr2,dax,inode64,noquota)

Expose persistent and cache volumes to applications

Kubernetes cluster administrators can expose persistent and cache volumes to applications using StorageClass Parameters. An optional persistencyModel parameter differentiates how the provisioned volume can be used:

  • no persistencyModel parameter or persistencyModel: normal in StorageClass

    A normal Kubernetes persistent volume. In this case PMEM-CSI creates PMEM volume on a node and the application that claims to use this volume is supposed to be scheduled onto this node by Kubernetes. Choosing of node is depend on StorageClass volumeBindingMode. In case of volumeBindingMode: Immediate PMEM-CSI chooses a node randomly, and in case of volumeBindingMode: WaitForFirstConsumer (also known as late binding) Kubernetes first chooses a node for scheduling the application, and PMEM-CSI creates the volume on that node. Applications which claim a normal persistent volume has to use ReadOnlyOnce access mode in its accessModes list. This diagram illustrates how a normal persistent volume gets provisioned in Kubernetes using PMEM-CSI driver.

  • persistencyModel: cache

    Volumes of this type shall be used in combination with volumeBindingMode: Immediate. In this case, PMEM-CSI creates a set of PMEM volumes each volume on different node. The number of PMEM volumes to create can be specified by cacheSize StorageClass parameter. Applications which claim a cache volume can use ReadWriteMany in its accessModes list. Check with provided cacheStorageClass example. This diagram illustrates how a cache volume gets provisioned in Kubernetes using PMEM-CSI driver.

NOTE: Cache volumes are associated with a node, not a pod. Multiple pods using the same cache volume on the same node will not get their own instance but will end up sharing the same PMEM volume instead. Application deployment has to consider this and use available Kubernetes mechanisms like node anti-affinity. Check with the provided cacheapplication example.

WARNING: late binding (volumeBindingMode:WaitForFirstConsume) has some caveats:

  • Pod creation may get stuck when there isn't enough capacity left for the volumes; see the next section for details.
  • A node is only chosen the first time a pod starts. After that it will always restart on that node, because that is where the persistent volume was created.

Volume requests embedded in Pod spec are provisioned as ephemeral volumes. The volume request could use below fields as volumeAttributes:

key meaning optional values
size Size of the requested ephemeral volume as Kubernetes memory string ("1Mi" = 1024*1024 bytes, "1e3K = 1000000 bytes) No
eraseAfter Clear all data after use and before
deleting the volume
Yes true (default),
false

Check with provided example application for ephemeral volume usage.

Raw block volumes

Applications can use volumes provisioned by PMEM-CSI as raw block devices. Such volumes use the same "fsdax" namespace mode as filesystem volumes and therefore are block devices. That mode only supports dax (= mmap(MAP_SYNC)) through a filesystem. Pages mapped on the raw block device go through the Linux page cache. Applications have to format and mount the raw block volume themselves if they want dax. The advantage then is that they have full control over that part.

For provisioning a PMEM volume as raw block device, one has to create a PersistentVolumeClaim with volumeMode: Block. See example PVC and application for usage reference.

That example demonstrates how to handle some details:

  • mkfs.ext4 needs -b 4096 to produce volumes that support dax; without it, the automatic block size detection may end up choosing an unsuitable value depending on the volume size.
  • Kubernetes bug #85624 must be worked around to format and mount the raw block device.

Enable scheduler extensions

The PMEM-CSI scheduler extender and admission webhook are provided by the PMEM-CSI controller. They need to be enabled during deployment via the --schedulerListen=[<listen address>]:<port> parameter. The listen address is optional and can be left out. The port is where a HTTPS server will run. It uses the same certificates as the internal gRPC service. When using the CA creation script described above, they will contain alternative names for the URLs described in this section (service names, 127.0.0.1 IP address).

This parameter can be added to one of the existing deployment files with kustomize. All of the following examples assume that the current directory contains the deploy directory from the PMEM-CSI repository. It is also possible to reference the base via a URL.

mkdir my-pmem-csi-deployment

cat >my-pmem-csi-deployment/kustomization.yaml <<EOF
bases:
  - ../deploy/kubernetes-1.16/lvm
patchesJson6902:
  - target:
      group: apps
      version: v1
      kind: StatefulSet
      name: pmem-csi-controller
    path: scheduler-patch.yaml
EOF

cat >my-pmem-csi-deployment/scheduler-patch.yaml <<EOF
- op: add
  path: /spec/template/spec/containers/0/command/-
  value: "--schedulerListen=:8000"
EOF

kubectl create --kustomize my-pmem-csi-deployment

To enable the PMEM-CSI scheduler extender, a configuration file and an additional --config parameter for kube-scheduler must be added to the cluster control plane, or, if there is already such a configuration file, one new entry must be added to the extenders array. A full example is presented below.

The kube-scheduler must be able to connect to the PMEM-CSI controller via the urlPrefix in its configuration. In some clusters it is possible to use cluster DNS and thus a symbolic service name. If that is the case, then deploy the scheduler service as-is and use https://pmem-csi-scheduler.default.svc as urlPrefix. If the PMEM-CSI driver is deployed in a namespace, replace default with the name of that namespace.

In a cluster created with kubeadm, kube-scheduler is unable to use cluster DNS because the pod it runs in is configured with hostNetwork: true and without dnsPolicy. Therefore the cluster DNS servers are ignored. There also is no special dialer as in other clusters. As a workaround, the PMEM-CSI service can be exposed via a fixed node port like 32000 on all nodes. Then https://127.0.0.1:32000 needs to be used as urlPrefix. Here's how the service can be created with that node port:

mkdir my-scheduler

cat >my-scheduler/kustomization.yaml <<EOF
bases:
  - ../deploy/kustomize/scheduler
patchesJson6902:
  - target:
      version: v1
      kind: Service
      name: pmem-csi-scheduler
    path: node-port-patch.yaml
EOF

cat >my-scheduler/node-port-patch.yaml <<EOF
- op: add
  path: /spec/ports/0/nodePort
  value: 32000
EOF

kubectl create --kustomize my-scheduler

How to (re)configure kube-scheduler depends on the cluster. With kubeadm it is possible to set all necessary options in advance before creating the master node with kubeadm init. One additional complication with kubeadm is that kube-scheduler by default doesn't trust any root CA. The following kubeadm config file solves this together with enabling the scheduler configuration by bind-mounting the root certificate that was used to sign the certificate used by the scheduler extender into the location where the Go runtime will find it:

sudo mkdir -p /var/lib/scheduler/
sudo cp _work/pmem-ca/ca.pem /var/lib/scheduler/ca.crt

sudo sh -c 'cat >/var/lib/scheduler/scheduler-policy.cfg' <<EOF
{
  "kind" : "Policy",
  "apiVersion" : "v1",
  "extenders" :
    [{
      "urlPrefix": "https://<service name or IP>:<port>",
      "filterVerb": "filter",
      "prioritizeVerb": "prioritize",
      "nodeCacheCapable": false,
      "weight": 1,
      "managedResources":
      [{
        "name": "pmem-csi.intel.com/scheduler",
        "ignoredByScheduler": true
      }]
    }]
}
EOF

cat >kubeadm.config <<EOF
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
scheduler:
  extraVolumes:
    - name: config
      hostPath: /var/lib/scheduler
      mountPath: /var/lib/scheduler
      readOnly: true
    - name: cluster-root-ca
      hostPath: /var/lib/scheduler/ca.crt
      mountPath: /etc/ssl/certs/ca.crt
      readOnly: true
  extraArgs:
    config: /var/lib/scheduler/scheduler-config.yaml
EOF

kubeadm init --config=kubeadm.config

It is possible to stop here without enabling the pod admission webhook. To enable also that, continue as follows.

First of all, it is recommended to exclude all system pods from passing through the web hook. This ensures that they can still be created even when PMEM-CSI is down:

kubectl label ns kube-system pmem-csi.intel.com/webhook=ignore

This special label is configured in the provided web hook definition. On Kubernetes >= 1.15, it can also be used to let individual pods bypass the webhook by adding that label. The CA gets configured explicitly, which is supported for webhooks.

mkdir my-webhook

cat >my-webhook/kustomization.yaml <<EOF
bases:
  - ../deploy/kustomize/webhook
patchesJson6902:
  - target:
      group: admissionregistration.k8s.io
      version: v1beta1
      kind: MutatingWebhookConfiguration
      name: pmem-csi-hook
    path: webhook-patch.yaml
EOF

cat >my-webhook/webhook-patch.yaml <<EOF
- op: replace
  path: /webhooks/0/clientConfig/caBundle
  value: $(base64 -w 0 _work/pmem-ca/ca.pem)
EOF

kubectl create --kustomize my-webhook

Filing issues and contributing

Report a bug by filing a new issue.

Before making your first contribution, be sure to read the development documentation for guidance on code quality and branches.

Contribute by opening a pull request.

Learn about pull requests.

Reporting a Potential Security Vulnerability: If you have discovered potential security vulnerability in PMEM-CSI, please send an e-mail to [email protected]. For issues related to Intel Products, please visit Intel Security Center.

It is important to include the following details:

  • The projects and versions affected
  • Detailed description of the vulnerability
  • Information on known exploits

Vulnerability information is extremely sensitive. Please encrypt all security vulnerability reports using our PGP key.

A member of the Intel Product Security Team will review your e-mail and contact you to collaborate on resolving the issue. For more information on how Intel works to resolve security issues, see: vulnerability handling guidelines.