XRd Control Plane on Openshift

18 minutes read

Introduction

This pages contain resources to guide you through running XRd Control Plane in Red Hat OpenShift. Starting from an operational OpenShift cluster, the instructions explain how to configure machines in the cluster to be suitable for running the XRd Control Plane workload and how to deploy an XRd Control Plane instance.

OpenShift is Red Hat’s Kubernetes offering.

The OpenShift documentation should be referred to alongside this guide for more information.

These instructions walk through how to configure worker nodes in an OpenShift cluster for running XRd, and how to deploy XRd on those nodes.

Content

The values used in this document when deploying XRd Control Plane, such as number of CPUs and memory allocation, are representative of possible requirements for production workloads. However, for an actual production deployment these values should be tuned to the requirements, such as for scale, of the deployment.

Resources

In addition to the documentation here, other key resources for bringing up XRd deployments are:

Prerequisites

These instructions assume a pre-existing OpenShift setup that meets the following requirements:

  • There is an operational and reachable OpenShift cluster
  • The cluster has all the OpenShift operators, such as the Node Tuning Operator and Cluster Networking Operator, installed that are installed by default with an OpenShift installation
  • The cluster contains at least one worker node suitable for running XRd. That is, there must be a worker node that is able to meet the other requirements listed in XRd Control Plane requirements
  • IP addresses on the cluster’s internal network are reachable, for example using a Kubernetes Service (see Kubernetes documentation) or allowing direct SSH access to worker nodes in the cluster
  • A PersistentVolume must be set up and usable by Pods on the worker node(s) to be used by XRd (see Kubernetes documentation)

These instructions use the OpenShift CLI (oc) (see OpenShift documentation) to interact with the cluster.

In order to configure worker nodes for running XRd, some information about the worker machine is required. These instructions use the oc debug command to gather this information, but other methods, such as direct ssh access to the worker machine, would also work.

These instructions are for OpenShift 4.14 and 4.16, but more recent OpenShift versions are expected to be similar.

Running commands from the instructions

Throughout the instructions several bits of output from OpenShift commands should be noted down for use in future commands. Values in angle brackets, e.g. <node name>, that are present in code blocks should be substituted with values taken from earlier command output, or with other values as specified.

XRd Control Plane requirements

XRd Control Plane places hardware requirements on the host. The requirements are summarized in this section to give context for the OpenShift configuration done in subsequent sections.

The requirements of XRd Control Plane fall into the following categories:

Host kernel requirements

XRd Control Plane has various required host kernel parameters (i.e. sysctl settings) that must be set on the host machine.

CPUs

XRd Control Plane requires that the host has an x86_64 CPU with at least two CPU cores. Furthermore, each XRd Control Plane instance running on the host requires at least one CPU core.

In this guide, 4 CPU cores are requested with a limit of 8. In deployments, the CPU allocation should vary according to BGP scale.

Memory

XRd Control Plane requires that the host has at least 4GB of memory. Furthermore, each XRd Control Plane instance running on the host requires at least 2GB of memory. (Higher scale production deployments will require more memory).

In this guide, 8GB of memory is requested with a limit of 12GB.

Disk space

Each XRd instance requires at least 3GB of disk space. This can either be in a persistent volume or ephemeral. These instructions assume that a persistent volume with read-write access and sufficient capacity is available on the worker nodes where XRd is to be deployed. If deploying more than one XRd Control Plane, either a persistent volume with “many” access or a dedicated persistent volume per XRd is required.

Core file handling

When running XRd, the host machine must have a robust core handling system in place to avoid disk exhaustion and availability issues. One of the considerations of this strategy is how much disk space is available on each worker node. For XRd Control Plane, the worker node must have at least three times the maximum memory allocation plus disk size of all deployed XRd Control Planes.

In these instructions, per-worker-node core file handling is set up using systemd. Users should consider what core file handling strategy suites their needs in multi-node deployments (see this Red Hat blog).

Configuring a node for XRd Control Plane

XRd needs some of the default values used by OpenShift to be tuned (both machine configuration and deployment options) to meet the requirements listed in XRd Control Plane requirements. This section details the additional configuration required to tune worker nodes in an OpenShift cluster for use with XRd.

All these configurations are compatible with standard OpenShift deployments. They are not expected to conflict with the requirements of any other (non-XRd) Pods, so should not prevent deployment of other Pods on the same worker node.

To meet XRd’s requirements, two configurations must be applied to the worker nodes to modify kernel parameters, as detailed in following subsections:

In these instructions, the node configuration is applied to all worker nodes in the cluster. Custom Machine Config Pools (see OpenShift documentation for more information) can be used to target the configuration if this behavior is not desired.

Machine Config

With the information gathered, the first step is to create a Machine Config that sets the number of inotify max user watches and instances.

Create a file machine_config.yaml containing the following:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-master-<config name>
spec:
  config:
    ignition:
      version: 3.4.0
    storage:
      files:
        - path: /etc/sysctl.d/inotify.conf
          contents:
            source: data:,fs.inotify.max_user_watches%20%3D%2065536%0Afs.inotify.max_user_instances%20%3D%2065536%0A
          mode: 420
          overwrite: true

where <config name> is a unique name that shall be used for all the configuration of the worker nodes. This sets both fs.inotify.max_user_watches and fs.inotify.max_user_instances to 65536. Note that the requirement is 4000 per XRd Control Plane instance.

Then apply this config with (the node will restart):

oc apply -f machine_config.yaml

When the Machine Config is applied, the following command will show that the Machine Config Pool is updating:

$ oc get machineconfigpool worker
NAME     CONFIG                            UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-(random string)   False     True       False      1              0                   0                     0                      34d

Once the Machine Config Pool has finished updating, the same command will show that the config is UPDATED.

TuneD

A TuneD profile needs to be applied to set host kernel parameters to meet the requirements discussed in Host kernel requirements.

Create a file tuned.yaml containing the following:

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: <config name>
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration to set sysctl settings
      [sysctl]
      kernel.randomize_va_space=2
      net.core.rmem_max=67108864
      net.core.wmem_max=67108864
      net.core.rmem_default=67108864
      net.core.wmem_default=67108864
      net.core.netdev_max_backlog=300000
      net.core.optmem_max=67108864
      net.ipv4.udp_mem="1124736 10000000 67108864"
      kernel.core_pattern="|/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %h"
    name: <config name>
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker"
    priority: 19
    profile: <config name>

The kernel.core_pattern argument sets up core file handling using systemd as described in the requirements section.

Apply this TuneD profile with:

oc apply -f tuned.yaml

Verify that the TuneD profile has been applied successfully with:

$ oc get profile <node name> -n openshift-cluster-node-tuning-operator
NAME           TUNED           APPLIED   DEGRADED   AGE
(node name)    (config name)   True      False      63m

where <node name> is the name of a worker node in the cluster (the list of all available nodes in the cluster is given by the command oc get nodes). This should be checked for all worker nodes. The profile should be successfully applied (i.e. DEGRADED should be false). The worker nodes are now setup so that XRd can be deployed on it. However, before we do so we shall create SR-IOV network resources to use for networking.

SR-IOV network resources

In OpenShift deployments, XRd Control Plane is able to use SR-IOV virtual functions (VFs) for its data and management interfaces. To use VFs in OpenShift, SR-IOV networking resource pools must first be created. This section runs through the creation of SR-IOV network resource pools to be used by XRd Control Plane.

The following example uses the OpenShift SR-IOV Network Operator. Instructions for installing this operator can be found here.

In this example, we will create a resource pool of four VFs on a single physical interface.

We will create resource pools on a per-node basis. First, we must gather information about the available interfaces.

Gathering information

The oc debug command is used to gather interface information. Enter the debug pod for the worker node with

oc debug node/<node name>

To get a list of physical networking devices available on the worker node, run

ls -l /sys/class/net/ | grep -v virtual | grep -oE "net/(.*)" | sed -e 's_net/__'

This returns a list of the Linux interface names of the physical networking devices, make a note of these. To find out the device-type and PCI address that an interface with name <interface name> has, run

ls /sys/bus/pci/devices/*/net | grep <interface name> -B 1 | grep -oE '([0-9a-fA-F]{2}\:[0-9a-fA-F]{2}\.[0-9])' | xargs lspci -s

Creating SR-IOV resource pools

Create the file <network node policy name>.netnodepolicy.yaml, where <network node policy name> is a unique name for the policy. Into this file copy the following:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: <network node policy name>
  namespace: openshift-sriov-network-operator
spec:
  resourceName: <resource name>
  nodeSelector:
    kubernetes.io/hostname: <node name>
  numVfs: 4
  nicSelector:
    pfNames:
    - <interface name>
  deviceType: netdevice

where <resource name> is the name of the SR-IOV resource that will be created and <interface name> is the Linux name of the physical interface that is to be used from the list of available interfaces found above. Note that more that one interface can be included in the same SR-IOV resource.

To create SR-IOV resources on multiple worker nodes, create SR-IOV Network Node Policies for each node.

When the Sriov Network Node Policy is applied, the operator creates numVfs VFs on the available NICs that meet the criteria specified under the nicSelector key. Other options for the NIC selector include the device PCI address(es). See OpenShift documentation for more information on options included in the Sriov Network Node Policy.

Apply the Sriov Network Node Policy with

oc apply -f <network node policy name>.netnodepolicy.yaml

Running oc get node <node name> -o jsonpath='{.status.allocatable}' will show the number of each resource available. Once the SR-IOV network resources are created, you should see that there are four of "openshift.io/<resource name>" available, i.e.

$ oc get node <node name> -o jsonpath='{.status.allocatable}'
{...,"openshift.io/xrd_sriov_resource":"4",...}

Namespace and Service Account

XRd must run in a custom Namespace and needs to run as a privileged Pod to access some of the host’s resources. To achieve this, we will use a Service Account with access to a privileged Security Context Constraint.

This section describes how to set up a Namespace and Service Account for XRd to use.

Namespace

Kubernetes uses Namespaces to provide logical isolation of resources. The following manifest will be used to create a Namespace called “xrd”. Create a file xrd.namespace.yaml containing

apiVersion: v1
kind: Namespace
metadata:
  name: xrd
  labels:
    kubernetes.io/metadata.name: xrd

Apply this with:

oc apply -f xrd.namespace.yaml

Once the manifest is applied, verify that the Namespace is created:

$ oc get project | grep xrd
xrd                               Active

Switch to the newly created Namespace (so that future commands use this Namespace by default) by entering:

$ oc project xrd
Now using project "xrd" on server "(server address)".

Service Account

XRd needs privileged access to some of the host’s resources and so needs to be run as a privileged Pod. To do this a Service Account with access to a privileged Security Context Constraint is required. Create a file xrd.sa.yaml containing

apiVersion: v1
kind: ServiceAccount
metadata:
  name: xrd-sa
  namespace: xrd

Applying this will create a Service Account named “xrd-sa”. Apply it with

oc apply -f xrd.sa.yaml

Verify that the Service Account has been created with

$ oc get serviceaccount xrd-sa
NAME       SECRETS   AGE
xrd-sa     0         1s

Next, we bind the xrd-sa Service Account to a privileged role. To do this, we use the default privileged Security Context Constraint in OpenShift. Create a file xrd.rolebinding.yaml containing:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:openshift:scc:privileged
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:scc:privileged
subjects:
- kind: ServiceAccount
  name: xrd-sa
  namespace: xrd

Apply the manifest with:

oc apply -f xrd.rolebindings.yaml

Verify that xrd-sa is now bound to the privileged role:

$  oc get clusterrolebinding system:openshift:scc:privileged -o wide
NAME                                                                        ROLE                                                                                    AGE   USERS                                                            GROUPS                                         SERVICEACCOUNTS
system:openshift:scc:privileged                                             ClusterRole/system:openshift:scc:privileged                                             34m                                                                                                                   xrd/xrd-sa

Under the SERVICEACCOUNTS heading xrd/xrd-sa should be listed.

Running XRd Control Plane

We have now configured the worker nodes and created OpenShift resources required to run XRd. This section describes how to run XRd Pods in this environment.

First, add the XRd Helm repository to the machine from which you are interacting with the cluster with

helm repo add xrd https://ios-xr.github.io/xrd-helm

Verify that the XRd Helm repository has been successfully added with

$ helm repo list
NAME        	URL
xrd         	https://ios-xr.github.io/xrd-helm

The xrd repo should be present.

Installing XRd Control Plane

Create a file xrd.yaml containing the following

image:
  repository: <repository uri containing Control Plane image>
  tag: <tag>
  pullSecrets:
  - name: <image pull secrets>

config:
  username: <username>
  password: <password>
  # ASCII XR configuration to be applied on XR boot, including configuration for SSH which will be accessible over a Mgmt interface.
  ascii: |
    hostname xrd-1
    ssh server v2

# XRd line interfaces.
interfaces:
- type: sriov
  resource: openshift.io/<resource name>
  config:
    type: sriov
    trust: "on"
    spoofChk: "off"

# Management interfaces. snoopIpv4Address adds detected IP address to XR config to allow SSH access.
mgmtInterfaces:
- type: defaultCni
  chksum: true
  snoopIpv4Address: true

resources:
  requests:
    cpu: '4'
    memory: '8Gi'
  limits:
    cpu: '8'
    memory: '12Gi'

serviceAccountName: xrd-sa

# Persistent storage
persistence:
  enabled: true
  size: 3Gi
  accessModes:
  - ReadWriteOnce
  existingVolume: <persistent volume name>
  storageClass: <storage class name>

where:

  • <repository uri containing Control Plane image> is an image repository containing XRd Control Plane images
  • <tag> is the image tag to use
  • <image pull secrets> is a standard Kubernetes Pod imagePullSecrets array (see here)
  • <username> a username to log into XR with
  • <password> a password to log into XR with
  • <persistent volume name> is the name of the Persistent Volume
  • <storage class name> is the name of the Storage Class to which the Persistent Volume belongs

The Storage Class and Persistent Volume specified under persistence must match the prerequisite Persistent Volume which must have at least 3GB free capacity. In this example, a dedicated (per XRd) Persistent Volume with an access mode of ReadWriteOnce is used.

Note that if no persistent storage is given, for example if the persistence section is omitted, then ephemeral storage will be used instead. This is not suitable for deployment scenarios due to high-availability considerations.

The xrd.yaml Helm values file will be used to install an XRd Control Plane as a Burstable Pod that requests 4 CPU cores and 8GB of memory, with a limit of 8 and 12GB respectively. If desired, XRd Control Plane can be deployed as a Guaranteed Pod by specifying equal requests and limits for the number of CPU cores and amount of memory respectively.

The XRd Control Plane will have a single interface drawn from the <resource name> SR-IOV network resource pool created in SR-IOV network resources section. Setting config.trust to "on" for VFs is required to be able receive multicast traffic (and thus is required for multicast based protocols to work such as IPv6 ND). Setting config.spoofChk to "off" for VFs is required to send packets from other unicast MAC addresses (such as the VRRP vMAC, and thus is required for protocols such as VRRP).

The XRd will also have a management interface that uses the Pod’s default veth interface on the cluster network. The applied config allows SSH access to XRd Control Plane at the Pod’s IP address.

Multiple interfaces and management interfaces can be requested from multiple SR-IOV resource pools by listing the desired interfaces under the interfaces and mgmtInterfaces keys, for example:

# XRd line interfaces.
interfaces:
- type: sriov
  resource: openshift.io/<resource name>
  config:
    type: sriov
    trust: "on"
    spoofChk: "off"
- type: sriov
  resource: openshift.io/<resource name of another resource pool>
  config:
    type: sriov
    trust: "on"
    spoofChk: "off"

# XRd management interfaces.
mgmtInterfaces:
- type: sriov
  resource: openshift.io/<resource name of another resource pool>
  config:
    type: sriov
    trust: "on"

The full range of options supported in the XRd Control Plane Helm values file are documented here.

Then, install XRd Control Plane with

helm install xrd xrd/xrd-control-plane -f xrd.yaml

Accessing XRd

The installation steps above start an XRd Pod on a worker node. It will take around a minute to come up as the image is pulled from the repository and XRd boots.

The progress can be monitored manually by looking for the status of the XRd Pod using:

$ oc get pod xrd-xrd-control-plane-0
NAME                      READY   STATUS    RESTARTS   AGE
xrd-xrd-control-plane-0   1/1     Running   0          4m36s

Once the Pod is in Running state, we can connect to the Pod using SSH. To do so, first identify the IP address assigned to the Pod by running

oc get pod xrd-xrd-control-plane-0 --template ''

Make a note of the returned IP address. Then, wait for XR to finish booting. To see the current status, run

oc logs xrd-xrd-control-plane-0

Retry this until the following log is seen:

$ oc logs xrd-xrd-control-plane-0
(omitted output)
RP/0/RP0/CPU0:Oct 22 16:21:29.194 UTC: ifmgr[229]: %PKT_INFRA-LINK-3-UPDOWN : Interface MgmtEth0/RP0/CPU0/0, changed state to Up

Now, the Pod can be accessed via SSH from an end point with access to the cluster-internal network (for example if the network has been exposed via a Kubernetes Service (see Kubernetes documentation) or using the node as a jump host) using

ssh <username>@<Pod IP>

where <Pod IP> is the address returned in the previous step. Enter the password <password> for access.

Once past the prompt, the status of the XR interfaces can be checked by running show ip interfaces brief, which should show a management interface and data interface in the output, similar to:

RP/0/RP0/CPU0:xrd-1#show ip interfaces brief
Wed Jul 10 12:55:05.810 UTC

Interface                      IP-Address      Status          Protocol Vrf-Name
MgmtEth0/RP0/CPU0/0            (Pod IP)        Up              Up       default
GigabitEthernetE0/0/0/0        unassigned      Shutdown        Down     default

The SSH connection can be closed by running exit.

Summary

XRd Control Plane will now be running in Red Hat OpenShift.

This was covered in four steps from a functioning OpenShift cluster:

  1. The machine was set up for running XRd using Machine Config and TuneD
  2. SR-IOV networking resources were created
  3. A namespace and Service Account for running XRd were created
  4. An XRd Control Plane workload was deployed on a worker node

Appendix A: Using physical functions (PFs)

This section replaces the steps for creating SR-IOV network resources if physical functions (PFs), instead of VFs, are to be used for network resources. The OpenShift SR-IOV Network Operator cannot be used with PFs. Instead, the SR-IOV Network Device Plugin must be used directly. This section assumes that the SR-IOV Network Device Plugin, the Host-Device CNI, and the Network Resources Injector are all installed on the node.

PF resource bundles are created by configuring the SR-IOV Network Device Plugin using a ConfigMap such as

apiVersion: v1
kind: ConfigMap
metadata:
  name: <config name>
  namespace: kube-system
data:
  config.json: |
    {
      "resourceList": [{
          "resourceName": "<resource name>",
          "resourcePrefix": "<resource prefix>",
          "selectors": [{
            "pciAddresses": ["<pci address>"]
          }]
        }
      ]
    }

where:

  • <config name> is a name for the Config Map
  • <resource name> is a unique name for the resource in the scope of the below resource prefix
  • <resource prefix> is a prefix for the resource name
    • If none is specified, ‘intel.com’ is used by default
  • <pci address> is the PCI address of the PF
    • Alternative resource selectors can be used, see here

Note that resource pools of pre-existing VFs can also be created using this method.

Deploying the SR-IOV Network Device Plugin with this ConfigMap then creates the SR-IOV resource bundles. If the deployment is successful, available resources can be be seen via oc get node <node name> -o json | jq '.status.allocatable' will show the created SR-IOV resources with the number available:

$ oc get node <node name> -o json | jq '.status.allocatable'
{
  ...
  "(resource prefix)/(resource name)": "1",
  ...
}

When using the SR-IOV Network Device Plugin directly, the desired network resources must be requested in the Helm under both the resources.requests and resources.limits keys, in addition to the interface specification. For example

...

resources:
  requests:
    <resource prefix>/<resource name>: 1
    (other requests)
  limits:
    <resource prefix>/<resource name>: 1
    (other limits)

interfaces:
- type: sriov
  resource: <resource prefix>/<resource name>
  config:
    # The SR-IOV CNI cannot be used for PFs, so instead use Host Device CNI
    type: host-device

...

in addition to the other config in xrd.yaml above. (The requirement to request resources explicitly can be relaxed if the Network Resources Injector is used.)

Appendix B: Troubleshooting

OpenShift SR-IOV Network Operator install fails with message Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline

This is a known Red Hat bug.

Workaround:

  1. Find the corresponding job and configmap (usually named the same) in the openshift-marketplace and grep for the operator name or keyword in its contents.
$ oc get job -n openshift-marketplace -o json | jq -r '.items[] | select(.spec.template.spec.containers[].env[].value|contains ("sriov")) | .metadata.name'
  1. Delete the found job and the corresponding configmap, (named the same as the job) found in the openshift-marketplace namespace.
$ oc delete job <job-name> -n openshift-marketplace
$ oc delete configmap <job-name> -n openshift-marketplace
  1. Now delete the subscription
$ oc delete sub sriov-network-operator-subscription -n openshift-sriov-network-operator

Now try and reinstall the subscription.

Updated:

Leave a Comment