Pod Scheduling for Multi-cluster Deployments
SUMMARY Juniper Cloud-Native Contrail Networking (CN2) release 23.2 supports network-aware pod scheduling for multi-cluster deployments. This article provides information about the changes to this feature from the previous release (23.1), and additional components for multi-cluster support.
Pod Scheduling in CN2
This article provides information about network-aware pod scheduling for multi-cluster deployments. For information about this feature for single-cluster deployments, see Pod Scheduling.
contrail-scheduler, schedules
pods based on the following network metrics:- Number of active ingress/egress traffic flows
- Bandwidth utilization
- Number of virtual machine interfaces (VMIs)
Network-Aware Pod Scheduling Overview
Many high-performance applications have bandwidth or network interface requirements as well as the typical CPU or VMI requirements. Ifcontrail-scheduler assigns a pod to a
node with low bandwidth availability, that application cannot run optimally. CN2 release 23.1
addressed this issue with the introduction of a metrics-collector, a
central-collector, and custom scheduler plugins. These components collect,
store, and process network metrics so that the contrail-scheduler schedules
pods based on these metrics. CN2 release 23.2 enables multi-cluster deployments to take
advantage of this feature. Custom Controlllers
In 23.2, CN2 introduces a metrics controller CR and a central collector CR. The following two new controllers manage the CRs:-
MetricsConfigcontroller: This controller reconciles and monitors theMetricsConfigcustom resource (CR). This controller writes config information for themetrics-collectorsin the central and distributed clusters. This controller also listens to create, read, update, delete (CRUD) events ofkubemanagerresources on the central cluster. The config file of ametrics-collectorcontains the following information:-
The
writeLocationfield references themetrics-collectorconfigMapdetails for mounting. Thename,namespace, andkeyof themetrics-collector'sconfigMapfor writing it's configuration. -
The receiving service's IP address and port. The
metrics-collectoruses this service IP and port to forward requested metrics data. -
Configuration data regarding Transport Layer Security TLS (if required).
-
A
metricssection which defines the types of metrics the receiving service is requesting.
The following is an example of a
metrics-collectorconfig:receivers: - encoding: json metrics: - <relevant_metrics> - <relevant_metrics> serviceName: <service_name_or_ip> port: <receiver_listener_port> writeLocation: name: mc_configmap namespace: contrail key: config.yaml
-
-
CentralCollectorcontroller: This controller also reconciles and monitors the central collector CR. TheCentralCollectorcontroller manages the LCM of thecentral-collector. Additionally, it creates aMetricsConfigCR, which then configures themetrics-collectorconfiguration. TheCentralCollectorcontroller also creates the clusterIP service for thecentral-collector. Unlike 23.1, you do not need to create acentral-collectorDeploymentorconfigMap. Create and apply the CR and thecentral-collectorcontroller manages all configuration, deployment, and LCM functions.apiVersion: collectors.juniper.net/v1 kind: CentralCollector metadata: name: central-collector namespace: contrail spec: common: containers: - image: REGISTRY/central-collector:TAG name: central-collector metricsCollectorConfigmapLoc: name: "metrics-collector-configmap" namespace: contrail
multi-cluster Pod Scheduling Components
Aside from the CentralCollector controller and the
MetricsConfig controller, the following components comprise CN2's
network-aware pod scheduling solution for multi-cluster deployments:
-
Metrics-collector: This component runs in a container alongside the vRouter pod that runs on each node in the central cluster and distributed clusters. Themetrics-collectorthen forwards requested data to configured sinks which are specified in the configuration. The central collector is one of the configured sinks and recieves this data from the metrics collector. This release adds an additional field in the config file of the metrics collector. This field designates a cluster name, specifying which cluster the metrics collector is collecting data from and will send the same in the metadata to the reciever. -
Central-collector: This component acts as an aggregator and stores data received from all of the nodes in a cluster via the metrics collector. The central collector exposes gRPC endpoints which consumers use to request this data for nodes in a cluster. Contrail-scheduler: This custom scheduler introduces the following three custom plugins:VMICapacityplugin (available from release 22.4 onwards): Implements Filter, Score, andNormalizeScoreextension points in the scheduler framework. Thecontrail-scheduleruses these extension points to determine the best node to assign a pod to based on active VMIs.FlowsCapacityplugin: Determines the best node to schedule a pod based on the number of active flows in a node. Too many traffic flows on a node means more competition for new pod traffic. Pods and nodes with a lower flow count are ranked higher by the scheduler.BandwidthUsageplugin: Determines the best node to assign a pod based on the bandwidth usage of a node. The node with the least bandwidth usage (ingoing and outgoing traffic) per second is ranked highest.
Metrics Collector Deployment
CN2 includes the metrics collector in vRouter pod deployments by default. The
agent:
default: field of the vRouter spec contains a collectors:
field which is configured with the metric collector reciever address. The example below
shows the value collectors: - localhost: 6700. Since the the metrics
collector runs in the same pod as the vRouter agent, it can communicate over the
localhost port. Note that port 6700 is fixed as the metrics collector
reciever address and cannot be changed. The vRouter agent sends metrics data to this
address.
The following is a section of a default vRouter deployment with the collector enabled:
apiVersion: dataplane.juniper.net/v1
kind: Vrouter
metadata:
name: contrail-vrouter-nodes
namespace: contrail
spec:
agent:
default:
collectors:
- localhost:6700
xmppAuthEnable: true
sandesh:
introspectSslEnable: true Create Permissions
After you configure the vRouter to send metrics to the metrics-collector
over port 6700, you must apply a Role-Based Access (RBAC) manifest. Applying this manifest
creates required permissions for the contrail-telemetry-controller. The
contrail-telemetry-controller reconciles the
MetricsConfig CR and creates the configMap for the
metrics-collector.
The following is an example RBAC manifest. Note that this manifest also creates the
namespace (contrail-analytics) for the metrics-collector.
apiVersion: v1 kind: Namespace metadata: name: contrail-analytics --- apiVersion: v1 kind: ServiceAccount metadata: name: contrail-telemetry-controller namespace: contrail-analytics --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: creationTimestamp: null name: contrail-telemetry-controller-role rules: - apiGroups: - "" resources: - configmaps verbs: - get - list - watch - create - update - patch - delete - apiGroups: - "" resources: - configmaps/status verbs: - get - update - patch - apiGroups: - "" resources: - namespaces - pods - pods/status verbs: - get - list - watch - patch - apiGroups: - apps resources: - deployments verbs: - get - update - list - watch - apiGroups: - "" resources: - secrets verbs: - create - delete - get - list - patch - update - watch - apiGroups: - telemetry.juniper.net resources: - metricsconfigs verbs: - create - delete - get - list - patch - update - watch - apiGroups: - telemetry.juniper.net resources: - metricsconfigs/status verbs: - get - patch - update - apiGroups: - configplane.juniper.net resources: - kubemanagers verbs: - create - delete - get - list - patch - update - watch - apiGroups: - configplane.juniper.net resources: - kubemanagers/status verbs: - get - patch - update --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: contrail-telemetry-controller-role-rolebinding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: contrail-telemetry-controller-role subjects: - kind: ServiceAccount name: contrail-telemetry-controller namespace: contrail-analytics
After applying the RBAC manifest, you must create a Deployment for the
contrail-telemetry-controller. The following is an example
Deployment.
apiVersion: apps/v1
kind: Deployment
metadata:
name: contrail-telemetry-controller
namespace: contrail-analytics
labels:
app: contrail-telemetry-controller
spec:
replicas: 1
selector:
matchLabels:
app: contrail-telemetry-controller
template:
metadata:
labels:
app: contrail-telemetry-controller
spec:
serviceAccount: contrail-telemetry-controller
securityContext:
fsGroup: 2000
runAsGroup: 3000
runAsNonRoot: true
runAsUser: 1000
containers:
- name: contrail-scheduler
image: <REGISTRY>/contrail-telemetry-controller:<TAG>
command:
- /contrail-telemetry-controllerCentral Collector Deployment
Apply the CentralCollector CR. Once applied, the
CentralCollector controller creates all of the necessary objects for the
central-collector. The following is an example
CentralCollector CR.
apiVersion: collectors.juniper.net/v1
kind: CentralCollector
metadata:
name: central-collector
namespace: contrail
spec:
common:
containers:
- image: <REGISTRY>/central-collector:<TAG>
name: central-collector
metricsCollectorConfigmapLoc:
name: "metrics-collector-configmap"
namespace: contrailCreate Configmaps
Perform the following steps to create configMaps for multi-cluster
pod-scheduling components.
Perform these steps in each of the clusters of your multi-cluster environment.
-
Create a
configMapcalledcluster-details:Applying the
CentralCollectorCR in the previous section also creates aconfigMapcalledcluster-details. You must replicate thisconfigMapin the same namespace where you intend to deploycontrail-scheduler. The CR creates thisconfigMapin the same namespace as the CR. TheconfigMapincludes the following information:-
Central-collector's service
clusterIP -
Metrics gRPC service port: Used by the
contrail-schedulerto retrieve and process network metrics and schedule pods accordingly. -
Name of the cluster where the
configMapis located: Used to identify the cluster wherecontrail-scheduleris running.
The following is an example
cluster-detailsconfigMap:clustername: <name_of_the_cluster> centralcollectoraddress: <ip:port>
-
-
Create a
vmi-configconfigMap: ThisconfigMapdefines the maximum VMI count allowed on DPDK nodes. The following is an exampleconfigMap:nodeLabels: "agent-mode": "dpdk" maxVMICount: 64
-
Create a
configMapfor thecontrail-scheduler: ThisconfigMapdefines thecontrail-schedulerconfiguration. The following is an exampleconfigMapwithVMICapacity,FlowsCapacity, andBandwidthUsagepluigin information:apiVersion: kubescheduler.config.k8s.io/v1 clientConnection: acceptContentTypes: "" burst: 100 contentType: application/vnd.kubernetes.protobuf kubeconfig: /tmp/config/kubeconfig qps: 50 enableContentionProfiling: true enableProfiling: true kind: KubeSchedulerConfiguration leaderElection: leaderElect: false profiles: - schedulerName: no-plugins-scheduler - schedulerName: vmi-scheduler pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: VMICapacityArgs config: /tmp/vmi/config.yaml clusterConfig: /tmp/cluster/config.yaml name: VMICapacity plugins: multiPoint: enabled: - name: VMICapacity - schedulerName: flows-scheduler pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: FlowsCapacityArgs clusterConfig: /tmp/cluster/config.yaml name: FlowsCapacity plugins: multiPoint: enabled: - name: FlowsCapacity - schedulerName: bandwidth-scheduler pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: BandwidthUsageArgs clusterConfig: /tmp/cluster/config.yaml name: BandwidthUsage plugins: multiPoint: enabled: - name: BandwidthUsage - schedulerName: contrail-scheduler pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: VMICapacityArgs config: /tmp/vmi/config.yaml clusterConfig: /tmp/cluster/config.yaml name: VMICapacity - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: FlowsCapacityArgs clusterConfig: /tmp/cluster/config.yaml name: FlowsCapacity - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: BandwidthUsageArgs clusterConfig: /tmp/cluster/config.yaml name: BandwidthUsage plugins: multiPoint: enabled: - name: VMICapacity weight: 50 - name: FlowsCapacity weight: 1 - name: BandwidthUsage weight: 20 -
Create a
ServiceAccountobject (required) and configure theClusterRolesfor theServiceAccount. AServiceAccountassigns a role to a pod or component within a cluster. In the example below, theServiceAccountassigns the same permissions as the default Kubernetes scheduler (kube-scheduler). The following is an exampleServiceAccount:apiVersion: v1 kind: ServiceAccount metadata: name: contrail-scheduler namespace: contrail-scheduler
The following is an example
ClusterRolesobject:apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: contrail-scheduler subjects: - kind: ServiceAccount name: contrail-scheduler namespace: contrail-scheduler roleRef: kind: ClusterRole name: system:kube-scheduler apiGroup: rbac.authorization.k8s.io --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: contrail-scheduler-as-volume-scheduler subjects: - kind: ServiceAccount name: contrail-scheduler namespace: contrail-scheduler roleRef: kind: ClusterRole name: system:volume-scheduler apiGroup: rbac.authorization.k8s.io --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: contrail-scheduler-extension-apiserver-authentication-reader namespace: contrail-scheduler roleRef: kind: Role name: extension-apiserver-authentication-reader apiGroup: rbac.authorization.k8s.io subjects: - kind: ServiceAccount name: contrail-scheduler namespace: contrail-scheduler
-
Create a kubeconfig
Secret. Mount the kubeconfig file within the scheduler container when you apply thecontrail-schedulerDeployment.kubectl create secret generic kubeconfig -n contrail-scheduler --from-file=kubeconfig=<path-to-kubeconfig-file>
-
Create a
Deploymentfor thecontrail-scheduler. The following is an exampleDeployment:apiVersion: apps/v1 kind: Deployment metadata: name: contrail-scheduler namespace: contrail-scheduler labels: app: scheduler spec: replicas: 1 selector: matchLabels: app: scheduler template: metadata: labels: app: scheduler spec: serviceAccountName: contrail-scheduler securityContext: fsGroup: 2000 runAsGroup: 3000 runAsNonRoot: true runAsUser: 1000 containers: - name: contrail-scheduler image: <REGISTRY>/contrail-scheduler:<TAG> command: - /contrail-scheduler - --authentication-kubeconfig=/tmp/config/kubeconfig - --authorization-kubeconfig=/tmp/config/kubeconfig - --config=/tmp/scheduler/scheduler-config - --secure-port=10271 imagePullPolicy: Always livenessProbe: failureThreshold: 8 httpGet: path: /healthz port: 10271 scheme: HTTPS initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 30 resources: requests: cpu: 100m startupProbe: failureThreshold: 24 httpGet: path: /healthz port: 10271 scheme: HTTPS initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 30 volumeMounts: - mountPath: /tmp/config name: kubeconfig readOnly: true - mountPath: /tmp/scheduler name: scheduler-config readOnly: true - mountPath: /tmp/vmi name: vmi-config readOnly: true - mountPath: /tmp/cluster name: cluster-details readOnly: true hostPID: false volumes: - name: kubeconfig secret: secretName: kubeconfig - name: scheduler-config configMap: name: scheduler-config - name: vmi-config configMap: name: vmi-config - name: cluster-details configMap: name: cluster-detailsNote:When creating
SecretsorconfigMaps, ensure that thekeyused during the creation matches thekeyused when mounting them in theDeployment. For example, in theDeploymentabove, the path/tmp/scheduler/scheduler-configis provided in the command section, and/tmp/scheduleris provided in the volume mount section. In this case, the key is "scheduler-config". If the keys do not match, you will need to explicitly specify a custom key in the volume mounts for custom file names. If no file name is given explicitly, use "config.yaml" as the key name.