Is Your Infra Ready for Telemetry?
In our previous post we explained how to build your Telemetry Collection Stack using open source tools, e.g. Pipeline, InfluxDB and others. The installation code for the stack was also provided for your convenience. This Telemetry Collection Stack will be used as the basis for our future use cases to be shared here, at xrdocs.io. But there is one crucial step missing before moving forward. Whenever you start the installation of the Collection Stack, you will probably ask yourself about the characteristics of the server to be used to store and process the counters. In this post, we will try to show you the utilization of the server in our scenario. It is not meant to be a full guide to cover all possible scenarios, but it contains a pretty scaled telemetry environment, and it should give you a reasonable level of understanding on how to select your server for your telemetry needs.
Telemetry Configuration Overview
Before moving to the server side, let’s see what we have from the Telemetry side. The main router used in our testing was NCS5501 with IOS XR 6.3.2. The following sensors were configured:
sensor-group fib
sensor-path Cisco-IOS-XR-fib-common-oper:fib-statistics/nodes/node/drops
sensor-path Cisco-IOS-XR-fib-common-oper:fib/nodes/node/protocols/protocol/vrfs/vrf/summary
!
sensor-group brcm
sensor-path Cisco-IOS-XR-fretta-bcm-dpa-hw-resources-oper:dpa/stats/nodes/node/hw-resources-datas/hw-resources-data
sensor-path Cisco-IOS-XR-fretta-bcm-dpa-hw-resources-oper:dpa/stats/nodes/node/npu-numbers/npu-number/display/trap-ids/trap-id
sensor-path Cisco-IOS-XR-fretta-bcm-dpa-hw-resources-oper:dpa/stats/nodes/node/asic-statistics/asic-statistics-for-npu-ids/asic-statistics-for-npu-id
!
sensor-group health
sensor-path Cisco-IOS-XR-shellutil-oper:system-time/uptime
sensor-path Cisco-IOS-XR-pfi-im-cmd-oper:interfaces/interface-summary
sensor-path Cisco-IOS-XR-wdsysmon-fd-oper:system-monitoring/cpu-utilization
sensor-path Cisco-IOS-XR-nto-misc-oper:memory-summary/nodes/node/summary
!
sensor-group optics
sensor-path Cisco-IOS-XR-controller-optics-oper:optics-oper/optics-ports/optics-port/optics-info
!
sensor-group mpls-te
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/te-mib/scalars
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/tunnels/summary
sensor-path Cisco-IOS-XR-ip-rsvp-oper:rsvp/interface-briefs/interface-brief
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/fast-reroute/protections/protection
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/signalling-counters/signalling-summary
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/p2p-p2mp-tunnel/tunnel-heads/tunnel-head
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/fast-reroute/backup-tunnels/backup-tunnel
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/topology/configured-srlgs/configured-srlg
sensor-path Cisco-IOS-XR-ip-rsvp-oper:rsvp/counters/interface-messages/interface-message
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/p2p-p2mp-tunnel/tunnel-remote-briefs/tunnel-remote-brief
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/signalling-counters/head-signalling-counters/head-signalling-counter
sensor-path Cisco-IOS-XR-mpls-te-oper:mpls-te/signalling-counters/remote-signalling-counters/remote-signalling-counter
!
sensor-group routing
sensor-path Cisco-IOS-XR-clns-isis-oper:isis/instances/instance/statistics-global
sensor-path Cisco-IOS-XR-clns-isis-oper:isis/instances/instance/neighbors/neighbor
sensor-path Cisco-IOS-XR-ip-rib-ipv4-oper:rib/rib-table-ids/rib-table-id/summary-protos/summary-proto
sensor-path Cisco-IOS-XR-clns-isis-oper:isis/instances/instance/levels/level/adjacencies/adjacency
sensor-path Cisco-IOS-XR-ipv4-bgp-oper:bgp/instances/instance/instance-active/default-vrf/process-info
sensor-path Cisco-IOS-XR-ip-rib-ipv6-oper:ipv6-rib/rib-table-ids/rib-table-id/summary-protos/summary-proto
sensor-path Cisco-IOS-XR-ipv4-bgp-oper:bgp/instances/instance/instance-active/default-vrf/neighbors/neighbor
sensor-path Cisco-IOS-XR-ip-rib-ipv4-oper:rib/vrfs/vrf/afs/af/safs/saf/ip-rib-route-table-names/ip-rib-route-table-name/protocol/bgp/as/information
sensor-path Cisco-IOS-XR-ip-rib-ipv4-oper:rib/vrfs/vrf/afs/af/safs/saf/ip-rib-route-table-names/ip-rib-route-table-name/protocol/isis/as/information
sensor-path Cisco-IOS-XR-ip-rib-ipv6-oper:ipv6-rib/vrfs/vrf/afs/af/safs/saf/ip-rib-route-table-names/ip-rib-route-table-name/protocol/bgp/as/information
sensor-path Cisco-IOS-XR-ip-rib-ipv6-oper:ipv6-rib/vrfs/vrf/afs/af/safs/saf/ip-rib-route-table-names/ip-rib-route-table-name/protocol/isis/as/information
!
sensor-group if-stats
sensor-path Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters
!
sensor-group mpls-ldp
sensor-path Cisco-IOS-XR-mpls-ldp-oper:mpls-ldp/nodes/node/bindings-summary-all
sensor-path Cisco-IOS-XR-mpls-ldp-oper:mpls-ldp/global/active/default-vrf/summary
sensor-path Cisco-IOS-XR-mpls-ldp-oper:mpls-ldp/nodes/node/default-vrf/neighbors/neighbor
sensor-path Cisco-IOS-XR-mpls-ldp-oper:mpls-ldp/nodes/node/default-vrf/afs/af/interfaces/interface
!
sensor-group openconfig
sensor-path openconfig-bgp:bgp/neighbors
sensor-path openconfig-interfaces:interfaces/interface
!
sensor-group troubleshooting
sensor-path Cisco-IOS-XR-lpts-ifib-oper:lpts-ifib/nodes/node/slice-ids/slice-id
sensor-path Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface/statistics/statistic
sensor-path Cisco-IOS-XR-ipv4-arp-oper:arp/nodes/node/traffic-interfaces/traffic-interface
We’re streaming counters from different fields:
- health: CPU utilization, memory, uptime and interface summary stats;
- optics: RX/TX power levels for transceivers;
- if-stats: interface counters (RX/TX bytes, packets, errors, etc);
- routing: a big number of different counters from ISIS and BGP;
- fib: FIB stats;
- mpls-ldp: MPLS LDP stats (interfaces, bindings, neighbors);
- mpls-te: tons of counters about MPLS-TE tunnels and RSVP-TE;
- brcm: NPU-related counters;
- troubleshooting: a set of stats about possible errors/drops on the router;
- openconfig: stats from interfaces and BGP using OC models.
Our next step is to calculate the number of counters the router will push to the collector. This was done in several steps:
- The number of counters for every sensor path was found.
- A sensor path collects data per some element (per NPU, per neighbor, per interface, etc.), so, proper math was applied.
- The total sum of the counters is based on the number of counters multiplied by the elements count.
Here is the table with the results to show you every step and the summary:
Telemetry Sensor Paths | Counters per path | Works per … | On the router | Streamed from the router |
---|---|---|---|---|
sensor-group fib | ||||
Cisco-IOS-XR-fib-common-oper.yang –tree-path fib-statistics/nodes/node/drops | 23 | per Node | 2 | 46 |
Cisco-IOS-XR-fib-common-oper.yang –tree-path fib/nodes/node/protocols/protocol/vrfs/vrf/summary | 85 | per Node/per Protocol | 6 | 510 |
sensor-group brcm | ||||
Cisco-IOS-XR-fretta-bcm-dpa-hw-resources-oper.yang –tree-path dpa/stats/nodes/node/hw-resources-datas/hw-resources-data | 22 | per Node / per table | 5 | 110 |
Cisco-IOS-XR-fretta-bcm-dpa-hw-resources-oper.yang –tree-path dpa/stats/nodes/node/npu-numbers/npu-number/display/trap-ids/trap-id | 16 | per Node / per NPU | 2 | 32 |
Cisco-IOS-XR-fretta-bcm-dpa-hw-resources-oper.yang –tree-path dpa/stats/nodes/node/asic-statistics/asic-statistics-for-npu-ids/asic-statistics-for-npu-id | 67 | per Node / per NPU | 2 | 134 |
sensor-group health | ||||
Cisco-IOS-XR-shellutil-oper.yang –tree-path system-time/uptime | 2 | per device | 1 | 2 |
Cisco-IOS-XR-pfi-im-cmd-oper.yang –tree-path interfaces/interface-summary | 10 | per device | 1 | 10 |
Cisco-IOS-XR-wdsysmon-fd-oper.yang –tree-path system-monitoring/cpu-utilization | 9 | per Node | 774 | 6,966 |
Cisco-IOS-XR-nto-misc-oper.yang –tree-path memory-summary/nodes/node/summary | 10 | per Node | 2 | 20 |
sensor-group optics | ||||
Cisco-IOS-XR-controller-optics-oper.yang –tree-path optics-oper/optics-ports/optics-port/optics-info | 398 | per transceiver | 10 | 3,980 |
sensor-group mpls-te | ||||
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/te-mib/scalars | 5 | per device | 1 | 5 |
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/tunnels/summary | 186 | per device | 1 | 186 |
Cisco-IOS-XR-ip-rsvp-oper.yang –tree-path rsvp/interface-briefs/interface-brief | 17 | per interface | 15 | 255 |
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/fast-reroute/protections/protection | 42 | per FRR HE tunnel | 272 | 11,424 |
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/signalling-counters/signalling-summary | 24 | per device | 1 | 24 |
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/p2p-p2mp-tunnel/tunnel-heads/tunnel-head | 900 | per HE tunnel | 272 | 244,800 |
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/fast-reroute/backup-tunnels/backup-tunnel | 30 | per FRR backup tunnel | 10 | 300 |
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/topology/configured-srlgs/configured-srlg | 7 | per device | 1 | 7 |
Cisco-IOS-XR-ip-rsvp-oper.yang –tree-path rsvp/counters/interface-messages/interface-message | 56 | per interface | 15 | 840 |
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/p2p-p2mp-tunnel/tunnel-remote-briefs/tunnel-remote-brief | 32 | per tunnel (RE) | 164 | 5,248 |
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/signalling-counters/head-signalling-counters/head-signalling-counter | 81 | per tunnel (HE) | 272 | 22,032 |
Cisco-IOS-XR-mpls-te-oper.yang –tree-path mpls-te/signalling-counters/remote-signalling-counters/remote-signalling-counter | 61 | per tunnel (RE) | 164 | 10,004 |
sensor-group routing | ||||
Cisco-IOS-XR-clns-isis-oper.yang –tree-path isis/instances/instance/statistics-global | 49 | per instance | 1 | 49 |
Cisco-IOS-XR-clns-isis-oper.yang –tree-path isis/instances/instance/neighbors/neighbor | 73 | per instance / per neighbor | 5 | 365 |
Cisco-IOS-XR-ip-rib-ipv4-oper.yang –tree-path rib/rib-table-ids/rib-table-id/summary-protos/summary-proto | 75 | per table / per protocol | 7 | 525 |
Cisco-IOS-XR-clns-isis-oper.yang –tree-path isis/instances/instance/levels/level/adjacencies/adjacency | 88 | per instance / per level | 1 | 88 |
Cisco-IOS-XR-ipv4-bgp-oper.yang –tree-path bgp/instances/instance/instance-active/default-vrf/process-info | 244 | per instance | 1 | 244 |
Cisco-IOS-XR-ip-rib-ipv6-oper.yang –tree-path ipv6-rib/rib-table-ids/rib-table-id/summary-protos/summary-proto | 75 | per table / per protocol | 6 | 450 |
Cisco-IOS-XR-ipv4-bgp-oper.yang –tree-path bgp/instances/instance/instance-active/default-vrf/neighbors/neighbor | 432 | per instance / per neighbor | 12 | 5,184 |
Cisco-IOS-XR-ip-rib-ipv4-oper.yang –tree-path rib/vrfs/vrf/afs/af/safs/saf/ip-rib-route-table-names/ip-rib-route-table-name/protocol/bgp/as/information | 11 | per VRF/AF/SAF/TABLE/AS | 1 | 11 |
Cisco-IOS-XR-ip-rib-ipv4-oper.yang –tree-path rib/vrfs/vrf/afs/af/safs/saf/ip-rib-route-table-names/ip-rib-route-table-name/protocol/isis/as/information | 11 | per VRF/AF/SAF/TABLE/AS | 1 | 11 |
Cisco-IOS-XR-ip-rib-ipv6-oper.yang –tree-path ipv6-rib/vrfs/vrf/afs/af/safs/saf/ip-rib-route-table-names/ip-rib-route-table-name/protocol/bgp/as/information | 11 | per VRF/AF/SAF/TABLE/AS | 1 | 11 |
Cisco-IOS-XR-ip-rib-ipv6-oper.yang –tree-path ipv6-rib/vrfs/vrf/afs/af/safs/saf/ip-rib-route-table-names/ip-rib-route-table-name/protocol/isis/as/information | 11 | per VRF/AF/SAF/TABLE/AS | 1 | 11 |
sensor-group if-stats | ||||
Cisco-IOS-XR-infra-statsd-oper.yang –tree-path infra-statistics/interfaces/interface/latest/generic-counters | 36 | per interface (physical and virtual) | 315 | 11,340 |
sensor-group mpls-ldp | ||||
Cisco-IOS-XR-mpls-ldp-oper.yang –tree-path mpls-ldp/nodes/node/bindings-summary-all | 18 | per Node | 2 | 36 |
Cisco-IOS-XR-mpls-ldp-oper.yang –tree-path mpls-ldp/global/active/default-vrf/summary | 24 | per Node | 2 | 48 |
Cisco-IOS-XR-mpls-ldp-oper.yang –tree-path mpls-ldp/nodes/node/default-vrf/neighbors/neighbor | 95 | per Neighbor | 5 | 475 |
Cisco-IOS-XR-mpls-ldp-oper.yang –tree-path mpls-ldp/nodes/node/default-vrf/afs/af/interfaces/interface | 13 | per Node/AF/Interface | 5 | 65 |
sensor-group openconfig | ||||
openconfig-bgp.yang –tree-path bgp/neighbors | 81 | per neighbor | 12 | 972 |
openconfig-interfaces.yang –tree-path interfaces/interface | 47 | per interface | 36 | 1,692 |
sensor-group troubleshooting | ||||
Cisco-IOS-XR-lpts-ifib-oper.yang –tree-path lpts-ifib/nodes/node/slice-ids/slice-id | 27 | per node / per slice | 105 | 2,835 |
Cisco-IOS-XR-drivers-media-eth-oper.yang –tree-path ethernet-interface/statistics/statistic | 56 | per interface (physical and virtual) | 315 | 17,640 |
Cisco-IOS-XR-ipv4-arp-oper.yang –tree-path arp/nodes/node/traffic-interfaces/traffic-interface | 30 | per Node/Interface | 55 | 1,650 |
350,637 |
The total number of counters is ~350k (if my math is correct ;) ). The biggest influencer here is the MPLS-TE headend tunnels stats sensor path. It includes tons of essential and valuable counters (IOS XR is so MPLS-TE rich!).
To double check the math the “dump.txt” file with the content from a single push from all the collections was checked:
cisco@ubuntu:~/analytics/pipeline/bin$ cat dump.txt | wc -l
482514
This file contains telemetry headers and lines without counters, so, roughly it confirms the math!
For the test purpose, the router had sample intervals equal to five seconds for every subscription. Most probably, you will use longer sample intervals for your installation. The goal of the testing was to emulate a scaled (and a reasonable worst-case) scenario. In our tests, several subscriptions were configured to gain the benefits of multithreading!
With all that information about Telemetry on the router, let’s move on!
Testing Environment Overview
Before we jump to the results, let me cover the server used and the procedure.
My testing was done on Ubuntu 16.04, running as a VMWare virtual machine:
- 10 vCPU allocated from Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz.
- Intel I350 NIC is installed on the server, with 10GB negotiated speed.
- ~10G of DRAM (DDR4 / 2133Mhz)
- ~70G of SSD (allocated from 2xSM1625 800GB 6G 2.5” SAS SSD).
The purpose of the testing was to check the following on the server side:
- Total and per-process CPU utilization
- DRAM utilization
- Hard disk utilization
- Hard disk write speed
- Network bandwidth
- Pipeline processing throughput
The whole testing was done in three stages:
- A single router pushing counters (to get the initial values)
- Two routers pushing counters (to find the difference and make assumptions)
- Five routers pushing counters (to confirm the assumptions and do the final checks)
For every critical component in the Stack the goal was to collect data within a TSDB (to have the historical overview) and double check the real-time view with a command from Linux itself (even if the collector uses the same way to collect the data, it might be worth to verify that proper and correct information is really collected). Telegraf was used as the collector for the server’s counters in the testing. All proper changes needed in “/etc/telegraf/telegraf.conf” will be covered later. Telegraf was configured to request information every second (1s interval).
And now we’re fully ready to jump over to the results!
Step One: One Router
At this step there was just a single router pushing ~350k counters every five seconds.
CPU Utilization
The first component to monitor is the total CPU (per core) utilization. You should have these lines in your Telegraf configuration file to have the collection active:
# Read metrics about cpu usage
[[inputs.cpu]]
## Whether to report per-cpu stats or not
percpu = true
## Whether to report total system cpu stats or not
totalcpu = true
## If true, collect raw CPU time metrics.
collect_cpu_time = false
## If true, compute and report the sum of all non-idle CPU states.
report_active = false
Here is a snapshot from a dashboard with the total CPU per core usage:
One second granularity is not good enough to catch the instantaneous load of the cores, but it shows that all the cores are loaded equally, and there are spikes up to ~10-11%. (in the idle mode, before the testing, all the cores were about ~1-2%)
Per Process Load
Having a general overview is nice, but we’re more interested in our primary components from the stack: InfluxDB, Pipeline, and Grafana. Telegraf also gives you a possibility to monitor the processes load. Configure this in the Telegraf configuration file to make the collection running:
[[inputs.procstat]]
# ## Must specify one of: pid_file, exe, or pattern
# ## PID file to monitor process
exe = "grafana"
[[inputs.procstat]]
exe = "telegraf"
[[inputs.procstat]]
exe = "influxd"
[[inputs.procstat]]
exe = "pipeline"
And here is a snapshot from the per-process load when there is a single active router:
InfluxDB takes the most CPU power across all the monitored processes. It is roughly ~120%-140% of the load. Pipeline takes ~50%, and the load of Grafana is almost nothing comparing to the first two applications (and this confirms the words of the developer) This picture seems reasonable, as InfluxDB does reads, compressions, writes; hence, it takes the most power.
The final step here, for checking CPU, is to get a snapshot from Linux itself. To do this “htop” was used.
“htop” updates data pretty fast, and every ~5s it is possible to catch the top load for Influxdb as well as Pipeline. And we got the confirmation for Telegraf data seen before (a big spike was caught).
DRAM Utilization
Our next component to look at is DRAM. To have DRAM collected with Telegraf you don’t need to configure a lot:
[[inputs.mem]]
# no configuration
There is no secret that InfluxDB reads and writes data using internal algorithms and procedures. It means that DRAM and hard disk utilization will be moving up and down constantly. Hence, it is more helpful to see the DRAM usage change over some period.
Here is a snapshot of DRAM utilization over several hours:
In the idle mode, it was about 1.3GB of DRAM used. According to the graph it roughly takes around 2.5G of DRAM now. The difference leaves ~1.2GB to process ~350k counters at five seconds interval.
Here is a quick check from the server itself:
cisco@ubuntu:~$ free -mh
total used free shared buff/cache available
Mem: 9.8G 2.0G 2.4G 103M 5.3G 7.3G
Swap: 9G 4.8M 9G
This value confirms the information collected with Telegraf.
Hard Disk Space
Our next stop is the hard disk. Before looking through the graphs, it is important to know the retention policy configured for the database. This information will be correlated with the results.
This is my configuration applied:
cisco@ubuntu:~$ influx -execute "show retention policies" -database="mdt_db"
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 3h0m0s 1h0m0s 1 true
So, at most, it will have around 4h of data stored (before it will delete a one-hour chunk of data). A small period was selected for the convenience of the testing. You will end up with keeping data longer, but simple math can be applied whenever needed!
You need this to be configured in the Telegraf configuration file for the collection to start:
# Read metrics about disk usage by mount point
[[inputs.disk]]
## By default, telegraf gather stats for all mountpoints.
## Setting mountpoints will restrict the stats to the specified mountpoints.
# mount_points = ["/"]
## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
## present on /run, /var/run, /dev/shm or /dev).
ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
This will monitor the full disk. There was nothing else running on the server, so, the initially used volume on the hard drive was just subtracted in the Grafana dashboard to precisely monitor just the InfluxDB changes.
Here is a snapshot of the hard disk utilization based on two days of monitoring:
As you can see, it constantly goes up and down, with a midpoint of around 4GB. Here is an instant snapshot from the server itself:
cisco@ubuntu:~$ sudo du -sh /var/lib/influxdb/data/
3.5G /var/lib/influxdb/data/
This value confirms data seen with Telegraf.
Hard Disk Write Speed
This is an essential characteristic to know about. The write speed of the hard drive is something obvious, but yet, one should pay attention to this once it comes to the Streaming Telemetry. Many different counters can be pushed from a router at the very high speed, and your disk(s) should be fast enough to write all the data. If there is not enough write speed, you will meet a situation when your graphs in Grafana are not built in real time (see slide No25 here)
To have write speed monitoring added in Telegraf, you should have these lines in the configuration file:
# Read metrics about disk IO by device
[[inputs.diskio]]
## By default, telegraf will gather stats for all devices including
## disk partitions.
## Setting devices will restrict the stats to the specified devices.
devices = ["sda", "sdb", "mapper/ubuntu--vg-root"]
Here is a snapshot of the hard disk write speed with just a single router pushing data:
The write speed is within the range from ~60MBps to ~90MBps.
This can also be confirmed with the output from the Linux server itself (iotop tool was used to get this data):
This snapshot confirms the value we saw in Telegraf (it will show the top value once in ~5 seconds).
Network Bandwidth
We’re all networking people here, and that’s why there was an intention to look at bandwidth with different tools. The goal here is to understand the traffic profile with Telemetry and have proper transport infrastructure designed.
The most straightforward way is to check the RX load on the ingress interface with Telegraf. This is the configuration you need to have in “telegraf.conf” (make sure to specify your interface name):
# # Read metrics about network interface usage
[[inputs.net]]
# ## By default, telegraf gathers stats from any up interface (excluding loopback)
# ## Setting interfaces will tell it to gather these explicit interfaces,
# ## regardless of status.
# ##
interfaces = ["ens160"]
Telegraf collects counters from “/proc/net/dev”, as it seen here. This is similar if you try to see the stats using “ifconfig” (an old way) or “ip -s link” (a new way).
One might argue that this is pretty high in the Linux networking stack and better to use something closer to the NIC, like “ethtool” at least, but there were no filters, qos, etc. configured and relying on “/proc/net/dev” was good enough. Also, during this testing, I didn’t try to balance flows from different gRPC sessions/routers to different queues and/or different CPUs to work with the processing of those queues and SoftIRQs (plus, I350 is not very flexible in manipulation).
But even with the default configuration, there was some balancing happening:
cisco@ubuntu:~$ ethtool -S ens160
NIC statistics:
Tx Queue#: 0
TSO pkts tx: 5371
TSO bytes tx: 14265596
ucast pkts tx: 10244115
ucast bytes tx: 711616671
mcast pkts tx: 7
mcast bytes tx: 506
bcast pkts tx: 1
bcast bytes tx: 57
pkts tx err: 0
pkts tx discard: 0
drv dropped tx total: 0
too many frags: 0
giant hdr: 0
hdr err: 0
tso: 0
ring full: 0
pkts linearized: 0
hdr cloned: 0
giant hdr: 0
Tx Queue#: 1
TSO pkts tx: 8523
TSO bytes tx: 23855746
ucast pkts tx: 5597962
ucast bytes tx: 405979501
mcast pkts tx: 2
mcast bytes tx: 156
bcast pkts tx: 2
bcast bytes tx: 116
pkts tx err: 0
pkts tx discard: 0
drv dropped tx total: 0
too many frags: 0
giant hdr: 0
hdr err: 0
tso: 0
ring full: 0
pkts linearized: 0
hdr cloned: 0
giant hdr: 0
Tx Queue#: 2
TSO pkts tx: 15321
TSO bytes tx: 40884653
ucast pkts tx: 849676
ucast bytes tx: 104659814
mcast pkts tx: 689
mcast bytes tx: 60840
bcast pkts tx: 5
bcast bytes tx: 242
pkts tx err: 0
pkts tx discard: 0
drv dropped tx total: 0
too many frags: 0
giant hdr: 0
hdr err: 0
tso: 0
ring full: 0
pkts linearized: 0
hdr cloned: 0
giant hdr: 0
Tx Queue#: 3
TSO pkts tx: 11981
TSO bytes tx: 30906375
ucast pkts tx: 7161148
ucast bytes tx: 520244572
mcast pkts tx: 678
mcast bytes tx: 72716
bcast pkts tx: 1
bcast bytes tx: 79
pkts tx err: 0
pkts tx discard: 0
drv dropped tx total: 0
too many frags: 0
giant hdr: 0
hdr err: 0
tso: 0
ring full: 0
pkts linearized: 0
hdr cloned: 0
giant hdr: 0
Tx Queue#: 4
TSO pkts tx: 13939
TSO bytes tx: 35826029
ucast pkts tx: 2544772
ucast bytes tx: 210321037
mcast pkts tx: 0
mcast bytes tx: 0
bcast pkts tx: 0
bcast bytes tx: 0
pkts tx err: 0
pkts tx discard: 0
drv dropped tx total: 0
too many frags: 0
giant hdr: 0
hdr err: 0
tso: 0
ring full: 0
pkts linearized: 0
hdr cloned: 0
giant hdr: 0
Tx Queue#: 5
TSO pkts tx: 4268
TSO bytes tx: 12138427
ucast pkts tx: 147058
ucast bytes tx: 26340175
mcast pkts tx: 2
mcast bytes tx: 156
bcast pkts tx: 0
bcast bytes tx: 0
pkts tx err: 0
pkts tx discard: 0
drv dropped tx total: 0
too many frags: 0
giant hdr: 0
hdr err: 0
tso: 0
ring full: 0
pkts linearized: 0
hdr cloned: 0
giant hdr: 0
Tx Queue#: 6
TSO pkts tx: 133051
TSO bytes tx: 1742790147
ucast pkts tx: 172700036
ucast bytes tx: 13463528864
mcast pkts tx: 1
mcast bytes tx: 78
bcast pkts tx: 0
bcast bytes tx: 0
pkts tx err: 0
pkts tx discard: 0
drv dropped tx total: 0
too many frags: 0
giant hdr: 0
hdr err: 0
tso: 0
ring full: 0
pkts linearized: 0
hdr cloned: 0
giant hdr: 0
Tx Queue#: 7
TSO pkts tx: 113109
TSO bytes tx: 1564030563
ucast pkts tx: 10729684
ucast bytes tx: 2296085621
mcast pkts tx: 0
mcast bytes tx: 0
bcast pkts tx: 0
bcast bytes tx: 0
pkts tx err: 0
pkts tx discard: 0
drv dropped tx total: 0
too many frags: 0
giant hdr: 0
hdr err: 0
tso: 0
ring full: 0
pkts linearized: 0
hdr cloned: 0
giant hdr: 0
Rx Queue#: 0
LRO pkts rx: 69503
LRO byte rx: 155537167
ucast pkts rx: 4899929
ucast bytes rx: 6933364483
mcast pkts rx: 664
mcast bytes rx: 71048
bcast pkts rx: 7690
bcast bytes rx: 461400
pkts rx OOB: 0
pkts rx err: 0
drv dropped rx total: 0
err: 0
fcs: 0
rx buf alloc fail: 0
Rx Queue#: 1
LRO pkts rx: 173207
LRO byte rx: 420063453
ucast pkts rx: 8744413
ucast bytes rx: 12400319120
mcast pkts rx: 0
mcast bytes rx: 0
bcast pkts rx: 0
bcast bytes rx: 0
pkts rx OOB: 0
pkts rx err: 0
drv dropped rx total: 0
err: 0
fcs: 0
rx buf alloc fail: 0
Rx Queue#: 2
LRO pkts rx: 68829
LRO byte rx: 179417502
ucast pkts rx: 7784799
ucast bytes rx: 11250828484
mcast pkts rx: 0
mcast bytes rx: 0
bcast pkts rx: 10080
bcast bytes rx: 1430784
pkts rx OOB: 0
pkts rx err: 0
drv dropped rx total: 0
err: 0
fcs: 0
rx buf alloc fail: 0
Rx Queue#: 3
LRO pkts rx: 175185
LRO byte rx: 512157733
ucast pkts rx: 12908488
ucast bytes rx: 18425489162
mcast pkts rx: 1329
mcast bytes rx: 128923
bcast pkts rx: 0
bcast bytes rx: 0
pkts rx OOB: 0
pkts rx err: 0
drv dropped rx total: 0
err: 0
fcs: 0
rx buf alloc fail: 0
Rx Queue#: 4
LRO pkts rx: 95519
LRO byte rx: 252147848
ucast pkts rx: 4410766
ucast bytes rx: 6185140629
mcast pkts rx: 0
mcast bytes rx: 0
bcast pkts rx: 0
bcast bytes rx: 0
pkts rx OOB: 0
pkts rx err: 0
drv dropped rx total: 0
err: 0
fcs: 0
rx buf alloc fail: 0
Rx Queue#: 5
LRO pkts rx: 3992421
LRO byte rx: 9493291192
ucast pkts rx: 342072378
ucast bytes rx: 490086127366
mcast pkts rx: 665
mcast bytes rx: 57855
bcast pkts rx: 6612
bcast bytes rx: 1748874
pkts rx OOB: 0
pkts rx err: 0
drv dropped rx total: 0
err: 0
fcs: 0
rx buf alloc fail: 0
Rx Queue#: 6
LRO pkts rx: 45801
LRO byte rx: 141305620
ucast pkts rx: 4268647
ucast bytes rx: 5801599902
mcast pkts rx: 0
mcast bytes rx: 0
bcast pkts rx: 0
bcast bytes rx: 0
pkts rx OOB: 0
pkts rx err: 0
drv dropped rx total: 0
err: 0
fcs: 0
rx buf alloc fail: 0
Rx Queue#: 7
LRO pkts rx: 460650
LRO byte rx: 1279922500
ucast pkts rx: 28727343
ucast bytes rx: 41614846434
mcast pkts rx: 0
mcast bytes rx: 0
bcast pkts rx: 0
bcast bytes rx: 0
pkts rx OOB: 0
pkts rx err: 0
drv dropped rx total: 0
err: 0
fcs: 0
rx buf alloc fail: 0
tx timeout count: 0
This is a snapshot of RX (and TX) load of the interface, where streaming telemetry was pushed to:
As you can see, the bandwidth profile is pretty close to the picture you might already have in your mind. Every fifth second you see two spikes of bandwidth utilization. The first one is pretty small (~12Mbps, it contains a set of “fast” collections) and then the big one follows (~73Mbps, it includes mostly MPLS-TE counters). This is something expected, as Telemetry works every sample interval and the amount of data is (roughly) the same (there were no changes/updates done in the router).
Let’s now check the transmission rate from the Management interface of the router used in the test:
The traffic profile is totally the same! You can see the small spikes (for fast collections) followed by the big spikes (MPLS-TE collections) with the same values.
You can also use any of the existing tools that collect counters from networking interfaces to calculate the rate. “Speedometer” was used in the testing. Speedometer also gets counters from /proc/net/dev, so, it will be shown here just once to check Telegraf.
This graph gives a bit better granularity, but, overall, confirms the graph we saw with Telegraf. There are several peaks with a higher rate (83Mbps vs. 73Mbps), mostly because several packets from smaller spikes were added to the big ones during the rate calculation.
And here is an example of how telemetry push looks through several hours of observation:
The Management interface load stays constant as expected.
Pipeline Throughput
The final stop in the first phase of the testing is Pipeline. Monitoring of Pipeline is essential, as this can help you to prevent situations with overloads (and hence, either drops or pushbacks to the router). Whenever you install the Telemetry Collection Stack, you will have this activated by default. All you need is to follow the graphs.
Here is a snapshot of the Pipeline load while processing counters from a single router:
Throughput is something around 2.2MBps. (try to guess the subscription the pink color corresponds to ;) ) No surprise, this load is the same and stable across a couple of days:
Step Two: Two Routers
At this step, the goal was to add another router to find the increments applied. The second router was also an NCS5501 with the same configuration, IOS XR version, and the very similar scale.
Let’s look through the snapshots to find the math.
CPU Utilization
As before, let’s start with the per core CPU load. Here is a snapshot of the graph, showing CPU load for the last 24h:
The addition of the router was around “14:00” on that graph (the time is marked on this graph and follow similar marks of the following graphs). More spikes are seen after the second router started pushing its telemetry data. The max value of spikes now is around 25%, and the midpoint is approximately 15%. It is hard to do the analysis based on this graph only, so, let’s see the per-process load.
Per Process Load
Okay, let’s check what is the situation with our three main processes:
To remind, with a single router we saw ~130% of InfluxDB and ~50% of Pipeline load. After adding the second router, it is seen that Pipeline is around 100% of the load. This gives us an assumption that Pipeline needs ~0.5 of vCPU per router. The load of InfluxDB became higher as well, ~250%. This leads us to ~1.3vCPU per router for InfluxDB. Grafana load is still nothing, comparing to both, Pipeline and InflxuDB.
Here is a snapshot for the 24h of per-process load monitoring:
InfluxDB midpoint is really ~250% (with random spikes to ~350%-400%), while Pipeline stayed almost flat around 100%.
And the final check on the Linux itself:
A snapshot was done at one of the highest spikes, and it confirms that InfluxDB goes up to ~290%, with Pipeline close to ~100%.
DRAM Utilization
A single router took around 1.2GB of the DRAM from the server. Here is a snapshot of DRAM stats for 24 hours:
DRAM utilization moved from ~2.5GB to ~3.6GB-3.7GB after the second router was added. It is something about ~1,1GB-1.2GB increase for the new router (the value is consistent)
A quick check from the linux:
cisco@ubuntu:~$ free -mh
total used free shared buff/cache available
Mem: 9.8G 3.3G 739M 98M 6.1G 6.0G
Swap: 9G 35M 9G
The result is pretty close to what we see with Telegraf.
Hard Disk Space
To store information from the first router, ~4GB of the space was needed. Keep on using the same retention policy, here is a snapshot of the 2-day disk utilization monitoring after the second router was added:
The disk utilization is now around 8GB. It means that adding one more device with the similar scale adds right the same amount of disk utilization (4GB per a router).
And a quick check from Linux at a random moment:
cisco@ubuntu:~$ sudo du -sh /var/lib/influxdb/data/
7.5G /var/lib/influxdb/data/
Hard Disk Write Speed
The write speed for the first router was ~60MBps-90MBps during the periods of counters coming to the server. This is a snapshot of the write speed with two routers:
There are many spikes up to ~600MBps, but the dense part is now ~200-250MBps. It looks like a new router needs at least ~90MBps of the write speed.
Here is one of the peaks caught from the Linux console:
IOTOP shows a smaller value, that is more relevant to the normal mode (not spikes).
Network Bandwidth
Whenever you add one more router you might have two possible situations:
- You will have their sample intervals aligned at start time
- You will not have their sample intervals aligned at start time
In the first case, you will see the max peak value multiplied by 2x. In the second case, you will see a profile with several peaks consistent in time (this case should happen more often).
In the tests, the second situation was observed:
With the first router, the peak value was ~72Mbps. Right now several collections are aligned in time. The peak value for several collections is ~90Mbps and the second peak around 80Mbps. (Again, the worst case scenario would be start time alignment and peak values up to ~150Mbps).
There is no need to show the long-term snapshot, as with streaming telemetry you will have a constant rate (unless there are drops, policing, etc. on its way!)
Pipeline Throughput
With the first router, we observed 2.2MBps of Pipeline throughput. Here is a snapshot with the load after adding the second one:
The volume of decoded messages grew up exactly two times! It means, every new similar router will need the same amount of processing power (~2.2MBps)
Step three: five routers
At this step, the plan is to check our findings while running five routers streaming almost the same amount of counters. Three more routers were added to the testbed. All were NCS5502 with 6.3.2 IOS XR release.
CPU Utilization
As before, let’s start with the total CPU load:
We observed the peak values ~25% and midpoint was ~15% with two routers. With five routers we can see ~22-25% as the midpoint, and peak values are up to 40%. This test confirms that all the processes are balanced almost equally across the cores, and we don’t see a linear increase on just a subset of cores. More details should be available in the per-process view.
Per Process Load
Let’s jump directly to the comparison of the per-process load with a long time of monitoring:
Based on this graph we can see that Pipeline now takes 250% and InfluxDB takes around 650%. This confirms our previous thoughts that Pipeline needs approximately 50% (~0.5 vCPU) to process a single router with ~350k of counters every five seconds. InfluxDB needs something around 120-130% per a router (~1.3 vCPU)
DRAM Utilization
In our previous test, we saw that around ~1.1GB-1.2GB of the DRAM was needed to process streaming telemetry from a router. Let’s see the graph with the five routers:
We can see that the used DRAM moved from ~3.6GB to something ~7.2GB-7.3GB (midpoint). This test confirms that ~1.1GB-1.2GB of DRAM is needed to process a router with ~350k counters every five seconds.
Hard Disk Space
According to our previous tests, we needed ~4GB to store data from a single router and around ~8GB for two of them. Let’s see the disk utilization with five routers streaming telemetry data:
It looks like that the utilization is around 20-25GB and this confirms our assumption that ~4GB of the hard disk is needed to store all the data from five routers. The retention policy configured is 3h+1h. This tells us that, roughly, an hour of storage of ~350k counters pushed every five seconds takes ~1GB of the hard disk.
Hard Disk Write Speed
Here is the graph with the write speed on the hard disk:
As you can see, the dense part “moved” from ~200MBps to ~400MBps. The fact of the increase in the write speed is obvious, but you can’t jump over the max speed on your drive. That’s why the system will keep on writing till the data is still in internal memory (hence, you see a more dense area). Please, remember, if you write speed is not good enough to handle immediately all the data coming, you might observe increasing of delays in Grafana’s graphs.
Network Bandwidth
As with two routers, you might meet different situations with five routers. Sample intervals can be aligned at start time or not. Here is the graph from the tests:
Several routers were aligned in their intervals, that’s why you’re able to see spikes up to ~185Mbps. The result here is that the total bandwidth will depend on the number of simultaneous pushes and a single router can take ~72Mbps.
Pipeline Throughput
The final piece to look at is Pipeline. Here is a snapshot:
Again, no surprise here. Every new router added ~2MBps of the load for the tool. You can also see that most of the processing was taken by just a single subscription from every router. This graph, actually, confirms that the number of counters of every router was almost the same!
So, What Is The Summary?
Based on the tests, you can refer to these numbers for your infrastructure designs.
For a router pushing ~350k counters every five seconds you need:
- DRAM: ~1.2GB (DDR4 / 2133Mhz)
- Hard disk space: ~1GB per hour
- Hard disk write speed: ~90MBps, but may grow non-linear (SM1625 800GB 6G 2.5” SAS SSD)
- InfluxDB process: ~1.5 vCPU (CPU E5-2697 v3 @ 2.60GHz)
- Pipeline process: ~0.5 vCPU (CPU E5-2697 v3 @ 2.60GHz)
- Pipeline throughput: ~2.2MBps
- Network bandwidth: ~75Mbps
Update this for your needs, and you’re good to go!
Before moving to the conclusion, let me please show you the difference in bandwidth needs between all the encodings/transport protocols. All other resources needs will roughly stay the same.
Peak bandwidth needs for ~350k counters:
- gRPC/KV-GPB: ~72.5 Mbps
- gRPC/GPB: ~9.6 Mbps
- gRPC/JSON: ~84.4 Mbps
- TCP/KV-GPB: ~72.6 Mbps
- TCP/GPB: ~9.6 Mbps
- TCP/JSON: ~84.5 Mbps
- UDP/KV-GPB: ~76.7 Mbps
- UDP/GPB: ~9.8 Mbps
- UDP/JSON: ~88.2 Mbps
Please, use these values as your general reference, paying attention that your number might be slightly different.
Conclusion
The IOS XR Telemetry Collection Stack gives you a possibility to start collecting telemetry data from your routers. But before doing this, you need to go through the proper planning of your infrastructure. You don’t want to meet a situation when everything is working fine, but you don’t have enough space to keep the data, or your server is just not powerful enough. There are many recommendations exist from the owners of the components used in the Stack (e.g. InfluxDB), but I hope that the results here will help you to get a better understanding of the needs, how to check utilization and move fast!
Leave a Comment