IOS-XR power consumption monitoring: an ephemeral telemetry stack use case

9 minutes read

Introduction

Power efficiency is a big topic as it can literally help customers save millions. Recently there has been several announcements about Cisco’s Silicon One and how it compares to a previous generation of chip on this specific point.
This post will describe how a docker-based ephemeral telemetry stack has been built to monitor Cisco IOS-XR device power consumption.

Context

As part of an upcoming CPOC (Customer Proof Of Concept) engagement, one of our customers wants to benchmark routers power consumption. Most of time, TCO (Total Cost of Ownership) calculations are made based on estimations. Cisco do provide such numbers through Cisco Power Calculator tool. However, they may vary based on several factors like ambient temperature, traffic load and pattern, optics, etc. For a fully loaded ASR 9906 with latest generation of RSP and linecards, there can be a 50% difference between the typical scenario (27C ambient temperature with 50% linerate IMIX traffic) and the worst-case scenario (50/55C)!
For this multi-vendor assessment, apples to apples comparison is required. A power meter or a smart PDU could be setup, but they require additional hardware to be purchased and installed on the premises. Instead we decided to use Model Driven Telemetry to monitor in real time Cisco’s device power consumption: it’s already available, free and we only need extra configuration. The last missing piece is a collector.

Docker-compose telemetry stack

To receive and consume power data from our routers, I decided to build an ephemeral telemetry stack which can be brought on-demand on a server or a laptop. As CPOC will be delivered over a one-week period, there is no need to build a full-blown and permanent solution. Purpose here is demonstration only.
I reused the same popular components: Telegraf, InfluxDB and Grafana. This time I decided to add Chronograf to quickly see the data stored in InfluxDB. I also leveraged Docker to build this stack and especially Docker Compose. As described in Docker’s documentation:

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.

There was some work done by Jeff Kehres available on GitHub. I reused it to add Telegraf and I removed the docker volumes: as this stack is ephemeral, persistent storage is not required. Full code and documentation can be found here. Here is the raw docker-compose.yaml file:


version: "2"
services:
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: always
    ports:
      - '3000:3000'
    volumes:
      - ./grafana-provisioning/:/etc/grafana/provisioning
    depends_on:
      - influxdb
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
  influxdb:
    image: influxdb:1.8
    container_name: influxdb
    restart: always
    ports:
      - '8086:8086'
    environment:
      - INFLUXDB_DB=telemetry
      - INFLUXDB_USER=admin
      - INFLUXDB_PASSWORD=admin
      - INFLUXDB_ADMIN_ENABLED=true

      - INFLUXDB_ADMIN_USER=admin
      - INFLUXDB_ADMIN_PASSWORD=admin
  chronograf:
    image: chronograf:latest
    container_name: chronograf
    ports:
      - '127.0.0.1:8888:8888'
    depends_on:
      - influxdb
    environment:
      - INFLUXDB_URL=http://influxdb:8086
      - INFLUXDB_USERNAME=admin
      - INFLUXDB_PASSWORD=admin
  telegraf:
    image: telegraf
    container_name: telegraf
    restart: always
    ports:
      - '57100:57100'
      - '57500:57500'
    environment:
      HOST_PROC: /rootfs/proc
      HOST_SYS: /rootfs/sys
      HOST_ETC: /rootfs/etc
    volumes:
     - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
     - /var/run/docker.sock:/var/run/docker.sock:ro
     - /sys:/rootfs/sys:ro
     - /proc:/rootfs/proc:ro
     - /etc:/rootfs/etc:ro

To bring up the stack, simply run:

` docker-compose up -d `

This will spawn our 4 containers to create the ephemeral telemetry stack:


{21/12/20 11:58}fcuillers-MacBook-Pro:~/Dev/telemetry fcuiller% docker-compose up -d
Creating network "telemetry_default" with the default driver
Creating telegraf   ... done
Creating influxdb ... done
Creating chronograf ... done
Creating grafana    ... done

With current configuration, following ports are exposed:

Host PortService
3000Grafana
8086InfluxDB
57100, 57500Telegraf
127.0.0.1:8888Chronograf

Port 57100 is used for TCP transport while 57500 is used for gRPC dial-out. telegraf.conf file can be updated to fit your configuration and environment.

Once stack is started, we can check containers are up:


{21/12/20 15:27}fcuillers-MacBook-Pro:~/Dev/telemetry fcuiller% docker ps
CONTAINER ID   IMAGE                    COMMAND                  CREATED         STATUS         PORTS                                                                              NAMES
fa0fa0f61c54   grafana/grafana:latest   "/run.sh"                3 seconds ago   Up 2 seconds   0.0.0.0:3000->3000/tcp                                                             grafana
88f308900623   chronograf:latest        "/entrypoint.sh chro…"   3 seconds ago   Up 2 seconds   0.0.0.0:8888->8888/tcp                                                             chronograf
032555f2ac94   telegraf                 "/entrypoint.sh tele…"   4 seconds ago   Up 3 seconds   8092/udp, 0.0.0.0:57100->57100/tcp, 8125/udp, 8094/tcp, 0.0.0.0:57500->57500/tcp   telegraf
320d88dbfcf3   influxdb                 "/entrypoint.sh infl…"   4 seconds ago   Up 3 seconds   0.0.0.0:8086->8086/tcp                                                             influxdb

When tests are done, you can bring down the stack with the following command:

` docker-compose down `

All metrics stored in InfluxDB and all dashboards created in Grafana will be lost.

IOS-XR models

Power statistics and counters are exposed through different Yang models.
On NCS 5500 and ASR 9000 running IOS-XR 64bit, this is achieved with Cisco-IOS-XR-sysadmin-envmon-ui:environment/oper sensor. This sensor contains information about temperatures, voltages, FAN and also power: all data you can usually find in ‘admin show environment’ command.

Here is a sample output taken from a NCS 5500 router:


RP/0/RP0/CPU0:NCS-5500#admin sh env power
Tue Dec 22 12:28:01.758 CET
================================================================================
CHASSIS LEVEL POWER INFO: 0
================================================================================
   Total output power capacity (Group 0 + Group 1) :    9000W +       0W
   Total output power required                     :    7748W
   Total power input                               :    2143W
   Total power output                              :    2003W

Power Group 0:
================================================================================
   Power       Supply     ------Input----   ------Output---      Status
   Module      Type        Volts     Amps    Volts     Amps
================================================================================
   0/PM0       3kW-AC      229.9      3.1     12.0     56.3    OK
   0/PM1       3kW-AC      231.4      3.3     12.0     58.1    OK
   0/PM2       3kW-AC      229.9      2.9     12.0     52.5    OK

Total of Power Group 0:       2143W/  9.3A      2003W/166.9A

================================================================================
   Location     Card Type            Power       Power       Status
                                     Allocated   Used
                                     Watts       Watts
================================================================================
   0/0          NC55-18H18F            456         273       ON
   0/1          NC55-24X100G-SE        940         486       ON
   0/2                 -               902           -       RESERVED
   0/3                 -                25           -       RESERVED
   0/4          NC55-36X100G-A-SE     1050         546       ON
   0/5                 -                25           -       RESERVED
   0/6                 -                25           -       RESERVED
   0/7                 -                25           -       RESERVED
   0/RP0        NC55-RP                 90          33       ON
   0/RP1               -                90           -       RESERVED
  0/FC0               -               250           -       RESERVED
   0/FC1        NC55-5508-FC2          475         191       ON
   0/FC2               -               250           -       RESERVED
   0/FC3               -               250           -       RESERVED
   0/FC4               -               250           -       RESERVED
   0/FC5               -               250           -       RESERVED
   0/FT0        NC55-5508-FAN2         775          84       ON
   0/FT1        NC55-5508-FAN2         775          83       ON
   0/FT2        NC55-5508-FAN2         775          83       ON
   0/SC0        NC55-SC                 35          15       ON
   0/SC1               -                35           -       RESERVED
RP/0/RP0/CPU0:NCS-5500#

What we are interested in is overall chassis power consumption, especially power input and power output. We would also like to get information on particular locations to monitor specific linecards.

On Cisco 8000, the sensor is different and power counters can be accessed with Cisco-IOS-XR-envmon-oper:power-management Yang model:


RP/0/RP0/CPU0:Cisco-8000#sh telemetry model-driven sensor-group POWER internal
Mon Dec 21 16:00:51.929 CET
  Sensor Group Id:POWER
    Sensor Path:        Cisco-IOS-XR-envmon-oper:power-management
    Sensor Path State:  Resolved
      Sysdb Path:       /oper/spi/gl/pwrmgmt/rack/\<spi_pwrmgmt_oper_Rack_rack\>/producers\<spi_pwrmgmt_oper_ProducerNode_nodeid\>
       Yang Path:       Cisco-IOS-XR-envmon-oper:power-management/rack/producers/producer-nodes/producer-node
      Sysdb Path:       /oper/spi/gl/pwrmgmt/rack/\<spi_pwrmgmt_oper_Rack_rack\>chassis
       Yang Path:       Cisco-IOS-XR-envmon-oper:power-management/rack/chassis
      Sysdb Path:       /oper/spi/gl/pwrmgmt/rack/\<spi_pwrmgmt_oper_Rack_rack\>/consumers\<spi_pwrmgmt_oper_ConsumerNode_nodeid\>
       Yang Path:       Cisco-IOS-XR-envmon-oper:power-management/rack/consumers/consumer-nodes/consumer-node 

Router configuration

On Cisco IOS-XR telemetry is configured in 3 main blocks: sensors, subscription and destination. This time I decided to use gRPC as transport and I will export environment data every 30s from my ASR 9000 and NCS 5500:


telemetry model-driven
 destination-group DEST-GROUP
  !
  address-family ipv4 10.209.198.44 port 57500
   encoding self-describing-gpb
   protocol grpc no-tls

sensor-group ENV-COUNTERS
  sensor-path Cisco-IOS-XR-sysadmin-envmon-ui:environment/oper

subscription SUB
  sensor-group-id ENV-COUNTERS sample-interval 30000
  destination-id DEST-GROUP

As I configure gRPC export through the out of band management port, I need additional TPA configuration:


tpa
 vrf default
  address-family ipv4
   default-route mgmt
   update-source dataports MgmtEth0/RP0/CPU0/0

If TPA routing table is not updated, gRPC session will not come up and following errors in telemetry traces will appear:


RP/0/RP0/CPU0:NCS5500#sh telemetry model-driven trace all | i 10.209.198.44
Dec 21 15:56:08.925 m2m/mdt/go-info 0/RP0/CPU0 t8076  1122925 [mdt_go_trace_info]: emsMdtConnEstablish:333 Dialer type 1, request of '10.209.198.44:57500'
Dec 21 15:56:08.925 m2m/mdt/go-info 0/RP0/CPU0 t14463  1122942 [mdt_go_trace_info]: mdtConnEstablish:171 1: Dialing out to 10.209.198.44:57500, req 503
Dec 21 15:56:08.925 m2m/mdt/go-info 0/RP0/CPU0 t14463  1122944 [mdt_go_trace_info]: mdtConnEstablish:240 dial: target 10.209.198.44:57500
Dec 21 15:56:08.925 m2m/mdt/go-info 0/RP0/CPU0 t14463  1122945 [mdt_go_trace_info]: mdtDialer:243 1: namespace /var/run/netns/global-vrf, args 10.209.198.44:57500
Dec 21 15:56:13.925 m2m/mdt/go-info 0/RP0/CPU0 t21431  1129419 [mdt_go_trace_info]: mdtDialer:243 1: namespace /var/run/netns/global-vrf, args 10.209.198.44:57500
Dec 21 15:56:18.929 m2m/mdt/go-info 0/RP0/CPU0 t21627  1136268 [mdt_go_trace_error]: mdtConnEstablish:267 1: grpc service call failed, ReqId 503, 10.209.198.44:57500, rpc error: code = Unavailable desc = all SubConns are in TransientFailure

As soon as TPA is updated, gRPC session is established to the ephemeral collector:


RP/0/RP0/CPU0:ASR9000#sh telemetry model-driven destination DEST-GROUP
Tue Dec 22 10:30:03.660 CET
  Destination Group:  DEST-GROUP
  -----------------
    Destination IP:       10.209.198.44
    Destination Port:     57500
    Subscription:         SUB
    State:                Active
    Encoding:             self-describing-gpb
    Transport:            grpc
    No TLS
    Total bytes sent:     212209
    Total packets sent:   6
    Last Sent time:       2020-12-22 10:30:02.941129092 +0100

    Collection Groups:
    ------------------
      Id: 2
      Sample Interval:      30000 ms
    Encoding:             self-describing-gpb
      Num of collection:    1553
      Collection time:      Min:   744 ms Max:  8162 ms
      Total time:           Min:   651 ms Max:  8205 ms Avg:  3958 ms
      Total Deferred:       0
      Total Send Errors:    107
      Total Send Drops:     0
      Total Other Errors:   0
    No data Instances:    8697
      Last Collection Start:2020-12-22 10:29:59.938084092 +0100
      Last Collection End:  2020-12-22 10:29:00.879691092 +0100
      Sensor Path:          Cisco-IOS-XR-sysadmin-envmon-ui:environment/oper

RP/0/RP0/CPU0:ASR9000#

This is also reflected in the telemetry traces:


RP/0/RP0/CPU0:NCS5500#sh telemetry model-driven trace all | i 10.209.198.44
Dec 21 15:57:08.921 m2m/mdt/go-info 0/RP0/CPU0 t8076  1201871 [mdt_go_trace_info]: emsMdtConnEstablish:333 Dialer type 1, request of '10.209.198.44:57500'
Dec 21 15:57:08.921 m2m/mdt/go-info 0/RP0/CPU0 t14463  1201888 [mdt_go_trace_info]: mdtConnEstablish:171 1: Dialing out to 10.209.198.44:57500, req 503
Dec 21 15:57:08.921 m2m/mdt/go-info 0/RP0/CPU0 t14463  1201890 [mdt_go_trace_info]: mdtConnEstablish:240 dial: target 10.209.198.44:57500
Dec 21 15:57:08.921 m2m/mdt/go-info 0/RP0/CPU0 t14463  1201891 [mdt_go_trace_info]: mdtDialer:243 1: namespace /var/run/netns/global-vrf, args 10.209.198.44:57500
Dec 21 15:57:09.088 m2m/mdt/go-info 0/RP0/CPU0 t14463  1201949 [mdt_go_trace_info]: mdtConnEstablish:287 1: 10.209.198.44:57500, chanstat buffered num 4000, gHardlimit 13000
Dec 21 15:57:09.088 m2m/mdt/go-info 0/RP0/CPU0 t14463  1201959 [mdt_go_trace_info]: mdtConnEstablish:299 1: Ready for mdt dialout data, req 503, chan 1, 10.209.198.44:57500
Dec 21 15:57:09.090 m2m/mdt/subdb 0/RP0/CPU0 t8074  1201964 [mdt_conn_process_establish]: Got resp from ipv4 10.209.198.44, port 57500

gRPC and TPA have been covered by Viktor and Shelly in previous articles.

I’ve got the power!

Counters are stored in the ‘telemetry’ database. We can check fields directly in influxDB:


{22/12/20 12:05}fcuillers-MacBook-Pro:~/Dev/telemetry fcuiller% docker exec -it influxdb sh
# influx
Connected to http://localhost:8086 version 1.8.3
InfluxDB shell version: 1.8.3
> show databases
name: databases
name
----
telemetry
_internal
> use telemetry
Using database telemetry
> show field keys
-- snip --
name: Cisco-IOS-XR-sysadmin-envmon-ui:environment/oper/power/location/pem_attributes_grpc
fieldKey                           fieldType
--------                           ---------
card_type/value                    string
confgd_power_redundancy_mode/value string
input_current/value                string
input_current_to_ps/value          string
input_power_to_ps                  integer
input_voltage/value                string
output_current/value               string
output_current_from_ps/value       string
output_footer                      integer
output_header                      integer
output_power_from_ps               integer
output_voltage/value               string
pem_id/value                       string
power_allocated                    integer
power_consumed/value               string
power_level                        integer
power_resrv_and_alloc              integer
power_status/value                 string
protection_power_capacity          integer
ps_sum_footer                      integer
ps_type/value                      string
shelf_num                          integer
status/value                       string
supply_type/value                  string
system_power_input                 integer
system_power_used                  integer
usable_power_capacity              integer

Chassis power consumption is located in system_power_input and system_power_used. For each location, IOS-XR will allocate a power budget stored in power_allocated. Last, current power utilization per location can be found in power_consumed/value.

As seen above, some of the reported values are stored as ‘string’. Extra processing might be required to transform those keys as integer. On telegraf, this can be achieved using a convertor processor. Here is a sample example for ASR 9000 where following configuration must be added to telegraf.conf:


[[processors.converter]]
  #apply only to specified measurement
  namepass = ["Cisco-IOS-XR-sysadmin-asr9k-envmon-ui:environment/oper/power/location/pem_attributes"]
  #convert value from string to int for grafana to be happy :)
  [processors.converter.fields]
    integer = ["power_consumed/value"]

Visualization has been tested on device located in Brussels IOS-XR TAC lab. A first simple overall power consumption dashboard has been built for an ASR 9000 and a NCS 5500:

overall-power

If we focus on one device, it’s interesting to compare the power budget provisioned by the router (worst case scenario) versus the ongoing power consumption. On this NCS 5500, while 7.7kW is provisioned only 2.2kW is drawn from power supplies:

power-allocated-and-used

Last but not least, we can also monitor power consumption per location (i.e. linecard, RP, fabric, FAN, etc.):

power-consumption-per-location

This last metric will help our customer to benchmark linecards and check the impact of different factors such traffic load, packet size, transceivers, etc.

FInally here is a configuration extract of this dashboard so you can build your own:

grafana-query

I use the location tag as an alias to dynamically display locations in the legend.

Conclusion

This was another simple and practical use case of telemetry, and we could imagine monitoring other KPI with Cisco IOS-XR environmental model like temperature, fan speed or voltages. Docker compose is a very flexible and agile solution when it comes to bring up an application stack. Combining both allows to quickly gain useful information and insights from the infrastructure.

Leave a Comment