Tuesday, August 4, 2015

Cisco Prime Infrastructure 2.x False Alarms on High Memory Utilization

Let's first have a look upon the hardware specification of a Cisco Catalyst 3750X series switch.
Based upon the following info, we can know that the DRAM size of a WS-C3750X-48P-L is 256MB.

http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-3750-x-series-switches/data_sheet_c78-584733.html

Let's have a look upon the output of show version and show memory statistics commands from a WS-C3750X-48P-L with an uptime of 3 minutes.
Switch#show version | include IOS|Compiled|uptime|of memory
Cisco IOS Software, C3750E Software (C3750E-UNIVERSALK9-M), Version 12.2(55)SE10, RELEASE SOFTWARE (fc2)
Compiled Wed 11-Feb-15 11:17 by prod_rel_team
Switch uptime is 3 minutes
cisco WS-C3750X-48P (PowerPC405) processor (revision A0) with 262144K bytes of memory.
Switch#
Switch#show memory statistics
                Head    Total(b)     Used(b)     Free(b)   Lowest(b)  Largest(b)
Processor    3EC3C38   187620168    44631960   142988208   142471592   126184352
      I/O    E000000    16777216    10787060     5990156     5990156     5979600
Driver te    2400000     4194304          44     4194260     4194260     4194260
Switch#

From the output of the show version command, 256MB of DRAM checked OK.
From the output of the show memory statistics command, we can see that the total of Processor memory is 188MB (187,620,168 bytes), which is the biggest portion of the total memory of the switch - 256MB.
From the output of the show memory statistics command, we can also see that:
  • The utilization of the Processor memory = 44631960 / 187620168 x 100 = 23%
  • The utilization of the I/O memory = 10787060 / 16777216 x 100 = 64%
The formula for calculating the utilization percentage is Used / Total x 100.


According to a Cisco Live presentation titled BRKCRS-3141 Troubleshooting Cisco Catalyst 2960, 3560 and 3750 Series Switches, there are 2 types of memory:
  • Processor memory is the memory used by Cisco IOS (the operating system of Cisco routers and switches).
  • I/O memory is used for traffic sent to the CPU.
    I/O memory is not used for normal packet switching - the forwarding of end user traffic (a.k.a the data plane).
    I/O memory is used for the packets bound to the CPU of the Cisco device, eg: CDP packets, STP packets, OSPF packets, EIGRP packets, etc. (a.k.a the control plane).
FYI we can’t tune the I/O memory allocation for Cisco Catalyst switches as like Cisco routers, in which the memory-size iomem {i/o-memory-percentage} command is available on Cisco routers but not on Cisco Catalyst switches.


Now let's have a look on the Top N Memory Utilization graph from Cisco Prime Infrastructure 2.2.

Looks pretty scary and worrying, because the average memory utilization exceed 75% and Alloy Orange color.

Now let's have a look on a bug / caveat - CSCuo31707.

The bug description tells us that:
    Cisco Prime Infrastructure (PI) Version 2.0, has a known issue that the Top N Memory Utilization dashlet, which we just saw now, is having false alarms of always seeing 100% utilization for Cisco IOS-XR devices.
    The Top N Memory Utilization dashlet should shows the actual memory utilization of the Cisco IOS-XR device, using the Processor memory, which is the actual and real status of the memory of the Cisco IOS-XR device.
    For Cisco IOS devices, it is a known issue that the Cisco Prime Infrastructure showing false alarms of high memory utilization by showing the utilization of I/O memory, but instead it should show the actual memory utilization by referring to the utilization of Processor memory.

We can see that the status of the bug / caveat is still Open.
There is not known fixed software releases yet.
The severity of the bug / caveat is 6 Enhancement.


I see that Cisco Prime Infrastructure showing high memory utilization for Cisco Catalyst switches, based on the I/O memory, but not the processor memory, is actually a very serious false alarm problem.
Monitoring something wrongly, and yet reporting problems for that, making people hoo-hah, making me to explain to people again and again...

Imaging you bought a nice and luxury car, the car temperature gauge shows Red at 80% after drove for 10 minutes.
You then quickly stop the car at a safe place, and tow the car to the service center.

The mechanic tells you...
Mechanic: Hi Mr. Customer, that is not something to worry about, it is a software problem in the dashboard system, the actual car temperature is only 40%, although it shows 80% on the dashboard.
You: When will the software fix be available?
Mechanic: Sorry Mr. Customer, because this is a Severity 6 cosmetic bug, which classified as Enhancement, no date is committed yet. Maybe our programmers will start to look into this after resolved all other S1, S2, S3, S4, and S5 bugs.
You: Oh well, how can I know the actual car temperature for the time being?
Mechanic: Sorry Mr. Customer, no effective workaround available at the moment. Perhaps when you see smoke coming out, which most probably means the car is overheated.
You: ...


I came across this bug / caveat back on Nov/2014, below shows the status of the same bug / caveat when I accessed it back on Nov/2014.
We can see that Cisco knows this problem at least since 20/Oct/2014.
How difficult to change the coding for memory utilization monitoring to refer to Processor memory instead of I/O memory?
Now is Aug/2015, Cisco PI 2.2 already available, how long more do I have to wait?!?
Cisco, are you serious about getting network monitoring done right?

1 comment: