Background
Just about 2 years ago I got my first look at CloudIQ and was immediately impressed.
I mean, who wouldn’t be impressed by a Cloud Native Application that enables our customers to view the Health, Configuration, Capacity and Performance information of their entire inventory of Unity, VMAX, SC, and XtremIO systems from a single pane of glass?
As I explored the early CloudIQ user interface, I remember thinking how amazing it would be if we could view SAN and host information from this interface as well. I believed this type of ”end-to-end” information would make it much easier to troubleshoot and remediate things like the slow drain and congestion spreading events that have been experienced by many of our customers.
Little did I know, less than a year later my team would be given the opportunity to add support for Connectrix products into CloudIQ and help make this end-to-end view of infrastructure a reality.
In this post I’ll provide a very high level overview of the data collection process used to support the Connectrix products as well as a couple of the important features that I believe you will find very useful when Connectrix in CloudIQ becomes available at the end of May.
Connectrix in CloudIQ overview
Rather than immediately dive into the different features and views, I’d like to start with overview of the process that we use to collect data.
Starting from the bottom of the diagram, you’ll note that we plan to support both B-Series and MDS Series Connectrix products. I should point out, that we can collect data from any FC switch running FOS 8.2.1a or above or NX-OS 8.3(2) and above (including non-Connectrix models). However, in order for the switches to be visible in CloudIQ, the product will need to be under an active service contract with Dell EMC. We validate the contract status using the product serial number.
The actual data collection process is periodically performed by the CloudIQ Collector (~once every 5 minutes). Each time a collection is performed, the CloudIQ Collector will gather the required health, configuration, capacity and performance data from every switch in the environment via their REST API. This information is then pushed back to Dell EMC via Secure Remote Services (SRS) and then makes its way to our CloudIQ backend where is it processed and securely stored.
The CloudIQ Collector itself is fairly lightweight (<700MB of disk space is required) and is based on Dell EMC Storage Resource Manager (SRM). The collector also has the ability to collect VM information from vCenter.
One final point to note, the switches also have the capability to send alert information directly to our services organization (i.e., call home) and this capability does not require the CloudIQ collector, only SRS.
Important features
When we first started this project, we wanted to make sure that the information we provided was always accurate, relevant and actionable. To this end, we solicited a lot of input not only from our Customers, but also from our services team (since they are troubleshooting SAN issues all day) and have a really healthy understanding of what’s relevant when there’s a problem. Because of this input, you’ll notice a number of features in our first release that are intended to help you troubleshoot in a more efficient manner.
I’ll review a couple of those features next.
Inventory of all Connectrix products including a product specific Health Score
Once you've started collecting data from your Connectrix products and logged into CloudIQ, you can immediately get a sense of the health of your Connectrix products by selecting “Inventory” from the navigation tree on the left side of the CloudIQ UI and then clicking on the “SAN” tab. As shown below, each SAN system (switch) will be displayed as a tile that includes the health score associated with that product.
Starting from the top, each tile includes:
- A health score that represents the HW status of the switch. In the short term, you can expect the health score calculation to include whether or not the switch is experiencing congestion.
- The name of the product (e.g., Production SAN Extension). This is the user-friendly name you assigned to the product in your environment.
- The model (e.g., Connectrix ED-DCX6-4B) and serial number (e.g., EAF300M001).
- The Firmware Version (e.g, v8.2.1a)
- Last Contact Time – The last time this product sent information back to CloudIQ.
- As well as the Location and Site ID where the product was physically installed.
Note: There is also a list view available that provides the same type of information but in a table format.
If you want more detailed configuration information about the switch, you can always click on its name and you’ll be brought to a detailed configuration view as shown below.
You’ll notice that this screen consists of:
- A Summary Header that provides much of the same information as the previous view did as well as the Switch uptime, Management IP Address and the WWNs associated with the product.
- Five tabs that each provide detailed information about the Fabrics, Partitions (VFs or VSANs), Zones, Attached Devices and Components (e.g, FRUs) for this switch.
Finally, if you want even more detailed information about your switch or you want to perform active management of a product (e.g., add a zone), you can click on the launch Element Manager link at the top right of the screen. This will launch a browser pointed to the IP Address of the management interface of the product. Here you can access the native switch management tool and do whatever you need. BTW, I LOVE THIS FEATURE! It’s so simple, yet so convenient.
Capacity
If you’re looking for port level information about your products, start from the left-hand navigation tree and select Capacity -> System Capacity and then click the SAN tab. For each product you’re managing in CloudIQ, you’ll see a tile similar to what is being shown below:
The donut graph gives a visual representation of the number of ports on the switch as well as the number of ports that have been consumed (e.g., are online) or are in need of attention (e.g., in an Error state).
If you’d like additional detail, click on the switch name and the Detailed Capacity view will be displayed as shown below.
The upper left side of the user interface provides the same information as the previous “System Capacity” view did. However, depending on what type of interface you have selected (i.e., Online), the other views to the right as well as the table below the donut graph will be filtered accordingly. This filtering can be combined so that you can limit what is displayed in the table view (at the bottom) to show only Online, F_Ports, that have registered as a storage port (for example).
Performance
If you’re looking for performance information about your products, start from the left-hand navigation tree and select Performance -> System Performance and then click the SAN tab. For each product you’re managing in CloudIQ, you’ll see a tile similar to what is being shown below:
This view contains:
- System Bandwidth – This represents the average amount of data being transmitted by the switch over the past 24 hours. You’ll also note the graph at the bottom of the dialog provides a historical view of the amount being transmitted during each sample interval. Note, we’ve already been asked to update this to include a metric that represents what percentage of total system bandwidth is being consumed and we plan to do this in an upcoming relase.
- Utilization >= 80% - Number of switch interfaces that were running at greater than or equal to 80% utilization when the last collection was taken from the switch.
- Congested - Number of switch interfaces that were congested (had a congestion ratio greater than or equal to 0.2) when the last collection was taken from the switch. NOTE: This is a REALLY COOL FEATURE. If you don’t know how to calculate the congestion ratio or why it’s important, see our recent white paper on Congestion Spreading and how to avoid it.
- Errors - Number of switch interfaces that met one of the following criteria when the last collection was taken from the switch:
- A change in an interface error counter of greater than or equal to 100 occurred between the current sample and a previous sample.
- Greater than or equal to 50% of the samples from a given interface over the past hour have incremented an error counter.
- Greater than or equal to 50% of the hour intervals from a given interface over the past 24 hours have met either of the above criteria.
- Link Resets – Similar to Errors (above) but with specific criteria related to the reception of Link Reset primitives.
After looking through all of this information, you’d like to see more detailed performance information, you can click on the switch name and a number of graphs showing detailed performance information will be displayed. One of these graphs related to congestion is shown below.
Currently, this graph displays a number of data points each representing the sum of “time spent at zero transmit credit” across all interfaces during each sampling interval. This can be useful when trying to understand if you’ve had a SAN event that could have impacted overall performance.
What’s next?
While we believe we’ve been able to provide some useful functionality, we also know that there’s a lot more we could do going forward. I expect we’ll be working in three areas:
- Customer originated enhancement requests. We built this tool with our end users in mind, so if you think we’re missing something or have a great idea for a feature, let us know!
- Predictive Analytics related to optic health. Shortly after we GA, we will start looking into how to predict optic failure based on TX and RX optical power, bit errors and other link events.
- More granular performance information. We would like to make it possible for you to zoom into a chart (e.g., the Congestion chart above) and allow you to easily view the top contributors.
All that said, we’re just super excited to reach this milestone and we look forward to helping you get the most from your SAN.
Let us know how we can help!
Thanks for reading!