Home > Articles > Solving Congestion with Storage I/O Performance Monitoring

Solving Congestion with Storage I/O Performance Monitoring

By Paresh Gupta, Edward Mazurek
Sample Chapter is provided courtesy of Cisco Press
Date: Apr 17, 2024

Chapter Information

Chapter Description

This sample chapter from Detecting, Troubleshooting, and Preventing Congestion in Storage Networks explains the use of storage I/O performance monitoring for handling network congestion problems and includes practical case studies.

From the Book

Detecting, Troubleshooting, and Preventing Congestion in Storage Networks

$55.99 (Save 20%)

I/O Flow Metrics

The I/O flow metrics collected by Cisco SAN Analytics can be classified into the following categories:

Flow identity metrics: These metrics identify a flow, such as switchport, initiator, target, LUN, or namespace.
Metadata metrics: The metadata metrics provide additional insights into the traffic. For example:
- VSAN count: Number of VSANs carrying traffic on a switchport.
- Initiator count: Number of initiators exchanging I/O operations behind a switchport.
- Target count: Number of targets exchanging I/O operations behind a switchport.
- IT flow count: Number of pairs of initiators and targets exchanging I/O operations via a switchport.
- TL and TN flow count: Number of pairs of targets and LUNs/namespaces behind a switchport exchanging I/O operations.
- ITL and ITN flow count: Number of pairs of initiators, targets, and LUNs/namespaces exchanging I/O operations via a switchport.
- Metric collection time: Start time and the end time for I/O flow metrics during a specific export. This metric helps in knowing the precise duration when a metric was calculated at the link.
Latency metrics: Latency metrics identify the total time taken to complete an I/O operation and the time taken to complete various steps of an I/O operation. For example:
- Exchange Completion Time (ECT): Total time taken to complete an I/O operation.
- Data Access Latency (DAL): Time taken by a target to send the first response to an I/O operation. DAL is one component of ECT that’s caused by the target.
- Host Response Latency (HRL): Time taken by an initiator to send the response after learning that the target is ready to receive data for a write I/O operation. HRL is one component of ECT that’s caused by the initiator.
Performance metrics: These metrics measure the performance of I/O operations. For example:
- IOPS: Number of read and write I/O operations completed per second.
- Throughput: Amount of data transferred by read and write operations, in bytes per second.
- Outstanding I/O: The number of read and write I/O operations that were initiated but are yet to be completed.
- I/O size: The amount of data requested by a read or write I/O operation.
Error metrics: The error metrics indicate errors in read and write I/O operations (for example, Aborts, Failures, Check condition, Busy condition, Reservation Conflict, Queue Full, LBA out of range, Not ready, and Capacity exceeded).

An exhaustive explanation of all these metrics is beyond the scope of this chapter. This chapter is just a starting point for using end-to-end I/O flow metrics in solving congestion and other storage performance issues.

Latency Metrics

Latency is a generic term to convey storage performance. But as Figure 5-5 and Figure 5-6 show, there are multiple latency metrics, each conveying a specific meaning. Latency metrics are measured in time (microseconds, milliseconds, and so on).

Figure 5-5 Latency Metrics for a Read I/O Operation

Figure 5-6 Latency Metrics for a Write I/O Operation

Exchange Completion Time

Exchange Completion Time (ECT) is the time taken to complete an I/O operation. It is a measure of the time difference between the command (CMND) frame and the response (RSP) frame. In Fibre Channel, an I/O operation is carried out by an exchange, and hence it’s called Exchange Completion Time, but ECT can also be known as I/O completion time.

ECT is an overall measure of storage performance. In general, the lower the ECT, the better. This is because lower ECTs result in improved application performance.

At the same time, a direct correlation between ECT and application performance is not straightforward because it’s dependent on the application I/O profile. In general, when application performance degrades and if ECT increases (degrades) at the same time, the reason for the performance degradation is the slower I/O performance.

Data Access Latency

Data Access Latency (DAL) is the time taken by a storage array in sending the first response after receiving a command (CMND) frame. For a read I/O operation, DAL is calculated as the time difference between the command (CMND) frame and the first-data (DATA) frame. For a write I/O operation, DAL is calculated as the time difference between the command (CMND) frame and the transfer-ready (XFER_RDY) frame.

When a target receives a read I/O operation, if the data requested is not in cache, the target must first read the data from the storage media, which takes time. The amount of time it takes to retrieve the data from the media depends on several factors, such as overall system utilization and the type of storage media being used. Likewise, when a target receives a write I/O operation, it must process all the other operations ahead of this operation, which takes time. An increase in these time values leads to a large DAL.

In most cases, it’s best to investigate DAL while troubleshooting higher ECT because DAL may tell why ECT increased. An increase in ECT and also in DAL indicates a slowdown within the storage array.

Host Response Latency

Host Response Latency (HRL), for a write I/O operation, is the time taken by a host in sending the data after receiving the transfer ready. It is calculated as the time difference between the transfer-ready frame and the first data frame.

Because read I/O operations do not have transfer ready, HRL is not calculated for them.

In most cases, it’s best to investigate HRL while troubleshooting higher-write ECTs because HRL may tell why ECT increased. An increase in write ECT and also in HRL indicates a slowdown within the host.

Using Latency Metrics

The following are important details to remember about latency metrics, such as ECT, DAL, and HRL, when addressing congestion in a storage network:

A good way of using ECT is to monitor it for a long duration and find any deviations from the baseline. For example, consider two applications with an average ECT of 200 μs and 400 μs over a week. The I/O flow path of the first application gets congested, resulting in an increased ECT of 400 μs. At this moment, although both applications have the same ECT, only the first application may be degraded, while the second application remains unaffected, even though their ECT values are the same.
ECT measures the overall storage performance, but it doesn’t convey the source of the delay, which can be the host, network, or storage array. The delay caused by the host is measured by HRL, whereas the delay caused by the storage array is measured by DAL.
The delay caused by the network may be the direct result of congestion. For example, when a host-connected switchport has high TxWait, the frames can’t be delivered to it in a timely fashion. As a result, the time taken to complete the I/O operations (ECT) increases.
Although an increase in TxWait (or a similar network congestion metric) increases ECT, the reverse may not be correct. ECT may increase even when the network isn’t congested. ECT is an end-to-end metric. It may increase due to delays caused by hosts, network, or storage. The block I/O stack within a host involves multiple layers. Similarly, an I/O operation undergoes many steps within a storage array. The delay caused by any of these layers increases ECT.
Network congestion is one of the reasons for higher ECT. However, it’s not the only reason. Other network issues may increase ECT even without congestion (for example, network traffic flowing through suboptimal paths, long-distance links, or poorly designed networks).
All latency metrics increase under network congestion. This increase is seen in all the I/O flows whose paths are affected by congestion.
While considering dual fabrics with active/active multipath, if only one fabric is congested, only the I/Os using the congested fabric report increases in ECT. The average increase in the ECT as reported by the host may or may not show this difference, depending on how much ECT degrades. For example, consider an application that measures I/O completion time (ECT) as 200 μs. The application accesses storage via Fabric-A and Fabric-B. ECT over Fabric-A is 180 μs, whereas ECT over Fabric-B is 220 μs. If Fabric-A becomes congested, resulting in an increase in ECT from 180 to 270 μs (50% deviation), the average ECT as measured by the application increases to 245 μs, which is only a 22% increase.

How can you verify if an increase in ECT for an application is because of congestion or not? Here are some suggestions:

Check the metrics for the ports (such as TxWait) in the end-to-end data path.
Check the ECT of the I/O flows that use the same network path as the switchport. If ECT increases just for one I/O flow but the rest of the I/O flows don’t show an increase, it is not a network congestion issue because the network doesn’t do any preferential treatment for I/O flows. A fabric just understands the frames, and all frames are equal for it.
Investigate other metrics, like I/O size, IOPS, and so on. A common example is an increase in I/O size because larger I/O size operations take longer to complete. Also, find any SCSI and NVMe errors and link-level errors.

The Location for Measuring Latency Metrics

Cisco SAN Analytics calculates latency metrics by taking the time difference between relevant frames on the analytics-enabled switchports on MDS switches. As a result, the absolute value of these metrics may differ by a few microseconds, depending on the exact location of the measurement. For example, the ECT reported by a storage-connected switchport may be a few microseconds lower than the ECT reported by a host-connected switchport. This is because the storage-connected switchport sees the command frame a few microseconds after the host-connected switchport does, and it sees the response frames a few microseconds earlier than the host-connected switchport. When the time difference between the command frame and the response frame on the storage port is considered, it comes out to be less than the time difference between the command frame and the response frame on the host-connected switchport.

This difference in the value of latency metrics based on the location of measurement is marginal. It may be a matter of discussion in an academic exercise, but for any real-world production environment, the difference is very small, increases complexity, makes it hard for various teams to understand the low-level details, and doesn’t change the end result.

What is more important is to understand that in lossless networks, congestion spreads from end to end quickly. If this congestion increases ECT by 50% on the storage-connected switchport, the same percentage increase will be seen on the host-connected port also, although the absolute values may differ.

What happens if the congestion is only severe enough that the effect is limited to storage ports or host ports? In production environments, the spread of congestion can’t be predicted. More importantly, if the congestion has not spread from end to end, it’s not severe enough to act on. In such cases, it is best to monitor and use the metrics for future planning, but without an end-to-end spread, the effect of congestion is limited to a small subset of the fabric.

Performance Metrics

Performance metrics convey the rate of I/O operations, their pattern, and the amount of data transferred.

I/O Operations per Second (IOPS)

IOPS, as its name suggests, is the number of read or write I/O operations per second. Typically, IOPS is a function of the application I/O profile and the type of storage. For example, transactional applications have higher IOPS requirements than do backup applications. Also, SSDs provide higher IOPS than do HDDs.

It is not possible to infer the network traffic directly from IOPS. An I/O operation may result in a few or many frames, depending on the data transferred by that I/O operation. Likewise, the throughput caused by I/O operations depends on the amount of data transferred by those I/O operations. Hence, it’s difficult to predict the effect of higher IOPS on network congestion without accounting for I/O size, explained next.

On the other hand, network congestion typically results in reduced IOPS because the network is unable to deliver the frames to their destinations in a timely fashion or can transfer fewer frames.

I/O Size

The amount of data transferred by an I/O operation is known as its I/O size. I/O size is a function of the application’s I/O profile. For example, a transactional application may have an I/O size of 4 KB, whereas a backup job may use an I/O size of 1 MB.

This I/O size metric in the context of storage I/O performance monitoring or SAN Analytics is different from the amount of data that an application wants to transfer as part of an application-level transaction or operation. For example, an application may want to transfer 1 MB of data, but the host may decide to request this data using four I/O operations, each of size 256 KB. This difference is worth understanding, especially while investigating various layers within a host.

I/O size is encoded in the command frame of I/O operations. It has no dependency on network health. As a result, I/O size doesn’t change with or without congestion.

Large I/O size results in a higher number of frames, which in turn leads to higher network throughput. For example, a 2 KB read I/O operation results in just one Fibre Channel data frame of size 2 KB, whereas a 64 KB read I/O operation results in 32 Fibre Channel frames of size 2 KB. Because of this, I/O size directly affects the network link utilization and thus provides insights into why a host port or a host-connected switchport may be highly utilized. For example, a host link may not be highly utilized with an I/O size of 16 KB. But the same link may get highly utilized and thus become the source of congestion when the I/O size spikes to 1 MB.

To understand the effect of I/O size on link utilization, consider the example in Figure 5-7. Two hosts, Host-1, and Host-2, connect to the switchports at 8 GFC to access storage from multiple arrays. Both servers are doing 10,000 read I/O operations per second (IOPS). However, the I/O sizes used by the two servers are different. Host-1 uses an I/O size of 4 KB, whereas Host-2 uses an I/O size of 128 KB.

Figure 5-7 Detecting and Predicting the Cause of Congestion Using I/O Size

Host-1, with 10,000 IOPS and 4 KB I/O size, results in a throughput of 40 MBps, whereas Host-2, with 10,000 IOPS and 128 KB I/O size, results in a throughput of 1280 MBps. As evident, 1280 MBps can’t be transported via an 8 GFC link because its maximum data rate is 800 MBps. As a result, Host-2’s read I/O traffic causes congestion due to overutilization. Host-1 doesn’t cause congestion even though its read IOPS is the same as Host-2’s. I/O size is the differentiating factor here.

Throughput

Throughput is a generic term that has different meanings for different people. For measuring storage performance, throughput is measured as the amount of data transferred by I/O operations, in megabytes per second (MBps). On the other hand, for measuring network performance, throughput is measured in frames transferred per second and the amount of data transferred by those frames, in gigabits per second (Gbps).

Another important detail to remember is that the read and write I/O throughput may have a marginal difference when measured on the end devices versus on the network. Applications measure the total amount of data that they exchange with the storage volumes. However, the network throughput differs slightly because I/O operations have headers, such as Fibre Channel headers and SCSI/NVMe headers. For all practical purposes, this marginal difference can be ignored. Be aware that the throughput reported by various entities may differ but don’t get carried away by these marginal differences.

Outstanding I/O

Outstanding I/O is the number of I/O operations that were initiated but are yet to be completed. In other words, an initiator sent a command frame, but it hasn’t received a response frame yet. Outstanding I/O is also known as open I/O or active I/O.

In production environments, there are always new I/Os being originated while the previous I/Os are being completed because the applications may be multithreaded or multiprocessed. Also, keeping some I/O operations open helps in a performance boost.

Outstanding I/O is directly related to the queue-depth value on a host as well as similar values on storage arrays. Different entities have different thresholds for outstanding I/O. For example, a host may stop initiating new I/O operations when the outstanding I/O reaches a threshold, such as 32. Likewise, a target may reject new incoming I/O operations when a large number of I/O operations (such as 2048) are already open (or outstanding), and the target is still processing them.

Congestion in a storage network may be a side effect of a large number of outstanding I/O operations.

6. I/O Operations and Network Traffic Patterns | Next Section Previous Section

Cisco Press Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from Cisco Press and its family of brands. I can unsubscribe at any time.

Email Address