esxtop

Monitoring Disk Command Aborts on ESXi: Identifying Storage Overload | Lazy Admin Blog

Posted on Updated on

When your storage subsystem is severely overloaded, it cannot process commands within the acceptable timeframe defined by the Guest Operating System. The result? Disk Command Aborts. For Windows VMs, this usually triggers after 60 seconds of silence from the storage array.

Aborted commands are a critical red flag indicating that your storage hardware is overwhelmed and unable to meet the host’s performance expectations. Monitoring this parameter is essential for proactive datacenter management.

Here is how you can track these aborts using two primary methods: the vSphere Client and esxtop.


💻 Method 1: vSphere Client (Graphical Interface)

This method provides a visual, historical look at command aborts across your infrastructure.

  1. Navigate to Hosts and Clusters.
  2. Select the object you want to monitor (Host or Cluster).
  3. Click on the Monitor tab, then Performance, and select Advanced.
  4. Click Chart Options.
  5. Switch the metric grouping to Disk.
  6. Select Commands aborted from the list of measurements.
  7. Click OK.

🛠️ Method 2: esxtop (Command Line Interface)

For real-time, granular troubleshooting, esxtop is the definitive tool. It monitors the ABRTS/s (Aborts per Second) field, specifically tracking SCSI aborts.

Steps to Configure esxtop for Aborts:

  1. Open Putty and log in to your ESXi host via SSH.
  2. Type esxtop and press Enter.
  3. Type u to switch to the Disk Device view.
  4. Type f to change the field settings.
  5. Type L to select Error stats.
  6. Press Enter, then press W to save these settings for future sessions.

You will now see the ABRTS/s column. This number represents the SCSI commands aborted by the guest VM during the 1-second collection interval.


📈 Thresholds and Interpretation

If you are deploying a monitoring tool, the critical threshold for ABRTS/s is 1. A value of 1 or higher means SCSI commands are actively being aborted by the guest OS because the storage is not responding.

What is Ideal?

In an ideal scenario, ABRTS/s should always be 0.

What is Real-World?

In a busy production environment, you may see this value fluctuate between 0 and 0.xx. This occurs during “peak hours”—for instance, when multiple servers on the host are running disk-intensive backup operations simultaneously, leading to temporary storage saturation. However, any consistent spike above 1 requires immediate investigation into path failures, array congestion, or complete storage unresponsiveness.

Troubleshooting Storage Latency with esxtop: The Admin’s Guide

Posted on Updated on

When “the server is slow,” the storage subsystem is usually the first suspect. While vCenter performance charts are great for history, esxtop gives you real-time data from the heart of the hypervisor.

🛠️ How to Configure esxtop for Storage Monitoring

You can monitor performance at three different levels depending on where you suspect the issue lies.

1. Per-HBA (Host Bus Adapter) Mode

  • Command: Type esxtop, then press d.
  • Tip: Press Shift + L and enter 36 to see the full device names.
  • Fields: Press f and ensure b, c, d, e, h, and j are selected.

2. Per-LUN (Device) Mode

  • Command: Type esxtop, then press u.
  • Why use this? To see if a specific volume on your SAN is being hammered.

3. Per-VM (Virtual Machine) Mode

  • Command: Type esxtop, then press v.
  • Why use this? To identify the “noisy neighbor”—the specific VM that is consuming all the IOPS.

🔍 Analyzing the “Big Three” Latency Columns

To understand storage health, you must look at these three columns. They tell you exactly where the delay is happening.

ColumnNameWhat it representsThreshold
DAVGDevice LatencyTime spent at the hardware level (HBA + SAN).< 10ms
KAVGKernel LatencyTime spent inside the VMware VMkernel.< 1ms
GAVGGuest LatencyTotal latency perceived by the Guest OS (DAVG + KAVG).< 10ms

What the numbers are telling you:

  • High DAVG: The problem is external to ESXi. Check your SAN controllers, disk spindles, or fabric switches.
  • High KAVG: The problem is inside the host. This usually means the host is overloaded or there is a queueing issue (e.g., Disk.SchedNumReqOutstanding is too low).
  • High GAVG: Your users are feeling the pain. If this exceeds 10–15ms consistently, application performance will suffer.

⚠️ When to Panic: Timeouts and Logs

If latency hits 5000ms (5 seconds), ESXi will abort the command. If you see high numbers in esxtop, immediately check your logs for SCSI aborts:

  • ESXi 5.x/6.x/7.x/8.x: /var/log/vmkernel.log
  • Legacy ESX 3.5/4.x: /var/log/vmkernel

#VMware #ESXi #esxtop #StorageAdmin #SysAdmin #Virtualization #PerformanceMonitoring #ITPro #LazyAdmin #DataCenter #vSphere