Performance Tuning

Monitoring Disk Command Aborts on ESXi: Identifying Storage Overload | Lazy Admin Blog

Posted on Updated on

When your storage subsystem is severely overloaded, it cannot process commands within the acceptable timeframe defined by the Guest Operating System. The result? Disk Command Aborts. For Windows VMs, this usually triggers after 60 seconds of silence from the storage array.

Aborted commands are a critical red flag indicating that your storage hardware is overwhelmed and unable to meet the host’s performance expectations. Monitoring this parameter is essential for proactive datacenter management.

Here is how you can track these aborts using two primary methods: the vSphere Client and esxtop.


💻 Method 1: vSphere Client (Graphical Interface)

This method provides a visual, historical look at command aborts across your infrastructure.

  1. Navigate to Hosts and Clusters.
  2. Select the object you want to monitor (Host or Cluster).
  3. Click on the Monitor tab, then Performance, and select Advanced.
  4. Click Chart Options.
  5. Switch the metric grouping to Disk.
  6. Select Commands aborted from the list of measurements.
  7. Click OK.

🛠️ Method 2: esxtop (Command Line Interface)

For real-time, granular troubleshooting, esxtop is the definitive tool. It monitors the ABRTS/s (Aborts per Second) field, specifically tracking SCSI aborts.

Steps to Configure esxtop for Aborts:

  1. Open Putty and log in to your ESXi host via SSH.
  2. Type esxtop and press Enter.
  3. Type u to switch to the Disk Device view.
  4. Type f to change the field settings.
  5. Type L to select Error stats.
  6. Press Enter, then press W to save these settings for future sessions.

You will now see the ABRTS/s column. This number represents the SCSI commands aborted by the guest VM during the 1-second collection interval.


📈 Thresholds and Interpretation

If you are deploying a monitoring tool, the critical threshold for ABRTS/s is 1. A value of 1 or higher means SCSI commands are actively being aborted by the guest OS because the storage is not responding.

What is Ideal?

In an ideal scenario, ABRTS/s should always be 0.

What is Real-World?

In a busy production environment, you may see this value fluctuate between 0 and 0.xx. This occurs during “peak hours”—for instance, when multiple servers on the host are running disk-intensive backup operations simultaneously, leading to temporary storage saturation. However, any consistent spike above 1 requires immediate investigation into path failures, array congestion, or complete storage unresponsiveness.

Understanding Processor Queue Length

Posted on Updated on

In simple terms, Processor Queue Length is the “waiting room” for your CPU. It represents the number of threads that are ready to be processed but are currently stuck waiting because the CPU is already busy handling other tasks.

🚦 The Core Concept: Threads in Waiting

Every action on your server—whether it’s a database query or a system background task—is broken down into threads. The CPU can only handle a certain number of threads at once. When more threads arrive than the CPU can handle, they line up in the Processor Queue.

📉 Identifying a Bottleneck

A high CPU utilization percentage (e.g., 90%) doesn’t always mean there is a problem. The true indicator of a performance bottleneck is a sustained or recurring queue.

  • The Golden Rule: A sustained queue of more than two threads per processor is a clear symptom of a bottleneck.
  • The Exception: Queues can develop even when CPU utilization is below 90% if the requests are random and the processing time for each thread varies wildly.

🔍 How to Troubleshoot a High Queue

If you notice frequent queueing, you need to dig into the specific processes causing the backup.

  1. Check % Processor Time: Identify which specific processes are eating up CPU cycles.
  2. Monitor Thread Patterns: Use Performance Monitor (PerfMon) to see if a single process is spawning too many threads.
  3. Evaluate Priorities: Check if certain low-priority tasks are holding up high-priority ones. While you can adjust base priorities in Task Manager, this is usually a “band-aid” fix, not a permanent solution.

🖥️ Multiprocessor Systems: Calculating the Limit

The acceptable queue length scales with your hardware. To find your target range, multiply your number of physical processors (or cores) by the thread threshold.

System TypeTypical Usage (0–10% CPU)Busy System (80–90% CPU)
Single Processor0 to 1 threads1 to 3 threads
Dual Processor0 to 1 threads2 to 6 threads
Quad Processor0 to 1 threads4 to 12 threads

Note: For servers, also keep an eye on the Server Work Queues\Queue Length counter, which specifically tracks requests waiting for the server service.

#WindowsServer #SysAdmin #PerformanceTuning #ITPro #TechTips #CPU #DataCenter #ServerManagement #LazyAdmin #PerfMon