A collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue. The clearest symptom of a processor bottleneck is a sustained or recurring queue of more than two threads. Although queues are most likely to develop when the processor is very busy, they can develop when utilization is well below 90 percent. This can happen if requests for processor time arrive randomly and if threads demand irregular amounts of time from the processor.
If queue occur frequently, You need to investigate the processes that are running when threads collect in the queue.
To determine this:
- Identify the processes that are consuming processor time. Determine whether a single process or multiple processes are active during a bottleneck. Running processes appear in the Instance box when you select the Process\% Processor Time counter. For more information, see “Processes in a Bottleneck” later in this chapter.
- Scrutinize the processor-intensive processes. Determine how many threads run in the process and watch the patterns of thread activity during a bottleneck.
- Evaluate the priorities at which the process and its threads run. You might be able to eliminate a bottleneck merely by adjusting the base priority of the process or the current priorities of its threads. However, Microsoft does not recommend this as a long-term solution. Use Task Manager to find the base priority of the process.
Different guidelines apply for queue lengths on multiprocessor systems. For busy systems (those having processor utilization in the 80 to 90 percent range) that use thread scheduling, the queue length should range from one to three threads per processor. For example, on a four-processor system, the expected range of processor queue length on a system with high CPU activity is 4 to 12.
On systems with lower CPU utilization, the processor queue length is typically 0 or 1.
There are other objects that track processor queue length. The Server Work Queues\Queue Length counter reports the number of requests in the queue for the processor on the selected server.
Syslog Server storage calculation:
I want to modify the settings so that my logs size is upped from 2mb to 10mb and we rotate after 40 logs not 20.Here we need to do some planning to see if we have enough free space.
Count hosts: 100
Current size log max: 2
Current rotation count: 20
Total possible MB used: 100x2x20 = 4,000MB (4gb)
Count hosts: 100
Desired size log max: 10
Desired rotation count: 40
Total possible MB used: 100x10x40=40,000MB (40GB)
So the drive where your logs are stored would need 40gbfree in the above example to be able to service future demands.
How to modify the VMware Syslog Collector configuration after it is installed:
- Make a backup of the file:vCenter Server 5.5 and lower: %PROGRAMDATA%\VMware\VMware Syslog Collector\vmconfig-syslog.xml
vCenter Server 6.0: %PROGRAMDATA%\VMware\vCenterServer\cfg\vmsyslogcollector\config.xml
- Open the copied file using a text editor.
- Under <defaultValues>, change any of the options to the required values.For example, to increase the log file size to 10 MB and to decrease the number of files retained to 20, modify the attributes:<defaultValues>
</defaultValues>Note: This configuration in vCenter Server overrides the ESXi host configuration file.
- Save and close the file.
- Stop the VMware Syslog Collector service.
- Remove the file:
vCenter Server 5.5 and lower: %PROGRAMDATA%\VMware\VMware Syslog Collector\vmconfig-syslog.xml
vCenter Server 6.0: %PROGRAMDATA%\VMware\vCenterServer\cfg\vmsyslogcollector\config.xml
- Rename the copy of the modified file to:vCenter Server 5.5 and lower: %PROGRAMDATA%\VMware\VMware Syslog Collector\vmconfig-syslog.xml
vCenter Server 6.0: %PROGRAMDATA%\VMware\vCenterServer\cfg\vmsyslogcollector\config.xml
- Start the VMware Syslog Collector service. It may be required to restart the syslog service on the ESXi host if logs are no longer updating on the Syslog Server. To restart the syslog service, see VMware ESXi 5.x host stops sending syslogs to remote server (2003127).
The maximum supported number of hosts for use with each vSphere Syslog Collector instance is 30, however depending on the load generated by your environment, you may encounter issues below this number.
To work around this issue, you can deploy multiple instances of vSphere Syslog Collector on separate Windows machine which allows you to distribute the load.
Restarting the Management agents on ESXi
To restart the management agents on ESXi:
From the Direct Console User Interface (DCUI):
- Connect to the console of your ESXi host.
- PressF2 to customize the system.
- Log in asroot.
- Use theUp/Down arrows to navigate to Restart Management Agents.
Note: In ESXi 4.1 and ESXi 5.0, 5.1, 5.5 and 6.0 this option is available under Troubleshooting Options.
- PressF11 to restart the services.
- When the service has been restarted, pressEnter.
- PressEsc to log out of the system.
From the Local Console or SSH:
- Log in to SSH or Local console as root.
- Run these commands:
Note: In ESXi 4.x, run this command to restart thevpxa agent:
service vmware-vpxa restart
- To reset the management network on a specific VMkernel interface, by default vmk0, run the command:
esxcli network ip interface set -e false -i vmk0; esxcli network ip interface set -e true -i vmk0
Note: Using a semicolon (;) between the two commands ensures the VMkernel interface is disabled and then re-enabled in succession. If the management interface is not running on vmk0, change the above command according to the VMkernel interface used.
- To restart all management agents on the host, run the command:
- Check if LACP is enabled on DVS for version 5.x and above. For more information, see vSphere 5.0 Networking Guide
- If LACP is enabled and configured, do not restart management services using services.sh script instead restart independent services using /etc/init.d/module restart command.
- If the issue is not resolved, and you have to restart all the services that are a part of the services.sh script, take a downtime before proceeding to the script.
Note: For more information about restarting the management service on an ESXi host, see Service mgmt-vmware restart may not restart hostd in ESX/ESXi (1005566).
Get hardware serial number using this command from putty
esxcfg-info | grep “Serial N”
Type the following command from the command line on the service console and you will get some Vendor details and serial number information.
/usr/sbin/dmidecode |grep -A4 “System Information”
These are referred to as Path Selection Plug-ins (PSP), and are also called Path Selection Policies.
These pathing policies can be used with VMware ESXi 5.x and ESXi/ESX 4.x:
- Most Recently Used (MRU): Selects the first working path, discovered at system boot time. If this path becomes unavailable, the ESXi/ESX host switches to an alternative path and continues to use the new path while it is available. This is the default policy for Logical Unit Numbers (LUNs) presented from an Active/Passive array. ESXi/ESX does not return to the previous path if, or when, it returns; it remains on the working path until it, for any reason, fails.
preferredflag, while sometimes visible, is not applicable to the MRU pathing policy and can be disregarded.
- Fixed (Fixed): Uses the designated
preferredpath flag, if it has been configured. Otherwise, it uses the first working path discovered at system boot time. If the ESXi/ESX host cannot use the
preferredpath or it becomes unavailable, the ESXi/ESX host selects an alternative available path. The host automatically returns to the previously defined
preferredpath as soon as it becomes available again. This is the default policy for LUNs presented from an Active/Active storage array.
- Round Robin (RR): Uses an automatic path selection rotating through all available paths, enabling the distribution of load across the configured paths.
- For Active/Passive storage arrays, only the paths to the active controller will be used in the Round Robin policy.
- For Active/Active storage arrays, all paths will be used in the Round Robin policy.
Note: For logical Units associated with Microsoft Cluster Service (MSCS) and Microsoft Failover Clustering virtual machines, the Round Robin pathing policy is supported only on ESXi 5.5 and later.
- Fixed path with Array Preference: The
VMW_PSP_FIXED_APpolicy was introduced in ESXi/ESX 4.1. It works for both Active/Active and Active/Passive storage arrays that support Asymmetric Logical Unit Access (ALUA). This policy queries the storage array for the preferred path based on the array’s preference. If no preferred path is specified by the user, the storage array selects the preferred path based on specific criteria.
VMW_PSP_FIXED_APpolicy has been removed from ESXi 5.0. For ALUA arrays in ESXi 5.0, the
MRUPath Selection Policy (PSP) is normally selected but some storage arrays need to use
Fixed. To check which PSP is recommended for your storage array, see the Storage/SAN section in the VMware Compatibility Guide or contact your storage vendor.
- These pathing policies apply to VMware’s Native Multipathing (NMP) Path Selection Plug-ins (PSP). Third-party PSPs have their own restrictions.
- Round Robin is not supported on all storage arrays. Please check with your array documentation or storage vendor to verify that Round Robin is supported and/or recommended for your array and configuration. Switching to a unsupported or undesirable pathing policy can result in connectivity issues to the LUNs (in a worst-case scenario, this can cause an outage).
Warning: VMware does not recommend changing the LUN policy from
MRU, as the automatic selection of the pathing policy is based on the array that has been detected by the NMP PSP.
The ExtPart utility provides support for online volume expansion of NTFS formatted basic disks.
This is a self extracting file that will install the extpart.exe utility. No reboot is necessary.
The Cisco UCS Monitoring Resource Handbook is a monitoring reference guide that was developed to supplement this session.
Eric Williams, Moderator, Technical Marketing Engineer, Cisco
Jeff Foster, Technical Marketing Engineer, Cisco
Jason Shaw, Technical Marketing Engineer, Cisco
Links Relevant to this session:
UCS Manager MIB Reference Guide: http://www.cisco.com/en/US/docs/unified_computing/ucs/sw/mib/b-series/b_UCS_MIBRef.html
UCS Manager Fault Reference Guide:http://www.cisco.com/en/US/docs/unified_computing/ucs/ts/faults/reference/UCSFaultsRef.pdf
C-Series MIB Reference Guide: http://www.cisco.com/en/US/docs/unified_computing/ucs/sw/mib/c-series/b_UCS_Standalone_C-Series_MIBRef.pdf
C-Series Fault Reference Guide:http://www.cisco.com/en/US/docs/unified_computing/ucs/c/sw/fault/reference/guide/CIMC_Fault_codes.pdf
Monitoring UCS Manager with Syslog:
To learn more about Cisco UCS Manager and Standalone C-Series:
Cisco UCS Communities: http://communities.cisco.com/ucs
Cisco UCS Manager: http://www.cisco.com/en/US/products/ps10281/index.html
Cisco UCS Central: http://www.cisco.com/en/US/products/ps12502/index.html
Cisco UCS Management (Blog): http://blogs.cisco.com/datacenter/cisco-ucs-management/
‘Demystifying Monitoring for UCS Manager & C-Series’ Tech Talk available here:
Additional Cisco Monitoring Resources: (Cited within this document)
- UCS Manager MIB Reference Guide
- UCS Manager Fault Reference Guide
- C-Series MIB Reference Guide:
- C-Series Fault Reference Guide:
- Monitoring UCS Manager with Syslog:
UCSM and Standalone C-Series Monitoring Overview:
UCS Manager Monitoring Background:
The core of UCS Manager is made up three core elements, which are the Data Management Engine (DME), Application Gateway (AG), and user accessible northbound interface (SNMP, Syslog, XMLAPI and UCS CLI). With UCS Manager there are three main ways of monitoring UCS servers, which are XML API, SNMP, and syslog. Both SNMP and Syslog are interfaces only used for monitoring as they are “read-only” in nature, not allowing an end user to change the configuration. Alternatively, the UCS XML API is a monitoring that is “read-write” in nature, which does allow an end user to both monitor UCS, as well as change the configuration if needed.
Data Management Engine (DME) – The DME is the center of the UCS Manager universe, or the “queen bee” of the entire system. It is the maintainer of the UCS XML database which houses the inventory database of all physical elements (blade / rack mount servers, chassis, IO modules, fabric interconnects, etc.), the logical configuration data for profiles, policies, pools, vNIC / vHBA templates, and the various networking related configuration details (VLANs, VSANs, port channels, network uplinks, server downlinks, etc). It maintains the current health and state of all components of all physical and logical elements in a UCS Domain, and maintains the transition information of all Finite State Machine (FSM) tasks occurring. The inventory, health, and configuration data of managed end points stored in the UCS XML Database are always showing current data, delivered in near real time. As fault conditions are raised and cleared on end points, the DME will create, clear, and remove faults in the UCS XML database as those fault conditions are raised or mitigated. The faults stored in the UCS XML database only are the ones actively occurring, as the DME by default does not store a historical log of all faults that have occurred on a UCS Domain.
Application Gateway (AG) – The AG’s are the software agents, or “worker bees”, that communicate directly with the end points to provide the health and state of the end points to the DME. AG’s manage configuration changes from the current state to the desired state during FSM transitions when changes are made to the UCS XML database. AG managed end points include servers, chassis, IO Modules, fabric extenders, fabric interconnects, and NXOS. The server AG’s actively monitor the server through the IPMI and SEL logs via the Cisco Integrated Management Controller (CIMC) to provide the DME with the health, state, configuration, and potential fault conditions of a device. The IO Module AG and chassis AG communicate with the Chassis Management Controller (CMC) to get information about the health, state, configuration, and fault conditions visible by the CMC. The fabric interconnect / NXOS AG communicates directly with NXOS to get information about the health, state, configuration, statistics, and fault conditions visible by NXOS on the fabric interconnects. All AG’s provide the inventory details to DME about end point during the various discovery processes. The AG’s perform the state changes necessary to configure an end point during FSM triggered transitions, monitors the health and state of the end points, and notifies the DME of any faults or conditions.
Northbound interfaces – The northbound interfaces include SNMP, Syslog, CLI and XML API. The XML API present in the Apache webserver layer used to send login, logout, query, and configuration requests via HTTP or HTTPS. SNMP and Syslog are both consumers of data from the DME. SNMP informs and traps are translated directly from the fault information stored in the UCS XML database. Inversely, SNMP GET requests are sent through the same object translation engine in reverse, where the DME receives a request from the object translation engine and the data is translated from XML data from the DME to a SNMP response. Syslog messages use the same object translation engine as SNMP, where the source of the data (faults, events, audit logs) is translated from XML into a UCS Manager formatted syslog message.
Standalone C-Series Monitoring Background:
Monitoring support for our Standalone C-Series Servers has evolved with each release. The features and capabilities of the current CIMC release, v1.5 supports our M3 Platforms including the C220 M3, C240 M3, C22 M3, C24 M3 and C420 M3 as well as our C260 M2 and C460 M2. While earlier versions of our CIMC supported Syslog and SNMP, the Fault Engine added support for SNMP v3 in CIMC v1.5. We have documented the internals of our monitoring subsystem in the graphic included below.
Fault Engine Overview:
While Cisco Standalone C-Series Servers do not support the DME/AG architecture described above in the UCS Manager section, many of the same concepts can be applied to the monitoring subsystem for Standalone Servers. The Fault Engine has become a central repository and clearinghouse for fault data as it is passed along to monitoring endpoints. The Fault engine acts as a master repository for events within the system which initiates alerts (SNMP Traps, Syslog messages, XML API events, etc.) but can also be queried via SNMP (GETs) or the XML API. This durability of fault information means provides customers a mechanism to not only receive fault data, but also use these interfaces to query system health data.
Within the system, the Fault Engine regularly polls component health status in the form of sensor data using IPMI and the Storage Daemon and these values are compared to threshold reference points. If a sensor value is outside one of the threshold values, an entry is created in the fault engine and notifications are sent as appropriate. As discussed earlier, multiple notification types are supported including SNMP (Traps and Informs), Syslog (Messages) and XML API (Event Subscription) and fault queries are supported through SNMP GET and XML API queries. Cisco has developed a number of integrations for 3rd Party Management solutions that leverage queries of the Fault Engine data to drive notifications in these management tools. The Fault Engine retains faults until they are mitigated or until the IMC is rebooted.
UCS Manager Best Practices:
The recommendation for monitoring a UCS Manager environment would be to monitor all faults of either severity critical or major and that are not of type “FSM”. FSM related faults are transient in nature, as they are triggered when a FSM transition is occurring in UCS Manager. Generally speaking, FSM related faults will resolve themselves automatically as most are triggered after a task fails the first time, but will be successful on a subsequent try. An example of a FSM task failure would be when a FSM task waiting for a server to finish BIOS POST fails during a service profile association. This particular condition can happen when a server with many memory DIMMs takes longer to successfully finish POST than the default timeout of the FSM task. This timeout would raise a FSM fault on this task, but by default would keep retrying up to the defined FSM task retry limit. If a subsequent retry is successful, the FSM task fault raised will be cleared and removed. However, if subsequent retries are unsuccessful and the retry limit is hit, the FSM task will be faulted and another fault will be raised against the affected object. In this example, a configuration failure would be raised against the service profile, as the association process would have failed because the server did not perform a successful BIOS POST.
If you are looking for a list of the most critical faults codes to monitor, refer to the “Syslog Messages to Monitor” section in Chapter 3 of the “Monitoring UCS Manager with Syslog” guide below. The fault codes listed are the same codes for all interfaces (SNMP, syslog, or XML API).
C-Series Standalone Best Practices:
Filtering: As referenced above, the faults for our Standalone C-Series Servers are consistent with faults for UCS Manager. The concept of FSM (Finite State Machine) does not exist with Standalone C-Series, there is no reason to filter out FSM State changes when monitoring these systems. The recommendation is that filters not be applied to Standalone C-Series Servers as all raised faults are relevant to customers who are interested in monitoring/alerting capabilities. At present, there are approximately 85 faults that are included in the Fault Database for our Standalone C-Series Servers with CIMC 1.5(3).
SNMP vs. Platform Event Filters (PEF): As monitoring has evolved in these systems, support has been extended to include a number of notification mechanisms, and Cisco is planning to deprecate Platform Event Filters (PEF) and Platform Event Traps (PET) in a future CIMC release. Platform Event Traps are sent as IPMI v1 traps where filters (PEF) can be applied so only certain subsystem traps are sent to the NMS system. The variable bindings that are consistent across UCS Manager and Standalone C-Series servers do not apply to Platform Event Filters as they have their own nomenclature that is defined and maintained by Intel.
XML API Usage: As a more robust XML API has been implemented in Standalone C-Series Servers, this is the preferred mechanism for capturing faults sent by the system. The XML API supports Event Subscription which provides proactive alerting. The XML API also supports queries which can be used to collect data in the fault table on a regular basis.
Cisco UCS MIB Files:
Cisco MIBs are available at the following download site:
All Cisco UCS Manager and Standalone C-Series faults are available with SNMP using the cucsFaultTable table and the CISCO-UNIFIED-COMUTING-FAULT-MIB. The table contains one entry for every fault instance. Each entry has variables to indicate the nature of a problem, such as its severity and type. The same object is used to model all Cisco UCS fault types, including equipment problems, FSM failures, configuration or environmental issues, and connectivity issues. The cucsFaultTable table includes all active faults (those that have been raised and need user attention), and all faults that have been cleared but not yet deleted because of the retention interval.
Important OIDs (Object Identifier):
In UCS Manager version 1.3 and later, Cisco UCS Manager sends a cucsFaultActiveNotif event notification whenever a fault is raised. There is one exception to this rule: Cisco UCS Manager does not send event notifications for FSM faults. The trap variables indicate the nature of the problem, including the fault type. Cisco UCS Manager sends a cucsFaultClearNotif event notification whenever a fault has been cleared. A fault is cleared when the underlying issue has been resolved.
In UCS Manager version 1.4 and later, the cucsFaultActiveNotif and cucsFaultClearNotif traps are defined in the CISCO-UNIFIED-COMPUTING-NOTIFS-MIB. All faults can be polled using SNMP GET operations on the cucsFaultTable, which is defined in the CISO-UNIFIED-COMPUTING-FAULT-MIB.
Fault Attributes (Variable Bindings):
MIB Loading Order & Statistics Collection Details:
More details on MIB load ordering and statistics collection including a comprehensive list of Statistics OID and their corresponding Statistics tables are located in the following MIB Reference Guides:
MIB Reference for Cisco UCS Manager:
MIB Reference for Cisco UCS Standalone C-Series Servers:
UCS Manager and Standalone C-Series Faults:
In the Cisco UCS, a fault is a mutable object that is managed by the Cisco UCS Manager. Each fault represents a failure in the Cisco UCS instance or an alarm threshold that has been raised. During the lifecycle of a fault, it can change from one state or severity to another.
Each fault includes information about the operational state of the affected object at the time the fault was raised. If the fault is transitional and the failure is resolved, then the object transitions to a functional state. A fault remains in the Cisco UCS Manager until the fault is cleared and deleted according to the settings in the fault collection policy.
You can view all faults in the Cisco UCS instance from either the Cisco UCS Manager CLI or the Cisco UCS Manager GUI. You can also configure the fault collection policy to determine how a Cisco UCS instance collects and retains faults.
Fault Severities for UCS Manager and Standalone C-Series Servers include:
Types of faults for UCS Manager and Standalone C-Series Servers include:
The faults in Cisco UCS are stateful, and a fault raised in a Cisco UCS instance transitions through more than one state during its lifecycle. In addition, only one instance of a given fault can exist on each object. If the same fault occurs a second time, the Cisco UCS increases the number of occurrences by one.
A fault has the following lifecycle:
- A condition occurs in the system and the Cisco UCS raises a fault in the active state.
- If the fault is alleviated within a short period of time known as the flap interval, the fault severity remains at its original active value but the fault enters the soaking state. The soaking state indicates that the condition that raised the fault has cleared, but the system is waiting to see whether the fault condition reoccurs.
- If the condition reoccurs during the flap interval, the fault enters the flapping state. Flapping occurs when a fault is raised and cleared several times in rapid succession. If the condition does not reoccur during the flap interval, the fault is cleared.
- Once cleared, the fault enters the retention interval. This interval ensures that the fault reaches the attention of an administrator even if the condition that caused the fault has been alleviated, and that the fault is not deleted prematurely. The retention interval retains the cleared fault for the length of time specified in the fault collection policy.
- If the condition reoccurs during the retention interval, the fault returns to the active state. If the condition does not reoccur, the fault is deleted.
RVTools is a windows .NET 2.0 application which uses the VI SDK to display information about your virtual machines and ESX hosts. Interacting with VirtualCenter 2.5, ESX Server 3.5, ESX Server 3i, VirtualCenter 4.x, ESX Server 4.x, VirtualCenter 5.0, VirtualCenter Appliance, ESX Server 5.0, VirtualCenter 5.1, ESX Server 5.1, VirtualCenter 5.5, ESX Server 5.5. RVTools is able to list information about VMs, CPU, Memory, Disks, Partitions, Network, Floppy drives, CD drives, Snapshots, VMware tools, Resource pools, Clusters, ESX hosts, HBAs, Nics, Switches, Ports, Distributed Switches, Distributed Ports, Service consoles, VM Kernels, Datastores, Multipath info and health checks. With RVTools you can disconnect the cd-rom or floppy drives from the virtual machines and RVTools is able to update the VMware Tools installed inside each virtual machine to the latest version.
Version 3.7 (March, 2015)
VI SDK reference changed from 5.0 to 5.5
Extended the timeout value from 10 to 20 minutes for really big environments
New field VM Folder on vCPU, vMemory, vDisk, vPartition, vNetwork, vFloppy, vCD, vSnapshot and vTools tabpages
On vDisk tabpage new Storage IO Allocation Information
On vHost tabpage new fields: service tag (serial #) and OEM specific string
On vNic tabpage new field: Name of (distributed) virtual switch
On vMultipath tabpage added multipath info for path 5, 6, 7 and 8
On vHealth tabpage new health check: Multipath operational state
On vHealth tabpage new health check: Virtual machine consolidation needed check
On vInfo tabpage new fields: boot options, firmware and Scheduled Hardware Upgrade Info
On statusbar last refresh date time stamp
On vhealth tabpage: Search datastore errors are now visible as health messages
You can now export the csv files separately from the command line interface (just like the xls export)
You can now set a auto refresh data interval in the preferences dialog box
All datetime columns are now formatted as yyyy/mm/dd hh:mm:ss
The export dir / filenames now have a formated datetime stamp yyyy-mm-dd_hh:mm:ss
Bug fix: on dvPort tabpage not all networks are displayed
Overall improved debug information
Download link: http://robware.net/index.php/register