Recovering from Blue Screen Stops
There are two varieties of executables in Windows Server 2003:
For the most part, it should be nearly impossible for a User-side application to crash a system. Oh, a user app can load the system outrageously or cause it to become autistic, but it should not cause a complete loss of system services.
Kernel services, on the other hand, are fully capable of causing drastic malfunctions. Rather than risk widespread memory and file corruption, when a kernel service misbehaves, the system is brought to a stop and information about the crash is displayed. This kernel-mode stop is commonly called a Blue Screen Of Death, or BSOD, due to the background color of the informational display. This blue screen display is handled by a kernel-mode routine called KeBugCheckEx, so it is often called a bugcheck.
Online Event Tracking
If a server crashes due to a bugcheck, or an application hangs and must be killed from Task Manager or by DrWatson, the system assembles a set of XML files that contain the names of the processes that were running at the time of the crash or hang and system information about memory contents and CPU register contents, similar to what you see at the blue screen.
These XML files are sent to Microsoft where they are added to a repository of failure information used to determine causes and help identify solutions for crashes. For example, if thousands of error reports flood into Microsoft that identify a particular driver as the culprit in a crash, Microsoft will work with the vendor to determine the cause of the instabilities.
You may not want this information to be transmitted to Microsoft. There is a set of group policies under Computer Settings | Administrative Templates | System | Error Reporting that control the online error report settings. You can elect to block reporting completely, to report on selected applications, or to report only unplanned shutdown events.
The top lines of the blue screen contain bugcheck codes that identify the source of the stop, information about the stop that differs depending on the stop code, and oftentimes the name of the culprit. The information looks like this:
*** STOP: 0x0000001E (0xC0000005, 0x8041E9FB, 0x00000000, 0x00000030)
*** Address 8041E9FB base at 80400000, DateStamp 377509d0 – ntoskrnl.exe
The bugcheck codes are your best bet for quickly finding the cause of the crash. The rest of the stop screen usually (but not always) contains stack dump information listing the processes that were in memory at the time of the crash and what they were doing. Here's a brief explanation of the bugcheck information:
The first entry after STOP is the hex ID of the stop code. This corresponds to the name on the second line. If there is no name, the exception was so severe that the system was not able to refer to the lookup table to generate the name.
The next four entries are parameters that were passed to KeBugCheckEx when the STOP error was issued. The meaning and origin of these parameters vary depending on the type of error.
The line following the bugcheck code specifies the base address of the image that caused the exception, a hex representation of the date stamp on the image, and its name. In this case, the exception was thrown by the kernel driver, Ntoskrnl.exe.
The fact that a particular executable is implicated by bugcheck does not necessarily mean that it was the actual perpetrator. In this game of blue screen Clue, you have to search through all the rooms to find out who killed Mr. Server. The name at the top of the bugcheck list might have just been a dupe used by the real culprit.
The Microsoft Knowledgebase is your best source for information about bugcheck codes. Start by searching for Q103059, which lists the stop codes and their names. Then check out Q192463 for ways to collect information without doing full-blown kernel debugging. For a full list of stop codes, download the Windows DDK from msdn.microsoft.com and take a look at the include file, bugcodes.h.
Online Error Reporting
In an effort to find and correct common sources of system hangs and bugchecks, Windows Server 2003 has an Error Reporting service, Ersvc, that collects kernel information from bugcheck and application information from Dr. Watson and sends them to Microsoft where they are cataloged and analyzed. Chapter 3, "Adding Hardware," discusses this feature in detail.
Common Stop Errors
Of the more than 200 kernel-mode stop codes, only a few are especially common. Here they are:
This error says that an exception occurred in the kernel for which there was no error handler. In most cases, bugcheck can tell you the name of the misbehaving driver. This will be listed in the third line of the display.
When a thread issues a software interrupt, it does so at a particular interrupt request level (IRQL). There are 32 IRQLs, with higher numbers having higher priority. An 0A error occurs when a driver running at one IRQL tries to access memory that is owned by a process at a higher IRQL.
This is generally a hardware problem. Refer to KnowledgeBase article Q137539 for a list of common culprits.
This is commonly caused by a virus, or sometimes an overly aggressive virus checker. It is also commonly cause by file system utilities that attempt to reach around the APIs to access the file system directly. It can also be caused by file system corruption.
This is also commonly caused by virus checkers. It has also been tied to many TCP/IP problems, as well. Some fairly notorious denial-of-service attacks result in a 0x50 stop error, so if you start getting this on your DMZ machines, you might try enabling auditing and applying a packet sniffer to see if you can capture the source of the problem.
If this occurs when starting a system that has been in operation a while, it almost always indicates a failed drive, drive controller, or a boot sector virus. If it occurs on a new installation, you may have drive sector translation problems or an improver host adapter driver. This error also occurs if you restart a system following a failure of the primary drive in a mirrored set.
There are a variety of steps you can take to assess the cause of a bugcheck and try to prevent another like it:
If you get a stop error after installing a new piece of hardware, a driver upgrade, or a new application, your first step should be to restart and select the Last Known Good Configuration option. See "Restoring Functionality with the Last Known Good Configuration" for details.
If that doesn't work, boot to a Recovery console and delete or rename the offending driver.
If you can't physically get to the server, you can set it up for out-of-band (OOB) access to see the bugcheck codes and restart. See "Using Emergency Management Services."
If you try all of these and are still unable to restore normal operation, you can capture the contents of memory at the time of the stop and send it off to Microsoft Product Support Services (PSS) to analyze. PSS charges a few hundred dollars per incident for this service, but when you compare that against the losses incurred from server downtime, it's often worth the expense.
Configuring Memory Dumps
By default, as part of the bugcheck, the contents of RAM are dumped to the paging file. After restart, the paging file is copied to a file called Memory.dmp in the \Windows folder.
For this full memory dump to succeed, the paging file must be at least the size of RAM plus 1MB for header information. The paging file must be in the root of the boot drive. (Microsoft calls this the System partition.) This is because the bugcheck routine cannot mount a file system, so it is limited to using bare INT13 calls. You can have other paging files on other drives, but they will not be used for the memory dump.
If you have a fire-breathing server with many, many gigabytes of RAM, you probably don't want to give up gigabytes and gigabytes of real estate in your system partition for the paging file. Also, a multi-gigabyte dump file is not likely to have useful content unless the misbehaving driver leaves a known footprint. To avoid large memory dumps, you have two options:
Small memory dumps.
This dumps just that portion of RAM owned by the operating system. This is rarely over 1GB and cannot be more than 2GB, at least on IA32 systems. IA64 systems can have a larger footprint. You'll have to check Task Manager to find out how much dump space to set aside.
Kernel memory dumps.
This dumps just the stack space. This can be useful only if the offending driver leaves a very clear indication. Otherwise, it does not include sufficient information for a full diagnosis.
Memory dump options are controlled by System properties. Right-click the My Computer icon and select PROPERTIES from the flyout menu. The Advanced tab has a Startup and Recovery button that opens a window to access the memory dump settings. Figure 21.16 shows an example.
Figure 21.16. Startup and Recovery window showing default Recovery settings for handling system stop errors.
Several recovery options in this window are worth your attention. They are as follows:
Send an Administrative Alert uses the Alerter service, if it is still functioning after the crash, to put out a network broadcast to members of the Administrators domain local group to notify them of the stop error. If you have a trap management console of some sort (HP Openview, for example), you should also load an SNMP agent on the server so it can trap when the failure occurs.
Write Debugging Information To specifies the name of a file that will hold the memory contents after the system reboots. The dump stays in the paging file until the system restarts successfully.
Write Kernel Information Only saves hard drive space that would go to waste if you have lots of RAM in a server. You can size the paging file by opening Task Manager, selecting the Performance tab, and looking at the total memory value under Kernel Memory.
This option restarts the system after the memory dump has completed. It has the potential of causing a continuous loop if the cause of the blue screen stop doesn't go away after restart. For this reason, it is a good idea to monitor your servers with some sort of SNMP tool that will notify you when the server crashes.
Examining Memory Dumps
In production, if you have a server that is crashing regularly and you cannot figure out the problem, it's probably a good idea to spend the money to call Microsoft's Product Support Services. They may want a copy of the memory dump file. Before you burn a huge dump onto a CD for the overnight pouch to PSS, it's a good idea to make sure there is useful information in the dump file. The Windows Server 2003 CD has a utility called DUMPCHK that verifies the integrity of the file contents.
If you'd like to do your own poking around in the dump file, you'll need a tool. The simplest and most flexible dump analysis tool is the Windows Debugger, Windbg. You can get this tool on the Windows Server 2003 CD, which comes in Technet or can be downloaded from the MSDN web site.
The Windows Server 2003 CD also has a fairly hefty symbols file that you need to install. These symbols help the debugger interpret what it sees in the dump file. There are two sets of symbol files, one for the retail version of the product and one for the debug version. Unless you are running the debug version of Windows Server 2003 from the MSDN library, use the retail symbols.
When you install Windbg, it will look for the symbols in the \Windows\Symbols folder. If you did not install the symbols into that default location, you must configure Windbg with the actual location. This is in done in VIEW | OPTIONS | SYMBOLS.
Windbg has a huge number of uses and switches, all of which lie outside the scope of this book. The feature that can help with crash dump analysis, though, is pretty straightforward. Simply point the program at the crash dump file, \Windows\Memory.dmp, using FILE | OPEN CRASH DUMP. Figure 21.17 shows an example following a crash. I generated the crash in this example using a diagnostic feature in Windows. (If you sneered just now, you're a cynic.) To enable this feature, make the following Registry change to the keyboard driver:
Key: HKLM | System | CurrentControlSet | Services | i8042prt | Parameters
Data: 1 (REG_DWORD)
Figure 21.17. Windbg Debugger window showing results of loading a crash dump file.
After restarting with this setting, you can crash the system by pressing the right Ctrl key (not the left) and then pressing the Scroll Lock key twice.
The debugger will point right at the source of the crash if it can get a clear picture from the dump file about the events that led up to the bugcheck.
Following the restart, the Error Reporting Service will want to send Microsoft information about the crash. A notification window gives you the opportunity to decide whether or not to send the information. A hyperlink takes you to the details of what will be sent. Figure 21.18 shows an example.
Figure 21.18. Error Reporting window following a system crash.