Read Windows Server 2008 R2 Unleashed Online
Authors: Noel Morimoto
. After the issue is isolated or, at least, the scope of the issue is understood, the
network administrator should communicate the outage to the necessary managers
and/or business owners and, as necessary, open communication to outside support
vendors and ISP contacts to report the issue and create a trouble ticket. And no—this
should not go out in an email if the network is down.
. Create a logical action plan to resolve the issue and execute the plan.
. Create and distribute a summary of the cause and result of the issue and how it can
be avoided in the future. Close the trouble ticket as required.
Physical Site Failure
In the event a physical site or office cannot be accessed, a number of business operations
might be suspended. Planning how to mitigate issues related to physical site limitations
can be extensive, but should include the considerations discussed in the following sections.
Physical Site Access Is Limited but Site Is Functional
This section lists a few considerations for a situation where the site or office cannot be
accessed physically, but all systems are functional:
. Can the main and most critical phone lines be accessed or forwarded remotely?
. Is there a remote access solution to allow employees with or without
notebooks/laptop computers to connect to the organization’s network and perform
their work?
Disaster Scenario Troubleshooting
1275
. Are there any other business operations that require onsite access that are tied to a
service-level agreement, such as responding to paper faxes or submitted customer
31
support emails, phone calls, or custom applications?
Physical Site Is Offline and Inaccessible
This section lists a few considerations for a situation where the resources in a site are
nonfunctional. This scenario assumes that the site resources cannot be accessed across the
network or Internet and the data center is offline with no chance of a quick recovery.
When planning for a scenario such as this, the following items should be considered:
. Can all services be restored in an alternate capacity—or at least the most critical
systems, such as the main phone lines, fax lines, devices, applications, system, and
remote access services?
. If systems are cut over to an alternate location, what is the impact in performance,
or what percentage of end-user load can the system support?
. If systems are cut over to an alternate location, will there be any data loss or will
only some data be accessible?
. If the decision to cut over to the alternate location is made, how long will it take to
cut over and restore the critical services?
ptg
. If the site outage is caused by power loss or network issues, how long of an outage
should be sustained before deciding to cut over services to an alternate location?
. When the original system is restored, if possible, what will it take to failback or cut
the systems back to the main location, and is there any data loss or synchronization
of data involved?
These short lists merely break the surface when it comes to the planning of or dealing
with a physical site outage, but, hopefully, they will spark some dialogue in the disaster
recovery planning process to lead the organization to the solution that meets their needs
and budget.
Server or System Failure
When a server or system failure occurs, administrators must decide on which recovery
plan of action will be the most effective. Depending on the particular system, in some
cases, it might be more efficient to build a new system and restore the functionality or
data. In other cases, where rebuilding a system can take several hours, it might be more
prudent to troubleshoot and repair the problem.
Application or Service Failure
If a Windows Server 2008 R2 system is still operational but a particular application or
service on the system is nonfunctional, in most cases troubleshooting and attempting
repair or restoring the system to a previous backup state is the correct plan of action. The
Windows Server 2008 R2 event log is much more useful of a tool than in previous
versions, and it should be one of the first places an administrator looks to determine the
cause of a validated issue. Following troubleshooting or recovery procedures for the partic-
ular application is the next logical step. For example, if an end user deleted a folder from a
1276
CHAPTER 31
Recovering from a Disaster
network share, the preferred recovery method might be to use Shadow Copy backups to
restore the data instead of the Windows Server Backup.
For Windows services, using Server Manager to review the status of the role and role
services assists administrators in identifying and isolating problems because the Server
Manager tool displays a filtered representation of Event Viewer items and service state for
each role installed on the system. Figure 31.1 details that the File Services role SERVER10
logged several errors and warnings in the last 24 hours.
ptg
FIGURE 31.1
File Services role and role status.
Data Corruption or Loss
When a report has been logged that the data on a server is missing, is corrupted, or has
been overwritten, Windows Server 2008 R2 administrators have a few options to deal with
this situation. Shadow Copies for Shared Folders can be used to restore previous versions
of selected files or folders and Windows Server Backup can be used to restore selected files,
folders, or the entire volume on a Windows disk. Using Shadow Copies for Shared Folders,
administrators and end users with the correct permissions can restore data right from their
workstation. Using the restore features of Windows Server Backup, administrators can
place the restored data back into the same folder by overwriting the existing data or
placing a copy of the data with a different name based on the backup schedule date and
time. For example, to restore a file called ClientProprosal.docx that was backed up on
10-9-09 at 12:30 p.m., Windows Server Backup will restore the file as 2009-10-09 12-30
Recovering from a Server or System Failure
1277
Copy of ClientProposal.docx, and the time representation will be the current time zone
of the server.
31
Hardware Failure
When hardware failure occurs, a number of issues and symptoms might result. The most
common issues related to hardware failures include system crashes, services or drivers
stopping unexpectedly, frozen (hung) systems, and systems that are in a constant reboot
cycle. When hardware is suspected as failed or failing on a Windows Server 2008 R2
system, administrators should first review the event logs for any related system or applica-
tion event warnings and errors. If nothing apparent is logged, hardware manufacturers
usually provide several different diagnostic utilities that can be used to test and verify
hardware configuration and functional state. Don’t wait to call Microsoft and involve
their professional support services department because they can be working in conjunc-
tion with your team to capture and review debugging data.
When a system is suspected of having hardware issues and it is a business-critical system,
steps should be taken to migrate services or applications hosted on that system to an alter-
nate production system, or the system should be recovered to new hardware. Windows
Server 2008 R2 can tolerate a full system restore or a complete PC restore to alternate
hardware if the system is an exact or close hardware match with regard to the mother-
ptg
board, processors, hard disk controller, and network card. Even if the hardware is exact
and the disk arrays, disk IDs, and volume or partition numbers do not match, a complete
PC restore to alternate hardware might fail if no additional steps are taken during the
restore or recovery process. This is detailed in a later section of this chapter named
“Complete PC Restore to Alternate Hardware.”
Recovering from a Server or System Failure
When a failure or issue is reported regarding a Windows Server 2008 R2 system, the
responsible administrator should first perform the standard validation tests to verify that
there is a real issue. The following sections include basic troubleshooting steps when
failure reports are based around data or application access issues, network issues, data
corruption, or recovery issues.
Access Issues
When end users report issues accessing a Windows Server 2008 R2 system but the system
is still online, this is categorized as an access issue. Administrators should start trou-
bleshooting access issues by first verifying that the system can be accessed from the
system console and then verifying that it can be accessed across the network. After that is
validated, the access issue should be tested to reveal whether the access issue is affecting
1278
CHAPTER 31
Recovering from a Disaster
everyone or just a set of users. Access issues can be system or network related, but they
can also be related to security configurations on the network or local system firewall or
application, share, and/or NTFS permissions. The following sections can be used to help
troubleshoot access issues.
Network Access Troubleshooting
Troubleshooting access to a system that is suspected to be network related can involve the
networking group as well as the Windows Server 2008 R2 system administrators. When
networking is a suspect, the protocol and system IP information should be noted before
any tests are performed. Tests should be performed from the Windows system console to
determine if the system can access other devices on the local network and systems on
neighboring networks located across a gateway or router. Tests should be performed using
both the system DNS names as well as IP addresses and, if necessary, IP Next Generation
IPv6 addresses.
NOTE
Testing connectivity for web-based applications should be performed using system host-
names, fully qualified domain names, and IP addresses to ensure that tests yield the
ptg
proper results. Many web servers and/or firewalls can receive a properly formed head-
er in the web GET request and will not respond to a request made from an IP-based
uniform resource locator (URL).
If the system can communicate out but users still cannot access the system, possible
causes could be an incorrect IP subnet mask default gateway or routing table or a restric-
tion configured in the Windows or network firewall. Windows Firewall is enabled by
default on Windows Server 2008 R2 systems and the new firewall supports multiple fire-
wall profiles simultaneously. If a network is identified incorrectly as a public network
instead of a domain network, depending on the firewall profile settings, this might restrict
access undesirably. When administrators follow the proper procedures for installing roles
and role services, during the installation of the roles, exceptions will be added to the fire-
wall. Administrators can review the settings using the Windows Firewall applet from
Control Panel but to get very detailed firewall information, the Windows Firewall with
Advanced Security console should be used. This console is located in the Administrative
Tools program group.
Share and NTFS Permissions Troubleshooting
If network connectivity and firewall configurations check out, the next step in trou-
bleshooting access issues is to validate the configured permissions to the affected applica-
tion, service, or shared folder. For application access troubleshooting, refer to the section,
Recovering from a Server or System Failure
1279
“Application Access Troubleshooting,” and the application vendors’ administration and
troubleshooting guides. For Windows services and share folder permission troubleshoot-
31
ing, Event Viewer can assist tremendously, especially if auditing is enabled. Auditing can
be enabled within an Active Directory group policy on the Windows Server 2008 R2 local
computer policy, but auditing must also be enabled on the particular NTFS folder. For
information on local and domain Group Policies, refer to Chapter 27, “Group Policy
Management for Network Clients.” To troubleshoot share and NTFS permissions, please