From the Archives: Why Exchange is cruel, part 1

A humble preface

I have been using computers since 1978, when I got my first puny little
TRS-80 Model I. I’ve been earning a living from them since about 1981 or so.
During all that time, I’ve never had a serious hardware failure. No data loss;
no crashed drives, no nothing– not even when a
crazed squirrel bit
through our house power line and incinerated the power supply of my (then)
brand-new Mac Plus.  However, I am now a convert to the gospel of regular
backups and redundant hardware, in a different sort of way. At
my old job, I specialized in telling people what
to do to ensure the availability of their Exchange and Windows servers.
Furthermore, I write columns for two
magazines in which I teach these
principles. Unfortunately, my own application of them has been lacking. As it
says in the New Testament,
"For all have sinned, and come short of the glory of God."

The beginning

I have a small E2K organization in my home lab; it handles mail for the
robichaux.net and
exchangefaq.org domains. I use this
network as my primary mail system, as do several family members. There are
several interesting machines in the context of my tale of woe:

  • TORNADO is an Intergraph 
    TD-30 (2 x Pentium-166). It’s an E2K FE and SMTP gateway, plus a DC. It works
    well, although it’s not the fastest machine in the world. W2K SP2 + E2K SP2.
  • CYCLONE is the primary mailbox server. It used to be built with an
    Asus
    P2B-D
    motherboard (2 x Pentium-II 450). Now it has something different (more on that
    in a minute). This box was also the primary RRAS and
    file/print server for my home network. W2K SP2 + E2K SP2.
  • HURRICANE is my primary work machine; it’s a Tyan Tiger III (2 x
    Pentium-III 600), running W2K SP2 as a GC/DC.
  • THUNDERSTORM is a Dell PowerEdge 2500 (2
    x Pentium-III 933) with a bunch of disks. W2K SP2 + E2K SP2 + a bunch of
    Dell-specific drivers and so forth.

All of these machines have been stable from the get-go. The P2B-D had some
initial stability problems with betas of Windows 2000, but

DATAC
(from whom I bought it) replaced the board with a later revision and
all was well. In October, I bought some cheap
Crucial
RAM and loaded up all my servers; CYCLONE went from 256Mb (not
enough) to 1Gb (aaaah!) Everything seemed fine– until, that is, I installed a
beta version of a service pack for a
popular
product whose NDA doesn’t permit me to name it. This was right before
Thanksgiving; my initial experience with the beta were quite good. Then I went
out of town.

Thanksgiving of terror

Well, OK, it wasn’t really a terror. We went to
Ohio to visit my parents, and I left my
laptop behind. "No problem," sez I. "Outlook Web Access will do the trick." The
only problem was that OWA didn’t consistently work. I could occasionally VPN in
and use Outlook 2000, but that didn’t work consistently either. When I used

Terminal Services
to log in and see what’s what, I couldn’t reset the
machine; I had to use the reskit shutdown utility. When CYCLONE restarted, the
IS hung, but bouncing it manually restored it to normal operation. This went on
for a few days: at random intervals the server would get wedged; eventually I
would notice and unwedge it. No big deal, right?

When I returned home, I examined the event log more carefully and found that
the NNTP service was reporting that it had a corrupted database, and that these
event IDs seemed to pop up right before a shutdown. No problem: turn off the
NNTP service. I did, and that seemed to fix the problem for a while. The system
even stayed up continuously during my
trip to France,
so I was pretty happy. This should have warned me: after all,
pride goeth before a fall.

Bad mobo! No donut!

When I returned to work, I noticed that CYCLONE was occasionally taking a
powder again. In fact, the MTBF seemed to be shrinking, so I did what any
thinking person would do: I opened the box up and added more RAM. No, just
kidding: I started an online backup of the mailbox database and stored it on
another server’s disk. However, during the backup the machine bugchecked with an
error I hadn’t seen before: PFN_LIST_CORRUPT. The
one KB
article
that mentions this error says that the most common cause is bad
device drivers, but it also says that RAM may be the culprit.

In my initial order from Crucial, I’d gotten a 512Mb stick for HURRICANE that
turned out to be bad. Its replacement was also bad. "No sweat," I thought. "The
guys at Crucial had too much potato
vodka the day they built my RAM." So,
I shut down the machine and pulled out all but the original stick. I succeeded
in rebooting and taking another (successful) online backup, then the real
trouble started: random BSODs every 3-45 minutes, depending on how busy the
machine was. Anything that caused lots of disk I/O would trash the machine,
which pretty much spelled doom for my attempts to take another backup or to copy
my .bkf files to another machine.

Digression: I don’t have a tape drive. Well, actually, I do have an
OnStream DI30, but I rarely use it
because it only works intermittently. Item #1 on my list is to get it fixed;
item #2 is to find a DLT stacker that I can afford and put it to work.

I struggled with the machine for a while when I was inspired to go back and
see what PFN_LIST_CORRUPT really meant. On re-reading the KB article, the
mention of bad RAM jumped out at me, and I bounced over to
download.com to look for RAM testers. I
found the excellent GoldMemory and put it
to work overnight. Lo and behold: every stick of RAM, in every slot, was showing
a bit error in the same place. This was true no matter which RAM I had in, so
clearly it wasn’t a problem with a single stick. So much for my theory about
Crucial.

It was clearly time for more expert expertise, so I called my pal
Bob Thompson. Bob wrote
the excellent
PC
Hardware In A Nutshell
for O’Reilly, and he’s a good friend, so
naturally I call him when something bizarre happens. He concurred that my mobo
was almost certainly at fault. So, I started calling around to local shops to
see who had dual-processor motherboards in stock. DATAC had an Asus

CUV4X-D
in stock, so I zipped over there to pick it up, with a pair of
PIII-933s to keep it company. Visions of happy Exchange services were dancing in
my head.

"It’s toast"

I returned home, pulled CYCLONE onto the kitchen table, eviscerated it, and
installed the new motherboard. This seemed easy enough, only when I got done the
POST reported no CPU in slot 1 and a P-III 933 in slot 2. Since this was clearly
bogus, I swapped processors, and guess what? At that point, the machine wouldn’t
boot at all. I packed it up and carried it back to DATAC for some professional
action. They promised to check it out immediately– at this point, it was about
1630 on Wednesday.

Digression: see, this doesn’t happen to people who buy brand-name servers.
You just call Compaq, or whomever, and they show up with a truck full of parts
and fix it on the spot. On the other hand, white-box shops are the preferred
route for most small businesses, even though it means you don’t get the
assurance of dealing with a major PC-market playa.

Of course, "immediately" meant something a little different in their lexicon.
I didn’t hear anything from them by 1100 Thursday, so I called and was informed
that my tech was on a service call and would call me back shortly. He did, and
it was not happy news. "It’s toast," he said cheerfully. "It looks like it may
have been damaged when you installed the processor fans, but in any case it’s
dead. We can get you a new one tomorrow."

More fool I for believing them. "Tomorrow" really meant "by 4pm or so tomorrow", putting me at 4pm
Friday with a non-working server. After some negotiation, they got the server
fixed, but it wasn’t until I went to pick it up that they (casually) mentioned
that W2K would no longer boot. One of the two CPUs was dead, too, but he
couldn’t tell whether it was DOA or whether the mobo had taken the PIII with it
in its death throes; they’d replaced that as well.

INACCESSIBLE_BOOT_DEVICE is not your friend

Whenever I tried to boot into W2K, I got a STOP BSOD reporting
INACCESSIBLE_BOOT_DEVICE. It took about 10 seconds for me to find KB article

Q271965
, which basically says "don’t move system disks between motherboards
with different chipsets." Of course, I’d never seen any mention of this problem
before, even though I like to consider myself well-informed. I was reminded of a
scripture in which Jesus,
somewhat incredulously, asks a learned man how he can not know something basic.
Anyway, the article suggests making sure you have a good system state backup (I
didn’t) or moving the drive to another system with a similar chipset (I
couldn’t).

Q249694
  blathers on about merging the registry, but none of this was
helping, so I turned to my friends at Google.
It turns out that STOP 0x7b errors are very popular among the UDMA-controller
crowd. One popular solution is to install a PCI IDE card in your working
machine, then swap motherboards, then put the IDE controller in the new
motherboard before booting the replacement disk. Since my original motherboard
had gone the way of Jimmy Hoffa, that didn’t help any either.

So, off I went with my W2K CD. First I tried doing an auto-repair, which
didn’t work. Then I tried a manual repair, with the same results. In some cases,
you can stop the ATAPI service from the W2K recovery console, but that didn’t
help me since that service wasn’t ever getting loaded. Then it hit me: what if I chose the "reinstall" option? I crossed my fingers
and reinstalled the OS; when setup asked me if I wanted to repair an existing
install, I said "yes" and went to bed while it finished. One oddity: setup
didn’t detect my NIC, which I thought was a little odd. However, when setup
finished the machine booted normally, and all of my existing software seemed to
be present– the machine was still a member of the same domain, and I could log
in with my cached credentials.

The case of the missing NIC

That’s when the real fun started. I got a popup telling me that at
least one service or device driver had failed. Checking the event log revealed
that Netlogon couldn’t start, so the Exchange SA couldn’t start, and so on. When
I checked the Network & Dial-up Connections folder, it was as empty as could be.
The Device Manager showed that the NIC was present, but it had the little yellow
exclamation point that we all dread. Checking the device properties told me that
code 35 was the problem. The
KB
told me what that code meant: 

This error message is
displayed when a device does not have an entry in the BIOS MultiProcessor
Specification (MPS) table. You can only see this error message on MPS-capable
systems. This behavior usually indicates a BIOS bug. This is particularly
prevalent on MPS systems with multiple root PCI buses.

Great. A BIOS bug. What next: a plague of
locusts? Like a good little
doobie, I checked the motherboard BIOS version, only to find that it was already
current. So much for the idea of just flashing over the bug. At this point, it
was about 10pm Friday night, so I went to bed with big plans for the morning.
After a good night’s sleep and a huge feeding at our annual
church Christmas breakfast, it was off to
the new
CompUSA
for a new NIC. I came home with an Intel inBusiness 10/100 card and
popped it into the server. Same problem.

Thus ensued a festival of NIC- and slot-swapping. Sometimes the Device
Manager would report code 35; sometimes code 10, and once even the rare and
dreaded code 19:

Your registry
might be corrupted. (Code 19)

Comments Off on From the Archives: Why Exchange is cruel, part 1

Filed under UC&C

Comments are closed.