legitimate,
minor problem. It went into fault recovery routines, announced "I'm
going down," then announced, "I'm back, I'm OK." And this cheery
message then blasted throughout the network to many of its fellow 4ESS
switches.
Many of the switches, at first, completely escaped trouble. These
lucky switches were not hit by the coincidence of two phone calls
within a hundredth of a second. Their software did not fail--at first.
But three switches--in Atlanta, St. Louis, and Detroit--were unlucky,
and were caught with their hands full. And they went down. And they
came back up, almost immediately. And they too began to broadcast the
lethal message that they, too, were "OK" again, activating the lurking
software bug in yet other switches.
As more and more switches did have that bit of bad luck and collapsed,
the call-traffic became more and more densely packed in the remaining
switches, which were groaning to keep up with the load. And of course,
as the calls became more densely packed, the switches were MUCH MORE
LIKELY to be hit twice within a hundredth of a second.
It only took four seconds for a switch to get well. There was no
PHYSICAL damage of any kind to the switches, after all. Physically,
they were working perfectly. This situation was "only" a software
problem.
But the 4ESS switches were leaping up and down every four to six
seconds, in a virulent spreading wave all over America, in utter,
manic, mechanical stupidity. They kept KNOCKING one another down with
their contagious "OK" messages.
It took about ten minutes for the chain reaction to cripple the
network. Even then, switches would periodically luck-out and manage to
resume their normal work. Many calls--millions of them--were managing
to get through. But millions weren't.
The switching stations that used System 6 were not directly affected.
Thanks to these old-fashioned switches, AT&T's national system avoided
complete collapse. This fact also made it clear to engineers that
System 7 was at fault.
Bell Labs engineers, working feverishly in New Jersey, Illinois, and
Ohio, first tried their entire repertoire of standard network remedies
on the malfunctioning System 7. None of the remedies worked, of
course, because nothing like this had ever happened to any phone system
before.
By cutting out the backup safety network entirely, they were able to
reduce the frenzy of "OK" messages by about half. The system then
began to rec
|