rid
itself of all its calls, drop everything temporarily, and re-boot its
software from scratch. Starting over from scratch will generally rid
the switch of any software problems that may have developed in the
course of running the system. Bugs that arise will be simply wiped out
by this process. It is a clever idea. This process of automatically
re-booting from scratch is known as the "normal fault recovery
routine." Since AT&T's software is in fact exceptionally stable,
systems rarely have to go into "fault recovery" in the first place; but
AT&T has always boasted of its "real world" reliability, and this
tactic is a belt-and-suspenders routine.
The 4ESS switch used its new software to monitor its fellow switches as
they recovered from faults. As other switches came back on line after
recovery, they would send their "OK" signals to the switch. The switch
would make a little note to that effect in its "status map,"
recognizing that the fellow switch was back and ready to go, and should
be sent some calls and put back to regular work.
Unfortunately, while it was busy bookkeeping with the status map, the
tiny flaw in the brand-new software came into play. The flaw caused
the 4ESS switch to interact, subtly but drastically, with incoming
telephone calls from human users. If--and only if--two incoming
phone-calls happened to hit the switch within a hundredth of a second,
then a small patch of data would be garbled by the flaw.
But the switch had been programmed to monitor itself constantly for any
possible damage to its data. When the switch perceived that its data
had been somehow garbled, then it too would go down, for swift repairs
to its software. It would signal its fellow switches not to send any
more work. It would go into the fault-recovery mode for four to six
seconds. And then the switch would be fine again, and would send out
its "OK, ready for work" signal.
However, the "OK, ready for work" signal was the VERY THING THAT HAD
CAUSED THE SWITCH TO GO DOWN IN THE FIRST PLACE. And ALL the System 7
switches had the same flaw in their status-map software. As soon as
they stopped to make the bookkeeping note that their fellow switch was
"OK," then they too would become vulnerable to the slight chance that
two phone-calls would hit them within a hundredth of a second.
At approximately 2:25 P.M. EST on Monday, January 15, one of AT&T's
4ESS toll switching systems in New York City had an actual,
|