The following text is copyright 1997 by Network World, permission is hearby given for reproduction, as long as attribution is given and this notice is included.

An Internet logic brownout

At about 11:30am Friday April 25th a "yet unexplained twist of bits" caused a router at MAI Network Services to send some incorrect routing information into MAE-East where it was picked up by Sprint and perhaps other major Internet service providers (ISPs). The result was that Sprint thought that MAI provided the best path to just about anywhere. Sprint announced the new better path to its ISP peers and started sending their traffic to MAI. What happened subsequent to 11:30 demonstrates some of the strengths and weaknesses of today's Internet.

Through the use of good monitoring techniques, MAI figured out within a few minutes that something was wrong. Their technical people started working on the problem and within 15 minutes shut down the routing connection to Sprint and called Sprint's network management people on the phone. This should have stopped the problem but for some not yet understood reason Sprint's router kept advertising the bad routing information even though MAI was no longer sending the bad information to Sprint. This persisted even after MAI turned off their router and Sprint turned off the DS3 link to MAI, either of which should have fixed the problem. The problems did die out eventually, in most places within an hour or so, but they persisted in a few places for up to 7 and a half hours.

The routing problems did cause, as one observer put it, an Internet brownout. Connectivity problems affected many Internet users. It was not that these users were cut off entirely but many people did experience problems reaching some other sites for a while. This was not any sort of catastrophic collapse but it was a real problem.

The basic problem here was not the introduction of the bad routing information even though that should not have happened and it was the trigger for the problems that followed. The basic problem was not the interaction between the technical people at the various ISPs, MAI and Sprint on the phone soon after the events started and the technical people from many of the effected ISPs got on a conference call soon thereafter. The basic problem seems to have been some bug or feature in the routing protocol that caused the bad information to continue to be distributed even after the source had gone away. Bugs happen. This one will be soon found and fixed so that this specific problem will be less likely to occur in the future.

MAI's network management process and the processes for sharing information among the ISP technical staff people worked very well. The cognitive powers of some people did not fare so well.

Some pundits on nanog (a mailing list for North American network operators) seemed only able to see malfeasance and incompetence behind the episode . All some of the trade press could see was fragility in the structure, not being able to see that the net had survived yet another potentially major disruption with little visible effect. It is not an accident that the net kept running. The architecture of the net, the design of the protocols, and the cooperation between ISPs make it likely that more often than not this will be the normal outcome of such problems.

disclaimer: There are at least 4 opinions for every 3 pundits at Harvard, this is one of them.