Tuesday, September 20, 2005

The Challenger Launch Decision, Diane Vaughan

Title: The Challenger Launch Decision
Author: Diane Vaughan
Rating: OK

So I finally get to write a more meaty book review. One about a book that I have been slogging through for months now.

While probably not as bad as Foucault, this book has been driving me crazy. A brief summary for those of you too young to remember:

On January 28, 1986, the space shuttle Challenger blew up during liftoff. The cause of the failure proved to be escaping propellant gases from one of the solid rocket motors (SRMs) piercing the external shuttle fuel tank. The resulting explosion was caught on film and became (for me) the logical equivalent of the JFK assassination. I wasn't yet born when JFK was assassinated, so I get a bit tired of the "where were you when..." question on that topic. But I definitely remember where I was on January 28, 1986.

Years later the space shuttle Columbia burned up during reentry. I hope you youngsters remember that event.

I started reading The Challenger Launch Decision well after the Columbia disaster. That level of additional hindsight made me wonder all kinds of things while (and after) reading this book.

First, I should say that the book is almost certainly accurate in its descriptions of things going on around the Challenger event. The paper trail there is huge, and the author has been very thorough. The conclusions are accurate too, as far as I can tell.

That said, there were many times I wanted to scream while reading this thing; sometimes at the author, sometimes at others. Here are some of my thoughts and concerns:
  • The author admits her own lack of expertise and experience in dealing with highly technical matters, but clearly she didn't get the right set of people to review her manuscript before publishing. A single simple example: someone is quoted discussing dimensions in the SRM joints with wording something like this: "yadda yadda 30,000ths yadda yadda". What? Thirty thousands of an inch is written: 0.030" and measurements are written correctly all over the rest of the book. Why did it come out wrong there? The author should have had a few (more?) engineers review and correct her book before publication.

  • Sometimes I think the author uses the phrase "scientific positivism" as if it has negative connotations. This may be my own hypersensitivity, but it bugged me. Given one of her conclusions is that almost any technology can be risky, my interpretation here might be correct, but I really don't know.

  • Many times I wanted to yell "Duh! Of course!" as the author describes parts of the engineering environment at NASA and Morton Thiokol (the SRM manufacturer). I've been an engineer, and while I've never worked on man rated software or on systems that might kill people, I know the environment. From my perspective, the system NASA had setup to qualify a shuttle for flight was about as good as it is going to get. There was at least one problem, though, and I'll describe that in a separate bullet below.

  • The author clearly doesn't quite get the distinction between science and engineering (or scientists and engineers). There is a major difference between the two, and it bugged me that she didn't deal with it. Here is a quote, taken from page 401 in my copy:
    In language that staggers in its seeming prescience about events at NASA, Kuhn describes normal science as "an attempt to force nature into the preformed and relatively inflexible box that the paradigm supplies. No part of the aim of normal science is to call forth new sorts of phenomena; indeed those that will not fit the box are often not seen at all." According to Kuhn, the recalcitrance of the scientific paradigm is in the worldview it engenders; it is overturned when a crisis arises that precipitates a transformation of that worldview. ...]
    My response to that quote could be summed up in just one obscene word, but I'll avoid sharing that with you.

    Science is all about finding new data in nature, understanding it, and figuring out how it works with the rules we have. Finding new data may well require changing the rules. Yes, there is resistance to changing the rules, but they do change. (Find me any other system with documented methods for changing its internal rules in response to new data. Go on... I'll wait.) The best rules predict things we haven't yet observed. Then we go look for them to determine if the rule is valid. The idea that science doesn't want to "call forth new sorts of phenomena" is entirely wrong. This shows a fundamental misunderstanding about the process of science.

    Worse, Vaughan conflates science and engineering, so now all the actors in the Challenger story are tainted with that same brush. Engineers are scientists, therefore they don't want to see new data, therefore they "normalized" the warnings of pending disaster in the SRMs. This is only one of the things she uses to rationalize her conclusions, and while I do agree with much of her conclusion in the end, this sort of point bugged me a few times along the way to getting there.

  • Vaughan is (I think) a social scientist, concerned mainly about the behavior of groups of people. Her intent her is to show that any technology or system where groups of people are responsible for decision making within or about that technology can have (possibly serious or life threatening) flaws. Once again, I wanted to scream "Duh!", but I am getting to be something of a Luddite in my old age.

  • A final critique of the author's work is that she's extremely repetitive. I'm an engineer, and thus attuned to certain things about the environment she's documenting, so perhaps things that are intrinsically obvious to me are hard for others to note, but I doubt it. She hammers at the same points over and over again often enough that I sometimes had to put the book down and go do something else until the urge to strangle someone had passed.
Moving on to other things that came out of this for me, there were three things that I found distressing in the Challenger disaster. Whether changing them would have been enough to avoid the problem I don't know, but they bother me.
  1. Without going into the details that later commissions found were the actual cause of the shuttle explosion, a couple of those causes were never mentioned or anticipated by SRM engineers. Wind shear, for example, played a role in the Challenger explosion, but it was never thought of by the engineers working on the SRMs. Excuse me? You have a vehicle plowing through an atmosphere that we all know is turbulent and didn't think about what that might do to the position of the O-rings in the SRM joints? That, frankly, was a failure of imagination, to use a recent phrase. I don't know why the people working on this didn't think the entire system through more deeply, but clearly there were issues they missed. That is never a good thing.

  2. The SRM joints followed a very successful design in another flight system, but almost immediately showed deviations in behavior. The engineers did review them, but I honestly don't think they were skeptical enough about this process. The things they were seeing shouldn't have happened. The fact that they were happening should have been a larger red flag than it was taken for. I admit I have the benefit of hindsight here, but I can't imagine sleeping the night before a shuttle launch given what they were seeing. My engineer's sense of "rightness" would be offended.

  3. As some may remember, there was a bunch of discussion on the eve of the Challenger launch about the cold weather and its impact on the O-rings in the SRMs. Some of the Thiokol engineers were worried about that, and recommended against launching, but they were overruled for various reasons that were seen as legitimate at the time. My problem with this is that no one stopped and said something like this:
    Wait a minute... we have about 25 shuttle launches so far, but none in this extreme cold. 25 launches is way too small a data set to give us any valid statistical data to go by, and we have no way to get that statistical data in the next few hours. Our best technical people say they are worried about the cold, but they can't justify it. Perhaps we should hold off a few hours until things warm up. That will give us a bigger safety margin on this flight. In addition we'll setup a task force to determine what the temperature criteria really should be.
    NASA had a 2nd launch window in the afternoon - when things would have warmed up - and there would have been no huge cost associated with delaying a few hours. But they didn't do that. The reason -- driven home by this book -- is that only "hard" data was acceptable at NASA, and the SRM engineers had no hard data about the SRM joints in cold weather. So they launched because it had always worked before. (I am condensing here... read the book to understand the entire rationale.) That can be viewed as reasonable, actually. Certainly no one intended to take a larger risk than necessary. But...

    In addition to being an engineer I've managed engineers. (There is a point coming. Bear with me.) I've come to value something that only experience - not engineering school - teaches: the gut feel. When I hire an engineer and have them working on something, I expect they will develop opinions about their task, and they'd better be able to share them, even if they are only half baked. They may be wrong, but if you squelch them, you may miss something important. Engineers should be able to give you an opinion of some sort about almost anything related to their discipline, even given little data. Though it may be small or large, there is some level of value to that opinion. It is the experience of the engineer that helps determine the value.

    In the NASA case, experienced engineers said it was too cold and bad things could happen. NASA culture said "You have no numbers to prove that. Go away." As a manager, that is a mistake I always tried to avoid. I'd always listen to those opinions, and take them into account. In the case of the Challenger, I wouldn't have launched if it were my decision and I'd heard the launch-eve conversation. I'd have rescheduled and asked the engineers how long they'd need to get more temperature data. I believe that I'd have acted that way (given my current background) even without the benefit of hindsight. In the end I view a NASA culture that totally devalued non-numerical data as having a fundamental flaw.
Finally, I need to say that one of my personal heroes (of a sort) takes one in the shorts here. Dr. Richard Feynman, the physicist, was a member of the President's commission that examined the Challenger disaster. Many of the conclusions that he and that commission came to were simply wrong. In yet another case of the distinction between engineering and science being lost, Feynman just didn't get it in places. Some of the opinions he espoused were out of touch with the reality of engineering in general, and risky engineering (like space flight) in particular. In fact, the Presidential commission's findings were wrong in many places, and the later findings of the congressional commission where much more accurate. There was a rush to judgment immediately after the disaster, and I am sorry to say that it appears Dr. Feynman was sucked into that mess. Not everything the President's commission found was wrong, but enough was to make me wonder about it.

I've blathered enough here. This book is not for the feint of heart. It is basically correct in what it says and concludes but the path it takes to get to the end is a tough one.