With today (6/6/6) bearing the number of the beast, my thoughts went back to the most recent scary date 1/1/00 when we were promised TEOTWAWKI thanks to the famous Y2K bug.
Oddly enough, although we seem to be overwhelmed with alleged sceptics on other topics, only a handful of people challenged the desirability of spending hundreds of billions of dollars to fix a problem which was not, on the face of it, any more serious than dozens of other bugs in computer systems. Admittedly not all the money was wasted, since lots of new computers were bought. But a lot of valuable equipment was prematurely scrapped and a vast amount of effort was devoted to compliance, when a far cheaper “fix on failure” approach would have sufficed for all but the most mission-critical of systems.
As far as I know, there was no proper peer-reviewed assessment of the seriousness of the problems published in the computer science literature. Most of the running was made by consultants with an axe to grind, and their scaremongering was endorsed by committees where no-one had any incentive to point out the nudity of the emperor.
Why was there so little scepticism on this issue? An obvious explanation is that no powerful interests were threatened and some, such as consultants and computer companies, stood to gain. I don’t think this is the whole story, and I tried to analyse the process here, but there’s no doubt that a reallocation of scepticism could have done us a lot of good here.
{ 1 trackback }
{ 64 comments }
Luis Villa 06.06.06 at 7:26 am
The problem with the Y2K problem is that every significant legacy software owner had to assume that they had a y2k problem, and the only way to verify was to audit their own code. The nature of the problem was such that there was no way to make a ‘proper peer reviewed assessment’- that implies some sort of ur-program which could be analyzed and found to have y2k bugs or no y2k bugs.
taj 06.06.06 at 7:35 am
I know it’s highly unlikely (given historical evidence) that any collective human endeavour of this scale is likely to be so, but could it be that we actually managed to handle the Y2K bug successfully? After all, the only reason we are sceptical at all is because the world failed to end.
JR 06.06.06 at 7:50 am
We were told that the lights would go out around the world as the power grid would fail in city after city. The New York Times ran a big magazine story about how US utilities were gasping for breath as they raced to meet the deadline- how would 3rd world utilities manage? Fire departments, building managers, and insurance companies worried about people being stuck in elevators, electronic doors not opening, sprinkler systems failing. Police departments prepared for riots. Just to take something from a quick Google search, Mass General Hospital told its employees:
“Prepare as though for a long holiday weekend by having at least a three-day supply of food and water (one gallon per person per day). Be sure to have adequate clothing, supplies, flashlights, batteries, a battery-powered radio and a first-aid kit. Make purchases early while stores are stocked.”
The University of Texas advised:
“Preparing for Y2K is like preparing for any natural disaster, such as a hurricane, flood, winter storm, tornado or fire.”
None of this happened, and it’s not good enough to say that it didn’t happen because we fixed the problem in time. It didn’t happen in the US, but it also didn’t happen in Moscow and Istanbul and Manilla.
So it was all bogus. And after the New Year, it just evaporated- no stories on how the government and press could have been so wrong, nothing.
How could a mistake on this scale have happened? I’ve puzzled about this for years.
Matt 06.06.06 at 7:55 am
I think the Y2K business was a minor case of millenial madness. Some people really did think that the world would end. It didn’t, as usual. Not a big deal, and an improvement over rioting in the streets and burning heretics.
abb1 06.06.06 at 8:11 am
The thing is, though, it was very easy to test – just set the clocks on all your computers to 1/1/2000 and see what happens.
So, it would’ve been extremely negligent for power grid management and all the rest of them to screw up under these circumstances – clearly defined and easily testable problem. So, they tested it beforehand and they fixed whatever needed to be fixed and the IT people used the hype to get money for hardware and software upgrades and that’s all there is to it.
abb1 06.06.06 at 8:13 am
…or, rather, set the clocks to 12/31/1999 11:59pm and see what happens.
Randolph Fritz 06.06.06 at 8:17 am
I hope you are joking.
taj 06.06.06 at 8:30 am
jr,
There’s no doubt that the media hype outshined the actual problem, but when isn’t that the case?
Regarding the third world – as a current resident of the “third world” (India), I think it should be no surprise that we failed to have any problems with the computers we don’t have. We are not yet at the stage where we have microprocessors and microcontrollers in everything, so there was less to go wrong.
(Note that I don’t necessarily subscribe to the idea that all the money and hype was justified – just that atleast part of it managed to avert the serious problems that could have come up)
james 06.06.06 at 8:39 am
The problem was easily proved for many systems. It existed. The catch was that for most systems the worse case scenario was a miss-reporting of the date. Actual crashes due to this where rare. Not to say that the money was misspent. All sorts of bad things could happen if, for example, your credit card company registers the payment as several years late.
taj 06.06.06 at 8:40 am
What also helped was that there was a concrete “world is about to end” date and time. It became an event rather than a process. If we want to make people take climate change and alternative power seriously, it might make sense to make up some dates.
Phil 06.06.06 at 8:40 am
In another life, I was an active participant in the comp.software.year2000 newsgroup. One fairly casual post of mine triggered off a poll on the group assessing how bad we thought things would get: 1 = ‘the lights stay on’, 3 = ‘everyone does a lot of overtime and nothing much breaks’, 5 = ‘the end of civilisation as we know it’. The first time we ran the poll, opinions ranged from around 2.5 to 4.5, with a mean between 3 and 4. I’d stress that this was when the group was populated mainly by IT professionals who had a pretty good idea what the problem was and how to fix it. Later on (circa ’98) we got invaded by a bunch of survivalists who called anyone predicting a 4 or below a ‘Pollyanna’; that was millennial madness.
I have absolutely no idea how we got through it as cleanly as we did. (There must have been millions of lines of non-compliant code out there; I remember being told to build six-digit dates into a new program I was writing in 1987.) I can only assume that there was a bit of inverse Underpants-Gnomery going on: we didn’t realise it, but nobody specified the mechanism whereby botched date comparisons and sort orders actually made something stop working.
(Either that or remediation work was a lot more thorough, and more successful, than we thought.)
LowLife 06.06.06 at 8:55 am
I know that my company has endured two crappy customer service and billing programs since y2k. It screwed up customer service so bad that they are now trying to get rid of the entire department. Luckily, though probably only temporially, the Union has put the breaks on the flood of pick slips. Next contract may prove both interesting and disasterous.
chris y 06.06.06 at 8:58 am
The worst case scenario for most systems was actually that large organisations would fail to make payments on time. This happened a bit, but a lot less than it would have if the work hadn’t been done. You can argue that in the greater scheme of things the temporary financial embarrassment of small creditors isn’t very glamourous, and compared to the scare stories about planes falling out of the sky, you’d be right. But inconveniencing people who depend on your processes because you don’t like the way the situation is reported in the meejah doesn’t strike me as the way to go.
Tangurena 06.06.06 at 9:12 am
The first documented case of the Y2K bug hit in 1970. Most companies didn’t do anything at all until 1998. A few companies started looking into things in 1995 or earlier.
The fear and uncertainty about Y2K being the “end of the world” speaks more to people’s uncertainty about all the systems we depend upon for life. That point was the basis for the first version of the TV series Connections: that by the early 1960s, life in the technological west had become so interconnected and dependant upon one another that no one can control it, just as no one can understand how all the interconnections interact. The first episode opens with the host holding something slightly larger than a can of soda in his hand, a device that did what it was built to do, and when it did that, plunged millions of people into darkness. What if the power never came back on, what do you do?
And this sort of it could have been worse was the basis for Perrow writing Normal Accidents:
We’ve been making petroleum refineries for over a hundred years. If you thought we understood, by now, how to keep them from bursting into flames, you’d be wrong. Humans have been making dams for over 2,000 years. If you thought that by now we understand how they work and how to keep them from failing, you’d be wrong.
In the US, we have an abhorance of the word “luck.” That something can happen, or not, because of “luck” drives most people I’ve met completely bonkers. Underlaying their hatred and denial of “luck” is their belief that if you cast the correct magic spell, you’ll be successful, and that if the spell didn’t work, something is fundamentally wrong with you, not the magic.
I think humans are hard-wired to look for some central control for things. Steven Johnson, in his book Emergence, calls this the Myth of the Ant Queen.
Things that aren’t controlled, or even controllable, scare most people. And that underlies the fear and loathing of evolution. Which is also driven by the pathological hatred of “luck.”
John Robb, over at Global Guerillas, has been looking at network attacks, what they are, how to predict them, how to mitigate them.
Most companies put off triaging systems until it was too late to perform any rational actions. As a result, the “replace it all” mentality was just about the only behavior remaining.
In the end, we are a culture who ignores things, waits until the last minute and then panics. For exemplars, one need only observe the American response to terrorism: ignore it until it can’t be ignored further, then panic and invade. Our “security” measures are more designed to look macho and butch rather than to have any measurable effect on security. Bruce Schneier calls this “security theatre” or “security kabuki.” Such as requiring background checks on Bingo operators, because AlNeda might open one to finance terrorism, ya know. Just like we’ll ingore and deny human induced climate changes until it is too late, and by then, Kyoto accords will be pocket change. But we’ll have followed our magic spell of ignoring things until it is too late, then panicking.
People don’t, won’t or can’t think for themselves. That’s why advertising and propaganda work so well. People don’t, won’t or can’t calculate risk. That’s why the gambling and insurance industries are so wealthy.
Sebastian Holsclaw 06.06.06 at 9:26 am
Presuming that it wasn’t a problem that was fixed (which seems unlikely but I certainly have no way of judging) there are a couple of things that probably contribute.
1. The Y2K problem (as defined) was very large but very comprehensible. At worst it involved testing everything with a computer. It had a single cause which was easily understood.
2. Once detected it was very fixable by what amounts to just throwing money at it without any other major changes. Lots of overweight people in the US could be not particularly overweight through the well understood mechanisms of eating less and engaging in moderate exercise. This is well understood but, to many people, annoying. Despite a well understood lifestyle fix, many people are much more willing to throw huge amounts of money at unproven (and perhaps dangerous) pills instead. When something like Y2K came around–big but comprehensible problem with no need to fix more than throwing a bit of money at it–people jump on the fix because it didn’t involve any annoying long term change.
Global Warming on the other hand has many factors which aren’t well understood and modeled. It has many input factors, some of which are poorly understood. It doesn’t involve a single mistake which leads obviously to a particular problem. It involves many things that we think of as good combining in a feedback loop to create difficult to anticpate outcomes. Its proposed fixes involve some rather long term changes that could be very annoying.
The mental allocation of risk problem is fascinating. It seems that at least once every other year a scare sweeps through the US involving something with at best a miniscule risk (the vaccine scare is a recent example) which people who risk food poisoning by eating fresh fish or which people who drive to work every day obsess over for a couple of months.
In each case I suspect you would find that the obsession does not have a solution which involves major day-to-day lifestyle changes. Choosing to avoid apples (Alar scare) or one-time vaccinations doesn’t cause need for a shift in habits. It seems possible that we can obsess over the easily fixable things because we don’t want to think about the hard to fix things. (I note also that people who obsess about a small risk of hard to fix things are often thought of as neurotic.)
Michael Sullivan 06.06.06 at 10:01 am
My comments are pretty much the same as phil’s. While not an active contributor to the year2000 newsgroup, I checked in once in a while to that and other discussions, in my capacity as an IT manager. The total cost of my (small) company’s year2000 response was a couple of books, an audit of mission critical systems by me that took roughly a day, and adjusting our inventory schedule so that we’d be full up on everything at the beginning of the year, with double stocks for anything critical but cheap. My internal audit confirmed that none of our own computer systems had any obvious date critical components
On Dec. 23rd when we shut down, I would have said we had a ~2% chance of seeing significant difficulty with at least one major supplier and a similar chance of seeing power problems, with maybe a .1% chance of power problems that would shut us down for more than a day or two without a true fail-safe (our own generating capacity). Our power requirements are quite high, and I could not justify the cost of a fail-safe. I figured if we were shut down for long, so would be most of our customer base, so our ability to run wouldn’t matter.
Like Phil, I have no clue how this happened as cleanly as it did given the huge amount of broken code that was out there. But I think you’re right that (as I found in my operation), computer systems are nearly always embedded in larger human systems that provide fail-safes, so that computer failure means extra labor, rather than system failure. Which means fix on failure was clearly the right plan for any systems that were not both highly automated and mission critical. In reality, there were many failures, but almost none serious enough to cause any external problems at all.
Is it likely that many people spent much more money and time on insuring against a Y2K crisis than they needed to? Absolutely. And I could have told you that in 1999. Most systems needed only to be backed up.
Whenever anybody asked me what they should do, I always said they should exactly the same thing they’d do to prepare for a big storm. Preparing for EOC scenarios is never cost justified, because the difference between possible EOC outcomes is so tiny compared to the giant gulf between any BAU outcome and any EOC outcome.
I didn’t know anyone but cranks who did more than prepare for a worst-case BAU outcome.
Cian 06.06.06 at 10:01 am
Just because some people who were either consultants or professional doom sayers made some ridiculous claims, does not mean that Y2K was not real, or a real problem.
And in the UK, at least, as a potential problem it was taken very seriously by both the government and the companies most likely to be affected (typically large companies, with interconnected systems). So it is quite possible if there was a potential problem, it was largely fixed.
There were a number of issues:
1) Complex software is unpredictable. So without doing proper audits (which is what a large part of the Y2K work was), it is impossible to know what they will do when they fail, or if they will fail.
2) Old software is normally kept running on a wing and a prayer, as the original developers have left (or retired), and new developers don’t know the legacy languages/environments particularly well. The older a system is, the harder the code base will be to understand. Consequently it is often easier just to scrap it, rather than audit it. This was a perfectly rational decision made by many companies – and typically the systems that were scrapped probably should have been scrapped many years previously.
3) A potential problem for some systems with the Y2k thing, was that they might continue working, but in an unpredictable fashion. So one might not realise that they were failing, until it was already quite expensive (definitely a problem for financial systems, or inventory/ordering systems).
“only a handful of people challenged the desirability of spending hundreds of billions of dollars to fix a problem which was not, on the face of it, any more serious than dozens of other bugs in computer systems.”
The problem was less that there might be a few bugs – but rather that a lot of computer systems, many of them interconnected, would have failed simultaneously. This is hardly a typical situation (and when multiple computer systems do fail simultaneously, its normally fairly catastrophic for the institutions involved).
“But a lot of valuable equipment was prematurely scrapped”
I don’t think is true. A lot of equipment was scrapped, but it was equipment that was aging and becoming harder and harder to maintain – particularly the software that was scrapped. Often the reason these things didn’t get replaced was that the operations budget, and the capital expenditure budgets, were seperate.
“and a vast amount of effort was devoted to compliance, when a far cheaper “fix on failure†approach would have sufficed for all but the most mission-critical of systems.”
1) It is always far more expensive to fix a bug after it has happened, rather than in advance. My guess would be by about an order of magnitude.
2) If lots of systems were failing simultaneously, then the resources may not have been there to fix urgent bugs in a reasonable time scale
3) For legacy systems, companies would have had to hire people to fix the systems, bidding against other companies who also needed COBOL programmers in a hurry.
“As far as I know, there was no proper peer-reviewed assessment of the seriousness of the problems published in the computer science literature.”
This is probably true, but then how would you do one? Who has the resoures to inspect all that code? Which companies are going to open their code up? How do you predict the unpredictable?
“Most of the running was made by consultants with an axe to grind, and their scaremongering was endorsed by committees where no-one had any incentive to point out the nudity of the emperor.”
This is partially true. There were also people without an obvious axe to grind who also claimed that it was a problem. And there was a fairly wide range of opinions, with the median professional opinion that I encountered being that it was a problem, but one which could be fixed (so long as it was taken seriously, which it was).
Nathan Williams 06.06.06 at 10:05 am
I worked in a Y2K auditing shop during 1999. In the eight months I was there, something on the order of 10,000 distinct date-handling bugs went across my desk, mostly in large COBOL/RPG applications from large businesses and financial firms.
I agree that most of them were not crash-the-computer kind of bugs. In fact, a lot of them were secondary bugs from “pivot year” approaches to addressing the Y2K bug – that is, instead of considering 00-99 in a two-digit year field to represent 1900-1999, it would be considered to represent (say) 1930-2029. One effect of this is just to shift the date at which the application breaks – instead of everyone’s credit card bills being fouled up in January 2000, they’ll be fouled up at semi-random times from then to 2050. Another effect, though, is that if a single application doesn’t take a coordinated approach to this problem, it can introduce more bugs – the programmer for one module picks 30 as the pivot year, while another chooses 15; the inconsistency between these can cause a lot of problems.
My ultimate conclusion was that Y2K bugs, as numerous as they were, were still just noise in the level of computer system bugs that we already have to deal with.
Cranky Observer 06.06.06 at 10:11 am
> I know it’s highly unlikely (given historical
> evidence) that any collective human endeavour of
> this scale is likely to be so, but could it be
> that we actually managed to handle the Y2K bug
> successfully?
There were always two Y2K problems: the real one, which information technology professionals (such as me) knew was (A) real (B) serious, but which could be fixed.
And the millenial-scaremongering-consultant-feeding one, which was of course bunk from the beginning.
No real computer professional every said that the revelations-type scenarios were anything but baloney. They DID say that the actual problem was real, and needed to be addressed. Both of which were true. I found and replaced several vulnerable systems myself, but you don’t have to believe me: Alan Greenspan admitted that as a junior programmer he WROTE some of the vulnerable systems and that he knew for a fact they were still in production as of 1998.
So yeah, there was a problem. And we fixed it (mostly; see the yearly New Years Day adventures of the Finnish National Railway which _still_ hasn’t fixed all their code).
Interestingly enough the leader who got it right was the then-Pope John Paul, who gave a speech sometime around September 1999 saying in effect “the professionals have it under control. The apoclpyse is NOT at hand. Have a fun party on New Years Eve”.
Unless you have some fairly deep understanding of both the technology and business of information processing, I would stay away from this one as you can easily make a fool of yourself.
Cranky
Cranky Observer 06.06.06 at 10:13 am
> As far as I know, there was no proper
> peer-reviewed assessment of the seriousness
> of the problems published in the computer
> science literature.
You have searched the entire ACM library? I no longer have my collection of back issues of “Communications” and “Journal”, but I distinctly remember several. In the mid- to late-1980s.
Cranky
No Nym 06.06.06 at 10:19 am
A more interesting question is, “given that nothing happened, why are the Usenet groups dedicated to Y2K still active?”
99 06.06.06 at 10:28 am
Aside from the budgeting issues that impacted most of the purchasing, there was also more rational and prudent behavior than one realizes, outside of the ‘lights will be out’ press (after all, sensationalistic stories are a rational product of the press). The bugs were often intractable, so it made no sense to scrap the systems until the very last minute, and outside of very antiquated systems, or truly mission critical (say, the air traffic control system), the switch over was a non event, except as a budget expenditure. A bunch of DB fields as had to be rewritten, and code redeployed. Surely this could have been done in 1992, but even then we understood Moore’s law enough to assess the cost/benefit relationship: why invest in hardware or software in 1992 where the impact of new code and data (doubling the size of a date field in financial transaction was a material consideration in 92, I bet) was potentially a real cost, instead of pushing it back 6 years, expecting improvements in applications, storage and processing would make the effort far easier?
The thing the press is far more guilty of is not properly researching what remediation costs truly were. Anyone who has tried to sell IT (or any application based solution) to corporate hacks knows you have to scream the sky is falling to get them to eat into their profit sharing plan.
jim 06.06.06 at 10:43 am
“a problem which was not, on the face of it, any more serious than dozens of other bugs in computer systems”
No. Bugs are more or less serious depending on the failure modes they induce. The Y2K bug was serious because many programs could have failed catastrophically as a result of it.
An example: many real-time or near real-time systems (like those which control the power grid or the telephone network) ignore “old” input which is no longer relevant and shouldn’t be acted upon: the date-stamp on a message is checked, if it’s outside a window, the message is discarded. Y2K had the potential to cause such systems to discard all input, to fail to act on any alert. That’s a very bad failure mode. A bug which might induce this is therefore a very serious bug.
Another example: payroll systems might have decided that work performed in ’99 wasn’t relevant to paychecks cut in ’00. No-one getting paid is a very bad failure mode for a payroll system. HR can deal with paychecks being a few dollars off; they can deal with one or two people not getting paid; they can’t deal with everyone not getting paid.
In general, people work on fixing more serious bugs before working on fixing less serious bugs.
Most Y2K bugs with serious consequences, therefore, got fixed before Y2K. There was, nonetheless, nervousness when the clock ticked over because, as is well known, testing can only disclose the presence of bugs; it can’t confirm their absence. What if we missed one?
Cranky Observer 06.06.06 at 10:59 am
> and outside of very antiquated
> systems, or truly mission critical
> (say, the air traffic control system),
> the switch over was a non event, except
> as a budget expenditure.
Actually, I know of many careers and several whole organizations (good sized ones) that were destroyed in the attempt to replace their core business software. But that says more about their inability to manage projects and their executives’ total lack of understanding of business information management than it does about the Y2K situation.
Cranky
Yentz Mahogany 06.06.06 at 11:28 am
Probably because any action taken to counter a risk, and not certain doom, is already packaged with its own skepticism. I don’t think people in general were quite so fanatical as was implied here.
Chris 06.06.06 at 11:37 am
The global warming comparison is a good one.
All the ‘experts’ were agreed about Y2K, as they are now about global warming. Any sceptics are dismissed as cranks or the puppets of vested interests – notably on this blog amongst many others. Lace that with a large dollop of anti-capitalist self-indulgence and you have the perfect recipe for the best TEOTWAWKI scare yet.
Barry Freed 06.06.06 at 11:37 am
If we want to make people take climate change and alternative power seriously, it might make sense to make up some dates.
Great. I can hear the wingnuts now; “See, Gore fibs!”
Barry Freed 06.06.06 at 11:45 am
@26:
I got .me some simultaneous CONFIRMATION!
Off to play the ponies now.
Chris 06.06.06 at 11:45 am
There you go – anyone sceptical about climate change = “wingnuts”.
dave heasman 06.06.06 at 11:51 am
“All the ‘experts’ were agreed about Y2K, as they are now about global warming.”
I don’t know if you mean to be ironic here, but this is true.
In run-of-the-mill non-crucial businesses, date processing goes from “today” a few months forward, a few months back.
I was working on a system that booked newspaper advertising. If we hadn’t fixed the Y2K problem, the Daily Mirror wouldn’t have been able to book advertising space in 1999 for ads to be printed in 2000, the system would have failed with “this date is in the (distant) past”. Invoices wouldn’t have been able to have been raised, cash-flow estimates wouldn’t have been able to have been produced. Etc.
OK the world wouldn’t have ended, but over a thousand people would have lost their jobs, as the paper folded. (oops). Why does John Quiggin hate successful software projects?
Barry Freed 06.06.06 at 12:07 pm
There you go – anyone sceptical about climate change = “wingnuts”
Now, now. Prior to her merciful passing, I don’t believe that, or at least I failed to come across any reference among the many hundreds of pages I must have read, that Terri Schiavo was a “wingnut.”
AWOA: Careful swimming there; you’ll catch a nasty case of blood flukes.
Tangurena 06.06.06 at 12:08 pm
I agree. People wildly underestimate risks where they perceive that they have some control over the situation, and wildly overestimate risks where they perceive that they have no control. Just look at the perception of risks of driving vs the perception of risks of airline travel. By all objective measures, airline travel is far safer than driving, yet people fear flying, and don’t fear driving at all.
I’d recommend using your university’s library. I’m sure that they’ve got a subscription to the ACM website and to the IEEE.
A search at the ACM portal shows more than 200 results for “risk” + “neumann” (Peter Neumann is the risk moderator for the ACM publication Software Engineering Notes). His Inside Risk column in CACM has been running since 1990. He also moderates a mail list here. That ought to give you a place to start reading.
Some of the books I’ve been reading about risk, failure and why people make bad decisions (so that at least I can make less bad ones) include:
Sources of Power (Klein).
Logic of Failure.
Beyond Fear (Schneier).
Collapse (Diamond).
Normal Accidents (Perrow).
Innovation Gap.
First, you’re making the assumption that people will accurately recognize failures.
Second, you’re making the assumption that people will accurately recognize the correct action to take.
Third, you’re making the assumption that the repair can be made in a reasonably quick period of time.
Fourth, you’re making the assumption that resources exist to address the failure.
The book Collapse shows examples of civilizations that failed for each of these 4 assumptions. The US handling of the war in Iraq (and proposed attack on Iran) are other examples of all 4 assumptions making an “ass” out of “u” and “me” when you “assume.” Especially when there is no “Plan B” because politics demand the denial of problems and the denial of possibility of failure. Do you really think that all the idiots in the US are currently in the whitehouse? Or that the failures in logic and risk management are limited only to that small herd of morons? Their stupidity does not spring forth from the head of Zeus, as Scott Adams’ Dilbert series testifies to: that degree of idiocy runs rampant throughout American business, and it is only through sheer luck that we haven’t had disasters every week. The TV version of Hitchhikers Guide to the Universe ends with the survivors of the spaceship crash failing each of the 4 assumptions above.
In addition, a company with “fix on failure” as a method of handling this problem would have been treated as a credit risk, and most likely placed on “cash with order” or “cash on delivery” as many folks would have felt that they weren’t likely to be around to pay their bills.
Computer systems are very complicated beasts. People horribly under-estimate the time it takes to build and deploy such things, and people horribly mis-estimate the requirements going into them. I’m constantly amazed when I meet folks who have no concept of what goes into writing software, or into what it takes to keep it running. Most of the time, the estimate for construction is pulled out from between some marketers hairy cheeks. I think the perfect example is the Denver Airport baggage handling system. The engineers and their managers said it would take 4 years to build such a system, based on past experience building such systems. The marketing department said you have 2 years to build the system, because that is when the airport opens. So the baggage handling system was finished 2 years after the airport opens. Was it on time? Was it 2 years too late? If you think it was late, are you the sort of person who thinks that 9 women can have 1 baby in 1 month? The end result is that the company that made the baggage handling system has gone out of business, mostly because of DIA.
There have been a few other “magic dates” where problems could exist, such as Sept 9, 1999. That’s because early programmers tended to use magic numbers to represent things like 9999 to represent the end of the file (or something that never ends – like a contract that never expires). Or, in the bio of the guy who was first head of Visa, why many cities in Equador could not get their credit transactions processed: because when the computer read the line containing Quito, Equador the program read the word quit, so it did.
croatoan 06.06.06 at 1:03 pm
We’re sorry, the Number of the Beast has been changed. The new number is 616. Please make a note of it.
Guest 06.06.06 at 2:01 pm
“An obvious explanation is that no powerful interests were threatened and some, such as consultants and computer companies, stood to gain.”
Well businesses had to spend a heck of a lot of money, which is analogous to climate change – it’s going to cost a heck of a lot of money to slow warming down.
Consultants and computer companies could also be held to be analogous to climatologists.
No Nym 06.06.06 at 2:14 pm
@32:
Another good book is Mission Improbable, which considers gov’t safety planning (such as, say, the evacuation of a large city during/after a disaster) as essentially fantasy documents with no actual meaningful content.
Mary R 06.06.06 at 3:06 pm
At the millennial First Night Boston, someone did an art installation which filled a small hotel exhibition room with computers, so that people could watch to see if they failed.
It was interesting to see the people who were ringing in the new millennium watching a room full of junked computers. Or maybe they didn’t stay either. My husband said we had to get home by 11 in case the subway stopped running.
Quo Vadis 06.06.06 at 3:11 pm
At the time of Y2K I was responsible for the development and operation of a b-to-b e-commerce system. A Y2K related failure on our part would not have brought down any airplanes, but it would have cost our clients a lot of time and money and probably put us out of business.
We conducted a special audit and testing program to verify that our systems were prepared and some of our clients actually conducted tests with us. We set up an entire environment with all the inputs to the system simulated including non-compliant external systems. Even though our code was written with Y2K in mind, we had to upgrade some third party software we used as compliant releases became available. In addition to testing we had contingency plans in place to cover, as well as possible, unanticipated problems.
What was the result of all this effort? Everyone complains about the big Y2K hoax rather than the big Y2K disaster. Sounds like success to me.
John Quiggin 06.06.06 at 4:35 pm
Cranky and others. Like you, I read articles about Y2K going back to the 1980s, and various columns including those of Peter Neumann.
What I didn’t see were reports of properly undertaken studies that gave credible estimates of the risk. To give just the most obvious unanswered question – what proportion of non-compliant computer systems would fail if the date were set to 1/1/00 and how badly would they fail? If there were experimental studies of this question, or rigorous theoretical analyses, I didn’t see them, though I saw and heard plenty of anecdotes on this point.
Similarly, there was little or no economic analysis of the options.
clew 06.06.06 at 4:39 pm
Not only were actual bugs fixed beforehand – and after – but a very high proportion of the world’s sysops were hovering over their logfiles for a day or so before and after; sending out cheerful email, on the whole, as midnight rolled over them without disaster. But if something needed to be rebooted or taken offline, there was someone there to do it.
Brett Bellmore 06.06.06 at 4:43 pm
“because it takes just the right combination of circumstances to produce a catastrophe, just as it takes the right combination of inevitable errors to produce an accident. I think this deserves to be called the “Union Carbide Factor.—
IIRC, the most likely explaination why a catastrophe took place in Bhopal, and not at all those other plants, was that all those other plants weren’t sabotoged. It’s substantially more difficult to protect against “deliberates” than it is accidents.
Cranky Observer 06.06.06 at 5:11 pm
> What I didn’t see were reports of properly
> undertaken studies that gave credible
> estimates of the risk
Well, I did. Mostly in various ACM publications.
But the flip side is this: as much as I learned from the various Computer Science professors I encountered during my degree programs, and as much as I now appreciate some of the fundamental tools they taught me, the cold fact is this: academics in general, and Computer Science academics in particular, have exactly zero understanding of what goes on in the world of business software and business information management (Abnormal Psych professors might have a ghost of a chance). There is simply no frame of reference in common between teaching Compiler Theory, Discrete Mathematics, and Formal Program Verification in the one realm and actually getting your arms into the intestines of an ERP implementation up to the elbows in the other.
So I really don’t think there would have been much value in academics attempting to perform such studies. The analysis WAS done: by insurance companies, risk managers, CIOs, and federal regulators. And they came to the conclusion that the work (note: not the scaremongering; the work) needed to be done.
In any case it is a bit of a silly question: take a random sampling of professionals who were involved in real Y2K work and ask them how many potentially serious problems they uncovered. I would be willing to wager a fair amount the answer will be greater than 1.
Cranky
Henry (not the famous one) 06.06.06 at 6:00 pm
And, of course, it kept the staff of Initech busy.
Quo Vadis 06.06.06 at 7:39 pm
John Quiggen,
The problem is much more complex than simply setting the system clock forward. The number of dependencies in computer systems can be enormous and problems with any dependent system can be triggered by any of those it depends on. For example, one system in your enterprise, your billing system, interoperates with all of your customers and is connected to your order entry system, your CRM system and your inventory system which interoperates with your suppliers. It is configured by your sales staff, your accounting staff, your marketing staff, and your finance staff, who use a variety of different systems and applications to do so. Now take this example up and down the supply chain.
Risk assessments on systems that can be so complex and interdependent would be difficult to generalize since the systems vary so widely in configuration, complexity and implementation.
Tony Healy 06.06.06 at 7:45 pm
I think the question of why there seemed to be so little scepticism is extremely important. It’s relevant not just for Y2K but for numerous other issues in IT.
To delineate Y2K itself a bit more, the issue was the way the scare campaign targeted medium and small sized businesses that were never at significant risk from trivial problems with dates. In Australia, Government advertising was even spent scaring that market, essentially to benefit accounting firms and outsourcers, who were the main drivers of the campaign.
Large transaction oriented business such as banks and airlines and were certainly at risk, but they already knew, and had plans in place, as they do for hundreds of issues. Throughout the 90s, those businesses had used the impending arrival of the year 2000 as a good reason to undertake or bring forward expensive upgrades of their software systems that would have been required eventually anyway. So there was a problem, but not the one that was hyped to us.
As to why scepticism was muted, I think there were four reasons. First, Y2K gave IT managers fantastic leverage to gain approval for new projects and to upgrade equipment. Accordingly, there was little reason for managers to question Y2K.
Second, the Y2K lobbyists introduced the threat of liability as a powerful weapon. From 1998 the business press was full of warnings that boards or IT managers who failed to take appropriate preventative action would be liable if their systems suffered Y2K issues. Thus it was much easier for all concerned to just go through the motions.
Third, because it’s still a young field, there are no clear structures in IT or software for understanding who are the experts and who are the shills. Numerous groups exploit this confusion. In the case of Y2K, accounting firms and outsourcers could easily marginalise the occasional software developer or IT manager who dared question orthodoxy.
Fourth, academia in IT is superficial or worse. A lot of the study of information systems, which is the field that should have commented on Y2K, consists of pretentious efforts that aren’t informed by either technical expertise or actual experience.
paul 06.06.06 at 8:39 pm
I’m in the biz. I am skeptical that it was as serious as it was being pressed but I think it was real.THat said I think it was a serious problem for a few that was probably only addressed by the attention of the many.
What I find interesting is one facet and one analogy. The dot com boom under the covers was really a productivity boom based on a tidal wave of new spending on new equipment. It not only funded the dot com follies but it transformed the relationship between labor and productivity that we are still seeing. I don;t have a graph but the growth of the internet would not have been as acute without y2k.
My analogy is to say that everything you say about consultants with axes to grind, information sources that did not do their job in being overly trusting of the hype – it all sounds like what has happened in politics in the years since y2k.
A connection? I don’t know…
Just saying
John Quiggin 06.06.06 at 9:26 pm
qv, to say that risk assessment is complex doesn’t seem to me to be a good argument for spending billions of dollars without a risk assessment.
And, if you don’t like artificial tests, why wasn’t the prevalence of severe problems in forward looking calculations before 2000 (negligible as far as I could tell at the time) taken into account in making policy. As I pointed out in 1999
Quo Vadis 06.07.06 at 12:32 am
I made the point was that risk assessments were difficult to generalize, not that they were not or should not be conducted. The assessments were conducted individually by enterprises large and small and remedial action was taken to comply with whatever policies governed that system’s operation.
I don’t know what data you were looking at in 1999, but it likely relied upon generalizations that would have made it difficult to apply to a specific business and system. Every business has its own mission critical systems with varying levels of Y2K exposure and consequence. For some of those systems there would be acceptable fallback procedures in the event of failure. All of these things vary widely from one enterprise to another. Generalizing a prudent policy for a small business to a global airline reservation system would be inappropriate.
In my case, my company was contractually obligated to provide a certain level of service availability. My company was a small start-up so in addition to the contractual penalties, a failure would have meant the end of the company. This is not something that the Gartner Group could have accounted for in their analysis.
Jake McGuire 06.07.06 at 1:54 am
That’s an interesting shift from “nothing worth reporting” to “nothing”, especially since “small shipping company scrambles to correct invoices” is not particularly newsworthy.
Fixing the Y2K bug affects a huge number of code paths, and a non-trivial number of those code paths will probably be broken by fixing the Y2K bug. This means that there is a lot of QA involved, and bug fixes can’t be rolled out quickly with high confidence. When the Y2K bug strikes, it’s also likely to be in a subtle and not immediately obvious way. Waiting until you notice that you’ve been generating corrupt data for the past three months to make a wide-ranging fix with poorly understood implications is not a good strategy.
Also, the Y2K problem is actually an amazingly good example of the difference between IT and Computer Science. Who wants to plow through megabytes of poorly documented billing system code full of special cases to handle buggy vendor software and idiosyncrasies of former employees? I’ve done it; it sucks. Doing it elegantly and automatically is (way) beyond the state of current knowledge; hiring a bunch of consultants to wade through the code is not interesting. Hence few papers.
bad Jim 06.07.06 at 2:53 am
I have little to add to the contributions of other practitioners above. My company was acquired two years before the apocalypse, and the acquiring company spent milllions it didn’t have migrating from a minicomputer-based MRP and accounting system, which wasn’t Y2K-compliant, to a minicomputer-based ERP system (Peoplesoft) which remained minimally functional by the time I bailed.
Recall the rapid obsolescence of computers at the time. Microsoft was forcing software upgrades at a furious pace, chasing the advances in hardware and this thing we call the Internet. For a company running its business on PC’s, the idea that we needed to schedule another round of upgrades hardly merited discussion.
That’s not to say that there was a shortage of nonsense in that situation. Many of our customers required us to certify that the computers embedded in our products were not at risk, so we produced a document admitting that our machines didn’t know or care what time it is.
dave heasman 06.07.06 at 3:39 am
John – “First, the judgement of the small businesses and schools, not to mention Third World countries, that have adopted a wait-and-see approach to the problem looks like being vindicated”
Yes. The primary example of their judgement was “don’t invest in IT systems in the ’60s and ’70s”. Small businesses, schools & 3rd world countries would have IT systems from the 90s, which would largely be Y2K compliant.
The National Westminster Bank, however, in 1997, used the Y2K project as a pretext to rewrite a part of their system for which the source code had long been lost, where the print buffers had been used as machine-code patch areas, and where one part of the system, not addressable in isolation, involved the translation of a monetary sum to a decimal value from pounds, shillings and pence.
abb1 06.07.06 at 4:42 am
#43:
The problem is much more complex than simply setting the system clock forward. The number of dependencies in computer systems can be enormous and problems with any dependent system can be triggered by any of those it depends on….
I think you exaggerate quite a bit here. Computer systems and subsystems in most cases communicate by exchanging files and accessing databases, thus you don’t necessarily have to investigate and modify each piece involved; often you can manage by simply adding intermediate steps to modify input or output files and database fields. If you know what I mean.
The problem with theoretical analysis here is that in most cases it’s impossible to predict whether you’ll have to rewrite a million lines of code or write a 100-line program to reformat the output file. So, as with everything else – you hope for the best and plan (and budget) for the worst…
Cranky Observer 06.07.06 at 6:05 am
Anyone who thinks that a mainframe-based banking system (to name just one example) can be tested by “setting the date forward” the way you do on your standalone PC at home really doesn’t have anything meaningful to contribute to this topic.
Cranky
Cranky Observer 06.07.06 at 7:24 am
To follow up on my last comment, and perhaps come back to the original post: I was both amused and horrifed by the Y2K scaremongering. And disgusted by the big-dollar consultants who cashed in (most of whom are now “SarbOx consultants”, raking in big bucks forcing their clients to implement insanely extreme intepretations of the actual Sarbannes-Oxley regulations).
But looking back on it, particularly considering the discussion in this thread, I have to wonder: without the scaremongers and consultants, would the actual work needed have ever gotten done? Perhaps humans just can’t take on a serious long-term problems unless they are convinced it is a Death Dealing Flaming Asteriod which will cause Doom In 90 Days!!!!!
This is something to consider in relation to the global warming debate I think.
Cranky
bellatrys 06.07.06 at 7:38 am
Humans have been making dams for over 2,000 years. If you thought that by now we understand how they work and how to keep them from failing, you’d be wrong.
We do. We just don’t bother to (or simply can’t afford it, in a lot of cases) put the money and effort into the necessary monitoring and repair of them. (Or in worst-case scenarios, into building them properly the first time, just like the scuzzy contractors who used sea-salt building cement high-rises – it wasn’t that “we don’t understand” how cement works, they just didn’t care since they weren’t planning on living in them.) That’s one reason why we’re slowly dismantling a lot of dams in my state, because they’re not used for anything practical any more, it costs too much to fix them, and they cause problems for the fish stocks. Unfortunately, taking down a dam safely *also* costs money, which we don’t have, either. Catch 22. So when we have heavy rains and flooding like we recently did, we try to monitor the ones known to be in most danger round the clock and patch them up. Fortunately these are mostly dams that were built well – that is to say, nobody cheated or skimped on supplies or proper engineering and surveying beforehand – it’s just that they’re in some cases over a hundred years old and haven’t been kept up because, again, lack of money.
bellatrys 06.07.06 at 7:42 am
Personally, I’ve always thought a lot of the non-examination of why TEOTWAWKI didn’t happen in 2000 was due to embarrassment.
One thing people I knew never asked, and looked at me bugeyed when I asked them in turn, was *WHY* people would all go bonkers and turn into CHUDs or Mad Max if the power went out and computers stopped working. Partly this is because I live in a region where bad ice storms have been known to knock out massive amounts of systems for days on end, and guess what, we jury-rigged things and helped each other out and very few people died – why should it be any different? And nobody could answer why nobody would think of, and be able to, go in and manually override and get the lights back on and start traffic moving and just keep records on paper, the way we did when the power was out in some towns for a whole week.
–Too much bad television, I say.
Cian 06.07.06 at 8:01 am
Abb1: “I think you exaggerate quite a bit here. Computer systems and subsystems in most cases communicate by exchanging files and accessing databases, thus you don’t necessarily have to investigate and modify each piece involved; often you can manage by simply adding intermediate steps to modify input or output files and database fields. If you know what I mean.”
Not really, and I’m not convinced that you really know what you mean. If you’re arguing that code is modular and exists in black boxes (which can be easily modified), then no, as anybody who has ever had the misfortune to work on old/complex systems can testify.
To give an example. In the case of a database, you might not know what all the systems which access/rely upon that piece of data are. Or how they rely upon it. They might use it for radically different things, or in ways that you wouldn’t expect.
Cian 06.07.06 at 8:07 am
WWhat I didn’t see were reports of properly undertaken studies that gave credible estimates of the risk. To give just the most obvious unanswered question – what proportion of non-compliant computer systems would fail if the date were set to 1/1/00 and how badly would they fail?”
1) How many companies would be willing to allow academics access to their codebase (answer: very few, if any). If you’re not inspecting the codebase, then you’re guessing.
2) How many academics would be able to analyse such a codebase, such as to give a sufficiently accurate answer (skills required would include actual experience of reading code and maintaining old code, which isn’t one you acquire in academia – together with knowledge of many computer languages and environments). This would require a huge investment of time and energy, btw, as you’d have to fully understand how the code worked (and it is unlikely to be adequately documented). You can’t just glance through it looking for “date”. Even then, you might not catch all the problems.
3) Predicting how all the systems would interact in a company with complex computer systems (which would include financial institutions, large retail institutions, shipping companies, companies who run significant infrastructure) – would be impossible. We simply don’t have the tools to do this.
4) Even if the above could be managed, the investment of time and resources to do this would be very large indeed. Doing it for multiple companies would be even more difficult.
5) How generalisable would such information be, given the very different computer systems, dependencies, history, business models and culture of each and every company (I’m guessing not very). You also have to take into account that companies also rely (without realising it) on the computer systems of suppliers and vendors.
So in short, such a study would be a huge and expensive undertaking which would rely upon skills that are rare, and cooperation with companies that is rarely forthcoming. At the end of this study you’d have information that would not be particularly reliable, or generalisable.
chris y 06.07.06 at 8:15 am
This thread seems to have turned into a dialogue of the deaf between those who have experience of working in large commercial ICT departments and those who don’t. We all agree that the hype before Y2K was reprehensible. We all agree that it was never ever going to usher in the end of anything except the millennium (as understood by the majority of journalists who can’t count from 1).
But. There was a real issue affecting legacy systems. Nobody needed to commission studies to prove this, because any half educated COBOL monkey could understand the point without much prompting. Commissioning studies would really have cost money. Nobody lost their heads, except the victims of the hype merchants. The director of my principal client division at the time read our risk assessment and remarked, “Nobody’s going to die, then.”
But he authorised the work, because he wanted people to get paid, he wanted his staff to be able to do their jobs, and he wanted to deliver a quality service. So we did the work, and fucking dull it was too. But we did it right, because we identified about 250 instances of the date issue in two major systems and corrected them all. Don’t clap, just appreciate that this doesn’t make a good story in the Daily Mail.
And as to the various CTOs like Dave Heasman’s who seized the opportunity to replace other systems which were becoming unsafe, them you may applaud. Good managers all.
Cranky Observer 06.07.06 at 9:06 am
> actual experience of reading code and
> maintaining old code,
Just to clarify what “old code” means in this context: one entity I worked with had a business-critical application with an estimated 7 millions lines of code. The core modules of that application dated from 1954. Yes, that is “54” not “64”. There really wasn’t a full set of source code available anymore (heck, Microsoft claims it no longer has the source code for Windows 95a), but the languages that could be identified included:
* IBM 1401 machine language
* IBM 1401 Autocoder (assembly language)
* COBOL
* FORTRAN
* PL/I
* and the odd bit of Ada(tm) that crept in somewhere
Now that’s something that’s really easy to test and fix!
Cranky
Cranky Observer 06.07.06 at 9:11 am
PS I don’t know whether it should be classed as “funny” or “ironic” that the CT comment subsystem barfed up a really ugly error screen when I clicked [Post] on my #60, but then went ahead and posted the message anyway!
Cranky
abb1 06.07.06 at 9:55 am
Cian,
To give an example. In the case of a database, you might not know what all the systems which access/rely upon that piece of data are. Or how they rely upon it. They might use it for radically different things, or in ways that you wouldn’t expect.
Sure, but still there are often ways to deal with this situation on the database level without rewriting the code, by, for example, adding triggers that reformat database fields in different ways for different applications and subsystems. A subsystem would still be writing 2-digit year into the database and the trigger will intervene, convert and store the 4-digit value. And when you read it, it will, again, present it to different clients in different formats. It’s not always possible, of course, but often it is. It’s much cheaper than re-writing everything.
chris y 06.07.06 at 10:04 am
But, abb1, that’s the whole point. You can’t do that sort of thing on a 1970 vintage network database, or with a sheaf of temporary flat files, or with dates that are formatted by the code and stored in temporary indicators. For instance.
abb1 06.07.06 at 10:29 am
Well, I don’t know about old databases, you’re probably right, but often mainframe financial applications were designed as more or less a straight chain of mainframe jobs with one flat input file and one flat output file. Many of those jobs had no problem with two digit year, so we only had to modify some of them and sometimes reformat files between some of the jobs. This kinda thing.
Noumenon 06.10.06 at 6:00 am
Mr Quiggin, I just wanted to point out that there’s a typo in the first sentence of your abstract (“result”).
Comments on this entry are closed.