NO. NO. NO. This is not about Skynet or Ultron.
It was a moonless night of April 12, 2014 in Washington State. A terrified Seattle woman dialed 911 for 37 times and the number was dead. There was a stranger lurking outside her home, and he was trying to break into the house. After a while she was trying for the emergency services in vain, the man managed to crawl through a window. She picked up a knife and the stranger fled.
It was later found out that all the emergency services for the state of Washington was in radio silence for an entire 6 hours.
The 911 outage, the largest ever reported at that time was traced back to a rogue piece of code running on a server in Englewood, Colorado. It was operated by a systems provider named Intrado, and they were responsible for routing calls to 911 dispatchers across the country. They ran a software that kept a counter of how many calls it had routed. The programmers of Intrado had set a threshold number for how high the counter could go. They had picked a counter variable in millions. Somehow, it seemed like they were quite determined on the possibility that a dark night in Washington could not produce more than a million distress calls.
Shortly before Midnight, April 10, the counter went over that pre-set number, resulting in a simple arithmetic error. This counter was used to generate unique identifiers for each new call, so once the counter was overflown, new calls were rejected. The programmers hadn’t anticipated this very problem, so they hadn’t devised alarms to call attention to the issue. Nobody knew it was happening.
911 dispatch centers in Washington, Florida, California, Minnesota and Carolinas, serving almost 11 million Americans, struggled to sort out the strange reports that indicated that the callers were getting dead or busy signals. Finally, it was morning when they realized that Intrado’s code in Eaglewood server was responsible.
All it took to fix the issue was change a single number in the code.
The gravity of the situation later sank in. For an entire night, 11 million Americans were vulnerable and wide open for home invasions, murders and looting.
Are our control systems flawed?
Because our codes are flawed.
Why? Because men are flawed.
What actually broken is, the way we write software. Not software itself. The tiny bits of ones and zeros that we feed into an IDE, do exactly what they are told to. A software bug that evolves into catastrophe is not the inherent flaw of the code. Code just works the way we tell it to. Mess with their semantics, they can kill someone, do them right, they can be your savior on a very unlucky day.
World has become so connected and inter-dependent and has shrunken beyond our expectations, into one hell of a digital cocoon. Rogue lines of codes, in today’s world, creates scenes redder than a game of thrones wedding.
Just imagine the following scenario.
There is a car crash and the air bags didn’t deploy. The software onboard didn’t trigger the deployment, and hence, there was loss of life.
That’s a problem with the software. Isn’t it? In plain sight?
Now let’s get deeper to discover what happened in hind sight.
The software detects collision with the aid of inputs from front-back and rear side sensors that detects collision. It was coded to deploy the airbag in case there was a positive collision signal from the front side- but not the rear side.
The car had took a hit from the read and as it was coded, the air bags didn’t deploy. The software was just doing what it was told to do. Nothing more, nothing less. It was just told to do the wrong thing.
Next day, front pages proclaimed. “Software error kills driver in freeway. Are machines taking over humanity?” Click-bait bloggers pushes borderline irritating articles titled “Machines are starting to take over our world. This is the apocalypse”.
The blame game.
The whole, “It was a line of code gone rogue”, or “it was a bug” is a brilliant case study of blame shifting. Let’s try to approach this problem from a new angle. Code never produces bugs or deadlocks on its own. As much as a feature is intentional, a bug is also intentional. Or at least, it needs to be see that way if we are to get to the bottom of this pandemic.
Just acknowledge the difficult truth. The bug in our code is not a side effect. Code doesn’t create anything on its own because it doesn’t has a damn genetic material enclosed in a protein cover destined for evolution. Try to see it in a new way that the bug was involuntarily designed, meticulously engineered and strategically placed by us. When we stop taking it as a side-effect, we could start talking about why the world is so full of coders, and there are only a very few good ones, and there is always a coder-scarcity. Despite the mighty growth of coding empire, the number of unemployed coders are rising like the freaking edge of the climate change graph presented at the Paris accord.
Pointed skywards towards Valhalla.
(Off Topic: The same climate accord which President Trump didn’t sign, claiming “Global warming is a Chinese conspiracy invented to drive down American GDP. LOL)
Abstraction gone wrong
Mostly, programmers sit on their chairs, glance their assigned tasks in a project management system, consumes caffeine and chips, formulate complicated equations (well not always) and writes well-formatted, concise, scalable and documented code in a hurry to strike things off their pending list. For all they know, the code they write maybe anything from a music player’s new feature update, to the security routine for a radiation therapy machine. They can be new bonus levels in a silly game or a control system for an autonomous car. God help us, they can be security patches for a nuclear reactor, updates for emergency services or mid-air maintenance modules for United Airlines flight system.
Between 1985 and 1987, an infamous radiation therapy machine named as “Therac-25”, produced by Atomic Energy of Canada Limited, shot High-current electron beam radiation straight into 6 cancer patients.
There was a critical bug in the security software of the Therac-25 machine, due to which, high-current electron beam hit the patients with around 100 times the intended dose of radiation, and over a much narrower area, delivering a potentially lethal dose of beta radiation. The patients felt an intense electric shock, which caused them to scream and run out of the room. After several days, radiation burns appeared and the patients began to show symptoms of radiation poisoning.
Within days, they died miserably from radiation poisoning.
It was due to a fault in the programming of the Therac-25 machine. The software bug arose from concurrent programming errors, which resulted in a race condition whereby when an operator quickly switched between two modes, direct electron-beam therapy mode and Megavolt X-ray therapy mode, the security software encountered a race scenario and entered a stalemate, unable to decide what to do, and then exhibited undocumented code behavior known as bug.
Previous models had hardware interlocks to prevent such faults, but the Therac-25 had removed them, depending instead on software checks for safety. The over-confident engineers had used codes from previous versions of the machines, which ran on different hardware’s before, a process known as cargo coding.
Somewhere inside those codes, there was a flag variable, which instead of being set to a non-zero value, was set to increment. Occasionally, an arithmetic overflow had occurred, that culminated in the flag being set to zero, and thus leading the software to bypass safety checks.
The coders wrote and handled code like it was okay for the machine to vomit radiation into the patients. The code was never ready for the world, or it looked strangely unfamiliar with the world it was written for. There was a disturbing abstraction that blinded it from its target environment that led straight to the agonizing deaths of 6 people.
Whatever it maybe, the lines of code are written into an IDE or EMAC, without the whole picture in mind. The culture of “striking things off the list”, “pushing to production” and “shipping” has inadvertently disconnected the code and the target ecosystem. Obviously, project management, systematic collaboration and productivity has ushered a new coding paradise. The world is pushing code with more efficiency now than ever. But, we are obsessed with splitting things into story-points, our frameworks instructs us to make these story points/issues as much independent as possible to such extends that the we are told not to worry about other men across the cubicles and should any problems arise, it would be sorted out with Testing/QA. Therein, the problem resides.
The whole picture. It’s like hiring multiple architects for a big building and then giving them a ritual lecture about your grandiose vision about the building across the evening skyline and then making them sit in cubicles, stressing them to isolate and telling them to stop worrying about the big picture, and to make story points with no dependencies which each other.
The increasing disconnection between the Source and the target.
It is in many ways, the holy grail of what programming is all about.
In September 20017, Jean Bookout was driving with her best friend on the highway in a Toyota Camry. Suddenly the accelerator seemed to get stuck. Panicked, she took her foot off the pedal, but the car didn’t slow down and it was then she realized that the breaks too had lost power. She went off ramp in 50 miles per hour before pulling the emergency break. The car skid through the highway for 150 feet long before it ran into an embankment by the side of the road.
When she woke up in a hospital a month later, she learned that her friend had died in the crash.
Toyota blamed the malfunction on sticky pedals, but the incident was one among the many in a decade long investigation against unintended acceleration in Toyota cars. But by that time, there was growing speculation that somewhere, a software was responsible. The highway Traffic safety administration enlisted NASA experts to perform an in-depth review of the Toyota’s code. 10 months and many hiccups later, the NASA team couldn’t establish that the software was the cause, but also said that they couldn’t also prove it wasn’t.
During the litigation of the Bookout accident, Michael Barr, an expert witness for the plaintiff, finally stumbled upon a convincing connection. His team of software experts spent 18 months with Toyota’s code, picking up where NASA had left off. This new team found what is called “spaghetti code,” a programmer lingo for software that has become a tangled mess. Code turns into spaghetti when it accumulates over many years, with features upon features piling upon them, resulting in a code- labyrinth which is impossible to follow, or test. Using the same Toyota model, the team demonstrated that there were almost 10 million ways for key tasks on the onboard computer system to fail, potentially leading to uncontrolled acceleration. A minor flip in a computer’s memory, such as a variable becoming a one instead of zero, could make the car go on a highway rampage.
Barr’s testimony led to plaintiff, and it resulted in $3 million in damages for Bookout and her friend’s family. Toyota recalled more than 9 million cars from the market.
“The problem is that most of the software engineers no longer understand, the problem they’re trying to solve, and don’t care to,” says Nancy Leveson, the MIT software-safety expert. A colossal amount of time is spent on getting the code to work, instead of thinking out and anticipating scenarios for the real world. “Software engineers like to provide all kinds of tools and stuff for coding errors,” she refers to IDEs. “The serious problems that have happened with software have to do with requirements, not coding errors.”
When you’re coding a control system that monitors car’s throttle, for instance, what’s important is how it’s going to play out there in the real world. The rules of motion, the time critical knowledge about when, how and how much the throttle needs to open. But over the time, these systems become a maze, and hardly anyone can read through them and understand it. “There’s 100 million lines of code in cars now,” Leveson says. “You just cannot anticipate all these things. “The problem,” Leveson wrote in a book, “is that we are attempting to build systems that are beyond our ability to intellectually manage.”
That’s a disconnect right there. A huge wall in between the programmer and the world.
More and more of these disconnected projects creates more disgruntled coders. It’s a proportional graph like Ohm’s law. Enough of those disconnected coders, the fate of the world is in jeopardy.
To fix that, we need to acknowledge the problem of disconnect.
If we are talking about cars, like everything else, the car has also been heavily computerized to accommodate new features. When a control software is in charge of our throttles and brakes, it can precisely manage fuel injection to help you save on gas, it can slow you down when you are on a collision path. The same software can help keep you on the lane when you start to drift off, or they can park your car with more precision that you will never have. You cannot, build these features without code. If you ever try to, your car would be jangling mechanical mess weighing 40 tons, an immovable mass of clockwork with a million moving parts.
We have come a long way since the invention of Wheel in Mesopotamia in 3500 BC. But when we are inside a car, we barely notice all these complexities. The harmony of code and mechanics has enabled us to build the most intricate machines that have ever existed. Just because all the complexities are hidden away from our eyes and packed into tiny silicon chips as million and millions of code, it doesn’t mean the complexity has gone away. It just requires a different set of eyes, skillset and understanding to create and maintain a codebase such as that.
Everyone looks up at Tony stark and how, within a bleak comic nightfall at the Stark tower lab, he becomes a genius in Artificial intelligence and just codes stuff that saves him from dark scenarios which we could never see coming.
There is a takeaway here. He knows the world. He know the way it works. He knows that the bad guys are always coming from other end of the wormhole and he needs to save the day irrespective of his paycheck. If there is anything coders can learn from Tony, it is that he knows his world well.
When there is a lack of connection between the code, the world for which the code is written, and the person who creates and manages the code, problems begin to erupt. And they are not just compilation errors and they just doesn’t stay there.
They kill people.
Well, not always, but they do and then they makes headlines in bold and people drop their champagne glasses reading them.
They cost other people their lives.
Back in the old days, when the control systems were Electro-mechanical, like safety valves and hardware control units, the people who manufactured those parts built out all these somewhere with an idea in mind that people are going to use that. It’s like, “You don’t weld that pedal at right angles, and the user is going to go Wile E Coyote down the road and unlike in the cartoons, he won’t just de-pancake.”
Replacing them with a sensor and tying it to some codes running on silicon gates just don’t give you liberty to abstract the real world from your code. When something was hardware, one could test in exhaustively. The manufacturer knows exactly when a conveyor belt will break, be it after 2000 or 3000 hops.
Now that everything is run atop segments of code, somewhere in our sprint course, we need to try and reestablish that connection with the real world, instead of splitting the project into independent stories. World doesn’t work that way. Maybe it’s high time we did a methodological rethinking.