Error handling and you

Error handling and you

Error handling is serious business. Or at least, this is how a lot of programmers see it. For decades, programmers have handled program errors in a wide variety of ways, from error levels, to error codes, macros, exceptions and beyond, all based upon the core concept of detecting an error and then handling it. But this process may be inherently flawed, because it is a reactive, not proactive formula that ultimately places a huge burden on software design.


Think digitally.
Computers are digital machines, but the world is not digital. No matter what we do on our desktops, our laptops, our phones, or on any other computer that we use (even that little embedded computer in your Blu-Ray player), everything is bound by a set of digital rules that keep data intact and error-free. But what exactly does this mean? The world is analog; it is far from error-free, and it is not always quantifiable.

The digital mindset is that data is first quantified, then transfered, received, error corrected and finally analyzed as if its location had never even changed - as if nothing happened. Regardless of how perilous the journey was, the data appears to be completely intact and still equally quantifiable as it was when it was sent out. It doesn't matter whether its a short bus trip to the nearest memory cell or thousands of miles across the world to another continent.

Such is the wonder of error correction. Because of the extensive use of error correction in our technology, in everything from our storage mediums such as hard disks, to our communication methods like DSL, the data being transfered and stored is never "aware" that it has been damaged in any way during transit through our analog world. It never has to worry about handling the errors that it encounters, the surrounding protocols take care of it all. And although there are exceptions, for the most part the program handling the data also doesn't have to worry about it containing errors.

Error handling is not correction.
This is where it gets tricky. Programmers are quick to assume that just because they trap an error and then take a correct course of action, they have corrected it. But this way of thinking is risky because it can lead software to containing many bugs over time as it is designed around - and itself is - an error-rich environment.

When you handle an error, you have integrated the error into your program and allowed it to propagate. It seems innocent at first, one little check to make sure you allocated the memory, then another check to make sure the function that allocated the memory succeeded, but then there's another check for the function that checked that the function that allocated the memory succeeded, and so on.

The error then propagates, all the way up to the highest scope, such that your main logic loop is eventually having to worry about some minor error that could've happened in some function ten scopes deep. Even though the error itself is hidden from parent scopes, the error handling code had to be implemented from the ground up, from the lowest utility function all the way up to the main loop.

Why? Simply because in this kind of reactive design, anything can fail.

You may see your application as robust, able to handle any situation, But in reality, you've very likely just written a god-aweful mess, which probably even has bugs in its error-handling code, which can not only lead to additional errors, but enter an entire new territory where one error generates many others. Any programer will be familiar with this kind of behavior - it occurs every time you compile a program that contains even a single error. It starts off with just one line, then the compiler is printing thousands of pages of errors. Don't let your applications become this, because...

Errors should not happen.
A thousand programmers just cried out in terror, and then they were silenced. Going back to our original subject of representing data in a digital way, and how error correction hides, abstracts, or "encapsulates" the data from the analog world, we can do the exact same thing with our software. In modern operating systems, everything from the kernel to the user space applications are usually written under the assumption that the computer just works. Well, because it should.

While there is a great deal of error handling involved at the lower levels, at a higher level this becomes error correction because the operating system and all of its abstraction layers have taken care of so much for you that you don't have to worry about much in the way of program failure - usually only in a few extreme scenarios like running out of memory, or if the user has broken drivers. And in these cases, they probably have far greater concerns than whether your application will continue running perfectly, like how OOM Killer is destroying all of their precious work.

The separation of layers between the system and the application have turned error handling into correction. The problem is that this cannot typically be applied to the user space application itself, because it is usually being written by the programmer to handle every single possible failure scenario - no matter how extremely unlikely - and as a result, is mostly made up of code that is designed under the assumption that anything can fail. Except that when nothing goes wrong, you end up with a horribly written, inefficient application that likely isn't even easy to debug.

The solution to all of this is to not let this mindset poison your code, which leads us to...

Write code that cannot fail.
This might immediately sound like an impossibility; after all anything can fail, right? But as I explained earlier, most failures on modern systems are borderline edge-cases; things that happen when something is really going wrong, and in those cases you're not likely to be worried about a single application. There are exceptions, mind, like a web server that has to stay up no matter what kind of circumstances the hardware it is running on might have to deal with.

But it is very much possible to write code that cannot fail. To turn error handling into correction at the application level.

Get error free.
Purge those error codes, exceptions, and everything else. Unless its a critical point in your application (i.e getting an error while constructing a static object), forget about it. The only time you should be generating an error is when it is something so bad that execution cannot possibly continue and you must abort.

Utilize redundancy.
For everything that your application needs to do, make sure that you have more than one way to do it if any of the available methods has a realistic chance to fail. For example, if you are doing threading, use multiple threading libraries (or APIs) so that if a call to one library fails, you can attempt a call to another library for the same task.

If the functionality provided by a third-party resource has a realistic chance to fail, write your own fail-safe version that is enabled by default, because you should...

Write for test-pass, not test-fail.
If something can fail, remove it from your core application entirely, and then add it back in as extra functionality that is enabled when errors do not occur. The core behavior of your application should not change based upon something that is not reliable. Instead of designing fall-backs, turn the fall-back code into the primary layer of execution and add additional layers when available.

Leave a Reply

Your email address will not be published. Required fields are marked *