Fault Tolerance and Recovering from Unexpected Failures

How to deal with failures #

Recently I had a situation where I had a relatively simple Python script that would do some API calls to retrieve some data. When writing this script, I thought of what to do if something wrong and unexpected will happen (except REST API timeouts), but slowly the script grew organically and I have not gone into making it fault-proof. While my script did not have any problems running, the server on which was running suddenly restarted and thus my data collection came to an end stop. So how do I now recover from this error?

The script was already running for over a week (yes, I had a lot of data to collect) in a concurrent fashion. Luckily, I set up a log file and I was able to quickly determine which data was being collected when the server restarted. The issue I had was now to figure out how to restart the data collection without redoing all the work that was already done. The question at hand was: how much time do I want to invest in fixing this issue?

There were several options available to me:

Check manually which data was incomplete, delete that data, and restart the script. As input to that script I had an Excel file with all the repositories that were to be used to collect the data. I could have of course made a new Excel only with the remaining repositories where data was incomplete and use the new one as input.. Or I could have specified in the Python script to only use repositories [X:Y]. The issue with this option was that while I could have done it quickly (30-40 min), it was not really fixing the real issue of having a script that could not restart and figure out where to continue collecting the data, in case an unexpected event happened.
Fix the script and add some recovering capabilities. I chose this route as I soon realized that I could use the fact that the script required as input a start and end date for each repository. Data was collected on a monthly basis, meaning that I could check if my last data collected for a repository was the last expected month, according to the end date supplied as input.

The fix was then simple: I would check if an output file already existed for the current repository that was being analyzed, and if contained as the last entry in the CSV file the data for the last month as per the end date's input. If it did, then I would skip it and continue to next repository. If it did not, then I would redo the data collection for that specific repository. This was a simple and efficient way to recover from unexpected failures (such as environment errors) that only took me a few minutes to code, but gave me the guarantee that I can easily re-run the data collection in case of unexpected failures.

Lessons learned #

Sometimes, simple solutions are sufficient given all the assumptions and knowledge regarding the environment in which the program executes.
Never assume that the program will never fail. For example, how would one handle the case of running out of hard disk space (I saw it happening in real-world systems)?
Use more complex fail-recovery methods when it's needed. Write down all the assumptions, think of edge cases, think about what strategy works best, and make a cost-benefit analysis. From experience, everytime I might need to "re-run" an experiment, a data collection, or something involving data, it is better to make it easy to re-run -- and I must take care managing (create, delete, etc) any assets produced by the program to ensure output consistency.

Next: Catching exceptions in Threadpool executors