Last weekend we released a gigantic version of the system which was presumably perfect regardless its size. At least that's what final test revealed on that Saturday.
At the beginning of this week, on Monday, The system was working quite well until 9:30ish when one of the components started to publish malformed messages of a certain type. Fortunately that message type had no huge impact on the business, though it was very unpleasant for the team to fall in a situation like this.
In the evening, we went into the detail of the failure and I have gone into some conclusions:
The request for change was triggered by the business to improve this message type in certain conditions.
5 loc changed in svn before moving out to git. (Nobody noticed about the change as there was no pull/merge request)
0 UT for that change
0 FT for that change
0 Preexistent tests for that message type.
Having this conditions, the software engineer who applied the change thought the change was so easy and implementing a test wouldn't have had brought more benefits than the cost of implementing it. So, he applied the change and you now know what happened.
The system is quite modular and the main operation was not affected at all by this glitch. Though it can be even more modular.
Rollback of the module/service malfunctioning was performed, though it could have been much much faster.
We have showed and convinced management of the benefits of changing the way software is being done and adopt practices like (A)TDD, CD, DevOps culture, automation, among others. In fact some changes have been implemented, for instance, changing from svn to git, code inspections, katas, and a (small) reading club.
What could have prevented that glitch?
Certainly (A)TDD. In fact, TDD is being adopted, software engineers have seen the huge benefits of working this way. Unfortunately, the glitch was injected before this adoption and nobody could catch that glitch earlier.
I always remember a piece of code refactored by the uncle bob in one of his books, "clean code: a handbook of agile software craftsmanship", when there was a method doing something with dates. He decided to clean that code up so the first thing he did was to test the current implementation until he got a hundred percent of covered code. Until then, he refactored the piece of code without any collateral damage.
Software inspections/Pair programming. It surprises me the amount of people that say software inspections and pair programming is a waste of time. Some of them have told me they tried it but didn't receive any benefit, some others say it just don't work and they don't have time, though they have plenty of time to fix bugs...
There are quite a few books that explain how to do software inspections and pair programming and I feel I can help using this analogy, Software is like a paper, report or document. It has an structure and you are supposed to understand what the document, the code, says/does. Once you finish, talking about inspections, you go to your supervisor and ask him to review (inspect) it. Your supervisor might tell you about changing your redaction, focusing on the reader and if they finds anything difficult to understand, they will ask you to rephrase or reorder it. In software it happens exactly the same while pair programming or inspecting software, if someone does not understand anything I'm hundred percent they will ask why is it like that or what it does or any other question/improvement.
Continuous Delivery. Of course, if we had had continuous delivery, we could have reestablished the service in a few seconds/minutes and not causing more damage to the business. Also we wouldn't have deployed that gigantic change where we didn't have control of all the changes included.
I'm being very brief on this topic but do not get me wrong, CD is not just deploying software automatically, it goes far beyond that. There are quite a few books, blogs, and papers that can explain what CD is much better than I do.
Software craftsman mindset. This is a huge topic and I encourage reading "The software craftsman: Professionalism, Pragmatism, Pride".
DevOps culture. This topic also needs a dedicated post.I'm going to write more about this when I've got more experience and can share thoughts about implementing that cultural change in my work.
Unfortunately in terms of releasing software in the Mexican Stock Exchange this is not the exception to the rule, this kind of releases are very common and they only occur twice a year. Changes are taking place to give the stock exchange the ability to release often and better. This is a very long way but this is not the first time any company has driven this road and won't be the last company taking this road to happiness and professionalism.
About the glitch I hope everybody understands the importance of anything we do, even it its a quite small change. It might have catastrophic results as history says. Glitches like this have caused casualties and million of, pounds, dollars, Mexican pesos, euros, you name it, lost.
And please, do no start coding if a test does not exist!