Friday, January 21, 2005

21st Century Bugs

(This article originally appeared in CMA Newsline March 2004)

This article is part two of a series intended to develop a theme for a presentation at the Enterprise Networks Conference 2004, an event which includes the CMA Plenary keynote session. See for more details.

In part one, I suggested that the biggest obstacle to 21st Century Communications was the poor quality of Software Engineering, a heresy likely to get me drummed out of the British Computer Society. I have visions of the dishonourable discharge where the pony-tailed Technical Design Authority Prime removes the pencils from my breast pocket & ceremonially snaps them, breaks my keyboard over his knee, guillotines my CDs, removes the ball from my mouse, takes a hammer to my cathode ray tube & plunges my SecureID dongle device into my Dilbert Coffee mug!

How can I substantiate that sweeping statement? From a couple of decades in the Telecoms industry, on the fringes of (& sometimes in the thick of) the programmers. Something I discovered very quickly is that most programmers love programming but have very little interest in the product they are developing, especially if it is for business process or something not directly software related, such as switching phone calls in a business communications system. A dream job for a programmer is probably either developing software tools, or developing computer games, depending on inclination.

My first stint as a professional programmer involved three months of on-the-job training over in Canada in the mid - 1980s. I had the advantage over many of the newbies (& lots of the oldbies as well) in that I had worked on the product for a number of years & knew what it could do. Understanding how it ticked under the hood gave me considerable insight into why it worked the way it did.

After exposure to the various aspects of the process, we were all given a number of simple bugs to investigate, fix & report back on. In a complex multi-user system a peer review is normally part of the process- you have to convince others that you recognise the problem and your solution is appropriate before being allowed to re-integrate into the main software product.

Five of my six bugs were all straightforward enough, but one of them was a real humdinger. The feature was something called dial-tone detection, where an outgoing Trunk line is checked for dial tone before sending the digits (This was before the days of ISDN and supervision signalling could not always be relied upon in some Countries, ground start/earth calling PBX trunk lines being uncommon outside UK & America). The high level design showed that there was a sophisticated algorithm (arcane programmer-speak for a recursive mathematical routine) that would carefully track overall Exchange response in order to isolate faulty Trunks and dial-tone detectors whilst allowing sufficient time to react to dial tone speed changes. There was a robust validation of the algorithm using data familiar to those of us who have had to use grade of service calculations derived from Erlang tables.

The documentation was spot-on and the Author was the most respected senior pointy-head in the place (who had recently moved on to greater things within the Company). Why didn’t it work?

The answer came to me in a flash of insight a couple of days later whilst setting up breakpoints & traps, the software equivalent of an oscilloscope. The reason it didn’t work properly was that the algorithm had a major shortfall- it was designed based on precision numeric calculations to several decimal places but the system itself used integer maths- 42.99999997 was still 42.

Once this was grasped, it showed up several flat-spots on the response curve where certain scenarios could result in calls hunting between Trunks rather than waiting the appropriate time. The fix was reasonably straight-forward- a couple of lines of compensatory code to smooth out the response hiccups.
What was considerably more enlightening, however, was getting the solution through code review. The experienced programmers had trouble accepting that such a mistake was possible by one so senior, such is the hubris of the profession. Whilst I was able to eventually persuade them, it took considerable escalation through the ranks as there was also a certain level of denial- if it was so fundamentally flawed, it should have shown up during testing, regression, field trial etc. The reality had also been that the programmer had designed some very complex code as an intellectual exercise then someone else had implemented it without too much thought.

Whilst this was just one (particularly memorable) bug from two decades ago, have things got better? To an extent. On the positive side, languages have evolved considerably, along with the tools and processes used to create and manage them. The downside, however, is that the proprietary platforms and languages have largely given way to generic tools. This is a good thing in many ways but it means that problems with functionality cannot always be identified and sorted out in-house, especially when the tool source code cannot be examined (the Windows vs. Linux argument).

A particular problem that hasn’t changed much is something known as scope creep- where the definition of what software is meant to actually do evolves during the course of development, often compromising the original design. There are many factors that cause this, starting with the business not actually knowing what they want at the outset, the analysts interpreting the requirements to fit a solution, the developers creating specifications that are open to interpretation and the business then changing priorities & functionality once they get their hands on the beta software.

There is always pressure- pressure to deliver on time, pressure to deliver what is wanted (rather than what is asked for), pressure to maximise quality and pressure to minimise expenditure. These pressures are perfectly normal for any type of project but when it comes to complex systems where the project managers can’t actually wander round site and see what the chaps in hard hats are up to, then it is often a voyage into the unknown.

Another area that causes bugs is interpretation of standards. When Telecommunications was very standards-based, the specs were thorough, well researched and definitive. The CCITT would publish the manuals on a 4 year cycle and the colour of the cover determined the vintage. Suppliers would talk about Q.921 Blue book & everyone knew what they meant. The standards bodies would adopt each others standards and everything in the garden was more-or-less rosy. However, as time went on, the pace of change blew this out of the water. An example of this is ISDN in Europe, where the early introductions were Country specific with standards such as DASS and 1TR6 before giving way to Euro-ISDN which still has a number of flavours.

Nowadays, whilst there are still numerous standards for all sorts of things, the Internet (bizarrely enough) doesn’t actually have any in the accepted sense of the word. Instead it has RFCs, or “request for comments”, maintained by the Internet Engineering Task Force. RFCs cover things such as routing, protocols and even accepted procedures for quoting text in email replies, something Microsoft chose to ignore when they created Outlook Express.

So, equipped with some programmers, some processes and some standards, all with varying levels of dodginess, our tame software house wants to write the ultimate killer application, the one that unifies the fractured communications landscape & makes them a shed-load of money in the process.

How well will they manage that? It depends on their background and to some extent whether the emphasis in ICT is on IT or CT. Developers of real-time carrier-class systems with expectations of 99.999% availability have different approaches to developers of billing systems requiring 99.999% accuracy.

The next article in the series looks at resilience, i.e. ensuring the IP equivalent of dial tone for all communications services.

No comments: