Friday, January 21, 2005

21st Century resilience

This article was part three of a series intended to develop a theme for a presentation at the Enterprise Networks Conference 2004, an event which included the CMA Plenary keynote session. See www.enterprisenetworks.co.uk for more details

What do we mean by resilience? One dictionary definition is “able to quickly return to a previous good condition”, giving a rubber ball as an example. For people, we generally mean bouncing back after some hard knocks, whether physical or emotional. For Telecoms systems, it is perseverance in the face of adversity, the Pony Express rider ensuring the mail gets through whilst dodging bandits and arrows (or fires in cold-war underground tunnels!)

Phone systems are real-time in software terms- the most important thing they have to do is notice events happening and act on them in timely fashion. These events are generally quite low level, i.e. a user has pushed a button on a featurephone, a DTMF digit has been detected on a trunk line, an E1 interface has received a signalling message, the technician has pressed a key on the system terminal. (In the 1970s, it was lower still- the processor was often keeping track of every dialled digit, validating it for speed, make/break ratio, inter-digit pause etc.)

What often isn’t appreciated is that in many systems letting go of a button on a featurephone is also an event- how else could it be possible to buzz-buzz your Secretary?

There are also a corresponding set of “non-events” to be handled- i.e. timeouts where something should have happened but didn’t. An example of this is a user being given dial tone but not doing anything about it. Often the hardware will react to non-events from the software as well- sometimes called a watchdog timer, to capture the situation where St. Vitus gets involved.

The software architecture will normally revolve around something called a work scheduler- this is the main engine that allocates tasks according to their importance. Giving a user dial tone may involve several iterations around the loop- firstly recognising the event at a low level & ensuring it is placed into the correct signalling buffer, then recognising what the transition actually means, i.e. a user appears to be initiating a call and it isn’t a priority user so put it into the correct call processing buffer, then eventually allocating resources & providing dial tone when it reaches the head of the queue.

So our system is constantly working very hard looking for something to do & leaving imaginary timed notes for itself to check up things that don’t happen when they should. So far, this isn’t that different to what any computer system would be up to, be it an enormous supercomputer at the Met Office or a humble Gameboy in your child’s bedroom. The difference comes in the resilience- being able to keep going despite all odds.

Hardware resilience comes in the form of robustness- duplicate or triplicate processors that work in co-operation or hot standby to ensure call processing continues. Mirrored memory, dual interfaces, backplanes & power ensure that most likely hardware failures result in minimal interruption and automatic recovery.

Software resilience comes in the form of recognising hardware faults and recovering from them. It also means recognising unexpected software outcomes and recovering from them, as well as producing an audit trail so that the situation can be duplicated & a solution found. Providing tools and alerts for the maintainers to ensure optimal service delivery varies from product to product & is generally much more simplistic for a Business system than for a Public Exchange.

In the phone system, unexpected events are hopefully worked around, provided that the scenario has been recognised by the designers. As the only software that runs on the system is proprietary there is a high expectation that outside of beta trials, the systems will run stable. Or, more accurately, they appear to do so- most systems run housekeeping routines that “tidy up” after the call processing routines and close scrutiny indicates that the garden is not totally rosy. Phone systems have generally evolved from older systems where memory capacity, processor throughput and intra-system bottlenecks have resulted in constraints. Stress any one of these and even the best known systems start to behave oddly during the busy hour. (This is inevitable as it is never possible to test for absolutely every eventuality in a complex system).

Contrast this with Servers. High end Servers are available that have redundant power supplies, robust disk arrays, hot swappable cards etc. Whilst they are not totally duplicated internally in their architecture (don’t be fooled by the number of CPUs) they can be clustered together in order to make them more powerful and resilient. Unfortunately, however, it is all too easy to load any old software onto them.

Are they suitable for call processing? They can be, provided that call processing is the primary function, or indeed the only function in a Windows environment. It is also a good idea to have automatic routines to enforce the occasional out of hours reboot in order to minimise the impact of memory leaks. Similarly, call processing needs to be executed as a “service” so that it starts again automatically and the device needs to be tuned for optimal behaviour, something not particularly suited to the ICT policies of using a preferred Server build and load.

Another thing about clustering- it comes in more than one flavour and may require intervention to recover from fail-over. The most useful approach is where one Server can be taken off-line for patching or upgrade whilst the other does the work & then vica versa, preferably seamlessly. This doesn’t entirely exist in the real world, although high availability is possible, even if zero downtime doesn’t. Also, be prepared for occasional intervention whilst the typical Server response to an unexpected event continues to be the “blue screen of death”.

Of course, there are benefits. Servers are streets ahead of legacy phone systems with Ethernet and network connectivity. Woe betide anyone who naively put their Option 11 onto the network without an intervening router, as the system would spend more time watching what was going on out beyond the RJ45 than attending to low priority call processing tasks. Servers enable applications to communicate with each other efficiently and effectively, whether across the LAN or the WAN.


Something that does need to be taken seriously on Servers is in managing threats- without the latest patch fixes they are vulnerable to attack, as are the intervening network devices, increasingly so if they are from Cisco.

We communicate most effectively using all of our senses, i.e. sight, hearing, touch, smell and taste, along with an awareness of acceleration and gravitational force through our balancing mechanisms. ICT mainly concentrates on sight and hearing (once you put aside oddities like Sensurround, Smellovision and Theme Park rides).

Voice communication has an immediacy & convenience factor that has made the humble telephone omnipotent in the developed world and the penetration of the mobile telephone even more remarkable. The convenience factor is a double-edged sword, of course, as it is frequently inconvenient to receive calls and often downright intrusive. The Unidirectional equivalent of the phone is of course Broadcast Radio, although the Tannoy system also fits the analogy.

Text communication has progressed through Telegraph, Telex, Fax and Email, with an exponential increase in content potential and effectiveness. Fax added imaging, the first “rich text format”. Email gave the ability to attach data, graphics, pictures, sounds and programs. However, it remains a unidirectional, non-real time channel, in that whilst it is possible to have email conversations, they are by their nature asynchronous. SMS and MMS are mobility approaches to un-tethered rich text channels and Instant Messaging is an interesting hybrid of immediacy that can be more effective in contacting someone than persistent phoning.

Combining sight and sound, we have Video Conferencing which is poised to become commonplace on most desktops, along with Video messaging, Video broadcast (multicast) and on-demand playback (unicast).

In Telecommunications there are also elements of communication not immediately obvious to the user, e.g. data flow from PCs to the Network, Clients to Servers, Applications to other Applications, housekeeping for connection management and billing.

This is definitely a fractured landscape for the poor old user, further complicated by Chinese walls between channels and devices for home and business use. (That sound you heard was the author giving himself a quick slap for jumping on the buzzword bandwagon).

So how can our new unified systems possibly hang together and work well? Let us take a look at the disparate communications channels that we might possibly want to bring together, along with the transport mechanisms, interfaces and deliverables. This diagram has had several iterations and adding further detail or subtlety has been abandoned for now. It is an exercise for the reader to join all the possible entries and associations together and not make it look a complete rat’s nest!
There are a number of challenges here. As the main theme is resilience, then let us consider what a UCI based system needs to do in a fairly straight-forward set-up, say a single-site office. It needs to communicate with all of the existing communications systems in order to be aware of what is going on. It may need to communicate with the back-office applications where there are associations (or hooks) into functionality. It needs to communicate to all of the users in order to serve their needs & be aware of their current availability set. It will have to be scalable for multi-site and nomadic use.

The important bit is that it has to be able to maximise the user experience in the event of any of these complex interactions failing or behaving inconsistently. Getting everything onto an IP backbone has to make it easier and given time, the other systems will be developed to be aware of and support UCI.

Next article- 21st Century presence, what is it and how can it be tamed?

No comments: