Dealing with complexity in large-scale complex IT systems

'We (the UK engineering population) are currently unprepared or at least under prepared for large scale complex IT projects' said Stuart Jobbins, from Rolls Royce, one of the introductory speakers at a recent BCS Thought Leadership debate.

The second speaker was Professor Justin Keen (University of Leeds) who also had an interesting perspective on the nature of complexity and how organisations can potentially deal with ever increasing levels of increasing complexity in IT systems. They were joined by leading figures from the academic, medical and business communities to debate this issue.

Defining complexity and recognising the problems

Both speakers felt that the overriding trend in three decades has been unprecedented growth in everything electronic, most significantly in information technology.

Definitions of complex systems have to incorporate the numbers game. The higher the number of technical parts and people involved the more likely a system is to be complex. With complex systems one can see the final result, but it's increasingly difficult to see/understand the inner workings of such a system.

From a real-time controls perspective, complexity jumped in the move from linear (continuous) systems, with their inherent constraints, to the realisation of these systems by digital construction, which allowed us arbitrary and discontinuous relationships. This was the liberating step for flexibility - but the precipice with regards complexity.

'There is complexity in requirements, as well as within the IT systems themselves. We need to be able to define the limits of projects. Civil engineers have developed the role of architect and IT modelling can be compared with this' said one of the speakers.

'Similarly', they continued 'in communications the need for autonomy and 'healing' in networked devices, has driven a significant increase in complexity over statically defined and deployed networks but a feature we now expect as standard.'

The second speaker asked 'what is complexity? Where is it from, and what does it mean?'

The considered answer was that it is partly down to the relationship between the past and the present and the speaker highlighted the difficulties in specifying the future, where both complexity and risk can be seen as two sides of the same coin.

They also put forward the notion that knowledge can actually decrease control. 'The more one integrates organisations with IT' they said 'the more likely you are to get people with different values and systems, which in turn increases the risk of misunderstanding.'

Unfortunately, real life is not simple. For example, attempting to real-time model a simple fuel injector, across hydraulic, electro-magnetic and mechanical domains would involve not only the complex interaction of the systems in terms of actuation response, but the environmental factors, (temperature, pressure), along with the dynamic changes caused to these parameters during actuation, as well as mechanical wear-out, drift, original manufacturing tolerance compensation and lots more.

In computing there is a higher risk of problems happening too fast for us to do anything to prevent them or control their spread.

The contraction of space and time is possible via new technologies but this is a double edged sword as in our rush to make life easier we have increased its unpredictability. Problems arise with more complex components, especially when taken out of their 'envelope' for which they were originally intended.

Engineers value simplicity. But the champions of simplicity frequently seem to lose out to the finesse argument - unless it crosses some immediate major cost breakpoint.

Usually it is combinations of smaller problems which have a cumulative effect, finally resulting in a catastrophic error or major problem. Complexity leads to things not being done right, thus leading to larger problems further down the line.

Consequences to the average person as a result of IT have increased exponentially, such as in car computers failing and thus killing people, hence the issue of complexity in IT is a serious one.

So what continues to drive complexity today?

From modern electronics, but very definitely from IT, we expect increasing refinement at a lower cost with successive generations. We no longer accept 'undiagnosed' failure for anything other than cheap disposable commodities.

Consumers value well-engineered complex products and the diversity, performance or flexibility they enable. The mobile phone, for example, is no longer just a phone. It can be a camera, pager, message service, calculator, alarm clock, diary, music player, and broadcast radio receiver. As consumers we seem to place value on the functionality - even when we don't use it.

The 'capability' growth of electronics, both from a reliability standpoint, but more significantly from a computation standpoint, has enabled engineers to provide solutions that increasingly improve the fidelity of the required operation - a direct factor in increasing system complexity - and allowed us to push the boundaries of a system’s operating efficiency.

Complexity continues to grow. In automotive engine control, complexity has been doubling every four years since the advent of electronic control (irrespective of the mitigating actions to simplify). In aero engine control the doubling takes seven years.

Most un-reliability in software stems from the interaction of relatively simple components in emergent behaviours, such as the system is exercised through contexts that the designer had not predicted, or foreseen.

Simple solutions are characterised by a small number of well-defined interfaces and coherent interface policies. Reliable systems tend to be simple (easy to validate) and the product of a uniform development philosophy (usually stemming from a single corporation, team, or even designer).

Some people are using IT to downsize their people requirements which can be a dangerous exercise from just a safety point of view.

Synthetically trying to replicate the physical world will always be constrained by the technology available. Even if we could sufficiently synthesize the behaviour (without introducing error) we would still have to live with the approximation errors, rounding, and so on, of the underlying machines.

A system's complexity is not simply driven by its explicitly stated behaviour but also by its relationship with its environment. Relationships with other processes, whether they are executed by other systems or by people, are probably a higher driver of complexity than the requirements for what the system needs to achieve on its own.

Once the interface of the systems with people is accepted as part of the solution those people’s expectations for a suitable user interface drive the complexity still further. These expectations come from experiences with systems which are not generally high integrity such as office applications, mobile phones, set top boxes, satellite navigation systems. Such systems are sold on their user-friendly nature and often integrity is compromised.

Informed by other systems

The construction industry could be viewed as a system with its use of well-defined basic materials from many different suppliers, well defined use-cases, with inspections to legislated quality standards. We see specialist trades working on integrating solutions which are defined in detail by architects and underwritten by civil engineers to significant design margins.

If we expand this idea to systems of systems i.e. towns or cities, we see continuous re-generation, because of wear-out or obsolescence. Fundamental services (water, sanitation, power) and overall design are often compromised on the efficiency of individual systems to maintain the integration with the complete system, or town in this case. Brunel for instance is rightly revered, not only because of his engineering skill but also for his foresight. This is the trait of a true system architect.

Larger scale systems imply more concurrent activity, more suppliers and different implementations. The effect of commercial competition and collaboration is probably easy for us to recognise in the construction domain - system solution being a trade of price, performance and materials. This really shouldn't be any different in significant IT solutions.

However, economically one can't afford to be adventurous and produce complex systems. If one is fundamentally revolutionary one is more likely to have a problem than if evolving the systems more slowly.

In other civil engineering areas, such as building, gas, electric, windows, there are rules and regulations which prevent designers being too adventurous to reduce the chance of people getting hurt. For example, Corgi registered heating engineers.

What can nature teach us?

Nature builds some of the most complex organisms we know. Most are customised to a particular environment; they are highly 'evolved', built on countless iterations, ruthlessly discarding failures, continuously mutating to find the best adaptation to particular environments.

The natural process takes many 'generations' with very few species actually ever truly stabilising. Prototypes are allowed to evolve in parallel to select optimal solutions, or all may be discarded if the mutation’s flaws outweigh its advantages at the time of assessment. And so it has to be with IT systems.

Given that we can generate potential solutions, replicate, mutate and iterate far faster than nature’s programmes, is it not unreasonable to suggest that 'goal seeking' solutions hold some value. Is stability an un-natural goal? Nature, it would appear, is way ahead of us.

Why are we so unprepared?

One of the speakers felt that there was a great deal of empirical evidence to suggest that the complexity of current systems is at the limit of, or even accelerating beyond, our 'comfort zone', shaped by our own classic education and experience. Our ability to be confident in the solutions we generate therefore, is not keeping pace with complexity growth.

In trying to synthesise the physical world we will always have problems with rounding errors. There is always a compromise to be made, and the larger the system the more compromises will be made. Large scale systems, by their very nature, imply greater collaboration which often comes with greater levels of compromise.

Because average system size is rising rapidly, probably beyond the capability of our process improvement to mitigate against, we need to find some revolutionary, novel jumps and stretch the thinking of our engineers.

Another problem, which particularly affects large scale projects, is that business today encourages us to have relationships with people all round the world who we don't know very well, thus producing systems that are run by disembodied relationships, giving rise to remote and less personalised relationships at the core of systems.

Possible solutions

Most of the debate delegates agreed that in order to make things more reliable it's better for people to know how they work. The more complex a system the less understanding people have of it, and the less reliable it becomes.

In principle, the processes of design and test that we use today characterise the residual error rates. Evolution of these processes, along with strict compliance, allows us a modest percentage increase in integrity versus cost. High-integrity developers, typical of the security, aerospace or mass transport industries, are using the strongest available mechanisms to assure themselves that they exceed their design goals.

We are still learning to live with complex systems but are still struggling to make large scale IT systems reliable. Perhaps the more complex the system the greater the need there is for design through modelling.

The problem with large scale systems is that even if one sees that things are going wrong sometimes it’s better not to do things to prevent a crash as one disaster can have a knock on effect elsewhere, e.g. Hatfield rail disaster, caused 30 deaths on roads as more people diverted from rail travel after six died in crash on the railway.

Instead organisations should be offering more training and regulations to deal with emergent technologies. For most IT projects it should be a mandatory feature to have a risk assessment of the project before it's put into place. IT is infinitely complex, hence the need for risk assessments.

However, if infinitely complex it would be impossible to predict all threats and risks inherent within the system. But who would scrutinise these risk assessments? Or do we just have to live with complex systems and all the associated problems that go with them?

Making people aware of what the problems are in the first place is a major step forward. Better training is one way which could help system developers be taught how not to design things in such a complex manner.

One of the great challenges in the IT domain is the occurrences of unintended emergent behaviours. These can be lived with as long as component parts are reliable, risk assessments are made and regulations assist rather than hinder.

An incremental evolutionary approach may be the only way of dealing with increased complexity. Asking engineers to specify things they have no way of knowing is dangerous. It may be that the academic community should get more involved in trying to understand the implications of these complex systems. Ultimately, engineers should be able to design and build systems to deal with known unknowns.

Many at the debate felt that organisations should take more steps to learn from other source communities which have been successful.

Tool sets were perceived to not have moved fast enough to keep up with the complexity. It is possible that tools that allow us to work at a suitable level of abstraction could enable complex system development. But where required system integrity levels are high we are often not confident enough to work at such levels.

Ironically the tools become increasingly complex IT-solutions so maybe it is wise not to be confident in our ability to build them. An analogy of this might be the development of early compilers to allow us to work in FORTRAN rather than Assembler.

In terms of a development process for complex systems there seems to be clear evidence that the successes are based on incremental or evolutionary approaches to a solution rather than revolutionary ones. Even revolutionary systems are often best achieved by incremental developments, in that we usually phase the project to first establish a framework, then some elements of a functioning system and then build towards full functionality.

Finally, it was generally agreed that an ideal role for the academics would be to get more involved in the actual project development work to witness the issues and formulate observations about them. The actual practitioners rarely get the time to really consider what the problems are and what the research should be to address them.