The design and implementation of highly available mission-critical systems requires a more disciplined approach to project management than is usual in the IT industry.

At a talk given to the Birmingham Branch of BCS, Colin Butcher, XDelta, discussed the factors governing system’s availability and performance and provided some good advice for IT project managers everywhere on how best to set up such important systems. Justin Richards reports.

According to Colin Butcher mission-critical systems are systems that people rely on come rain or shine. The actual definition of a mission critical system is ‘a system that someone somewhere relies on to get something done, without data loss or corruption and without stopping working when they need it to work’.

Unfortunately, he went on to say, people tend to be on the sloppy side and only really do what they’ve been asked to do. It’s this bare-minimum approach to project managing and system creation that has led to the delivery of some rather poor IT systems over the years; systems which fall down rapidly when they are pushed.

Objectives and architectural design

When designing a system project managers need to have clear objectives and think ahead as far as possible in order to best develop well structured systems architecture. They have to also understand the constraints and the absolute minimum that the proposed system has to achieve. It helps to focus on the system’s core functions first and try and implement these as well as possible.

Leadership

Project managers need to ensure that all involved with developing the system have a clear idea of what the project is about and that everyone has a consistent idea of what the project is trying to achieve. The key is to plan ahead and don’t wait for events to overtake you.

Budget and schedule

As with any project the budget and schedule need to be appropriate for the problems that you’re trying to deal with. Budget and schedules should never be set first. For example, there is no point in pricing for hardware and even buying it until you’ve established as much about the systems’ requirements and project constraints as possible.

Business continuity

Business continuity isn’t just about the systems; it’s about everything - the buildings, the people, equipment, governance and so on. Project managers have to be on top of this issue.

Disaster tolerance

An organisation’s system must be able to survive a major site outage without loss of service. This is known as disaster tolerance; the higher the tolerance the better the system.

Disaster recovery

Disaster recovery is the process of restarting sufficient system service to run the business after a loss of service, typically from another location. Basically, it’s what you do when you’ve lost a battle with the system.

A mission-critical system needs to be able to:

  • survive failures (resilience and failover);
  • survive changes (adapt and evolve);
  • survive people (simplify and automate);
  • never corrupt or lose critical data (data integrity).

One major point to bear in mind is that system requirements never remain static over extended periods; they always evolve, which is why it’s so important for mission critical systems to be able to adapt over time. Hence, you need to reduce the probability of failure and to minimise the consequences of failure.

You can do this by determining critical components and people and defining the critical stages of a project and understanding how systems failure can happen and what failure actually looks like when it happens. It pays to be paranoid about your data!

The best way to reduce the chance of systems failure is to continually train people so that they are prepared to deal with worst-case scenarios.

100 per cent uptime

By their very nature mission-critical systems require high levels of availability and have to be fail-safe. But do we really need 100 per cent uptime? Probably not.

It’s true that 24 x 365 mission critical systems are fairly rare, as there’s no downtime window to make changes, fix faults and take backups. Hence in such a scenario these would all have to be done ‘live’ and very carefully. Therefore it is better to design something that can be worked on piecemeal.

Survivability

It pays to pay sales people when a project is actually up and running and not on a commission basis. This can make a difference to a system’s overall survivability. You need to design survivability into a range of factors including hardware and operating systems.

With this in mind project managers need to ask themselves a number of searching questions:

  • How long have we got?
  • How much data can we afford to lose?
  • How can we model whole systems?
  • How quickly do we need to react to a failure?
  • How can we predict the overall availability?
  • What level of service outage can we tolerate?

After all, the closer you get to 100 per cent uptime the more expensive a satisfactory solution will become.

The design process

Design is an abstract process and yet it remains a structured one. It’s best to start off with an ‘ideal’ design; if you could ignore all the normal constraints - how would it all work? Then incorporate the constraints into the process and redesign, but only on paper at first, talking it through with as many stakeholders as possible.

One of the hardest aspects of systems design is to understand all the details and complexities involved and to consider all the possible interactions within the system and plan ahead to avoid potential problems. All through this process managers need to remain flexible and adapt to changes. It also pays to keep it simple, but don’t oversimplify.

Follow the data flows

All design decisions are compromises and require you to exercise judgement. By looking carefully at all the systems’ proposed parts and how sections interact with one another you can then work out how you’re going to bring the system into service.

However, never make assumptions as to how things are going to be implemented, as these will constrain your thinking. Be sure to document your decisions and the outline design and even include stuff you did wrong first time around. And remember you need to design in the ability to make changes.

Establish meaningful naming conventions

The art of choosing names for things is a difficult one. You shouldn’t reference a place or call something primary or secondary as these can change over the course of time. Node names, network addresses and so on are almost impossible to change without major disruption, hence try to use a sparsely populated name space with room for expansion.

It’s best to choose simple and logical names based on function, not location and to have a structured way of creating names and allow for future expansion and change. Colin also advised against embedding real names of physical entities into the application.

When it comes, for example, to storage device naming Colin suggested separating the data types (e.g. bootable system, static data etc.) and allocating LUN values1, identifiers, volume names and device names in a logical manner.

Data storage and interchange

Data file internal structures may change over time, therefore, to minimise the risk of corruption tag the file structure with a version identifier. Colin also made the point that packet / buffer formats for exchanging data between application components may change over time, and that application components may need to move between machines over time, and project managers should be aware of this.

Network segmentation and addressing

Colin advised on using separate data flows, for example iLO traffic, inter-site storage traffic, user access, etc and on using 802.1Q VLANS to segment the network and separate out the different types of traffic flow. He also made a good case for mapping network addresses to functions rather than to machines and for always allocating protocol addresses in a logical manner.

Availability and performance

Availability is defined as ‘the probability of a system being available for use at a given instant in time within the operational window’. It can also be seen as a function of both MTBF (reliability) and MTTR (repair time).

Performance issues are often the cause of transient system failures and disruption, hence systems have to have sufficient capacity and performance to deal with the workload within acceptable periods of time under normal, failure and recovery conditions.

Project managers need to test systems before they go into production and additionally test changes made within the system before implementing them live.

When testing systems managers need to ask themselves a number of pertinent questions including:

  • What is the system required to do?
  • What are the consequences if the system fails?
  • What happens if the system is pushed beyond its limits?

If the system is close to the edge, project managers need to know how close to the edge the system is and what evidence to look for in order to ascertain this. They need to have some sort of ‘control’ version of the system in mind for comparison purposes and also have a way of measuring changes within the system.

Bandwidth and response time

It is bandwidth that determines throughput, but it’s not just speed that matters, it’s throughput in terms of ‘units of stuff per second’. Latency determines response time and how much ‘stuff’ is in transit through the system at any given instant. Ultimately, ‘stuff in transit’ is the data at risk if there is a failure in the system.

Variation of latency with respect to time (jitter or div latency) determines predictability of response. Understanding jitter is important for establishing timeout values, after all latency fluctuations can cause system failure under peak loading.

According to queuing theory latency will get worse and become more unpredictable as you approach the point of saturation. Therefore it’s important to know what else the system is sharing capacity with and what control (if any) can be had over it.

Scalability

Capacity can be increased by scaling up, otherwise known as vertical scaling, which refers to increasing capacity by adding more resources to a machine or buying a bigger machine.

By contrast horizontal scaling refers to increasing capacity by adding more machines. Ultimately scalability is dependent on how your workloads break down into parallel streams of execution and on what level of availability you need to achieve.

Project managers, therefore, need to understand how their system’s workload could break down into parallel streams of execution. Some will be capable of being split into many elements with little interaction while others may require very high levels of interconnectivity and interaction or high throughput single-stream processing.

Designing for performance and availability

It is important to have the capacity to be able to cope with peaks in workload. This capacity can be increased by minimising so-called ‘wait states’ (caches, parallelism) and also contention for resources and data structures. Managers therefore need to understand the need for synchronisation and serialisation of access to data structures.

By considering such issues you might not make the system go faster, but you might manage to stop it going slower! Colin also suggested that by maximising ‘user mode,’ and minimising other modes, system efficiency can be improved. He went on to say that project managers need to ask themselves:

  • Which parts of their systems are mission critical?
  • Which parts are safety-critical?
  • What state transitions occur during both failure and recovery?
  • What kind of failure do we prefer?

Being able to answer such questions will enable managers to better appreciate what they need to do to improve the design of their systems.

State transitions

Nothing happens instantaneously; there is always a ‘state of transition’, hence one must consider what possible states a pair of machines can be in and how long transitions last for. Managers must ensure that there are no timing windows or other flaws in their designs and consider what can be done while a state transition is in progress.

Multi-site issues

The effect of distance on network and storage protocols needs to be taken into consideration if your system has to operate across a number of different sites. Colin advised against booting across inter-site links and against the automation of decision making if / when a site fails.

He also suggested remote access for management and operations, and centralised (and duplicated) monitoring and alerting. Full environmental monitoring for lights-out sites was also recommended.

Risk

Risk is a fact of life and we have to deal with it as best we can. Risk of course can be minimised by good planning and sound technology.

Project management and technical design are both integral parts of systems engineering, which should be accompanied by techniques from other engineering disciplines to help managers analyse the situation and guide their thinking. Disaster-tolerant systems should therefore aim to minimise the risk of loss of service and loss of data as much as possible.

Most projects handle medium risk fairly well while other projects overspecify to cater for what are actually rather low risk issues. And some projects underspecify and therefore fail to cater for what are in fact high risk issues. Hence project managers need to ask themselves the following three questions when assessing the risk level:

  • What is the probability of a situation occurring?
  • What is the impact if that situation occurs?
  • What are the long-term consequences?

Once managers have identified the potential risks in their systems they must then ask themselves how they can look for single points of failure, for modes of failure and how they can identify specific scenarios of interest in their system. Finally, they must then ask themselves if they can test all the probable conditions and what could happen to the data.

Analysis techniques

There are many techniques that have evolved over the years and there are tools to help managers to apply them, including reliability block diagrams (RBD), fault tree analysis (FTA) and failure modes, effects and criticality analysis (FMECA). These and many other techniques can be applied with software tools available from a number of vendors.

Operational service

Transitioning from the old to the new can be fraught with difficulties and issues. However, ultimately it is about splitting the process into manageable steps and keeping things simple. It’s best to get as much as possible done in advance and allow for a back-up plan, such as being able to migrate back to the original system if need be.

Project managers need to minimise the risk of data loss and of loss of service and also to migrate user connectivity, live and historic data and be in the position where they can potentially convert data files if necessary.

It has to be said that mission-critical systems hardly ever fail, but we do need people responsible for their operation to have a good understanding and ‘feel’ for the way they work. They need to find out how the system behaves when it starts to fail, to know the warning signs of imminent failure.

And, most importantly, they need to know how to return the system back to its normal operational state without data loss or data corruption. This can probably best be achieved through the use of a representative offline environment that can mirror the main system but which is only used for testing and trials.

Colin went on to say that sometimes the small stuff gets overlooked, but there should always be clear identification of equipment including colour coding and labelling, and the physical layout of equipment, with its connections between racks and between subsystems, should be carefully planned and scrutinised before the system is created.

Testing

During testing it needs to be demonstrated that service will continue with minimal disruption during failure and recovery. Hence project teams need to regularly rehearse and test their procedures and plans to ensure that they stay current.

Scalability, as well as functionality, needs to be tested and every aspect of the system and surrounding infrastructure needs to be examined under normal, failure and recovery conditions. The team have also got to decide what their acceptance criteria are.

Other factors to consider with regard to measuring system behaviour are: how to instrument the system, time synchronisation, generating a typical workload and representative data sets.

Also project teams need to consider where they will do the testing - the production environment, pre-production test environment or pre-delivery test environment?

Continual monitoring and event logging is essential, as is knowledge of the whole system, otherwise finding problems and understanding them enough to recreate the problem is very difficult.

At this point Colin gave an example of the NHS Blood and Transplant project called pulse renewal, which has production, archive and test environments, all with a shared common infrastructure and separated from the rest of the existing infrastructure, as a good case study.

The NHS system is split into a common infrastructure (SAN fabrics, private network interconnects etc.), a production environment (a split-site cluster with host-based volume shadowed storage), a test environment (a split-site cluster on a smaller scale) and an archive environment (a single node at site A). There is also duplicated monitoring and reporting facilities and external connectivity for users.

Ensuring successful delivery

So how can we ensure successful delivery? According to Colin, we can’t, but we can maximise our chances of success.

Unfortunately, you can’t buy high availability ‘off the shelf’. Instead managers have to establish their own ‘ideal design’ and then adapt it to suit the constraints that have been placed upon them.

Mission critical systems are the same as any other system; project managers just need to demonstrate an appropriate level of care and attention to detail when managing them.

To start with they need to design, plan and test the transition to the new systems, design for maintainability and minimal risk of error and also to design for change on-the-fly with no loss of service. It’s also good to build a proof of concept model early on and learn from it.

When it comes to procurement managers need to do their research to understand what is technically feasible, what is necessary and to establish the scope of the project. Project leaders need to have clear objectives and acceptance criteria and be absolutely clear as to where the duty of care lies.

Mangers must also find ways to protect their data and ensure correctness and consistency within their system(s). A good level of monitoring and quality assurance can help with this. It’s essential to test and check everything on a regular basis.

Finally, always use small groups of excellent people who work well together and who can communicate effectively. In the end good engineering, leadership, project management and collaborative working will help you deliver.

References

1Virtual LUN: Virtual LUN technology enables tiered storage strategies by allowing manual ‘re-tiering’ of data as its value changes.