Seven steps to making the web perform

Online shopping Cart With a slogan like 'Try Something New Today', many of Sainsbury's internet customers were probably doing exactly that - shopping with Tesco or Waitrose - after its online service shut up shop recently. Matthew Goulden, Triometric, gives good advice on how to avoid this happening to you.

With the Sainsbury's website going down on the 17th of June, it is believed that the grocery store has lost in the region of £1m in lost sales, possibly more.

Individuals and businesses have come to rely on the internet for so many things - not just grocery shopping - that when it goes wrong, the consequences can be frustrating for the user and financially dire for the supplier.

For businesses running high value enterprise web applications, there are three major challenges: security issues, problems with reliability and availability and poor response times.
Web security issues are relatively well-understood, and many solutions (e.g. firewalls, intrusion detectors, virus scanners) are on the market. High availability can be achieved, with effort, through the use of redundant servers and communications facilities, careful ISP selection and continuous monitoring.
Good performance, however, is a much more complex issue. Users - in this instance, frustrated Sainsbury's customers - see good performance in terms of quick, reliable response time. When users know a site it always available and will respond readily, they enjoy using it and keep coming back. When there are problems, however, a competitor's site is only a mouse click away.

Achieving good performance is complex because of the many factors that affect it. The response time of a typical web transaction is influenced by 10, 20 or more routers and communications links, as well as by firewalls, web servers and the user's browser.

A problem in any of these components can drive response time to unacceptable levels.  Furthermore, the tools for measuring response time in a reliable, universal and non-intrusive way are not yet well-understood by all website managers.

Enterprise website requirements

In spite of its problems, the combination of universal access and low cost has converted the net into one of the most important marketing and customer support tools available. Internet technology is also increasing the productivity of companies and their suppliers through the use of intranets and extranets.

For all these applications, excellent performance by the website is crucial.

In particular, the following are key website requirements:

  1. Near-100 per cent availability - the site should be available to customers as close to 100 per cent of the time as possible.
  2. Good response times - the site should consistently achieve customer response times within defined targets.
  3. Customer confidence - the customer must believe that the site will be available, with an acceptable response time, whenever they need it.

It's not easy to meet these requirements. To understand why it's neccessary to look at the challenges of using the internet.

The environment of the World Wide Web

The internet is in many ways a hostile environment. A typical packet traveling from a company's web server to a customer's browser will encounter some or all of the following.

  • Router congestion. The high traffic levels in the internet can force routers to hold packets for relatively long periods of time until bandwidth is available to transmit them. This increases the latency in web transactions, slowing response time. Routers can run out of memory for holding packets, in which case they must discard them.

    This packet loss, while a normal part of the internet protocols, is especially bad for response time, since the client and server systems must wait for timers to expire before retransmission. And probably worst of all, latency and packet loss can vary considerably over time, which means that users cannot rely on the level of performance a website will provide.

  • Long distances and multiple hops - a typical packet from a server in London traveling to a client in Hong Kong must cross the Atlantic Ocean, the width of North America, and the Pacific Ocean to reach its destination. It may be handled by 20 or more routers. Even without congestion, the time to travel these distances (the propagation delay) significantly increases response time.

  • Many different ISPs - the packets in a typical web transaction will be passed through many different ISPs' networks. This makes it very difficult for website managers to control quality of service.

  • Low bandwidth connections at the client end - the web page that  screams across the 100 megabit Ethernet at corporate headquarters doesn't download quite so fast on Joe Customer's dial line. Web designers are forced to calculate a trade-off between rich page design and minimising download time.

  • Protocol issues - there are a number of features of the web protocols that make it difficult to achieve good response time, including:
  1. High overhead - the Transmission Control Protocol (TCP) used by the Web was designed years ago for applications such as file transfer, e-mail and remote login. TCP typically requires that seven packets flow simply to start and end a connection. In traditional applications, this overhead was spread over a relatively long period of time. However, the Web uses many short transactions and this overhead becomes a serious performance issue.

  2. Persistent connection problems - the web's Hypertext Transport Protocol (HTTP) includes a feature known as persistent connections, which is intended to reduce the impact of TCP's high overhead. The idea is to reuse TCP connections to avoid the setup and takedown cost. Unfortunately, many web servers and firewalls are not set up to correctly use this feature.

  3. TCP is 'polite' - a key feature of the TCP protocol is that it watches for signs of network congestion, such as high latency and packet loss. When it detects these, mechanisms such as exponential backoff and slow start cause TCP to run more slowly. While this protects the network, it often slows web transactions beyond what is necessary.

All these problems contribute to poor and unreliable response time. While solving them is a complex undertaking, it is achievable, as you will experience with websites on the internet that do perform well.

Website seventh heaven

Generally speaking, there are seven steps to website heaven which will ensure that your website is among the fastest on the internet.

Step one: Set objectives

As with any project, the first step is to set objectives. Without defined targets, you won't have any way of assessing performance other than vague reports of customer dissatisfaction, or the web design team's assurances that you have the greatest site since Yahoo.

An example of a response time target is 'Pages will download in less than 10 seconds 95 per cent of the time, and in less than 25 seconds 99 per cent of the time.' You may find it useful to:

  • establish targets based on geography - it's not realistic to expect users halfway round the globe to get the same download time as users in the same city as the server;
  • set exception targets for large, complex pages - it's unrealistic to expect a page crammed with frames, graphics and Java applets to download as fast as a page of text;
  • set exception targets for pages that require backend processing, for example, when an external database must be queried or updated.

Step two: Measure current situation

Once you have established performance targets, the next step is to measure the current response time situation against the targets, and identify the areas where the targets are not met. As mentioned above, you will probably need to break the analysis down by geography and by specific pages. At this point it may be necessary to adjust the targets. For example, you may have under- or over-estimated the response times your site is capable of achieving.

There are three basic methodologies for measuring your current performance against your targets:

  • Client-side monitoring. Here, special measurement software is installed into the client system. The software watches transactions, measures response times and periodically reports to a central management system. The main advantage of this approach is its accuracy. The main drawback is that it can be difficult or impossible to install the monitoring software in all client systems.

  • Synthetic transactions. Monitoring systems are deployed at key points in the network where they run transactions that simulate real clients. The monitoring systems measure the response from web servers and then report their results to a central system. This approach is useful for comparing performance with competitors' websites and also for before and after comparisons, for example, when upgrading a website. The main problems with the approach are that it measures response time from only a relatively small number of locations, and it can impose an unacceptable load on servers.

  • Server-side monitoring. A monitoring system is installed at the website where it watches all packets passing to and from the web servers. By performing detailed analysis of the HTTP, TCP and IP protocols, the monitor can estimate with high accuracy the end-to-end response time, as experienced by users. While this approach is slightly less accurate than client-side monitoring, it provides universal measurement of all users' transactions is completely non-intrusive and is very cost-effective to deploy since it does not require any special software in the client or server. The Triometric Web Analyzer uses this approach.

    Triometric approach 
    Triometric approach (click the
    image to view a larger version)


Step three: Identify deficiencies

Once you've collected your measurement information, the next step is to identify the shortfalls between targets and actual performance. The type and distribution of target shortfalls can be useful in diagnosing the cause of the problem - for instance, a widespread problem will generally have a different cause than one that is limited to a single customer or set of customers.

Performance problems can be categorised as follows:

Website/page design issues - performance shortfalls attributable to defects in website infrastructure or page design tend to have a widespread distribution, affecting all clients to some degree, regardless of ISP or geographic area. However, these types of problems tend to generate more complaints from users on dial links or who are geographically remote, while local users on high-speed connections usually will not perceive the problems as being as serious.

Page design is one of the most important factors in response time. It is discussed in more detail in the next section.

Problems with website infrastructure include lack of capacity in communications links, servers and firewalls. More serious are protocol problems, such as firewalls or web servers that do not support persistent connections. These are discussed in more detail below.

Diagnosing capacity problems requires tools that show the load in routers, firewalls and communications links. The most likely bottlenecks are WAN communications links, since their expense tends to mean websites are rarely oversupplied with bandwidth. Diagnosing protocol problems requires packet capture / protocol analysis tools to understand exactly how clients, servers and firewalls are interacting.

Customer issues - performance shortfalls attributable to a specific customer's access problems are characterised by good overall performance outside this customer. This type of problem can be pinpointed through response time measurements. The same types of infrastructure problems as discussed above for the website can also occur at the customer's site, including insufficient capacity in WAN links, routers and firewalls, and protocol problems.

ISP/internet issues - this is the most difficult type of problem to diagnose and resolve. Unfortunately, it is not uncommon, since current levels of ISP service are uneven and sometimes quite poor. The diagnosis is driven by analysing response time measurements. The particular ISP involved can be pinpointed by continuous response time measurements supplemented by tools such as traceroute to identify the location of the bottleneck. Additional evidence can be gathered by using continuous ICMP (Internet Control Message Protocol) ECHOs to points 'before' and 'after' the suspected bottleneck ISP.

Website performance factors 
Website Performance Factors (click
the image to view a larger version)

Once you've identified your specific performance problem, you can begin to design and implement an appropriate solution.

Step four: Solve web site and page problems

There are a number of ways that websites and web pages can be optimised for performance. The foundation step is to look at your infrastructure and ensure that adequate capacity exists in servers, firewalls and ISP links.

As mentioned above, capacity problems occur most often in WAN links. A rule of thumb used by many network designers is that normal use of a link should not exceed 50 per cent utilization. This is a very rough guideline, and in general the faster the link, the higher the acceptable utilisation. The delay caused by any link gets increasingly worse as utilisation increases (i.e. exponential growth of delay), so that you should certainly avoid utilisations above 90 per cent and will usually be quite safe if you stay below 50 per cent.

Capacity problems in web servers, firewalls and routers are rare, since their price performance is good enough that they are usually have far greater capacity than is required. Still, shortages of RAM in routers and web servers can be a serious issue. Another problem can be complex database processing in backend servers.

After addressing capacity issues, the next step is to solve any protocol problems afflicting your website. In most cases, it's necessary to use protocol analysers to detect these problems. Problems to look for include:

  • Small packets. In most cases, larger packets (i.e. 1500 bytes) give better performance than smaller ones, because of reduced overhead.
  • Small TCP receive windows in web servers and firewalls. When these windows are too small, bandwidth is not effectively utilised.
  • Servers and firewalls not supporting persistent connections. This problem dramatically increases TCP overhead.
  • Servers and firewalls that don't support HTTP version 1.1. In general, version 1.1 is faster than version 1.0.

If your website infrastructure has been optimised for performance, but you still find that a significant proportion of your clients are experiencing poor response times, there may be problems with your web page design. It is important to keep pages 'small', minimising use of graphic images, frames and Java applets.

It is also very important to keep the overall number of objects (i.e. graphic images, frames, applets) in the page small, to reduce protocol overhead. Since each individual image requires a separate transaction, it is actually more efficient to combine several small images into a single large image. In other words, one large image totaling 50 Kbytes will download faster than five small images of 10 Kbytes each. Think not just 'smaller is better' but 'fewer is better'.

The order of preference for good performance is as follows, from best to worst:

  • No images (not always a good esthetic choice, of course);
  • Few small images;
  • Few large images;
  • Many small images;
  • Many large images.

Finally, consider using gzip or deflate compression. This technique is supported by most current browsers. One possible solution for browsers that do not support these compression methods is to maintain two sets of pages on the server, one compressed and one uncompressed. The server then delivers the appropriate set for the client's browser.

Step five: Manage ISPs

No matter how optimal your server configuration and site design, you can be let down by your ISP. If your analysis in step three led to a diagnosis of ISP problems, your action now depends on which ISP was at fault: your ISP, your client's ISP, or a transit ISP somewhere in the path. Again, the use of traceroute and ICMP ECHOs will help you pinpoint the culprit.

ISPs in client server path 
ISPs in client server path (click the
image to view a larger version)

If the problem lies with your ISP, you have the advantage of having a direct relationship. You can negotiate for a better service agreement or consider changing providers.

If the problem lies with the client's ISP, you can pursue it through the customer; again, the customer can negotiate for better service, or may consider changing to a better-performing provider. From your response time measurements, you may be aware of a provider who could provide better service to the customer, and can suggest a change.

If the problem lies with a transit ISP somewhere else in the path, you may well have more difficulty exerting pressure.

If the transit ISP has a direct peering relationship with your ISP, you can ask your ISP to pursue the issue, and likewise if the transit ISP peers with your client's access provider. Otherwise, you should contact your ISP and discuss the problem with a view to having your ISP exert pressure on the transit ISP or change their peering arrangements.

Step six: Assist customers with internet use

Assume that your site design has been tuned to perfection, and there are no outstanding issues with service providers. Some of your customers may have specific problems that are preventing them from getting the best out of your site.

You may want to recommend browser software for optimum performance and usability. All browsers are not the same, and your page may perform better with a browser that offers particular capabilities.

Ensure that your customers understand the minimum hardware specifications required for best use of your site, e.g. modem speed, PC processor speed and memory requirements.

If a customer is especially valued, consider a more pro-active approach. Many customer problems can be diagnosed by protocol analysis on the server side. This is especially true of protocol issues, e.g. persistent connection problems.

Step seven: Continuous monitoring

Now you've solved all your own problems, and have even cracked ISP and customer-side problems. Your site is performing brilliantly. It's important at this point not to let down your guard. Web applications, protocols, such as TCP and HTTP, and the internet interact in complex ways, and it's easy for problems to appear or reappear.

Most ISPs are pushing too much traffic over their expensive bandwidth, and few are, at this time, seriously managing end-to-end performance. You need to continue monitoring performance and availability against targets on a long-term basis. And let your ISP know you're watching.

One very important aspect of continuous monitoring is to measure response times for all transactions to your website and compare them to service level targets. From this foundation measurement, you can decide when there are problems you need to address.

Try something new today

The internet has enormous potential to provide flexible, universal access at low cost. However, to be truly useful for business applications it needs the high levels of availability and performance currently provided by well-managed private networks.

The key to successful enterprise use of the internet is to establish targets, monitor continuously, and take action when targets are missed. The web will not improve by its own accord, it requires active management.

Whatever the issue with Sainsbury's website, one thing is sure. It went down; they probably lost a lot of potential income and just may have put a few customers off choosing to buy from them again.

July 2008