Improving software cost estimation

Sinnathamby Vijayakumar CEng MBCS suggests a way of improving software cost estimation by using historical data.

Many models that are used to predict software development costs and timescales have resulted in significant underestimates. So-called calibrated models have been found to perform little better than non-calibrated models.

The need for accurate validation has been demonstrated using historical data on an ongoing basis, if costing accuracy is to be improved and developed further. This data should correspond to the attributes of a new project, i.e. comparing like with like.

Models for cost estimation come in a variety of forms and can fill the gaps in cost prediction that cannot be provided by project managers. There needs to be an unbiased method for the technical appraisal of a cost-estimating model. Models without cross-checking against actual data cannot be relied on to predict accurate costs.

What is needed is the generation of a structured software costing database to gather and maintain information, validating it and enabling the development of a “living” cost model to estimate software costs with greater accuracy.

Data collection

To develop a parametric model based on analogy, it is necessary to acquire historical cost data, schedule data and technical data on a range of similar projects.

I suggested a design for such a database in 1997. Data collected will allow the estimator to compare the risks inherent in any contractor proposal for a software project. Once data is available a statistical analysis can determine which factors are most highly correlated with the costs, and the extent of the independence of the various factors. The most promising independent factors may then be identified to derive cost estimating relationships (CER) which should then be scrutinized to ensure that the CERs are plausible.

The CERs should include the source of the data involved, and the size and range of the database. Additional information that should be included is: how the parameters are derived; what the model's limitations are; the timeframe of the data so that the database remains valid; how well the parametric model estimates its own database.

Model structure

My proposed model for the prediction of software costs compares the input data with data on similar projects, subassemblies and components stored within the database, using analogy to extract the closest matches.

Provision should also be made for the users to select their own case studies and likely drivers. This will enable them to refine the solution if they have information available to them but not known to the expert system, or should the CERs appear invalid or inappropriate. It will also allow sensitivity analysis to be carried out. This requires the model to be re-run a number of times, varying key elements, to assess how sensitive the costs and timescales are to various assumptions.

For the model to remain accurate and up-to-date there is a continuing need to collect, store and analyse cost, time and reliability data from past and current projects to maintain an effective database. This is an essential foundation for reliably estimating future costs.

All types of model need to be validated, which is therefore the key driver to generating confidence in the model estimate and determines:

Whether the model performs as intended, which basically consists of debugging the computer program;
Whether the model is an accurate representation of that aspect of the real world under study.

Before claiming that a particular software cost estimating model is a useful tool for estimating a new program, it is essential to check its consistency with historical data collected from recently developed software programs. Successful validation establishes a basis for confidence that the model will predict sufficiently accurately under new conditions.

Simple comparisons of the model output and observed actual result, together with some detailed statistical analysis (to compare, for example, distributions) can then be used to show how well the model performs. Continuous collection of data on projects relevant to users should enable them to eventually establish their own cost estimation model more accurately than the general-purpose models available off-the-shelf.

Historical data is the only reliable form of information, provided proper care is taken to evaluate its relevance. The problem is recognising the similarities and differences between the current project data and the historical data, which includes technology changes and personnel differences. Without proper judgement, historical data can be interpreted incorrectly and out of context, and errors will occur.

A review of published work on commercial software cost models shows that they often fail to predict costs and time scales accurately. Models may only apply to specific environments, and a contributing factor in their failure to provide accurate costs and timescales is that they are used outside of their baseline environment.

When a commercial or an internally developed model is used for cost and schedule predictions, the outputs can be checked and validated using the common database. Differences in the estimates can be identified and appropriate CERs can be applied, where necessary, to improve estimates and identify areas of concern.

Conclusions

Any large organisation can and should derive accurate cost estimates relevant to its own environment and that which addresses the shortcomings in current software estimating models. This provides a coherent way for an organisation to capture and exploit the corporate knowledge of previous / current projects that is otherwise lost as part of personnel moves.

Thus, the main aspects of this proposed approach are:

Continual data collection employing a database to support data exploitation;
Continual model validation;
- Monitoring of prediction performance;
- The performance assessment of the model in estimating its own database;
The necessity to have knowledge of basic statistics, modeling skills and analytical techniques in order to develop parametric cost estimating relationships.

When a customer requests contractors to provide estimates for a requirement, these estimates needs to be verified before acceptance. This can be validated by a database which will provide estimates of cost and time.

Once a contract is awarded the contractor and customer should track separately the progress against the original estimate to monitor and record any unexpected events, which were not included in the original estimate, but which will impact the final cost.

The customer and contractor should maintain a common database for this monitoring to ensure better quality of data and also mutual understanding. This will help both sides achieve better estimates and better confidence for future projects.

Dr S Vijayakumar is with the Ministry of Defence (Defence Engineering and Support), Abbey Wood.