Data integration - the foundation of a robust enterprise architecture

Technology has come of age for more than just data warehousing, says Mark Mitchell of Informatica.

Gartner once reported that 'large enterprises should create a central competency centre to reduce the time and cost required to integrate application systems'.

More and more, companies are turning to data integration (DI) software as the foundation, or integration competency centre, upon which their overall enterprise data architectures reside.

This may surprise those who think of DI only in terms of data warehousing, yet DI provides capabilities and advantages that are simply not available from any other single integration technology.

These include the ability to perform complex transformations, focus on data quality and profiling, quickly move terabytes of data scheduled or event driven, leverage rich metadata capabilities, use codeless integration and utilise adaptive integration to dynamically keep pace with changing information requirements and environments.

Moreover, DI marries well with other integration technologies, such as enterprise application integration (EAI) and enterprise information integration (EII). And its unique attributes can be leveraged by them, and vice versa, in order to integrate, visualise and track any type of data, in any quantity to and from any platform, scheduled or event driven, for any enterprise data requirement.

This said, why should data integration software be the platform upon which to layer other integration technologies? The answer lies in how today's typical enterprise data architecture encompasses a variety of integration requirements, and in how those requirements can be best met to enable true business agility.

It's all about the data

The tasks for which DI is best suited are those that reside at the heart of the today's fragmented information environments. DI can pull huge amounts of dissimilar data from any number of disparate sources, rapidly transport, transform and cleanse it, and integrate it so that it appears to have come from a single source.

High on the list of applications requiring this functionality are business intelligence type applications, explaining why DI is currently so closely associated with data warehousing.

Without DI there could be: no unified views of business data across multiple systems; no single views of customers and suppliers to drive customer relationship management (CRM) operations; and no trusted source of cleansed and normalised data for business intelligence and corporate performance management (CPM) applications.

But a recent Informatica World 2003 survey of senior IT executives found that 87 per cent indicated that they use DI solutions for general integration, not just data warehousing. A recent TDWI survey concluded the same point, with 80 per cent using DI solutions for beyond data warehousing, such as data migration, application integration and reference data projects.

DI is equally essential for migrating data between systems, replicating and synchronising large amounts of data across databases, meeting data-interchange requirements such as HIPAA in healthcare and SWIFT in financial services, or consolidating ERP and other data onto a single platform in the wake of a merger or acquisition.

DI's ability to move massive volumes of data quickly while simultaneously performing data transformations and data-quality operations all come into play in these scenarios.

There are also comparatively new arenas that require DI functionality: business activity monitoring (BAM), zero-latency enterprise (ZLE), and reference data hub initiatives. All of these initiatives require some sort of real-time integration component.

DI is often pigeonholed as a batch technology because batch was normally required to support traditional data warehousing. But as we'll discuss, DI can source transactional data, and integrate it in real time, to enable the real-time enterprise.

Data integration - architectural advantages

There are several architectural attributes of DI to explore. First, DI software is built for mass data sourcing and movement. As data sources multiply, volumes expand and transformations become more complex, a DI environment scales accordingly.

DI software will scale linearly and can transparently leverage parallel-processing technologies for enhanced performance and application server technologies for dynamic balancing of workloads. DI software is also quickly deployed and inherently easy to maintain, as it does not require hand coding.

Web services have also recently emerged as a significant DI trend. Within a DI environment, plug-and-play support for Web services enables DI solutions to adapt quickly to a company's existing and new Web services processes.

For example, an order application can invoke DI's transformation and integration functionality via Web services while a DI solution can in turn invoke external Web services to receive data or to leverage external functionality, for example invoking an external workflow manager.

Leveraging metadata

Also a strength of DI, metadata is important to making data usable and to driving effective information reuse.

From the integration perspective, metadata-driven transformations are extremely important because they are far less brittle than code-generated transformations.

With metadata, a single change can be reflected across myriad integration processes, whereas a change to those same integration processes must be inspected and changed by hand when using a code generating approach.

Metadata is also important in a business intelligence and analytics context. Giga states: 'When combined with enlightened data management practices and business acumen, metadata-driven design is making possible significant benefits in terms of reuse, productivity improvements and reduced coordination costs'.

The total integration imperative

We've alluded to a new generation of strategic applications and constructs, such as BAM, ZLE and reference data hubs that have a fundamental need for DI.

BAM is the new generation of business intelligence, capable of providing real-time dashboards and scorecards that enable users to keep their fingers on the pulse of enterprise events as they happen.

As Gartner states, 'Data is presented to the BAM recipient in much the same way as car performance data is presented to a driver via the dashboard of a car'. Everything that DI brings to business intelligence - the rapid transformations, the consistent data quality, the single views, the sourcing from numerous systems - is required by BAM.

A DI solution can integrate real-time data sourced from real-time data feeds such as EAI message queues. The real-time data can be transformed, de-duplicated and inserted into a real-time data repository by the DI solution as it emerges from the messaging pipeline. It also manages the immense amount of metadata that is integral to the effective use of data by BAM or any other business intelligence application.

Enterprise information hubs - whether called ZLE hubs, reference data repositories or real-time customer information stores - require a similar synergy. Unlike BAM, these constructs are essentially operational in nature.

These are single places for people and applications to find clean, trusted, consolidated enterprise information. Data is sourced from a wide variety of enterprise transaction systems, some of it in real time (and some not), and transformed, cleansed and aggregated by the DI solution. The real-time data is integrated from EAI or other real-time message queues.

The data in these hubs is, then, an amalgamation of real-time data and historical data, with the real-time data being kept continually up-to-date, often through DI-driven changed data capture.

Once in the hub's repository, or 'hot cache' as it's sometimes called, data can be enriched by data that is already there or by new data flowing through the hub. And it can be pushed or pulled out to applications and users via EAI-enabled publish/subscribe messaging.

The consolidated data can also be used to populate downstream data marts and warehouses, and to feed BAM, traditional business intelligence, data mining and other analytic-type applications.

Why make DI the foundation?

BAM and information hub applications are characterised by very complex transformations that need to be performed very rapidly. The same is true for data cleansing and other data quality operations.

DI does this in a single process while sourcing from the full range of enterprise sources - relational and non-relational databases, legacy mainframe systems, real-time message systems, flat files, remote data sources and so on.

With these emerging applications, it is also necessary not just to integrate real-time data, but to integrate it in real time. A DI solution can bring numerous performance features to bear - from real-time pipelining to leveraging parallel processing - to do just that.

With a DI foundation in place, companies can incrementally and cost-effectively add EAI and/or EII capabilities to their data architectures as required, to do what they are best at.

EAI can be leveraged to enable reliable feeds of real-time transactional data for integration with historical data. Alternatively, EII might be leveraged for certain kinds of real-time unified views. For example, a call centre can use EII as a backend to get data for trouble ticket calls.

The EII solution gets aggregate historical numbers out of a warehouse where the heavy lifting transforms, data quality processes and consolidations have been done by a DI solution. Then the EII tool can go to a real-time system for current data that completes the single business view.

Meeting the integration imperative

As more and more companies set up integration competency centres, it's imperative they recognise the value of data integration. As stated earlier, the typical enterprise data architecture encompasses a variety of integration requirements.

No single integration technology can address them all. EAI, EII and DI all have a place, and DI - with its ability to get data into shape, move large amounts of it very quickly, and make it useful to end users - needs to be taken extremely seriously as the fundamental enabler of the agile enterprise.