A Data Warehouse Development Methodology

 

A Data Warehouse Development Methodology is a process for gathering requirements for the construction of a data warehouse then organizing the steps required to implement the data warehouse.

 

  

A Data Warehouse Development Methodology

Description:

The intent of this design methodology is not to describe the intimate details of every design activity. Instead the intent of this methodology is to describe:

   what design activities need to be done,

   the order in which they need to be done, and

   the output of the design activities.

 By not describing exactly how each activity is to be done, the design methodology is able to:

   remain finite in size,

   remain in a comprehensible state,

   be applicable to many warehouses.

The more explicit a methodology becomes, the less generalized it becomes. It is the intention of this methodology to provide a general overview, then to allow the developer to figure out exactly how he/she is going to accomplish the task.

There are many activities to the building of a data warehouse. This methodology describes one such approach to building the warehouse. Parts of the methodology are to be done once, such as the establishment of the technological infrastructure. Other aspects of the methodology are to be repeated in an iterative manner. Some activities are to be done sequentially and alone, other activities are to be done in parallel with other activities.

The diagram describes the general flow of the development of the data warehouse. The initial organization needs to be done only once, although the decisions made there can be revisited on an as needed basis.

Once the initial organization is established, then three activities need to be done in parallel. Those activities are:

   the definition of the technical infrastructure,

   preliminary design, and

   data modeling.

The output from these three steps can be aggregated into the next step  which is physical database design. Physical database design starts the iterative process of data warehouse development. The development methodology starting at physical data base design is an iterative one. Another way of describing the development methodology is that it is "spiral".

There is a normal flow throughout the spiral portion of the development methodology. The spiral processes go from:

   physical data base design to

   identification of the system of record to

   sourcing/population of the data bases to

   end user access and analysis of the data to

   respecification  to

   physical data base design.

Once the end user has established regularly occurring requirements, it is safe to establish a data mart. Or if the end user is starting to do regular exploration and data mining, then the end user can establish an exploration warehouse.

 These then are the steps required to build a data warehouse.

 It is strongly suggested that the viewer be familiar with the corporate information factory. Many terms and concepts will be foreign unless the viewer has a foundation in the CIF.

Issues:

There are many issues surrounding the usage of the methodology described in the diagram.

One issue is that of backflow. The arrows in the diagram suggest that there is a normal way to proceed. But with any iterative approach there are alternative design criteria that do not become apparent until the designer is already moving through the process. It is often necessary for the designer to reverse the direction of the flow of design. These arrows and lines are suggested by a dotted line.

Another issue is what to do if part of an activity is already done. For example, suppose that a shop dictates to its developers what hardware and software foundation to use. Then the designer simply accepts the choice and bypasses the appropriate parts of the methodology. The methodology does not suggest that an activity be redone after it has already been done once. Common sense is the best friend of the developer.  

 

Output:

The data warehouse built in an iterative, step by step pattern.

 

1       Initial Organization Development

Ownership

Objectives

Iterative Analysis

Identifiable Checkpoints

Architecture Overview

 

  

Initial Development

 

Description:

The initialization of the data warehouse development process is usually done informally. The normal motivations for the initiation of the development process include:

         the pain of not having integrated information,

         the frustration with not having historical information,

         the dissatisfaction with trying to report from legacy applications,

         the enlightenment that data marts do not solve the problems addressed by data warehousing, and/or

         the need for a foundation on which to do exploration and data mining.

These pains are usually not of the acute variety. They buildup gradually over time. Thus it is that organizations arrive at the conclusion that they need a data warehouse in a gradual manner. At some point in time, the pain builds up and beyond a threshold. It is at this point that the data warehouse begins. 

Issues:

There are many issues to be decided at this point. Some of them are: 

         who is to pay for data warehouse development and infrastructure,

         what technical infrastructure is to be chosen,

         what is the order in which the warehouse is to be developed,

         how can organizations not initially participating in the development of the warehouse ultimately be included?

         how large is the warehouse to grow,

         what is the order in which development activities should be done, and so forth.

Some of the issues should be addressed up front, as an initial part of design. Other issues should be addressed when they arise.  

Output:

There needs to be a single driving force for the warehouse effort. Once that force is established, the output from this activity includes:

         identification of ownership,

         statement of objectives,

         outline of how iterative development will be achieved, including a plan for the management of "scope creep"

         identifiable checkpoints established,

         an architectural outline developed for the purpose of mapping the corporate information factory to the organizations needs

1.1      Initialization - Ownership

Description:

Ownership of the data warehouse is usually an inherited thing.  In most cases the data warehouse is initiated by an end user community, such as marketing or sales. By the time the data warehouse grows to a formal size, the data warehouse ends up being owned by the IT function.

 Ownership of the warehouse entails the following:

         selection of the technological foundation,

         ongoing management of the technological function,

         design and population of the data warehouse,

         management of the architecture surrounding the data warehouse,

         population of the data warehouse,

         ongoing monitoring of the data warehouse.

In addition, all budgetary considerations are a part of ownership, including the measurement of return on investment.

 

Issues:

The ownership issues are:

         does the budget exist for the building of the data warehouse

         how quickly will the data warehouse grow

         who will use the data warehouse and in what manner

         how will the data warehouse be monitored

         what technology will be used  

Output:

None

1.2      Initialization  Objectives

  Description:

The objectives of the building of the data warehouse usually center around the support of the DSS needs of the organization.  However, simply stating that the data warehouse supports data warehouse is a very broad and general proclamation.

In order to be effective the statement of the objectives of the data warehouse need to be couched in terms of the business.  If the objectives are viewed as technical objectives, then the statement will be weak.

 A more focused statement of the objectives of the data warehouse might look like: 

         support of integrated and historical data for data marts,

         creation of a corporate foundation of information,

         support of data for exploration,

         providing a platform of reconcilability, a single version of the truth from

     which to do reporting and analysis.

Of course the objectives can become much more specific.  Specific data, specific reports, specific queries can be included in the objectives for the support of the data warehouse.

In short, anything supporting the relevance and the efficacy of the data warehouse can be listed as the support vehicle for the data warehouse.

An important aspect of the statement of objectives is to tie the objectives to the iterations of development. If the objectives are not tied to the schedule of development, then there is every likelihood of unmet expectations. 

Issues:

The issues of the statement of the objectives for the data warehouse include: 

         the terseness of the statement,

         the relevance to the business to the data warehouse,

         the general approach to the achievement of the building of the data warehouse.  

Output:

A statement of the objectives of the data warehouse.

1.3      Initialization  Iterative Analysis

 Description:

Iterative analysis is the process breaking up a development effort into more than one smaller efforts. There are some very good reasons why data warehouse development should be broken up into iterative steps:

   tangible deliverables can be delivered to the organization quickly,

   the developer can focus on a small, easily achievable deliverable,

   the design process need not attempt to create all of the data warehouse at once.

But perhaps the single most important reason why iterative development for the warehouse makes sense is that the end user does not know what he/she wants. The end user has the attitude of "give me what I say I want, then I can tell you that I really want". The end user does not know what the possibilities are until the end user experiences the data warehouse. Therefore it is guaranteed that change will occur to the design of the data warehouse AFTER it has been built.

Given that change will occur, the designer is ill advised to create the data warehouse on a large scale.

The iterative steps of development - in the ideal case - center around the different subject areas of the data warehouse. A typical subject area might be customer or transaction.

But sometimes the iterations cannot focus on a single subject.  Sometimes several subjects must be developed together to maintain business continuity. In this case - where multiple subjects are built in the same iteration of development -then subsets of the subjects are developed.

The iterations of development are spaced such that a deliverable takes no longer than six months to deliver.

Issues:

The issues of iterative development are these: 

         the size of the iteration,

         the order of the iterations,

         the timing of the deliverables,

         the business relevance of the deliverables,

         scope creep - where the scope of the deliverable changes after the statement of the deliverable is created and after development has commenced.

Output:

A plan for the different iterations of development.

1.4       Initialization  Architecture Overview

 Description:

The architecture surrounding the data warehouse is called the corporate information factory.  The corporate information factory is rigorously defined.  The components of the corporate information factory are: 

         operational applications,

         integration / transformation programs and processes,

         the data warehouse,

         data marts,

         ODS,

         exploration warehouses,

         data mining warehouses,

         DSS applications.

The development methodology focuses on the development of the data warehouse, but the development methodology is fully aware of the other architectural components of the corporate information factory.

In order to have a successful deployment of the data warehouse and the corporate information factory, the data warehouse administrator must know what the different components of the corporate information factory are and how they relate.  In addition, the data warehouse administrator must be aware of the different issues that relate to the corporate information factory such as why the data marts must be built from data that comes from the data warehouse, not from other sources.

Another subject of importance is that of the order of creation of the components of the corporate information factory.

For a more complete description of the corporate information factory, please refer to the book THE CORPORATE INFORMATION FACTORY, available from John Wiley and Sons. 

Issues:

The issues of the architecture surrounding the data warehouse are: 

         identification of the components of the corporate information factory,

         the order in which the components should be built,

         why data marts should not be built for applications,

         the speed with which data should be refreshed, and so forth.  

Output:

None

2       Technological Infrastructure

Hardware Selection

Service Level Agreement Specification

DBMS Selection

Purge Schedule

Access/Analysis Tool Selection

Data Quality Plan

Extract/Transform/Load Tool Selection

Capacity Projection

Near Line/Secondary Storage Plan

Access/Analysis Tool Indoctrination

Change Data Capture

 

 

 ERP Interface

Capacity Monitoring

 

 

Data Mart Replenishment

Online Activity Monitoring

 

 

Exploration Warehouse Plan

Historical Activity Monitoring

 

 

Data Mart Identification

 

 

  

Technological Infrastructure

 Description:

The technological infrastructure is the foundation that houses and manages the data warehouse. The technological infrastructure consists of:

Hardware:

   storage

   alternative storage

   processors

   communications

Software:

   operating system software

   basic DBMS management

   end user access and analysis software

   systems management software

       online activity monitoring

       historical activity monitoring

       charge back/budgetary

       capacity planning software

       data quality, control software

   extract/transform/load software

   metadata management software  

Issues:

The selection of the technological infrastructure is particularly important because the selection of the environment has a great impact on:

costs,

long-term capacity,

performance,

functionality, and so forth.

The data warehouse administrator must be particularly sensitive to the fact that the data warehouse environment will grow at a rate that the organization has never seen.

Another important issue is the fact that the measurement of success of the data warehouse environment is very different than the measurement of success of other environments such as the OLTP environment. The OLTP environment centers around response time while the data warehouse environment centers around the management of the volumes of data that aggregate in the data warehouse.

The management of large volumes entails the management of dormant data. Dormant data is that data that is loaded into the data warehouse and then is very infrequently or never used.  The management of dormant data starts with the recognition of the data then with the movement of dormant data to the alternative storage environment.

Output:

The selection and acquisition of the technological foundation for the data warehouse.

2.1      Hardware Selection

 Description:

The hardware selection centers around the platform, the communications devices, and the storage devices that will house the data warehouse.  There are many considerations to the selection of the hardware for the data warehouse environment:

         how does the hardware vendor get along with the organization? what has been the vendors track record at the shop? in other shops?

         can the hardware perform as advertised?

         does the hardware perform the functions as needed and as advertised?

         what is the cost of hardware? and so forth. (For a more complete list, please refer to the ISSUES section.)

The selection of hardware is rarely a one time experience. Usually the hardware vendor chosen is one that has a long history with the shop.

It is noteworthy that the hardware that is optimal for OLTP processing is usually not at all optimal for DSS, data warehouse processing. The following is a list of differences:

OLTP

 DSS

  • 2/3 second response time
  • 30 second - 24 response time
  • many queries
  • few queries
  • small queries
  • large queries
  • update processing
  • access and analysis
  • many users
  • few users
  • queries predetermined  
  • queries not predetermined
  • gigabytes of data  
  • terabytes of data

The selection of hardware can be a drawn out process or a very short process. 

Issues:

The issues of hardware selection for the data warehouse environment are these:

         the cost of the hardware,

         the capacity of the hardware,

         scalability of hardware,

         ability of hardware to interface with secondary and near line storage,

         compatibility of new hardware with old hardware,

         the functionality of the hardware,

         stability of the hardware vendor,

         the software that operates on the hardware,

         the connectivity of the hardware,

         the openness or the proprietary nature of the hardware,

         the operating costs of the hardware,

         vendor support for the hardware,

         the maintenance requirements of the hardware,

         the "foot print" of the hardware, and so forth.

Output:

None.

2.2      DBMS Selection

 Description:

The DBMS that runs the data warehouse must focus on the ability to handle industrial strength volumes of data. If the DBMS choice is based on ANYTHING else as the primary consideration, it is a mistake. Some of the characteristics the data warehouse administrator should look for in the DBMS are:

         the scalability of the DBMS,

         the ability to partition the DBMS,

         the ability to optimize the DBMS on both SMP and MPP platforms,

         the ability of the DBMS to support both secondary storage and near line storage,

         the ability to support standard SQL processing,

         the ability to load data efficiently,

         the vendors support of the product,

         the third party application and utility software available,

         the cost of acquisition,

         the cost of ongoing operation,

         the efficiency of operation,

         the ability to not incur overhead of update when doing data warehouse processing,

         the ability to index data efficiently, and so forth.

In addition, the general reputation the DBMS has as a basis for data warehouse support in the general IT community should be considered.

One item not on the list is support of star joins, fact tables and dimension tables. As a rule these structures are not found in the data warehouse. Instead they are much more common in the data mart environment. A data warehouse focused around star joins, fact tables, and dimension tables will have a heavy bias toward one set of users at the expense of all other users. Instead, the DBMS for the data warehouse needs to support normalized and lightly normalized structures.

In truth there are only a handful of DBMS software packages that have a wide following as a data base engine for the data warehouse environment:

         Oracle,

         UDB,

         DB2,

         Teradata,

         Informix.

Any other choice for an industrial strength data base engine for the volumes of data found in a data warehouse must be questioned. 

Issues:

None. 

Output:

The selection of the DBMS for the data warehouse.

2.3      Access/Analysis Tool Selection

 Description:

The access and analysis tools are the technology that the end user acquires and implements in order to access the data found in the DSS environment. The tools cover a wide range of territory, from spreadsheets, to multi dimensional tools, to full data marts servers capable of managing a large amount of data.

There will be more than one access and analysis tool in a shop to serve the needs of the users. There are several reasons why more than one type of tool will be used:

         tools typically come out of the end users budget. User A wants tool A and user B wants tool B. There is nothing to stop an end user from buying their own tool regardless of whether the tool is supported by the IT function,

         tools have many different capabilities. Some tools are for sophisticated users. Other tools are for the neophyte.  Some tools have specialized capabilities such as statistical capabilities. Other tools have the capability of accessing data over a wide range of technologies.  It is no surprise that there is a wide range of tools that are appealing to the community of users.

         tools come to the market place at different times. Some tools have been around for a long time. Other tools have been in the market place only a short while. There are many new capabilities that enter the marketplace all the time.

The data warehouse administrator is well advised to adopt a strategy of :

         many tools,

         allowing the end user to select their own tool (they are going to do it anyway),

         support of a select group of tools,

         providing a range of capabilities.

In addition, it is a mistake for the data warehouse administrator to carry the tools in the budget. The budget for the tools needs to be in the budget of the end user, where the final decision as to tools will be made.

One consideration the end user has is that of compatibility of the tool with the corporate information factory. The tool needs to fit comfortably with the corporate information factory in terms of being able to manage:

         data created in another place (most tools do this),

         metadata across the corporate information factory (most tools don't do this),

         volumes of data, and so forth.

The sorts of considerations the data warehouse administrator has are:

         the initial acquisition cost of the tool,

         the ongoing cost of the tool,

         the capabilities of the tool,

         the efficiency of the tool,

         the type of business problem addressed by the tool,

         the stability of the vendor,

         the complexity of the tool,

         the ability of the tool to fit into a corporate information factory framework, and so on.  

Issues:

None  

Output:

The selection of the end user access and analysis tools for the corporate information factory environment.

 

2.4      Extract/Transformation/Load Tool Selection 

Description:

The extract transform load tool marketplace is designed to access and convert application data so that it is ready to be placed in a data warehouse. Nearly all application data is unintegrated and unfit for direct placement into a data warehouse. The extract transform load tool is designed to lift the data out of the operational environment and move the data into the data warehouse in a transformed state.

There are many function performed by the extract transform load tool:

         the selection of application data,

         the summarization of application data when needed,

         the conversion of application data,

         the reformatting of application data,

         the merging of application data when necessary,

         the resequencing of application data,

         the pulling of application data from old DBMS technology,

         the creation of default values when there is no application source, and so on.

There are two types of extract transform load packages -those that create code for the purpose of integrating data, and those that read the data directly and write out data with creation of code. The former tends to be complex but very flexible, the latter tends to be easy to use but limited in scope and capability.

Whichever variety of extract transform load the organization chooses is superior to the writing of code in a manual fashion.

One of the functions of the extract transform load is the creation of metadata that describes the logic that is executed by the extract transform load tool.

Issues:

The issues of the selection of the extract transform load tool are these:

         the acquisition cost of the extract transform load tool,

         the ongoing cost of the usage of the extract transform load tool,

         the ability of the extract transform load tool to read older DBMS in a native mode,

         the complexity of the extract transform load tool,

         the training required to become productive in the extract transform load tool,

         the efficiency of the code created by the extract transform load tool,

         the support required for the extract transform load tool, and so forth.  

Output:

The selection of the extract transform load tool.

2.5       Near Line/Secondary Storage

Description:

As data warehouses grow larger, data separates into two classes; actively used data and dormant data. There is a need to move dormant data to storage that is less expensive than high performance disk storage. It makes no sense to place unused data on expensive technology.

There are two options for the placement of dormant data out of disk storage - near line storage and secondary storage. Near line storage is siloed tape storage and secondary storage is slower, less expensive disk storage.

In order to make either near line storage or siloed tape storage work some software is required:

         a monitor that tells what data is used and what data is not used. This intelligence tells the administrator what data to place in near line storage, and

         a cross media storage manager that allows data requested online that is not stored there to be moved to online storage.

Both of these pieces of software are required in order to make near line or secondary storage successful.

Note: near line and/or secondary storage is not needed for all shops. The need becomes apparent as the shop grows larger. 

Issues:

The issues of near line storage or secondary storage are these:

         the cost of the storage devices,

         the software required to make these devices work,

         the workload of the queries coming into the systems,

         compatibility of the near line/secondary storage devices with commonly used data base software,

         the ability to find dormant data,

         the volume of data to be retrieved back into the online processor once the data is requested,

         the stability of the vendor, and so forth.  

Output:

None

2.6         Changed Data Capture

Description:

One of the problems facing the dwa is that of finding data in the operational environment in order to do refreshment into the data warehouse. The simplistic approach is to read an entire operational file in order to find the data that has changed from the last refreshment. Unfortunately this approach is very wasteful in that in many cases data is read that is not needed for refreshment. The dwa may read an entire file looking for 1% of the records in the file. Then when the data needs to be refreshed again, the read is done again. It is very possible for a company to use many cycles of processing in a totally wasted manner. Another problem with this approach is that the operational online system must be up and running for the reads of the operational data to be accomplished.

The dwa needs a way to do refreshment with only the data that has changed since the last refreshment of the data warehouse.

The log or journal tape used for transactional processing is very useful for this purpose. The log or journal tape contains only the data that has changed, since all transactions that do update are found in the log tape.

A third party utility is run that reads the tape and extracts the data about the transaction from the tape. The extracted data is then entered into an ETL tool for further processing.

Note: changed data capture is necessary for large shops where there is a short batch window at night.

Note: there are other forms of changed data capture other than those outlined here.

 

 

Issues:  

The issues of changed data capture are these:

Output: 

An efficient method of accessing operational data for the purpose of refreshing data into the data warehouse.

2.7       Service Level Agreements

Description:

One of the cornerstones of successful operation of the data warehouse environment is that of the creation of and application of service level agreements. Service level agreements are agreements between the IT function that operates the data warehouse and the end users who use the data warehouse.  Service level agreements contain agreements on:

The service level agreement for the data warehouse environment is quite different from a service level agreement for the operational or OLTP environment. The operational transaction oriented environment does not have to take into account exploration processing or very lengthy queries, as are common in the data warehouse environment.  For this reason the data warehouse service level agreement is a bit broader than a service level agreement for other environments.

Issues:

The issues of data warehouse environment service level agreements are:

Output:

A  well defined and workable relationship between the IT organization and the end user.

2.8       Purge Schedule

Description:

The data placed in a warehouse operates in a cycle. The data is placed in a warehouse, it is used there, and then it is removed. Developers of the warehouse environment often times get to be carried away to the point that