- At the end of this chapter you should be able to:
- Distinguish a data warehouse from an operational database system, and
appreciate the need for developing a data warehouse for large corporations.
- Describe the problems and processes involved in the development of a data
- Explain the process of data mining and its importance.
- Understand different data mining techniques.
Rapid developments in information technology have resulted in the construction
of many business application systems in numerous areas. Within these systems,
databases often play an essential role. Data has become a critical resource in
many organisations, and therefore, efficient access to the data, sharing the data,
extracting information from the data, and making use of the information stored,
has become an urgent need. As a result, there have been many efforts on firstly
integrating the various data sources (e.g. databases) scattered across different
sites to build a corporate data warehouse, and then extracting information from
the warehouse in the form of patterns and trends.
A data warehouse is very much like a database system, but there are distinctions
between these two types of systems. A data warehouse brings together the
essential data from the underlying heterogeneous databases, so that a user only
needs to make queries to the warehouse instead of accessing individual databases.
The co-operation of several processing modules to process a complex query is
hidden from the user.
Essentially, a data warehouse is built to provide decision support functions for
an enterprise or an organisation. For example, while the individual data sources
may have the raw data, the data warehouse will have correlated data, summary
reports, and aggregate functions applied to the raw data. Thus, the warehouse
is able to provide useful information that cannot be obtained from any individual
databases. The differences between the data warehousing system and
operational databases are discussed later in the chapter.
We will also see what a data warehouse looks like – its architecture and other
design issues will be studied. Important issues include the role of metadata as
well as various access tools. Data warehouse development issues are discussed
with an emphasis on data transformation and data cleansing. Star schema, a
popular data modelling approach, is introduced. A brief analysis of the relationships
between database, data warehouse and data mining leads us to the second
part of this chapter - data mining.
Data mining is a process of extracting information and patterns, which are previously
unknown, from large quantities of data using various techniques ranging
from machine learning to statistical methods. Data could have been stored in
files, Relational or OO databases, or data warehouses. In this chapter, we will
introduce basic data mining concepts and describe the data mining process with
an emphasis on data preparation. We will also study a number of data mining
techniques, including decision trees and neural networks.
We will also study the basic concepts, principles and theories of data warehousing
and data mining techniques, followed by detailed discussions. Both theoretical and practical issues are covered. As this is a relatively new and popular topic in databases, you will be expected to do some extensive searching,
reading and discussion during the process of studying this chapter.
General introduction to data warehousing
In parallel with this chapter, you should read Chapter 31, Chapter 32 and Chapter
34 of Thomas Connolly and Carolyn Begg, “Database Systems A Practical
Approach to Design, Implementation, and Management”, (5th edn.).
What is a data warehouse?
A data warehouse is an environment, not a product. The motivation for building
a data warehouse is that corporate data is often scattered across different
databases and possibly in different formats. In order to obtain a complete piece
of information, it is necessary to access these heterogeneous databases, obtain
bits and pieces of partial information from each of them, and then put together
the bits and pieces to produce an overall picture. Obviously, this approach
(without a data warehouse) is cumbersome, inefficient, ineffective, error-prone,
and usually involves huge efforts of system analysts. All these difficulties deter
the effective use of complex corporate data, which usually represents a valuable
resource of an organisation.
In order to overcome these problems, it is considered necessary to have an environment
that can bring together the essential data from the underlying heterogeneous
databases. In addition, the environment should also provide facilities
for users to carry out queries on all the data without worrying where it actually
resides. Such an environment is called a data warehouse. All queries are
issued to the data warehouse as if it is a single database, and the warehouse
management system will handle the evaluation of the queries.
Different techniques are used in data warehouses, all aimed at effective integration
of operational databases into an environment that enables strategic use
of data. These techniques include Relational and multidimensional database
management systems, client-server architecture, metadata modelling and repositories,
graphical user interfaces, and much more.
A data warehouse system has the following characteristics:
- It provides a centralised utility of corporate data or information assets.
- It is contained in a well-managed environment.
- It has consistent and repeatable processes defined for loading operational
- It is built on an open and scalable architecture that will handle future
expansion of data.
- It provides tools that allow its users to effectively process the data into
information without a high degree of technical support
A data warehouse is conceptually similar to a traditional centralised warehouse
of products within the manufacturing industry. For example, a manufacturing
company may have a number of plants and a centralised warehouse. Different
plants use different raw materials and manufacturing processes to manufacture
goods. The finished products from the plants will then be transferred to and
stored in the warehouse. Any queries and deliveries will only be made to and
from the warehouse rather than the individual plants.
Using the above analogy, we can say that a data warehouse is a centralised place
to store data (i.e. the finished products) generated from different operational
systems (i.e. plants). For a big corporation, for example, there are normally
a number of different departments/divisions, each of which may have its own
operational system (e.g. database). These operational systems generate data day
in and day out, and the output from these individual systems can be transferred
to the data warehouse for further use. Such a transfer, however, is not just a
simple process of moving data from one place to another. It is a process involving
data transformation and possibly other operations as well. The purpose is
to ensure that heterogeneous data will conform to the same specification and
requirement of the data warehouse.
Building data warehouses has become a rapidly expanding requirement for most
information technology departments. The reason for growth in this area stems
from many places:
- With regard to data, most companies now have access to more than 20
years of data on managing the operational aspects of their business.
- With regard to user tools, the technology of user computing has reached
a point where corporations can now effectively allow the users to navigate
corporation databases without causing a heavy burden to technical
- With regard to corporate management, executives are realising that the
only way to sustain and gain an advantage in today’s economy is to better