Chapter 2 provides literature review about data warehouse, OLAP MDDB and data mining concept. We reviewed concept, characteristics, design and implementation approach of each above mentioned technology to identify a suitable data warehouse framework. This framework will support integration of OLAP MDDB and data mining model.
Section 2.2 discussed about the fundamental of data warehouse which includes data warehouse models and data processing techniques such as extract, transform and loading (ETL) processes. A comparative study was done on data warehouse models introduced by William Inmons (Inmon, 1999), Ralph Kimball (Kimball, 1996) and Matthias Nicola (Nicola, 2000) to identify suitable model, design and characteristics. Section 2.3 introduces about OLAP model and architecture. We also discussed concept of processing in OLAP based MDDB, MDDB schema design and implementation. Section 2.4 introduces data mining techniques, methods and processes for OLAP mining (OLAM) which is used to mine MDDB. Section 2.5 provides conclusion on literature review especially pointers on our decision to propose a new data warehouse model. Since we propose to use Microsoft A® product to implement the propose model, we also discussed a product comparison to justify why Microsoft A® product is selected.
2.2 DATA WAREHOUSE
According to William Inmon, data warehouse is a “subject-oriented, integrated, time-variant, and non-volatile collection of data in support of the management’s decision-making process” (Inmon, 1999). Data warehouse is a database containing data that usually represents the business history of an organization. This historical data is used for analysis that supports business decisions at many levels, from strategic planning to performance evaluation of a discrete organizational unit.
It provides an effective integration of operational databases into an environment that enables strategic use of data (Zhou, Hull, King and Franchitti, 1995). These technologies include relational and MDDB management systems, client/server architecture, meta-data modelling and repositories, graphical user interface and much more (Hammer, Garcia-Molina, Labio, Widom, and Zhuge, 1995; Harinarayan, Rajaraman, and Ullman, 1996).
The emergence of cross discipline domain such as knowledge management in finance, health and e-commerce have proved that vast amount of data need to be analysed. The evolution of data in data warehouse can provide multiple dataset dimensions to solve various problems. Thus, critical decision making process of this dataset needs suitable data warehouse model (Barquin and Edelstein, 1996).
The main proponents of data warehouse are William Inmon (Inmon, 1999) and Ralph Kimball (Kimball, 1996). But they have different perspectives on data warehouse in term of design and architecture. Inmon (Inmon, 1999) defined data warehouse as a dependent data mart structure while Kimball (Kimball, 1996) defined data warehouse as a bus based data mart structure. Table 2.1 discussed the differences in data warehouse structure between William Inmon and Ralph Kimball.
A data warehouse is a read-only data source where end-users are not allowed to change the values or data elements. Inmon’s (Inmon, 1999) data warehouse architecture strategy is different from Kimball’s (Kimball, 1996). Inmon’s data warehouse model splits data marts as a copy and distributed as an interface between data warehouse and end users. Kimball’s views data warehouse as a unions of data marts. The data warehouse is the collections of data marts combine into one central repository. Figure 2.1 illustrates the differences between Inmon’s and Kimball’s data warehouse architecture adopted from (Mailvaganam, 2007).
Although Inmon and Kimball have a different design view of data warehouse, they do agree on successful implementation of data warehouse that depends on an effective collection of operational data and validation of data mart. The role of database staging and ETL processes on data are inevitable components in both researchers data warehouse design. Both believed that dependant data warehouse architecture is necessary to fulfil the requirement of enterprise end users in term of preciseness, timing and data relevancy
2.2.1 DATA WAREHOUSE ARCHITECTURE
Although data warehouse architecture have wide research scope, and it can be viewed in many perspectives. (Thilini and Hugh, 2005) and (Eckerson, 2003) provide some meaningful way to view and analyse data warehouse architecture. Eckerson states that a successful data warehouse system depends on database staging process which derives data from different integrated Online Transactional Processing (OLTP) system. In this case, ETL process plays a crucial role to make database staging process workable. Survey on factors that influenced selection on data warehouse architecture by (Thilini, 2005) indentifies five data warehouse architecture that are common in use as shown in Table 2.2
Independent Data Marts
Independent data marts also known as localized or small scale data warehouse. It is mainly used by departments, divisions of company to provide individual operational databases. This type of data mart is simple yet consists of different form that was derived from multiple design structures from various inconsistent database designs. Thus, it complicates cross data mart analysis. Since every organizational units tend to build their own database which operates as independent data mart (Thilini and Hugh, 2005) cited the work of (Winsberg, 1996) and (Hoss, 2002), it is best used as an ad-hoc data warehouse and also to be use as a prototype before building a real data warehouse.
Data Mart Bus Architecture
(Kimball, 1996) pioneered the design and architecture of data warehouse with unions of data marts which are known as the bus architecture or virtual data warehouse. Bus architecture allows data marts not only located in one server but it can be also being located on different server. This allows the data warehouse to functions more in virtual mode and combined all data marts and process as one data warehouse.
(Inmon, 1999) developed hub and spoke architecture. The hub is the central server taking care of information exchange and the spoke handle data transformation for all regional operation data stores. Hub and spoke mainly focused on building a scalable and maintainable infrastructure for data warehouse.
Centralized Data Warehouse Architecture
Central data warehouse architecture build based on hub-and-spoke architecture but without the dependent data mart component. This architecture copies and stores heterogeneous operational and external data to a single and consistent data warehouse. This architecture has only one data model which are consistent and complete from all data sources. According to (Inmon, 1999) and (Kimball, 1996), central data warehouse should consist of database staging or known as operational data store as an intermediate stage for operational processing of data integration before transform into the data warehouse.
According to (Hackney, 2000), federated data warehouse is an integration of multiple heterogeneous data marts, database staging or operational data store, combination of analytical application and reporting systems. The concept of federated focus on integrated framework to make data warehouse more reliable. (Jindal, 2004) conclude that federated data warehouse are a practical approach as it focus on higher reliability and provide excellent value.
(Thilini and Hugh, 2005) conclude that hub and spoke and centralized data warehouse architectures are similar. Hub and spoke is faster and easier to implement because no data mart are required. For centralized data warehouse architecture scored higher than hub and spoke as for urgency needs for relatively fast implementation approach.
In this work, it is very important to identify which data warehouse architecture that is robust and scalable in terms of building and deploying enterprise wide systems. (Laney, 2000), states that selection of appropriate data warehouse architecture must incorporate successful characteristic of various data warehouse model. It is evident that two data warehouse architecture prove to be popular as shown by (Thilini and Hugh, 2005), (Eckerson, 2003) and (Mailvaganam, 2007). First hub-and-spoke proposed by (Inmon, 1999) as it is a data warehouse with dependant data marts and secondly is the data mart bus architecture with dimensional data marts proposed by (Kimball, 1996). The selection of the new proposed model will use hub-and-spoke data warehouse architecture which can be used for MDDB modelling.
2.2.2 DATA WAREHOUSE EXTRACT, TRANSFORM, LOADING
Data warehouse architecture process begins with ETL process to ensure the data passes the quality threshold. According to Evin (2001), it is essential to have right dataset. ETL are an important component in data warehouse environment to ensure dataset in the data warehouse are cleansed from various OLTP systems. ETLs are also responsible for running scheduled tasks that extract data from OLTP systems. Typically, a data warehouse is populated with historical information from within a particular organization (Bunger, Colby, Cole, McKenna, Mulagund, and Wilhite, 2001). The complete process descriptions of ETL are discussed in table 2.3.
Data warehouse database can be populated with a wide variety of data sources from different locations, thus collecting all the different dataset and storing it in one central location is an extremely challenging task (Calvanese, Giacomo, Lenzerini, Nardi, and Rosati, , 2001). However, ETL processes eliminate the complexity of data population via simplified process as depicts in figure 2.2. The ETL process begins with data extract from operational databases where data cleansing and scrubbing are done, to ensure all data’s are validated. Then it is transformed to meet the data warehouse standards before it is loaded into data warehouse.
(Zhou et al, 1995) states that during data integration process in data warehouse, ETL can assist in import and export of operational data between heterogeneous data sources using Object linking and embedding database (OLE-DB) based architecture where the data are transform to populate all validated data into data warehouse.
In (Kimball, 1996) data warehouse architecture as depicted in figure 2.3 focuses on three important modules, which is “the back room” “presentation server” and “the front room”. ETL processes is implemented in the back room process, where the data staging services in charge of gathering all source systems operational databases to perform extraction of data from source systems from different file format from different systems and platforms. The second step is to run the transformation process to ensure all inconsistency is removed to ensure data integrity. Finally, it is loaded into data marts. The ETL processes are commonly executed from a job control via scheduling task. The presentation server is the data warehouse where data marts are stored and process here. Data stored in star schema consist of dimension and fact tables. This is where data are then process of in the front room where it is access by query services such as reporting tools, desktop tools, OLAP and data mining tools.
Although ETL processes prove to be an essential component to ensure data integrity in data warehouse, the issue of complexity and scalability plays important role in deciding types of data warehouse architecture. One way to achieve a scalable, non-complex solution is to adopt a “hub-and-spoke” architecture for the ETL process. According to Evin (2001), ETL best operates in hub-and-spoke architecture because of its flexibility and efficiency. Centralized data warehouse design can influence the maintenance of full access control of ETL processes.
ETL processes in hub and spoke data warehouse architecture is recommended in (Inmon, 1999) and (Kimball, 1996). The hub is the data warehouse after processing data from operational database to staging database and the spoke(s) are the data marts for distributing data. Sherman, R (2005) state that hub-and-spoke approach uses one-to-many interfaces from data warehouse to many data marts. One-to-many are simpler to implement, cost effective in a long run and ensure consistent dimensions. Compared to many-to-many approach it is more complicated and costly.
2.2.3 DATA WAREHOUSE FAILURE AND SUCCESS FACTORS
Building a data warehouse is indeed a challenging task as data warehouse project inheriting a unique characteristics that may influence the overall reliability and robustness of data warehouse. These factors can be applied during the analysis, design and implementation phases which will ensure a successful data warehouse system. Section 126.96.36.199 focus on factors that influence data warehouse project failure. Section 188.8.131.52 discusses on the success factors which implementing the correct model to support a successful data warehouse project.
184.108.40.206 DATA WAREHOUSE FAILURE FACTORS
(Hayen, Rutashobya, and Vetter, 2007) studies shows that implementing a data warehouse project is costly and risky as a data warehouse project can cost over $1 million in the first year. It is estimated that two-thirds of the effort of setting up the data warehouse projects attempt will fail eventually. (Hayen et al, 2007) cited on the work of (Briggs, 2002) and (Vassiliadis, 2004) noticed three factors for the failure of data warehouse project which is environment, project and technical factors as shown in table 2.4.
Environment leads to organization changes in term of business, politics, mergers, takeovers and lack of top management support. These include human error, corporate culture, decision making process and poor change management (Watson, 2004) (Hayen et al, 2007).
Poor technical knowledge on the requirements of data definitions and data quality from different organization units may cause data warehouse failure. Incompetent and insufficient knowledge on data integration, poor selection on data warehouse model and data warehouse analysis applications may cause huge failure.
In spite of heavy investment on hardware, software and people, poor project management factors may lead data warehouse project failure. For example, assigning a project manager that lacks of knowledge and project experience in data warehouse, may cause impediment of quantifying the return on investment (ROI) and achievement of project triple constraint (cost, scope, time).
Data ownership and accessibility is a potential factor that may cause data warehouse project failure. This is considered vulnerable issue within the organization that one must not share or acquire someone else data as this considered losing authority on the data (Vassiliadis, 2004). Thus, it emphasis restriction on any departments to declare total ownership of pure clean and error free data that might cause potential problem on ownership of data rights.
220.127.116.11 DATA WAREHOUSE SUCCESS FACTORS
(Hwang M.I., 2007) stress that data warehouse implementations are an important area of research and industrial practices but only few researches made an assessment in the critical success factors for data warehouse implementations. He conducted a survey on six data warehouse researchers (Watson & Haley, 1997; Chen et al., 2000; Wixom & Watson, 2001; Watson et al., 2001; Hwang & Cappel, 2002; Shin, 2003) on the success factors in a data warehouse project. He concluded his survey with a list of successful factors which influenced data warehouse implementation as depicted in figure 2.8. He shows eight implementation factors which will directly affect the six selected success variables
The above mentioned data warehouse success factors provide an important guideline for implementing a successful data warehouse projects. (Hwang M.I., 2007) studies shows an integrated selection of various factors such as end user participation, top management support, acquisition of quality source data with profound and well-defined business needs plays crucial role in data warehouse implementation. Beside that, other factors that was highlighted by Hayen R.L. (2007) cited on the work of Briggs (2002) and Vassiliadis (2004), Watson (2004) such as project, environment and technical knowledge also influenced data warehouse implementation.
In this work on the new proposed model, hub-and-spoke architecture is use as “Central repository service”, as many scholars including Inmon, Kimball, Evin, Sherman and Nicola adopt to this data warehouse architecture. This approach allows locating the hub (data warehouse) and spokes (data marts) centrally and can be distributed across local or wide area network depending on business requirement. In designing the new proposed model, the hub-and-spoke architecture clearly identifies six important data warehouse components that a data warehouse should have, which includes ETL, Staging Database or operational database store, Data marts, MDDB, OLAP and data mining end users applications such as Data query, reporting, analysis, statistical tools. However, this process may differ from organization to organization. Depending on the ETL setup, some data warehouse may overwrite old data with new data and in some data warehouse may only maintain history and audit trial of all changes of the data.
2.3 ONLINE ANALYTICAL PROCESSING
OLAP Council (1997) define OLAP as a group of decision support system that facilitate fast, consistent and interactive access of information that has been reformulate, transformed and summarized from relational dataset mainly from data warehouse into MDDB which allow optimal data retrieval and for performing trend analysis.
According to Chaudhuri (1997), Burdick, D. et al. (2006) and Vassiladis, P. (1999), OLAP is important concept for strategic database analysis. OLAP have the ability to analyze large amount of data for the extraction of valuable information. Analytical development can be of business, education or medical sectors. The technologies of data warehouse, OLAP, and analyzing tools support that ability.
OLAP enable discovering pattern and relationship contain in business activity by query tons of data from multiple database source systems at one time (Nigel. P., 2008). Processing database information using OLAP required an OLAP server to organize and transformed and builds MDDB. MDDB are then separated by cubes for client OLAP tools to perform data analysis which aim to discover new pattern relationship between the cubes. Some popular OLAP server software programs include Oracle (C), IBM (C) and Microsoft (C).
Madeira (2003) supports the fact that OLAP and data warehouse are complementary technology which blends together. Data warehouse stores and manages data while OLAP transforms data warehouse datasets into strategic information. OLAP function ranges from basic navigation and browsing (often known as “slice and dice”), to calculations and also serious analysis such as time series and complex modelling. As decision-makers implement more advanced OLAP capabilities, they move from basic data access to creation of information and to discovering of new knowledge.
2.3.4 OLAP ARCHITECTURE
In comparison to data warehouse which usually based on relational technology, OLAP uses a multidimensional view to aggregate data to provide rapid access to strategic information for analysis. There are three type of OLAP architecture based on the method in which they store multi-dimensional data and perform analysis operations on that dataset (Nigel, P., 2008). The categories are multidimensional OLAP (MOLAP), relational OLAP (ROLAP) and hybrid OLAP (HOLAP).
In MOLAP as depicted in Diagram 2.11, datasets are stored and summarized in a multidimensional cube. The MOLAP architecture can perform faster than ROLAP and HOLAP (C). MOLAP cubes designed and build for rapid data retrieval to enhance efficient slicing and dicing operations. MOLAP can perform complex calculations which have been pre-generated after cube creation. MOLAP processing is restricted to initial cube that was created and are not bound to any additional replication of cube.
In ROLAP as depict in Diagram 2.12, data and aggregations are stored in relational database tables to provide the OLAP slicing and dicing functionalities. ROLAP are the slowest among the OLAP flavours. ROLAP relies on data manipulating directly in the relational database to give the manifestation of conventional OLAP’s slicing and dicing functionality. Basically, each slicing and dicing action is equivalent to adding a “WHERE” clause in the SQL statement. (C)
ROLAP can manage large amounts of data and ROLAP do not have any limitations for data size. ROLAP can influence the intrinsic functionality in a relational database. ROLAP are slow in performance because each ROLAP activity are essentially a SQL query or multiple SQL queries in the relational database. The query time and number of SQL statements executed measures by its complexity of the SQL statements and can be a bottleneck if the underlying dataset size is large. ROLAP essentially depends on SQL statements generation to query the relational database and do not cater all needs which make ROLAP technology conventionally limited by what SQL functionality can offer. (C)
HOLAP as depict in Diagram 2.13, combine the technologies of MOLAP and ROLAP. Data are stored in ROLAP relational database tables and the aggregations are stored in MOLAP cube. HOLAP can drill down from multidimensional cube into the underlying relational database data. To acquire summary type of information, HOLAP leverages cube technology for faster performance. Whereas to retrieve detail type of information, HOLAP can drill down from the cube into the underlying relational data. (C)
In OLAP architectures (MOLAP, ROLAP and HOLAP), the datasets are stored in a multidimensional format as it involves the creation of multidimensional blocks called data cubes (Harinarayan, 1996). The cube in OLAP architecture may have three axes (dimensions), or more. Each axis (dimension) represents a logical category of data. One axis may for example represent the geographic location of the data, while others may indicate a state of time or a specific school. Each of the categories, which will be described in the following section, can be broken down into successive levels and it is possible to drill up or down between the levels.
Cabibo (1997) states that OLAP partitions are normally stored in an OLAP server, with the relational database frequently stored on a separate server from OLAP server. OLAP server must query across the network whenever it needs to access the relational tables to resolve a query. The impact of querying across the network depends on the performance characteristics of the network itself. Even when the relational database is placed on the same server as OLAP server, inter-process calls and the associated context switching are required to retrieve relational data. With a OLAP partition, calls to the relational database, whether local or over the network, do not occur during querying.
2.3.3 OLAP FUNCTIONALITY
OLAP functionality offers dynamic multidimensional analysis supporting end users with analytical activities includes calculations and modelling applied across dimensions, trend analysis over time periods, slicing subsets for on-screen viewing, drilling to deeper levels of records (OLAP Council, 1997) OLAP is implemented in a multi-user client/server environment and provide reliably fast response to queries, in spite of database size and complexity. OLAP facilitate the end user integrate enterprise information through relative, customized viewing, analysis of historical and present data in various “what-if” data model scenario. This is achieved through use of an OLAP Server as depicted in diagram 2.9.
OLAP functionality is provided by an OLAP server. OLAP server design and data structure are optimized for fast information retrieval in any course and flexible calculation and transformation of unprocessed data. The OLAP server may either actually carry out the processed multidimensional information to distribute consistent and fast response times to end users, or it may fill its data structures in real time from relational databases, or offer a choice of both.
Essentially, OLAP create information in cube form which allows more composite analysis compares to relational database. OLAP analysis techniques employ ‘slice and dice’ and ‘drilling’ methods to segregate data into loads of information depending on given parameters. Slice is identifying a single value for one or more variable which is non-subset of multidimensional array. Whereas dice function is application of slice function on more than two dimensions of multidimensional cubes. Drilling function allows end user to traverse between condensed data to most precise data unit as depict in Diagram 2.10.
2.3.5 MULTIDIMENSIONAL DATABASE SCHEMA
The base of every data warehouse system is a relational database build using a dimensional model. Dimensional model consists of fact and dimension tables which are described as star schema or snowflake schema (Kimball, 1999). A schema is a collection of database objects, tables, views and indexes (Inmon, 1996). To understand dimensional data modelling, Table 2.10 defines some of the terms commonly used in this type of modelling:
In designing data models for data warehouse, the most commonly used schema types are star schema and snowflake schema. In the star schema design, fact table sits in the middle and is connected to other surrounding dimension tables like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.
Most data warehouses use a star schema to represent the multidimensional data model. The database consists of a single fact table and a single table for each dimension. Each tuple in the fact table consists of a pointer or foreign key to each of the dimensions that provide its multidimensional coordinates, and stores the numeric measures for those coordinates. A tuple consist of a unit of data extracted from cube in a range of member from one or more dimension tables. (C, http://msdn.microsoft.com/en-us/library/aa216769%28SQL.80%29.aspx). Each dimension table consists of columns that correspond to attributes of the dimension. Diagram 2.14 shows an example of a star schema For Medical Informatics System.
Star schemas do not explicitly provide support for attribute hierarchies which are not suitable for architecture such as MOLAP which require lots of hierarchies of dimension tables for efficient drilling of datasets.
Snowflake schemas provide a refinement of star schemas where the dimensional hierarchy is explicitly represented by normalizing the dimension tables, as shown in Diagram 2.15. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables. (C)
Levene. M (2003) stresses that in addition to the fact and dimension tables, data warehouses store selected summary tables containing pre-aggregated data. In the simplest cases, the pre-aggregated data corresponds to aggregating the fact table on one or more selected dimensions. Such pre-aggregated summary data can be represented in the database in at least two ways. Whether to use star or a snowflake mainly depends on business needs.
2.3.2 OLAP Evaluation
As OLAP technology taking prominent place in data warehouse industry, there should be a suitable assessment tool to evaluate it. E.F. Codd not only invented OLAP but also provided a set of procedures which are known as the ‘Twelve Rules’ for OLAP product ability assessment which include data manipulation, unlimited dimensions and aggregation levels and flexible reporting as shown in Table 2.8 (Codd, 1993):
Codd twelve rules of OLAP provide us an essential tool to verify the OLAP functions and OLAP models used are able to produce desired result. Berson, A. (2001) stressed that a good OLAP system should also support a complete database management tools as a utility for integrated centralized tool to permit database management to perform distribution of databases within the enterprise. OLAP ability to perform drilling mechanism within the MDDB allows the functionality of drill down right to the source or root of the detail record level. This implies that OLAP tool permit a smooth changeover from the MDDB to the detail record level of the source relational database. OLAP systems also must support incremental database refreshes. This is an important feature as to prevent stability issues on operations and usability problems when the size of the database increases.
2.3.1 OLTP and OLAP
The design of OLAP for multidimensional cube is entirely different compare to OLTP for database. OLTP is implemented into relational database to support daily processing in an organization. OLTP system main function is to capture data into computers. OLTP allow effective data manipulation and storage of data for daily operational resulting in huge quantity of transactional data. Organisations build multiple OLTP systems to handle huge quantities of daily operations transactional data can in short period of time.
OLAP is designed for data access and analysis to support managerial user strategic decision making process. OLAP technology focuses on aggregating datasets into multidimensional view without hindering the system performance. According to Han, J. (2001), states OLTP systems as “Customer oriented” and OLAP is a “market oriented”. He summarized major differences between OLTP and OLAP system based on 17 key criteria as shown in table 2.7.
It is complicated to merge OLAP and OLTP into one centralized database system. The dimensional data design model used in OLAP is much more effective for querying than the relational database query used in OLTP system. OLAP may use one central database as data source and OLTP used different data source from different database sites. The dimensional design of OLAP is not suitable for OLTP system, mainly due to redundancy and the loss of referential integrity of the data. Organization chooses to have two separate information systems, one OLTP and one OLAP system (Poe, V., 1997).
We can conclude that the purpose of OLTP systems is to get data into computers, whereas the purpose of OLAP is to get data or information out of computers.
2.4 DATA MINING
Many data mining scholars (Fayyad, 1998; Freitas, 2002; Han, J. et. al., 1996; Frawley, 1992) have defined data mining as discovering hidden patterns from historical datasets by using pattern recognition as it involves searching for specific, unknown information in a database. Chung, H. (1999) and Fayyad et al (1996) referred data mining as a step of knowledge discovery in database and it is the process of analyzing data and extracts knowledge from a large database also known as data warehouse (Han, J., 2000) and making it into useful information.
Freitas (2002) and Fayyad (1996) have recognized the advantageous tool of data mining for extracting knowledge from a data warehouse. The results of the extraction uncover hidden patterns and inconsistency that are not visible in the existing a data sets. The discovery of such hidden patterns and data inconsistency cannot be achieved by using conventional data analysis and query tools approaches. Data mining techniques vary from conventional data analysis approach as data mining involve in extracting hidden patterns in a dataset while conventional data analysis tool only assume on the result from a data set.
There are several data mining techniques that are used to demonstrate different data mining technique in different areas of applications. Data mining techniques covers association, classification, clustering and prediction (Citation). Freitas (2002) stressed that data mining issues has to think about potential of solving the issues using different data mining techniques. Thus, to carry out a successful data mining applications with the chosen data mining techniques, a process model is required as it include a series of steps that will guide to agreeable results. Chapter 2.4.1 will discussed about the data mining techniques and the preferred techniques used in this study. Chapter 2.4.2 presents the detailed data mining process model and also discussed the process model use throughout this research in deploying the experimental application tools which is further discuss in Chapter 4.
2.4.1 Data Mining Techniques
In general, data mining is capable of predicting or forecasting of future events based on historical data set and its purpose is to find hidden patterns in the database. There are several data mining techniques that are used and applied in the different areas of data. The knowledge on how each data mining technique is essential used is to select the suitable technique for a specific area.
According to Mailvaganam (2007), data mining techniques consists of two models which is descriptive and predictive models as describe in Diagram 2.16. Descriptive models can be generated by employing association rules discovery and clustering algorithms. As for Predictive models, it is generated by using classification and regression algorithms. Descriptive models can provide hidden relationships knowledge in a give data set, for example, in student database, students who pass mathematics tends to pass science. Predictive models can influence the future results in a given data set, for example, in marketing, a customer’s gender, age, and purchase history might predict the likelihood of a future sale.
Diagram 2.16 Descriptive and Predictive Model (adapted from Mailvaganam, 2007)
Data mining algorithm is the mechanisms that generate a data mining model. In order to generate a data mining model, a data mining algorithm needs to be define. The algorithm will then analyse a set of given data to investigate for an identifiable hidden patterns and trends results. This result will then be used by the algorithm to define parameters of the mining model. These parameters are then used across the whole data set to extract actionable patterns and detailed statistics. More details on the data mining algorithms are discussed as follows:
Association algorithm is a powerful correlation counting engine. It can perform scalable and efficient analysis in identifying items in a collection that occur together. (citation)
Classification is the process of finding a set of models that describes and distinguishes data classes or concepts for the purpose of being capable of using the model to predict a class of objects with unknown class labels. This is the decision tree algorithm, including both classification & regression. It can also build multiple trees in a single model to performance association analysis. (Citation)
Clustering algorithm includes 2 different clustering techniques: EM expectation and maximization) and K-means. It automatically detects the number of natural clusters in the datasets and discovering groups or categories. (Citation)
Prediction can be viewed as a model constructed and used to access the class of a unlabeled sample or the value ranges of an attribute that a given sample is likely to have. (Citation)
In data mining techniques, choosing the best algorithm are based on specific business user case. It is possible to use different data mining algorithm to perform mining on the same business user case data sets, each algorithm will produce different set of results and some data mining algorithms can produce more than one type of result. Data mining algorithms are flexible and do not require to be use separately. Having a single data mining solutions, first action is to use an algorithm to explore the data set and then use other algorithm to perform prediction on a specific result based on the explored data (Citation). In a specific data mining solution, some algorithms like clustering can be use to explore data which is use for recognize patterns and break data set into groups and then use other algorithms like decision trees model based on classification algorithm to predict a specific outcome based on that data. (Citation)
Data mining models are used to predict values, find hidden trends and generate summaries data. It is important to know which data mining algorithm to use in order to run a business user case. Table 2.11 shows the suggestions on which algorithms to use for a data mining solutions adapted from Microsoft A® (2009).
As data mining model are built, deployed and trained, the result of the data mining model details is stored as data mining model nodes. This nodes is used to collect the attributes, description, probabilities, and distribution information for the model element it represents and relation to other nodes. Every data mining model node has a connected node type that assist in signifying a data mining model. A data mining model node is the uppermost node, regardless of the actual structure of the model. All data mining models start with a model node. (Citation)
In this study, we used decision tree and clustering model as the main data mining techniques. This data mining techniques and model will cover further discuss in Chapter 3 and 4.
18.104.22.168 Decision Tree Model
Decision tree are standard data mining model for classification and prediction techniques (Citation). Decision tree are preferred in contrast to neural networks, decision trees representation of rules. These rules can easily express and understood. A decision tree model can be used to categorize an instance by initializing at the root of the tree and construct the leaf node which provides the classification of the instance.
A decision tree model is a tree like structure using classification techniques, in which a node in the tree structure represents each question used to further classify data. Decision tree is efficient and can be built faster than other model and acquiring results with similar accuracy in some cases. Thus, it is appropriate for large training data set. Decision tree model is easy to understand and interpret depending on the complexity of the decision tree and it handles non numerical data. The various methods used to create decision trees have been used widely for decades, and there is a large body of work describing these statistical techniques (Citation). According to Witten et al (2000), decision tree model approach is known for its fast data mining modelling, as it uses divide and conquers approach.
Witten et al (2000) describe decision tree process is constructed recursively. A model is placed at the root node of the tree and make out one or more tree node with possible value. Tree nodes training sets are then split up into subsets makes up Decision Tree 1, Decision Tree 2 or more tree nodes. This process is repeated recursively for each branch until the node has the same classifications then the tree construction will stop. This means the leaf node with one class of “true” or “false” cannot be split further which resulted the recursive process to stop. The objective of decision tree model is to build as simplified decision tree model as possible to produce good classification or predictive performance results. (Citation)
In decision tree-based model as shown in diagram 2.17, the model node serves as the root node of the tree. Decision Tree model may have many trees nodes that make up the whole structure, but there is only one tree node from which all other nodes such as interior and distribution nodes that are related for each tree. An interior node represents a split in the tree model or known as the “branch” node and a distribution node describes the distribution of values for one or more attributes according to the data represented by this node or known as the “leaf” node. A decision tree based model always has one model node and at least one or more than one tree node (Citation).
22.214.171.124 Clustering Model
Clustering is a data mining technique that is used to separate data set into groups or clusters based on the similarity between the data entities (Citation). Entities of the cluster share common features that differentiate them from other clusters. Similarity is measured in terms of distance between its elements or entities. Unlike classification, which has predefined labels (supervised learning), clustering is considered as unsupervised learning because it automatically comes up with the labels (Citation).
According to Kogan, J. et.al. (2006), clustering techniques is divided into partitioning and hierarchical methods. Partitioning methods construct various partitions of similar and dissimilar items in a group or clusters evaluated by conditions. For hierarchy methods, it builds hierarchical breakdown using a set of data progressively using either top-down approach or bottom up approach (Citation). Using top down approach begins with a cluster containing all data and breakdown into a smaller cluster known as sub clusters. Using bottom-up approach begins with small clusters and combine them recursively from larger cluster in a nested method. The advantage of hierarchical clustering compared to partition is that it is flexible as regards to the label of granularity. Clustering techniques are assessed in provisions of certain features related to size, distance between parts of the cluster or shape of the cluster. Clustering techniques support application that requires segmenting the data into common groups (Citation).
In clustering-based model as shown in diagram 2.18, the model node serves as the root node of the cluster (Citation). A cluster node gathers the attributes and data for the abstraction of the specific cluster. Basically, it gathers the set of distribution that comprises a cluster of cases for the data mining model. A clustering based model constantly has one model node and at least one cluster node. A user does not need to identify the number of clusters to be developed in advance. The clustering process automatically creates the exact number of clusters by specifying how similar the records within the individual clusters should be. The clustering approach works best with categorical and non repetitive variables.
2.4.2 Data Mining Process Model
A process model is required to implement a data mining project. This process model involves a sequence of steps that will produce correct results. Some examples of these process models are CRISP (Chapman et al, 2000) and TWOCROWS (Two Crows, 1999). In this study, applications experimental tools are based on the CRISP data mining process model. The difference phases of CRISP data mining process model are presented in Diagram 2.19 (Citation). The focus of this chapter is on the first three CRISP phases which relevant to the research objectives of this study.
According to Chapman et al (2000), CRISP data mining model is a life cycle for a data mining project which also includes the tasks and relationship between the tasks. CRISP life cycle consists of six phases which includes business understand, data understanding, data preparation, modelling, evaluation and deployment, and the arrows indicate the most important and frequent dependencies between phases (Citation INCLUDE PAGE NUMBER).
In CRISP data mining process model, it begins with the business understanding of the projects objectives and requirements as this is important to convert it to data mining problem definition. Next step is to perform data understanding with the datasets to identify data quality problem and to discover interesting subsets to form hypothesis for hidden information. After the identification of the datasets, data preparation phase will load all data into the modelling tools from the initial datasets. This phase will execute for multiple times to complete the transformation and cleaning of data for modelling. In modelling phase, various techniques are used and applied for the data mining problem to have high quality models for data analysis. In evaluation phase, the model(s) is thoroughly reviewed to ensure whether it achieve the specified business objectives. Finally, deployment phase is executed as to produce simple reporting or complex data mining results as this phase mainly triggered by the end-users.
One of the major advantage of CRISP data mining process model is tat its highly replicable in which it support this study. The process is flexible and can be applied on different types of data and used in any business user’s area. It also provides a uniform framework as a guideline and documentation.
2.4.3 OLAP Mining
As mentioned in section 2.4.2, data mining techniques is useful for mining hidden pattern in a relational database. However, according to Song Lin (2002), the combination of OLAP and data mining then known as OLAM or OLAP Mining is a tool for mining hidden patterns in a MDDB. OLAM provides suggestions to the decision-maker according to the internal model with few quantitative data mining methods, like clustering or classification. Data mining have been introduced into OLAP and it is not involved for any development of data mining algorithm. OLAM is the process of applying intelligent methods to extract data patterns, provides automatic data analysis and prediction, gathers hidden pattern and predicts unknown information with data mining tools. Diagram 2.20 (Citation) depicts the OLAM concept where MDDB integrates with data mining algorithm to produce effective reporting for decision makers.
According to Hans, J. (1997) OLAM architecture provides modular and systematic design for MDDB mining on data warehouse. In diagram 2.21 (Citation), OLAM architecture consists of four layers. Layer 1 is the database layer consists of systematically constructed relational databases and performs data cleaning, integration, and consolidation in the building of data warehouses. Layer 2 is the MDDB layer, which offers a MDDB for OLAP and data mining. Layer 3 is the crucial layer for data mining as the OLAP and OLAM engines blends together for processing and mining of data. Lastly on layer 4 lays the graphical user interfaces which allow users to built data warehouses, MDDBs, perform OLAP and mining, and visualize and explore the results. A proficient OLAM architecture should use existing infrastructure in this way rather than constructing everything from scratch.
Diagram 2.21 Online Analytical Mining (adapted from Hans, 1998 page number)
OLAM architecture benefits the OLAP based system as it provide exploratory data analysis environment using data mining technology (Citation). As depicts in Diagram 2.2.1 (Citation), layer 1, 2 and 3, the integration from database, data warehouse, MDDB, OLAP and OLAM makes data mining possible in different subsets of data from different levels of abstractions by using drilling, pivoting, slicing and dicing a MDDB and intermediate data mining results. Thus, it simplify interactive data mining functions such as decision trees and clustering for viewing the results with flexible knowledge visualization data mining tools.
According to Hans (1998), OLAM uses the integration of multiple data mining techniques such as association, classification and clustering to mine different portions of the data warehouses and at different levels of abstraction. Data mining can be done with programs that analyze the data automatically. In order to better understand customer behaviour and preferences, businesses are using data mining to pass through the huge amounts of information gathered. Vacca (2002) summarized that OLAM are considered among the different concept and architectures for data mining systems. It combines OLAP with data mining and mining knowledge in MDDB cube.
The results generated by OLAP and data mining applications are evaluated using the precision evaluation method. It comprises of decision tree and clustering techniques as shown in table 2.12
As a conclusion based on the literature review, we identified that hub and spoke based data warehouse model emerged as prominent model with six components that provides efficient architecture. Multiple OLAP models were discussed and MOLAP model was identified to be suitable for MDDB architecture. In order to drill and extract knowledge from MDDB, there are various methods used such as query mining and data mining. Since query mining tools such as cube browser, pivot table and Web-OLAP (WOLAP) mostly embedded in OLAP applications, we will discuss it in Chapter 3. However, we discussed data mining especially on classification and clustering techniques which will be used in OLAM approach.
Finally, we justify that Microsoft A® data warehouse, OLAP and data mining tools will be used in this research. Microsoft A® provides a comprehensive data warehouse, OLAP and data mining platform with multiple capabilities (Kramer, 2002). Kramer (2002) conducted a product comparison ranges from Microsoft with Hyperion Solutions, ORACLE Corporation and IBM Corporation on the data warehouse offerings. The summary of Kramer findings is shown in table 3.3. His evaluation have pointed that Microsoft A® based data warehouse, OLAP and data mining have a winning component from Hyperion Solutions, ORACLE Corporation and IBM Corporation.