DATA WAREHOUSES & DATA MINING Term-Paper In Management Support System [pic] Submitted By:Submitted To: Chitransh NamanAnita Ma’am A22-JK903Lecturer 10900100MSS ABSTRACT :- Collection of integrated, subject-oriented, time-variant and non-volatile data in support of managements decision making process. Described as the "single point of truth", the "corporate memory", the sole historical register of virtually all transactions that occur in the life of an organization.A fundamental concept of a data warehouse is the distinction between data and information.
Data is composed of observable and recordable facts that are often found in operational or transactional systems. At Rutgers, these systems include the registrar’s data on students (widely known as the SRDB), human resource and payroll databases, course scheduling data, and data on financial aid. In a data warehouse environment, data only comes to have value to end-users when it is organized and presented as information.Information is an integrated collection of facts and is used as the basis for decision-making. For example, an academic unit needs to have diachronic information about its extent of instructional output of its different faculty members to gauge if it is becoming more or less reliant on part-time faculty. [pic] INTRODUCTION :- “The data warehouse is always a physically separate store of data transformed from the application data found in the operational environment”.
Data entering the data warehouse comes from operational environment in almost every case.Data warehousing provides architectures and tools for business executives to syste-matically organize ,understand ,and use their data to make stragetic decisions. A large number of organizations have found that data warehouse systems are valuable tools in today’s competive,fast-evolving world. In the last several years ,many firms have spent millions of dollars in building enterprise wide data warehouses. Many people feel that with competition mounting in every industry ,data warehousing is the latest must have marketing weapon –a way to keep customers by learning more about their needs.Data warehouses have been defined in many ways,making it difficult to formulate a rigorous definition.
Loosely speaking , a data warehouse refers to a database that is maintened separately from an organization,s operational databases. Data warehouse systems allow for integration of a variety of applications systems . They support information processing by providing a solid platform of consolidated historical data for analysis. Data warehousing is a more formalised methodology of these techniques.For example, many sales analysis systems and executive information systems (EIS) get their data from summary files rather then operational transaction files. The method of using summary files instead of operational data is in essence what data warehousing is allabout.
Some data warehousing tools neglect the importance of modelling and building a datawarehouse and focus on the storage and retrieval of data only. These tools might havestrong analytical facilities, but lack the qualities you need to build and maintain a corporatewide data warehouse.These tools belong on the PC rather than the host. Your corporate wide (or division wide) data warehouse needs to be scalable, secure, openand, above all, suitable for publication.
NEED OF DATA WAREHOUSE :- Missing data: Decision support requires historical data which operational DBs do not typically maintain Data Consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources: operational DBs, external sources Data quality: Different sources typically use inconsistent data representations, codes and formats which have to be reconciled. pic] DATA WAREHOUSE ARCHITECTURE :- [pic] Components :- • OPERATIONAL DATA WAREHOUSE ( for the DW is supplied from mainframe operational data held in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organization’s suppliers or customers • OPERATIONAL DATABASE( is a repository of current and integrated operational data used for analysis.It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse • LOAD MANAGER ( also called the frontend component, it performance all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse WAREHOUSE MANAGER ( performs all the operations associated with the management of the data in the warehouse.
The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data. • QUERY MANAGER( also called backend component, it performs all the operations associated with the management of user queries.The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries. . END-USER ACCESS TOOLS( can be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools. DATA MART :- It is a subset of a data warehouse that supports the requirements of particular department or business function.
The characteristics that differentiate data marts and data warehouses include: • a data mart focuses on only the requirements of users associated with one department or business function • as data marts contain less data compared with data warehouses, data marts are more easily understood and navigated • data marts do not normally contain detailed operational data, unlike data warehouse. pic] META DATA:- Metadata is about controlling the quality of data entering the data stream. Batch processes can be run to address data degradation or changes to data policy. Metadata policies are enhance by using metadata repositories. IMPORTANCE OF META DATA :- The integration of meta-data, that is ”data about data” • Meta-data is used for a variety of purposes and the management of it is a critical issue in achieving a fully integrated data warehouse • The major purpose of meta-data is to show the pathway back to where the data began, so that the warehouse administrators know the history of any item in the warehouse • The meta-data associated with data transformation and loading must describe the source data and any changes that were made to the data • The meta-data associated with data management describes the data as it is stored in the warehouse • The meta-data is required by the query manager to generate appropriate queries, also is associated with the user of queries • The major integration issue is how to synchronize the various types of meta-data use throughout the data warehouse.
The challenge is to synchronize meta-data between different products from different vendors using different meta-data stores • Two major standards for meta-data and modeling in the areas of data warehousing and component-based development-MDC(Meta Data Coalition) and OMG(Object Management Group) • a data warehouse requires tools to support the administration and management of such complex enviroment. • for the various types of meta-data and the day-to-day operations of the data warehouse, the administration and management tools must be capable of supporting those tasks: • monitoring data loading from multiple sources data quality and integrity checks • managing and updating meta-data • monitoring database performance to ensure efficient query response times and resource utilization. [pic] [pic] DATA WAREHOUSING PROCESSES :- The process of extracting data from source systems and bring it into the data warehouse is commonly called ELT, which stands for extraction, transformation, and loading. In addition, after the data warehouse (detailed data) is created, several data warehousing processes that are relevant to implementing and using the data warehouse are needed, which include data summarization, data warehouse maintenance. Extraction in Data Warehouse :-Extraction is the operation of extracting data from a source system for future use in a data warehouseenvironment. This is the first step of the ETL process.
After extraction, data can be transformed and loaded into the data warehouse. Extraction process does not need involve complex algebraic database operations, such as join and aggregate functions. Its focus is determining which data needs to be extracted, and bring the data into the data warehouse, specifically, to the staging area. The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the data warehouse and keep it up-to-date.Thus, data extraction is not only used in the process of building the data warehouse, but also in the process of maintaining the data warehouse.
Every often, entire documents or tables from the data sources are extracted to the data warehouse or staging area, and the data completely contain whole information from the data sources. There are two kinds of logic extraction methods in data warehousing. Full Extraction :- The data is extracted completely from the data sources. As this extraction reflects all the data currently available on the data source, there is no need to keep track of changes to the data source since the last successful extraction.
The source data will be provided as-is and no additional logic information is necessary on the source site. Incremental Extraction :-At a specific point in time, only the data that has changed since a well-defined event back in history will be extracted. The event may be the last time of extraction or a more complex business event like the last sale day of a fiscal period. This information can be either provided by the source data itself, or a change table where an appropriate additional mechanism keeps track of the changes besides the originating transaction. in most case, using the latter method means adding extraction logic to the data source. For the independence of data sources, many data warehouses do not use any change-capture technique as part of the extraction process, instead, use full extraction logic.
After full extracting, the entire extracted data from the data sources can be compared with the previous extracted data to identify the changed data. Unfortunately, for many source systems, identifying the recently modified data may be difficult or intrusive to the operation of the data source. Change Data Capture is typically the most challenging technical issue in data extraction. [pic] DATA MINING :- Data Mining is the process of discovering new correlations, patterns, and trends by digging into (mining) large amounts of data stored in warehouses, using artificial intelligence, statistical and mathematical techniques.
Data mining can also be defined as the process of extracting knowledge hidden from large volumes of raw data i. e. he nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The alternative name of Data Mining is Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, etc.
The importance of collecting data thai reflect your business or scientific activities to achieve competitive advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases are in place in all large and mid-range companies. [pic] How Data Mining Works :- While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two.Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries.
Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.
Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends.
For example, an otitdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes. DATA MINING MODELS :- 1. Predictive Model Prediction a. determining how certain attributes will behave in the future Regression b.
mapping of data item to real valued prediction variable Classification c. categorization of data based on combinations of attributes Time Series analysis xamining values of attributes with respect to time 2. Descriptive Model Clustering most closely data clubbed together into clusters Data Summarization extracting representative information about database Association Rules associativity defined between data items to form relationship Sequence Discovery it is used to determine sequential patterns in data based on time sequence of action [pic] APPLICATIONS OF DATA WAREHOUSE :- Exploiting Data for Business Decisions The value of a decision support system depends on its ability to provide the decision-maker with relevant information that can be acted upon at an appropriate time. This means that the information needs to be: Applicable.
The information must be current, pertinent to the field of interest and at the correct level of detail to highlight any potential issues or benefits. Conclusive. The information must be sufficient for the decision-maker to derive actions that will bring benefit to the organisation. Timely. The information must be available in a time frame that allows decisions to be effective.
Decision Support through Data Warehousing One approach to creating a decision support system is to implement a data warehouse, which integrates existing sources of data with accessible data analysis techniques. An organisation’s data sources are typically departmental or functional databases that have evolved to service specific and localised requirements.Integrating such highly focussed resources for decision support at the enterprise level requires the addition of other functional capabilities: Fast query handling. Data sources are normally optimised for data storage and processing, not for their speed of response to queries. Increased data depth. Many business conclusions are based on the comparison of current data with historical data.
Data sources are normally focussed on the present and so lack this depth. Business language support. The decision-maker will typically have a background in business or management, not in database programming. It is important that such a person can request information using words and not syntax. [pic]The proliferation of data warehouses is highlighted by the “customer loyalty” schemes that are now run by many leading retailers and airlines. These schemes illustrate the potential of the data warehouse for “micromarketing” and profitability calculations, but there are other applications of equal value, such as: Stock control Product category management Basket analysis Fraud analysis All of these applications offer a direct payback to the customer by facilitating the identification of areas that require attention.
This payback, especially in the fields of fraud analysis and stock control, can be of high and immediate value. APPLICATIONS OF DATA MINING:- • Banking: loan/credit card approval • predict good customers based on old customers • Customer relationship management: identify those who are likely to leave for a competitor. • Targeted marketing: • identify likely responders to promotions • Fraud detection: telecommunications, financial transactions • from an online stream of event identify fraudulent events • Manufacturing and production: • automatically adjust knobs when process parameter changes • Medicine: disease outcome, effectiveness of treatments • analyze patient disease history: find relationship between diseases • Molecular/Pharmaceutical: • identify new drugs • Scientific data analysis: • identify new galaxies by searching for sub clusters • Web site/store design and promotion: find affinity of visitor to pages and modify layout. [pic] CONCLUSION :- What we are seeing is two-fold depending on the retailer's strategy: 1) Most retailers build data warehouses to target specific markets and customer segments. They're trying to know their customers. It all starts with CDI – customer data integration.
By starting with CDI, the retailers can build the DW around the customer. 2) On the other side -- there are retailers who have no idea who their customers are, or feel they don’t need to…. the world is their customer and low prices will keep the worldloyal. They use their data warehouse to control inventory and negotiate with suppliers.The future will bring real time data warehouse updates…with the ability to give the retailer an minute to minute view of what is going on in a retail location…and take action either manually or through a condition triggered by the data warehouse data… The future belongs to those who: 1) Possess knowledge of the Customer and 2) Effectively use that knowledge… REFERENCES :- 1.
Mining interesting knowledge from weblogs: a survey - Federico Michele Facca, Pier Luca lanzi. http://software. techrepublic. com. com/abstract. aspx http://en.
wikipedia. org/ http://msdn. microsoft. com/ Google Books Google Images Google Search www. seminarprojects. com Self =========================================================