Data mining and warehousing
What is Data Mining?
Data
Mining is defined as extracting information from huge sets of data. In other
words, we can say that data mining is the procedure of mining knowledge from
data. The information or knowledge extracted so can be used for any of the
following applications −
·
Market Analysis
·
Fraud Detection
·
Customer Retention
·
Production Control
·
Science Exploration
Data Mining Applications
Data
mining is highly useful in the following domains −
·
Market Analysis and Management
·
Corporate Analysis & Risk Management
·
Fraud Detection
Apart
from these, data mining can also be used in the areas of production control,
customer retention, science exploration, sports, astrology, and Internet Web
Surf-Aid
Market Analysis and Management
Listed
below are the various fields of market where data mining is used −
·
Customer Profiling −
Data mining helps determine what kind of people buy what kind of products.
·
Identifying Customer Requirements −
Data mining helps in identifying the best products for different customers. It
uses prediction to find the factors that may attract new customers.
·
Cross Market Analysis −
Data mining performs Association/correlations between product sales.
·
Target Marketing −
Data mining helps to find clusters of model customers who share the same
characteristics such as interests, spending habits, income, etc.
·
Determining Customer purchasing pattern −
Data mining helps in determining customer purchasing pattern.
·
Providing Summary Information −
Data mining provides us various multidimensional summary reports.
Corporate Analysis and Risk Management
Data
mining is used in the following fields of the Corporate Sector −
·
Finance Planning and Asset Evaluation −
It involves cash flow analysis and prediction, contingent claim analysis to
evaluate assets.
·
Resource Planning −
It involves summarizing and comparing the resources and spending.
·
Competition − It involves
monitoring competitors and market directions.
Fraud Detection
Data
mining is also used in the fields of credit card services and telecommunication
to detect frauds. In fraud telephone calls, it helps to find the destination of
the call, duration of the call, time of the day or week, etc. It also analyzes
the patterns that deviate from expected norms
Data mining deals with the kind of patterns that can be mined. On
the basis of the kind of data to be mined, there are two categories of
functions involved in Data Mining −
- Descriptive
- Classification
and Prediction
Descriptive Function
The descriptive function deals with the general properties of data
in the database. Here is the list of descriptive functions −
- Class/Concept
Description
- Mining of
Frequent Patterns
- Mining of
Associations
- Mining of
Correlations
- Mining of
Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes
or concepts. For example, in a company, the classes of items for sales include
computer and printers, and concepts of customers include big spenders and
budget spenders. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by the following
two ways −
·
Data Characterization −
This refers to summarizing data of class under study. This class under study is
called as Target Class.
·
Data Discrimination −
It refers to the mapping or classification of a class with some predefined
group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in
transactional data. Here is the list of kind of frequent patterns −
·
Frequent Item Set −
It refers to a set of items that frequently appear together, for example, milk
and bread.
·
Frequent Subsequence −
A sequence of patterns that occur frequently such as purchasing a camera is
followed by memory card.
·
Frequent Sub Structure −
Substructure refers to different structural forms, such as graphs, trees, or
lattices, which may be combined with item-sets or subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that
are frequently purchased together. This process refers to the process of
uncovering the relationship among data and determining association rules.
For example, a retailer generates an association rule that shows
that 70% of time milk is sold with bread and only 30% of times biscuits are
sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover
interesting statistical correlations between associated-attribute-value pairs
or between two item sets to analyze that if they have positive, negative or no
effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster
analysis refers to forming group of objects that are very similar to each other
but are highly different from the objects in other clusters.
Classification and
Prediction
Classification is the process of finding a model that describes
the data classes or concepts. The purpose is to be able to use this model to
predict the class of objects whose class label is unknown. This derived model
is based on the analysis of sets of training data. The derived model can be
presented in the following forms −
- Classification
(IF-THEN) Rules
- Decision Trees
- Mathematical
Formulae
- Neural
Networks
The list of functions involved in these processes are as follows −
·
Classification −
It predicts the class of objects whose class label is unknown. Its objective is
to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e.
the data object whose class label is well known.
·
Prediction −
It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction
can also be used for identification of distribution trends based on available
data.
·
Outlier Analysis −
Outliers may be defined as the data objects that do not comply with the general
behavior or model of the data available.
·
Evolution Analysis −
Evolution analysis refers to the description and model regula
What is Data Warehousing?
Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from multiple
heterogeneous sources that support analytical reporting, structured and/or ad
hoc queries, and decision making. Data warehousing involves data cleaning, data
integration, and data consolidations.
Using Data Warehouse
Information
There are decision support technologies that help utilize the data
available in a data warehouse. These technologies help executives to use the
warehouse quickly and effectively. They can gather data, analyze it, and take
decisions based on the information present in the warehouse. The information
gathered in a warehouse can be used in any of the following domains −
·
Tuning Production Strategies −
The product strategies can be well tuned by repositioning the products and
managing the product portfolios by comparing the sales quarterly or yearly.
·
Customer Analysis −
Customer analysis is done by analyzing the customer's buying preferences,
buying time, budget cycles, etc.
·
Operations Analysis −
Data warehousing also helps in customer relationship management, and making
environmental corrections. The information also allows us to analyze business
operations.
Integrating
Heterogeneous Databases
To integrate heterogeneous databases, we have two approaches −
- Query-driven
Approach
- Update-driven
Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous
databases. This approach was used to build wrappers and integrators on top of
multiple heterogeneous databases. These integrators are also known as
mediators.
Process of Query-Driven Approach
·
When a query is issued to a client side, a metadata dictionary
translates the query into an appropriate form for individual heterogeneous
sites involved.
·
Now these queries are mapped and sent to the local query
processor.
·
The results from heterogeneous sites are integrated into a global
answer set.
Disadvantages
·
Query-driven approach needs complex integration and filtering processes.
·
This approach is very inefficient.
·
It is very expensive for frequent queries.
·
This approach is also very expensive for queries that require
aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data
warehouse systems follow update-driven approach rather than the traditional
approach discussed earlier. In update-driven approach, the information from
multiple heterogeneous sources are integrated in advance and are stored in a
warehouse. This information is available for direct querying and analysis.
Advantages
This approach has the following advantages −
·
This approach provide high performance.
·
The data is copied, processed, integrated, annotated, summarized
and restructured in semantic data store in advance.
·
Query processing does not require an interface to process data at
local sources.
Functions of Data
Warehouse Tools and Utilities
The following are the functions of data warehouse tools and
utilities −
·
Data Extraction −
Involves gathering data from multiple heterogeneous sources.
·
Data Cleaning −
Involves finding and correcting the errors in data.
·
Data Transformation −
Involves converting the data from legacy format to warehouse format.
·
Data Loading −
Involves sorting, summarizing, consolidating, checking integrity, and building
indices and partitions.
·
Refreshing −
Involves updating from data sources to warehouse.
Note − Data cleaning and data transformation are important steps
in improving the quality of data and data mining results.
No comments:
Post a Comment