Brief Answer Question:
i. What are two types of data used in data mining?
Types of data used in data mining are:
 Records
 Graph and network.
 Ordered
 Multimedia
ii. Write formula for confidence used in association rule mining.
Confidence is the conditional probability that a transaction will contains Y given that it contains X or Pr(Y  X)
confidence(X → Y) = σ(X ∪ Y) / σ (X)
= support(X ∪ Y) / support(X)
iii. What are the parameters to be considered in capacity planning?
The parameters to be considered in capacity planning are:
 The number of times the process will be run,
 The number of I/Os the process will use,
 Whether there is an arrival peak to the processing,
 The expected response time.
iv. Why is data warehouse called nonvolatile?
Data warehouse is considered nonvolatile because previous data is not erased when new data is entered in it and data is readonly and periodically refreshed.
v. What is time series data mining?
Time series represents a collection of values or data obtained from the logical order of measurement over time. Time series data mining makes our natural ability to visualize the shape of realtime data. It is an ordered sequence of data points at uniform time intervals.
vi. Mention some possible misuses of data mining.
Some possible misuses of data mining are:
 Cost
 Security
 Privacy
 Information misuse
vii. What are the limitation of KMean algorithm?
Limitation of KMeans algorithm are:
 Defining the number of cluster (i.e value of K)
 Determining the initial centroids.
viii. Define data mart.
Data Mart is a subset of the information content of a data warehouse that is stored in its own database. Data mart can improve query performance simply by reducing the volume of data that needs to be scanned to satisfy the query.
ix. What is web structure mining?
Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between Web pages linked by information or direct link connection. It offers information about how different pages are linked together to form this huge web.
Exercise Problems:
2. Explain Star scheme with example.
A star schema is a database organizational structure optimized for use in a data warehouse or business intelligence that uses a single large fact table to store transactional or measured data, and one or more smaller dimensional tables that store attributes about the data. It is called a star schema because the fact table sits at the center of the logical diagram, and the small dimensional tables branch off to form the points of the star.
How a star schema works ?
The fact table (central table in the star schema) stores two types of information: numeric values and dimension attribute values. Using a sales database as an example:
Numeric value cells are unique to each row or data point and do not correlate or relate to data stored in other rows. These might be facts about a transaction, such as an order ID, total amount, net profit, order quantity or exact time.
The dimension attribute values do not directly store data, but they store the foreign key value for a row in a related dimensional table. Many rows in the fact table will reference this type of information. So, for example, it might store the sales employee ID, a date value, a product ID or a branch office ID.
3. Describe DBSCAN algorithm.
Clustering analysis is an unsupervised learning method that separates the data points into several specific bunches or groups, such that the data points in the same groups have similar properties and data points in different groups have different properties in some sense. Relies on a densitybased notion of cluster: A cluster is defined as a maximal set of densityconnected points. DBSCAN discovers clusters of arbitrary shape in spatial databases with noise.
DBSCAN algorithm is given below:
 Arbitrary select a point p
 Retrieve all points densityreachable from p w.r.t. Eps and MinPts
 If p is a core point, a cluster is formed
 If p is a border point, no points are densityreachable from p and DBSCAN visits the next point of the database
 Continue the process until all of the points have been processed
4. Explain Load manager.
Process managers are responsible for maintaining the flow of data both into and out of the data warehouse. There are three different types of process managers −
 Load manager
 Warehouse manager
 Query manager
The system components that perform all the operations necessary to support the extract and load process is load manager. It fast loads the extracted data into a temporary data store and performs simple transformations into a structure similar to the one in the data warehouse also called ETL (Extract, Transform and Load).
Load manager performs the operations required to extract and load the data into the database. The size and complexity of a load manager varies between specific solutions from one data warehouse to another.
Load Manager Architecture
The load manager does perform the following functions −
 Extract data from the source system.
 Fast load the extracted data into temporary data store.
 Perform simple transformations into structure similar to the one in the data warehouse.
4. Construct the FP tree for the following transactions.
TID  Items 
100  {a, b, c} 
200  {b, c, d} 
300  {a, c, e} 
400  {b, d, f} 
500  {a, b, e} 
600  {b, c, f} 
5. Explain classification and prediction. How Decision Tree can be used for classification of elements (data)?
Classification is the process of identifying which category a new observation belongs to base on a training data set containing observations whose category membership is known. The set of input data and the corresponding outputs are given to the algorithm. So, the training data set includes the input data and their associated class labels. Using the training dataset, the algorithm derives a model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network.
Another process of data analysis is prediction. It is used to find a numerical output. Same as in classification, the training dataset contains the inputs and corresponding numerical output values. The algorithm derives the model or a predictor according to the training dataset. The model should find a numerical output when the new data is given. Unlike in classification, this method does not have a class label. The model predicts a continuousvalued function or ordered value.
Decision tree could be used for classification as well. By looking at the decision tree we could easily figure out many things.
Examples can be classified as follows:
 Look at the example’s value for the feature specified
 Move along the edge labeled with this value
 If you reach a leaf, return the label of the leaf
 Otherwise, repeat from step 1
So a new instance:
<rainy, hot, normal, true>: ?
will be classified as “noplay”
Comprehensive Questions:
7. Consider given training data set: Check whether given person does cheat or no using Bayesian classifier. Refund (x, ‘yes’)^Marital Status (x, Divorced)^Income(x, <80K).
ID  Refund  Martial Status  Income  Cheat 
1  Yes  Single  >80K  No 
2  No  Married  >80K  No 
3  No  Single  <80K  No 
4  Yes  Married  >80K  No 
5  No  Divorced  >80K  Yes 
6  No  Married  <80K  No 
7  Yes  Divorced  >80K  No 
8  No  Single  >80K  Yes 
9  No  Married  <80K  No 
10  No  Single  >80K  Yes 
8. Describe types of OLAP operations. Explain the steps of KDD.
OLAP stands for OnLine Analytical Processing. OLAP is a classification of software technology which authorizes analysts, managers, and executives to gain insight into information through fast, consistent, interactive access in a wide variety of possible views of data that has been transformed from raw information to reflect the real dimensionality of the enterprise as understood by the clients.
OLAP operations are:
 Rollup (drillup):
The rollup operation performs aggregation on a data cube either by climbing up the hierarchy or by dimension reduction.
 Drilldown:
Drilldown is the reverse of rollup.That means lower level summary to higher level summary.
Drilldown can be performed either by:

 Stepping down a concept hierarchy for a dimension
 By introducing a new dimension.
 Slice and dice
The slice operation performs a selection on one dimension of the given cube, resulting in a sub cube. Reduces the dimensionality of the cubes.
 Pivot
Pivot is also known as rotate. It rotates the data axis to view the data from different perspectives.
Knowledge Discovery in Databases (KDD) also known as Data Mining, refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data stored in databases.
Steps involved in KDD are:
 Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
 Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common source (Data Warehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(ExtractLoadTransformation) process.
 Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
 Data Transformation: Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure.
Data Transformation is a twostep process:

 Data Mapping: Assigning elements from source base to destination to capture transformations.
 Code generation: Creation of the actual transformation program.
 Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
 Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data understandable by user.
 Knowledge representation: Knowledge representation is defined as technique which utilizes visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules, etc.