Brief Answer Questions:
i. How data mining differ with data warehouse?
The main difference between data mining and data warehousing is that data mining is a process for analyzing and extracting data whereas, data warehousing refers to the process of sequentially storing data after extracting it from sources.
ii. What are the main parameters to be considered in capacity planning?
The parameters to be considered in capacity planning are:
- The number of times the process will be run,
- The number of I/Os the process will use,
- Whether there is an arrival peak to the processing,
- The expected response time.
iii. What is the role of damping factor in search engine?
Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other pages would have PageRank zero. In the presence of damping, Page A effectively links to all pages in the web, even though it has no outgoing links of its own.
iv. What are the complexities in multimedia datamining?
Multimedia data mining is an interdisciplinary field that integrates image processing and understanding, computer vision, data mining, and pattern recognition.
Complexities in multimedia data mining include content-based retrieval and similarity search, and generalization and multidimensional analysis.
v. List any two types of noise in data.
Types of noise in data are:
- electronic noise
- thermal noise
vi. Mention any two limitations of k-means algorithm.
Limitation of k-means algorithm are:
- Defining the number of cluster (i.e value of K)
- Determining the initial centroids.
vii. Give any two applications of time series data mining.
Applications of time series data mining are:
- Trend analysis
- Forecasting
viii. What are the possible limitations of data mining?
Limitations of data mining are:
- Data snooping
- Grouping
ix. What is the role of minimum support?
- Minimum-Support is a parameter supplied to the Apriori algorithm in order to prune candidate rules by specifying a minimum lower bound for the Support measure of resulting association rules.
Exercise Problems:
2. Why data preprocessing is necessary before mining? Explain some techniques to clean the data.
Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data.
Preprocessing of data is mainly to check the data quality. The quality can be checked by the following:
- Accuracy: To check whether the data entered is correct or not.
- Completeness: To check whether the data is available or not recorded.
- Consistency: To check whether the same data is kept in all the places that do or do not match.
- Timeliness: The data should be updated correctly.
- Believability: The data should be trustable.
- Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing:
- Data cleaning
- Data integration
- Data reduction
- Data transformation
Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data from the datasets, and it also replaces the missing values. There are some techniques in data cleaning:
- Missing values
You can’t ignore missing data because many algorithms will not accept missing values. There are a couple of ways to deal with missing data. Neither is optimal, but both can be considered, such as:
- Ignore the tuple.
- Fill in the missing values manually.
- Use a global constant to fill in the missing values.
- Use a measure of central tendency for the attribute to fill the missing values.
- Use the attribute mean or median for all sample belonging to the same class as the given tuple.
- Use the most probable value to fill in the missing value.
- Noise data
- Binning
- first sort data and partition into (equal-frequency) bins
- Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
- Regression
- smooth by fitting the data into regression functions
- Binning
-
- Clustering
- detect and remove outliers
- Combined computer and human inspection
- detect suspicious values and check by human (e.g., deal with possible outliers)
- Clustering
3. Explain the Link Analysis page ranking algorithm.
The original Page Rank algorithm was described by Lawrence Page and Sergey Brin in several publications. It is given by
PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
Where
PR(A) is the Page Rank of page A,
PR(Ti) is the Page Rank of pages Ti which link to page A,
C(Ti) is the number of outbound links on page Ti and
d is a damping factor which can be set between 0 and 1.
- Page Rank does not rank web sites as a whole, but is determined for each page individually. Further, the Page Rank of page A is recursively defined by the Page Ranks of those pages which link to page A.
- The Page Rank of pages Ti which link to page A does not influence the PageRank of page A uniformly. Within the Page Rank algorithm, the Page Rank of a page T is always weighted by the number of outbound links C(T) on page T. This means that the more outbound links a page T has, the less will page A benefit from a link to it on page T.
- The weighted Page Rank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A’s Page Rank.
- Finally, the sum of the weighted Page Ranks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1. Thereby, the extend of PageRank benefit for a page by another page linking to it is reduced.
4. Consider the following transaction data sets.
TID | Items |
1
2 3 4 5 6 7 8 |
E, A, D, B
D, A, C, E, B C, A, B, E B, A, D D D, B A, D, E B, C |
Solution:
5. What are the roles of MinPts and epsilon in DBSCAN algorithm? Explain.
Clustering analysis is an unsupervised learning method that separates the data points into several specific bunches or groups, such that the data points in the same groups have similar properties and data points in different groups have different properties in some sense. Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points. DBSCAN discovers clusters of arbitrary shape in spatial databases with noise.
DBSCAN algorithm is given below:
- Arbitrary select a point p
- Retrieve all points density-reachable from p w.r.t. Eps and MinPts
- If p is a core point, a cluster is formed
- If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database
- Continue the process until all of the points have been processed
- Eps: Maximum radius of the neighbourhood
- MinPts: Minimum number of points in an Eps- neighbourhood of that point
- NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
6. Consider the 14 training datasets with 9 positive and 5 negative classes. Suppose one of the attributes is Wind, which have values Weak and Strong. There are 8 occurrences of weak winds and 6 occurrences of strong winds. For the weak winds, 6 are positive and 2 are negative. For the strong winds, 3 are positive and 3 are negative. Calculate the information gain of wind.
Comprehensive Questions:
7. What are the needs of multidimensional data model? Describe about web usage mining and web content mining.
The multi-Dimensional Data Model is a method which is used for ordering data in the database along with good arrangement and assembling of the contents in the database.
The Multi-Dimensional Data Model allows customers to interrogate analytical questions associated with market or business trends, unlike relational databases which allow customers to access data in the form of queries. They allow users to rapidly receive answers to the requests which they made by creating and examining the data comparatively fast.

Web Usage Mining:
- Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications.
- Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site.
- Web usage mining itself can be classified further depending on the kind of usage data considered:
- Web Server Data: The user logs are collected by Web server. Typical data includes IP address, page reference and access time.
- Application Server Data: Commercial application servers such as Web logic Story Server have significant features to enable E-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs.
- Application Level Data: New kinds of events can be defined in an application, and logging can be turned on for them – generating histories of these specially defined events. It must be noted however that many end applications require a combination of one or more of the techniques applied in the above categories.
Web Content Mining
- Web Content Mining is the process of extracting useful information from the contents of Web documents.
- Content data corresponds to the collection of facts a Web page was designed to convey to the users.
- May consist of text, images, audio, video, or structured records such as lists and tables.
- Web content has been the most widely researched. Issues addressed in text mining are, topic discovery, extracting association patterns, clustering of web documents and classification of Web Pages.
8. Explain ETL process. Distinguish between OLAP and OLTP.
A data warehouse usually stores many years of data to support historical analysis. The data in a data warehouse is typically loaded through an extraction, transformation, and loading (ETL) process from multiple data sources. Modern data warehouses are moving toward an extract, load, and transformation (ELT) architecture in which all or most data transformation is performed on the database that hosts the data warehouse.
Extract
- Data extraction (retrieval) from source system
- Incremental extract: identifying modified record and extract only those
- Full extract: extract full copy in same format to identify changes
Transform
- Transform data from source to the target
- Conversion to same dimension, units etc.
- Generating aggregates, sorting, validating etc.
Load
- Load the data into temporary data source first and then perform simple transformation into structure similar to one in data warehouse
- Consume as little resource as possible
Difference between OLAP and OLTP are:
OLTP | OLAP | |
users | clerk, IT professional | knowledge worker |
function | day to day operations | decision support |
DB design | application-oriented | Subject-oriented |
data | current, up-to-date
detailed, flat relational isolated |
historical,
summarized, multidimensional integrated, consolidated |
usage | repetitive | ad-hoc |
access | read/write
Index/hash on prim. key |
lots of scans |
unit of work | short, simple transaction | complex query |
# records accessed | tens | millions |
#users | thousands | hundreds |
DB size | 100MB-GB | 100GB-TB |
metric | transaction throughput | query throughput, response |