Brief Answer Question:

i. What are two types of data used in data mining?

Types of data used in data mining are:

  • Records
  • Graph and network.
  • Ordered
  • Multimedia

ii. Write formula for confidence used in association rule mining.

Confidence is the conditional probability that a transaction will contains Y given that it contains X or Pr(Y | X)

confidence(X → Y) = σ(X ∪ Y) / σ (X)

= support(X ∪ Y) / support(X)

iii. What are the parameters to be considered in capacity planning?

The parameters to be considered in capacity planning are:

  • The number of times the process will be run,
  • The number of I/Os the process will use,
  • Whether there is an arrival peak to the processing,
  • The expected response time.

iv. Why is data warehouse called nonvolatile?

Data warehouse is considered nonvolatile because previous data is not erased when new data is entered in it and data is read-only and periodically refreshed.

v. What is time series data mining?

Time series represents a collection of values or data obtained from the logical order of measurement over time. Time series data mining makes our natural ability to visualize the shape of real-time data. It is an ordered sequence of data points at uniform time intervals.

vi. Mention some possible misuses of data mining.

Some possible misuses of data mining are:

  • Cost
  • Security
  • Privacy
  • Information misuse

vii. What are the limitation of K-Mean algorithm?

Limitation of K-Means algorithm are:

  • Defining the number of cluster (i.e value of K)
  • Determining the initial centroids.

viii. Define data mart.

Data Mart is a subset of the information content of a data warehouse that is stored in its own database. Data mart can improve query performance simply by reducing the volume of data that needs to be scanned to satisfy the query.

ix. What is web structure mining?

Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between Web pages linked by information or direct link connection. It offers information about how different pages are linked together to form this huge web.

Exercise Problems:

2. Explain Star scheme with example.

A star schema is a database organizational structure optimized for use in a data warehouse or business intelligence that uses a single large fact table to store transactional or measured data, and one or more smaller dimensional tables that store attributes about the data. It is called a star schema because the fact table sits at the center of the logical diagram, and the small dimensional tables branch off to form the points of the star.

How a star schema works ?

The fact table (central table in the star schema) stores two types of information: numeric values and dimension attribute values. Using a sales database as an example:

Numeric value cells are unique to each row or data point and do not correlate or relate to data stored in other rows. These might be facts about a transaction, such as an order ID, total amount, net profit, order quantity or exact time.

The dimension attribute values do not directly store data, but they store the foreign key value for a row in a related dimensional table. Many rows in the fact table will reference this type of information. So, for example, it might store the sales employee ID, a date value, a product ID or a branch office ID.

3. Describe DBSCAN algorithm.

Clustering analysis is an unsupervised learning method that separates the data points into several specific bunches or groups, such that the data points in the same groups have similar properties and data points in different groups have different properties in some sense. Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points. DBSCAN discovers clusters of arbitrary shape in spatial databases with noise.

DBSCAN algorithm is given below:

  • Arbitrary select a point p
  • Retrieve all points density-reachable from p w.r.t. Eps and MinPts
  • If p is a core point, a cluster is formed
  • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database
  • Continue the process until all of the points have been processed

4. Explain Load manager.

Process managers are responsible for maintaining the flow of data both into and out of the data warehouse. There are three different types of process managers −

  • Load manager
  • Warehouse manager
  • Query manager

The system components that perform all the operations necessary to support the extract and load process is load manager. It fast loads the extracted data into a temporary data store and performs simple transformations into a structure similar to the one in the data warehouse also called ETL (Extract, Transform and Load).

Load manager performs the operations required to extract and load the data into the database. The size and complexity of a load manager varies between specific solutions from one data warehouse to another.

Load Manager Architecture

The load manager does perform the following functions −

  • Extract data from the source system.
  • Fast load the extracted data into temporary data store.
  • Perform simple transformations into structure similar to the one in the data warehouse.

4. Construct the FP tree for the following transactions.

TID Items
100 {a, b, c}
200 {b, c, d}
300 {a, c, e}
400 {b, d, f}
500 {a, b, e}
600 {b, c, f}

5. Explain classification and prediction. How Decision Tree can be used for classification of elements (data)?

Classification is the process of identifying which category a new observation belongs to base on a training data set containing observations whose category membership is known. The set of input data and the corresponding outputs are given to the algorithm. So, the training data set includes the input data and their associated class labels. Using the training dataset, the algorithm derives a model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network.

Another process of data analysis is prediction. It is used to find a numerical output. Same as in classification, the training dataset contains the inputs and corresponding numerical output values. The algorithm derives the model or a predictor according to the training dataset. The model should find a numerical output when the new data is given. Unlike in classification, this method does not have a class label. The model predicts a continuous-valued function or ordered value.

Decision tree could be used for classification as well. By looking at the decision tree we could easily figure out many things.

Examples can be classified as follows:

  1. Look at the example’s value for the feature specified
  2. Move along the edge labeled with this value
  3. If you reach a leaf, return the label of the leaf
  4. Otherwise, repeat from step 1

So a new instance:

<rainy, hot, normal, true>: ?

will be classified as “noplay”

Comprehensive Questions:

7. Consider given training data set: Check whether given person does cheat or no using Bayesian classifier. Refund (x, ‘yes’)^Marital Status (x, Divorced)^Income(x, <80K).

ID Refund Martial Status Income Cheat
1 Yes Single >80K No
2 No Married >80K No
3 No Single <80K No
4 Yes Married >80K No
5 No Divorced >80K Yes
6 No Married <80K No
7 Yes Divorced >80K No
8 No Single >80K Yes
9 No Married <80K No
10 No Single >80K Yes

 

8. Describe types of OLAP operations. Explain the steps of KDD.

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software technology which authorizes analysts, managers, and executives to gain insight into information through fast, consistent, interactive access in a wide variety of possible views of data that has been transformed from raw information to reflect the real dimensionality of the enterprise as understood by the clients.

OLAP operations are:

  • Roll-up (drill-up):-

The roll-up operation performs aggregation on a data cube either by climbing up the hierarchy or by dimension reduction.

  • Drill-down:-

Drill-down is the reverse of roll-up.That means lower level summary to higher level summary.

Drill-down can be performed either by:-

    • Stepping down a concept hierarchy for a dimension
    • By introducing a new dimension.
  • Slice and dice

The slice operation performs a selection on one dimension of the given cube, resulting in a sub cube. Reduces the dimensionality of the cubes.

  • Pivot

Pivot is also known as rotate. It rotates the data axis to view the data from different perspectives.

Knowledge Discovery in Databases (KDD) also known as Data Mining, refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data stored in databases.

Steps involved in KDD are:

  • Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.
    • Cleaning in case of Missing values.
    • Cleaning noisy data, where noise is a random or variance error.
    • Cleaning with Data discrepancy detection and Data transformation tools.
  • Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common source (Data Warehouse).
    • Data integration using Data Migration tools.
    • Data integration using Data Synchronization tools.
    • Data integration using ETL(Extract-Load-Transformation) process.
  • Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection.
    • Data selection using Neural network.
    • Data selection using Decision Trees.
    • Data selection using Naive bayes.
    • Data selection using Clustering, Regression, etc.
  • Data Transformation: Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure.

Data Transformation is a twostep process:

    • Data Mapping: Assigning elements from source base to destination to capture transformations.
    • Code generation: Creation of the actual transformation program.
  • Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful.
    • Transforms task relevant data into patterns.
    • Decides purpose of model using classification or characterization.
  • Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures.
    • Find interestingness score of each pattern.
    • Uses summarization and Visualization to make data understandable by user.
  • Knowledge representation: Knowledge representation is defined as technique which utilizes visualization tools to represent data mining results.
    • Generate reports.
    • Generate tables.
    • Generate discriminant rules, classification rules, characterization rules, etc.
Figure: Data mining as a step in Knowledge Discovery

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *