What is information?

• Information is the processed form of data, which has meaningful values for the receiver.

What is data mining?

• Data mining is defined as a process used to extract usable data from a larger set of any raw data.

Is privacy of data an issue in data mining? Give reason.

• Yes, Privacy of data is considered as a huge issue in data mining because when the system violates the privacy of the user, it causes problem like safety and security of its users. Eventually, it creates miscommunication between people.

What is search engine?

• A search engine is an online tool that is designed to search for websites on the internet based on the user’s search query.

What happen if we train the machine with false data?

• If we train the machine with false data then the machine would not perform its activities as it supposed to perform. With this, there would be a misleading conclusion to a problem.

List any two pitfalls of data mining.

• Pitfalls of data mining are:
• Sample bias
• Sample size
• Data snooping
• Grouping

What are some problem complexity of multimedia mining?

• Problems complexity of multimedia mining are:
• Content-based retrieval and Similarity search
• Multidimensional Analysis
• Classification and Prediction Analysis

Give an example of noise and inconstancy data?

• Noisy data: Salary = -5000 or name = 1234

Inconstant data: Age= 5 years, Birthday= 06/06/1990 and current year = 2022

What is entropy?

• Entropy is a measure of the uncertainly associated with a random variable.

Entropy=∑i=1-Pi log2(pi)

Define fact table.

• A fact table is the central table in a star schema of a data warehouse. A fact table stores quantitative information for analysis and is often denormalized.

Exercise Problems:

2. What do you mean by capacity planning? Explain how CPU requirement for data warehouse is calculated.

Capacity planning in the data warehouse environment centers around planning disk storage and processing resources. Capacity planning is important for the data warehouse environment as it was (and still is!) for the operational environment. Capacity planning is done to save money and to do task smoothly.

There are several aspects to the data warehouse environment that make capacity planning for the data warehouse a unique exercise:

• The first factor is that the workload for the data warehouse environment is very variable.
• A second factor making capacity planning for the data warehouse a risky business is that the data warehouse normally entails much more data than was ever encountered in the operational environment.
• A third factor making capacity planning for the data warehouse environment a nontraditional exercise is that the data warehouse environment and the operational environments do not mix under the stress of a workload of any size at all.

The calculations for space are almost always done exclusively for the current detailed data in the data warehouse. The reason why the other levels of data are not included in this analysis is that:

• They consume much less storage than the current detailed level of data, and
• They are much harder to identify. Therefore, the considerations of capacity planning for disk storage center around the current detailed level.

In order to estimate the processer requirement for the data warehouse, the work passing through the data warehouse must pass through one of three categories.

These three categories are:

• background processing
• predictable DSS processing
• unpredictable DSS processing

The parameters of interest for the data warehouse designer (for both the background processing and the predictable DSS processing) are:

• The number of times the process will be run,
• The number of I/O s the process will use,
• Whether there is an arrival peak to the processing,
• The expected response time.

These metrics can be arrived at by examining the pattern of calls made to the DBMS and the interaction with data managed under the DBMS.

3. Give the following training set with class “Play Tennis” for decision tree classifier:

 Day Outlook Temperature Humidity Wind Play Tennis (Class) D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes

Calculate information gain for attribute Wind.

4. Define data processing. Describe its techniques.

Data processing is, generally, “the collection and manipulation of items of data to produce meaningful information.” Data processing is done to improve the quality of data and to remove noise and impurities.

Techniques of data processing are:

• Data cleaning:
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction (Wavelet Transfn, PCA, Attr Subset Seln
• Numerosity reduction (Parametric [Using model], Non- Parametric)
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

5. Write the algorithm for K nearest neighbor algorithm.

K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.

The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

Step-6: Our model is ready.

6. Explain OLAP operations with examples.

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software technology which authorizes analysts, managers, and executives to gain insight into information through fast, consistent, interactive access in a wide variety of possible views of data that has been transformed from raw information to reflect the real dimensionality of the enterprise as understood by the clients.

Some popular OLAP operations are:

• Roll-up (drill-up): –

The roll-up operation performs aggregation on a data cube either by climbing up the hierarchy or by dimension reduction.

Example:

Delhi, New York, Patiala and Los Angeles wins 5, 2, 3 and 5 medals respectively. So in this example, we roll upon Location from cities to countries.

More detailed data to less detailed data

• Drill-down: –

Drill-down is the reverse of roll-up. That means lower level summary to higher level summary.

Drill-down can be performed either by:-

1. Stepping down a concept hierarchy for a dimension
2. By introducing a new dimension.

Consider an example:-

Drill-down on Location form countries to cities.

Less detailed data to more detailed data.

• Slice and dice

The slice operation performs a selection on one dimension of the given cube, resulting in a sub cube. Reduces the dimensionality of the cubes.

For example, if we want to make a select where Medal = 5

The dice operation defines a sub-cube by performing a selection on two or more dimensions.

For example, if we want to make a select where Medal = 3 or Location = New York

• Pivot

Pivot is also known as rotate. It rotates the data axis to view the data from different perspectives.

Comprehensive Questions:

7. Use the Apriori algorithm using candidate generation for finding frequent itemset and then evaluate the valid association rules:

 TID List of items T100 A, C, D T200 B, C, E T300 A, B, C, E T400 B, E

8. Define warehouse manager. Write the functions of warehouse manager.

The data warehouse is the heart of the architected environment, and is the foundation of all DSS processing. The job of the DSS analyst in the data warehouse environment is massively easier than in the classical legacy environment because there is a single integrated source of data (the data warehouse) and because the granular data in the data warehouse is easily accessible.

It is the system component that performs analysis of data to ensure consistency. The data from various sources and temporary storage are merged into data warehouse by the warehouse manager. The job of backing-up and archiving data as well as creation of index is performed by this manager.

A warehouse manager is responsible for the warehouse management process. It consists of third-party system software, C programs, and shell scripts. The size and complexity of warehouse managers varies between specific solutions