Data Mining and Data Warehousing: BIM TU Solution 2017

May 18, 2022June 8, 2022 Study Notes NepalPosted in 8th Semester, BIMTagged BIM 8 sem, bim 8th sem solution, bim 8th semester, bim solution, data mining and data warehousing, Pitfalls of Data Mining, tu solution, what is data mining?, what is information?

Brief Answer Questions:

What is information?

Information is the processed form of data, which has meaningful values for the receiver.

What is data mining?

Data mining is defined as a process used to extract usable data from a larger set of any raw data.

Is privacy of data an issue in data mining? Give reason.

Yes, Privacy of data is considered as a huge issue in data mining because when the system violates the privacy of the user, it causes problem like safety and security of its users. Eventually, it creates miscommunication between people.

What is search engine?

A search engine is an online tool that is designed to search for websites on the internet based on the user’s search query.

What happen if we train the machine with false data?

If we train the machine with false data then the machine would not perform its activities as it supposed to perform. With this, there would be a misleading conclusion to a problem.

List any two pitfalls of data mining.

Pitfalls of data mining are:
Sample bias
Sample size
Data snooping
Grouping

What are some problem complexity of multimedia mining?

Problems complexity of multimedia mining are:
Content-based retrieval and Similarity search
Multidimensional Analysis
Classification and Prediction Analysis

Give an example of noise and inconstancy data?

Noisy data: Salary = -5000 or name = 1234

Inconstant data: Age= 5 years, Birthday= 06/06/1990 and current year = 2022

What is entropy?

Entropy is a measure of the uncertainly associated with a random variable.

Entropy=∑_i=1-P_i log₂(p_i)

Define fact table.

A fact table is the central table in a star schema of a data warehouse. A fact table stores quantitative information for analysis and is often denormalized.

Exercise Problems:

2. What do you mean by capacity planning? Explain how CPU requirement for data warehouse is calculated.

Capacity planning in the data warehouse environment centers around planning disk storage and processing resources. Capacity planning is important for the data warehouse environment as it was (and still is!) for the operational environment. Capacity planning is done to save money and to do task smoothly.

There are several aspects to the data warehouse environment that make capacity planning for the data warehouse a unique exercise:

The first factor is that the workload for the data warehouse environment is very variable.
A second factor making capacity planning for the data warehouse a risky business is that the data warehouse normally entails much more data than was ever encountered in the operational environment.
A third factor making capacity planning for the data warehouse environment a nontraditional exercise is that the data warehouse environment and the operational environments do not mix under the stress of a workload of any size at all.

The calculations for space are almost always done exclusively for the current detailed data in the data warehouse. The reason why the other levels of data are not included in this analysis is that:

They consume much less storage than the current detailed level of data, and
They are much harder to identify. Therefore, the considerations of capacity planning for disk storage center around the current detailed level.

In order to estimate the processer requirement for the data warehouse, the work passing through the data warehouse must pass through one of three categories.

These three categories are:

background processing
predictable DSS processing
unpredictable DSS processing

The parameters of interest for the data warehouse designer (for both the background processing and the predictable DSS processing) are:

The number of times the process will be run,
The number of I/O s the process will use,
Whether there is an arrival peak to the processing,
The expected response time.

These metrics can be arrived at by examining the pattern of calls made to the DBMS and the interaction with data managed under the DBMS.

3. Give the following training set with class “Play Tennis” for decision tree classifier:

Day	Outlook	Temperature	Humidity	Wind	Play Tennis (Class)
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes

Calculate information gain for attribute Wind.

4. Define data processing. Describe its techniques.

Data processing is, generally, “the collection and manipulation of items of data to produce meaningful information.” Data processing is done to improve the quality of data and to remove noise and impurities.

Techniques of data processing are:

Data cleaning:
- Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data integration
- Integration of multiple databases, data cubes, or files
Data reduction
- Dimensionality reduction (Wavelet Transfn, PCA, Attr Subset Seln
- Numerosity reduction (Parametric [Using model], Non- Parametric)
- Data compression
Data transformation and data discretization
- Normalization
- Concept hierarchy generation

5. Write the algorithm for K nearest neighbor algorithm.

K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.

The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

Step-6: Our model is ready.

6. Explain OLAP operations with examples.

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software technology which authorizes analysts, managers, and executives to gain insight into information through fast, consistent, interactive access in a wide variety of possible views of data that has been transformed from raw information to reflect the real dimensionality of the enterprise as understood by the clients.

Some popular OLAP operations are:

Roll-up (drill-up): –

The roll-up operation performs aggregation on a data cube either by climbing up the hierarchy or by dimension reduction.

Example:

Delhi, New York, Patiala and Los Angeles wins 5, 2, 3 and 5 medals respectively. So in this example, we roll upon Location from cities to countries.

More detailed data to less detailed data

Drill-down: –

Drill-down is the reverse of roll-up. That means lower level summary to higher level summary.

Drill-down can be performed either by:-

Stepping down a concept hierarchy for a dimension
By introducing a new dimension.

Consider an example:-

Drill-down on Location form countries to cities.

Less detailed data to more detailed data.

Slice and dice

The slice operation performs a selection on one dimension of the given cube, resulting in a sub cube. Reduces the dimensionality of the cubes.

For example, if we want to make a select where Medal = 5

The dice operation defines a sub-cube by performing a selection on two or more dimensions.

For example, if we want to make a select where Medal = 3 or Location = New York

Dice Operation

Pivot

Pivot is also known as rotate. It rotates the data axis to view the data from different perspectives.

Comprehensive Questions:

7. Use the Apriori algorithm using candidate generation for finding frequent itemset and then evaluate the valid association rules:

TID	List of items
T100	A, C, D
T200	B, C, E
T300	A, B, C, E
T400	B, E

8. Define warehouse manager. Write the functions of warehouse manager.

The data warehouse is the heart of the architected environment, and is the foundation of all DSS processing. The job of the DSS analyst in the data warehouse environment is massively easier than in the classical legacy environment because there is a single integrated source of data (the data warehouse) and because the granular data in the data warehouse is easily accessible.

It is the system component that performs analysis of data to ensure consistency. The data from various sources and temporary storage are merged into data warehouse by the warehouse manager. The job of backing-up and archiving data as well as creation of index is performed by this manager.

A warehouse manager is responsible for the warehouse management process. It consists of third-party system software, C programs, and shell scripts. The size and complexity of warehouse managers varies between specific solutions

Data Mining and Data warehousing | BIM 8th Semester TU Solution | 2017

Leave a Reply Cancel reply

Important Links