What is Data Mining?
Data Mining (DM) is the process of discovering trends and patterns from large sets of data. Data mining involves mathematical and statistical analyses to obtain patterns and trends that already exist in the data. Usually, these patterns are tough to be deciphered by traditional methods of data analysis because either the associations are too complex or the data is huge. For example, in the banking sector, data mining helps in proper customer profiling.
A data mining function can be used to identify the details of customers who have not made any transaction in the last one year. This helps the bank to identify the characteristics of such a group, which can then be used to take various strategic decisions.
Table of Content
- 1 What is Data Mining?
- 2 Data Mining Parameters
- 3 Techniques for Data Mining
- 4 How Data Mining Works?
- 5 Architecture of Data Mining
- 6 Various Risks Involved in Data Mining
- 7 Advantages of Data Mining
- 8 Disadvantages of Data Mining
- 9 Data Mining vs Predictive Analytics
Data mining is accomplished by building models. A model runs an algorithm over a set of data. The data mining models can be useful in specific scenarios, such as the following:
- Forecasting: Assessing future sales trends, predicting server traffic loads or server downtime.
- Determining risk and probability: Selecting the customers for target marketing, determining break point for risk scenarios and assigning probabilities to the outcomes as determined by the trends and patterns generated through data mining.
- Providing recommendations: Providing recommendations for the products that are likely to be sold.
- Finding sequences: Analysing customer choices in a shopping basket and using the information for predicting subsequent choices.
- Grouping: Segregating customers or events into clusters of related items, analysing and predicting associations and sequences.
Data Mining Parameters
Data mining uses processes, based on parameters and rules, to pull out critical information from vast amounts of data. We have a choice between parametric and non-parametric models. In parametric models, the distribution is known or assumed.
The model may have hyperparameters and parameters. Some examples of hyperparameters are train-test split ratio, learning rate and choice of optimization algorithm. Usually, the parameters are weights that are estimated in the model.
Data mining parameters include the following:
- Sequence or path analysis: This technique identifies those patterns where one event leads to a subsequent event. For example, consumers may demand a backpack/carry bag depending on the items and the quantity of items they are buying.
- Classification: This technique identifies new groups from the stored data and explores the previously unknown facts. For example, a restaurant could mine the customer data to identify when the maximum number of customers visit and what do they order. Based on this information, special daily offers can be introduced to increase customers and revenue.
- Forecasting: This technique is used for discovering patterns in data that can lead to practicable predictions about the future. For example, life insurance companies frame policies on the basis of prediction on human life.
Thus, data mining systems enable us to deal with large amounts of data, and plan wisely. Data mining is concerned with uncovering information, and the organisation needs to have dedicated staff to ensure that the cost of unearthing this information is lower than the benefits it delivers.
Uncertainty and risk are two different things. Uncertainty refers to predetermining the possibility of getting a result that is not expected in the normal functionality of an organisation.
On the other hand, risk refers to the conditions or circumstances that change around the subject, which in turn can affect the underlying pattern. Commercial data mining also includes some of these risks. With the use of data mining technique, an organisation can build a remarkable model for predicting its prospects in launching a product.
If at the time of the product launch, the competitor also plans to launch a similar product at a much lower price, then there is a high chance that the campaign may go downhill, irrespective of the flawless nature of the data mining model. This is the kind of potential risk that any data mining business model can face.
Techniques for Data Mining
Some techniques used for data mining are as follows:
This technique identifies those patterns where two or more events are inter-connected.
This technique is used to group data items on the basis of consumer preferences or logical relationships. For example, data can be mined to identify consumer behaviour or market trends.
To find data from the predetermined groups, stored data is used. For example, customer purchase data can be obtained by a restaurant chain to ascertain what the customers usually order and when do they visit. This data can be used for accelerating the trade by introducing the daily specials.
This technique is helpful in determining the nature of a dataset’s connection between variables. In some cases, the associations might be causative, while in others, they could just be correlations. Regression is a simple white box approach for determining how variables are connected. Forecasting and data modelling both employ regression methods.
Data is obtained to predict Behavior trends and patterns. For example, a retailer of outdoor equipment can anticipate the likelihood of a backpack being purchased by the customer on the basis of the purchase of sleeping bags.
Prediction is one of four disciplines of analytics and is a particularly strong element of data mining. Predictive analytics works by extending trends observed in current or historical data into the future. As a result, it provides enterprises with insight into what patterns may emerge in their data in the future. Using predictive analytics may be done in a variety of ways.
Decision trees are a form of a prediction model that allows businesses to harvest data effectively. Although a decision tree is technically a type of machine learning, it is more commonly referred to as a white-box machine learning approach due to its simplicity. User can easily see how the data inputs impact the outputs using a decision tree. A random forest is a predictive analytics model that is created by combining multiple decision tree models.
Outlier analysis or anomaly analysis
Any irregularities in datasets are detected via outlier detection. When companies discover anomalies in their data, it becomes simpler to understand why they occur and plan for future occurrences in order to meet corporate goals.
For example, if there’s a rise in the use of transactional systems for credit cards at a given time of day, businesses may use this information to optimise their sales for the remainder of the day by figuring out why.
A neural network is a sort of machine learning model that is frequently used in artificial intelligence and deep learning. Neural networks are one of the most accurate machine learning models utilised today. Although a neural network may be a useful tool in data mining, organisations should exercise caution when employing it since some of these neural network models are quite complicated, making it difficult to grasp how a neural network arrived at a result.
How Data Mining Works?
DM includes a number of operations, each one of which is aided by a variety of techniques, such as decision trees, rule-based induction, neural networks and clustering. The derived results are evaluated repeatedly over the developed models to minimise the occurrence of potential errors. In the real-world applications, knowledge extraction involves collaborative use of several data mining techniques and operations.
Data mining consists of the following six steps:
- Data is pulled from the data warehouse and sent to the ETL layer, where E stands for extract, i.e., pulled, T for transform and L for load.
- Based on the query generated, the data is explored for the matching patterns.
- The output is then forwarded to the analyst, who prepares a new set of questions to elaborate on certain aspects of the findings or to refine the search. This is an iterative search process.
- Once the analyst is through with the task, the final report generated by the data mining system is forwarded for human interpretation.
- The data is then represented in a functional format, such as a report, graph or table.
- Finally, it is passed on to the decision-makers to take appropriate action based on such findings.
Architecture of Data Mining
Data mining extracts information from data set and transforms that information into a comprehensible structure for further use. The efficiency with which the data is mined is ruled by the architecture of a data mining system. Both the architecture and algorithms play a significant role in the mining process. Data mining system integrates a variety of data mining algorithms to cater to the different needs of different customers.
The different layers of the data mining process are explained as follows:
- Graphical user interface: It acts as an interface between the user and the data mining software. This layer permits the user to access the data stored in a data warehouse by means of different tools and queries.
- Pattern evaluation module: It keeps a track of the interest measures based on prior executions and interacts with the data mining engine to guide the search accordingly.
- Data mining engine: It consists of a set of modules to execute various tasks, such as classification, association, characterisation and cluster analysis. The main function of a data mining engine is to schedule and execute tasks.
- Database or data warehouse server: It fetches the relevant data in sync with the user request.
- Database/data warehouse: It collects data from various data sources. This data is then subjected to data cleaning and data integration techniques.
- Knowledge base: It carries domain-specific knowledge to evaluate the acceptance of the resulting patterns or to guide the search. It works on the basis of developing hierarchies or arranging attributes into various abstraction levels.
The data generated through business transactions is stored in databases and/or data warehouse systems. Now, to provide analytical reports in order to offer insight about business processes, the data mining system must be coupled with databases and data warehouse systems
Various Risks Involved in Data Mining
- Misinterpretation of results: When decision is viewed in a different context as per the data collection.
- Privacy issues: When personal information such as credit card, medical records are disclosed.
- False discoveries: When a conclusion is made on the basis of inaccurate data.
- Discriminating models: When decision is biased as it is based on some discrimination factors, such as selecting a candidate for a job on the basis of gender, origin and ethical background.
- Information asymmetry: When decision is made on the data collected from different data sources, as sampling and parameter, vary.
Advantages of Data Mining
Data mining aims at exploring knowledge from the data warehouses. The data from a database or a data warehouse is first sorted to prepare the target data, and then analysed to find out the structure, correlations and the meaning that it contains.
The following are some advantages of data mining:
Data mining supports advertising and marketing professionals by imparting them useful, valuable and accurate information of the trends about their clients’ purchasing behaviour. Based on these trends, marketers can focus their attention on their customers with more precision. Moreover, data mining may also help these professionals in identifying products and services that are less liked by their customers. On this basis, marketers can give suggestions or recommendations to their customers and enhance their shopping experience.
Data mining aids financial institutions in areas, such as credit default and loan delivery. Data mining can also support credit card issuers in detecting potentially fraudulent credit card transactions.
Data mining assists law enforcement agencies in identifying criminal suspects, as well as in catching them by investigating trends in location, habits, crime type and other behaviour patterns.
Data mining supports researchers by increasing the pace of their data analysis process; thus, providing them more time to work on other projects.
Data mining is applied widely to determine the range of control parameters in the manufacturing sector. These optimal control parameters are then used to manufacture products with the desired quality.
Data mining supports government agencies by extracting and analysing records of financial transactions, for example, it helps banks to discover patterns that can identify money laundering or criminal activities.
Disadvantages of Data Mining
The disadvantages of data mining are as follows:
Personal privacy has constantly been a major concern irrespective of the wide usage of the Internet and its services in various organisations. In recent years, the concern about privacy has increased significantly in view of data leakage from trusted organisations. Owing to privacy issues, some people avoid shopping on the Internet.
There is a constant apprehension that personal information of customers may be accessed and used in an unethical way.
Although companies have access to a considerable amount of personal information available online, they do not have sufficient security systems in place to protect that information.
Misuse of information/inaccurate information
Patterns obtained through data mining are intended to be used for marketing or any other ethical purpose, but there is always a danger of it being misused. Unethical businesses or individuals may obtain this information through data mining and use it to harm individuals, groups or society
Data Mining vs Predictive Analytics
Predictive analytics is defined as the process of focussing on predicting the possible outcome using machine-learning techniques. Data mining is the process of discovering trends and patterns from large sets of data. Some of the differences between predictive analytics and data mining are shown in Table below:
|Data Mining||Predictive Analytics|
|Data mining refers to the act of analysing and identifying patterns in massive amounts of data stored in a company’s data warehouse.||Predictive analytics is used to analyse data in order to predict future occurrences.|
|It tries to obtain patterns and trends that already exist in the data.||It tries to forecast on the basis of previous data and scenarios.|
|Effective data mining necessitates a strong mathematical foundation. As a result, machine learning engineers and statisticians are ideally equipped for data mining tasks.||Predictive analytics necessitates a thorough understanding of business ideas as well as subject expertise. These jobs are best suited for business analysts and domain specialists.|