Data mining is the process of digging through data to discover hidden connections and predict future trends has a long history. Data mining is about identifying patterns and trends in the information as well as processing data so that we can validate findings by applying the detected patterns to new subset of data. Data mining have been around so many years, and because data are rising (big data), it is even more prevalent.  Sometimes referred to as “knowledge discovery in databases,” the term “data mining” wasn’t coined until the 1990s. But its foundation comprises three intertwined scientific disciplines: statistics (the numeric study of data relationships), artificial intelligence (human-like intelligence displayed by software and/or machines) and machine learning (algorithms that can learn from data to make predictions). What was old is new again, as data mining technology keeps evolving to keep pace with the limitless potential of big data and affordable computing power. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Why data mining is important in today’s society? It is important because it allows us to examine thoroughly thorough all the chaotic and repetitive noise in your data. Understand what is relevant and then make good use of that information to assess likely outcomes. And accelerate the pace of making informed decisions. One of the most important tasks in data mining is to choose the correct data mining technique. We need to choose the right technique based on the type of business and type of the problem that can be used to improve the accuracy and cost effectiveness of using data mining technique.  There are a lot of data mining techniques but I will discuss only one technique which is correlation.

Correlation, this data modelling technique because it is relatively simple to construct, run and interpret, thus serving as an easy starting point upon which to build. Future models will become more complex, but continuing to develop your skills in RapidMiner and getting comfortable with the tools will make the more complex models easier for you to achieve as we move forward. Correlation is a statistical measure of how strong the relationships are between attributes in a data set. Correlation is a measure of association between two variables. The two most popular correlation coefficients are the spearman’s correlation coefficient rho and Pearson’s product-moment correlation coefficient. Spearman’s technique is used for calculating a correlation coefficient for ordinal data while Pearson’s technique is used for interval or ratio-type data. Correlation is often used as a preliminary technique to discover relationships between variables. Moreover, correlation is a measure of the linear relationship between two variables. Techniques in Determining Correlation

There are several different correlation techniques. The Survey System’s optional Statistics Module includes the most common type, called the Pearson or product-moment correlation. The module also includes a variation on this type called partial correlation. The latter is useful when you want to look at the relationship between two variables while removing the effect of one or two other variables.

Like all statistical techniques, correlation is only appropriate for certain kinds of data. Correlation works for quantifiable data in which numbers are meaningful, usually quantities of some sort. It cannot be used for purely categorical data, such as gender, brands purchased, or favorite color.

In correlation analysis, we estimate a sample correlation coefficient, more specifically the Pearson Product Moment correlation coefficient. Ranges from between -1 and +1 and quantifies he direction and strength of the linear association between the two variables. The correlation between two variables can be positive which means the higher levels of one variable are associated with higher levels of the other and negative, this means the higher levels of one variable are associated with lower levels of the other. According to Stigler (1898), the correlation coefficient can take values that occur in the interval negative one and positive one. The two extreme values of this interval represent a perfectly linear relation between the variables, “positive” in the first case and negative o the other. This means that, a minus one indicates a perfect negative correlation, while the positive 1 indicates a perfect positive correlation. If there is a positive correlation between two variables, the value of one variable increases. And if there is a positive correlation between two variables, the value of one variable increases and the value of the other variable increases. The value zero implies the absence of linear relation. The standard error of a correlation coefficient is used to determine the confidence intervals around a true correlation of zero.  All correlations will fall between 0 and 1 or 0 and -1. The closer a correlation coefficient is to 1 or to -1, the stronger it is.  From -0.8 to -1 there is a very strong correlation between two variables, from -0.6 to -0.8 there is a strong correlation, from -0.4 to -0.6 there is some correlation, from 0 to -0.4 there is no correlation, form 0 to 0.4 there is no correlation, from 0.6 to 0.4 there is some correlation, form 0.8 to 0.6 there is a strong correlation and from 1 to 0.8 there is a very strong correlation between the two variables. According to the book of Concise Encyclopaedia of Statistics, the concept of correlation originated in the 1880’2 with the works of Galton, F.  In his autobiography Memories of My Life (1880), he writes that he thought of this concept during a walk in the grounds of Naworth Castle, when a rain shower forced him to find shelter.

A key thing to remember when working with correlations is never to assume a correlation means that a change in one variable causes a change in another. Sales of personal computers and athletic shoes have both risen strongly in the last several years and there is a high correlation between them, but you cannot assume that buying computers causes people to buy athletic shoes (or vice versa).

SOURCES:, Margaret Rouse research and application reated to data mining. Author. Sandro Saitta  The Statistics Calculator. Copyright © 2017 StatPac Inc., All Rights Reserved title Correlation Coefficient

The Concise Encyclopedia of Statistics pp 115-119 Stigler, S.M. (1989),

Introduction to Correlation and Regression Analysis.

Content ©2013. All Rights Reserved. Date last modified: January 17, 2013. Boston University School of Public Health mobile page


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s