Linear Regression

We will get familiar with a few statistics concept as Data Science is a multi-disciplinary field. A combination of Statistics, Algorithms, data analysis, visualization and much more.

Few key terms to help you better understand the calculations we do in data analysis , when we predict the target, train the data, model summary, when we compute the error in our model. Behind all this Statistics is working!

Lets get started!!

Linear Regression: used for continuous data values. It determines the linear relationship between a dependent variable and one or more independent variables.

Represented as : Y = aX + b where Y and X are called dependent and independent variables respectively.

Why do we need or why to find out linear regression?

It is the co-relation between the variables that we are interested in and the co-relation helps us to make decisions, predict or forecast what’s next…. So there must be a tool or resource that would help us determine that, along with proofs and calculations, graphs , plots etc. Here is where we find Linear Regression coming into picture.

Simple example : Sale of ice cream with rise in temperature. We want to find out how is sale impacted when temperature rises. Lets see how it looks like:

we observe that there is a positive relationship i.e. with the increase in temperature the quantity of ice cream sold also increases. Here notice that Quantity is the dependent variable and Temperature is the independent variable.

The scatter plot helps us in visualizing the relationship, what if we want to find out the numerical measure of the relationship. We use Correlation Coefficient (Pearson product moment correlation coefficient)

what does this do? Correlation coefficient measures the strength and direction of relationship between two variables. It is denoted by ‘r’

Properties of Correlation Coefficient

  • range is -1 to +1
  • +1 indicates perfect positive linear relationship
  • -1 indicates perfect negative linear relationship
  • value of r is close to +1 , there is a strong positive linear relationship
  • value of r is close to -1, there is a strong negative linear relationship
  • value or r is close to 0, little or no relationship.

Bur here the values were given have limited size and correlation coefficient was also small, what if we have a large set of data having large(positive or negative) correlation coefficient. Next step will be to fit a regression line which best fits or model the data . Line of best fit comes in here! (a new term).

Line of best fit

Line of best fit is the one which has the least error i.e. minimal distance between the actual value and the estimated or predicted value.

Line of best fit represents association between two variables. It is used to model the data. Now, how to determine this line? This takes us to next term, Regression Analysis.

Regression Analysis

Regression Analysis is a form of predictive modeling technique which investigates the relationship between a dependent and independent variables

It helps to determine which line best fits the relationship.This regression line is usually called Line of best fit. Other uses of Regression Analysis are : determines the strength of predictors, forecasting an effect and trend forecasting.

Equation of Regression Analysis is : ŷ (y hat) =ax+b , where a is the slope and b is the y intercept. What it does? It gives predicted y value for a given x value. How do we find out ‘a’ and ‘b’ ? Through Least-squares analysis.

Least Squares analysis helps to determine the values of slope(a) and intercept (b) such that equation of regression analysis best represents the relationship between x and y. It also minimizes the error sum of squares.

Let us explore one new related term here : Coefficient of Determination it is a measure of how certain one is while making predictions with line of best fit. It is a statistical measure of how close the data is to the fitted regression line( Goodness of fit)

  • Coefficient of Determination measures the proportion of variability in the dependent variable (y) which is explained by regression model through independent variable(x).
  • Symbol is r2 , value lies between 0<= r2 <=1
  • If r2 value is close to 1 this means model is explaining most of the variations in dependent variable and is a useful model
  • if r2 value is close to 0 this means model is explaining little variation in dependent variable and is not a useful model.

Last term in this section : Residual Plots are just errors in our dependent variable. why is sit useful because it gives idea of appropriateness of the model. If the current model is not proper we can go for a more appropriate model.

Residual (error)= observed – predicted (y- ŷ )

This Post Has One Comment

Leave a Reply