class: center, middle, inverse, title-slide .title[ # Linear Regression ] .subtitle[ ## Session 04 ] .author[ ### Alejandro Ucan ] .date[ ### 2023-09-24 ] ---
# Goals * Given a set of data, find the best fitting line. <br/><br/> * Describe the correlation coefficients. <br/><br/> * Describe the coefficient of determination. <br/><br/> * Determine the parameters for the best linear model that fits. --- # Let's see the data <img src="index_files/figure-html/unnamed-chunk-1-1.png" width="100%" /><img src="index_files/figure-html/unnamed-chunk-1-2.png" width="100%" /> --- # Lets see the data of two variables <img src="index_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" /> --- # Motivation > Sometimes the collected data (in real life) does not fit a linear model in the analitic form. <br/><br/> But it looks like a linear model or it has a linear behavior. <br/><br/> So, we need to find the best fitting line. <br/><br/> This is called **Linear Regression**. --- # Linear Regression > __Definition:__ A linear regression model is a model that assumes a linear relationship between the input variables (x) and the single output variable (y). <br/><br/> More specifically, that y can be calculated from a linear combination of the input variables (x). <br/><br/> When there is a single input variable (x), the method is referred to as simple linear regression. <br/><br/> When there are multiple input variables, literature from statistics often refers to the method as multiple linear regression. --- # Correlation How does we can assure that two variables are candidate to be linearly related? > __Definition:__ _correlation_ between two variables is a measure of the degree to which the variables are linearly related. <br/><br/> The correlation coefficient is a number between -1 and 1. <br/><br/> The closer the correlation coefficient is to 1 or -1, the more closely the two variables are related. <br/><br/> If the correlation coefficient is close to 0, the variables are not very closely related. --- ## Pearson's Correlation Coefficient > __Definition:__ The _Pearson's correlation coefficient_ between variables `\(X\)` and `\(Y\)` is defined as the covariance of `\(X\)` and `\(Y\)` divided by the product of their standard deviations, in symbols <br/><br/> `$$\rho_{X,Y} = \frac{\text{cov}(X,Y)}{\sigma_X \sigma_Y}.$$` --- ### Determination Coefficient ($R^2$) > __Definition:__ The _determination coefficient_ is a measure of how well the regression line represents the data. <br/><br/> It is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). <br/><br/> It is defined as `$$R^2=\frac{\text{SSR}}{\text{SST}}=1-\frac{\text{SSE}}{\text{SST}}$$` where <br/><br/> * `\(\text{SSR}=\sum_{i=1}^n (\hat{y}_i-\bar{y})^2\)` is the _regression sum of squares_. <br/><br/> * `\(\text{SSE}=\sum_{i=1}^n (y_i-\hat{y}_i)^2\)` is the _error sum of squares_. <br/><br/> * `\(\text{SST}=\sum_{i=1}^n (y_i-\bar{y})^2\)` is the _total sum of squares_. --- ## What can I obtain from linear regression? > If my variables are related, it is possible to __predict__ a response value from a predictor value more accurately. Also, * Examine how the response variable (y) changes as the explanatory variable (x). <br/><br/> * Predict the value of the response variable (y) for a given value of the explanatory variable (x). <br/><br/> --- ### The line that fits > The linear regression model of a `\(n-\)` sample points of `\(X\)` and `\(Y\)` is given by `$$y=mx+b+\epsilon$$` where * `\(\epsilon\)` is the error given by measurement. <br/><br/> * `\(m\)` is the slope of the line and it is given by `$$m=\frac{\sum_{i=1}^n xy-\overline{x}\overline{y}}{\sum_{i=1}^n x^2 -\frac{1}{n}\left(\sum_{i=1}^n x\right)^2}$$`<br/><br/> * `\(b\)` is the intercept of the line and it is given by `$$b=\overline{y}-m\overline{x}.$$` --- ### How to validate our linear regression model? > In order to validate if our linear regression model is acceptable we need to: <br/><br/> 1. Represent the data into a scatter plot. <br/><br/> 2. Compute the Pearson's correlation coefficient: If this number is closer to 1 or -1, the variables are more closely related. <br/><br/> 3. Compute the determination coefficient: if the number is closer to 1, the model is more accurate and if it the number is closer to 0 the model is less accurate. <br/><br/> --- # Example The following table contains information related to age and income of a group of people. Can we say if there is any relation between these two variables? | Age (years) | Income (USD) | |:-----------:|:------------:| | 18 | 20,000 | | 21 | 25,000 | | 25 | 30,000 | | 27 | 35,000 | | 30 | 40,000 | | 35 | 45,000 | | 40 | 50,000 | | 45 | 55,000 |