Predicting Continuous Outcomes
Linear Regression
A statistical method used to model the relationship between a dependent variable (also known as the response variable) and one or more independent variables (also known as predictor variables).
- Dependent Variable ((Y)): The outcome we want to predict
- Independent Variable ((X)): The input used to make predictions
The goal is to find a linear equation that best predicts the dependent variable based on the values of the independent variables.
The Linear Regression Equation
In its simplest form, the linear regression equation is:
$$ Y = \beta_0 + \beta_1X + \epsilon $$
- $Y$: The dependent variable
- $X$: The independent variable
- $\beta_0$: The intercept, representing the value of $Y$ when $X$ is zero
- $\beta_1$: The slope, indicating how much $Y$ changes for a one-unit change in $X$
- $\epsilon$: The error term, accounting for the variation in $Y$ not explained by $X$
- Linear regression assumes a linear relationship between the dependent and independent variables.
- This means the change in $Y$ is proportional to the change in $X$.
Relationship Between Independent and Dependent Variables
- Independent Variables: These are the predictors. Their values are assumed to influence the dependent variable but are not influenced by it.
- Dependent Variable: This is the response. Its values are assumed to depend on the independent variables.
- Think of the independent variable as the cause and the dependent variable as the effect.
- For example, in predicting house prices, the size of the house (independent variable) influences the price (dependent variable).
Significance of the Slope and Intercept
The Intercept: $\beta_0$
- Represents the expected value of the dependent variable when the independent variable is zero.
- It is where the regression line crosses the y-axis.
- Sometimes, the intercept may not have a practical interpretation.
- For example, predicting salary at zero years of experience might not make sense, but it is still a necessary part of the model.
The Slope: $\beta_1$
- Showcases how much the dependent variable changes for each unit increase in the independent variable (indicates steepness).
- A positive slope means the dependent variable increases with the independent variable, while a negative slope means it decreases (indicates direction).
The slope directly measures the influence of the independent variable on the dependent variable, assuming all other factors remain constant.
Predicting House Price Based on Square Footage
- Intercept: Represents the base price of a house with zero square footage (often not realistic but mathematically necessary for the model).
- Slope: Indicates how much the house price increases for each additional square foot.
How Well the Model Fits the Data
- In regression, the model fit refers to how closely the model’s predicted values match the actual observed values
- This fit indicates how well the model captures patterns or relationships in the data set
- A good fit means the predicted values are close to the actual data points, while a poor fit means the predictions deviate significantly
- To evaluate this fit, we can use a common metric known as the coefficient of determination, denoted as $r^2$ (R-squared)
Understanding R-squared
- The formula for R-squared is: $r^2$ 1 - \frac{S_{\text{res}}}{S_{\text{tot}}}
- $SS_{\text{res}}$ is the sum of squared residuals, which measures the differences between observed and predicted values.
- $SS_{\text{tot}}$ is the total sum of squares, representing the differences between observed values and the mean of those values.
- The $r^2$ values mean the following:
- A higher $r^2$ value indicates a better model fit, meaning the model explains more of the variability in the data.
- An $r^2$ value of 0 means the model explains none of the variability in the response variable.
- An $r^2$ value of 1 means the model explains all of the variability in the response variable.
- A high $r^2$ value suggests a good fit, but it does not guarantee the model is perfect.
- Always consider other factors, such as the distribution of residuals and potential overfitting.