Regression

Example: Recall our scatterplot of manatee deaths versus the number of boat registrations in Florida.

We are now going to model this data with a straight line....

Below is the manatee death versus boat registration data with a line which models the overall trend.

This line-of-best-fit is called a regression line.

Some Straight-Line Basics: Recall that every non-vertical straight line can be expressed in the form $y=mx+b$.

Statisticians, as well as our text, for reasons we need not yet discuss, prefer to write a generic line as $$y=a+bx.$$

Formula for the Regression Line

We can calculate the slope (written $b$ in our text) and $y$-intercept (written $a$ in our text):

the slope is $$b=r\frac{s_y}{s_x},$$

and the $y$-intercept is $$a=\bar{y}-b\bar{x}.$$

Our regression line has the form $$\hat{y}=a+bx$$

where $a$ and $b$ are given above.

The notation $\hat{y}$ is a reminder that our line gives us a prediction, or approximate guess based upon the data.

For example, the regression line for the manatee data is $\hat{y}=0.129x-43.172$. Thus, if we know in advance the number of boat registrations for the year, we can roughly predict how many manatee deaths will occur in that year.

Example: Using the regression line for the manatee data, predict the number of manatee deaths that will occur in a year with 850 boat registrations.

In this case $\hat{y}=66.418$.

Interpretation: We predict that approximately 66 manatees will die in a year in which there are 850 boat registrations.

From Correlation to R-Squared: Recall that the correlation $r$ is a way of measuring the strength and direction of a linear relationship.

When we square this value, we get a more useful measure for our regression line; the percentage of the variation in the data which can be explained by our model is actually given by the value of $r^2$.

The manatee data has a correlation of $r=0.951$. The amount of variation which is explained by our regression line is $r^2=0.905$, or $90.5\%.$

Fact: The value $r^2$ is a gives us a sense of how accurate our prediction is going to be.

Low values of $r^2$ indicate a weak relationship so that our regression will yield poor predictive results.

On the other hand, high values of $r^2$ indicate a strong relationship so that our regression will yield good predictive results.

$\begin{array}{llllrrrr} r^2=0.81 & \mbox{ } & \mbox{ } & \mbox{ } & \mbox{ }& \mbox{ } & \mbox{ } & r^2=0.98\\ \end{array}$

$\begin{array}{llllrrrr} r^2=0.25 & \mbox{ } & \mbox{ } & \mbox{ } & \mbox{ }& \mbox{ } & \mbox{ } & r^2=0.49\\ \end{array}$

$\begin{array}{llllrrrr} r^2=0 & \mbox{ } & \mbox{ } & \mbox{ } & \mbox{ }& \mbox{ } & \mbox{ } & r^2=0.09\\ \end{array}$

Fact: Any time you present to your reader a regression line model, you should also always report the value of $r^2$.

Residuals: A residual is the difference between a data point $y$ and the predicted value $\hat{y}$ for a given $x$.

Thus, each residual is calculated as $y-\hat{y}$ for any given value of $x$.

Below is the residual plot for the manatee data:

Fact: Smaller residuals mean better predictive value of the regression line.

Fact: Our regression line formula minimizes the sum of the squares of the residuals.

Influential Observations: An observation is influential for a statistical calculation ifremoving it would markedly change the result of the calculation.

The result of a statistical calculation may be of little practical use if it depends strongly on a few influential observations.

Points that are outliers in either the $x$ or the $y$ direction of a scatterplot are often influential for the correlation. Points that are outliers in the $x$ direction are often influential for the least-squares regression line.

Example: Recall the loss-aversion example we looked at in homework:

Savvy Citizen Fact #3: Correlation DOES NOT Imply Causation.

Example: Consider the relationship between lemon imports from Mexico and traffic deaths in the United States.

Beware Extrapolation: Using a statistical model to make predictions outside of a data set is called extrapolation.

Example: Suppose that you have data on a child’s growth between 3 and 8 years of age. You find a strong linear relationship between age $x$ and height $y$. If you fit a regression line to these data and use it to predict height at age 25 years, you will predict that the child will be 8 feet tall.