Linear regression with a single predictor

Lecture 4

2025-02-10

Warm up

Goals

  • Modeling with a single predictor
  • Model parameters, estimates, and error terms
  • Interpreting slopes and intercepts

Setup

Correlation vs. causation

Spurious correlations

Spurious correlations

Linear regression with a single predictor

Read the data

df_raw_cattle_numbers <- read.csv('https://raw.githubusercontent.com/Bovi-analytics/minor-digital-agriculture/refs/heads/main/data/fao-cattle-numbers.csv')

Data prep

  • Select columns needed : Year and Value
  • Apply correct FAIR naming convention
df_cattle_numbers <- df_raw_cattle_numbers %>%
  dplyr::select(Year, Value) %>%
  dplyr::rename(
    NumberOfCows = Value
  )

Data grouping

  • Create a groupby Year
  • Create sum for entire world
df_total_cattle_numbers <- df_cattle_numbers %>%
  dplyr::group_by(Year) %>%
  dplyr::summarise(
    TotalNumberOfCows = sum(NumberOfCows)/1000000000
  )

Writing the data

  • Write the data to csv
  • Import in Tableau
write_csv(df_total_cattle_numbers, file = "df_total_cattle_numbers.csv")

Data overview

df_total_cattle_numbers %>%
  select(Year, TotalNumberOfCows)
# A tibble: 62 × 2
    Year TotalNumberOfCows
   <int>             <dbl>
 1  1961             0.992
 2  1962             1.00 
 3  1963             1.02 
 4  1964             1.04 
 5  1965             1.06 
 6  1966             1.08 
 7  1967             1.11 
 8  1968             1.12 
 9  1969             1.13 
10  1970             1.14 
# ℹ 52 more rows

Data visualization

Regression model

A regression model is a function that describes the relationship between the outcome, \(Y\), and the predictor, \(X\).

\[\begin{aligned} Y &= \color{black}{\textbf{Model}} + \text{Error} \\[8pt] &= \color{black}{\mathbf{f(X)}} + \epsilon \\[8pt] &= \color{black}{\boldsymbol{\mu_{Y|X}}} + \epsilon \end{aligned}\]

Regression model

\[ \begin{aligned} Y &= \color{#325b74}{\textbf{Model}} + \text{Error} \\[8pt] &= \color{#325b74}{\mathbf{f(X)}} + \epsilon \\[8pt] &= \color{#325b74}{\boldsymbol{\mu_{Y|X}}} + \epsilon \end{aligned} \]

Simple linear regression

Use simple linear regression to model the relationship between a quantitative outcome (\(Y\)) and a single quantitative predictor (\(X\)): \[\Large{Y = \beta_0 + \beta_1 X + \epsilon}\]

  • \(\beta_1\): True slope of the relationship between \(X\) and \(Y\)
  • \(\beta_0\): True intercept of the relationship between \(X\) and \(Y\)
  • \(\epsilon\): Error (residual)

Simple linear regression

\[\Large{\hat{Y} = b_0 + b_1 X}\]

  • \(b_1\): Estimated slope of the relationship between \(X\) and \(Y\)
  • \(b_0\): Estimated intercept of the relationship between \(X\) and \(Y\)
  • No error term!

Choosing values for \(b_1\) and \(b_0\)

Residuals

\[\text{residual} = \text{observed} - \text{predicted} = y - \hat{y}\]

Least squares line

  • The residual for the \(i^{th}\) observation is

\[e_i = \text{observed} - \text{predicted} = y_i - \hat{y}_i\]

  • The sum of squared residuals is

\[e^2_1 + e^2_2 + \dots + e^2_n\]

  • The least squares line is the one that minimizes the sum of squared residuals

Least squares line

# A tibble: 2 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) -16.1      0.510        -31.7 3.85e-39
2 Year          0.00879  0.000256      34.3 4.03e-41

Slope and intercept

Properties of least squares regression

  • The regression line goes through the center of mass point (the coordinates corresponding to average \(X\) and average \(Y\)): \(b_0 = \bar{Y} - b_1~\bar{X}\)

  • Slope has the same sign as the correlation coefficient: \(b_1 = r \frac{s_Y}{s_X}\)

  • Sum of the residuals is zero: \(\sum_{i = 1}^n \epsilon_i = 0\)

  • Residuals and \(X\) values are uncorrelated

Interpreting the slope

The slope of the model for predicting number of cows is 0.008785434.

How to interpret

Interpreting slope & intercept

\[\widehat{\text{Total number of cows}} = -16.1 + 0.008785434 \times \text{Year}\]

  • Slope: For every one point increase in Year, we expect the total number of cows to be higher by 0.008785434 points, on average.
  • Intercept: In Year is 0, we expect the total number of cows to be -16.1.

Is the intercept meaningful?

✅ The intercept is meaningful in context of the data if

  • the predictor can feasibly take values equal to or near zero or
  • the predictor has values near zero in the observed data

🛑 Otherwise, it might not be meaningful!