Simple Linear Regression

Introduction

  • Statistical model is quite important across a wide range of fields, providing researchers with tools for both explanation and prediction.
  • The most popular models of the statistical practice has been the general linear model (GLM).
  • The GLM finds the relation of a dependent and several independents variables that take form of tools as analysis of variance (ANOVA) and regression.

Simple Linear Regression

  • The simple linear regression model in population form is where \(y_i\) is the dependent variable for individual \(i\) in the data set and \(x_i\) is the independent variable for subject \(i\) (\(i\) = 1, …, \(N\)).
  • The terms \(\beta_0\) dan \(\beta_1\), are the intercept and slope of the model, respectively.

\[ y=\beta_{0}+\beta_{1}X_i+\varepsilon_i{} \qquad(1)\]

  • The intercept is the point at which the line in Equation 1 crosses the \(y\) axis at \(x\) = 0.
  • Thus, larger values of \(\beta_1\) (positive or negative) indicate a stronger linear relationship between \(y\) and \(x\).

Ilustration of Simple Regression

Imagine we have a dependent variable (y).

Ilustration of Simple Regression

We estimate a linear regression with no predictors and plot the intercept (null model).

\[ y = \beta_0 + \varepsilon_i{} \]

Ilustration of Random Error

  • Purple lines indicate the error term for each observation.

Random Error

  • Random error, represented by \(\varepsilon_i\), is inherent in any statistical model, including regression.
  • It expresses the fact that for any individual, \(i\), the model will not generally provide a perfect predicted value of \(y\), denoted \(\hat{y}_i\) and obtained by applying the regression model as

\[ \hat{y} = \beta_0 + \beta_iX_i \qquad(2)\]

  • Conceptually, this random error is representative of all factors that may influence the dependent variable other than \(x\).

Ilustration of Simple Regression

\[ y = \beta_0 + \beta_iX_i + \varepsilon_i{} \]

Estimating Regression with OLS

  • Ordinary least squares (OLS) is popular methods for obtaining estimated values of the regression model parameters (\(b_0\) and \(b_i\), respectively) given a set of \(x\) and \(y\).
  • \(\beta_0\) and \(\beta_i\) must be estimated using sample data taken from the population.
  • The goal of OLS is to minimize the sum of the squared differences between the observed values of \(y\) and the model-predicted values of \(y\), across the sample.
  • This difference, known as the residual, is written as

\[ e_i = y_i - \hat{y}_i \qquad(3)\]

  • Therefore, the method OLS seeks to minimize

\[ \Sigma_{i=1}^n e_i^2 = \Sigma_{i=1}^n (y_i - \hat{y}_i) \qquad(4)\]

OLS Criteria

  • It should be noted that in the context of simple linear regression, the OLS criteria reduce to the following equations, which can be used to obtain \(b_0\) and \(b_i\) as

\[ \beta_0 = r \left(\frac{S_y}{S_x} \right) \qquad(5)\]

and

\[ \beta_0 = \overline{y} - \beta_1\overline{x} \qquad(6)\]

Example of Simple Regression

  • In this example, we use the PISA 2022 data to analyze the impact of ESCS on the math achievement of Indonesian students.
  • The sample includes 1329 students who were assessed for both variables.
  • In this scenario, math achievement serves as the dependent variable, while ESCS is the independent variable.
  • Descriptive statistics for each variable, along with the correlations between them, are provided in Table 1.1.

Descriptive stat

     vars    n   mean    sd median trimmed   mad    min    max  range skew
ESCS    1 1297  -1.49  1.04  -1.56   -1.51  1.00  -4.63   1.52   6.15 0.18
MATH    2 1297 374.02 64.42 365.02  370.49 64.62 188.95 632.92 443.98 0.51
     kurtosis   se
ESCS    -0.34 0.03
MATH    -0.04 1.79

Correlation

Variable Mean Standar Deviasi Correlation
Math 367.28 58.34 0.244
ESCS -1.63 0.99

Beta 1

  • Using equations (1.4) and (1.5), we can use this information to obtain estimates for both the slope and the intercept of the regression model.
  • First, the slope of the regression is calculated as

\[ \beta_1 = 0.244 \left(\frac{58.34}{0.99} \right)=14.38 \qquad(7)\]

Beta 0

  • The results indicate that individuals with higher ESCS scores generally achieve higher math scores.

  • We can calculate an estimate of the intercept using the values in the table:

\[ \beta_0 = \overline{y} - \beta_1\overline{x} \qquad(8)\]

\[ \beta_0 = 367.28-(14.38)(-1.63)=390.72 \]

Full model

  • The resulting estimated regression equation for math and ESCS is:

\[ \hat {math}=390.72+14.38(ESCS). \]

  • This indicates that for a 1-point increase in ESCS score, math achievement would increase by 390.72 points.

Measure the strength

  • To assess the strength of the relationship between ESCS and math achievement, we should calculate the coefficient of determination.
  • This requires the values of \(SS_R\) and \(SS_T\).
  • The sum of squares due to regression (\(SS_R\)) or explained sum of squares (ESS) is the sum of the differences between the predicted value and the mean of the dependent variable.

Sum of squares due to regression

  • We can calculate the strength by this equation.

\[SS_R=\Sigma^n_{i=1}(\hat{y}_i-\bar{y})^2\]

  • Where \(\hat{y}_i\) is the predicted value of the dependent variable and \(\bar{y}\) is mean of the dependent variable.

Sum of squares error

  • The sum of squares error (\(SS_E\)) or residual sum of squares is the difference between the observed and predicted values.

\[SS_E=\Sigma^n_{i=1}\varepsilon^2_i\] Where \(\varepsilon_i\) is the difference between the actual value of the dependent variable and the predicted value:

\[\varepsilon_i=y_i-\hat{y}_i\]

Sum of squares total

  • The sum of squares total (\(SS_T\)) or the total sum of squares (TSS) is the sum \(SS_R\) and \(SS_E\).

\[SS_T=\Sigma^n_{i=1}(\hat{y}_i-\bar{y})^2 +\Sigma^n_{i=1}\varepsilon^2_i\]

Regression in R

To simplify the calculation we will calculate the \(R^2\) value using r. First, we create a regression model using the lm () function.

model <- lm(MATH ~ ESCS, data = pisa_idn)

Calculate by hand

  • We calculate the \(SS_R\), \(SS_E\), \(SS_T\), \(R^2\) value with the following command:
ssr <- sum((fitted(model)-mean(pisa_idn$MATH))^2)
ssr
[1] 424543.8
sse <- sum((fitted(model) - pisa_idn$MATH)^2)

sst <- ssr+sse
sst
[1] 5377820

Calculate R square

\[ R^2=\frac{SS_R}{SS_T}= \frac{269424.3}{4520484}=0.06 \]

  • The results indicate that about 6% of the difference in math achievement can be accounted for by the variance in ESCS scores.

Calculate F

  • With this \(R^2\) value, we can compute the F-statistic to test if any of the model slopes (in this instance, there is only one) are different from 0 in the population.

\[ F= (\frac {1329-1-1} {1})(\frac{0.06}{1-0.06})= 84.7 \]