ISLR

Summary

Skimmed this book out of curiosity & did not finish b/c lack of proofs and overlap, see ESL.

Return to all notes

Ch.1 – Introduction

Notation:

Ch.2 – Statistical Learning

Observe a quantiative response YY and pp different predictors X1,...,XpX_1, ..., X_p. We assume there is some relationship b.t. Y,X=(X1,...,Xp)Y, X = (X_1, ..., X_p):

Y=f(X)=ϵ.Y = f(X) = \epsilon.

Statistical learning refers to a set of approaches for estimating ff.

2.1.1 Why estimate f?

Two main reasons: prediction & inference.

Prediction

In many situations, a set of inputs XX are readily available, but the output YY cannoot be easily obtained. In this setting, since the error term averages to zero, we can predict YY using

Y^=f^(X),\hat Y = \hat f (X),

where f^\hat f represents our estimate for ff, and Y^\hat Y represents the resulting prediction for YY. f^\hat f is often treated as a black box.

The accuracy of Y^\hat Y as a prediction for YY depends on the reducible and irreducible error. In general, f^\hat f will not be a perfect estimate for ff; this error is reducible because we can potentially improve the accuracy. However, YY itself is also a function of ϵ,\epsilon, which cannot, by definition, be predicted by XX. This is the irreducible error.

Consider a given estimate f^\hat f and a set of predictors XX, which yields the prediction Y^=f^(X)\hat Y = \hat f(X). Assume for now that both f^,X\hat f, X are fixed. Then,

E(YY^)2=E[f(X)+ϵf^(X)]2=[f(X)f^(X)]2+Var(ϵ),\begin{align*} E(Y - \hat Y)^2 &= E[f(X) + \epsilon - \hat f(X)]^2 \\ &= [f(X) - \hat f(X)]^2 + \text{Var}(\epsilon), \end{align*}

where the first term is reducible and the second term is irreducible. E(YY^)2E(Y - \hat Y)^2 the EV of the squared difference between predicted and actual value of YY.  Var(ϵ)\Var(\epsilon) the variance associated with the error term ϵ\eps.

Inference

We are often interested in understanding how YY is affected as X1,...,XpX_1, ..., X_p change. We wish to estimate ff, but are not necessarily going to make predictions for YY. Now ff cannot be treated as a black box, because we need to know its exact form. We might be interesting in asking,

2.1.2 How do we estimate f?

Explore linear and nonlinear approaches to estimating ff. These methods generally share certain characteristics. Overview of these shared char’s in this section/

Will assume we have observed nn diff data points (training data). xijx_{ij} the value of the jj-th predictor for observation ii, where we have nn observations and pp features. yiy_i the response variable for the ii-th observation.

We want to find f^\hat f such that Yf^(X)Y \approx \hat f (X) for any observation (X,Y)(X, Y).

Parametric Methods

Parametric methods involve a 2-step model-based approach.

“Parametric” = we fix the number of parameters that need to be estimated. Simplifies the estimation but potentially may be poor estimate if we choose a model too far from true ff. In general, fitting a more flexible model requires estimating a greater number of parameters, but this can lead to overfitting, where models follow the errors (noise) too closely.

Non-parametric Methods

Do not make explicit assumptions abt the functional form of ff. Seek an estimate of ff that gets as close to data points as possible without being too “rough or wiggly.” Such approaches are advantageous because they have the potential to accurately fit a wider range of possible shapes for ff, but a very large number of observations is required for an accurate estimate.

2.1.3 Trade-off bt. Prediction Accuracy and Model Interpretability

One might ask: why would we ever choose to use a more restrictive method instead of a very flexible approach? There are several reasons we might prefer a more restrictive model. If we are mainly interested in inference, then restrictive models are much more interpretable. In some settings, however, we are only interested in prediction, and teh interpretability of the predictive model is simply not of interest.

2.1.4 Supervised vs. Unsupervised Learning

Most statistical learning problems fall into one of two categories: supervised or unsupervised. Supervised learning means for each observation of the predictor measurements, there is an associated response measurement. Unsupervised learning is when for every observation i=1,...,ni= 1, ..., n we observe a vector of measurements xix_i but no associated response yiy_i. What sort of statistical analysis is then possible?

We can seek to understand the relationships between the variables or between the observations. One stat learning tool we might use is cluster analysis, or clustering. The goal is to ascertain, on the basis of x1,...,xnx_1, ..., x_n, whether the observations fall into relatively distinct groups.

Semi-supervised learning occurs when we have predictor and response measurements for m<nm < n observations, and the remaining nmn-m observations have no response measurement.

2.1.5 Regression vs. Classification

Variables can be quantitative or qualitative (categorical). Quantitative variables take on numerical values. We tend to refer to problems with a quantitative response as regression problems, and qualitative responses as classification problems.

2.2 Assessing Model Accuracy

2.2.1 Measuring Quality of Fit

In order to evaluate the performance of a stat learning method on a given data set, we need some way to measure how well its predictions match the observed data.

In the regression setting, the most commonly used measure is the mean squared error (MSE)

MSE=1ni=1n(yif^(xi))2,MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat f (x_i))^2,

where f^(xi)\hat f(x_i) is the prediction f^\hat f gives for the ii-th observation. The MSE will be small if the pred responses are very close to the true responses, and will be large if for some of the observations, the pred and true responses differ substantially.

The MSE above is computed using the training data and should be referred to as the training MSE. In general, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. We would like to select a method that minimizes the test MSE.

In some settings, we may have a test data set available. If no test observations are available, one cannot just choose the approach minimizing the training MSE.

As the flexibility of the statistical learning method increases, we observe a monotone decrease in the training MSE and a U-shape in the test MSE. This is a fundamental property of statistical learning that holds regardless of the method and data set. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting.

Throughout this book, we discuss a variety of approaches that can be used in practice to estimate the proper minimum test MSE.

2.2.2 The Bias-Variance Trade-Off

The U-shape observed in the test MSE curves turns out to be the result of two competing statistical learning methods. The expected test MSE, for a given x0x_0, can always be decomposed into the sum of three fundamental quantities: the variance of f^(x0)\hat f(x_0), the squared bias of f^(x0)\hat f(x_0), and the variance of the error terms ϵ\eps:

E(y0f^(x0))2= Var(f^(x0))+[Bias(f^(x0))]2+ Var(ϵ).\E \left( y_0 - \hat f(x_0) \right)^2 = \Var(\hat f(x_0)) + [\text{Bias}(\hat f (x_0))]^2 + \Var(\eps).

Here E(y0f^(x0))\E \left( y_0 - \hat f(x_0) \right) defines the expected test MSE, the average test MSE we would obtain if we estimated ff using a large number of training sets, and tested at each x0x_0. The overall expected test MSE can be computed by averaging E(y0f^(x0))\E \left( y_0 - \hat f(x_0) \right) over all possible values of x0x_0 in the test set.

In order to minimize the expected test error, we need to select a method that achieves both low variance and low bias.

Variance refers to the amount by which f^\hat f would change if we estimated it using a different training data set. In general, more flexible statistical methods have higher variance.

Bias introduced by approximating a real-life problem by a much simpler model. For example, linear regression assumes a linear relationship, which is unlikely in real life, resulting in bias in the estimation of ff. Generally, more flexible methods result in less bias.

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. The relationship between bias, variance, and test set MSE is referred to as the bias-variance trade-off. The challenge lies in finding a method for which both the variance and squared bias are low.

In a real life situation in which ff is unobserved, it is generally not possible to explicitly compute the test MSE, bias, or variance for a stat learning method.

2.2.3 The Classification Setting

Suppose we seek to estimate ff on the basis of {(x1,y1,...,(xn,yn)}\{(x_1, y_1, ..., (x_n, y_n)\} where y1,...,yny_1, ..., y_n are qualitative. The most common approach for quantifying the accuracy is the training error rate, the proportion of mistakes that are made if we apply our estimate f^\hat f to the training observations:

1ni=1nI(yiy^i).\frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat y_i).

The test error rate associated with a set of test observations of the form (x0,y0)(x_0, y_0) is given by

Ave(I(y0y^0)),\text{Ave}(I(y_0 \neq \hat y_0)),

where y^0\hat y_0 is the predicted class label that results from applying the classifier to the test observation with predictor x0.x_0.

The Bayes Classifier

It is possible to show that the test error rate is minimized, on average, by a very simple classifier that assigns each observation to the most likely class, given its predictor values. In other words, we should simply assign a test observation with a prediciton vector x0x_0 to the class jj for which

P(Y=jX=x0)\P(Y = j | X = x_0)

is largest. This simple classifier is called the Bayes classifier.

The Bayes classifier’s prediction is determined by the Bayes decision boundary.

The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. Since the Bayes classifier will always choose the class for which P(Y=jX=x0) \P(Y = j | X = x_0) is largest, the error rate at X=x0X= x_0 will be 1maxjP(Y=jX=x0).1- \text{max}_j \P(Y = j | X =x_0). In general, the overall Bayes error rate is given by

1E(maxjP(Y=jX)),1 - \E \left(\text{max}_j \P (Y=j | X)\right),

where the expectation averages the probability over all possible values of XX.

K-Nearest Neighbors

In theory, we would like to predict qualitative responses using the Bayes classifier. But for real data, we don’t know the cond. distribution of YY given XX, so computing the Bayes classifier is impossible. The Bayes classifier serves as an unattainable gold standard against which to compare other models. Many approaches try to estimate the cond. dist. of YY given XX, then classify a given observation to the class with the highest estimated probability. One such example is the K-nearest neighbors (KNN) classifier.

Given a positive integer KK and a test observation x0x_0, the KNN classifier first identifies the KK points in the training data that are closest to x0x_0, represented by N0N_0.

It then estimates the conditional probability for class jj as the fraction of points in N0N_0 whose response values equal jj:

P(Y=jX=x0)=1KiN0I(yi=j).\P (Y=j | X=x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j).

Finally, KNN applies Bayes rule and classifies the test observation x0x_0 to the class with the largest probability.

notes

The choice of KK has a drastic effect on the KNN classifier obtained. As KK grows, the method becomes less flexible and produces a decision boundary that is close to linear (low-variance, high-bias).

notes

Ch. 3 – Linear Regression

3.1 Simple Linear Regression

Simple linear regression assumes there is an approximately linear relationship between XX and YY.

Yβ0+β1X.Y \approx \beta_0 + \beta_1 X.

Sometimes described as “regressing YY on XX” or “YY onto XX”. Once we have used our training data to produce estimates β^0,β^1\hat \beta_0, \hat \beta_1 for the model coefficients, we can predict

y^=β^0+β^1x.\hat y = \hat \beta_0 + \hat \beta_1 x.

3.1.1 Estimating the Coefficients

In practice, the parameters are unknown. We must use data to estimate the coefficients. Suppose we have nn observation pairs. We’ll use least squares, the most common approach to determining closeness.

Let y^i=β^0+β^1xi\hat y_i = \hat \beta_0 + \hat \beta_1 x_i be the prediction for YY based on the ii-th value of XX. Then ei=yiy^ie_i = y_i - \hat y_i represents the ii-th residual.

We define the residual sum of squares (RSS) as

RSS=e12+e22+...+en2,\text{RSS} = e_1^2 + e_2^2 + ... + e_n^2,

or equivalently as

RSS=(y1β^0β^1x1)2+(y2β^0β^1x2)2+...+(ynβ^nβ^1xn)2.\text{RSS} = (y_1 - \hat \beta_0 - \hat \beta_1 x_1)^2 + (y_2 - \hat \beta_0 - \hat \beta_1x_2)^2 + ...+(y_n - \hat \beta_n - \hat \beta_1 x_n)^2.

The least squares approach choose β^0\hat \beta_0 and β^1\hat \beta_1 to minimize the RSS. Using calculus, one can show that the minimizers are

β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2,β^0=yˉβ^1xˉ,(3.4)\begin{align*} \hat \beta_1 &= \frac{ \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}, \\ \hat \beta_0 &= \bar y - \hat \beta_1 \bar x, \\ (3.4) \end{align*}

where yˉ=1ni=1nyi\bar y = \frac{1}{n} \sum_{i=1}^n y_i and xˉ=1ni=1nxi\bar x = \frac{1}{n}\sum_{i=1}^n x_i are the sample means. In other words, the above defines the least squares coefficient estimates for simple linear regression.

3.1.2 Assessing the Accuracy of the Coefficient Estimates

We assume the true relationship between X,YX,Y as Y=f(X)+ϵY =f(X) + \eps. If ff is to be approx as a linear fn, we can write this as

Y=β0+β1X+ϵ.(3.5)Y = \beta_0 + \beta_1X+\eps. (3.5)

The model given by this equation (3.5) defines the population regression line: the best linear approximation to the true relationship between XX and YY. The least squares regression coefficient estimates characterize the least squares line. The true relationship is generally not known for real data, but the least squares line can always be computed using the (3.4) estimates. (think: population mean μ\mu vs sample mean μ^\hat \mu).

If we use the sample mean μ^\hat \mu to estimate μ\mu, this estimate is unbiased: on average, expected m^u\hat mu to equal μ\mu.

Hence, an unbiased estimator does not systematically over- or under-estimate the true parameter.

Next, (again with the pop vs sample mean), recall how we check accuracy of sample mean μ^\hat \mu as an estimate of μ\mu. How far off will a single estimate μ^\hat \mu be off of μ\mu? Generally, we answer this with the standard error of μ^\hat \mu, written SE(μ^)\text{SE}(\hat \mu).

 Var(μ^)=SE(μ^)2=σ2n,\Var (\hat \mu) = \text{SE}(\hat \mu)^2 = \frac{\sigma^2}{n},

for σ\sigma the standard deviation of each of the realizations yiy_i of Y. Roughly speaking, the standard error tells us the avg amt that the estimate μ^\hat \mu differs from the actual value μ\mu. It also tells us how this deviation shrinks with nn.

Similarly, how do we check how close β^0,β^1\hat \beta_0, \hat \beta_1 are to the true values β0,β1\b_0, \b_1? To compute the standard errors associated with β^0\hat \b_0 and β^1\hat \b_1, we use the following formulas:

SE(β^0)2=σ2[1n+xˉ2i=1n(xixˉ)2],SE(β^1)2=σ2i=1n(xixˉ)2,\se(\hat \b_0)^2 = \sigma^2 \left[ \frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^{n}(x_i - \bar x)^2} \right], \se(\hat \b_1)^2 = \frac{\sigma^2}{\sum_{i=1}{n}(x_i - \bar x)^2},

where σ2= Var(ϵ)\sigma^2 = \Var(\eps). For these formulas to be strictly valid, we need to assume the errors for reach observation are uncorrelated with common variance σ2\sigma^2. In general, σ2\sigma^2 is not known, but can be estimated from the data. This estimate is the residual standard error, given

RSE=RSS/(n2).\text{RSE} = \sqrt{\text{RSS}/ (n-2)}.

Standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the r ange will contain the true unknown of the parameter. The range is defined in terms of lower and upper limits computed from the sample of data.

For lin reg, 95% confidence interval for β1\b_1 approx β^1±2SE(β^1)\hat \b_1 \pm 2 \cdot \se(\hat\b_1) and similarly for β0\b_0.

Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of

H0: There is no relationship between X and YH_0 : \text{ There is no relationship between } X \text{ and } Y

vs the alternative hypothesis

Ha: There is some relationship between X and Y.H_a: \text{ There is some relationship between } X \text{ and } Y.

Mathematically, this corresponds to testing

H0:β1=0 vs. Ha:β10.H_0: \beta_1 = 0 \text{ vs. } H_a: \b_1 \neq 0.

To test the null hypothesis we need to determine whether our estimate is sufficiently far from zero to be confident that β1\beta_1 is non-zero. How far depends on the accuracy of our estimate. in practice, we compute a t-statistic:

t=β^10SE(β^1),(3.14)t = \frac{\hat \b_1 - 0}{\text{SE}(\hat \b_1)}, (3.14)

which measures the number of standard deviations that β^1\hat \b_1 is away from 0. If there is no relationship between XX and YY, we expect (3.14) to have a t-distribution with n2n-2 degrees of freedom. The t-dist has bell shape and for n>n > about 30, is very similar to the normal dist. So it is easy to compute the prob of observing any value equal to t|t| or larger, assuming β1=0\b_1 = 0. We call this probability the p-value.

Roughly speaking, a small p-value indicates it is unlikely to observe a substantial association between the predictor and the response due to chance, in the absence of any real ssociation between the predictor and the response. Hence, if there is a small p-value, we can infer there is an association b.t. the predictor and response. We reject the null hypothesis, i.e., declare a relationship to exist b.t. X,YX,Y. Typical p-value cutoffs for rejecting the null hypothesis are 5% or 1%.

3.1.3 Assessing the Accuracy of the Model

After rejecting the null hypothesis we want to know the extent to which the model fits the data. The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the R2R^2 statistic.

Recall that associated with each observation is an error term. The RSE is an estimate of the standard deviation of ϵ\eps (the avg amt that the response will deviate from the true regression line):

RSE=1n2RSS=1n2i=1n(yiy^i)2.\text{RSE} = \sqrt{\frac{1}{n-2}\text{RSS}} = \sqrt{\frac{1}{n-2}\sum_{i=1}^n (y_i - \hat y_i)^2}.

The RSE measures a lack of fit. The R2R^2 statistic provides an alternative measure of fit. It is the proportion of variance explained, so it is always between 0 and 1.

R2=TSSRSSTSS=1RSSTSS,R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\rss}{\tss},

where TSS=(yiyˉ)2\tss = \sum (y_i - \bar y)^2 is the total sum of squares. TSS measures the total variance in the response YY, and can be thought of the amount of variability inherent in the response before the regression is performed. RSS measures the amount of variability left unexplained after performing the regression. So R2R^2 is the proportion of variability in YY that can be explained using XX.

3.2 Multiple Linear Regression

Fitting a separate simple linear reg model for each predictor is not satisfactory (for example, it’s unclear how to make 1 prediction with several different regression equations).

A better approach is to extend to multilinear regression, where each predictor gets a separate slope coefficient in a single model. Suppose we have pp distinct predictors. Then the multilinear regression model takes the form

Y=β0+β1X1+β2X2+...+βpXp+ϵ.Y = \b_0 + \b_1 X_1 + \b_2 X_2 + ... + \b_p X_p + \eps.

We interpret βj\b_j as the average effect on YY of a one unit increase in XjX_j, holding all other predictors fixed.

3.2.1 Estimating the Regression Coefficients

As in simple linear reg, we want to make predictions using the formula

y^=b^0+β^1x1+...+β^pxp.\hat y = \hat b_0 + \hat \b_1 x_1 + ... + \hat \b_p x_p.

The parameters are again estimated using the same least squares approach, choosing our regression coefficients estimates as those that minimize the sum of squared residuals

RSS=i=1n(yiy^i)2=i=1n(yiβ^0β^1xi1β^2xi2...β^pxip).\begin{align*} \rss &= \sum_{i=1}^n (y_i - \hat y_i)^2 \\ &= \sum_{i=1}^n (y_i - \hat \b_0 - \hat \b_1 x_{i1} - \hat \b_2 x_{i2} - ... - \hat \b_p x{ip}). \end{align*}

The muliple regression coefficient estimates are more complex than the simple linear reg coeffs, better represented in matrix form.

3.2.2 Some important questions

3.3 Other Considerations in the Regression Model

3.3.2 Extensions of the Linear Model

Two important assumptions of the linear regression model that are violated in practice are the additive assumption and linear assumption.

Additive: the effect of changes in a predictor XjX_j on the response YY is independent of the values of the other predictors.

Linear: the change in the response YY due to a 1-unit change in XjX_j is constant, regardless of the value of XjX_j.

3.3.3 Potential Problems

When we fit a linear regression model, common problems that may occur are:

3.5 Comparison of Linear Regression w/ KNN

One of the simplest non-parametric methods is K-nearest neighbors regression (KNN regression).

KNN regression is closely related to the KNN classifier. Given a value KK and a prediction point x0x_0, KNN regression first identifies the KK training observations closest to x0x_0, represented by N0N_0. It then estimates f(x0)f(x_0) using the vareage of all the training respponses in N0N_0:

f^(x0)=1KxiN0yi.\hat f(x_0) = \frac{1}{K} \sum_{x_i \in N_0} y_i.

In what setting will a parametric approach such as least squares linear regression outperform a non-parametric approach such as KNN regression? The answer is simple: the parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form of ff.

notes

Ch 4 – Classification

Ch 5 – Resampling Methods

5.1 Cross Validation

5.1.1 Validation Set Approach

5.1.2 Leave One Out Cross Validation

CVn=1ni=1nMSEi.\text{CV}_n = \frac{1}{n} \sum_{i=1}^n \text{MSE}_i.

Advantages over validation set approach:

Note: LOOCV has the potential to be expensive to implement: we have to fit the model nn times. time consuming if nn large. However, with least squares linear or polynomial regression, the cost of LOOCV is the same as that of a single model fit, because the following formula holds:

CVn=1ni=1n(yiy^i1hi)2,\text{CV}_n = \frac{1}{n} \sum_{i=1}^n (\frac{y_i - \hat y_i}{1-h_i})^2,

where y^i\hat y_i is the ii-th fitted value from the original least squares fit, and hih_i is the leverage.

5.1.3 k-Fold Cross Validation

CVk=1ki=1kMSEi.\text{CV}_k = \frac{1}{k}\sum_{i=1}^k \text{MSE}_i.

Clearly, LOOCV is a special case of kk-fold CV where k=nk=n. In practice, one typically does kk fold CV with k=5k=5 or k=10k=10.

Computational advantage over LOOCV.

5.1.4 Bias-Variance Trade-Off for k-Fold CV

kk-fold CV also often gives more accurate estimates of the test error rate than LOOCV, which has to do with a bias-variance trade-off.

Obviously, in terms of bias reduction, LOOCV is preferable to kk-fold CV. But LOOCV has higher variance than kk-fold CV with k<nk < n:

So, typically, considering this fact, ones does kk-fold CV on k=5k=5 or k=10k=10, as these values have been show empirically to yield test error rates that suffer neither from excessively high bias nor from very high variance.

5.2 The Bootstrap

The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.

However, the power of the bootstrap lies in the fact that it can be easily applied to a wide range of statistical learning methods, including some for which a measure of variability is otherwise difficult to obtain and is not automatically output by statistical software.

Bootstrapping is a procedure for estimating the distribution of an estimator by resampling (often with replacement) one’s data or a model estimated from the data.Bootstrapping assigns measures of accuracy (bias, variance, confidence intervals, prediction error, etc.) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In the example in the book, sps we want to invest a fixed sum of money in two financial assets yielding returns of XX and YY, respectively, where X,YX,Y are random quantities. Example goes through finding the fraction α\alpha of money we should invest into XX to minimize variance of our overall investment. In reality, we end up needing to compute estimates using a data set from past measurements, then estimate α\alpha. To estimate the standard deviation of α^\hat \alpha, we would repeat a process of simulating observations and estimating α\alpha many times. This can’t actually happen in real life, because we can’t generate new samples from the original population. The bootstrap approach lets us emulate the process of obtaining new sample sets, where we repeatedly sample observations from the original data set.

Ch 6 – Linear Model Selection & Regularization

Chapter discusses ways the simple linear model can be improved by replacing least squares fitting with alternative fitting procedures. Alternative fitting procedures can yield better:

Three important classes of methods discussed:

6.1 Subset Selection

Best subset selection: fit a separate least squares regression for each possible combination of the pp predictors. (obviously too computationally slow)

Stepwise selection: forwards and backwards. Forward stepwise selection begins with a model with 0 predictors, then adds predictors one at a time (at each step, variable giving the greatest additional improvement is added to the model). Backward stepwise selection begins with the full least squares model containing all pp predictors and iteratively removes the least useful predictor. THere are also hybrid approaches.

6.1.3 Choosing the Optimal Model

Choosing model by RSS and R2R^2 is indicative only of training error (not test error). We wants to estimate test error. Two common approaches:

Training set MSE is generally an underestimate of the test MSE. This is because when we fit a model to the training data using least squares, we specifically estimate the regression coefficients such that the training RSS (but not the test RSS) is as small as possible. In particular, the training error will decrease as more variables are included in the model, but the test error may not. Therefore, training set RSS and training set R2R^2 cannot be used to select from among a set of models with different numbers of variables.

Techniques for adjusting the training error for the model size: CpC_p, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R2R^2.

For a fitted least squares model containing dd predictors, the CpC_p estimate of test MSE is

Cp=1n(RSS+2dσ^2),C_p = \frac{1}{n}(\rss + 2d\hat \sigma^2),

where σ^2\hat \sigma^2 is an estimate of the variance of the error ϵ\eps associated with each response measurement. (Adds penalty for added numbers of predictors)

AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of the model with Gaussian errors, maximum likelihood and least squares are the same thing:

AIC=1nσ^2(RSS+2dσ^2).\text{AIC} = \frac{1}{n \hat \sigma ^2}(\text{RSS} + 2 d \hat \sigma^2).

For the least squares model with dd predictors, the BIC is, up to constants, given by

BIC=1n(RSS+log(n)dσ^2).\text{BIC} = \frac{1}{n}(\text{RSS} + \log(n) d \hat \sigma^2).

Adjusted R2R^2 for a least squares model with dd variables is

Adjusted R2=1RSS/(nd1)TSS/(n1).\text{Adjusted }R^2 = 1- \frac{\text{RSS} / (n - d - 1)}{\text{TSS} / (n-1)}.

6.2 Shrinkage Methods

See ESL for ridge, LASSO.

6.3 Dimension Reduction Methods

Let Z1,Z2,...,ZMZ_1, Z_2, ..., Z_M represent M<pM < p linear combinations of our original pp predictors. So

Zm=j=1pϕjmXj,Z_m = \sum_{j=1}^p \phi_{jm} X_j,

for some constants ϕ1m,...,ϕpm\phi_{1m}, ..., \phi_{pm} where m=1,...,Mm = 1, ..., M. We can then fit the linear regression model

yi=θ0+m=1Mθmzim+ϵiy_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im} + \eps_i

for i=1,...,ni = 1, ..., n using least squares. Fitting this using least squares can lead to better results than fitting the full model using least squares. Dimension reduction term because we reduce the problem of estimating p+1p+1 coefficeints to estimating the M+1M+1 coefficients where M<pM<p.

6.3.1 PCA

HERE ONWARDS JUST SEE ESL

Ch 7 – Moving Beyond Linearity

Ch 8 – Tree Based Methods

Ch 9 – Support Vector Machines

Ch 10 – Unsupervised Learning

Chapter focuses on PCA and clustering methods.

10.1 The Challenge of Unsupervised Learning

10.2 PCA

Principal components discussed in Chapter 6.