ISLR 4.6 Lab: Logistic Regression, LDA, QDA, and KNN-白红宇

ISLR 4.6 Lab: Logistic Regression, LDA, QDA, and KNN

阅读量：5069 次

发布时间：2019-06-12

本文共 11604 字，大约阅读时间需要 38 分钟。

4.6.1 The Stock Market Data

> library (ISLR)> names(Smarket )[1] "Year" "Lag1" "Lag2" "Lag3" "Lag4"[6] "Lag5" "Volume " "Today" " Direction "> dim(Smarket )[1] 1250 9

The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the Direction variable is qualitative. 这个还挺有意思的

> cor(Smarket )Error in cor(Smarket) : 'x' must be numeric> cor(Smarket [,-9])             Year         Lag1         Lag2         Lag3         Lag4Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527               Lag5      Volume        TodayYear    0.029787995  0.53900647  0.030095229Lag1   -0.005674606  0.04090991 -0.026155045Lag2   -0.003557949 -0.04338321 -0.010250033Lag3   -0.018808338 -0.04182369 -0.002447647Lag4   -0.027083641 -0.04841425 -0.006899527Lag5    1.000000000 -0.02200231 -0.034860083Volume -0.022002315  1.00000000  0.014591823Today  -0.034860083  0.01459182  1.000000000

4.6.2 Logistic Regression

The glm() function fits generalized glm() linear models, a class of models that includes logistic regression. The syntax

generalized of the glm() function is similar to that of lm(), except that we must pass in linear model the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

> glm.fit=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket ,family =binomial )> summary (glm.fit )Call:glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +     Volume, family = binomial, data = Smarket)Deviance Residuals:    Min      1Q  Median      3Q     Max  -1.446  -1.203   1.065   1.145   1.326  Coefficients:             Estimate Std. Error z value Pr(>|z|)(Intercept) -0.126000   0.240736  -0.523    0.601Lag1        -0.073074   0.050167  -1.457    0.145Lag2        -0.042301   0.050086  -0.845    0.398Lag3         0.011085   0.049939   0.222    0.824Lag4         0.009359   0.049974   0.187    0.851Lag5         0.010313   0.049511   0.208    0.835Volume       0.135441   0.158360   0.855    0.392(Dispersion parameter for binomial family taken to be 1)    Null deviance: 1731.2  on 1249  degrees of freedomResidual deviance: 1727.6  on 1243  degrees of freedomAIC: 1741.6Number of Fisher Scoring iterations: 3

分析“

The smallest p-value here is associated with Lag1. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of 0.15, the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.

”

看具体的参数

coef() function in order to access just the coefficients for this fitted model. We can also use the summary() function to access particular aspects of the fitted model, such as the p-values for the coefficients.

> coef(glm.fit) (Intercept)         Lag1         Lag2         Lag3         Lag4 -0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938         Lag5       Volume  0.010313068  0.135440659 > summary (glm.fit )$coef                Estimate Std. Error    z value  Pr(>|z|)(Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983Lag1        -0.073073746 0.05016739 -1.4565986 0.1452272Lag2        -0.042301344 0.05008605 -0.8445733 0.3983491Lag3         0.011085108 0.04993854  0.2219750 0.8243333Lag4         0.009358938 0.04997413  0.1872757 0.8514445Lag5         0.010313068 0.04951146  0.2082966 0.8349974Volume       0.135440659 0.15835970  0.8552723 0.3924004>

结果预测

The predict() function can be used to predict the probability that the market will go up, given values of the predictors.

The type="response" option tells R to output probabilities of the form P(Y = 1|X), as opposed to other information such as the logit.

> attach(Smarket)>glm.probs= predict (glm.fit, type = "response")In order to make a prediction as to whether the market will go up ordown on a particular day, we must convert these predicted probabilities into class labels, Up or Down.> contrasts (Direction )     UpDown  0Up    1

之后

The first command creates a vector of 1,250 Down elements. The second line transforms to Up all of the elements for which the predicted probability of a market increase exceeds 0.5. Given these predictions, the table() function table() can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.

> glm.pred=rep ("Down " ,1250)> glm.pred[glm .probs >.5]=" Up"

> table(glm.pred ,Direction )

Direction

glm.pred Down Up

Up 457 507

Down 145 141

Cross validation create a held out data set of observations from 2005.

> train =(Year <2005)> Smarket.2005= Smarket [! train ,]> Direction.2005= Direction [! train]

now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set—that is, for the days in 2005.

> glm.fit=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume ,          data=Smarket ,family =binomial ,subset =train )

混乱，不继续这部分了。

4.6.3 Linear Discriminant Analysis

Now we will perform LDA on the Smarket data. In R, we fit a LDA model using the lda() function, which is part of the MASS library.

> library (MASS)> lda.fit=lda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train)> lda.fitCall:lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)Prior probabilities of groups:    Down       Up 0.491984 0.508016 Group means:            Lag1        Lag2Down  0.04279022  0.03389409Up   -0.03954635 -0.03132544Coefficients of linear discriminants:            LD1Lag1 -0.6420190Lag2 -0.5135293

The LDA output indicates that ˆπ1 = 0.492 and ˆπ2 = 0.508; in other words,49.2% of the training observations correspond to days during which the market went down. It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of μk. These suggest that there is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous days’ returns to be positive on days when the market declines. The coefficients of linear discriminants output provides the linear combination of Lag1 and Lag2 that are used to form the LDA decision rule.

If −0.642×Lag1−0.514×Lag2 is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline. The plot() function produces plots of the linear discriminants, obtained by computing −0.642 × Lag1 − 0.514 × Lag2 for each of the training observations. .

> lda.pred=predict (lda.fit , Smarket.2005)> names(lda.pred)[1] "class"     "posterior" "x"

class, contains LDA’s predictions about the movement of the market.

The second element, posterior, is a matrix whose kth column contains the

posterior probability that the corresponding observation belongs to the kth

class, computed from (4.10). Finally, x contains the linear discriminants,

described earlier.

> lda.class =lda.pred$class> table(lda.class ,Direction.2005)         Direction.2005lda.class Down  Up     Down   35  35     Up     76 106

4.6.4 Quadratic Discriminant Analysis

We will now fit a QDA model to the Smarket data. QDA is implemented in R using the qda() function, which is also part of the MASS library.

> qda.fit=qda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train)> qda.fitCall:qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)Prior probabilities of groups:    Down       Up 0.491984 0.508016 Group means:            Lag1        Lag2Down  0.04279022  0.03389409Up   -0.03954635 -0.03132544

The output contains the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. The predict() function works in exactly the same fashion as for LDA.

4.6.5 K-Nearest Neighbors

perform KNN using the knn() function, which is part of the class library.

The function requires four inputs.

1. A matrix containing the predictors associated with the training data, labeled train.X below.

2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below.

3. A vector containing the class labels for the training observations, labeled train.Direction (train.Y)below.

4. A value for K, the number of nearest neighbors to be used by the classifier.

We use the cbind() function, short for column bind, to bind the Lag1 and Lag2 variables together into two matrices, one for the training set and the other for the test set.

Seed

Now the knn() function can be used to predict the market’s movement for the dates in 2005. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results.

>  library (class)> train.X=cbind(Lag1 ,Lag2)[train ,]> test.X=cbind (Lag1 ,Lag2)[!train ,]> train.Direction =Direction [train]> set.seed (1)

> knn.pred=knn (train.X,test.X,train.Direction ,k=3)

> table(knn.pred ,Direction.2005)

Direction.2005

knn.pred Down Up

Down 48 54

Up 63 87

> mean(knn.pred== Direction.2005)

[1] 0.5357143

results are bac, QDA is the best for this type of data

4.6.6 An Application to Caravan Insurance Data

Caravan data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only 6% of people purchased caravan insurance.

Limitations on KNN

Because the KNN classifier predicts the class of a given test observation by

identifying the observations that are nearest to it, the scale of the variables

matters. Any variables that are on a large scale will have a much larger

effect on the distance between the observations, and hence on the KNN

classifier, than variables that are on a small scale.

As far as KNN is concerned, a difference of $1,000

in salary is enormous compared to a difference of 50 years in age. Consequently,

salary will drive the KNN classification results, and age will have

almost no effect.

A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one. we exclude column 86, because that is the qualitative Purchase variable.

standardized.X=scale(Caravan [,-86])

We now split the observations into a test set, containing the first 1,000

observations, and a training set, containing the remaining observations.

We fit a KNN model on the training data using K = 1, and evaluate its

performance on the test data.

> test =1:1000> train.X=standardized.X[-test ,]> test.X=standardized.X[test ,]> train.Y=Purchase [-test]> test.Y=Purchase [test]> set.seed (1)> knn.pred=knn (train.X,test.X,train.Y,k=1)

转载于:https://www.cnblogs.com/jiajiaxingxing/p/4684813.html

你可能感兴趣的文章

java笔记--用ThreadLocal管理线程，Callable<V>接口实现有返回值的线程

查看>>

Scaling Pinterest - From 0 To 10s Of Billions Of Page Views A Month In Two Years

查看>>

SelectSort　选择排序

查看>>

关于android 加载https网页的问题

查看>>

BZOJ 1047 HAOI2007 理想的正方形单调队列

查看>>