吾生也有涯，而知也无涯，以有涯随无涯，殆已。

数据描述与背景导入

各变量介绍

Code

```{r}
df %>% 
kable( booktabs = TRUE,caption = "CTG数据") %>% 
  kable_styling(bootstrap_options = "bordered") %>% 
  row_spec(0, color = "black", background = "#98F5FF" )
```

变量名 \|数据	型 \|含义	\|
fixed acidity	连续型变量 \|固定酸度	\|
volatile acidity	连续型变量 \|挥发性酸	\|
citric acid	连续型变量 \|柠檬酸	\|
residual sugar	连续型变量 \|残糖	\|
chlorides	连续型变量 \|氯化物	\|
free sulfur dioxide	连续型变量 \|游离二氧	硫 \|
total sulfur dioxide	连续型变量 \|总二氧化	\|
density	连续型变量 \|密度	\|
pH	连续型变量 \|pH值	\|
sulphates	连续型变量 \|硫酸盐	\|
alcohol	连续型变量 \|酒精浓度	\|
quality	有序分类变量 \|酒质量3-	\|

CTG数据

数据预览

Code

```{r}
d %>%  
  mlmCorrs::corstars()
```

Variable	Mean	SD	1	2	3	4	5	6	7	8	9	10	11	12
Correlation Table
1. Fixed.acidity	8.32	1.74	--
2. Volatile.acidity	0.53	0.18	-.26**	--
3. Citric.acid	0.27	0.19	.67**	-.55**	--
4. Residual.sugar	2.54	1.41	.11**	.00	.14**	--
5. Chlorides	0.09	0.05	.09**	.06*	.20**	.06*	--
6. Free.sulfur.dioxide	15.87	10.46	-.15**	-.01	-.06*	.19**	.01	--
7. Total.sulfur.dioxide	46.47	32.90	-.11**	.08**	.04	.20**	.05	.67**	--
8. Density	1.00	0.00	.67**	.02	.36**	.36**	.20**	-.02	.07**	--
9. PH	3.31	0.15	-.68**	.23**	-.54**	-.09**	-.27**	.07**	-.07**	-.34**	--
10. Sulphates	0.66	0.17	.18**	-.26**	.31**	.01	.37**	.05*	.04	.15**	-.20**	--
11. Alcohol	10.42	1.07	-.06*	-.20**	.11**	.04	-.22**	-.07**	-.21**	-.50**	.21**	.09**	--
12. Quality	5.64	0.81	.12**	-.39**	.23**	.01	-.13**	-.05*	-.19**	-.17**	-.06*	.25**	.48**	--
Note. p < .05. *p < .01.

Code

```{r}
corrplot(corr=cor(d))
d$quality=factor(d$quality,ordered=TRUE)
d %>% select(1:6)%>% head(5) %>% 
kable( booktabs = TRUE,caption = "CTG数据") %>% 
  kable_styling(bootstrap_options = "bordered") %>% 
  row_spec(0, color = "white", background = "#696969" )
d %>% select(7:12)%>% head(5) %>% 
kable( booktabs = TRUE,caption = "CTG数据") %>% 
  kable_styling(bootstrap_options = "bordered") %>% 
  row_spec(0, color = "white", background = "#696969" )
skimr::skim(d)
```

fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide
7.4	0.70	0.00	1.9	0.076	11
7.8	0.88	0.00	2.6	0.098	25
7.8	0.76	0.04	2.3	0.092	15
11.2	0.28	0.56	1.9	0.075	17
7.4	0.70	0.00	1.9	0.076	11

CTG数据

CTG数据
total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
34	0.9978	3.51	0.56	9.4	5
67	0.9968	3.20	0.68	9.8	5
54	0.9970	3.26	0.65	9.8	5
60	0.9980	3.16	0.58	9.8	6
34	0.9978	3.51	0.56	9.4	5

Data summary
Name	d
Number of rows	1599
Number of columns	12
_______________________
Column type frequency:
factor	1
numeric	11
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
quality	0	1	TRUE	6	5: 681, 6: 638, 7: 199, 4: 53

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
fixed.acidity	1	8.32	1.74	4.60	7.10	7.90	9.20	15.90	▂▇▂▁▁
volatile.acidity	1	0.53	0.18	0.12	0.39	0.52	0.64	1.58	▅▇▂▁▁
citric.acid	1	0.27	0.19	0.00	0.09	0.26	0.42	1.00	▇▆▅▁▁
residual.sugar	1	2.54	1.41	0.90	1.90	2.20	2.60	15.50	▇▁▁▁▁
chlorides	1	0.09	0.05	0.01	0.07	0.08	0.09	0.61	▇▁▁▁▁
free.sulfur.dioxide	1	15.87	10.46	1.00	7.00	14.00	21.00	72.00	▇▅▁▁▁
total.sulfur.dioxide	1	46.47	32.90	6.00	22.00	38.00	62.00	289.00	▇▂▁▁▁
density	1	1.00	0.00	0.99	1.00	1.00	1.00	1.00	▁▃▇▂▁
pH	1	3.31	0.15	2.74	3.21	3.31	3.40	4.01	▁▅▇▂▁
sulphates	1	0.66	0.17	0.33	0.55	0.62	0.73	2.00	▇▅▁▁▁
alcohol	1	10.42	1.07	8.40	9.50	10.20	11.10	14.90	▇▇▃▁▁

可视化

Code

```{r}
ggplot(d,aes(quality,alcohol))+
  geom_boxplot(mapping=aes(fill=quality))+
  geom_jitter(aes(fill=quality),width =0.2,shape = 21,size=0.5)+
  theme_bw() + 
  theme(panel.grid.major =element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))+
  scale_fill_npg()+
  labs(title = "六种红酒质量下alcohol的分布",
    tag = "箱型图",
    caption = "数据来源：winequality")
```

非参数检验

多样本位置检验：六种质量下的红酒酒精浓度中位数是否不同？

\[ H_0:same\ vs\ H_1:not \]

Code

```{r}
kruskal.test(alcohol~quality,d)
```


    Kruskal-Wallis rank sum test

data:  alcohol by quality
Kruskal-Wallis chi-squared = 412.38, df = 5, p-value <
0.00000000000000022

\[ H_0:same\ vs\ H_1:increasing \]

Code

```{r}
JonckheereTerpstraTest( alcohol~quality,data=d,alternative = c( "increasing"))
```


    Jonckheere-Terpstra test

data:  alcohol by quality
JT = 602267, p-value < 0.00000000000000022
alternative hypothesis: increasing

独立两样本比较的Wilcoxon秩和检验

将红酒分成两组, 比较质量为3，4，5与6，7，8的红酒样本的酒精浓度中位数是否有显著性差异 \[ \\H_0:\Delta=0 \]

Code

```{r}
d1 <- d %>% filter(quality==5|quality==4|quality==3)
d2 <- d %>% filter(quality==6|quality==7|quality==8)
wilcox.test(d1$alcohol,d2$alcohol)
```


    Wilcoxon rank sum test with continuity correction

data:  d1$alcohol and d2$alcohol
W = 154807, p-value < 0.00000000000000022
alternative hypothesis: true location shift is not equal to 0

检验结果可见两组中位数确实有显著性差异

分布检验

Code

```{r}
ggplot(d)+
  geom_histogram(mapping=aes(x=density))+
  labs(title = "density的分布",
    tag = "直方图",
    caption = "数据来源：winequality")
```

红酒密度的分布是否符合正态

\[ H_0:norm\ vs\ H_1:unnorm \]

Code

```{r}
x <- d$density
mu=mean(x);sigma=sum((x-mean(x))^2)/length(x)
ks.test(x,"pnorm",mu,sigma)
```


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  x
D = 0.49781, p-value < 0.00000000000000022
alternative hypothesis: two-sided

接下来比较一下先前分出的两组红酒的酒精浓度分布是否不同

\[ H_0:same\ pdf\ vs\ H_1:unsame \]

Code

```{r}
ks.test(d1$alcohol,d2$alcohol)
```


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  d1$alcohol and d2$alcohol
D = 0.4103, p-value < 0.00000000000000022
alternative hypothesis: two-sided

一元核密度估计

这里使用插入法计算宽窗

Code

```{r}
d %>% ggplot(aes(x = alcohol)) +geom_density(kernel ="gaussian",size=1.1,bw=bw.SJ(d$alcohol))+labs(
  title = "alcohol核密度估计",
  subtitle = "kernel=gaussian,bw = 0.098"
)+geom_histogram(mapping = aes(y = ..density..,alpha = 0.6))

d %>% ggplot(aes(x = alcohol)) +geom_density(kernel ="epanechnikov",size=1.1,bw=bw.SJ(d$alcohol))+labs(
  title = "alcohol核密度估计",
  subtitle = "kernel=epanechnikov,bw = 0.098"
)+geom_histogram(mapping = aes(y = ..density..,alpha = 0.6))

d %>% ggplot(aes(x = alcohol)) +geom_density(kernel ="rectangular",size=1.1,bw=bw.SJ(d$alcohol))+labs(
  title = "alcohol核密度估计",
  subtitle = "kernel=rectangular,bw = 0.098"
)+geom_histogram(mapping = aes(y = ..density..,alpha = 0.6))
```

多元核密度估计

对酒精浓度与密度做二元核密度估计

Code

```{r}
par(mfrow=c(1,2))
fhat <- kde(d %>% select(alcohol,density))
plot(fhat,display="filled.contour2")
points(d %>% select(alcohol,density),cex=0.1,pch=5)
plot(fhat,diplay="persp")
```

分类

先分好训练集与测试集，且要保证好酒与坏酒的比例与原数据集相近

Code

```{r}
d <- read.csv("C:/Users/10156/Desktop/non/data/winequality-red.csv", sep=";")
d$quality[d$quality==3]=0
d$quality[d$quality==4]=0
d$quality[d$quality==5]=0
d$quality[d$quality==6]=1
d$quality[d$quality==7]=1
d$quality[d$quality==8]=1
d$quality=factor(d$quality)
library(rsample)
d_split <- initial_split(d, prop = 0.9,strata = quality)
d_train <- training(d_split)
d_test <- testing(d_split)
```

logistic回归

\[ cheap\ wine1(3,4,5)and\ good\ wine2(6,7,8)res\\ \begin{aligned} Y \sim & B(m, p) \\ g(p) =& \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k \\ AIC\ backward\ to\ choose\ model \end{aligned} \]

Code

```{r}
fit=glm(quality~fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    density + pH + sulphates + alcohol,family = binomial,data=d_train,na.action = na.pass)
fit1=step(fit,direction = "backward")
```

Start:  AIC=1489.85
quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    density + pH + sulphates + alcohol

                       Df Deviance    AIC
- pH                    1   1465.9 1487.9
- density               1   1466.7 1488.7
<none>                      1465.8 1489.8
- residual.sugar        1   1468.0 1490.0
- fixed.acidity         1   1468.7 1490.7
- citric.acid           1   1470.3 1492.3
- chlorides             1   1472.2 1494.2
- free.sulfur.dioxide   1   1474.3 1496.3
- sulphates             1   1500.0 1522.0
- total.sulfur.dioxide  1   1500.5 1522.5
- volatile.acidity      1   1511.7 1533.7
- alcohol               1   1534.5 1556.5

Step:  AIC=1487.86
quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    density + sulphates + alcohol

                       Df Deviance    AIC
- density               1   1467.3 1487.3
<none>                      1465.9 1487.9
- residual.sugar        1   1468.5 1488.5
- citric.acid           1   1470.4 1490.4
- chlorides             1   1472.5 1492.5
- fixed.acidity         1   1472.8 1492.8
- free.sulfur.dioxide   1   1474.4 1494.4
- sulphates             1   1500.9 1520.9
- total.sulfur.dioxide  1   1501.6 1521.6
- volatile.acidity      1   1512.0 1532.0
- alcohol               1   1559.8 1579.8

Step:  AIC=1487.34
quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    sulphates + alcohol

                       Df Deviance    AIC
- residual.sugar        1   1468.7 1486.7
<none>                      1467.3 1487.3
- citric.acid           1   1471.8 1489.8
- fixed.acidity         1   1473.2 1491.2
- chlorides             1   1473.6 1491.6
- free.sulfur.dioxide   1   1475.8 1493.8
- sulphates             1   1500.9 1518.9
- total.sulfur.dioxide  1   1502.6 1520.6
- volatile.acidity      1   1517.8 1535.8
- alcohol               1   1643.7 1661.7

Step:  AIC=1486.69
quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + 
    free.sulfur.dioxide + total.sulfur.dioxide + sulphates + 
    alcohol

                       Df Deviance    AIC
<none>                      1468.7 1486.7
- citric.acid           1   1472.9 1488.9
- chlorides             1   1474.7 1490.7
- fixed.acidity         1   1474.8 1490.8
- free.sulfur.dioxide   1   1477.7 1493.7
- sulphates             1   1501.4 1517.4
- total.sulfur.dioxide  1   1502.7 1518.7
- volatile.acidity      1   1518.5 1534.5
- alcohol               1   1649.5 1665.5

Code

```{r}
summary(fit1)
```


Call:
glm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    sulphates + alcohol, family = binomial, data = d_train, na.action = na.pass)

Coefficients:
                      Estimate Std. Error z value             Pr(>|z|)    
(Intercept)          -9.681250   1.017330  -9.516 < 0.0000000000000002 ***
fixed.acidity         0.133308   0.054311   2.455              0.01411 *  
volatile.acidity     -3.361463   0.502698  -6.687      0.0000000000228 ***
citric.acid          -1.206145   0.594499  -2.029              0.04247 *  
chlorides            -3.861215   1.628817  -2.371              0.01776 *  
free.sulfur.dioxide   0.026007   0.008706   2.987              0.00281 ** 
total.sulfur.dioxide -0.016994   0.003053  -5.567      0.0000000258943 ***
sulphates             2.487972   0.453970   5.480      0.0000000424192 ***
alcohol               0.957603   0.079339  12.070 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1986.5  on 1437  degrees of freedom
Residual deviance: 1468.7  on 1429  degrees of freedom
AIC: 1486.7

Number of Fisher Scoring iterations: 4

模型预测：

Code

```{r}
plot_confusion = function(cm) {
  as.table(cm) %>% 
    as_tibble() %>% 
    mutate(response = factor(response),
           truth = factor(truth, rev(levels(response)))) %>% 
    ggplot(aes(response, truth, fill = n)) +
    geom_tile() +
    geom_text(aes(label = n)) +
    scale_fill_gradientn(colors = rev(hcl.colors(10, "Blues")), breaks = seq(0,10,2)) +
    coord_fixed() +
    theme_minimal()
}

# 测试数据真实值
truth <- d_test$quality
# 测试数据预测值
pred <- predict.glm(fit1,newdata = d_test,type="response")
pred[pred>0.5]=1
pred[pred<0.5]=0
# 计算混淆矩阵
confusionMatrix(table(pred,truth))
```

Confusion Matrix and Statistics

    truth
pred  0  1
   0 57 26
   1 18 60
                                         
               Accuracy : 0.7267         
                 95% CI : (0.651, 0.7939)
    No Information Rate : 0.5342         
    P-Value [Acc > NIR] : 0.0000004289   
                                         
                  Kappa : 0.4546         
                                         
 Mcnemar's Test P-Value : 0.2913         
                                         
            Sensitivity : 0.7600         
            Specificity : 0.6977         
         Pos Pred Value : 0.6867         
         Neg Pred Value : 0.7692         
             Prevalence : 0.4658         
         Detection Rate : 0.3540         
   Detection Prevalence : 0.5155         
      Balanced Accuracy : 0.7288         
                                         
       'Positive' Class : 0

KNN

Code

```{r}
library(caret)
# 设置10折交叉训练
control <- trainControl(method = 'cv',number = 10)
# knn模型训练
model <- train(quality~.,d_train,
               method = 'knn',
               preProcess = c('center','scale'),
               trControl = control,
               tuneLength = 5)

model
```

k-Nearest Neighbors 

1438 samples
  11 predictor
   2 classes: '0', '1' 

Pre-processing: centered (11), scaled (11) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1294, 1294, 1294, 1294, 1294, 1296, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.7253815  0.4441503
   7  0.7239534  0.4427284
   9  0.7217527  0.4388053
  11  0.7315043  0.4586259
  13  0.7329225  0.4614702

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 13.

查看训练好的模型，可知模型的准确率都超过了70%，且最优模型的k值13。

Code

```{r}
# 测试数据真实值
truth <- d_test$quality
# 测试数据预测值
pred <- predict(model,newdata = d_test)
# 计算混淆矩阵
confusionMatrix(table(pred,truth))



species = c("坏酒", "好酒")
cm = matrix(c(54, 21,
              12, 74), nrow = 2,
            dimnames = list(response = species, truth = species))

plot_confusion(cm)
```

Confusion Matrix and Statistics

    truth
pred  0  1
   0 53 20
   1 22 66
                                          
               Accuracy : 0.7391          
                 95% CI : (0.6642, 0.8051)
    No Information Rate : 0.5342          
    P-Value [Acc > NIR] : 0.00000007284   
                                          
                  Kappa : 0.4749          
                                          
 Mcnemar's Test P-Value : 0.8774          
                                          
            Sensitivity : 0.7067          
            Specificity : 0.7674          
         Pos Pred Value : 0.7260          
         Neg Pred Value : 0.7500          
             Prevalence : 0.4658          
         Detection Rate : 0.3292          
   Detection Prevalence : 0.4534          
      Balanced Accuracy : 0.7371          
                                          
       'Positive' Class : 0

回归

Code

```{r}
d <- read.csv("C:/Users/10156/Desktop/non/data/winequality-red.csv", sep=";")
ggplot(d)+
  geom_point(mapping=aes(x=alcohol,y=density))+
  facet_wrap(~quality)
```

样条光滑

Y = _{j=1}^p f_j(x_j) +

\[\begin{align} \min_{f(\cdot)} \left\{ \sum_i [y_i - f(x_i)]^2 + \lambda \int [f''(x)]^2 \,dx \right\} \end{align}\]

对质量7的红酒酒精浓度与酒密度进行样条光滑回归

Code

```{r}
nsamp <- 53
alcohol <- (d %>% filter(quality==7))$alcohol
density <- (d %>% filter(quality==7))$density
alcohol=sort(alcohol)
plot(alcohol,density)
library(splines)
res <- smooth.spline(alcohol, density)
lines(spline(res$x, res$y), col="skyblue",lwd=2)
```

Generalized Additive Model (GAM)线性可加模型

将density作为因变量，自变量为fixed.acidity, citric.acid, residual.sugar与alcohol

Code

```{r}
library(mgcv)

fit2 <- gam(
  log(density) ~ s(fixed.acidity) + s(citric.acid) + s(residual.sugar)+s(alcohol), data=d)
summary(fit2)
```


Family: gaussian 
Link function: identity 

Formula:
log(density) ~ s(fixed.acidity) + s(citric.acid) + s(residual.sugar) + 
    s(alcohol)

Parametric coefficients:
               Estimate  Std. Error t value            Pr(>|t|)    
(Intercept) -0.00326042  0.00002145    -152 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                    edf Ref.df      F              p-value    
s(fixed.acidity)  7.411  8.295 141.62 < 0.0000000000000002 ***
s(citric.acid)    4.002  5.003   3.56              0.00328 ** 
s(residual.sugar) 8.618  8.950 110.70 < 0.0000000000000002 ***
s(alcohol)        5.868  7.013 227.99 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.795   Deviance explained = 79.8%
GCV = 7.4852e-07  Scale est. = 7.3593e-07  n = 1599

Code

```{r}
plot(fit2)
```

变量名 \|数据	型 \|含义	\|
fixed acidity	连续型变量 \|固定酸度	\|
volatile acidity	连续型变量 \|挥发性酸	\|
citric acid	连续型变量 \|柠檬酸	\|
residual sugar	连续型变量 \|残糖	\|
chlorides	连续型变量 \|氯化物	\|
free sulfur dioxide	连续型变量 \|游离二氧	硫 \|
total sulfur dioxide	连续型变量 \|总二氧化	\|
density	连续型变量 \|密度	\|
pH	连续型变量 \|pH值	\|
sulphates	连续型变量 \|硫酸盐	\|
alcohol	连续型变量 \|酒精浓度	\|
quality	有序分类变量 \|酒质量3-	\|

non-parametric

数据描述与背景导入

各变量介绍

数据预览

可视化

非参数检验

多样本位置检验：六种质量下的红酒酒精浓度中位数是否不同？

独立两样本比较的Wilcoxon秩和检验

相关性检验

spearman

Kendall

分布检验

红酒密度的分布是否符合正态

一元核密度估计

多元核密度估计

分类

logistic回归

KNN

回归

样条光滑

Generalized Additive Model (GAM)线性可加模型