R
非参数
Author

Kili

Published

2023-12-05

吾生也有涯,而知也无涯,以有涯随无涯,殆已。

数据描述与背景导入

各变量介绍

Code
```{r}
df %>% 
kable( booktabs = TRUE,caption = "CTG数据") %>% 
  kable_styling(bootstrap_options = "bordered") %>% 
  row_spec(0, color = "black", background = "#98F5FF" )
```
变量名 |数据 型 |含义 |
fixed acidity 连续型变量 |固定酸度 |
volatile acidity 连续型变量 |挥发性酸 |
citric acid 连续型变量 |柠檬酸 |
residual sugar 连续型变量 |残糖 |
chlorides 连续型变量 |氯化物 |
free sulfur dioxide 连续型变量 |游离二氧 硫 |
total sulfur dioxide 连续型变量 |总二氧化 |
density 连续型变量 |密度 |
pH 连续型变量 |pH值 |
sulphates 连续型变量 |硫酸盐 |
alcohol 连续型变量 |酒精浓度 |
quality 有序分类变量 |酒质量3- |
CTG数据

数据预览

Code
```{r}
d %>%  
  mlmCorrs::corstars()
```
Correlation Table
Variable Mean SD 1 2 3 4 5 6 7 8 9 10 11 12
1. Fixed.acidity 8.32 1.74 --
2. Volatile.acidity 0.53 0.18 -.26** --
3. Citric.acid 0.27 0.19 .67** -.55** --
4. Residual.sugar 2.54 1.41 .11** .00 .14** --
5. Chlorides 0.09 0.05 .09** .06* .20** .06* --
6. Free.sulfur.dioxide 15.87 10.46 -.15** -.01 -.06* .19** .01 --
7. Total.sulfur.dioxide 46.47 32.90 -.11** .08** .04 .20** .05 .67** --
8. Density 1.00 0.00 .67** .02 .36** .36** .20** -.02 .07** --
9. PH 3.31 0.15 -.68** .23** -.54** -.09** -.27** .07** -.07** -.34** --
10. Sulphates 0.66 0.17 .18** -.26** .31** .01 .37** .05* .04 .15** -.20** --
11. Alcohol 10.42 1.07 -.06* -.20** .11** .04 -.22** -.07** -.21** -.50** .21** .09** --
12. Quality 5.64 0.81 .12** -.39** .23** .01 -.13** -.05* -.19** -.17** -.06* .25** .48** --
Note. *p < .05. **p < .01.
Code
```{r}
corrplot(corr=cor(d))
d$quality=factor(d$quality,ordered=TRUE)
d %>% select(1:6)%>% head(5) %>% 
kable( booktabs = TRUE,caption = "CTG数据") %>% 
  kable_styling(bootstrap_options = "bordered") %>% 
  row_spec(0, color = "white", background = "#696969" )
d %>% select(7:12)%>% head(5) %>% 
kable( booktabs = TRUE,caption = "CTG数据") %>% 
  kable_styling(bootstrap_options = "bordered") %>% 
  row_spec(0, color = "white", background = "#696969" )
skimr::skim(d)
```
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
7.4 0.70 0.00 1.9 0.076 11
7.8 0.88 0.00 2.6 0.098 25
7.8 0.76 0.04 2.3 0.092 15
11.2 0.28 0.56 1.9 0.075 17
7.4 0.70 0.00 1.9 0.076 11
CTG数据
CTG数据
total.sulfur.dioxide density pH sulphates alcohol quality
34 0.9978 3.51 0.56 9.4 5
67 0.9968 3.20 0.68 9.8 5
54 0.9970 3.26 0.65 9.8 5
60 0.9980 3.16 0.58 9.8 6
34 0.9978 3.51 0.56 9.4 5
Data summary
Name d
Number of rows 1599
Number of columns 12
_______________________
Column type frequency:
factor 1
numeric 11
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
quality 0 1 TRUE 6 5: 681, 6: 638, 7: 199, 4: 53

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
fixed.acidity 0 1 8.32 1.74 4.60 7.10 7.90 9.20 15.90 ▂▇▂▁▁
volatile.acidity 0 1 0.53 0.18 0.12 0.39 0.52 0.64 1.58 ▅▇▂▁▁
citric.acid 0 1 0.27 0.19 0.00 0.09 0.26 0.42 1.00 ▇▆▅▁▁
residual.sugar 0 1 2.54 1.41 0.90 1.90 2.20 2.60 15.50 ▇▁▁▁▁
chlorides 0 1 0.09 0.05 0.01 0.07 0.08 0.09 0.61 ▇▁▁▁▁
free.sulfur.dioxide 0 1 15.87 10.46 1.00 7.00 14.00 21.00 72.00 ▇▅▁▁▁
total.sulfur.dioxide 0 1 46.47 32.90 6.00 22.00 38.00 62.00 289.00 ▇▂▁▁▁
density 0 1 1.00 0.00 0.99 1.00 1.00 1.00 1.00 ▁▃▇▂▁
pH 0 1 3.31 0.15 2.74 3.21 3.31 3.40 4.01 ▁▅▇▂▁
sulphates 0 1 0.66 0.17 0.33 0.55 0.62 0.73 2.00 ▇▅▁▁▁
alcohol 0 1 10.42 1.07 8.40 9.50 10.20 11.10 14.90 ▇▇▃▁▁

可视化

Code
```{r}
ggplot(d,aes(quality,alcohol))+
  geom_boxplot(mapping=aes(fill=quality))+
  geom_jitter(aes(fill=quality),width =0.2,shape = 21,size=0.5)+
  theme_bw() + 
  theme(panel.grid.major =element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))+
  scale_fill_npg()+
  labs(title = "六种红酒质量下alcohol的分布",
    tag = "箱型图",
    caption = "数据来源:winequality")
```

非参数检验

多样本位置检验:六种质量下的红酒酒精浓度中位数是否不同?

\[ H_0:same\ vs\ H_1:not \]

Code
```{r}
kruskal.test(alcohol~quality,d)
```

    Kruskal-Wallis rank sum test

data:  alcohol by quality
Kruskal-Wallis chi-squared = 412.38, df = 5, p-value <
0.00000000000000022

\[ H_0:same\ vs\ H_1:increasing \]

Code
```{r}
JonckheereTerpstraTest( alcohol~quality,data=d,alternative = c( "increasing"))
```

    Jonckheere-Terpstra test

data:  alcohol by quality
JT = 602267, p-value < 0.00000000000000022
alternative hypothesis: increasing

独立两样本比较的Wilcoxon秩和检验

将红酒分成两组, 比较质量为3,4,5与6,7,8的红酒样本的酒精浓度中位数是否有显著性 差异 \[ \\H_0:\Delta=0 \]

Code
```{r}
d1 <- d %>% filter(quality==5|quality==4|quality==3)
d2 <- d %>% filter(quality==6|quality==7|quality==8)
wilcox.test(d1$alcohol,d2$alcohol)
```

    Wilcoxon rank sum test with continuity correction

data:  d1$alcohol and d2$alcohol
W = 154807, p-value < 0.00000000000000022
alternative hypothesis: true location shift is not equal to 0

检验结果可见两组中位数确实有显著性差异

相关性检验

\[ H_0:unrelated\ vs\ H_1:related \]

spearman

Code
```{r}
cor.test(d$alcohol,as.numeric(d$quality),meth="spearman")
```

    Spearman's rank correlation rho

data:  d$alcohol and as.numeric(d$quality)
S = 355321833, p-value < 0.00000000000000022
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4785317 

Kendall

将酒精浓度转化为有序类别变量

\[ Kendall's \ \tau_b\\\ \]

Code
```{r}
cor.test(d$alcohol,as.numeric(d$quality),meth="kendall")
```

    Kendall's rank correlation tau

data:  d$alcohol and as.numeric(d$quality)
z = 19.416, p-value < 0.00000000000000022
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.3803673 

分布检验

Code
```{r}
ggplot(d)+
  geom_histogram(mapping=aes(x=density))+
  labs(title = "density的分布",
    tag = "直方图",
    caption = "数据来源:winequality")
```

红酒密度的分布是否符合正态

\[ H_0:norm\ vs\ H_1:unnorm \]

Code
```{r}
x <- d$density
mu=mean(x);sigma=sum((x-mean(x))^2)/length(x)
ks.test(x,"pnorm",mu,sigma)
```

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  x
D = 0.49781, p-value < 0.00000000000000022
alternative hypothesis: two-sided

接下来比较一下先前分出的两组红酒的酒精浓度分布是否不同

\[ H_0:same\ pdf\ vs\ H_1:unsame \]

Code
```{r}
ks.test(d1$alcohol,d2$alcohol)
```

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  d1$alcohol and d2$alcohol
D = 0.4103, p-value < 0.00000000000000022
alternative hypothesis: two-sided

一元核密度估计

这里使用插入法计算宽窗

Code
```{r}
d %>% ggplot(aes(x = alcohol)) +geom_density(kernel ="gaussian",size=1.1,bw=bw.SJ(d$alcohol))+labs(
  title = "alcohol核密度估计",
  subtitle = "kernel=gaussian,bw = 0.098"
)+geom_histogram(mapping = aes(y = ..density..,alpha = 0.6))

d %>% ggplot(aes(x = alcohol)) +geom_density(kernel ="epanechnikov",size=1.1,bw=bw.SJ(d$alcohol))+labs(
  title = "alcohol核密度估计",
  subtitle = "kernel=epanechnikov,bw = 0.098"
)+geom_histogram(mapping = aes(y = ..density..,alpha = 0.6))

d %>% ggplot(aes(x = alcohol)) +geom_density(kernel ="rectangular",size=1.1,bw=bw.SJ(d$alcohol))+labs(
  title = "alcohol核密度估计",
  subtitle = "kernel=rectangular,bw = 0.098"
)+geom_histogram(mapping = aes(y = ..density..,alpha = 0.6))
```

多元核密度估计

对酒精浓度与密度做二元核密度估计

Code
```{r}
par(mfrow=c(1,2))
fhat <- kde(d %>% select(alcohol,density))
plot(fhat,display="filled.contour2")
points(d %>% select(alcohol,density),cex=0.1,pch=5)
plot(fhat,diplay="persp")
```

分类

先分好训练集与测试集,且要保证好酒与坏酒的比例与原数据集相近

Code
```{r}
d <- read.csv("C:/Users/10156/Desktop/non/data/winequality-red.csv", sep=";")
d$quality[d$quality==3]=0
d$quality[d$quality==4]=0
d$quality[d$quality==5]=0
d$quality[d$quality==6]=1
d$quality[d$quality==7]=1
d$quality[d$quality==8]=1
d$quality=factor(d$quality)
library(rsample)
d_split <- initial_split(d, prop = 0.9,strata = quality)
d_train <- training(d_split)
d_test <- testing(d_split)
```

logistic回归

\[ cheap\ wine1(3,4,5)and\ good\ wine2(6,7,8)res\\ \begin{aligned} Y \sim & B(m, p) \\ g(p) =& \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k \\ AIC\ backward\ to\ choose\ model \end{aligned} \]

Code
```{r}
fit=glm(quality~fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    density + pH + sulphates + alcohol,family = binomial,data=d_train,na.action = na.pass)
fit1=step(fit,direction = "backward")
```
Start:  AIC=1489.85
quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    density + pH + sulphates + alcohol

                       Df Deviance    AIC
- pH                    1   1465.9 1487.9
- density               1   1466.7 1488.7
<none>                      1465.8 1489.8
- residual.sugar        1   1468.0 1490.0
- fixed.acidity         1   1468.7 1490.7
- citric.acid           1   1470.3 1492.3
- chlorides             1   1472.2 1494.2
- free.sulfur.dioxide   1   1474.3 1496.3
- sulphates             1   1500.0 1522.0
- total.sulfur.dioxide  1   1500.5 1522.5
- volatile.acidity      1   1511.7 1533.7
- alcohol               1   1534.5 1556.5

Step:  AIC=1487.86
quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    density + sulphates + alcohol

                       Df Deviance    AIC
- density               1   1467.3 1487.3
<none>                      1465.9 1487.9
- residual.sugar        1   1468.5 1488.5
- citric.acid           1   1470.4 1490.4
- chlorides             1   1472.5 1492.5
- fixed.acidity         1   1472.8 1492.8
- free.sulfur.dioxide   1   1474.4 1494.4
- sulphates             1   1500.9 1520.9
- total.sulfur.dioxide  1   1501.6 1521.6
- volatile.acidity      1   1512.0 1532.0
- alcohol               1   1559.8 1579.8

Step:  AIC=1487.34
quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    sulphates + alcohol

                       Df Deviance    AIC
- residual.sugar        1   1468.7 1486.7
<none>                      1467.3 1487.3
- citric.acid           1   1471.8 1489.8
- fixed.acidity         1   1473.2 1491.2
- chlorides             1   1473.6 1491.6
- free.sulfur.dioxide   1   1475.8 1493.8
- sulphates             1   1500.9 1518.9
- total.sulfur.dioxide  1   1502.6 1520.6
- volatile.acidity      1   1517.8 1535.8
- alcohol               1   1643.7 1661.7

Step:  AIC=1486.69
quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + 
    free.sulfur.dioxide + total.sulfur.dioxide + sulphates + 
    alcohol

                       Df Deviance    AIC
<none>                      1468.7 1486.7
- citric.acid           1   1472.9 1488.9
- chlorides             1   1474.7 1490.7
- fixed.acidity         1   1474.8 1490.8
- free.sulfur.dioxide   1   1477.7 1493.7
- sulphates             1   1501.4 1517.4
- total.sulfur.dioxide  1   1502.7 1518.7
- volatile.acidity      1   1518.5 1534.5
- alcohol               1   1649.5 1665.5
Code
```{r}
summary(fit1)
```

Call:
glm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
    chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    sulphates + alcohol, family = binomial, data = d_train, na.action = na.pass)

Coefficients:
                      Estimate Std. Error z value             Pr(>|z|)    
(Intercept)          -9.681250   1.017330  -9.516 < 0.0000000000000002 ***
fixed.acidity         0.133308   0.054311   2.455              0.01411 *  
volatile.acidity     -3.361463   0.502698  -6.687      0.0000000000228 ***
citric.acid          -1.206145   0.594499  -2.029              0.04247 *  
chlorides            -3.861215   1.628817  -2.371              0.01776 *  
free.sulfur.dioxide   0.026007   0.008706   2.987              0.00281 ** 
total.sulfur.dioxide -0.016994   0.003053  -5.567      0.0000000258943 ***
sulphates             2.487972   0.453970   5.480      0.0000000424192 ***
alcohol               0.957603   0.079339  12.070 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1986.5  on 1437  degrees of freedom
Residual deviance: 1468.7  on 1429  degrees of freedom
AIC: 1486.7

Number of Fisher Scoring iterations: 4

模型预测:

Code
```{r}
plot_confusion = function(cm) {
  as.table(cm) %>% 
    as_tibble() %>% 
    mutate(response = factor(response),
           truth = factor(truth, rev(levels(response)))) %>% 
    ggplot(aes(response, truth, fill = n)) +
    geom_tile() +
    geom_text(aes(label = n)) +
    scale_fill_gradientn(colors = rev(hcl.colors(10, "Blues")), breaks = seq(0,10,2)) +
    coord_fixed() +
    theme_minimal()
}

# 测试数据真实值
truth <- d_test$quality
# 测试数据预测值
pred <- predict.glm(fit1,newdata = d_test,type="response")
pred[pred>0.5]=1
pred[pred<0.5]=0
# 计算混淆矩阵
confusionMatrix(table(pred,truth))
```
Confusion Matrix and Statistics

    truth
pred  0  1
   0 57 26
   1 18 60
                                         
               Accuracy : 0.7267         
                 95% CI : (0.651, 0.7939)
    No Information Rate : 0.5342         
    P-Value [Acc > NIR] : 0.0000004289   
                                         
                  Kappa : 0.4546         
                                         
 Mcnemar's Test P-Value : 0.2913         
                                         
            Sensitivity : 0.7600         
            Specificity : 0.6977         
         Pos Pred Value : 0.6867         
         Neg Pred Value : 0.7692         
             Prevalence : 0.4658         
         Detection Rate : 0.3540         
   Detection Prevalence : 0.5155         
      Balanced Accuracy : 0.7288         
                                         
       'Positive' Class : 0              
                                         

KNN

Code
```{r}
library(caret)
# 设置10折交叉训练
control <- trainControl(method = 'cv',number = 10)
# knn模型训练
model <- train(quality~.,d_train,
               method = 'knn',
               preProcess = c('center','scale'),
               trControl = control,
               tuneLength = 5)

model
```
k-Nearest Neighbors 

1438 samples
  11 predictor
   2 classes: '0', '1' 

Pre-processing: centered (11), scaled (11) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1294, 1294, 1294, 1294, 1294, 1296, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.7253815  0.4441503
   7  0.7239534  0.4427284
   9  0.7217527  0.4388053
  11  0.7315043  0.4586259
  13  0.7329225  0.4614702

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 13.

查看训练好的模型,可知模型的准确率都超过了70%,且最优模型的k值13。

Code
```{r}
# 测试数据真实值
truth <- d_test$quality
# 测试数据预测值
pred <- predict(model,newdata = d_test)
# 计算混淆矩阵
confusionMatrix(table(pred,truth))



species = c("坏酒", "好酒")
cm = matrix(c(54, 21,
              12, 74), nrow = 2,
            dimnames = list(response = species, truth = species))

plot_confusion(cm)
```
Confusion Matrix and Statistics

    truth
pred  0  1
   0 53 20
   1 22 66
                                          
               Accuracy : 0.7391          
                 95% CI : (0.6642, 0.8051)
    No Information Rate : 0.5342          
    P-Value [Acc > NIR] : 0.00000007284   
                                          
                  Kappa : 0.4749          
                                          
 Mcnemar's Test P-Value : 0.8774          
                                          
            Sensitivity : 0.7067          
            Specificity : 0.7674          
         Pos Pred Value : 0.7260          
         Neg Pred Value : 0.7500          
             Prevalence : 0.4658          
         Detection Rate : 0.3292          
   Detection Prevalence : 0.4534          
      Balanced Accuracy : 0.7371          
                                          
       'Positive' Class : 0               
                                          

回归

Code
```{r}
d <- read.csv("C:/Users/10156/Desktop/non/data/winequality-red.csv", sep=";")
ggplot(d)+
  geom_point(mapping=aes(x=alcohol,y=density))+
  facet_wrap(~quality)
```

样条光滑

$$

Y = _{j=1}^p f_j(x_j) +

$$

$$

\[\begin{align} \min_{f(\cdot)} \left\{ \sum_i [y_i - f(x_i)]^2 + \lambda \int [f''(x)]^2 \,dx \right\} \end{align}\]

$$

对质量7的红酒酒精浓度与酒密度进行样条光滑回归

Code
```{r}
nsamp <- 53
alcohol <- (d %>% filter(quality==7))$alcohol
density <- (d %>% filter(quality==7))$density
alcohol=sort(alcohol)
plot(alcohol,density)
library(splines)
res <- smooth.spline(alcohol, density)
lines(spline(res$x, res$y), col="skyblue",lwd=2)
```

Generalized Additive Model (GAM)线性可加模型

density作为因变量,自变量为fixed.acidity, citric.acid, residual.sugar与alcohol

Code
```{r}
library(mgcv)

fit2 <- gam(
  log(density) ~ s(fixed.acidity) + s(citric.acid) + s(residual.sugar)+s(alcohol), data=d)
summary(fit2)
```

Family: gaussian 
Link function: identity 

Formula:
log(density) ~ s(fixed.acidity) + s(citric.acid) + s(residual.sugar) + 
    s(alcohol)

Parametric coefficients:
               Estimate  Std. Error t value            Pr(>|t|)    
(Intercept) -0.00326042  0.00002145    -152 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                    edf Ref.df      F              p-value    
s(fixed.acidity)  7.411  8.295 141.62 < 0.0000000000000002 ***
s(citric.acid)    4.002  5.003   3.56              0.00328 ** 
s(residual.sugar) 8.618  8.950 110.70 < 0.0000000000000002 ***
s(alcohol)        5.868  7.013 227.99 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.795   Deviance explained = 79.8%
GCV = 7.4852e-07  Scale est. = 7.3593e-07  n = 1599
Code
```{r}
plot(fit2)
```