本文是《R 语言实战》(第 2 版)中数据整合与重构的一个例子,书中用的是老旧的 reshape2
对数据进行了融合和重铸,有点像《变脸》里唱的那样,文化领域百花齐放,百家争鸣是对的,可是一张表表达的意思是一个意思,不同的人做出来就可能千差万别,如果你不能做到源头控制,那你就得掌握如何通过各种变换变成你需要的那张脸。

加载相关 R 包
library(tidyr)
library(dplyr)
library(knitr)
原始数据集
mydata = data.frame(
ID = c(1, 1, 2, 2),
Time = c(1, 2, 1, 2),
X1 = c(5, 3, 6, 2),
X2 = c(6, 5, 1, 4)
)
kable(mydata, align = "c")
ID | Time | X1 | X2 |
---|
1 | 1 | 5 | 6 |
1 | 2 | 3 | 5 |
2 | 1 | 6 | 1 |
2 | 2 | 2 | 4 |
原表转化为长表
mydata |>
pivot_longer(
X1:X2,
names_to = "variable",
values_to = "value"
) -> table_long
kable(table_long, align = "c")
ID | Time | variable | value |
---|
1 | 1 | X1 | 5 |
1 | 1 | X2 | 6 |
1 | 2 | X1 | 3 |
1 | 2 | X2 | 5 |
2 | 1 | X1 | 6 |
2 | 1 | X2 | 1 |
2 | 2 | X1 | 2 |
2 | 2 | X2 | 4 |
执行整合
转化会导致信息损失,就像麦子变成面粉一样,这就意味着转换后的表无法逆回到之前的状态。
长表转化为表 a
table_long |>
summarise(
value_mean = mean(value),
.by = c(ID, variable)
) |>
pivot_wider(
names_from = variable,
values_from = value_mean
) -> table_a
kable(table_a, align = "c")
长表转化为表 b
table_long |>
summarise(
value_mean = mean(value),
.by = c(Time, variable)
) |>
pivot_wider(
names_from = variable,
values_from = value_mean
) -> table_b
kable(table_b, align = "c")
长表转化为表 c
table_long |>
summarise(
value_mean = mean(value),
.by = c(ID, Time)
) |>
pivot_wider(
names_from = Time,
names_prefix = "Time",
values_from = value_mean
) -> table_c
kable(table_c, align = "c")
不执行整合
这就意味着信息必须是无损的,只是做了形式转换,转换后可以逆回到原来的状态。
长表转为表 d
table_long |>
pivot_wider(
names_from = variable,
values_from = value
) -> table_d
kable(table_d, align = "c")
ID | Time | X1 | X2 |
---|
1 | 1 | 5 | 6 |
1 | 2 | 3 | 5 |
2 | 1 | 6 | 1 |
2 | 2 | 2 | 4 |
表 d 逆回到长表
table_d |>
pivot_longer(
X1:X2,
names_to = "variable",
values_to = "value"
) |>
kable(align = "c")
ID | Time | variable | value |
---|
1 | 1 | X1 | 5 |
1 | 1 | X2 | 6 |
1 | 2 | X1 | 3 |
1 | 2 | X2 | 5 |
2 | 1 | X1 | 6 |
2 | 1 | X2 | 1 |
2 | 2 | X1 | 2 |
2 | 2 | X2 | 4 |
长表转为表 e
table_long |>
pivot_wider(
names_from = Time,
names_prefix = "Time",
values_from = value
) -> table_e
kable(table_e, align = "c")
ID | variable | Time1 | Time2 |
---|
1 | X1 | 5 | 3 |
1 | X2 | 6 | 5 |
2 | X1 | 6 | 2 |
2 | X2 | 1 | 4 |
表 e 逆回长表
table_e |>
pivot_longer(
cols = starts_with("Time"),
names_to = "Time",
names_transform = list(Time = readr::parse_number),
values_to = "value"
) |>
kable(align = "c")
ID | variable | Time | value |
---|
1 | X1 | 1 | 5 |
1 | X1 | 2 | 3 |
1 | X2 | 1 | 6 |
1 | X2 | 2 | 5 |
2 | X1 | 1 | 6 |
2 | X1 | 2 | 2 |
2 | X2 | 1 | 1 |
2 | X2 | 2 | 4 |
长表转为表 f
table_long |>
pivot_wider(
names_from = c(variable, Time),
names_glue = "{variable}_Time{Time}",
values_from = value
) -> table_f
kable(table_f, align = "c")
ID | X1_Time1 | X2_Time1 | X1_Time2 | X2_Time2 |
---|
1 | 5 | 6 | 3 | 5 |
2 | 6 | 1 | 2 | 4 |
表 f 逆回长表
table_f |>
pivot_longer(
-ID,
names_to = c("variable", "Time"),
names_sep = "_Time",
names_transform = list(Time = as.numeric),
values_to = "value"
) |>
kable(align = "c")
ID | variable | Time | value |
---|
1 | X1 | 1 | 5 |
1 | X2 | 1 | 6 |
1 | X1 | 2 | 3 |
1 | X2 | 2 | 5 |
2 | X1 | 1 | 6 |
2 | X2 | 1 | 1 |
2 | X1 | 2 | 2 |
2 | X2 | 2 | 4 |