Welcome to R Square

重塑一个数据集

楚新元 / 2025-07-09


本文是《R 语言实战》(第 2 版)中数据整合与重构的一个例子,书中用的是老旧的 reshape2 对数据进行了融合和重铸,有点像《变脸》里唱的那样,文化领域百花齐放,百家争鸣是对的,可是一张表表达的意思是一个意思,不同的人做出来就可能千差万别,如果你不能做到源头控制,那你就得掌握如何通过各种变换变成你需要的那张脸。

加载相关 R 包

library(tidyr)
library(dplyr)
library(knitr)

原始数据集

mydata = data.frame(
  ID = c(1, 1, 2, 2),
  Time = c(1, 2, 1, 2),
  X1 = c(5, 3, 6, 2),
  X2 = c(6, 5, 1, 4)
)
kable(mydata, align = "c")
IDTimeX1X2
1156
1235
2161
2224

原表转化为长表

mydata |> 
  pivot_longer(
    X1:X2,
    names_to = "variable",
    values_to = "value"
  ) -> table_long
kable(table_long, align = "c")
IDTimevariablevalue
11X15
11X26
12X13
12X25
21X16
21X21
22X12
22X24

执行整合

转化会导致信息损失,就像麦子变成面粉一样,这就意味着转换后的表无法逆回到之前的状态。

长表转化为表 a

table_long |> 
  summarise(
    value_mean = mean(value),
    .by = c(ID, variable)
  ) |> 
  pivot_wider(
    names_from = variable,
    values_from = value_mean
  ) -> table_a
kable(table_a, align = "c")
IDX1X2
145.5
242.5

长表转化为表 b

table_long |> 
  summarise(
    value_mean = mean(value),
    .by = c(Time, variable)
  ) |> 
  pivot_wider(
    names_from = variable,
    values_from = value_mean
  ) -> table_b
kable(table_b, align = "c")
TimeX1X2
15.53.5
22.54.5

长表转化为表 c

table_long |> 
  summarise(
    value_mean = mean(value),
    .by = c(ID, Time)
  ) |> 
  pivot_wider(
    names_from = Time,
    names_prefix = "Time",
    values_from = value_mean
  ) -> table_c
kable(table_c, align = "c")
IDTime1Time2
15.54
23.53

不执行整合

这就意味着信息必须是无损的,只是做了形式转换,转换后可以逆回到原来的状态。

长表转为表 d

table_long |> 
  pivot_wider(
    names_from = variable,
    values_from = value
  ) -> table_d
kable(table_d, align = "c")
IDTimeX1X2
1156
1235
2161
2224

表 d 逆回到长表

table_d |> 
  pivot_longer(
    X1:X2,
    names_to = "variable",
    values_to = "value"
  ) |> 
  kable(align = "c")
IDTimevariablevalue
11X15
11X26
12X13
12X25
21X16
21X21
22X12
22X24

长表转为表 e

table_long |> 
  pivot_wider(
    names_from = Time,
    names_prefix = "Time",
    values_from = value
  ) -> table_e
kable(table_e, align = "c")
IDvariableTime1Time2
1X153
1X265
2X162
2X214

表 e 逆回长表

table_e |> 
  pivot_longer(
    cols = starts_with("Time"), 
    names_to = "Time",
    names_transform = list(Time = readr::parse_number),
    values_to = "value"
  ) |> 
  kable(align = "c")
IDvariableTimevalue
1X115
1X123
1X216
1X225
2X116
2X122
2X211
2X224

长表转为表 f

table_long |> 
  pivot_wider(
    names_from = c(variable, Time), 
    names_glue = "{variable}_Time{Time}",
    values_from = value
  ) -> table_f
kable(table_f, align = "c")
IDX1_Time1X2_Time1X1_Time2X2_Time2
15635
26124

表 f 逆回长表

table_f |> 
  pivot_longer(
    -ID,
    names_to = c("variable", "Time"),
    names_sep = "_Time",
    names_transform = list(Time = as.numeric),
    values_to = "value"
  ) |> 
  kable(align = "c")
IDvariableTimevalue
1X115
1X216
1X123
1X225
2X116
2X211
2X122
2X224