我对这个问题有点数据争论头痛。我在 Excel 中有一个通过从网站复制粘贴生成的表,其中每一行代表一个样本。在这一行中有一个包含可变数量单元格的特定字段。我附上一个屏幕截图,以便您可以轻松理解我的意思:
现在,我想做的是将这些字段中的每一个都放在一个单独的列中。如果这是固定数量的行,我将简单地通过“特殊粘贴”进行转置,然后在固定数量的字段处中断结果行,以便以整洁的方式处理所有内容。但是,每行的字段数量会发生变化;并非所有移动的样本都具有相同数量的属性,这意味着转置整个列的属性是不够的。
然而,这是乏味的,耗时的,因为我有超过 500 个条目将永远。
我一点也不流利的 Excel 脚本,但我有一个半体面的知识在基地 R。我必须解决的主要问题是,如果我导入这个表作为 R 中的文本文件,该列中的每个单元格将被分配一行,使折叠相当复杂。
这是我到目前为止提出的:
#This is the data I need to wrangle
tmp <- read.delim("pdata.txt", header = T, stringsAsFactors = F)
> head(tmp)
Title Source.name Disease.state Sex Age
1 LT000842RU_CTRL Flash frozen whole lung Control 1-Male 75
2 NA
3 NA
4 NA
5 LT001600RL_ILD Flash frozen whole lung Interstitial lung disease 1-Male 54
6 NA
Gold.stage Characteristics Ild.subtype Pneumocystis.colonization
1 0-At Risk smoker?: 2-Ever (>100)
2 %predicted fev1 (pre-bd): 96
3 %predicted fvc (pre-bd): 97
4 %predicted dlco: 78
5 %emphysema (f-950): 1.903 2-UIP/IPF
6 smoker?: 2-Ever (>100)
如您所见,“标题”列中有空格,这是由“特征”列具有相同标题的更多条目引起的。
#Extremely ugly series of gsub to make for compatible colnames (no spaces, no dashes, etc)
d = sapply(tmp[,7], function(x) {gsub(x, pattern = " ", replacement = "_", fixed = T)})
dd = sapply(d, function(x) {gsub(x, pattern = "%", replacement = "", fixed = T)})
dd = sapply(dd, function(x) {gsub(x, pattern = "_(f-950):_", replacement = " ", fixed = T)})
dd = sapply(dd, function(x) {gsub(x, pattern = "?:", replacement = "", fixed = T)})
dd = sapply(dd, function(x) {gsub(x, pattern = "smoker_", replacement = "smoker ", fixed = T)})
dd = sapply(dd, function(x) {gsub(x, pattern = ":_", replacement = " ", fixed = T)})
dd = sapply(dd, function(x) {gsub(x, pattern = "(", replacement = "", fixed = T)})
dd = sapply(dd, function(x) {gsub(x, pattern = ")", replacement = "", fixed = T)})
dd = sapply(dd, function(x) {gsub(x, pattern = "-", replacement = "_", fixed = T)})
#Use the character vector that has been gsubbed as the attributes column in the df
tmp[,7] = dd
#Take the rows that are not empty, i.e. those that have the name of the sample and will be the starting rows for the attributes
nonempty = which(tmp[,1] != "")
jumps = nonempty[2:length(nonempty)]-nonempty[1:length(nonempty)-1]
jumps = c(jumps, 0)
#Make dummy columns with the same names as the gsubbed attributes
tmp$emphysema = tmp[,1]
tmp$smoker = tmp[,1]
tmp$predicted_fcv_pre_bd = tmp[,1]
tmp$predicted_fev1_pre_bd = tmp[,1]
tmp$predicted_fev1_post_bd = tmp[,1]
tmp$predicted_fcv_post_bd = tmp[,1]
tmp$predicted_dlco = tmp[,1]
#This is a loop to fill in the columns with the values extracted from the gsubbed attributes column
for(i in 1:length(nonempty))
{
a = as.data.frame(tmp[seq(nonempty[i], (nonempty[i]+jumps[i]-1),by = 1),7])
chars = colnames(tmp[,10:ncol(tmp)])
for (j in chars)
{
gg = as.character(a[grep(pattern = j, x = a[,1]),1])
if(length(gg) != 0) tmp[nonempty[i],j] = as.character(unlist(strsplit(gg, split = " "))[2]) else tmp[nonempty[i],j] = NA
}
}
# Make the new df by only taking the rows with samples
tmp2 = tmp[nonempty,]
#This is the resulting data frame:
> head(tmp2)
Title Source.name
LT000216LL_ILD LT000216LL_ILD Flash frozen whole lung
LT000379LU_ILD LT000379LU_ILD Flash frozen whole lung
LT000842RU_CTRL LT000842RU_CTRL Flash frozen whole lung
LT001600RL_ILD LT001600RL_ILD Flash frozen whole lung
LT001796RU_CTRL LT001796RU_CTRL Flash frozen whole lung
LT002410RM_ILD LT002410RM_ILD Flash frozen whole lung
Disease.state Sex Age Gold.stage
LT000216LL_ILD Interstitial lung disease 2-Female 70
LT000379LU_ILD Interstitial lung disease 1-Male 63
LT000842RU_CTRL Control 1-Male 75 0-At Risk
LT001600RL_ILD Interstitial lung disease 1-Male 54
LT001796RU_CTRL Control 1-Male 48 0-At Risk
LT002410RM_ILD Interstitial lung disease 1-Male 52
Characteristics Ild.subtype
LT000216LL_ILD emphysema 0.051 2-UIP/IPF
LT000379LU_ILD smoker 2_Ever_>100 9-Hypersensitive Pneumonitis (HP)
LT000842RU_CTRL smoker 2_Ever_>100
LT001600RL_ILD emphysema 1.903 2-UIP/IPF
LT001796RU_CTRL smoker 2_Ever_>100
LT002410RM_ILD emphysema 0.03 2-UIP/IPF
Pneumocystis.colonization emphysema smoker
LT000216LL_ILD 0.051 3_Never
LT000379LU_ILD <NA> 2_Ever_>100
LT000842RU_CTRL <NA> 2_Ever_>100
LT001600RL_ILD 1.903 2_Ever_>100
LT001796RU_CTRL <NA> 2_Ever_>100
LT002410RM_ILD 0.03 2_Ever_>100
predicted_fcv_pre_bd predicted_fev1_pre_bd
LT000216LL_ILD <NA> 56
LT000379LU_ILD <NA> 67
LT000842RU_CTRL <NA> 96
LT001600RL_ILD <NA> 40
LT001796RU_CTRL <NA> 107
LT002410RM_ILD <NA> 56
predicted_fev1_post_bd predicted_fcv_post_bd predicted_dlco
LT000216LL_ILD <NA> <NA> 36
LT000379LU_ILD <NA> <NA> 42
LT000842RU_CTRL <NA> <NA> 78
LT001600RL_ILD <NA> <NA> 16
LT001796RU_CTRL 110 <NA> 107
LT002410RM_ILD 60 <NA> 63
现在,虽然我对这个结果相当满意,但我花了一些时间做饭,而且它相当不灵活(非常适合这个特定的数据集),所以我想知道以下内容:
对于这种情况,Excel 中是否有快速修复?
或者,或者实际上更好:是否有一个基本 / tidyverse 的方式来处理这个 R 的方式,而不是像我想出的那样愚蠢?
提前感谢!

假设原始数据中的大单元格是合并单元格,这在openxlsx和tidyr的帮助下非常简单。无论如何,我会第二个 @ RonRosenfeld 的建议(在评论中)设置到远程源的数据连接,而不是复制粘贴。
我创建了一个简化的示例文件 messy.xlsx(请参阅文章末尾的代码)来演示数据整理过程的一个选项:
您可以首先读取 Excel 文件read.xlsx()
,保留合并单元格所隐含的结构,然后separate()
和spread()
将不同的特征放入自己的列中:
library(openxlsx)
library(tidyr)
# Repeat merged cell value accross all cells
messy <- read.xlsx("messy.xlsx", fillMergedCells = TRUE)
# Create a column for each characteristic
messy %>%
separate(
characteristics,
into = c("variable", "value"),
sep = ": "
) %>%
spread(variable, value)
#> subject a b c
#> 1 101 1 2 3
#> 2 102 2 8 <NA>
示例数据:
library(openxlsx)
wb <- createWorkbook()
addWorksheet(wb, "Sheet 1")
df <- data.frame(
subject = c(101, NA, NA, 102, NA),
characteristics = c("a: 1", "b: 2", "c: 3", "a: 2", "b: 8")
)
writeData(wb, 1, df)
mergeCells(wb, 1, 1, 2:4)
mergeCells(wb, 1, 1, 5:6)
saveWorkbook(wb, "messy.xlsx")
由reprex package(v0.2.0.9000) 于 2018-07-17 创建。
本站系公益性非盈利分享网址,本文来自用户投稿,不代表码文网立场,如若转载,请注明出处
评论列表(79条)