Extract text from first, second and third brackets in R
我在 R 中有以下数据框:
1 2 3 4 | text <- c("[AAA]xxxx","[AAA] yyyrrr","[AAA][bbb] bla","[AAA][bbb] cccvvv", "[AAA][bbb] bla","[AAA][bbb][CcC] bla","[AAA][bbb][CcC] xbbpr") value <- rnorm(7) df <- data.frame(text, value) |
我想在我的数据框中分别为第一个、第二个和第三个括号对中包含的文本创建三个新变量。
所需的输出如下所示:
1 2 3 4 5 6 7 8 | text value Bracket1 Bracket2 Bracket3 1 [AAA]xxxx -0.01819034 AAA NA NA 2 [AAA] yyyrrr -0.24808460 AAA NA NA 3 [AAA][bbb] bla -0.36293689 AAA bbb NA 4 [AAA][bbb] cccvvv 1.27757055 AAA bbb NA 5 [AAA][bbb] bla -0.46889715 AAA bbb NA 6 [AAA][bbb][CcC] bla 0.07105410 AAA bbb CcC 7 [AAA][bbb][CcC] xbbpr -0.26603845 AAA bbb CcC |
我无法从第一个括号中提取文本,更不用说第二个或第三个了。
例如,我试过:
1 | df$Bracket1 <- gsub('.*\\\\[(.*)\\\\].*', '\\\\1', text) |
和
1 | df$Bracket1 <- sub('.*\\\\[(.*)\\\\].*', '\\\\1', text) |
但这些都产生:
1 2 3 4 5 6 7 8 | text value Bracket1 1 [AAA]xxxx -0.01819034 AAA 2 [AAA] yyyrrr -0.24808460 AAA 3 [AAA][bbb] bla -0.36293689 bbb 4 [AAA][bbb] cccvvv 1.27757055 bbb 5 [AAA][bbb] bla -0.46889715 bbb 6 [AAA][bbb][CcC] bla 0.07105410 CcC 7 [AAA][bbb][CcC] xbbpr -0.26603845 CcC |
我是 regex 的新手,对 R 比较陌生,提前感谢您的任何建议。
使用
1 2 3 4 5 6 7 8 9 10 11 12 13 | require(data.table) # v1.9.6+ dt = data.table(text, value) # text is character vals = regmatches(dt$text, gregexpr("(?<=\\\\[)[[:alpha:]]+(?=])", dt$text, perl=TRUE)) dt[, paste0("Bracket", 1:3) := transpose(vals)] # text value Bracket1 Bracket2 Bracket3 # 1: [AAA]xxxx -0.9285790 AAA NA NA # 2: [AAA] yyyrrr 0.7928830 AAA NA NA # 3: [AAA][bbb] bla 0.1177066 AAA bbb NA # 4: [AAA][bbb] cccvvv 1.1818542 AAA bbb NA # 5: [AAA][bbb] bla -0.4476371 AAA bbb NA # 6: [AAA][bbb][CcC] bla 2.2992593 AAA bbb CcC # 7: [AAA][bbb][CcC] xbbpr 2.1161453 AAA bob CcC |
这是一种使用
1 | mtchs <- regmatches(df$text, gregexpr("\\\\[\\\\w+\\\\]", df$text)) |
然后只需将输出重新组织为所需的结构:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | library(plyr) # for rbind.fill df[,3:5] <- do.call(rbind.fill, lapply(mtchs, function(xx) {x <- data.frame(matrix(xx, nrow=1)) names(x) <- paste0("Bracket", 1:length(xx)) x})) # or using dplyr's bind_row: library(dplyr) df[,3:5] <- bind_rows(lapply(mtchs, function(xx) {x <- data.frame(matrix(xx, nrow=1)) names(x) <- paste0("Bracket", 1:length(xx)) x})) # or using data.table's rbindlist: library(data.table) df[,3:5] <- rbindlist(lapply(mtchs, function(xx) {x <- data.frame(matrix(xx, nrow=1)) names(x) <- paste0("Bracket", 1:length(xx)) x}), fill=TRUE) |
如果需要,您可以更改
1 | mtchs <- regmatches(df$text, gregexpr("(?<=\\\\[)\\\\w+(?=\\\\])", df$text, perl=TRUE)) |
这是基于
1 2 3 | df$Bracket1 <- gsub('(?:.*?\\\\[([^][]*)\\\\].*|.*)', '\\\\1', text, perl=T) df$Bracket2 <- gsub('(?:.*?\\\\[[^][]*\\\\].*?\\\\[([^][]*)\\\\].*|.*)', '\\\\1', text, perl=T) df$Bracket3 <- gsub('(?:.*?\\\\[[^][]*\\\\].*?\\\\[[^][]*\\\\].*?\\\\[([^][]*)\\\\].*|.*)', '\\\\1', text, perl=T) |
查看 IDEONE 演示。