直接从R脚本读取Excel文件

Read an Excel file directly from a R script

如何将Excel文件直接读入R？还是应该首先将数据导出到文本或CSV文件，然后将该文件导入R？

相关讨论

让我重申@Chase的建议：使用XLConnect。

我认为使用XLConnect的原因是：

跨平台。 XLConnect用Java编写，因此可以在Win，Linux和Mac上运行，而无需更改R代码(可能的路径字符串除外)

没有什么可加载的了。只需安装XLConnect并继续生活。

您仅提到读取Excel文件，但是XLConnect还将写入Excel文件，包括更改单元格格式。而且它将从Linux或Mac(不仅仅是Win)做到这一点。

与其他解决方案相比，XLConnect有点新，因此在博客文章和参考文档中很少提及。对我来说，这非常有用。

现在有readxl：

The readxl package makes it easy to get data out of Excel and into R.
Compared to the existing packages (e.g. gdata, xlsx, xlsReadWrite etc)
readxl has no external dependencies so it's easy to install and use on
all operating systems. It is designed to work with tabular data stored
in a single sheet.

readxl is built on top of the libxls C library, which abstracts away
many of the complexities of the underlying binary format.

It supports both the legacy .xls format and .xlsx

readxl is available from CRAN, or you can install it from github with:

1 2	# install.packages("devtools") devtools::install_github("hadley/readxl")

用法

1
2
3
4
5
6
7
8
9
10
11
12
13

library(readxl)

# read_excel reads both xls and xlsx files
read_excel("my-old-spreadsheet.xls")
read_excel("my-new-spreadsheet.xlsx")

# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet ="data")
read_excel("my-spreadsheet.xls", sheet = 2)

# If NAs are represented by something other than blank cells,
# set the na argument
read_excel("my-spreadsheet.xls", na ="NA")

请注意，尽管描述中说"没有外部依赖项"，但它确实需要Rcpp软件包，这反过来又需要Rtools(对于Windows)或Xcode(对于OSX)，它们是R的外部依赖项。尽管很多人都安装了它们由于其他原因。

相关讨论

是。请参阅R Wiki上的相关页面。简短的答案：gdata包中的read.xls在大多数情况下都可以工作(尽管您需要在系统上安装Perl，通常在MacOS和Linux上已经是正确的，但在Windows上则需要采取额外的措施，例如，请参见http： //strawberryperl.com/)。 R Wiki页面上列出了各种警告和替代方法。

我不直接执行此操作的唯一原因是，您可能需要检查电子表格以查看其是否存在故障(怪异的标题，多个工作表[您一次只能阅读一个，尽管显然可以遍历所有这些工作表] ，包括地块等)。但是对于格式完整的矩形电子表格，它具有纯数字和字符数据(即，非逗号格式的数字，日期，具有零除错误的公式，缺少值等)..我通常没有问题这个过程。

相关讨论

EDIT 2015年10月：正如其他人在这里评论的那样，openxlsx和readxl软件包比xlsx软件包快得多，实际上可以打开更大的Excel文件(> 1500行和> 120列)。 @MichaelChirico演示了当首选速度时readxl更好，并且openxlsx替代了xlsx包提供的功能。如果要在2015年寻找用于读取，写入和修改Excel文件的软件包，请选择openxlsx而不是xlsx。

2015年之前：我使用过xlsx软件包。它使用Excel和R改变了我的工作流程。不再烦人的弹出式窗口询问我是否确定要以.txt格式保存Excel工作表。该软件包还写入Excel文件。

但是，打开大型Excel文件时，我发现read.xlsx函数运行缓慢。 read.xlsx2函数的速度要快得多，但不会查询data.frame列的向量类。如果使用read.xlsx2函数，则必须使用colClasses命令指定所需的列类。这是一个实际的例子：

read.xlsx("filename.xlsx", 1)读取文件并使data.frame列类几乎有用，但是对于大型数据集来说非常慢。也适用于.xls文件。

read.xlsx2("filename.xlsx", 1)更快，但是您必须手动定义列类。一种快捷方式是两次运行命令(请参见下面的示例)。 character规范将您的列转换为因子。使用Date和POSIXct选项设置时间。

1
2
3
4
5
6
7
8
9
10
11

coln <- function(x){y <- rbind(seq(1,ncol(x))); colnames(y) <- colnames(x)
rownames(y) <-"col.number"; return(y)} # A function to see column numbers

data <- read.xlsx2("filename.xlsx", 1) # Open the file

coln(data) # Check the column numbers you want to have as factors

x <- 3 # Say you want columns 1-3 as factors, the rest numeric

data <- read.xlsx2("filename.xlsx", 1, colClasses= c(rep("character", x),
rep("numeric", ncol(data)-x+1)))

考虑到以多种方式读取R中的Excel文件的方式以及此处的大量答案，我想我将尝试阐明此处提到的哪些选项表现最佳(在一些简单情况下)。

自从我开始使用R以来，我本人就一直在使用xlsx，如果没有其他问题，则可以保持惯性，而且我最近注意到，似乎没有关于哪个程序包效果更好的客观信息。

任何基准测试活动都充满困难，因为某些软件包肯定会比其他软件包更好地处理某些情况，而其他方面则需??要注意。

就是说，我正在使用一种(可重现的)数据集，我认为它是一种非常常见的格式(8个字符串字段，3个数字，1个整数，3个日期)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

set.seed(51423)
data.frame(
str1 = sample(sprintf("%010d", 1:NN)), #ID field 1
str2 = sample(sprintf("%09d", 1:NN)), #ID field 2
#varying length string field--think names/addresses, etc.
str3 =
replicate(NN, paste0(sample(LETTERS, sample(10:30, 1L), TRUE),
collapse ="")),
#factor-like string field with 50"levels"
str4 = sprintf("%05d", sample(sample(1e5, 50L), NN, TRUE)),
#factor-like string field with 17 levels, varying length
str5 =
sample(replicate(17L, paste0(sample(LETTERS, sample(15:25, 1L), TRUE),
collapse ="")), NN, TRUE),
#lognormally distributed numeric
num1 = round(exp(rnorm(NN, mean = 6.5, sd = 1.5)), 2L),
#3 binary strings
str6 = sample(c("Y","N"), NN, TRUE),
str7 = sample(c("M","F"), NN, TRUE),
str8 = sample(c("B","W"), NN, TRUE),
#right-skewed integer
int1 = ceiling(rexp(NN)),
#dates by month
dat1 =
sample(seq(from = as.Date("2005-12-31"),
to = as.Date("2015-12-31"), by ="month"),
NN, TRUE),
dat2 =
sample(seq(from = as.Date("2005-12-31"),
to = as.Date("2015-12-31"), by ="month"),
NN, TRUE),
num2 = round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L),
#date by day
dat3 =
sample(seq(from = as.Date("2015-06-01"),
to = as.Date("2015-07-15"), by ="day"),
NN, TRUE),
#lognormal numeric that can be positive or negative
num3 =
(-1) ^ sample(2, NN, TRUE) * round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L)
)

然后，我将其写入csv并在LibreOffice中打开并将其保存为.xlsx文件，然后对该线程中提到的4个包进行基准测试：xlsx，openxlsx，readxl和gdata，使用默认设置选项(我还尝试了是否指定列类型的版本，但这并没有改变排名)。

我排除RODBC是因为我在Linux上。 XLConnect因为它的主要目的不是在单个Excel工作表中阅读而是在导入整个Excel工作簿，因此仅凭其阅读能力就屈指可数了；和xlsReadWrite，因为它不再与我的R版本兼容(似乎已被淘汰)。

然后，我使用NN=1000L和NN=25000L(在上面每个data.frame的声明之前重置种子)运行基准测试，以允许在Excel文件大小方面存在差异。 gc主要用于xlsx，我有时发现它会创建内存阻塞。事不宜迟，这是我发现的结果：

1,000行Excel文件

1
2
3
4
5
6
7
8
9
10
11
12
13

benchmark1k <-
microbenchmark(times = 100L,
xlsx = {xlsx::read.xlsx2(fl, sheetIndex=1); invisible(gc())},
openxlsx = {openxlsx::read.xlsx(fl); invisible(gc())},
readxl = {readxl::read_excel(fl); invisible(gc())},
gdata = {gdata::read.xls(fl); invisible(gc())})

# Unit: milliseconds
# expr min lq mean median uq max neval
# xlsx 194.1958 199.2662 214.1512 201.9063 212.7563 354.0327 100
# openxlsx 142.2074 142.9028 151.9127 143.7239 148.0940 255.0124 100
# readxl 122.0238 122.8448 132.4021 123.6964 130.2881 214.5138 100
# gdata 2004.4745 2042.0732 2087.8724 2062.5259 2116.7795 2425.6345 100

因此，readxl是赢家，openxlsx具有竞争力，gdata是明显的输家。采取相对于列最小值的每个度量：

1
2
3
4
5

# expr min lq mean median uq max
# 1 xlsx 1.59 1.62 1.62 1.63 1.63 1.65
# 2 openxlsx 1.17 1.16 1.15 1.16 1.14 1.19
# 3 readxl 1.00 1.00 1.00 1.00 1.00 1.00
# 4 gdata 16.43 16.62 15.77 16.67 16.25 11.31

我们看到了我自己的最爱，xlsx比readxl慢60％。

25,000行Excel文件

由于花费的时间，我只对较大的文件进行了20次重复，否则命令是相同的。这是原始数据：

1
2
3
4
5
6

# Unit: milliseconds
# expr min lq mean median uq max neval
# xlsx 4451.9553 4539.4599 4738.6366 4762.1768 4941.2331 5091.0057 20
# openxlsx 962.1579 981.0613 988.5006 986.1091 992.6017 1040.4158 20
# readxl 341.0006 344.8904 347.0779 346.4518 348.9273 360.1808 20
# gdata 43860.4013 44375.6340 44848.7797 44991.2208 45251.4441 45652.0826 20

以下是相关数据：

1
2
3
4
5

# expr min lq mean median uq max
# 1 xlsx 13.06 13.16 13.65 13.75 14.16 14.13
# 2 openxlsx 2.82 2.84 2.85 2.85 2.84 2.89
# 3 readxl 1.00 1.00 1.00 1.00 1.00 1.00
# 4 gdata 128.62 128.67 129.22 129.86 129.69 126.75

因此，readxl是速度方面的明显赢家。 gdata最好有别的选择，因为它在读取Excel文件时非常缓慢，并且仅对于较大的表，此问题才会加剧。

openxlsx的两幅图是：1)它广泛的其他方法(readxl仅设计用于做一件事，这可能是为什么它这么快的一部分)，尤其是它的write.xlsx函数，以及2)(更多readxl的缺点)(仅在readxl中)col_types自变量(在撰写本文时)接受一些非标准的R："text"而不是"character"和"date"而不是"date"。

相关讨论

我对XLConnect表示好运：http://cran.r-project.org/web/packages/XLConnect/index.html

相关讨论

1
2
3
4
5
6
7
8

library(RODBC)
file.name <-"file.xls"
sheet.name <-"Sheet Name"

## Connect to Excel File Pull and Format Data
excel.connect <- odbcConnectExcel(file.name)
dat <- sqlFetch(excel.connect, sheet.name, na.strings=c("","-"))
odbcClose(excel.connect)

就个人而言，我喜欢RODBC并且可以推荐它。

相关讨论

今天就尝试使用包openxlsx。它真的很好(而且很快)。

http://cran.r-project.org/web/packages/openxlsx/index.html

另一个解决方案是xlsReadWrite软件包，它不需要额外的安装，但是需要您在首次使用它之前下载额外的shlib，方法是：

1 2	require(xlsReadWrite) xls.getshlib()

忘记这一点会导致完全沮丧。去过那里...

旁注：您可能要考虑转换为基于文本的格式(例如csv)并从中读取。这有很多原因：

不管您使用什么解决方案(RODBC，gdata，xlsReadWrite)，在转换数据时都可能会发生一些奇怪的事情。特别是日期可能非常麻烦。 HFWutils软件包具有一些处理EXCEL日期的工具(根据@Ben Bolker的评论)。
如果您有大张纸，则读取文本文件要比从EXCEL读取数据快。
对于.xls和.xlsx文件，可能需要不同的解决方案。例如，xlsReadWrite软件包当前不支持.xlsx AFAIK。 gdata要求您安装其他支持.xlsx的perl库。 xlsx包可以处理相同名称的扩展名。

相关讨论

通过扩展@Mikko提供的答案，您可以使用巧妙的技巧来加快处理速度，而不必提前"知道"您的列类。只需使用read.xlsx来获取有限数量的记录来确定类，然后使用read.xlsx2

例

1
2
3
4

# just the first 50 rows should do...
df.temp <- read.xlsx("filename.xlsx", 1, startRow=1, endRow=50)
df.real <- read.xlsx2("filename.xlsx", 1,
colClasses=as.vector(sapply(df.temp, mode)))

相关讨论

如上文在其他许多答案中所述，有很多不错的程序包可以连接到XLS / X文件并以合理的方式获取数据。但是，应警告您，在任何情况下都不应使用剪贴板(或.csv)文件从Excel中检索数据。若要查看原因，请在excel的单元格中输入=1/3。现在，将对您可见的小数点数量减少到两个。然后将数据复制并粘贴到R中。现在保存CSV。您会注意到，在这两种情况下，Excel都仅保留了通过界面对您可见的数据的帮助，并且您已经失去了实际源数据中的所有精度。

相关讨论

可以将Excel文件直接读取到R中，如下所示：

1	my_data <- read.table(file ="xxxxxx.xls", sep ="\\t", header=TRUE)

使用readxl包读取xls和xlxs文件

1
2
3

library("readxl")
my_data <- read_excel("xxxxx.xls")
my_data <- read_excel("xxxxx.xlsx")