关于r：为什么使用as.factor()而不是factor()

Why use as.factor() instead of just factor()

我最近看到Matt Dowle用as.factor()编写了一些代码，特别是

1	for (col in names_factors) set(dt, j=col, value=as.factor(dt[[col]]))

在对此答案的评论中。

我使用了此代码段，但我需要显式设置因子水平以确保水平按所需顺序显示，因此我必须进行更改

1	as.factor(dt[[col]])

至

1	factor(dt[[col]], levels = my_levels)

这让我开始思考：与仅使用factor()相比，使用as.factor()有什么好处(如果有)？

相关讨论

as.factor是factor的包装，但是如果输入向量已经是一个因数，它可以快速返回：

好。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

function (x)
{
if (is.factor(x))
x
else if (!is.object(x) && is.integer(x)) {
levels <- sort(unique.default(x))
f <- match(x, levels)
levels(f) <- as.character(levels)
if (!is.null(nx <- names(x)))
names(f) <- nx
class(f) <-"factor"
f
}
else factor(x)
}

弗兰克的评论：这不仅仅是一个包装，因为这种"快速回报"将保持因子水平不变，而factor()则不会：

好。

1
2
3
4
5
6
7
8
9
10
11

f = factor("a", levels = c("a","b"))
#[1] a
#Levels: a b

factor(f)
#[1] a
#Levels: a

as.factor(f)
#[1] a
#Levels: a b

两年后扩大答案，包括以下内容：

手册说什么？

性能：当输入是一个因素时，as.factor> factor

性能：输入为整数时，as.factor> factor

未使用的水平或NA水平

使用R的分组功能时的警告：注意未使用或不可用的水平

好。

手册说什么？

?factor的文档提到以下内容：

好。

1
2
3
4
5
6

‘factor(x, exclude = NULL)’ applied to a factor without ‘NA’s is a
no-operation unless there are unused levels: in that case, a
factor with the reduced level set is returned.

‘as.factor’ coerces its argument to a factor. It is an
abbreviated (sometimes faster) form of ‘factor’.

性能：当输入是一个因素时，as.factor> factor

"不操作"一词有点含糊。不要把它当作"无所事事"；实际上，它的意思是"做很多事情，但本质上什么也没改变"。这是一个例子：

好。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

set.seed(0)
## a randomized long factor with 1e+6 levels, each repeated 10 times
f <- sample(gl(1e+6, 10))

system.time(f1 <- factor(f)) ## default: exclude = NA
# user system elapsed
# 7.640 0.216 7.887

system.time(f2 <- factor(f, exclude = NULL))
# user system elapsed
# 7.764 0.028 7.791

system.time(f3 <- as.factor(f))
# user system elapsed
# 0 0 0

identical(f, f1)
#[1] TRUE

identical(f, f2)
#[1] TRUE

identical(f, f3)
#[1] TRUE

as.factor确实可以快速返回，但是factor并不是真正的"无操作"。让我们对factor进行概要分析，看看它做了什么。

好。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Rprof("factor.out")
f1 <- factor(f)
Rprof(NULL)
summaryRprof("factor.out")[c(1, 4)]
#$by.self
# self.time self.pct total.time total.pct
#"factor" 4.70 58.90 7.98 100.00
#"unique.default" 1.30 16.29 4.42 55.39
#"as.character" 1.18 14.79 1.84 23.06
#"as.character.factor" 0.66 8.27 0.66 8.27
#"order" 0.08 1.00 0.08 1.00
#"unique" 0.06 0.75 4.54 56.89
#
#$sampling.time
#[1] 7.98

它首先sort输入向量f的unique值，然后将f转换为字符向量，最后使用factor将字符向量强制转换为因子。这是factor的源代码，用于确认。

好。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
{
if (is.null(x))
x <- character()
nx <- names(x)
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
levels <- unique(as.character(y)[ind])
}
force(ordered)
if (!is.character(x))
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))]
f <- match(x, levels)
if (!is.null(nx))
names(f) <- nx
nl <- length(labels)
nL <- length(levels)
if (!any(nl == c(1L, nL)))
stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
nl, nL), domain = NA)
levels(f) <- if (nl == nL)
as.character(labels)
else paste0(labels, seq_along(levels))
class(f) <- c(if (ordered)"ordered","factor")
f
}

因此，函数factor实际上是设计用于字符向量的，并且将as.character应用于其输入以确保做到这一点。我们至少可以从上面学习两个与性能相关的问题：

好。

对于数据帧DF，如果容易转换许多列，则lapply(DF, as.factor)比类型转换的lapply(DF, factor)快得多。

该函数factor较慢可以解释为什么某些重要的R函数较慢，例如table：R：表函数令人惊讶地较慢

性能：输入为整数时，as.factor> factor

因子变量是整数变量的近亲。

好。

1
2
3
4
5
6
7

unclass(gl(2, 2, labels = letters[1:2]))
#[1] 1 1 2 2
#attr(,"levels")
#[1]"a""b"

storage.mode(gl(2, 2, labels = letters[1:2]))
#[1]"integer"

这意味着将整数转换为因数要比将数字/字符转换为因数容易。 as.factor会处理这个问题。

好。

1
2
3
4
5
6
7
8
9

x <- sample.int(1e+6, 1e+7, TRUE)

system.time(as.factor(x))
# user system elapsed
# 4.592 0.252 4.845

system.time(factor(x))
# user system elapsed
# 22.236 0.264 22.659

未使用的水平或NA水平

现在让我们看几个关于factor和as.factor对因子水平的影响的示例(如果输入已经是因子)。弗兰克(Frank)给了一个未使用的因子水平，我将提供一个NA的水平。

好。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

f <- factor(c(1, NA), exclude = NULL)
#[1] 1 <NA>
#Levels: 1 <NA>

as.factor(f)
#[1] 1 <NA>
#Levels: 1 <NA>

factor(f, exclude = NULL)
#[1] 1 <NA>
#Levels: 1 <NA>

factor(f)
#[1] 1 <NA>
#Levels: 1

有一个(通用)函数droplevels，可用于删除未使用的因子水平。但是默认情况下不能删除NA级别。

好。

1
2
3
4
5
6
7
8
9
10
11
12

##"factor" method of `droplevels`
droplevels.factor
#function (x, exclude = if (anyNA(levels(x))) NULL else NA, ...)
#factor(x, exclude = exclude)

droplevels(f)
#[1] 1 <NA>
#Levels: 1 <NA>

droplevels(f, exclude = NA)
#[1] 1 <NA>
#Levels: 1

使用R的分组功能时的警告：注意未使用或不可用的水平

R函数执行分组操作，例如split，tapply，希望我们将因子变量作为" by"变量提供。但是通常我们只提供字符或数字变量。因此，在内部，这些函数需要将它们转换为因数，并且可能其中大多数首先会使用as.factor(至少对于split.default和tapply是如此)。 table函数看起来像个异常，我在其中发现了factor而不是as.factor。可能有一些特殊的考虑因素，不幸的是，当我检查其源代码时，这些因素对我而言并不明显。

好。

由于大多数分组R函数都使用as.factor，因此如果为它们赋予未使用或NA级别的因数，则此类分组将出现在结果中。

好。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

x <- c(1, 2)
f <- factor(letters[1:2], levels = letters[1:3])

split(x, f)
#$a
#[1] 1
#
#$b
#[1] 2
#
#$c
#numeric(0)

tapply(x, f, FUN = mean)
# a b c
# 1 2 NA

有趣的是，尽管table不依赖于as.factor，它也保留了那些未使用的级别：

好。

1
2
3

table(f)
#a b c
#1 1 0

有时，这种行为可能是不希望的。一个经典的例子是barplot(table(f))：

好。

enter image description here

好。

如果确实不希望这样做，则需要使用droplevels或factor从因子变量中手动删除未使用的或NA级别。

好。

暗示：

好。

split有一个参数drop，默认为FALSE，因此使用as.factor。通过drop = TRUE函数factor代替。

aggregate依赖于split，因此它也有一个drop自变量，默认为TRUE。

tapply没有drop，尽管它也依赖于split。特别是文档?tapply指出(始终)使用as.factor。

好。