关于集群计算：在R中，如何基于列属性的统计信息选择行？

In R, how to select rows based on a statistic of a column attribute?

本问题已经有最佳答案，请猛点这里访问。

我的表有成千上万的行，按400个类别分类，有12列。

理想的结果是基于" z"列的最大值的包含400行(每个类各1行)的表，并包含所有原始列。

这是我的数据的一个示例，在此示例中，我只需要使用R提取第二，第四，第七，第八行。

1
2
3
4
5
6
7
8
9
10
11
x y z cluster
1 712521.75 3637426.49 19.46 12
2 712520.69 3637426.47 19.66 12 *
3 712518.88 3637426.63 17.37 225
4 712518.4 3637426.48 19.42 225 *
5 712517.11 3637426.51 18.81 225
6 712515.7 3637426.58 17.8 17
7 712514.68 3637426.55 18.16 17 *
8 712513.58 3637426.55 18.23 50 *
9 712512.1 3637426.62 17.24 50
10 712513.93 3637426.88 18.08 50

我尝试了许多不同的组合，包括：

1
2
tapply(data$z, data$cluster, max) # returns only the max value and cluster columns
which.max(data$z) # returns only the index of the max value in the entire table

我也阅读了plyr软件包，但没有找到解决方法。

一种非常简单的方法是使用aggregate和merge：

1
2
3
4
5
6

> merge(aggregate(z ~ cluster, mydf, max), mydf)
cluster z x y
1 12 19.66 712520.7 3637426
2 17 18.16 712514.7 3637427
3 225 19.42 712518.4 3637426
4 50 18.23 712513.6 3637427

您甚至可以使用tapply代码的输出来获取所需的内容。只需使其成为data.frame而不是命名向量即可。

1
2
3
4
5
6

> merge(mydf, data.frame(z = with(mydf, tapply(z, cluster, max))))
z x y cluster
1 18.16 712514.7 3637427 17
2 18.23 712513.6 3637427 50
3 19.42 712518.4 3637426 225
4 19.66 712520.7 3637426 12

有关其他选项，请参阅此问题的答案。

相关讨论

谢谢大家的帮助！ Aggregate()和merge()对我来说很完美。

重要一点：aggregate()-每个集群仅选择一个重复点，而merge()-选择所有重复点，因为它们在一个集群中具有相同的最大值。

在这种情况下这是理想的，因为这些点是3D的，并且在考虑x和y坐标时不是重复的。

这是我的解决方法：

1
2
3
4
df <- read.table("data.txt", header=TRUE, sep=",")
attach(df)
names(df)
[1]"Row" "x" "y" "z" "cluster"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

head(df)
Row x y z cluster
1 1 712521.8 3637426 19.46 361
2 2 712520.7 3637426 19.66 361
3 3 712518.9 3637427 17.37 147
4 4 712518.4 3637426 19.42 147
5 5 712517.1 3637427 18.81 147
6 6 712515.7 3637427 17.80 42

new_table_a <- aggregate(z ~ cluster, df, max) # output 400 rows, no duplicates
new_table_b <- merge(new_table_a, df) # output 408 rows, includes duplicates of"z"

head(new_table_b)
cluster z Row x y
1 1 20.44 6043 712416.2 3637478
2 10 26.09 1138 712458.4 3637511
3 100 19.39 6496 712423.4 3637485
4 101 25.74 2141 712521.2 3637488
5 102 17.33 2320 712508.2 3637484
6 103 21.01 6908 712462.2 3637493