关于python：在大型数据集的pandas中排序

Sorting in pandas for large datasets

我希望按给定的列(特别是p值)对数据进行排序。但是，问题是我无法将整个数据加载到内存中。因此，以下内容不起作用，或者只适用于小数据集。

1	data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)

有没有一种快速的方法可以按给定的列对我的数据进行排序，该列只考虑数据块，不需要在内存中加载整个数据集？

相关讨论

正如我在评论中提到的，这个答案已经提供了一个可能的解决方案。它基于HDF格式。

关于排序问题，至少有三种方法可以用这种方法来解决它。

首先，您可以尝试直接使用panda，查询hdf存储的数据帧。

第二，你可以用大熊猫用的折叠桌。

Francesc Alted在Pytables邮件列表中给出提示：

The simplest way is by setting the sortby parameter to true in the
Table.copy() method. This triggers an on-disk sorting operation, so you
don't have to be afraid of your available memory. You will need the Pro
version for getting this capability.

在文档中，它说：

sortby :
If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used

第三，对于Pytables，您仍然可以使用方法Table.itersorted()。

来自文档：

Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)

Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.

另一种方法是使用中间的数据库。详细的工作流程可以在plot.ly上发布的ipython笔记本中看到。

这可以解决排序问题，以及熊猫可能进行的其他数据分析。看起来它是由用户Chris创建的，所以所有的功劳都归他所有。我在这里复制相关部分。

介绍

This notebook explores a 3.9Gb CSV file.

This notebook is a primer on out-of-memory data analysis with

pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.

IPython notebook: An interface for writing and sharing python code, text, and plots.

SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.

Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.

要求

1 2	import pandas as pd from sqlalchemy import create_engine # database connection

将csv数据导入sqlite

Load the CSV, chunk-by-chunk, into a DataFrame

Process the data a bit, strip out uninteresting columns

Append it to the SQLite database

1
2
3
4
5
6
7
8
9
10
11
12
13

disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory

chunksize = 20000
index_start = 1

for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):

# do stuff

df.index += index_start

df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1

查询值计数并对结果排序

Housing and Development Dept receives the most complaints

1
2
3
4

df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
'FROM data '
'GROUP BY Agency '
'ORDER BY -num_complaints', disk_engine)

限制已排序条目的数量

What's the most 10 common complaint in each city?

1
2
3
4
5

df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
'FROM data '
'GROUP BY `City` '
'ORDER BY -num_complaints '
'LIMIT 10 ', disk_engine)

可能相关和有用的链接

PANDAS：内存中排序HDF5文件
ptrepack sortby需要"full"索引
http://pandas.pydata.org/pandas-docs/stable/cookbook.html_hdfstore
http://www.pytables.org/usersguide/optimization.html

Blaze可能是你的工具，它可以处理熊猫和核心之外的csv文件。网址：http://blaze.readthedocs.org/en/latest/ooc.html

1
2
3
4

import blaze
import pandas as pd
d = blaze.Data('my-large-file.csv')
d.P_VALUE.sort() # Uses Chunked Pandas

为了更快的处理速度，请先将其加载到Blaze可以控制的数据库中。但是如果这是一次性的，并且您有时间，那么发布的代码应该可以做到这一点。

如果您的csv文件只包含结构化数据，我建议只使用linux命令。

假设csv文件包含两列：COL_1和P_VALUE：

MAP.PY：

1
2
3
4

import sys
for line in sys.stdin:
col_1, p_value = line.split(',')
print"%f,%s" % (p_value, col_1)

然后，下面的linux命令将生成已排序p_值的csv文件：

1	cat input.csv \| ./map.py \| sort > output.csv

如果您熟悉Hadoop，使用上面的map.py还可以添加一个简单的reduce.py，它将通过Hadoop流媒体系统生成已排序的csv文件。

这是我的诚实建议。/你可以做三个选择。

我喜欢熊猫，因为它有丰富的医生和特点，但我被建议使用numpy，因为对于较大的数据集来说，它感觉更快。你也可以考虑使用其他工具来做更简单的工作。

在使用python3的情况下，可以将大数据块分解成集合，并执行一致的线程处理。我太懒惰了，它看起来不酷，你看熊猫，麻木，坐立不安是建立在硬件设计的角度，使多线程，我相信。

我更喜欢这个，这是一种简单而懒惰的技巧。查看文档：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.dataframe.sort.html

您还可以在正在使用的pandas排序函数中使用"kind"参数。

哥斯比，我的朋友。