关于性能:如何减少在python中加载pickle文件所花费的时间

How to Reduce the time taken to load a pickle file in python

我用python创建了一个字典,并将其放入pickle中。它的大小达到了300MB。现在,我想装同样的泡菜。

1
2
output = open('myfile.pkl', 'rb')
mydict = pickle.load(output)

装这个泡菜大约需要15秒钟。我该如何减少这一时间?

硬件规格:Ubuntu 14.04,4GB RAM

下面的代码显示了使用json、pickle和cpickle转储或加载文件所需的时间。

转储后,文件大小将在300MB左右。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import json, pickle, cPickle
import os, timeit
import json

mydict= {all values to be added}

def dump_json():    
    output = open('myfile1.json', 'wb')
    json.dump(mydict, output)
    output.close()    

def dump_pickle():    
    output = open('myfile2.pkl', 'wb')
    pickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL)
    output.close()

def dump_cpickle():    
    output = open('myfile3.pkl', 'wb')
    cPickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL)
    output.close()

def load_json():
    output = open('myfile1.json', 'rb')
    mydict = json.load(output)
    output.close()

def load_pickle():
    output = open('myfile2.pkl', 'rb')
    mydict = pickle.load(output)
    output.close()

def load_cpickle():
    output = open('myfile3.pkl', 'rb')
    mydict = pickle.load(output)
    output.close()


if __name__ == '__main__':
    print"Json dump:"
    t = timeit.Timer(stmt="pickle_wr.dump_json()", setup="import pickle_wr")  
    print t.timeit(1),'
'


    print"Pickle dump:"
    t = timeit.Timer(stmt="pickle_wr.dump_pickle()", setup="import pickle_wr")  
    print t.timeit(1),'
'


    print"cPickle dump:"
    t = timeit.Timer(stmt="pickle_wr.dump_cpickle()", setup="import pickle_wr")  
    print t.timeit(1),'
'


    print"Json load:"
    t = timeit.Timer(stmt="pickle_wr.load_json()", setup="import pickle_wr")  
    print t.timeit(1),'
'


    print"pickle load:"
    t = timeit.Timer(stmt="pickle_wr.load_pickle()", setup="import pickle_wr")  
    print t.timeit(1),'
'


    print"cPickle load:"
    t = timeit.Timer(stmt="pickle_wr.load_cpickle()", setup="import pickle_wr")  
    print t.timeit(1),'
'

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Json dump:
42.5809804916

Pickle dump:
52.87407804489

cPickle dump:
1.1903790187836

Json load:
12.240660209656

pickle load:
24.48748306274

cPickle load:
24.4888298893

我已经看到cpickle转储和加载所需的时间更少,但加载文件仍然需要很长的时间。


尝试使用json库而不是pickle库。在您的案例中,这应该是一个选项,因为您处理的是一个相对简单的对象。

根据这个网站,

JSON is 25 times faster in reading (loads) and 15 times faster in
writing (dumps).

另请参见这个问题:加载pickled dictionary对象或加载JSON文件到dictionary更快是什么?

升级python或使用带有固定python版本的marshal模块也有助于提高速度(从这里改编的代码):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
try: import cPickle
except: import pickle as cPickle
import pickle
import json, marshal, random
from time import time
from hashlib import md5

test_runs = 1000

if __name__ =="__main__":
    payload = {
       "float": [(random.randrange(0, 99) + random.random()) for i in range(1000)],
       "int": [random.randrange(0, 9999) for i in range(1000)],
       "str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)]
    }
    modules = [json, pickle, cPickle, marshal]

    for payload_type in payload:
        data = payload[payload_type]
        for module in modules:
            start = time()
            if module.__name__ in ['pickle', 'cPickle']:
                for i in range(test_runs): serialized = module.dumps(data, protocol=-1)
            else:
                for i in range(test_runs): serialized = module.dumps(data)
            w = time() - start
            start = time()
            for i in range(test_runs):
                unserialized = module.loads(serialized)
            r = time() - start
            print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r))

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
C:\Python27\python.exe -u"serialization_benchmark.py"
json int W 0.125 R 0.156
pickle int W 2.808 R 1.139
cPickle int W 0.047 R 0.046
marshal int W 0.016 R 0.031
json float W 1.981 R 0.624
pickle float W 2.607 R 1.092
cPickle float W 0.063 R 0.062
marshal float W 0.047 R 0.031
json str W 0.172 R 0.437
pickle str W 5.149 R 2.309
cPickle str W 0.281 R 0.156
marshal str W 0.109 R 0.047

C:\pypy-1.6\pypy-c -u"serialization_benchmark.py"
json int W 0.515 R 0.452
pickle int W 0.546 R 0.219
cPickle int W 0.577 R 0.171
marshal int W 0.032 R 0.031
json float W 2.390 R 1.341
pickle float W 0.656 R 0.436
cPickle float W 0.593 R 0.406
marshal float W 0.327 R 0.203
json str W 1.141 R 1.186
pickle str W 0.702 R 0.546
cPickle str W 0.828 R 0.562
marshal str W 0.265 R 0.078

c:\Python34\python -u"serialization_benchmark.py"
json int W 0.203 R 0.140
pickle int W 0.047 R 0.062
pickle int W 0.031 R 0.062
marshal int W 0.031 R 0.047
json float W 1.935 R 0.749
pickle float W 0.047 R 0.062
pickle float W 0.047 R 0.062
marshal float W 0.047 R 0.047
json str W 0.281 R 0.187
pickle str W 0.125 R 0.140
pickle str W 0.125 R 0.140
marshal str W 0.094 R 0.078

python 3.4默认使用pickle协议3,与协议4没有区别。python 2的协议2是最高的pickle协议(如果为dump提供负值,则选择此选项),其速度是协议3的两倍。


我在使用cpickle本身读取大型文件(例如:~750 MB的igraph对象——一个二进制pickle文件)方面取得了很好的效果。这是通过简单地结束这里提到的pickle加载调用来实现的。

您案例中的示例代码段如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import timeit
import cPickle as pickle
import gc


def load_cpickle_gc():
    output = open('myfile3.pkl', 'rb')

    # disable garbage collector
    gc.disable()

    mydict = pickle.load(output)

    # enable garbage collector again
    gc.enable()
    output.close()


if __name__ == '__main__':
    print"cPickle load (with gc workaround):"
    t = timeit.Timer(stmt="pickle_wr.load_cpickle_gc()", setup="import pickle_wr")
    print t.timeit(1),'
'

当然,可能有更合适的方法来完成任务,但是,这种变通方法确实大大减少了所需的时间。(对我来说,从843.04s降到41.28s,大约20倍)


如果您试图将字典存储到单个文件中,则是大型文件的加载时间减慢了您的速度。您可以做的最简单的事情之一是将字典写到磁盘上的一个目录中,每个字典条目都是一个单独的文件。然后可以在多个线程中(或使用多处理)对文件进行pickle和unpickle。对于一个非常大的字典,无论您选择哪种序列化程序,这都应该比在单个文件中读取和读取快得多。有些软件包,如kleptojoblib,已经为您做了很多工作(如果不是全部)。我去看看那些包裹。(注:我是klepto的作者。请参见https://github.com/uqfoundation/klepop)。