python多进程/多线程加速文件复制

Python multiprocess/multithreading to speed up file copying

我有一个程序可以将大量文件从一个位置复制到另一个位置-我说的是100000多个文件(目前我正在复制314G的图像序列)。他们都在进行大规模的、非常快速的网络存储突袭。我正在使用shutil按顺序复制文件，这需要一些时间，所以我正在尝试找到最佳的方法来对此进行优化。我注意到一些软件，我使用有效的多线程从网络中读取文件，在加载时间上有很大的提高，所以我想尝试在Python中这样做。

我没有编程多线程/多处理的经验-这似乎是正确的继续？如果是这样，最好的方法是什么？我看过其他一些关于用python复制线程文件的文章，它们似乎都说你没有速度提升，但是考虑到我的硬件，我不认为会出现这种情况。目前，我离我的IO容量还差得很远，资源大约占1%(我在本地有40个内核和64G的RAM)。

斯宾塞

相关讨论

更新：

我从来没有得到gevent工作(第一个答案)，因为我无法安装没有互联网连接的模块，这是我的工作站上没有的。不过，我只使用了内置的python线程就可以将文件复制时间减少8次(我已经学会了如何使用它)，我想把它作为其他感兴趣的人的回答张贴出来！下面是我的代码，请注意，由于您的硬件/网络设置，我的8倍复制时间很可能因环境而异。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

import Queue, threading, os, time
import shutil

fileQueue = Queue.Queue()
destPath = 'path/to/cop'

class ThreadedCopy:
totalFiles = 0
copyCount = 0
lock = threading.Lock()

def __init__(self):
with open("filelist.txt","r") as txt: #txt with a file per line
fileList = txt.read().splitlines()

if not os.path.exists(destPath):
os.mkdir(destPath)

self.totalFiles = len(fileList)

print str(self.totalFiles) +" files to copy."
self.threadWorkerCopy(fileList)

def CopyWorker(self):
while True:
fileName = fileQueue.get()
shutil.copy(fileName, destPath)
fileQueue.task_done()
with self.lock:
self.copyCount += 1
percent = (self.copyCount * 100) / self.totalFiles
print str(percent) +" percent copied."

def threadWorkerCopy(self, fileNameList):
for i in range(16):
t = threading.Thread(target=self.CopyWorker)
t.daemon = True
t.start()
for fileName in fileNameList:
fileQueue.put(fileName)
fileQueue.join()

ThreadedCopy()

相关讨论

这可以通过在python中使用gevent来并行化。

我建议使用以下逻辑来加速100k+文件复制：

将需要复制到csv文件中的所有100k+文件的名称，例如："input.csv"。

然后从那个csv文件创建块。块的数量应该根据机器中处理器/核心的数量来决定。

将这些块传递给各个线程。

每个线程依次读取该块中的文件名，并将其从一个位置复制到另一个位置。

下面是python代码段：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

import sys
import os
import multiprocessing

from gevent import monkey
monkey.patch_all()

from gevent.pool import Pool

def _copyFile(file):
# over here, you can put your own logic of copying a file from source to destination

def _worker(csv_file, chunk):
f = open(csv_file)
f.seek(chunk[0])
for file in f.read(chunk[1]).splitlines():
_copyFile(file)

def _getChunks(file, size):
f = open(file)
while 1:
start = f.tell()
f.seek(size, 1)
s = f.readline()
yield start, f.tell() - start
if not s:
f.close()
break

if __name__ =="__main__":
if(len(sys.argv) > 1):
csv_file_name = sys.argv[1]
else:
print"Please provide a csv file as an argument."
sys.exit()

no_of_procs = multiprocessing.cpu_count() * 4

file_size = os.stat(csv_file_name).st_size

file_size_per_chunk = file_size/no_of_procs

pool = Pool(no_of_procs)

for chunk in _getChunks(csv_file_name, file_size_per_chunk):
pool.apply_async(_worker, (csv_file_name, chunk))

pool.join()

将文件另存为file_copier.py。打开终端并运行：

1	$ ./file_copier.py input.csv