关于python：Scrapy项目管道可以并行或顺序执行process_item

Scrapy item pipelines parallel or sequential execution of process_item

我正在开发一只刮y的蜘蛛，它成功地产生了一些物品。这些项目应使用pymysql插入数据库中。因为数据是关系数据，所以对于每一项我都必须执行一些插入语句。
我想在每次完整插入后调用connection.commit()，以确保发生的错误不会导致数据库中的条目不一致。

我目前在想，scrapy是否会为多个项目并行调用process_item，还是为另一个项目依次调用process_item。如果是后者，我可以简单地使用以下方法：

1
2
3

def process_item(self, item, spider):
# execute insert statements
connection.commit()

如果通过刮擦同时执行了多次对process_item的调用，则在未完全插入另一项时可以调用最后对commit()的调用。

项目管道的文档指出：

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

但是我不确定这是否意味着process_item将永远不会并行执行，还是只是不同的管道总是总是一个接一个地执行(例如，删除重复项->更改内容->数据库插入)。

我认为process_item将按顺序执行，因为文档显示了以下示例：

1
2
3
4
5
6
7
8
9
10
11

class DuplicatesPipeline(object):

def __init__(self):
self.ids_seen = set()

def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item

在此代码中，没有将id添加到所涉及的ids_seen的同步，但是我不知道示例是否已简化，因为该示例仅演示了如何使用管道。