关于clojure pmap：clojure pmap-为什么我不使用所有核心？

clojure pmap - why aren't i using all the cores?

我正在尝试使用clojure pantomime库从大量的tif文档(以及其他文档)中提取文本。

我的计划是使用pmap将映射应用于一系列输入数据(来自postgres数据库)，然后使用tika / tesseract OCR输出更新相同的postgres数据库。这一直很好，但是我在htop中注意到许多内核有时处于空闲状态。

无论如何，有没有调和的方法，我可以采取哪些步骤来确定为什么它可能会阻塞某处？所有处理都在单个tif文件中进行，并且每个线程是完全互斥的。

附加信息：

一些tika / tesseract过程需要3秒，而其他过程则需要90秒。一般来说，tika受CPU限制很大。根据htop，我有足够的可用内存。

postgres在会话管理中没有锁定问题，因此我不认为这会阻止我。

也许某处future正在等待deref？怎么知道呢？

任何提示表示赞赏，谢谢。下面添加了代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

(defn parse-a-path [{:keys [row_id, file_path]}]
(try
(let [
start (System/currentTimeMillis)
mime_type (pm/mime-type-of file_path)
file_content (-> file_path (extract/parse) :text)
language (pl/detect-language file_content)
]
{:mime_type mime_type
:file_content file_content
:language language
:row_id row_id
:parse_time_in_seconds (float (/ ( - (System/currentTimeMillis) start) 100))
:record_status"doc parsed"})))

(defn fetch-all-batch []
(t/info (str"Fetching lazy seq. all rows for batch.") )
(jdbc/query (db-connection)
["select
row_id,
file_path ,
file_extension
from the_table" ]))

(defn update-a-row [{:keys [row_id, file_path, file_extension] :as all-keys}]
(let [parse-out (parse-a-path all-keys )]
(try
(doall
(jdbc/execute!
(db-connection)
["update the_table
set
record_last_updated = current_timestamp ,
file_content = ? ,
mime_type = ? ,
language = ? ,
parse_time_in_seconds = ? ,
record_status = ?
where row_id = ?"
(:file_content parse-out) ,
(:mime_type parse-out) ,
(:language parse-out) ,
(:parse_time_in_seconds parse-out) ,
(:record_status parse-out) ,
row_id ])
(t/debug (str"updated row_id" (:row_id parse-out)" (" file_extension")"
" in" (:parse_time_in_seconds parse-out)" seconds." )))
(catch Exception _ ))))

(dorun
(pmap
#(try
(update-a-row %)
(catch Exception e (t/error (.getNextException e)))
)
fetch-all-batch )
)