How can I read/parse following text using Clojure?
文本的结构是这样的;
1 2 3 4 5 6 7 8 | Tag001 0.1, 0.2, 0.3, 0.4 0.5, 0.6, 0.7, 0.8 ... Tag002 1.1, 1.2, 1.3, 1.4 1.5, 1.6, 1.7, 1.8 ... |
文件可以具有任意数量的TagXXX内容,并且每个Tag可以具有任意数量的CSV值行。
==== PPPS。 (对不起这些东西:-)
更多改进;现在,我的原子笔记本电脑上的31842行数据需要1秒钟左右的时间,比原始代码快7倍。但是,C版本比这快20倍。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | (defn add-parsed-code [accu code] (if (empty? code) accu (conj accu code))) (defn add-values [code comps] (let [values comps old-values (:values code) new-values (if old-values (conj old-values values) [values])] (assoc code :values new-values))) (defn read-line-components [file] (map (fn [line] (clojure.string/split line #",")) (with-open [rdr (clojure.java.io/reader file)] (doall (line-seq rdr))))) (defn parse-file [file] (let [line-comps (read-line-components file)] (loop [line-comps line-comps accu [] curr {}] (if line-comps (let [comps (first line-comps)] (if (= (count comps) 1) ;; code line? (recur (next line-comps) (add-parsed-code accu curr) {:code (first comps)}) (recur (next line-comps) accu (add-values curr comps)))) (add-parsed-code accu curr))))) |
==== PPS。
尽管我无法弄清楚为什么第一个比第二个快10倍,而不是
拖延,地图和开放式阅读确实可以使阅读更快;尽管整个阅读/处理时间
并没有减少(从7秒减少到6秒)
1 2 3 4 5 6 7 8 9 10 11 | (time (let [lines (map (fn [line] line) (with-open [rdr (clojure.java.io/reader "DATA.txt")] (doall (line-seq rdr))))] (println (last lines)))) (time (let [lines (clojure.string/split-lines (slurp"DATA.txt"))] (println (last lines)))) |
==== PS。
Skuro的解决方案确实奏效。但是解析速度不是那么快,因此我必须使用基于C的解析器(在1到3秒钟内读取400个文件,而clojure的单个文件确实需要1-4秒钟;是的,文件大小相当大)来读取和构建DB和Clojure仅用于统计分析部分。
以下内容将分隔所有值行,解析上述文件。如果这不是您想要的,则可以更改
更新:副作用现在封装在专用的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | (defn tag? [line] (re-matches #"Tag[0-9]*" line)) ; potentially unsafe, you might want to change this: (defn parse-values [line] (read-string (str"[" line"]"))) (defn add-parsed-tag [accu tag] (if (empty? tag) accu (conj accu tag))) (defn add-values [tag line] (let [values (parse-values line) old-values (:values tag) new-values (if old-values (conj old-values values) [values])] (assoc tag :values new-values))) (defn load-file [path] (slurp path)) (defn parse-file [file] (let [lines (clojure.string/split-lines file)] (loop [lines lines ; remaining lines accu [] ; already parsed tags curr {}] ; current tag being parsed (if lines (let [line (first lines)] (if (tag? line) ; we recur after starting a new tag ; if curr is empty we don't add it to the accu (e.g. first iteration) (recur (next lines) (add-parsed-tag accu curr) {:tag line}) ; we're parsing values for a currentl tag (recur (next lines) accu (add-values curr line)))) ; if we were parsing a tag, we need to add it to the final result (add-parsed-tag accu curr))))) |
我对上面的代码不是很兴奋,但是可以完成工作。给定一个文件,例如:
1 2 3 4 5 6 7 8 9 10 11 | Tag001 0.1, 0.2, 0.3, 0.4 0.5, 0.6, 0.7, 0.8 Tag002 1.1, 1.2, 1.3, 1.4 1.5, 1.6, 1.7, 1.8 Tag003 1.1, 1.2, 1.3, 1.4 1.1, 1.2, 1.3, 1.4 1.5, 1.6, 1.7, 1.8 1.5, 1.6, 1.7, 1.8 |
它产生以下结果:
1 2 3 4 5 6 7 8 | user=> (clojure.pprint/print-table [:tag :values] (parse-file (load-file"tags.txt"))) ================================================================ :tag | :values ================================================================ Tag001 | [[0.1 0.2 0.3 0.4] [0.5 0.6 0.7 0.8]] Tag002 | [[1.1 1.2 1.3 1.4] [1.5 1.6 1.7 1.8]] Tag003 | [[1.1 1.2 1.3 1.4] [1.1 1.2 1.3 1.4] [1.5 1.6 1.7 1.8] [1.5 1.6 1.7 1.8]] ================================================================ |
这可以使用partition-by函数来完成。读起来可能有些神秘,但是可读性很容易提高。此功能大约在500毫秒内在我的mini-mac上执行。
首先,我使用以下函数创建了测试数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | (defn write-data[fname] (with-open [wrtr (clojure.java.io/writer fname) ] (dorun (for [ x (take 7500 (range)) ] (do (.write wrtr (format"Tag%010d" x)) (.write wrtr" 1.1, 1.2, 1.3, 1.4 1.1, 1.2, 1.3, 1.4 1.5, 1.6, 1.7, 1.8 1.5, 1.6, 1.7, 1.8 " )))))) (write-data"my-data.txt") ;"a b c d" will be converted to [ a b c d ] (defn to-vec[st] (load-string (str"[" st"]"))) (defn my-transform[fname] (let [tag (atom {:tag nil})] (with-open [rdr (clojure.java.io/reader fname)] (doall (into {} (map (fn[xs] {(first xs) (map to-vec (rest xs))}) ( partition-by (fn[y] (if(.startsWith (str y)"Tag") (swap! tag assoc :tag y) @tag)) (line-seq rdr)))))))) (time (count (my-transform"my-data.txt"))) ;Elapsed time: 517.23 msecs |