Is indexing of Data.Vector.Unboxed.Mutable.MVector really this slow?
我有一个应用程序,它花费大约80%的时间使用Kahan求和算法来计算一大堆(10 ^ 7)高维向量(dim = 100)的质心。我已尽最大努力优化求和,但它仍比等效的C实现慢20倍。分析表明,罪魁祸首是来自
这是两个实现。 Haskell是使用llvm后端使用ghc-7.0.3编译的。 C语言是用llvm-gcc编译的。
Haskell中的Kahan求和:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | {-# LANGUAGE BangPatterns #-} module Test where import Control.Monad ( mapM_ ) import Data.Vector.Unboxed ( Vector, Unbox ) import Data.Vector.Unboxed.Mutable ( MVector ) import qualified Data.Vector.Unboxed as U import qualified Data.Vector.Unboxed.Mutable as UM import Data.Word ( Word ) import Data.Bits ( shiftL, shiftR, xor ) prng :: Word -> Word prng w = w' where !w1 = w `xor` (w `shiftL` 13) !w2 = w1 `xor` (w1 `shiftR` 7) !w' = w2 `xor` (w2 `shiftL` 17) mkVect :: Word -> Vector Double mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng foldV :: (Unbox a, Unbox b) => (a -> b -> a) -- componentwise function to fold -> Vector a -- initial accumulator value -> [Vector b] -- data vectors -> Vector a -- final accumulator value foldV fn accum vs = U.modify (\\x -> mapM_ (liftV fn x) vs) accum where liftV f acc = fV where fV v = go 0 where n = min (U.length v) (UM.length acc) go i | i < n = step >> go (i + 1) | otherwise = return () where step = {-# SCC"fV_step" #-} do a <- {-# SCC"fV_read" #-} UM.unsafeRead acc i b <- {-# SCC"fV_index" #-} U.unsafeIndexM v i {-# SCC"fV_write" #-} UM.unsafeWrite acc i $! {-# SCC"fV_apply" #-} f a b kahan :: [Vector Double] -> Vector Double kahan [] = U.singleton 0.0 kahan (v:vs) = fst . U.unzip $ foldV kahanStep acc vs where acc = U.map (\\z -> (z, 0.0)) v kahanStep :: (Double, Double) -> Double -> (Double, Double) kahanStep (s, c) x = (s', c') where !y = x - c !s' = s + y !c' = (s' - s) - y {-# NOINLINE kahanStep #-} zero :: U.Vector Double zero = U.replicate 100 0.0 myLoop n = kahan $ map mkVect [1..n] main = print $ myLoop 100000 |
使用llvm后端使用ghc-7.0.3进行编译:
1 2 3 4 5 6 | ghc -o Test_hs --make -fforce-recomp -O3 -fllvm -optlo-O3 -msse2 -main-is Test.main Test.hs time ./Test_hs real 0m1.948s user 0m1.936s sys 0m0.008s |
配置文件信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | 16,710,594,992 bytes allocated in the heap 33,047,064 bytes copied during GC 35,464 bytes maximum residency (1 sample(s)) 23,888 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 31907 collections, 0 parallel, 0.28s, 0.27s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 24.73s ( 24.74s elapsed) GC time 0.28s ( 0.27s elapsed) RP time 0.00s ( 0.00s elapsed) PROF time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 25.01s ( 25.02s elapsed) %GC time 1.1% (1.1% elapsed) Alloc rate 675,607,179 bytes per MUT second Productivity 98.9% of total user, 98.9% of total elapsed Thu Feb 23 02:42 2012 Time and Allocation Profiling Report (Final) Test_hs +RTS -s -p -RTS total time = 24.60 secs (1230 ticks @ 20 ms) total alloc = 8,608,188,392 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc fV_write Test 31.1 26.0 fV_read Test 27.2 23.2 mkVect Test 12.3 27.2 fV_step Test 11.7 0.0 foldV Test 5.9 5.7 fV_index Test 5.2 9.3 kahanStep Test 3.3 6.5 prng Test 2.2 1.8 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 1 0 0.0 0.0 100.0 100.0 CAF:main1 Test 339 1 0.0 0.0 0.0 0.0 main Test 346 1 0.0 0.0 0.0 0.0 CAF:main2 Test 338 1 0.0 0.0 100.0 100.0 main Test 347 0 0.0 0.0 100.0 100.0 myLoop Test 348 1 0.2 0.2 100.0 100.0 mkVect Test 350 400000 12.3 27.2 14.5 29.0 prng Test 351 9900000 2.2 1.8 2.2 1.8 kahan Test 349 102 0.0 0.0 85.4 70.7 foldV Test 359 1 5.9 5.7 85.4 70.7 fV_step Test 360 9999900 11.7 0.0 79.5 65.1 fV_write Test 367 19999800 31.1 26.0 35.4 32.5 fV_apply Test 368 9999900 1.0 0.0 4.3 6.5 kahanStep Test 369 9999900 3.3 6.5 3.3 6.5 fV_index Test 366 9999900 5.2 9.3 5.2 9.3 fV_read Test 361 9999900 27.2 23.2 27.2 23.2 CAF:lvl19_r3ei Test 337 1 0.0 0.0 0.0 0.0 kahan Test 358 0 0.0 0.0 0.0 0.0 CAF:poly_$dPrimMonad3_r3eg Test 336 1 0.0 0.0 0.0 0.0 kahan Test 357 0 0.0 0.0 0.0 0.0 CAF:$dMVector2_r3ee Test 335 1 0.0 0.0 0.0 0.0 CAF:$dVector1_r3ec Test 334 1 0.0 0.0 0.0 0.0 CAF:poly_$dMonad_r3ea Test 333 1 0.0 0.0 0.0 0.0 CAF:$dMVector1_r3e2 Test 330 1 0.0 0.0 0.0 0.0 CAF:poly_$dPrimMonad2_r3e0 Test 328 1 0.0 0.0 0.0 0.0 foldV Test 365 0 0.0 0.0 0.0 0.0 CAF:lvl11_r3dM Test 322 1 0.0 0.0 0.0 0.0 kahan Test 354 0 0.0 0.0 0.0 0.0 CAF:lvl10_r3dK Test 321 1 0.0 0.0 0.0 0.0 kahan Test 355 0 0.0 0.0 0.0 0.0 CAF:$dMVector_r3dI Test 320 1 0.0 0.0 0.0 0.0 kahan Test 356 0 0.0 0.0 0.0 0.0 CAF GHC.Float 297 1 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.FD 256 2 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 214 2 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 211 1 0.0 0.0 0.0 0.0 CAF Data.Vector.Generic 182 1 0.0 0.0 0.0 0.0 CAF Data.Vector.Unboxed 174 2 0.0 0.0 0.0 0.0 |
C版本中的。它几乎不影响表演。我希望现在我们都能承认阿姆达尔定律并继续前进。作为
由于
此外,我应该说,由于我正在与使用
更新2:我想我对最初的问题不够清楚。我不是在寻找加速此微基准测试的方法。我正在寻找对计数器直观分析统计信息的解释,因此我可以决定是否针对
您的C版本不等同于您的Haskell实现。在C中,您自己内联了重要的Kahan求和步骤,在Haskell中,您创建了一个多态的高阶函数,该函数执行的工作更多,并将转换步骤作为参数。将
我制作了一个更接近Haskell版本的C版本,
kahan.h:
kahanStep.c:
1 2 3 4 5 6 7 8 9 10 |
main.c:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | #include <stdint.h> #include <stdio.h> #include"kahan.h" #define VDIM 100 #define VNUM 100000 uint64_t prng (uint64_t w) { w ^= w << 13; w ^= w >> 7; w ^= w << 17; return w; }; void kahan(double s[], double c[], DPair (*fun)(DPair,double)) { for (int i = 1; i <= VNUM; i++) { uint64_t w = i; for (int j = 0; j < VDIM; j++) { DPair pr; pr.fst = s[j]; pr.snd = c[j]; pr = fun(pr,w); s[j] = pr.fst; c[j] = pr.snd; w = prng(w); } } }; int main (int argc, char* argv[]) { double acc[VDIM], err[VDIM]; for (int i = 0; i < VDIM; i++) { acc[i] = err[i] = 0.0; }; kahan(acc, err,kahanStep); printf("["); for (int i = 0; i < VDIM; i++) { printf("%g", acc[i]); }; printf("]\ "); }; |
单独编译并链接,比此处的第一个C版本运行慢25%(0.1s对0.079s)。
现在,您在C语言中拥有一个高阶函数,比原始函数慢得多,但仍比Haskell代码快得多。一个重要的区别是C函数将一对未装箱的
但是,装箱和拆箱是区别的较小部分。
如果您使用C到Haskell的更直接的翻译,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | {-# LANGUAGE CPP, BangPatterns #-} module Main (main) where #define VDIM 100 #define VNUM 100000 import Data.Array.Base import Data.Array.ST import Data.Array.Unboxed import Control.Monad.ST import GHC.Word import Control.Monad import Data.Bits prng :: Word -> Word prng w = w' where !w1 = w `xor` (w `shiftL` 13) !w2 = w1 `xor` (w1 `shiftR` 7) !w' = w2 `xor` (w2 `shiftL` 17) type Vec s = STUArray s Int Double kahan :: Vec s -> Vec s -> ST s () kahan s c = do let inner w j | j < VDIM = do !cj <- unsafeRead c j !sj <- unsafeRead s j let !y = fromIntegral w - cj !t = sj + y !w' = prng w unsafeWrite c j ((t-sj)-y) unsafeWrite s j t inner w' (j+1) | otherwise = return () forM_ [1 .. VNUM] $ \\i -> inner (fromIntegral i) 0 calc :: ST s (Vec s) calc = do s <- newArray (0,VDIM-1) 0 c <- newArray (0,VDIM-1) 0 kahan s c return s main :: IO () main = print . elems $ runSTUArray calc |
它快得多。诚然,它仍然比C慢大约三倍,但是原始速度比C慢了13倍(而且我没有安装llvm,所以我使用vanilla gcc和GHC的本机支持,使用llvm可能会产生稍微不同的结果)。
我认为索引并不是真正的罪魁祸首。 vector包在很大程度上依赖于编译器的魔力,但是为提供概要分析支持而进行的编译会极大地干扰这一点。对于像
在核心中,所有读取和写入都将转换为快速的primops
因此,不要将概要分析结果视为毫无疑问的事实。您的代码越多(直接或间接通过所使用的库)取决于优化,则它越容易受到由禁用优化引起的误导性分析结果的影响。这也适用于堆分析,以减少空间泄漏,但程度要小得多。
当您获得可疑的分析结果时,请检查删除某些SCC时会发生什么。如果这导致运行时间大大减少,则说明SCC并不是您的主要问题(在解决其他问题之后,它可能再次成为问题)。
看看为您的程序生成的Core,跳出来的是您的
今天在haskell-cafe上再次出现这种情况时,有人用ghc-7.4.1从上述代码中获得了可怕的性能,tibbe亲自研究了GHC产生的核心,并发现GHC产生了不理想的代码。从
的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | {-# LANGUAGE CPP #-} module Main (main) where #define VDIM 100 #define VNUM 100000 import Data.Array.Base import Data.Array.ST import Data.Array.Unboxed import Control.Monad.ST import GHC.Word import Control.Monad import Data.Bits import GHC.Float (int2Double) prng :: Word -> Word prng w = w' where w1 = w `xor` (w `shiftL` 13) w2 = w1 `xor` (w1 `shiftR` 7) w' = w2 `xor` (w2 `shiftL` 17) type Vec s = STUArray s Int Double kahan :: Vec s -> Vec s -> ST s () kahan s c = do let inner w j | j < VDIM = do cj <- unsafeRead c j sj <- unsafeRead s j let y = word2Double w - cj t = sj + y w' = prng w unsafeWrite c j ((t-sj)-y) unsafeWrite s j t inner w' (j+1) | otherwise = return () forM_ [1 .. VNUM] $ \\i -> inner (fromIntegral i) 0 calc :: ST s (Vec s) calc = do s <- newArray (0,VDIM-1) 0 c <- newArray (0,VDIM-1) 0 kahan s c return s correction :: Double correction = 2 * int2Double minBound word2Double :: Word -> Double word2Double w = case fromIntegral w of i | i < 0 -> int2Double i - correction | otherwise -> int2Double i main :: IO () main = print . elems $ runSTUArray calc |
在所有看似
1 |
正确使用
1 |
然后我的时间减少了三分之二-从
这出现在邮件列表上,我发现GHC 7.4.1中的Word-> Double转换代码中存在一个错误(至少)。此版本可解决该错误,它的速度与我计算机上的C代码一样快:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | {-# LANGUAGE CPP, BangPatterns, MagicHash #-} module Main (main) where #define VDIM 100 #define VNUM 100000 import Control.Monad.ST import Data.Array.Base import Data.Array.ST import Data.Bits import GHC.Word import GHC.Exts prng :: Word -> Word prng w = w' where w1 = w `xor` (w `shiftL` 13) w2 = w1 `xor` (w1 `shiftR` 7) w' = w2 `xor` (w2 `shiftL` 17) type Vec s = STUArray s Int Double kahan :: Vec s -> Vec s -> ST s () kahan s c = do let inner !w j | j < VDIM = do cj <- unsafeRead c j sj <- unsafeRead s j let y = word2Double w - cj t = sj + y w' = prng w unsafeWrite c j ((t-sj)-y) unsafeWrite s j t inner w' (j+1) | otherwise = return () outer i | i <= VNUM = inner (fromIntegral i) 0 >> outer (i + 1) | otherwise = return () outer (1 :: Int) calc :: ST s (Vec s) calc = do s <- newArray (0,VDIM-1) 0 c <- newArray (0,VDIM-1) 0 kahan s c return s main :: IO () main = print . elems $ runSTUArray calc {- I originally used this function, which isn't quite correct. We need a real bug fix in GHC. word2Double :: Word -> Double word2Double (W# w) = D# (int2Double# (word2Int# w)) -} correction :: Double correction = 2 * int2Double minBound word2Double :: Word -> Double word2Double w = case fromIntegral w of i | i < 0 -> int2Double i - correction | otherwise -> int2Double i |
除了解决Word-> Double错误以外,我还删除了更多列表以更好地与C版本匹配。
我知道您并没有要求改善这种微基准的方法,但我会给您一个解释,该解释在将来编写循环时可能会有所帮助:
一个未知的函数调用(例如对
如果循环(例如
在这种情况下,这不能使性能与C相提并论,因为还有其他事情在进行(正如其他人所评论的那样),但这是朝着正确方向迈出的一步(这是您可以不采取的步骤)每次都必须查看分析输出)。