关于cuda：在合并的内存访问中的全局负载事务计数

Global load transaction count when in coalesced memory access

我在nvidia gtx980卡中创建了一个简单的内核，通过观察事务计数来测试合并的内存访问。内核是

1
2
3
4
5
6
7

__global__
void copy_coalesced(float * d_in, float * d_out)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;

d_out[tid] = d_in[tid];
}

当我使用以下内核配置运行它时

1
2
3
4
5
6

#define BLOCKSIZE 32

int data_size = 10240; //always a multiply of the BLOCKSIZE
int gridSize = data_size / BLOCKSIZE;

copy_coalesced<<<gridSize, BLOCKSIZE>>>(d_in, d_out);

由于内核中的数据访问已完全合并，并且由于数据类型为float(4个字节)，因此可以找到预期的装入/存储事务数，如下所示，

加载事务大小= 32字节

每个事务可以加载的浮点数= 32字节/ 4字节= 8

加载10240数据所需的事务数= 10240/8 = 1280个事务

预计也将写入相同数量的事务。

但是当观察nvprof指标时，结果如下

1
2
3
4
5

gld_transactions 2560
gst_transactions 1280

gld_transactions_per_request 8.0
gst_transactions_per_request 4.0

我不知道为什么加载数据需要两倍的事务。但是当涉及到加载/存储效率时，两个指标都给出了100％

我在这里错过了什么？