关于 c:OpenCL 是否支持随机访问的全局队列缓冲区？

Does OpenCL support a randomly accessed global queue buffer?

我正在编写一个处理组合数据的内核。因为这类问题通常有很大的问题空间，其中大部分处理的数据都是垃圾，有没有办法可以做到以下几点：

(1) 如果计算出的数据通过某种条件，则将其放入全局输出缓冲区。

(2) 一旦输出缓冲区已满，则将数据发送回主机

(3) 主机从缓冲区中获取数据的副本并清除它

(4) 然后创建一个由 GPU 填充的新缓冲区

为了简单起见，这个例子可以说是一个选择性内积，我的意思是

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

__global int buffer_counter; // Counts

void put_onto_output_buffer(float value, __global float *buffer, int size)
{
// Put this value onto the global buffer or send a signal to the host
}

__kernel void
inner_product(
__global const float *threshold, // threshold
__global const float *first_vector, // 10000 float vector
__global const float *second_vector, // 10000 float vector
__global float *output_buffer, // 100 float vector
__global const int *output_buffer_size // size of the output buffer -- 100
{
int id = get_global_id(0);
float value = first_vector[id] * second_vector[id];
if (value >= threshold[0])
put_onto_output_buffer(value, output_buffer, output_buffer_size[0]);
}

这取决于输出的频率。如果它是高频率的(一个工作项经常写入输出)，那么 buffer_counter 将成为争用的来源并导致速度变慢(顺便说一下，它需要使用原子方法进行更新，即为什么它很慢)。在这种情况下，您最好始终编写输出并稍后对真实输出进行排序。

另一方面，如果写输出相当少见，那么使用原子位置指示符是很有意义的。大多数工作项将进行计算，决定它们没有输出，然后退出。只有具有输出的不频繁的会竞争原子输出位置索引，串行递增它，并将其输出写入其唯一位置。您的输出内存将紧凑地包含结果(没有特定顺序，因此如果您愿意，请存储工作项 ID)。

再次，阅读原子，因为索引需要是原子的。