关于 c:cudaHostRegister 在具有计算能力 1.1 的 GPU 上返回 cudaErrorInvalidValue

cudaHostRegister returns cudaErrorInvalidValue on GPUs with compute capability 1.1

我有一个简单的程序，它分配一个 unsigned __int64(堆栈上的 8 个字节)，然后尝试使用 cudaHostRegister 在 GPU 上注册该内存。进行此调用的程序部分如下所示：

1
2
3
4
5
6
7
8
9
10

unsigned __int64 mem;
unsigned __int64 *pMem = &mem;
cudaError_t result;

result = cudaHostRegister(pMem, sizeof(unsigned __int64), cudaHostRegisterMapped);
if(result != cudaSuccess) {
printf("Error in cudaHostRegister: %s.\
", cudaGetErrorString(result));
return -1;
}

我在 Visual Studio 2010 Premium 中使用 nvcc 标志 compute_11 和 sm_11 进行编译，并且在我的笔记本电脑上一切正常，运行带有 cuda 功能版本 3.0 的 Quadro K1000m。

我最近切换到我的台式机，我尝试使用 GeForce 8600 GT 和 GeForce 9500 GT 运行，两者的 cuda 功能版本均为 1.1。

根据 NVIDIA 的 cudaHostRegister 文档，具有 1.1 及以上 cuda 功能的卡应该允许使用 cudaHostRegisterMapped:

cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer(). This feature is available only on GPUs with compute capability greater than or equal to 1.1.

经过一番搜索，似乎 cudaHostRegisterMapped 可能需要页面对齐的内存。我认为这可能是我的 3.0 卡和我的 1.1 卡之间的差异，所以我屏蔽了地址以获得页面对齐的地址，并在 size 字段中使用了页面的大小(4096 字节)，如下所示：

1
2
3
4
5
6
7
8
9
10
11
12
13

unsigned __int64 mem;
unsigned __int64 *pMem = &mem;
unsigned __int64 memAddr = (unsigned __int64)pMem;
cudaError_t result;

pMem = (unsigned __int64 *)(memAddr & 0xFFFFFFFFFFFFF000);

result = cudaHostRegister(pMem, 4096, cudaHostRegisterMapped);
if(result != cudaSuccess) {
printf("Error in cudaHostRegister: %s.\
", cudaGetErrorString(result));
return -1;
}

此代码也适用于我的 3.0 卡，但在我的 1.1 卡上失败，结果与之前相同。 cudaHostRegister 函数返回错误cudaErrorInvalidValue，表示：

one or more of the parameters passed to the API call is not within an acceptable range of values

我无法找到更多关于为什么这个函数可能会像这样失败的更多信息。感谢任何人都可以提供的任何帮助。

[编辑]
根据 talonmies 的回复，我验证了我的至少一张卡(9500 GT，我没有在 8600 GT 上运行它)确实支持根据 SDK 附带的 NVIDIA 的 deviceQuery 可执行文件进行内存映射。