关于基准测试：每个时钟1个CUDA内核能否处理1个以上的浮点指令(Maxwell)？

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

Nvidia GPU列表-GeForce 900系列-写道：

4 Single precision performance is calculated as 2 times the number of
shaders multiplied by the base core clock speed.

即例如对于GeForce GTX 970，我们可以计算性能：

1664内核* 1050 MHz * 2 = 3494 GFlops峰值(3494400 MFlops)

我们可以在"处理能力(峰值)GFLOPS-单精度"列中看到此值。

但是为什么我们必须乘以2？

上面写着：http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/

SMM uses a quadrant-based design with four 32-core processing blocks
each with a dedicated warp scheduler capable of dispatching two
instructions per clock.

好的，nVidia Maxwell是超标量架构，每个时钟调度两个指令，但是每个时钟1个CUDA内核(FP32-ALU)可以处理多于1条指令吗？

我们知道1个CUDA核心包含两个单位：FP32单位和INT单位。但是INT单位与GFlops(每秒浮点运算)无关。

即一个SMM包含：

128 FP32单元
128 INT单位
32 SFU单位
32 LD / ST单元

要获得GFlops的性能，我们应该仅使用：128个FP32单元和32个SFU单元。

即如果同时使用128个FP32单元和32个SFU单元，则每1 SM的每个时钟可以获得160条带浮点运算的指令。

即我们必须乘以1,2 =(160/132)instad 2。

1664内核* 1050 MHz * 1,2 = 2096 Glops峰值

为什么在Wiki中写到我们必须将Cores * MHz乘以2？

enter image description here