关于算法：如何计算32位整数中的设置位数？

How to count the number of set bits in a 32-bit integer?

8位代表数字7，如下所示：

00000111

设置了三个位。

在32位整数中，确定集合位数的算法是什么？

这被称为"汉明重量"、"爆米花数"或"横向添加"。

"最佳"算法实际上取决于您所使用的CPU和使用模式。

一些CPU有一个单独的内置指令来完成这项工作，而另一些CPU有并行指令来处理位向量。并行指令(如支持它的CPU上的x86的popcnt)几乎肯定是最快的。一些其他的体系结构可能有一个缓慢的指令，用一个微码循环来实现，该循环每循环测试一位(需要引用)。

如果您的CPU有一个大的缓存，并且/或者您在一个紧凑的循环中执行了大量的这些指令，那么预填充的表查找方法可能非常快。但是，它可能会因为"缓存未命中"而遭受损失，因为CPU必须从主内存中获取一些表。

如果您知道您的字节主要是0或1，那么对于这些场景有非常有效的算法。

我相信一个非常好的通用算法是下面的，称为"并行"或"变精度swar算法"。我用一个类似C语言的伪语言表达了这一点，您可能需要调整它以适应特定的语言(例如，在爪哇使用Unt32×t用于C++和> >)：

1
2
3
4
5
6
7
8

int numberOfSetBits(int i)
{
// Java: use >>> instead of >>
// C or C++: use uint32_t
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

这是所讨论的任何算法中最糟糕的行为，因此可以有效地处理您抛出的任何使用模式或值。

这种位swar算法可以并行处理多个向量元素，而不是一个整数寄存器，以加快使用simd但没有可用popcount指令的CPU的速度。(例如，必须在任何CPU上运行的x86-64代码，而不仅仅是Nehalem或更高版本。)

然而，使用向量指令进行popcount的最佳方法通常是使用变量shuffle对每个并行字节中的4位进行表查找。(4位索引向量寄存器中的16个条目表)。

在Intel CPU上，硬件64位popcnt指令可以比ssse3 PSHUFB位并行实现的性能提高大约2倍，但前提是编译器能够正确地执行。否则，上证综指可能会明显超前。较新的编译器版本已意识到在英特尔上存在popcnt错误依赖问题。

参考文献：

https://graphics.stanford.edu/~seander/bithacks.html网站

https://en.wikipedia.org/wiki/hamming_weight网站

http://gurmet.net/quellets/fast-bit-counting-routines/快速位计数-例程/

http://aggregate.ee.engr.uky.edu/magic/population%20count%20(one%20count)

相关讨论

+ 1。the first()numberofsetbits在线你是很酷-只是教学instead of the 4 3，你会如果你separately假面out the need and numbered偶-奇(位移appropriately added)和他们在一起。
他！我爱numberofsetbits()函数，但我要再见通过代码审查。：-)
也许它应该使用unsigned intthat is to easily恩秀，无符号位complications of any。uint32_t好也会更安全，当你得到你期望的，在线的平台？
当一些事情的checked answering to this question stackoverflow.com /问题/ 2709430 / & hellip；我认为这是错误的。最后一行：(((+回车> &；4)0xf0f0f0f)* 0x1010101＞24)；should be changed to：((((+回车> > 4)*)&；0xf0f0f0f 0x1010101 >)(24)；和第一位茶及点心一)；
你能请提供：@ Maciej EN does not where返回家园的result the expected？
"马特Howells我对不起。我迷失在恩北有那些性括号。我在我的错误执行。它不工作在15号>。维基百科的checked the article and there是那些性括号。我认为那是我的问题。它开始与茶性括号的固定工作。如何看东西的其他固定。但我获得的东西。reproduce起居to the"虫"自制我check the of the假期)值优先。谢谢你的信件和apologize for the
便携式version of this is有尾巴吗？how does EN为审前9位字节的举止与其他architectures寻常？
nonnb @：其实，as is written the needs车尾，和维护。>>is执行定义的值为负。the argument needs to be changed to unsigned(或铸造)，和32位since the code is should probably be特异性，uint32_t使用EN。
@ R…谁说它是C + +？
我"只写"位，True As was as a en，挑战，和decode to the队列集。我已经确定我not about；确切的说因为是什么additions & the the context of this给倍频功能。我edited to the explanation as the答案AS包括让我得到它；我呼吁任何人比smarter完成它。
"彼得Hosey：对不起，但我感觉你的意见不多，在队列的报告值和更好的解释也许会好茶在评论中。
这是不是真的魔术。这是我做的，但增sets of some bits和聪明的方法。the given in the Wikipedia链接答案好的job of explaining does什么想去但11在线在线在线模式。1)count the number of弹出在每比特位对of that，放在对比特(count of that have你00，01，10或"聪明的"位)；在一个subtract that is the avoids掩模。2)文件对那些为bitpairs into of of their对应什么聪明的贪吃蛇；这里but each have a值会蚕食。0～4。(cont' d)
3)在线在线does太多酮，but the up to the MASK的贪吃蛇，然后是added into the MULTIPLY后续字节，字节的adds of the which is into the高字节，然后移下来，离开the result。
另一个音符，64和128位extends to this by the appropriately常数registers简单的伸展。interestingly(对我的)，这些常数是也~ 0 / 3，5，17，和255；the前三有2^n + 1。这让你意识到黑莓黑莓坐在茶茶)。EN和淋浴。)：
另一个众所周知的，没有被投诉，above about >扩展为负ints unspecified，不管一个人多，在结果> are of the from such as to假面中的方式discard the扩展位。除了最终幻想> which is for the since the好上2位输入mathematically of its are to be移调的零。这可能jave apply to the问题为好。
如果我们是在输入字节？我不相信这是我的工作对我来说，当演员从输入字节到安.
解决方案bcdabcd987 by the algorithm is the same to the optimized之前有点变得越来illegible…
mpi-inf.mpg.de/departments/rg1/teaching/advancedc-ws08/scrip&zwnj；&8203；t/&hellip；
有关该算法的说明，请参阅AMD Athlon？软件优化指南第179-180页。64和Opteron？处理器。
@R：格雷戈是对的。如果它们被移入，那么它们就被屏蔽掉了。最后的移位是将每个字节相加为高位字节的乘法。因为高位字节是求和位集，所以它永远不会超过32位，最多只需要6位。所以高位永远不会被设置，数字也永远不会是负数。
我觉得奇怪的是，这个答案与这个答案在赞成票方面不太接近。从性能上看，它似乎真正稳定取决于位数集。也许我误解了这个速度有多快。
对于Java，您可以只调用EDCOX1×0(1.5，但这已经过去了一段时间)。
您还可以使用C++中的可变模板生成一个整洁的查找表。不过，它的性能还是不如内置的好。bitback.org/ldiamond/popcount/src
@如果你引用你从哪里得到它，例如数字配方或编程艺术，而有人仍然否认它从代码审查是不可维护的，那人不应该审查代码。有一些圣经/圣谕是我们简单信任的好算法的来源，所以我们不会自己去重新发明东西。
即使是引用，也不意味着代码已经被正确复制，或者原始代码没有包含边缘案例的错误。
编写一个只通过所有2*32可能输入并与透明引用实现进行比较的测试套件并不难。假设没有循环或条件，就没有可作为证据的变量移位(因此，如果使用无符号的话，就没有UD行为的空间)。不过，对于64位版本来说就没那么多了。顺便说一句，当没有popcount指令时，_uu builtin_PopCount()(gcc，clang)将生成类似的东西。
不得在已签字数量上使用轮班操作员和面罩。

还要考虑编译器的内置函数。

例如，在GNU编译器上，您可以只使用：

1 2	int __builtin_popcount (unsigned int x); int __builtin_popcountll (unsigned long long x);

在最坏的情况下，编译器将生成对函数的调用。在最好的情况下，编译器将发出一条CPU指令来更快地完成相同的工作。

GCC内部函数甚至可以跨多个平台工作。PopCount将成为x86体系结构的主流，因此现在就开始使用内部版本是有意义的。其他体系结构拥有多年的人口统计。

在x86上，您可以告诉编译器，它可以假定对-mpopcnt或-msse4.2的popcnt指令的支持，以同时启用在同一代中添加的矢量指令。参见GCC x86选项。EDOCX1(或EDOCX1(或EDOCX1)(4))无论您希望您的代码采用何种CPU，并对其进行调优，都是一个不错的选择。在旧的CPU上运行生成的二进制文件将导致非法的指令错误。

要使二进制文件针对构建它们的机器进行优化，请使用-march=native(使用gcc、clang或icc)。

msvc为x86 popcnt指令提供了一个内部函数，但与gcc不同，它实际上是硬件指令的一个内部函数，需要硬件支持。

使用std::bitset<>::count()而不是内置的

理论上，任何知道如何高效地为目标CPU进行计数的编译器都应该通过ISOC+EDCOX1(8)来揭示该功能。在实践中，对于某些目标CPU，在某些情况下，您最好使用位hack和/shift/add。

对于硬件popcount是可选扩展的目标体系结构(如x86)，并非所有编译器都有一个std::bitset，在可用时可以利用它。例如，msvc无法在编译时启用popcnt支持，并且总是使用表查找，即使使用/Ox /arch:AVX(这意味着SSE4.2，尽管从技术上讲，popcnt有一个单独的功能位。)

但是，至少你得到了在任何地方都可以工作的可移植的东西，并且有了合适的目标选项gcc/clang，你得到了支持它的体系结构的硬件popcount。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value, unsigned >::type
popcount(T x)
{
static_assert(std::numeric_limits<T>::radix == 2,"non-binary type");

// sizeof(x)*CHAR_BIT
constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
// std::bitset constructor was only unsigned long before C++11. Beware if porting to C++03
static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits,"arg too wide for std::bitset() constructor");

typedef typename std::make_unsigned<T>::type UT; // probably not needed, bitset width chops after sign-extension

std::bitset<bitwidth> bs( static_cast<UT>(x) );
return bs.count();
}

请参阅godbolt编译器资源管理器上gcc、clang、icc和msvc中的asm。

x86-64 gcc -O3 -std=gnu++11 -mpopcnt发出：

1
2
3
4
5
6
7
8
9
10
11
12

unsigned test_short(short a) { return popcount(a); }
movzx eax, di # note zero-extension, not sign-extension
popcnt rax, rax
ret
unsigned test_int(int a) { return popcount(a); }
mov eax, edi
popcnt rax, rax
ret
unsigned test_u64(unsigned long long a) { return popcount(a); }
xor eax, eax # gcc avoids false dependencies for Intel CPUs
popcnt rax, rdi
ret

powerpc64 gcc -O3 -std=gnu++11发射(对于intarg版本)：

1
2
3

rldicl 3,3,0,32 # zero-extend from 32 to 64-bit
popcntd 3,3 # popcount
blr

这个源代码根本不是x86特定的或GNU特定的，但只适用于使用gcc/clang/icc的x86。

还要注意，对于没有单指令popcount的体系结构，GCC的回退是一个逐字节的表查找。例如，这对手臂来说不太好。

相关讨论

我同意这是一个很好的实践，但是在xcode/osx/intel上，我发现它生成的代码比这里发布的大多数建议要慢。详情请参阅我的答案。
Afaik是唯一能在一条指令中进行pop计数的x86 CPU，它是AMD Phenom/Barcelona(家族10H)。大约有4个周期的延迟？
Intel i5/i7具有执行此操作的SSE4指令popcnt，使用通用寄存器。我的系统上的gcc并没有使用这个内部函数发出指令，我想是因为还没有-march=nehalem选项。
@如果我使用-msse4.2编译，那么我的gcc 4.4.1将发出popcnt指令。
@尼尔斯·皮彭布兰克，说得对，行得通。
使用C++的EDCOX1×0。内联后，编译为单个__builtin_popcount调用。
酷。不知道。您使用的编译器是什么？
所提到的内部函数(_popcnt32/64)位于immintrin.h中，如果其中一个函数具有popcnt cpuid功能标志，则该函数可用。这并不是SSE的一部分——至少这是我对Intel提供的Intrinsics指南3.0.1中信息的解释)
@恩卢卡罗尼，是的。时代在改变。我在2008年写了这个答案。现在我们有了本地popcount，如果平台允许的话，内部函数将编译成一个汇编语句。
是的，我知道。只是更新信息而已。
不幸的是，调用/返回可能代价高昂，所以对于内部循环，我更喜欢Matt的NumberOfSetBits()的内联版本。
@迈克尔：__builtin函数不是真正被调用的函数。如果为支持它作为单个指令的目标进行编译(例如，使用-mpopcnt的x86)，则只需输入该指令。但是，如果没有该命令，gcc可能会发出对libgcc的调用，这是一个类似于__popcountdi2的函数，而不是包含这些指令。不过，这取决于编译器；clang4.0选择内联，就像它对std::bitset.count()所做的那样。godbolt.org/g/tuemqt公司
为了更新上述评论中提到的"popcnt指令性能状态"，最近几代Intel CPU已经能够每周期发布1个popcnt，延迟3个周期，AMD Zen架构可以发布4个(！！)每个周期的popcnt指令，延迟1个周期。所以我们可以说，在现代硬件上，popcnt是"非常快的"，在AMD的情况下，它的速度和像or和add这样的简单指令一样快。

在我看来，"最好的"解决方案是另一个程序员(或者两年后的原始程序员)可以阅读的解决方案，而不需要大量的评论。你很可能想要一些已经提供的最快或最聪明的解决方案，但我更喜欢可读性而不是任何时候的聪明。

1
2
3
4
5
6
7
8
9

unsigned int bitCount (unsigned int value) {
unsigned int count = 0;
while (value > 0) { // until all bits are zero
if ((value & 1) == 1) // check lower bit
count++;
value >>= 1; // shift bits, removing lower bit
}
return count;
}

如果您想要更快的速度(并且假设您很好地记录它以帮助后续任务)，您可以使用表查找：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

// Lookup table for fast calculation of bits set in 8-bit unsigned char.

static unsigned char oneBitsInUChar[] = {
// 0 1 2 3 4 5 6 7 8 9 A B C D E F (<- n)
// =====================================================
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n
: : :
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn
};

// Function for fast calculation of bits set in 16-bit unsigned short.

unsigned char oneBitsInUShort (unsigned short x) {
return oneBitsInUChar [x >> 8]
+ oneBitsInUChar [x & 0xff];
}

// Function for fast calculation of bits set in 32-bit unsigned int.

unsigned char oneBitsInUInt (unsigned int x) {
return oneBitsInUShort (x >> 16)
+ oneBitsInUShort (x & 0xffff);
}

尽管这些都依赖于特定的数据类型大小，所以它们不是那么可移植的。但是，由于许多性能优化无论如何都不可移植，这可能不是问题。如果你想要可移植性，我会坚持使用可读的解决方案。

相关讨论

除了除以2并将其注释为"shift bits…"之外，您应该只使用shift操作符(>>)并忽略注释。
不，然后他必须评论"除以2"…
用count += value & 1替换if ((value & 1) == 1) { count++; }是否更有意义？
不，在这种情况下，最好的解决方案不是最可读的。在这里，最好的算法是最快的算法。
这完全是你的意见，@nikic，尽管你可以很明显地投我反对票。在这个问题中没有提到如何量化"最佳"这个词，"性能"或"快速"这个词在任何地方都看不到。所以我选择了可读性。
@帕西亚布洛：没错，我改变了我的投票。我没有仔细阅读这个问题。(我必须做一个鬼编辑，这样我才能改变投票。希望你能接受。)
三年后我读了这个答案，我发现它是最好的答案，因为它可读性强，而且有更多的评论。时期。
四年后，我会用这个。主要是因为我能足够理解它来复制它。好答案。
抱歉，查阅表格速度很慢。
@Sambatyon，如果你真的支持的话，我会更认真地对待你的论点。
比所有其他答案都好。在没有调试程序的帮助下，人类可以清晰、直观和理解。
在查找表较大的情况下，由于CPU缓存丢失，它可能确实比运行时计算慢，但这些表看起来不太大。
可读性？int cnt=0; for( int b=0; b < 32; b++) if( (val>>b) & 1) cnt++;，但我认为在速度成为优先考虑的问题之前，这个问题没有那么有趣。
Downvote。在不需要维护的单功能叶代码中，可读性不是一个问题。性能是。
@Aikendum，你有权发表自己的观点，但我的观点是，几乎所有的代码最终都需要维护一些描述。我祈祷我永远不必维护你的：—)
如果您必须阅读我为提高可读性而编写的任何代码，那么您会发现它得到了很好的注释来补偿。如果您的LOB涉及到最大吞吐量，那么您不需要牺牲底线和大量客户的时间来节省单个未来程序员的时间来计算(或读取)您的算法如何计算一个值中的设置位的数量。您是否没有在多个级别的代码或多个工作中工作？任何人都不应该认为这样一个非常普遍的经验法则可以适用于所有情况。在你去侮辱别人的技能之前先学会一些礼节。
我怀疑高可读版本中的瓶颈是if分支，这是不可预测的。Ponkodle的建议几乎和原来的一样易读，但是去掉了枝条——仍然容易阅读的低垂果实应该摘下来，imho。但是不管怎样，我很高兴有一个优先考虑可读性的答案，即使这不是我想找的问题。
@艾肯德鲁姆，问题(或问题历史)中绝对没有任何要求表现的问题。一个早期的版本要求最好的方法(因为没有定义"最佳"是不明智的)，我的观点一直是可读性胜过性能，除非有特定的要求。你还应该学会读微笑，关于维护你的代码的评论是一个温和的刺拳，而不是有意的侮辱。我为你的报复行为感到遗憾，这意味着我缺乏经验，但我不会因此而生气。
我很清楚地表明，答案很大程度上是基于我的观点和偏好，而且我认为几乎没有必要对已经在使用中的基于性能的答案进行重复。早些时候的评论已经覆盖了这一领域，显然，172人同意了，但是，诚然，这是所有社区的五分之四，所以可能不会承担那么多的重量：—)底线，我对你的投票方式没有异议，我只是想确保你等至少理解我为什么给出答案。如果你愿意的话，我想我已经尽我所能地解释了。
请注意，函数bitcount()不能用于有符号int，因为如果设置了符号位，while循环将永远不会结束，因为在某些实现中，符号位被添加而不是零。
@Fabian，因此函数原型中的unsigned：-)
我想推翻这个问题，仅仅是因为它引起了@johannesschoub litb的评论，但由于通过的新路德派认为我是在对优点进行反对。
即使我费尽心思去看你的观点，这里也有问题：你说"另一个程序员(或两年后的原程序员)可以读的那个"，然后你写while (value > 0) ..和if ((value & 1) == 1) ..，这意味着"另一个程序员"实际上是指"另一种语言的另一个程序员"或"原程序员同时得到了C"，同时诱惑不经意的读者认为你是在提倡更简单的算法。这是惯用的&更容易阅读while (value) ..和if (value & 1) ..，你的真正立场是"像白痴一样的代码"。
在那里。这么难吗？大约15年前我停止了用C编写代码，这样更容易阅读。根据你的名声，我能想到的你最初的答案的唯一理由是，你在寻觅并建立某种白痴蜜罐来警告其他人，哈哈。

《黑客的喜悦》，第66页，图5-2

1
2
3
4
5
6
7
8
9

int pop(unsigned x)
{
x = x - ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x = x + (x >> 8);
x = x + (x >> 16);
return x & 0x0000003F;
}

执行大约20个ISH指令(依赖架构)，无分支。黑客的喜悦是愉快的！强烈推荐。

相关讨论

我认为不使用查找表和popcount的最快方法如下。它仅用12个操作来计算设定位。

1
2
3
4
5

int popcount(int v) {
v = v - ((v >> 1) & 0x55555555); // put count of each 2 bits into those 2 bits
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits
return c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

它之所以有效，是因为您可以通过分成两半来计算设置位的总数，计算两半中的设置位的数量，然后将它们相加。也被称为Divide and Conquer范式。让我们详细讲一下……

1	v = v - ((v >> 1) & 0x55555555);

两位中的位数可以是0b00、0b01或0b10。让我们试着在2位上解决这个问题。

1
2
3
4
5
6
7

---------------------------------------------
| v | (v >> 1) & 0b0101 | v - x |
---------------------------------------------
0b00 0b00 0b00
0b01 0b00 0b01
0b10 0b01 0b01
0b11 0b01 0b10

这是必需的：最后一列显示每两位对中的设置位计数。如果两位数字是>= 2 (0b10)，那么and产生0b01，否则产生0b00。

1	v = (v & 0x33333333) + ((v >> 2) & 0x33333333);

这句话应该容易理解。在第一次操作之后，我们每两位就有一组位的计数，现在我们每4位就把这个计数相加。

1 2	v & 0b00110011 //masks out even two bits (v >> 2) & 0b00110011 // masks out odd two bits

然后我们总结上述结果，给出4位中的集合位总数。最后一句话是最棘手的。

1	c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;

让我们把它进一步分解…

1	v + (v >> 4)

它类似于第二个语句；我们将以4为一组来计算集合位。因为我们以前的操作，我们知道每一个半字节都有一组位的计数。让我们看一个例子。假设我们有字节0b01000010。这意味着第一个半字节有它的4位集，第二个字节有它的2位集。现在我们把这些小东西加在一起。

1	0b01000010 + 0b01000000

它给出了在第一个半字节0b01100010中，一个字节中的设置位的计数，因此我们屏蔽了数字中所有字节的最后四个字节(丢弃它们)。

1	0b01100010 & 0xF0 = 0b01100000

现在每个字节中都有设定位的计数。我们需要把它们加在一起。诀窍是将结果乘以具有有趣特性的0b10101010。如果我们的数字有四个字节，即A B C D，它将使用这些字节A+B+C+D B+C+D C+D D产生一个新的数字。一个4字节的数字最多可以设置32位，可以表示为0b00100000。

我们现在需要的是第一个字节，它包含所有字节中所有设置位的总和，我们通过>> 24得到它。该算法是针对32 bit字而设计的，但对于64 bit字可以很容易地进行修改。

相关讨论

如果您恰巧使用Java，内置方法EDCOX1(2)将做到这一点。

相关讨论

我感到厌烦，并对三种方法进行了十亿次迭代。编译器是gcc-o3。CPU是他们在第一代MacBook Pro中投入的一切。

最快速度为3.7秒：

1
2
3
4
5

static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}

第二位是相同的代码，但查找的是4个字节，而不是2个半字。大约花了5.5秒。

第三名是"侧边加法"，这花了8.6秒。

排名第四的是GCC的_uuBuiltin_PopCount()，倒霉的时间是11秒。

一次一位计数的方法慢了一些，我厌倦了等待它完成。

因此，如果您最关心性能，那么使用第一种方法。如果您关心，但不足以在上面花费64kb的RAM，那么使用第二种方法。否则，使用可读(但速度慢)的一位一次的方法。

很难想象在这种情况下，你会想用这种小玩乐的方法。

编辑：这里有类似的结果。

相关讨论

1
2
3
4
5
6
7
8
9

unsigned int count_bit(unsigned int x)
{
x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
x = (x & 0x0000FFFF) + ((x >> 16)& 0x0000FFFF);
return x;
}

我来解释一下这个算法。

该算法基于分治算法。假设有一个8位整数213(二进制为11010101)，该算法的工作原理如下(每次合并两个邻居块)：

1
2
3
4
5
6

+-------------------------------+
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | <- x
| 1 0 | 0 1 | 0 1 | 0 1 | <- first time merge
| 0 0 1 1 | 0 0 1 0 | <- second time merge
| 0 0 0 0 0 1 0 1 | <- third time ( answer = 00000101 = 5)
+-------------------------------+

相关讨论

这是其中一个有助于了解微体系结构的问题。我只是在GCC 4.3.3下使用了两个不同的变量，即使用C++内联编译，以消除函数调用开销，十亿次迭代，保持所有计数的运行总和，以确保编译器不删除任何重要的东西，使用RDTSC计时(时钟周期精确)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

inline int pop2(unsigned x, unsigned y)
{
x = x - ((x >> 1) & 0x55555555);
y = y - ((y >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
y = (y & 0x33333333) + ((y >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
y = (y + (y >> 4)) & 0x0F0F0F0F;
x = x + (x >> 8);
y = y + (y >> 8);
x = x + (x >> 16);
y = y + (y >> 16);
return (x+y) & 0x000000FF;
}

未经修改的黑客的喜悦花费了12.2千兆周期。我的并行版本(计算两倍的位)运行在13.0千兆周期。在2.4GHz双核系统中，两个系统总共运行了10.5秒。25千兆周期=在这个时钟频率下刚超过10秒，所以我相信我的计时是正确的。

这与指令依赖链有关，这对该算法非常不利。通过使用一对64位寄存器，我可以使速度再快一倍。事实上，如果我聪明一点，早点加入X+Y，我可以刮掉一些轮班。64位版本，有一些小的调整，可能会出来，甚至，但计数两倍的位。

对于128位的SIMD寄存器，还有另外一个二分之一，而SSE指令集也常常有巧妙的捷径。

没有理由让代码特别透明。该算法界面简单，在很多地方都可以在线参考，并且易于进行全面的单元测试。偶然发现它的程序员甚至可能会学到一些东西。这些位操作在机器级别是非常自然的。

好的，我决定让经过调整的64位版本作为工作台。对于这一个sizeof(无符号长)=8

1
2
3
4
5
6
7
8
9
10
11
12
13
14

inline int pop2(unsigned long x, unsigned long y)
{
x = x - ((x >> 1) & 0x5555555555555555);
y = y - ((y >> 1) & 0x5555555555555555);
x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
y = (y & 0x3333333333333333) + ((y >> 2) & 0x3333333333333333);
x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
y = (y + (y >> 4)) & 0x0F0F0F0F0F0F0F0F;
x = x + y;
x = x + (x >> 8);
x = x + (x >> 16);
x = x + (x >> 32);
return x & 0xFF;
}

这看起来不错(不过我没有仔细测试)。现在计时结果是10.70千兆周期/14.1千兆周期。后面的数字加起来是1280亿位，对应于这台机器上经过的5.9秒。非并行版本的速度提高了一点点，因为我是在64位模式下运行的，它喜欢64位寄存器，比32位寄存器稍微好一点。

让我们看看这里是否还有更多的OOO管道。这有点复杂，所以我实际上做了一些测试。每一项的总和为64，所有的总和为256。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

inline int pop4(unsigned long x, unsigned long y,
unsigned long u, unsigned long v)
{
enum { m1 = 0x5555555555555555,
m2 = 0x3333333333333333,
m3 = 0x0F0F0F0F0F0F0F0F,
m4 = 0x000000FF000000FF };

x = x - ((x >> 1) & m1);
y = y - ((y >> 1) & m1);
u = u - ((u >> 1) & m1);
v = v - ((v >> 1) & m1);
x = (x & m2) + ((x >> 2) & m2);
y = (y & m2) + ((y >> 2) & m2);
u = (u & m2) + ((u >> 2) & m2);
v = (v & m2) + ((v >> 2) & m2);
x = x + y;
u = u + v;
x = (x & m3) + ((x >> 4) & m3);
u = (u & m3) + ((u >> 4) & m3);
x = x + u;
x = x + (x >> 8);
x = x + (x >> 16);
x = x & m4;
x = x + (x >> 32);
return x & 0x000001FF;
}

我兴奋了一会儿，但事实证明gcc正在玩-o3的内联技巧，尽管我在一些测试中没有使用inline关键字。当我让gcc玩把戏时，对pop4()的10亿个调用需要12.56千兆周期，但我确定它将参数折叠为常量表达式。更实际的数字是19.6gc，再加速30%。我的测试循环现在看起来是这样的，确保每个参数都足够不同，以阻止GCC玩把戏。

1
2
3
4

hitime b4 = rdtsc();
for (unsigned long i = 10L * 1000*1000*1000; i < 11L * 1000*1000*1000; ++i)
sum += pop4 (i, i^1, ~i, i|1);
hitime e4 = rdtsc();

总共256亿位，用了8.17秒。计算出3200万位的102秒，作为16位表查找中的基准。无法直接比较，因为另一个工作台不提供时钟速度，但看起来我已经从64kb的表版本中剔除了snot，这首先是对一级缓存的悲惨使用。

更新：决定通过添加四个重复的行来完成明显的操作并创建pop6()。结果是228gc，9.5秒内总计3840亿位。所以现在有另外20%是800毫秒，320亿比特。

相关讨论

为什么不迭代除以2？

1
2
3
4
5

count = 0
while n > 0
if (n % 2) == 1
count += 1
n /= 2

我同意这不是最快的，但"最好"有点模棱两可。我认为"最好的"应该有一个清晰的元素

相关讨论

当你写出比特模式时，黑客的快乐比特旋转变得更加清晰了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

unsigned int bitCount(unsigned int x)
{
x = ((x >> 1) & 0b01010101010101010101010101010101)
+ (x & 0b01010101010101010101010101010101);
x = ((x >> 2) & 0b00110011001100110011001100110011)
+ (x & 0b00110011001100110011001100110011);
x = ((x >> 4) & 0b00001111000011110000111100001111)
+ (x & 0b00001111000011110000111100001111);
x = ((x >> 8) & 0b00000000111111110000000011111111)
+ (x & 0b00000000111111110000000011111111);
x = ((x >> 16)& 0b00000000000000001111111111111111)
+ (x & 0b00000000000000001111111111111111);
return x;
}

第一步将偶数位与奇数位相加，生成每两个位的和。其他步骤将高阶块添加到低阶块，将块大小一直放大一倍，直到最后一个计数占据整个int。

相关讨论

对于232查找表和单独遍历每个位之间的快乐介质：

1
2
3
4
5
6
7
8

int bitcount(unsigned int num){
int count = 0;
static int nibblebits[] =
{0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
for(; num != 0; num >>= 4)
count += nibblebits[num & 0x0f];
return count;
}

来自http://ctips.pbwiki.com/countbits

相关讨论

这可以在O(k)中完成，其中k是设置的位数。

1
2
3
4
5
6
7
8
9
10
11

int NumberOfSetBits(int n)
{
int count = 0;

while (n){
++ count;
n = (n - 1) & n;
}

return count;
}

这不是最快或最好的解决方案，但我发现了同样的问题，我开始思考和思考。最后我意识到，如果你从数学方面得到这个问题，然后画一个图，你会发现它是一个函数，它有一些周期部分，然后你会意识到周期之间的区别…所以，你走吧：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

unsigned int f(unsigned int x)
{
switch (x) {
case 0:
return 0;
case 1:
return 1;
case 2:
return 1;
case 3:
return 2;
default:
return f(x/4) + f(x%4);
}
}

相关讨论

您要查找的函数通常称为二进制数的"侧方和"或"总体计数"。Knuth在前分册1a，pp11-12中讨论了这一点(尽管在第2卷，4.6.3-(7)中有一个简短的参考)。

轨迹分类是彼得·韦格纳的文章《二进制计算机中的计数技术》，摘自ACM通讯，第3卷(1960年)，第5号，第322页。他在那里给出了两种不同的算法，一种针对预期"稀疏"的数字进行了优化(即，有少量的算法)，另一种针对相反的情况。

很少有开放性问题：

如果数字是负数，那么？

如果数字是1024，那么"迭代除以2"方法将迭代10次。

我们可以修改算法来支持负数，如下所示：

1
2
3
4
5
6

count = 0
while n != 0
if ((n % 2) == 1 || (n % 2) == -1
count += 1
n /= 2
return count

现在，为了克服第二个问题，我们可以编写如下算法：

1
2
3
4
5
6
7
8
9
10

int bit_count(int num)
{
int count=0;
while(num)
{
num=(num)&(num-1);
count++;
}
return count;
}

完整参考见：

http://gursaha.freeoda.com/miscellaneous/integerbitcount.html

1
2
3
4
5
6
7
8
9

private int get_bits_set(int v)
{
int c; // c accumulates the total bits set in v
for (c = 0; v>0; c++)
{
v &= v - 1; // clear the least significant bit set
}
return c;
}

我使用下面的代码，这是更直观的。

1
2
3

int countSetBits(int n) {
return !n ? 0 : 1 + countSetBits(n & (n-1));
}

逻辑：n&；(n-1)重置n的最后一个设置位。

P.S：我知道这不是O(1)解决方案，尽管这是一个有趣的解决方案。

相关讨论

我认为布莱恩·克尼根的方法也会有用…它会经历尽可能多的迭代。所以如果我们有一个32位的字，只有高位集，那么它将只通过一次循环。

1
2
3
4
5
6

int countSetBits(unsigned int n) {
unsigned int n; // count the number of bits set in n
unsigned int c; // c accumulates the total bits set in n
for (c=0;n>0;n=n&(n-1)) c++;
return c;
}

Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to me that this method"was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"

如果你使用C++，另一个选择是使用模板元编程：

1
2
3
4
5
6
7
8
9
10
11
12
13

// recursive template to sum bits in an int
template <int BITS>
int countBits(int val) {
// return the least significant bit plus the result of calling ourselves with
// .. the shifted value
return (val & 0x1) + countBits<BITS-1>(val >> 1);
}

// template specialisation to terminate the recursion when there's only one bit left
template<>
int countBits<1>(int val) {
return val & 0x1;
}

用途是：

1
2
3
4
5
6
7
8

// to count bits in a byte/char (this returns 8)
countBits<8>( 255 )

// another byte (this returns 7)
countBits<8>( 254 )

// counting bits in a word/short (this returns 1)
countBits<16>( 256 )

当然，您可以进一步扩展这个模板以使用不同的类型(甚至自动检测位大小)，但为了清晰起见，我一直保持简单。

编辑：忘了提到这是好的，因为它应该在任何C++编译器中工作，并且它基本上只是为你打开循环，如果一个常量值用于比特计数(换句话说，我确信它是最快的通用方法)。

相关讨论

我大约在1990年为RISC机器编写了一个快速的bitcount宏。它不使用高级算术(乘法、除法、百分比)、内存提取(速度太慢)、分支(速度太慢)，但它确实假定CPU有一个32位桶形移位器(换句话说，>>1和>>32的周期数相同)。它假定小常量(如6、12、24)不需要加载到寄存器中，或存储在寄存器中。在临时的，反复使用。

根据这些假设，在大多数RISC机器上，它在大约16个周期/指令中计算32位。请注意，15条指令/周期接近于周期数或指令数的下限，因为看起来至少需要3条指令(掩码、移位、运算符)才能将加数减半，因此log 2(32)=5，5x 3=15条指令是准下界指令。

1
2
3
4
5

#define BitCount(X,Y) \
Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
Y = ((Y + (Y >> 3)) & 030707070707); \
Y = (Y + (Y >> 6)); \
Y = (Y + (Y >> 12) + (Y >> 24)) & 077;

以下是第一个也是最复杂的步骤的秘密：

1
2
3
4
5
6

input output
AB CD Note
00 00 = AB
01 01 = AB
10 01 = AB - (A >> 1) & 0x1
11 10 = AB - (A >> 1) & 0x1

所以如果我取上面第1列(a)，右移1位，从ab中减去它，我就得到输出(cd)。扩展到3位是类似的；如果您愿意的话，可以使用上面类似我的8行布尔表来检查它。

唐吉利斯

"最佳算法"是什么意思？短代码还是快代码？您的代码看起来非常优雅，并且有一个恒定的执行时间。代码也很短。

但是，如果速度是主要因素，而不是代码大小，那么我认为下面的内容可能更快：

1
2
3
4
5
6
7
8
9
10
11

static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... };
static int bitCountOfByte( int value ){
return BIT_COUNT[ value & 0xFF ];
}

static int bitCountOfInt( int value ){
return bitCountOfByte( value )
+ bitCountOfByte( value >> 8 )
+ bitCountOfByte( value >> 16 )
+ bitCountOfByte( value >> 24 );
}

我认为对于64位的值来说，这不会更快，但是32位的值可能更快。

相关讨论

我总是在有竞争力的编程中使用它，而且它很容易编写和高效：

1
2
3
4
5
6
7
8

#include <bits/stdc++.h>

using namespace std;

int countOnes(int n) {
bitset<32> b(n);
return b.count();
}

我在使用simd指令(ssse3和avx2)的数组中发现了位计数的实现。它的性能比使用popcnt64内部函数要好2-2.5倍。

SSSE3版本：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

#include <smmintrin.h>
#include <stdint.h>

const __m128i Z = _mm_set1_epi8(0x0);
const __m128i F = _mm_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m128i T = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
__m128i _sum = _mm128_setzero_si128();
for (size_t i = 0; i < size; i += 16)
{
//load 16-byte vector
__m128i _src = _mm_loadu_si128((__m128i*)(src + i));
//get low 4 bit for every byte in vector
__m128i lo = _mm_and_si128(_src, F);
//sum precalculated value from T
_sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, lo)));
//get high 4 bit for every byte in vector
__m128i hi = _mm_and_si128(_mm_srli_epi16(_src, 4), F);
//sum precalculated value from T
_sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, hi)));
}
uint64_t sum[2];
_mm_storeu_si128((__m128i*)sum, _sum);
return sum[0] + sum[1];
}

AVX2版本：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

#include <immintrin.h>
#include <stdint.h>

const __m256i Z = _mm256_set1_epi8(0x0);
const __m256i F = _mm256_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m256i T = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
__m256i _sum = _mm256_setzero_si256();
for (size_t i = 0; i < size; i += 32)
{
//load 32-byte vector
__m256i _src = _mm256_loadu_si256((__m256i*)(src + i));
//get low 4 bit for every byte in vector
__m256i lo = _mm256_and_si256(_src, F);
//sum precalculated value from T
_sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, lo)));
//get high 4 bit for every byte in vector
__m256i hi = _mm256_and_si256(_mm256_srli_epi16(_src, 4), F);
//sum precalculated value from T
_sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, hi)));
}
uint64_t sum[4];
_mm256_storeu_si256((__m256i*)sum, _sum);
return sum[0] + sum[1] + sum[2] + sum[3];
}

JAVA JDK1.5

整数。比特计数(n)；

其中n是要计算1的数字。

也检查一下，

1
2
3
4
5
6
7
8
9
10
11

Integer.highestOneBit(n);
Integer.lowestOneBit(n);
Integer.numberOfLeadingZeros(n);
Integer.numberOfTrailingZeros(n);

//Beginning with the value 1, rotate left 16 times
n = 1;
for (int i = 0; i < 16; i++) {
n = Integer.rotateLeft(n, 1);
System.out.println(n);
}

相关讨论

我特别喜欢财富档案中的这个例子：

1
2
3
4

#define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x) ((x) - (((x)>>1)&0x77777777)
- (((x)>>2)&0x33333333)
- (((x)>>3)&0x11111111))

我最喜欢它，因为它很漂亮！

相关讨论

快速C解决方案，使用预先计算的字节位计数表，在输入大小上进行分支。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

public static class BitCount
{
public static uint GetSetBitsCount(uint n)
{
var counts = BYTE_BIT_COUNTS;
return n <= 0xff ? counts[n]
: n <= 0xffff ? counts[n & 0xff] + counts[n >> 8]
: n <= 0xffffff ? counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff]
: counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] + counts[(n >> 24) & 0xff];
}

public static readonly uint[] BYTE_BIT_COUNTS =
{
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
}

相关讨论

这里有一个可移植模块(ansi-c)，它可以在任何体系结构上对每个算法进行基准测试。

你的CPU有9位字节？没问题：(-)目前它实现了两种算法，K&R算法和逐字节查找表。查找表平均比K&R算法快3倍。如果有人能想出一种方法使"黑客的乐趣"算法可移植，就可以随意添加它。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

#ifndef _BITCOUNT_H_
#define _BITCOUNT_H_

/* Return the Hamming Wieght of val, i.e. the number of 'on' bits. */
int bitcount( unsigned int );

/* List of available bitcount algorithms.
* onTheFly: Calculate the bitcount on demand.
*
* lookupTalbe: Uses a small lookup table to determine the bitcount. This
* method is on average 3 times as fast as onTheFly, but incurs a small
* upfront cost to initialize the lookup table on the first call.
*
* strategyCount is just a placeholder.
*/
enum strategy { onTheFly, lookupTable, strategyCount };

/* String represenations of the algorithm names */
extern const char *strategyNames[];

/* Choose which bitcount algorithm to use. */
void setStrategy( enum strategy );

#endif

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155

#include <limits.h>

#include"bitcount.h"

/* The number of entries needed in the table is equal to the number of unique
* values a char can represent which is always UCHAR_MAX + 1*/
static unsigned char _bitCountTable[UCHAR_MAX + 1];
static unsigned int _lookupTableInitialized = 0;

static int _defaultBitCount( unsigned int val ) {
int count;

/* Starting with:
* 1100 - 1 == 1011, 1100 & 1011 == 1000
* 1000 - 1 == 0111, 1000 & 0111 == 0000
*/
for ( count = 0; val; ++count )
val &= val - 1;

return count;
}

/* Looks up each byte of the integer in a lookup table.
*
* The first time the function is called it initializes the lookup table.
*/
static int _tableBitCount( unsigned int val ) {
int bCount = 0;

if ( !_lookupTableInitialized ) {
unsigned int i;
for ( i = 0; i != UCHAR_MAX + 1; ++i )
_bitCountTable[i] =
( unsigned char )_defaultBitCount( i );

_lookupTableInitialized = 1;
}

for ( ; val; val >>= CHAR_BIT )
bCount += _bitCountTable[val & UCHAR_MAX];

return bCount;
}

static int ( *_bitcount ) ( unsigned int ) = _defaultBitCount;

const char *strategyNames[] = {"onTheFly","lookupTable" };

void setStrategy( enum strategy s ) {
switch ( s ) {
case onTheFly:
_bitcount = _defaultBitCount;
break;
case lookupTable:
_bitcount = _tableBitCount;
break;
case strategyCount:
break;
}
}

/* Just a forwarding function which will call whichever version of the
* algorithm has been selected by the client
*/
int bitcount( unsigned int val ) {
return _bitcount( val );
}

#ifdef _BITCOUNT_EXE_

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* Use the same sequence of pseudo random numbers to benmark each Hamming
* Weight algorithm.
*/
void benchmark( int reps ) {
clock_t start, stop;
int i, j;
static const int iterations = 1000000;

for ( j = 0; j != strategyCount; ++j ) {
setStrategy( j );

srand( 257 );

start = clock( );

for ( i = 0; i != reps * iterations; ++i )
bitcount( rand( ) );

stop = clock( );

printf
("
\t%d psudoe-random integers using %s: %f seconds

",
reps * iterations, strategyNames[j],
( double )( stop - start ) / CLOCKS_PER_SEC );
}
}

int main( void ) {
int option;

while ( 1 ) {
printf("Menu Options
"
"\t1.\tPrint the Hamming Weight of an Integer
"
"\t2.\tBenchmark Hamming Weight implementations
"
"\t3.\tExit ( or cntl-d )

\t" );

if ( scanf("%d", &option ) == EOF )
break;

switch ( option ) {
case 1:
printf("Please enter the integer:" );
if ( scanf("%d", &option ) != EOF )
printf
("The Hamming Weight of %d ( 0x%X ) is %d

",
option, option, bitcount( option ) );
break;
case 2:
printf
("Please select number of reps ( in millions ):" );
if ( scanf("%d", &option ) != EOF )
benchmark( option );
break;
case 3:
goto EXIT;
break;
default:
printf("Invalid option
" );
}

}

EXIT:
printf("
" );

return 0;
}

#endif

相关讨论

有很多计算设定位的算法，但我认为最好的算法是更快的算法！您可以在此页面上看到详细信息：

小捣蛋鬼

我建议这个：

使用64位指令对14、24或32位字中的位进行计数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 14-bit values in v:
c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;

// option 2, for at most 24-bit values in v:
c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
% 0x1f;

// option 3, for at most 32-bit values in v:
c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) %
0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

这种方法需要一个64位的CPU，具有快速的模分效率。第一个选项只接受3个操作；第二个选项接受10个操作；第三个选项接受15个操作。

32位还是不？我刚刚在Java中阅读了"破解编码面试"第四版练习5.5(CHAP 5：位操作)。如果最低有效位是1增量count，则右移整数。

1
2
3
4
5
6
7

public static int bitCount( int n){
int count = 0;
for (int i=n; i!=0; i = i >> 1){
count += i & 1;
}
return count;
}

我认为这个比常数为0x33333333的解更直观，不管它们有多快。这取决于你对"最佳算法"的定义。

相关讨论

你能做的就是

while(n){
n=n&(n-1);
count++;
}

后面的逻辑是n-1的位从n的最右边的设置位倒转。如果n＝6，即110然后5是101，位从n的最右设置位倒转。因此，如果我们在每次迭代中使用这两个位，我们将使最右边的位为0，并始终转到下一个最右边的设置位。因此，计算设置位。当设置每个位时，最糟糕的时间复杂性将是O(logn)。

我个人用这个：

1
2
3
4
5
6
7
8

public static int myBitCount(long L){
int count = 0;
while (L != 0) {
count++;
L ^= L & -L;
}
return count;
}

相关讨论

这里有一个到目前为止还没有提到的解决方案，使用位域。下面的程序使用4种不同的方法计算100000000个16位整数数组中的集合位。括号中给出了计时结果(在MacOSX上，带gcc -O3)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

#include <stdio.h>
#include <stdlib.h>

#define LENGTH 100000000

typedef struct {
unsigned char bit0 : 1;
unsigned char bit1 : 1;
unsigned char bit2 : 1;
unsigned char bit3 : 1;
unsigned char bit4 : 1;
unsigned char bit5 : 1;
unsigned char bit6 : 1;
unsigned char bit7 : 1;
} bits;

unsigned char sum_bits(const unsigned char x) {
const bits *b = (const bits*) &x;
return b->bit0 + b->bit1 + b->bit2 + b->bit3 \
+ b->bit4 + b->bit5 + b->bit6 + b->bit7;
}

int NumberOfSetBits(int i) {
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

#define out(s) \
printf("bits set: %lu
bits counted: %lu
", 8*LENGTH*sizeof(short)*3/4, s);

int main(int argc, char **argv) {
unsigned long i, s;
unsigned short *x = malloc(LENGTH*sizeof(short));
unsigned char lut[65536], *p;
unsigned short *ps;
int *pi;

/* set 3/4 of the bits */
for (i=0; i<LENGTH; ++i)
x[i] = 0xFFF0;

/* sum_bits (1.772s) */
for (i=LENGTH*sizeof(short), p=(unsigned char*) x, s=0; i--; s+=sum_bits(*p++));
out(s);

/* NumberOfSetBits (0.404s) */
for (i=LENGTH*sizeof(short)/sizeof(int), pi=(int*)x, s=0; i--; s+=NumberOfSetBits(*pi++));
out(s);

/* populate lookup table */
for (i=0, p=(unsigned char*) &i; i<sizeof(lut); ++i)
lut[i] = sum_bits(p[0]) + sum_bits(p[1]);

/* 256-bytes lookup table (0.317s) */
for (i=LENGTH*sizeof(short), p=(unsigned char*) x, s=0; i--; s+=lut[*p++]);
out(s);

/* 65536-bytes lookup table (0.250s) */
for (i=LENGTH, ps=x, s=0; i--; s+=lut[*ps++]);
out(s);

free(x);
return 0;
}

虽然bitfield版本非常可读，但计时结果表明它比NumberOfSetBits()慢4倍以上。基于查找表的实现仍然要快一些，特别是对于一个65kb的表。

相关讨论

You can use built in function named __builtin_popcount(). There is no__builtin_popcount in C++ but it is a built in function of GCC compiler. This function return the number of set bit in an integer.

1	int __builtin_popcount (unsigned int x);

参考文献：比特旋转木马

在Java 8或9中，只调用EDOCX1 0。

另一个汉明重量算法，如果你在一个bmi2能力的CPU上

1	the_weight=__tzcnt_u64(~_pext_u64(data[i],data[i]));

玩得高兴！

相关讨论

1
2
3
4
5
6
7

int countBits(int x)
{
int n = 0;
if (x) do n++;
while(x=x&(x-1));
return n;
}

或：

1	int countBits(int x) { return (x)? 1+countBits(x&(x-1)): 0; }

1
2
3
4
5
6
7
8
9
10

int bitcount(unsigned int n)
{
int count=0;
while(n)
{
count += n & 0x1u;
n >>= 1;
}
return count;
}

迭代的"count"按与总位数成比例的时间运行。它简单地循环通过所有位，由于while条件的原因，稍微提前终止。如果1或集合位是稀疏的，并且在最低有效位之间，则很有用。

这是示例代码，可能有用。

1
2
3
4
5
6
7
8
9
10
11

private static final int[] bitCountArr = new int[]{0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};
private static final int firstByteFF = 255;
public static final int getCountOfSetBits(int value){
int count = 0;
for(int i=0;i<4;i++){
if(value == 0) break;
count += bitCountArr[value & firstByteFF];
value >>>= 8;
}
return count;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

#!/user/local/bin/perl

$c=0x11BBBBAB;
$count=0;
$m=0x00000001;
for($i=0;$i<32;$i++)
{
$f=$c & $m;
if($f == 1)
{
$count++;
}
$c=$c >> 1;
}
printf("%d",$count);

ive done it through a perl script. the number taken is $c=0x11BBBBAB
B=3 1s
A=2 1s
so in total
1+1+3+3+3+2+3+3=19

相关讨论

一种简单的方法，可以很好地处理少量的位，类似于这样(在本例中为4位)：

(I&1)+(I&2)/2+(I&4)/4+(I&8)/8

其他人会建议将少量位作为一个简单的解决方案吗？

如何将整数转换为二进制字符串并对其进行计数？

PHP解决方案：

1	substr_count( decbin($integer), '1' );

相关讨论

下面是一些在PHP中工作的东西(所有的PHP集成器都是32位有符号的，因此是31位的)：

1
2
3
4
5
6
7
8
9
10
11

function bits_population($nInteger)
{

$nPop=0;
while($nInteger)
{
$nInteger^=(1<<(floor(1+log($nInteger)/log(2))-1));
$nPop++;
}
return $nPop;
}

我在任何地方都没有看到这种方法：

1
2
3

int nbits(unsigned char v) {
return ((((v - ((v >> 1) & 0x55)) * 0x1010101) & 0x30c00c03) * 0x10040041) >> 0x1c;
}

它按字节工作，因此对于32位整数必须调用4次。它是从侧向加法派生的，但使用两个32位乘法将指令数减少到只有7个。

当前大多数C编译器将使用SIMD(SSE2)指令来优化此函数，前提是显然请求的数量是4的倍数，并且竞争非常激烈。它是可移植的，可以定义为宏或内联函数，不需要数据表。

这种方法可以扩展到一次处理16位，使用64位乘法。但是，当所有16位都被设置时，它会失败，返回零，因此它只能在不存在0xffff输入值时使用。由于64位操作，它的速度也较慢，并且不能很好地优化。

你可以这样做：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

int countSetBits(int n)
{
n=((n&0xAAAAAAAA)>>1) + (n&0x55555555);
n=((n&0xCCCCCCCC)>>2) + (n&0x33333333);
n=((n&0xF0F0F0F0)>>4) + (n&0x0F0F0F0F);
n=((n&0xFF00FF00)>>8) + (n&0x00FF00FF);
return n;
}

int main()
{
int n=10;
printf("Number of set bits: %d",countSetBits(n));
return 0;
}

参见：http://ideone.com/jhwcx

工作解释如下：

首先，将所有偶数位右移，并加上奇数位，以计算两组中的位数。然后我们分成两组，然后是四组，依此类推。

相关讨论

1
2
3
4
5
6
7
8
9
10
11
12

// How about the following:
public int CountBits(int value)
{
int count = 0;
while (value > 0)
{
if (value & 1)
count++;
value <<= 1;
}
return count;
}

相关讨论

我给出两个算法来回答这个问题，

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

package countSetBitsInAnInteger;

import java.util.Scanner;

public class UsingLoop {

public static void main(String[] args) {
Scanner in = new Scanner(System.in);
try{
System.out.println("Enter a integer number to check for set bits in it");
int n = in.nextInt();
System.out.println("Using while loop, we get the number of set bits as:"+usingLoop(n));
System.out.println("Using Brain Kernighan's Algorithm, we get the number of set bits as:"+usingBrainKernighan(n));
System.out.println("Using");
}
finally{
in.close();
}
}
private static int usingBrainKernighan(int n) {
int count = 0;
while(n>0){
n&=(n-1);
count++;
}
return count;
}/*
Analysis:
Time complexity = O(lgn)
Space complexity = O(1)
*/
private static int usingLoop(int n) {
int count = 0;
for(int i=0;i<32;i++){
if((n&(1<<i))!=0)
count++;
}
return count;
}
/*
Analysis:
Time Complexity = O(32) // Maybe the complexity is O(lgn)
Space Complexity = O(1)
*/
}

我使用以下函数。还没有检查基准，但它起作用了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

int msb(int num)
{
int m = 0;
for (int i = 16; i > 0; i = i>>1)
{
// debug(i, num, m);
if(num>>i)
{
m += i;
num>>=i;
}
}
return m;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

public class BinaryCounter {

private int N;

public BinaryCounter(int N) {
this.N = N;
}

public static void main(String[] args) {

BinaryCounter counter=new BinaryCounter(7);
System.out.println("Number of ones is"+ counter.count());

}

public int count(){
if(N<=0) return 0;
int counter=0;
int K = 0;
do{
K = biggestPowerOfTwoSmallerThan(N);
N = N-K;
counter++;
}while (N != 0);
return counter;

}

private int biggestPowerOfTwoSmallerThan(int N) {
if(N==1) return 1;
for(int i=0;i<N;i++){
if(Math.pow(2, i) > N){
int power = i-1;
return (int) Math.pow(2, power);
}
}
return 0;
}
}

相关讨论

这也可以很好地工作：

1
2
3
4
5
6

int ans = 0;
while(num){
ans += (num &1);
num = num >>1;
}
return ans;

相关讨论