One of my friends was wondering about the performance difference between large representations of bools, say bytes, and packed representations of bools, individual bits. Accessing a byte is easier, requiring fewer instructions, but is going to fill up your processor’s cache. Accessing a bit is harder, requiring more instructions, but you can fit more in. If you’re going down a Data-oriented Design route which should you choose? I’ve looked at bit manipulation before when talking about the journey, destination and story of code’s development. I thought this would be an easy investigation.
Accessing bits
If we want to compress bools into a smaller space that probably means a collection of unsigned integers. Taking an index for which bool we want that has to be used to determine which integer and then which bit. There are options to consider: what unsigned integer to use, what collection to use. I can use some templates in my testing although a final implementation might pick specific types.
template <typename Collection> class BitCollection { public: using Value = Collection::value_type; BitCollection(Collection& values) : m_Values(values) { }; bool GetBit(size_t index) const { auto offsets = IndexToOffsets(index); auto& value = m_Values[offsets.Value]; auto mask = GetBitMask<Value>(offsets.Bit); return value & mask; } void SetBit(size_t index, bool bit) { auto offsets = IndexToOffsets(index); auto& value = m_Values[offsets.Value]; auto mask = GetBitMask<Value>(offsets.Bit); if (bit) { value |= mask; } else { value &= ~mask; } } private: Collection& m_Values; struct Offsets { size_t Value; uint8_t Bit; }; static Offsets IndexToOffsets(size_t index) { auto bitCount = GetBitCount<Value>(); return { index / bitCount, static_cast<uint8_t>(index % bitCount) }; } };
There are three basic operations here: checking, setting and clearing a bool. They all have the same setup so we are doing the same thing a lot.
How fast?
Let’s start by considering
which underlying integer
to use.
They’re all of a similar speed but it does make a difference.
uint64_t
and particularly uint32_t
are better.
Quick Bench shows the assembly as well.
To me it looks like the 32-bit code is a bit more compact.
Maybe the instruction set or the optimiser is better suited for that.
Now we can look at how the different algorithms do.
I’m going to compare my code to
std::array
and
std::vector
using bool
but also
std::bitset
which has a compact representation.
Which algorithm
is best?
- Everything is a similar order of magnitude.
- A array of bools is fast.
- A vector of bools is slower. I would have thought it should have the same access patterns. What’s slowing it down?
- The standard bitset is the same as the vector. The extra bit manipulation takes some time but not too much.
- My
BitCollection<std::array<uint32_t>>
is the same as the standard bitset. We are probably doing similar things under the hood.
What about the cache?
At first glance std::array<bool>
is the winner. This gives you the easiest access to each of the bools and is faster. The main limitation is that this type has a fixed size determined at compile time. If you’re sure of your maximum size and don’t mind wasting spaces it’s great. If those are problems then the std::vector<bool>
is a good runner up. Similarly easy access and matches the speed of the other options.
For these test I even chose a large collection size. That’s more likely to blow out the cache and give an advantage to a compact representation. Does the cache make a difference? I had to hunt for it but it does. When we get up to 8M bools things start slowing down. I’d go higher but Quick Bench starts timing out. So how does that affect our algorithm? Basic bools are still the winner.
On balance
The idea of getting some savings by packing things in tightly is nice but didn’t pan out in this situation. While bit manipulation isn’t really slow it does take a little more time on these tests. Different setups have different properties and that might be enough to make a difference. Maybe you’re on a machine with a small cache or maybe you need to transfer data between systems. If speed is critically important then it could be worth giving it a go.
I’d say a more significant issue is the extra complexity. While we can package all of this up in a handy class it makes for more layers. Trying to investigate or debug code where everything has been packed makes it harder to inspect anything. In this case keep it simple makes sense.
Leave a Reply