I’ve talked about data-oriented design before. It looks at structuring things to remove unnecessary data, keep cache refreshes down and thereby get faster code. Compilers are pretty good nowadays at optimising code. That is they can re-order code, inline functions and unroll loops but they’re not going to restructure your data for you.
Packing booleans
Here’s a contrived structure:
struct Test { bool B0; bool B1; ... };
I had a look and my compiler laid the data out at one byte per boolean.
Given that simple layout I’m going to use
std::array
which will do the same thing but is easier to test.
I then compared it to
std::bitset
which uses one bit per boolean,
plus a bit of wasted space at the end.
I made a little test to repeatedly scan through data:
bool accumulator = false; for (uint32_t i = 0; i < s_Count; i++) { for (uint32_t j = 0; j < s_Size; j++) { accumulator ^= values[j]; } }
With the array it will have to access a byte which is easy to do but takes up more space. With the bitset it will have to access a bit which is harder to do but takes up less space. Which is faster? Neither, they neck and neck.
However if you look at the assembly for the array code it’s huge. It’s done a lot of unrolling. The bitset version takes less data, less code and is just as fast. Given that should we always be using bitsets? Not really. It’s the same speed not faster. More importantly it makes the code more complicated. You wouldn’t be able to refer to a simple boolean any more. You’d have to access it via an index into the bitset. We also tend to put associated variables together rather than cluster them according to data requirements. Doing so is going to be harder to read, write and understand.
Optimising compiler
Could we keep the code looking the same and get the compiler to do the work? In theory it should be possible for the compiler to order the data in a structure however it likes. However, there’s certainly some C++ code out there that makes assumptions about how data is laid out. Files being read in and pointers to bytes being cast directly to structs. Other languages with stricter rules are less likely to have that problem.
That could involve packing booleans together into a single byte but it could also place a couple of shorts next to each other give a cumulative 4 byte alignment.
You could tell the compiler:
unordered struct Test { bool Alpha; char Beta; bool Gamma; short Epsilon; bool Zeta; int Eta; bool Theta; };
And it would lay it out thus:
Padding : 63..60 Alpha : 59 Gamma : 58 Zeta : 57 Theta : 56 Beta : 55..48 Epsilon : 47..32 Eta : 31..0
Or you could leave it in the hands of the programmer but separate the declaration of the data from the layout of the data. Perhaps you could choose to structure it differently in different parts of the program.
In the end
I didn’t know what to expect from the speed tests. That they so similar in speed is surprising. As before I think it’s best to focus on the code in general first. Make it easy to read, write and change. Think about algorithms and data structures because they are likely to have the biggest effects. If you need to get performance don’t make assumptions. Use a profiler, do tests, and find out what’s fastest.
P.S.
In my initial timing tests I used a really big array to fill up the processor cache and
see if the bitset would be faster. What about a smaller test? With 128 booleans, they are still neck and neck. With 64 booleans, the bitset pulls ahead. With 32 booleans, the bitset is 2000 times faster! With 16 booleans, they are neck and neck again. So, again, use a profiler and make meaningful tests because there are all sorts of surprises out there.
Leave a Reply