Saving pennies

Imagine your shopping at a different supermarket from normal. You notice that baked beans are 3p per tin cheaper than you’re use to. The news had a story about supermarket price wars on basic goods recently. It’s a bit out of your way but if you switch supermarkets you could save money on beans each month!

Does that sound familiar? No, probably not. Saving 3p on a tin of beans is unlikely to be worth it unless you eat a lot of beans. If the supermarket is out of your way then you might end up spending more money on getting there. It might make sense if you noticed the other supermarket sells steak or bottles of wine at half price. Otherwise it’s better to look at the overall cost for a trolley and decide based on that.

Optimisation is a bit like this. There are a few simple ways you can save a fraction here and there. There are even automated ways to detect those situations. While making your code a few percent faster is nice wouldn’t it be better to make it many times faster.

Inefficient vector operations

clang-tidy is a static analysis tool that can perform all sorts of checks on your code. One of them looks for inefficient vector operations. Lets say you want a list of numbers:

    std::vector<uint32_t> values;
    for (uint32_t i = 0; i < count; i++) {
      values.push_back(i);
    }

This will trigger a warning because, as values gradually expands, it could trigger multiple memory allocations and copies. If instead you pre-allocate the list then only a single allocation is required:

    std::vector<uint32_t> values;
    values.reserve(count);
    for (uint32_t i = 0; i < count; i++) {
      values.push_back(i);
    }

Does this really make a difference? I can guess but let’s have a look. It does, pre-allocating space can be 2.5 times faster. Or maybe 4.3 times faster or 1.5 times faster. So yes, it’s faster but details matters. The smaller loops show a greater speed increase. It matters more when you add 10 numbers and less add 1000 numbers. However remember that adding 10 numbers is going to fast anyway, there’s only 10 of them. Dealing with 1000 things is where you’re more likely to notice the difference but there the speed is more similar. All of these have been dealing with some of the simplest possible loops. What if they were slightly more complicated? Then pre-allocating space is only 1.1 times faster.

It’s not that much faster, it looks more complicated and it’s has more room for bugs:

If you change your upper bound you have to remember to change reserve.
If you change how many values to add per loop you have to remember to change reserve.

And if you do make a mistake it won’t break it will just silently slow down. In some ways that’s worse because it’s unlikely to be noticed and fixed.

I suppose the worst situation for you code is when you are adding, say, 10 numbers to a new array but you’re doing that hundreds of times. However, you can do even better than using reserve by thinking about the algorithm.

To my mind reserve is best used when you’re writing library functions that can get re-used everywhere:

template <typename T, typename Generator>
void append_n(std::vector<T>& values, size_t count, Generator generator) {
    values.reserve(values.size() + count);
    for (size_t i = 0; i < count; i++) {
        values.push_back(generator(i));
    }
}

That way everyone else can write something simple and automatically get the benefit:

    std::vector<uint32_t> values;
    append_n(values, count, [](const int i){ return i; });

This is just as fast as pre-allocating the list manually. It can also be easily repurposed to store only even numbers, or string representations of the numbers, or whatever you want to calculate. As it’s something discrete it’s easy to write unit tests for so should be more reliable. Your team will have to learn another set of library functions and what to expect from them. That’s a cost that comes with any new library whether you write it yourself or not.

Pre-increment vs post-increment

Sometimes you see advice saying to prefer pre-increment operators to get the best performance.

    // First increment and then use the new value.
    uint32_t preTotal = 0;
    for (uint32_t i = 0; i < s_Count;) {
      pretotal += ++i;
    }
    // Store the old value, increment, and then use the old value.
    uint32_t postTotal = 0;
    for (uint32_t i = 0; i < s_Count;) {
      postTotal += i++;
    }

So the first loop is faster than the second but did you notice anything special about the test? I had to switch the optimisation off. With optimisation switched on the both loops are the same. Indeed with a more typical loop:

    uint32_t total = 0;
    for (uint32_t i = 0; i < s_Count; i++) {
      total += i;
    }

It doesn’t even matter if you don’t use optimisation. Again, both loops are the same. You’ll also find if you do anything other than the most basic calculations which operator you use largely doesn’t matter.

So technically pre-increment operators can be faster than post-increment operators but in practice it hardly ever seems to be the case. It may be that this use to matter with older compilers but not today.

Really I have more of a problem with doing something to a value and using a value at the same time. If you can separate incrementing the value from using the value then it’s generally easier to understand what the code is doing. There are less likely to be edge cases that you miss. For me I’d prefer a single increment operator i++ that had the same behaviour as i += 1. That way we don’t have to worry about specific pre and post increment behaviours. Unfortunately I don’t get to re-design C++.

Function calls

Another idea is that the cost of function calls can add up. People might encourage you to duplicate code or specifically inline functions to avoid this. We can take a simple example of squaring a number. If you do it directly your code will be scattered with value * value. If you do it via function call you’ll have Square(value). What does the profiler say? Direct usage is about 10% faster than calling a function. Interesting inlining the function doesn’t seem to help. Maybe that’s because I had to switch off optimisation again. With optimisation they are all neck and neck.

Little vs big

Some of these techniques do make for faster code and some don’t. For some of them it matters what optimisation you use, perhaps which compiler as well. However I’d argue that all of the are relatively small gains. They’re small details that are easy to teach, easy to put in a tips & tricks list, and easy to write automatic checkers for.

What’s harder to teach is where to get big performance improvements. Generally speaking this comes down:

Clever algorithms
Organising you data
Switching your tools
Doing less work somehow

I’ve said it before and I’ll say it again, the right algorithm can make a huge difference. These can be complicated but they don’t have to be. Binary search is much faster than linear search. Here that’s 8.5 times faster, the biggest performance difference so far, and it’s not a huge example. With a bigger example it gets even more significant, 54 times faster! However it’s not as simple as using it everywhere. For small lists linear can be faster.

I’ve talked about organising your data before. At the time I wasn’t impressed by the book but saw that the technique could work. I looked at an example particle system where I kept related data together. Again it has a speed up an order of magnitude greater than the little ones we discussed earlier. Data that isn’t useful to the calculations at hand can clog up the CPU cache and cause more cache misses. It can also help you concentrate on doing only the work that you need to do, separate active and inactive data.

If you are using libraries to do some of your work you might want to investigate alternatives. Lets consider a fairly common activity, parsing JSON. Fortunately for me someone else has already looked at the numbers. The look at 43 libraries from ArduinoJson all the way to YAJL.

xychart-beta horizontal

x-axis ["gason", "RapidJSON", "RapidJSON_Insitu", "sajson", "ujson4c", "RapidJSON_AutoUTF", "RapidJSON_FullPrec", "Scheredom json.h", "mikeando/FastJson", "cJSON", "ujson", "taocpp/json", "udp/json-parser", "V8", "ccan/json", "Nlohmann", "SimpleJSON", "Configuru", "jsoncons", "Parson", "dropbox/json11", "JVar", "YAJL", "Vinenthz/libjson", "Jansson", "Qt", "C++ REST SDK", "json-c", "PicoJSON", "Jzon", "JsonCpp", "POCO", "hjiang/JSON++", "tunnuz/JSON++", "JsonBox", "jsmn", "nbsdx_SimpleJSON", "ArduinoJson", "CAJUN"]
y-axis "Parsing time (ms)" 0 --> 1200
bar [8, 8, 8, 9, 9, 16, 16, 17, 19, 25, 37, 41, 47, 53, 60, 72, 79, 81, 89, 90, 94, 94, 103, 107, 112, 125, 134, 135, 140, 149, 166, 219, 224, 317, 341, 395, 486, 568, 1133]

There’s a factor of about 140 times between the fastest and slowest. Personally I’m use to using Nlohmann/json which turns out to be middle of the road. If my project was slow switching to RapidJSON might be the way to go. (gason is the same speed but a different language.)

Are you doing calculations multiple times when you could get away with doing them once? Caching previous work and then re-using it later trades memory for time. I once inherited a project where potentially thousands of tasks where being performed but each triggered the loading and saving of one of dozens of files. That meant a single file might be loaded and saved 100 times on average. Re-organising the code allowed me to load one file, perform 100 tasks and then save the file. It was a massive speedup.

One of the most powerful options is just doing less work. Look at what the task actually requires. If it was “List the top 10 items in order cost”. You might total up the cost of every item, sort all the costs and take the first 10. That’s certainly the obvious way to do it. However, with the right algorithm, you could start sorting all the costs and then stop once you had the first 10. The order of the rest of them doesn’t matter.

In the end

There are lots of ways you can improve the performance of code. You might think it’s good to do them all. Surely fast code is good code? I’d say: clear, readable, flexible and functional code are all more important than just being fast. If you do need code that is faster the first thing to do is almost always to profile it. Once you know what parts of your code are slow you know where to focus your attention. You get a bigger impact speeding up slow code than speeding up something that is already fast. Start by looking for ways to save a lot of time rather than a little. We usually have limited time for development. It’s not a questions of “How fast can this code possibly be?” but “How fast does this code need to be?”

Inefficient vector operations

Pre-increment vs post-increment

Function calls

Little vs big

In the end

Comments

Leave a Reply Cancel reply