How often?

Don’t Repeat Yourself (DRY) is, ironically, said quite a lot. I think in general this is uncontroversial. If you have a choice between writing a lot of very similar code:

    std::vector<int> numbers;
    numbers.append(0);
    numbers.append(1);
    numbers.append(2);
    numbers.append(3);
    numbers.append(4);
    numbers.append(5);
    numbers.append(6);
    numbers.append(7);

Or something smaller that does the same job:

    std::vector<int> numbers;
    for (size_t i = 0; i < 8; i++) {
        numbers.append(i);
    }

You are probably going to choose the later. Maybe you think this is an overly simple example but I’ll come back to it later.

There are a number of reasons that DRY makes sense:

Reducing code size: Less to read and less to understand.
Identical behaviour: One piece of code has one behaviour.
Easy updates: One place to change and everything is updated.
Fewer bugs: One place for bugs, one place to fix bugs.
Easier testing: Less code to test.
Better code: Less code can be better checked and optimised.

However this isn’t without costs both initial and ongoing:

Sharing code: Everything must be able to access the master version.
Recognising duplication: Sometimes it’s not obvious what counts.
Removing duplication: Refactoring always takes time and effort.
Added complexity: A few similar functions can be easier to write than a single clever one.

So, while repetition can be bad, I don’t think it can be as simple as never repeat yourself.

How often is too often?

Over the years I’ve talked to a few people about this and a common threshold to think about this is 3. So that could be 3 lines, 3 functions, 3 files.

I’m going to go further and say that you should start thinking about it at 2. As soon as you notice your repeating something at all, start thinking about whether it would be better to refactor. All the standard reason for DRY apply, admittedly with lower benefits for such a small number. However you repeating yourself the first time doesn’t mean the code is only being repeated for the first time. It’s very easy for many people to write similar code many times without anyone catching on. If you can start looking for that repetition the project may get more benefit than just a couple of functions.

Copy-paste is problematic

When you are just getting started with a problem it’s hard to know what the code should look like at the end. It’s very common to write a function then know you need something like it but different. A quick copy-paste and a few tweaks gets you a new but different function that does what you want. I do this as well.

Where it falls down is when you have two, three or more functions that are 90% the same. Maybe the codebase is stable and these don’t change. However my experience is that requirements change over time and so the code has to change. What are the chances that all these functions are going to be changed in the same way at the same time? Even if all the functions still work they have begun to diverge.

I have spent literally years, on and off, with a project trying to pull two sets of classes back together. One set of classes was obviously copied from another and then both were changed as required. They did very similar things. So much of the code was almost the same. However there can be a big difference between the same and almost the same. Every time I needed to make a change both sets of classes had to be updated separately! Eventually I was able to account for or remove all the differences. Almost everything could be done by a single code path with a few specific differences. Future changes were smaller and quicker.

As I said before, I often start with some copy-pasting. Once I have working code and before I start checking in there’s a chance for review. It’s often easy to replace a few custom functions with a generic one which has a few extra parameters. Maybe you keep the custom functions around but the call the generic function to do the heavy work.

Thinking small

Although removing large duplications can lead to big benefits you can also remove lots of little duplications. In my initial example many people would be happy to leave it here:

    std::vector<int> numbers;
    for (size_t i = 0; i < 8; i++) {
        numbers.append(i);
    }

However filling a vector with “stuff” is probably happening in lots of places as well:

    std::vector<int> numbers;
    std::generate_n(numbers.begin(), 8, [](const int i){ return i; });

This has fewer significant lines but you could argue against it as the techniques it uses are more complicated. We can go further:

    const auto numbers = ext::generate_n<std::vector<int>>(8, ext::identity);

To me this is a better as it uses one line to give us exactly what we want. We’ve moved beyond what the standard template library provides. I’m assuming our own ext library provides a templated generate_n function to create a collection and a templated identity function which returns the value it’s given. Those are both very general functions which can be used in lots of other situations.

A library might not provide exactly you need but that doesn’t have to mean hand-crafting things all the time. Removing very small duplications can be worth it:

All sorts of operations on all sorts of collections, e.g. custom comparison operations for your data.
A function wrapper that uses fewer parameters, e.g. replace log(INFO, ...) with info(...).
Your favourite string operations, e.g. contains was only implemented in C++23!

When it’s hard

Sometimes it’s not easy to avoid duplication. Duplication can exist in very separate areas. Whether that’s different libraries, languages, platforms or departments. These things should be identical but the only connection may be the specification.

In my last project we had to run code on two different platforms and ensure they were in sync. Initially both sets of were separately written in different languages. Essentially no functional changes were made to the code because of the overhead this created. Eventually I was to use something like cross-compilation to simplify this. Suddenly changes to the code could be made easily… but there was a bigger build system to maintain.

I don’t know a general solution here because the reasons can be very different. If you can manage it then the normal benefits apply.

Repeating yourself

Every so often you’re going to come across where it does make most sense to repeat yourself. To me this probably comes down to overall complexity both now and in the future. Most often this is going to be at the lower end of repetitions:

If it’s just a specific 2 or 3 times then it may not be worth factoring out a function or a lambda.
If things happen to be the same but are not required to be the same then it may not make sense, indeed calling common code could lead to confusion.
If making common code adds a lot of complexity then a bit of repetition may be more understandable.
If the code is likely to change often it then repeated refactoring may be too costly.
If the code is unlikely to ever change then maybe it can just wait.

Do remember the benefits, especially if it could be used elsewhere. The most basic guideline might be to use whichever version of the code is shorter… as long as it’s still understandable.

P.S.

I’ve noticed a few issues with my example:

    const auto numbers = ext::generate_n<std::vector<int>>(8, ext::identity);

Mistakes do happen but we can still try to improve. So what are the problems:

I intended to mimicstd::generate_n here but it wouldn’t accept this sort of function. Instead it expects a function that takes no parameters. It might be better to choose a different name to avoid misleading people.
There isn’t enough type information for the compiler to work it all out. It is best to test these things.
My naming guidelines say to avoid non-standard contracts and here is ext. Depending on your code some namespaces can be referenced a lot. A long namespace identifier can end up weighing the code down. I might have to come up with a new exception for namespaces in my guidelines.