Array iteration performance in C# — Branching and parallelization

O’Connell Bridge, Dublin, Ireland by aalmada

This is a third post in a series on performance in C#.

Modern CPUs, in a quest for better performance, try to maximize its throughput by executing instructions simultaneously and while waiting for the result of logic tests. They try their best to guess the behavior of our code but, many times, the code has to be adjusted to maximize that potencial. These optimizations are CPU dependent and results may vary.

To the ones that keep quoting Donald Knuth every time performance is mentioned, I’ll start with a quote from that exact same paper:

“Easily obtained improvements are never considered marginal.” — Donald Knuth

Performance depends not only on the logic of our own code but also, what the compiler does with it. We saw on the previous posts that Roslyn generates optimized code for foreach by using the indexer in the case of arrays or Span<T>. We saw that the JIT compiler can remove bounds checking for some cases. But, we also saw that it fails to detect some patterns and it needs some help from the developers.

Writing code that performs well usually means applying the same “recipes” over and over again. Most of these are very simple and can be considered “easily obtained”.

The challenge is that developers need to know that there are “better” ways of implementing a certain task. If they don’t know, they won’t apply it. This knowledge depends on learning from the experts in several fields. It can be found in books, blog posts, conferences, videos, pull requests, or simply in tweets.

I’m far from an expert in this. I just like to learn and try to apply it the best I can to my work. These posts help me structure my thoughts, are references for myself, and hopefully for others too. I also hope these work as validation. In case I got it wrong, I want to learn some more from others.

I want first to acknowledge that this post is mostly based on knowledge shared by Bartosz Adamczewski.

Logical Branch Removal

“A branch is an instruction in a computer program that can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order.” — Wikipedia

For example, lets now implement an overload for Sum() that only sums the items to which the given predicate returns true:

In this case, the CPU will try to guess if it should add the value or not, but it has no knowledge of the data in source or the logical condition in predicate. This makes it almost impossible for the CPU to infer an heuristic.

One possible way of removing the logical condition is to convert the boolean to a number. It can be done very efficiently by using the following extension method:

It converts false to 0 and true to 1, with no logical branches.

NOTE: This assumes a bool is internally represented as a byte. This may not be true for all implementations of .NET. It worked correctly on the ones that I tested it on.

The converted number can now be multiplied by the item. This results in 0 being added to the sum when predicate is false; otherwise, the item is added.

Sum() will now look like this:

So, how do these fair in terms of performance?

On an Intel Core i7–7567U 3.50GHz (Kaby Lake), it’s 2 times faster than the first version.

The benchmark uses as source, an array filled with random numbers, and as predicate, a lambda that returns true if the item is even. It also uses the memory randomization feature of BenchmarkDotNet. This simulates real-world usage as a library method where it’s hard for the CPU to predict branches and data may not be aligned.

Parallelization

We saw in the previous posts that CPUs can parallelize instruction execution by using SIMD, but it can also parallelize automatically in some other situations. These optimizations cannot achieve the performance of SIMD but can be used when SIMD is not available or on the remaining items that don’t fill up a Vector<T>.

Lets go back to the original implementation of Sum():

It adds all the items sequentially into a sum local variable. What if we use more than one local variable?

This version iterates the array every two items, adding each item to two separate local variables. If the length of the array is odd, it adds the remaining item to one of the local variables. Finally, it returns the addition of the two local variables.

Looking at it, it seems to involve a lot more instructions, but how does it fair in terms of performance?

It’s 1.21 times faster than the original.

It also makes a difference when applied to the overload with predicate and no branches. It’s almost 3 times faster than the original implementation of the overload with predicate.

Conclusions

Having a general knowledge of how a CPU works may allow huge improvements in performance.

If you’re developing a library, you never know how often and under what conditions a method is going to be called. You have no ideia if it’s the hot path or not.

The code may not be as readable but I’m sure that 2 or 3 times gain in performance cannot be considered marginal. Complexity can and should be encapsulated!

Benchmarks source code is available at https://github.com/aalmada/ArrayIteration.

Principal Engineer @ Farfetch - Future Retail Lab https://about.me/antao.almada