Optimising my Rust solutions for Advent of Code

Advent of Code is a series of programming challenges that I love solving. It runs between 1st and 25th December each year, with 1 problem released each day.

To make it extra fun this year I set myself a goal of solving it with the fastest code I could possibly write. I managed it in the end - the code for all 25 days runs in about 84 milliseconds (3.3ms on average).

Each day I’d get a working solution and then spend some time optimising the code. The typical speedup was anywhere between 60-90% each time. I didn’t need to write any unsafe code, or do anything that made the code unreadable. For most part this demonstrates how easy Rust and it’s ecosystem make it to optimise code.

A lot of the optimisations I used are general and would apply to any codebase. I learned these from The Rust Performance Book written by Nicholas Nethercote.

1. Benchmark thoroughly

Optimisation needs numbers you can rely on, especially when we’re trying to optimise small functions that may only take a few microseconds to run. Rust makes this easy - there’s a fantastic framework called criterion.rs that runs each function 10k times, discards the outliers and gives the mean and median run-times with confidence intervals. I’ve set it up so I only need to import the new day’s code in the benchmark file and run cargo bench -- "day{{DAY}} ".

Benchmarking leads to unexpected results at times. Applying all the other tricks mentioned here would mostly work, until they didn’t. That’s why every change needed to be followed by a cargo bench call. A few examples

Running the code on multiple threads is a great idea, unless the overhead of starting threads and joining their results is greater than the single threaded run-time.
On day 19, replacing hundreds of thread local caches with a global thread-safe cache sounded like a good idea, but benchmarking showed otherwise.
On day 7 code that “looked” worse because it was allocating memory left and right outperformed the supposedly efficient code by 90%. I didn’t expect this and figured out why it was faster after seeing the benchmark result - it was doing much less work.

If criterion isn’t your style, you can use hyperfine from the command-line. Just remember to run your code in release-mode!

Always measure!

Lastly, make sure you’re benchmarking the right thing. I don’t want to measure the time it takes to read the input file so I include the input in the binary with the include_str! macro.

2. Parallelise with Rayon

Rust has a superpower - fearless concurrency. You can run your code on any number of threads and if you’re doing something potentially unsafe, like incrementing an integer from multiple threads - that will be a compile time error that you can fix immediately.

That’s not all though. Rust has framework called Rayon that makes parallelising your code dead simple. My initial solution would iterate over the input string line-by-line with lines(). By adding just 4 characters - par_ it would process the lines in parallel.

use rayon::prelude::*;

input
    .lines()
    .par_lines()
    .map(|line| num_foo(line, &cache))
    .sum()

No other changes needed. Rayon figures everything out - the optimal number of threads to start on this hardware, reassigning work in case any thread finishes its work early and joining the results from all the threads. Minimal effort for us and our code remains as readable as before.

On some days this needed a bit more work if my solution wasn’t thread safe, meaning safe to call from multiple threads. For example, doing this would be a compile time error:

input
    .par_lines()
    .map(|line| num_foo(line, &cache))
    .map(|line| num_foo(line, &mut cache)) // ❌
    .sum()

Unless cache was thread-safe, meaning protected by a Mutex or an equivalent, this code would not compile.

On day 6 I had a 2D grid I was modifying on each iteration to keep track of some state (&mut Grid). I refactored this so the grid stayed the same on every iteration (&Grid). Then I could sprinkle the Rayon magic (par_) for an 86% speedup (29.75 seconds -> 4.09 seconds).

As mentioned earlier, multithreading can’t be used everywhere. There is a cost to spinning up new threads, as well as communicating the results between threads. If this overhead is comparable to the single-threaded run time, then don’t bother doing it. rayon only improved performance on 5 out of the 25 days.

That said, Rayon is a general purpose tool that can be applied anywhere, not just for small programs like these. Right now they’re experimenting with making the Rust compiler multi-threaded using Rayon:

The front-end (of the Rust compiler) is now capable of parallel execution. It uses Rayon to perform compilation tasks using fine-grained parallelism … The addition of parallelism was done by modifying a relatively small number of key points in the code. The vast majority of the front-end code did not need to be changed.

3. Use HashMap and HashSet alternatives

In general Hash(Map|Set)s inserts work by hashing the key and assigning the key to a specific bucket. At read time we look up the specific bucket associated with the key in constant time (O(1)). If k keys are assigned to the same bucket we have to read each one in O(k) time, slowing us down. In the worst case all n keys are assigned to the same bucket and reading any of them takes O(n) time rather than O(1). At this point reading from the HashMap is as slow as a Vec.

Now imagine a web service that functions like a common key value store usable by anyone. It exposes /set and /get endpoints. The implementation is simple - it just inserts/reads from a HashMap. If an attacker knew your implementation they could call /set with specific keys, causing multiple collisions in your HashMap. Then a legit user calling /read might see their request time out because it took too long to read the HashMap.

This example might sound contrived, but that’s a Denial of Service (DOS) attack. Rust eliminates this problem by using a slower, but DOS-resistant hashing algorithm (SipHash 1-3) in the standard library Hash(Map|Set). This default works well in the general case, say in a library which might find itself exploited in a DOS attack. But its needlessly slow if you’re only dealing with trusted data (like Advent of Code inputs).

The conclusion - use AHashMap and AHashSet from the ahash crate when speed is a priority and the data is from a trusted source. They’re drop-in replacements for Hash(Map|Set).

use std::collections::{HashMap, HashSet};
use ahash::{AHashMap, AHashSet};

4. Use Const Generics

Rust generics are evaluated at compile time, creating a copy of the function for each value of the generic parameter. These copies are then optimised separately, yielding much more performant code.

Typically in Advent of Code the code for part 2 shares a lot with part 1. With a couple of small tweaks it can be reused to great effect. When I tried this for day 7, I noticed something unusual. I replaced the constant 2 in two expressions % 2 and /= 2 with an integer that could be 2 (part 1) or 3 (part 2). I didn’t expect a major regression, but that’s what I found from cargo bench.

// regression
fn calculate_dfs(result: u64, operands: &[u64]) -> bool {
fn calculate_dfs(result: u64, operands: &[u64], n: u64) -> bool {

It is clear why this happens though - dividing by 2 is a single CPU instruction, but dividing by an arbitrary integer isn’t.

// back to original performance
fn calculate_dfs(result: u64, operands: &[u64], n: u64) -> bool {
fn calculate_dfs<const N: u64>(result: u64, operands: &[u64]) -> bool {

Replacing n with <const N: u64> as a generic improved performance by 79% for part 1 and 13.5% for part 2.

Another day when it was useful was day 18 where both parts involved searching a 2D grid. Part 1 was simpler though, and didn’t need to track the path needed to get us to the destination. I added a const boolean as a generic param to the function only inserted into the path vector if the boolean was false. This had no effect on part 2, but it eliminated the vector allocation and operations from part 1, reducing run time by 70%.

With this trick we’re giving the compiler more information at compile time so it can optimise better. It’s equally possible to do this with a copy-paste and writing 2 separate functions. Using const generics or copy-paste comes down to personal preference.

5. Cache aggressively

We’re optimising for run-time here, not memory usage. That means anything you might compute twice, compute once and throw it in a cache. The next time it’ll be there, waiting for you. In longer lived programs a Least-Recently-Used (LRU) cache is a better fit, but for Advent of Code a simple AHashMap will work great.

Day 11 is a good example. Without caching it would likely take several weeks to calculate. With caching it takes 2.84ms.

But allocating memory has a cost too, which we handle in the next two sections.

6. Optimise memory allocation

A pattern I’d often use is to calculate something on a set of data and return a Vec

fn foo(input: &str) -> Vec<u32> {
    input
        .lines()
        .map(|line| calculate(line))
        .collect()
}

And then in the function bar() calling foo(input) the code would simply iterate over each u32. This is wasteful because we allocate the memory for a Vec for no reason. We could save memory and time by returning an impl Iterator - an opaque type that implements the Iterator trait.

fn foo(input: &str) -> Vec<u32> {
fn foo(input: &str) -> impl Iterator<Item = u32> {
    input
        .lines()
        .map(|line| calculate(line))
        .collect() // Eliminate the Vec allocation
}

Changing a Vec to an impl Iterator was a small change (+3, -6) that reduced the runtime of day 19 by 25% (0.89ms -> 0.66ms).

I try to avoid allocating Strings too, preferring to borrow from the input: &str string slice. I know this borrow will always be valid because the compiler verifies it at compile time.

Lastly, .clone() calls can be expensive as well and should be reduced if possible. I typically make the solution work with as many clone()s as I want and optimise them out later, especially if they’re in a loop. Day 16’s runtime was reduced by 60% by eliminating a few clones() that were called in a loop.

7. Reduce memory re-allocation

All growable containers like Vec and Hash(Map|Set) start with some amount of memory allocated to them - it’s capacity. Inserts are quick until the len of the container reaches it’s pre-allocated capacity. At that point we allocate a new container with twice the capacity and copy each value one by one to the new container. Inserts are usually a constant time O(1) but when we have to re-allocate it becomes much slower - O(capacity).

We can eliminate this by guessing or measuring the final size of the container and initialising the container with that capacity. That way we allocate once and don’t do any expensive copies while our code runs.

let v = Vec::new();
let m = AHashMap::new();
let v = Vec::with_capacity(input.len());
let m = AHashMap::with_capacity(rows*columns);

On day 11 this was the only optimisation I could think of. It reduced the run-time by 35% on part 2 (4.43ms -> 2.84ms).

8. Choose the right sort() function

The Rust standard library offers 2 sort functions - sort() and sort_unstable(). The latter is slightly faster, at the cost of re-ordering items that are equal. That’s an acceptable price to pay in most cases. For day 1 sort() takes 11.6µs while sort_unstable() takes 10.6µs (8.6% faster).

To be honest this tip isn’t that useful for Advent of Code. Saving a microsecond only has an impact if you’re calling this in a loop.

One thing to remember is that Rust’s sort algorithms (driftsort and ipnsort) are general algorithms that strike a balance between run-time, compile-time and binary-size. It is possible to perform better in specific cases with a handwritten sort algorithm that prioritises run-time. Here’s a different repo where the author implemented a radix sort to become the fastest solution for day 1.

9. Choose your parser

There are several ways of parsing the input text, depending on what it looks like.

For a series of integers -> something simple like input.lines().flat_map(str::parse). If it’s a 2D grid, I’d read the characters into a Vec<char>. So far, so easy.

Sometimes we get structured input, like day 13

Button A: X+94, Y+34
Button B: X+22, Y+67
Prize: X=8400, Y=5400

My first approach was quick to write and easy to read. I use the scan_fmt crate like so:

let (x_op1, y_op1, x_op2, y_op2, x_result, y_result) =
    scan_fmt!(input,
        "Button A: X+{d}, Y+{d}
         Button B: X+{d}, Y+{d}
         Prize: X={d}, Y={d}",
         i64, i64, i64, i64, i64, i64
    )?;

This performs identical to regular expressions, but the intent is clearer.

On day 13 in particular, I wanted to see if a second approach with a handwritten parser would be faster than scan_fmt. It was. The handwritten parser brought the total runtime down by 84%. For this problem the actual calculation time was tiny compared with parsing time.

Personally I’d prefer scan_fmt if I can get away with it, because it’s shorter, faster to write and easier to understand. It just works.

The only time where a handwritten parser is undoubtedly superior is when the input is both structured and irregular, like this one from 2022.

Valve BM has flow rate=24; tunnels lead to valves GH, NK, YH, OH
Valve QN has flow rate=25; tunnel leads to valve SD

For input like this a handwritten parser is definitely the way to go - the code is both fast and easy to understand.

10. Don’t be afraid to look at the assembly

I’ll show you 2 small functions that are doing the same thing and should perform the same, but don’t.

fn next_random(mut n: i64) -> i64 {
    n = ((n * 64) ^ n) % 16777216;
    n = ((n / 32) ^ n) % 16777216;
    ((n * 2048) ^ n) % 16777216
}

fn next_random_bitwise(mut n: i64) -> i64 {
    n = ((n << 6) ^ n) & 0xffffff;
    n = ((n >> 5) ^ n) & 0xffffff;
    ((n << 11) ^ n) & 0xffffff
}

next_random_bitwise is more than twice as fast. For day 22 part 1, where we run this function 4.5 million times, replacing next_random with next_random_bitwise reduced the run time by 64%. I thought the bitwise version was more efficient because the compiler wasn’t applying these obvious transformations for some reason. That wasn’t correct - the compiler was working exactly as intended.

Take a look at the assembly generated by the Compiler Explorer:

next_random Assembly

next_random:
        mov     rax, rdi
        shl     rax, 6
        xor     rax, rdi
        lea     rcx, [rax + 16777215]
        test    rax, rax
        cmovns  rcx, rax
        and     rcx, -16777216
        sub     rax, rcx
        lea     ecx, [rax + 31]
        test    eax, eax
        cmovns  ecx, eax
        sar     ecx, 5
        movsxd  rcx, ecx
        xor     rcx, rax
        lea     rax, [rcx + 16777215]
        test    rcx, rcx
        cmovns  rax, rcx
        and     rax, -16777216
        sub     rcx, rax
        mov     rax, rcx
        shl     rax, 11
        xor     rax, rcx
        lea     rcx, [rax + 16777215]
        test    rax, rax
        cmovns  rcx, rax
        and     rcx, -16777216
        sub     rax, rcx
        ret

next_random_bitwise Assembly

next_random_bitwise:
        mov     rcx, rdi
        shl     rcx, 6
        xor     rcx, rdi
        mov     eax, ecx
        shr     eax, 5
        and     eax, 524287
        and     ecx, 16777215
        xor     rcx, rax
        mov     eax, ecx
        shl     eax, 11
        and     eax, 16775168
        xor     rax, rcx
        ret

What the assembly tells us:

next_random_bitwise has less than half the assembly instructions (12 vs 27). That’s why it’s twice as fast.
The only difference is the modulo operation %. When we’re dealing with a signed integer (i64), the modulo operation needs to have different behaviour depending on whether n is positive or negative.
A bitwise AND operation & on the other hand, compiles down to one instruction whether n is signed (i64) or unsigned (u64).

Changing i64 to an usigned u64 has the same effect. It is just as fast as next_random_bitwise with nearly identical assembly.

fn next_random(mut n: i64) -> i64 {
fn next_random(mut n: u64) -> u64 {
    n = ((n * 64) ^ n) % 16777216;
    n = ((n / 32) ^ n) % 16777216;
    ((n * 2048) ^ n) % 16777216
}

next_random Assembly

next_random_unsigned:
        mov     eax, edi
        shl     eax, 6
        xor     eax, edi
        and     eax, 16777215
        mov     ecx, eax
        shr     ecx, 5
        xor     ecx, eax
        mov     eax, ecx
        shl     eax, 11
        and     eax, 16775168
        xor     eax, ecx
        ret

Shoutout to maneatingape for writing next_random_bitwise and sending me down this rabbit hole of assembly! Sometimes looking at the assembly output can be fascinating, although it is hard to read the output of larger functions.

Conclusion

I had a blast solving these problems and optimising them after. You can find all my solutions in this repo -> https://github.com/nindalf/advent. I’ve listed out the problems, with link to the solution and the benchmark time in the Readme.

Big thanks to the Rust project and the maintainers of various crates that pushed me into a pit of success. With minimal effort I was able to use tools like rayon and ahash to make my code blazingly fast.

One thing I didn’t cover is the impact of choosing good data structures and algorithms because others have covered it better than I possibly could. Better data structures and algorithms will always beat out any of the optimisations I’ve highlighted. For example, on day 9 I saw a 98.7% boost from 27ms to 364µs by replacing an array-scan with 10 binary heaps.

The crazy part is, all this was just the low hanging fruit. I got my solutions down to 84ms but this other repo runs the same solutions in 6 milliseconds. And a collaborative effort from the community ran in 1 millisecond - aoc-fastest. The latter used unsafe and SIMD features to show the full capability of modern languages, compilers and hardware. This post explains their optimisation process step-by-step.

If you enjoyed this essay and want to read more about optimisation, check out Nicholas Nethercote’s series on speeding up the Rust compiler over the last 10 years.

Thanks to Eric Wastl for building Advent of Code and the entire community for participating!