A sub-millisecond GC for .NET?!

Be sure to checkout my follow up post: A week with Satori, the experimental low-latency GC for .NET

Howdy folks, I wanted to bring your attention to a Github discussion over in the .NET runtime where an experimental Garbage Collector called Satori has emerged that is producing very exciting numbers for those of us in the .NET performance crowd.

Quick Links:

- The discussion thread

- The comment introducing the new GC

- Benchmarks, Benchmarks, Benchmarks

TL;DR / Results

Compared to the traditional Server GC in interactive mode, Satori offers improvements in synthetic benchmarks in several key metrics that range from impressive to down right shocking. How shocking?

- 50x improvement to Median Pause Time

- >100x improvement to 99^th percentile pause times

- 3x improvement to Heap Size

So yeah… it is really a stunning development.

I would encourage anyone interested in writing high-performance .NET code to try out Satori (instructions below) on your own workloads to see if it offers any benefits. This feedback can help the folks at Microsoft prioritize investment in this experiment.

👇 You can find instructions on how to get it at the end of this article 👇

What is a garbage collector and why do I care?

Automated garbage collection is a way of managing memory in applications that is used by many(most?) popular languages such as C#/.NET, Java, Go, Ruby, Javascript, PHP and many more. Notably, it isn’t used much in lower level languages such as Rust, C/C++ and Zig because those languages need more precise control over the behavior of their applications.

Specifically, when you use a garbage collector, you are giving up control over how your application manages memory; you are also giving up control over how your program is executed in time. Most types of garbage collectors have to freeze your program in place during a phase called “stop the world” in order to count up all of the objects your program has created and see which ones are still valid. It then removes the invalid objects and does a some other house keeping such as rearranging the valid objects into more compact regions of memory so that your program’s memory useage doesn’t grow out of control overtime.

As you can imagine however, stopping everything that your program is doing at seemingly random times in the middle of operations can be a disruptive event and the longer the “pause”, the more disruptive it can be. This is why lower level languages generally avoid garbage collection because of the unpredictable nature of their design. Those languages generally require the developer to keep track of the objects that they create and manually delete them from memory when they are done with them. This can be tricky to do correctly and when a developer forgets to delete an object, a memory leak occurs which can cause the amount of memory that a program uses to grow until the program crashes.

Garbage collected languages for the most part don’t have to worry about memory leaks (although other types of resource leaks can be just as bad or worse) and this is a big reason that GCs are so popular. They make programming safer and simpler, which is why you don’t have to think about them. They just work…until they don’t.

Large scale and high-throughput applications have to begin to take the behavior of the GC into account because the pauses to the application can become extremely disruptive. Imagine if you favorite app would freeze for 5 seconds every few minutes! Well real-world GC pauses can last this long or even longer if there are large numbers of objects to track or very large amounts of memory to manage. Or sometimes a programmer might accidentally choose an algorithm that generates a lot of objects very quickly that causes the GC to “thrash”. In these cases the amount of time that the GC spends cleaning up memory can actually take up more time then the application’s actual code! So a lot of care and optimization goes into making GCs as fast and efficient as possible!

Some history of the .NET GC

The .NET garbage collector has a very long history and has evolved quite a bit over the years and it has improved in multiple ways. Long ago, the Workstation GC was created, it was essentially intended for desktop UI applications. It is more or less single threaded and only uses a single heap to store managed objects.

As the demands of .NET applications grew, a new “server” garbage collector emerged that was designed to maximize throughput of the application, at the expense of longer pause times. It is generally more efficient for GC to run a few larger collections rather than many smaller ones. At first big difference was that server GC used multiple heaps (generally one per CPU core) to store objects. This makes it easier for the .NET runtime to allocate and manage objects when running on very large machines with lots of CPU cores. Another difference was that server GC would, by design, use a lot more of the computer’s memory. This was because the GC was trying to minimize the number of pauses by letting more objects build up over time and then doing a big collection of all of them.

As .NET got more mature, more advanced features such as the option to use Concurrent garbage collection for both workstation and server mode. This allowed the GC to use multiple threads to collect memory and helped make pause times smaller. Concurrent garbage collection was replaced with Background garbage collection in later versions of .NET and helped the garbage collector scale even better over time.

Even more innovations and features have been added over the years, largely driven by Maoni Stephens, who has been an incredible resource over the years through her blog posts. Her and the team have added some more features in recent years, including a major one called DATAS which trades a little bit of application throughput for dramatically smaller heap sizes when using the Server GC.

But quietly while this was happening in .NET other ecosystems were also delivering innovations and improvements. Java has a bright ecosystem where developers have the option of swapping out their garbage collectors completely rather than just changing a few options. And then Go made a quantum leap came and made everyone pay attention. Advances in the Go garbage collector made pause times lighting fast, many times less than 1 millisecond which makes them pretty much invisible during request processing. This immediately made developers in other language ecosystems jealous and the comparisons began pouring in.

As people drew comparisons between .NET and Go, the important thing to note is that although Go had smaller pause durations, .NET offered superior throughput. This was an intentional tradeoff made by the .NET team and has worked fairly well over the years. But pauses can still be painful especially when everything else in .NET has become dramatically faster in the last 10 years. Web requests that used to take 250 milliseconds, now might only take 10 milliseconds and being interrupted by a GC pause, even a short one, shows up a lot more on monitoring dashboard than it used to.

So what is going on now?

For the last 10 or 15 years, when people would ask about alternative garbage collectors for .NET, the runtime team has patiently explained each time that .NET support some more exotic features (such as interior pointers) that some other ecosystems (like Java) do not. They would also explain the tradeoffs and send the commenter on their way. Which is why when I saw yet another thread asking about a pauseless GC in .NET, I hopped in to provide the standard answer.

Now skip forward, 2 years. The thread has been untouched in 8 months when all of a sudden, .NET runtime engineer Vladimir Sadov popped up and floated the idea that a GC could keep pauses down to 2-3 milliseconds, but that was just theoretical. Until it wasn’t. Vladamir has apparently been sitting on an experimental fork of the runtime that provides an alternative garbage collector (called Satori) that completely dominates the existing server and workstation GCs on pause latency, even with all the new innovations over the years.

This is exciting for me as a .NET developer that works on financial processing systems where latency is always a concern. The idea of being able to have my cake and eat it too when I get high-throughput and low pause times is really enticing to me and the results have been nothing but encouraging in the short time so far.

How does it perform?

There is a drop in allocation throughput of about 15-20%, but the improvement in pause time places it amongst the lowest pause times in the industry. To pile on the wins, Satori keeps heap size much smaller than the server GC. This has a significant impact on benchmark performance. For example, in a 30 second synthetic benchmark, the existing server GC took about 2.6 seconds of that time (about 8%). Satori only needed 156 milliseconds (about 0.5%). This is giving back real time back to the application.

Just take a look at this table from the GC stress benchmark:

Mode	GC Count	GC Time %	Allocation Rate MB/s	P50	p90	p99	pMax
workstation-batch	38	88.42	39.1	971.571	1015.808	1211.597	1211.569
workstation-interactive	36	88.04	39.19	997.785	1037.926	2351.104	2351.104
workstation-lowlatency	2622	95.96	46.63	0.042	0.064	421.069	1131.315
workstation-sustainedlowlatency	39	88.69	37.92	985.497	1042.841	2156.134	2156.134
server-batch	19	10.61	172.7	157.594	495.616	495.616	495.616
server-interactive	20	11.03	174.46	148.48	153.6	772.915	772.915
server-sustained-lowlatency	19	11.23	172.91	165.888	801.178	801.178	801.178
server-batch-datas	78	42.55%	112.94	154.522	171.622	491.52	1124.762
server-interactive-datas	49	96.38	23.61	1073.971	1116.57	1143.603	1143.603
server-sustained-lowlatency-datas	46	96.45	22.96	1102.643	1154.253	1171.456	1171.456
satori-interactive	21	N/A	144.75	0.203	31.166	27.853	27.853
satori-lowlatency	21	N/A	147.62	0.143	0.192	5.491	5.491

The numbers ultimately speak for themselves. Satori in all benchmarks run so far offers sub-millisecond pause times at the 90^th percentile and at the 99^th percentile in many. Max pause times improve on the existing server GC by anywhere from 20x-100x or more in some cases. In case you’re wondering how the workstation GC (which was designed for low latency workloads), it would probably need to be rearchitected to even compete fairly. The single threaded design leaves it far slower and temperamental than the other options.

So how do I get it?

It is actually pretty simple.

You need to build your app targeting .NET 8.0
You need to publish your app in self-contained mode

`dotnet publish –self-contained -c Release -o .\pub`

Once published, you need to copy in two modified dlls into your publish folder. And you can now run your application without any other changes.

I have provided these for windows here. (I’m working on building them for linux…)
Or if you don’t want to download random dlls off the internet you can clone the .NET runtime repo and run `build.cmd clr -c Release`

You can find the original instructions from Vladimir here.

Try it out, see how it performs for you and report back to the thread if you find the results compelling. I hope that you’ve found this article interesting

Applied Algorithms

A blog about software development