A week with Satori, the experimental low-latency GC for .NET

So, about a week ago I made a post that talked about Satori, an exciting experimental garbage collector that showed pareto front advances of key metrics. Put another way, this GC seems to offer a new level of near minimal STW pause durations, minimal heap sizes and strong throughput performance in a number of synthetic benchmarks.

Initial benchmarking showed Satori as a shooting star of low latency with some modest give back on the throughput axis and great heap sizes, which seemed like a very attractive tradeoff for some classes of applications. But as more benchmarks have been run, we begin to see some of the places where Satori doesn’t shine as brightly. So, after a week of testing what have we learned so far? Read on below👇

By the way, if there are terms that you don’t understand, visit these annex pages for more details:

- What is a garbage collector “Generation”?

- An overview of .NET Garbage Collector Types

And if you want to test out Satori for yourself you can grab binaries here, graciously built by the Osu! developers and instructions here.

What Satori is good at

Here are the highlights:

- Some of the early synthetic benchmarks run by hez2010, huoyaoyuan and me show Satori delivering excellent (typically sub-millisecond) pause latency at the 99^th and 99.9^th percentiles, which is really fantastic.

- Satori also shows good results at minimizing application memory usage

- Satori shows strong allocation throughput that is generally superior to WKS and can often rival SVR.

Here are some links to the benchmarks run so far:

- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162716

- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162726

- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162728

- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162732

- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162742

- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162743

- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162754

These initial results were enough to get some performance enthusiasts very excited about Satori and prompted my original blog post in the first place.

But synthetic benchmarks are not real-world tests and only cover very specific situations. However, they are still useful because these extreme situations can occur in production systems which almost by definition are at the worst possible times. You don’t want to find out under peak system load if the GC you are relying on is going to go into a death spiral when pushed into a corner.

A bit about how Satori works internally

An interesting behavior that Satori has relates to how it performs in the Generation 0 collections.

If you don’t know what this means, take a look at the end of this article for an overview

Satori, under certain workloads, uses Gen0 collections extensively as opposed to the more invasive Gen1 or Gen2 collections. This is generally a good tradeoff because Gen0 collections do not have to pause the entire application.

Instead Gen0s happen local to a specific thread whenever that thread allocates memory at a high rate; above a threshold of about 2MB/s. When this happens, Satori will occasionally pause a thread when that thread attempts to allocate a new object, to clean up the portion of memory that is exclusive to that thread. This can happen very, very quickly, but not instantly. I think this is a big part of how Satori achieves its high performance; by trading larger, mostly infrequent global pauses for more frequent thread specific pauses.

Thread local pauses are better than whole application pauses for several reasons:

- Obviously, it doesn’t freeze your entire application allowing it to continue processing overall.

- Since you don’t have to wait on other threads, you can immediately jump into your collection tasks without delay and then return control back to the application with as little overhead as a function call.

These localized collections are really great in general, but our testing has found that Satori, can be absolutely voracious when using Gen0s. When leads to the next section…

What Satori isn’t so good at…

In fact, it can issue 1000s of Gen0s per second across the application. Take a look at these numbers collected by hez2010:

*Metric*	Workstation GC	Server GC	DATAS GC	Satori GC	Satori GC LowLatency
Execution Time (ms)	63,611.3954	22,645.3525	24,881.6114	41,515.6333	40,642.3008
Peak WorkingSet (bytes)	1,442,217,984	4,314,828,800	2,076,291,072	1,734,955,008	1,537,855,488
WorkingSet After Test (bytes)	485,978,112	715,259,904	423,571,456	57,421,824	81,395,712
Max Pause Time (ms)	48.9107	259.9675	197.7212	6.5239	4.0979
Avg Pause Time (ms)	6.117282383	12.00785067	3.304014164	0.673435691	0.437758553
P99.9 Pause Time (ms)	46.8537	243.2844	172.3259	5.8535	3.6835
P99 Pause Time (ms)	44.0532	207.3627	57.4681	5.2661	3.2012
P95 Pause Time (ms)	39.4903	48.7269	8.92	3.0054	1.3854
P90 Pause Time (ms)	23.1327	21.4588	2.8013	1.7859	0.9204
P80 Pause Time (ms)	8.3317	4.7577	1.7581	0.8009	0.6006
Total Pause Time (ms)	31,216.492	1,801.1776	5,411.9752	209.4385	133.0786
Gen 0 Count	5,104	150	1,638	35,892	35,866
Gen 1 Count	1,707	33	203	311	304
Gen 2 Count	76	3	15	143	134

Satori performed over 35,800 Gen0 collections when the SVR collector only performed only 150. My own testing similarly showed Satori pausing threads up to 10,000 times per second (total of all process threads) compared to *only* ~2000 times per second with SVR under heavy load.

Satori pauses are much shorter than SVR’s, but not that much shorter. These numbers are also hard to track because normal GC metrics only count global STW pauses, so this “hidden” pause time doesn’t show up in normal analysis. But in total, they definitely make an impact.

In this test the application running Satori took almost twice as long to execute than the application running SVR. Giving up half of your application’s throughput is a very expensive decision to make.

But there is a bit of hope; there is a switch to disable Gen0 collections (setting env var DOTNET_gcGen0=0) and it holds many wonders for us:

*Metric*	Satori GC (No Gen0)
Execution Time (ms)	13,528.3383
Peak WorkingSet (bytes)	1,541,136,384
WorkingSet After Test (bytes)	58,818,560
Max Pause Time (ms)	1.2347
Avg Pause Time (ms)	0.13912
P99.9 Pause Time (ms)	0.9887
P99 Pause Time (ms)	0.5814
P95 Pause Time (ms)	0.3536
P90 Pause Time (ms)	0.2681
P80 Pause Time (ms)	0.1942
Total Pause Time (ms)	317.0574
Gen 0 Count	2297[1]
Gen 1 Count	2297
Gen 2 Count	1583

With Gen0 disabled, this test crushed the competition, completing in 40% less time than SVR while improving 99.9^th percentile pause times by an astonishing 245x. So, there is definitely still work to be done, but potentially hope for improvement through tuning.

But this behavior isn’t the most disappointing part: Satori doesn’t seem to help real world applications as much.

Satori in real applications is not as magical

As of this writing, there have been two applications that have tested Satori in their code and reported back the results. The rhythm game “Osu!” and an application in the modding application tModLoader.

The results from tModLoader show that Satori isn’t necessarily providing much benefit in their already minimal allocation application. When modified to allocate much more as a test, Satori displayed more erratic performance than the existing WKS GC. That isn’t great. Part of this is that UI applications are mostly single threaded, so Satori’s frequent Gen0s will be primarily pausing the main UI thread. It is possible that the application could respond better with Gen0 collections disabled as in the above test, but that experiment hasn’t yet been run.

--EDIT--

Just before I posted this, I checked back with the discussion thread. Chicken-Bones, (who is the lead performance engineer for the game Terraria [2]), has done some fantastic analysis and experimentation in the last few hours and has produced very compelling results with Satori by tweaking it’s collection behavior to perform a collection once per frame instead of randomly with Gen0 disabled. In these tests, Satori now performs the best of any collector tested!

--END EDIT--

The results from Osu! tell a more complicated story. On the one hand, Satori uses quite a bit more memory than WKS does; 2.2GB vs 1.3GB. But…this was measured on a 64GB workstation, so Satori wasn’t exactly running the system into the ground.

But the interesting note is about the application performance when allocating a lot, such as when fast scrolling a menu. You’ll have to watch the videos and draw your own conclusions, but to my eye, WKS seems to produce an average framerate somewhere around ~160, sometimes dipping down to 80-90. Satori on the other hand, seems to float around 220-240 fps sometimes surging to over 300fps and dipping once down to 125fps. This seems like a no brainer, but here is the cold water you were probably expecting:

“During gameplay we're only allocating ~2MB/sec, so the GC isn't taking much away from the average but Satori is smoothing out the P99 frame times.”

So, in real world applications that have already been sensitive to GC latency, they have already squeezed down the amount of work the GC has to do ~~so much that Satori doesn’t have much to offer that WKS isn’t already serving. The fear of GC pauses runs deep in this community~~. But Satori can still offer a compelling option at this stage with some tuning.

Conclusion

Overall, the expected but very unsexy answer is that more testing is needed in order to see how impactful this experimental GC can be to real world applications. It might turn out that we don’t actually need it because SVR and WKS can carry the required workloads between them in existing applications. But the hope for Satori is that it can free application developers from needing to care quite so much about the behavior of the GC and give them back time to work on the actual problems that their software is trying to solve.

Footnotes

1. As an aside, you will note that this test still shows Gen0 collections, that is due to a quirk in how they are counted. Basically, every time a Gen 1 happens, it counts as a gen0 also. Gen2s also count as Gen1s and Gen0s also.

2. Terraria is one of the highest selling games of all time.

Applied Algorithms

A blog about software development