A week with Satori, the experimental low-latency GC for .NET

So, about a week ago I made a post that talked about Satori, an exciting experimental garbage collector that showed pareto front advances of key metrics. Put another way, this GC seems to offer a new level of near minimal STW pause durations, minimal heap sizes and strong throughput performance in a number of synthetic benchmarks.

Initial benchmarking showed Satori as a shooting star of low latency with some modest give back on the throughput axis and great heap sizes, which seemed like a very attractive tradeoff for some classes of applications. But as more benchmarks have been run, we begin to see some of the places where Satori doesn’t shine as brightly. So, after a week of testing what have we learned so far?

By the way, if there are terms that you don’t understand, visit these annex pages for more details:

-          What is a garbage collector “Generation”?

-          An overview of .NET Garbage Collector Types

And if you want to test out Satori for yourself you can grab binaries here, graciously built by the Osu! developers and instructions here.

What Satori is good at

Here are the highlights:

-          Some of the early synthetic benchmarks run by hez2010, huoyaoyuan and me show Satori delivering excellent (typically sub-millisecond) pause latency at the 99th and 99.9th percentiles, which is really fantastic.

-          Satori also shows good results at minimizing application memory usage

-          Satori shows strong allocation throughput that is generally superior to WKS and can often rival SVR.

Here are some links to the benchmarks run so far:

-          https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162716

-          https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162726

-          https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162728

-          https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162732

-          https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162742

-          https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162743

-          https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162754

 

These initial results were enough to get some performance enthusiasts very excited about Satori and prompted my original blog post in the first place.

But synthetic benchmarks are not real-world tests and only cover very specific situations. However, they are still useful because these extreme situations can occur in production systems which almost by definition are at the worst possible times. You don’t want to find out under peak system load if the GC you are relying on is going to go into a death spiral when pushed into a corner.

A bit about how Satori works internally

An interesting behavior that Satori has relates to how it performs in the Generation 0 collections.

If you don’t know what this means, take a look at the end of this article for an overview

Satori, under certain workloads, uses Gen0 collections extensively as opposed to the more invasive Gen1 or Gen2 collections. This is generally a good tradeoff because Gen0 collections do not have to pause the entire application.

Instead Gen0s happen local to a specific thread whenever that thread allocates memory at a high rate; above a threshold of about 2MB/s. When this happens, Satori will occasionally pause a thread when that thread attempts to allocate a new object, to clean up the portion of memory that is exclusive to that thread. This can happen very, very quickly, but not instantly. I think this is a big part of how Satori achieves its high performance; by trading larger, mostly infrequent global pauses for more frequent thread specific pauses.

Thread local pauses are better than whole application pauses for several reasons:

-          Obviously, it doesn’t freeze your entire application allowing it to continue processing overall.

-          Since you don’t have to wait on other threads, you can immediately jump into your collection tasks without delay and then return control back to the application with as little overhead as a function call.

These localized collections are really great in general, but our testing has found that Satori, can be absolutely voracious when using Gen0s. When leads to the next section…

What Satori isn’t so good at…

In fact, it can issue 1000s of Gen0s per second across the application. Take a look at these numbers collected by hez2010:

Metric

Workstation GC

Server GC

DATAS GC

Satori GC

Satori GC LowLatency

Execution Time (ms)

63,611.3954

22,645.3525

24,881.6114

41,515.6333

40,642.3008

Peak WorkingSet (bytes)

1,442,217,984

4,314,828,800

2,076,291,072

1,734,955,008

1,537,855,488

WorkingSet After Test (bytes)

485,978,112

715,259,904

423,571,456

57,421,824

81,395,712

Max Pause Time (ms)

48.9107

259.9675

197.7212

6.5239

4.0979

Avg Pause Time (ms)

6.117282383

12.00785067

3.304014164

0.673435691

0.437758553

P99.9 Pause Time (ms)

46.8537

243.2844

172.3259

5.8535

3.6835

P99 Pause Time (ms)

44.0532

207.3627

57.4681

5.2661

3.2012

P95 Pause Time (ms)

39.4903

48.7269

8.92

3.0054

1.3854

P90 Pause Time (ms)

23.1327

21.4588

2.8013

1.7859

0.9204

P80 Pause Time (ms)

8.3317

4.7577

1.7581

0.8009

0.6006

Total Pause Time (ms)

31,216.492

1,801.1776

5,411.9752

209.4385

133.0786

Gen 0 Count

5,104

150

1,638

35,892

35,866

Gen 1 Count

1,707

33

203

311

304

Gen 2 Count

76

3

15

143

134


Satori performed over 35,800 Gen0 collections when the SVR collector only performed only 150. My own testing similarly showed Satori pausing threads up to 10,000 times per second (total of all process threads) compared to *only* ~2000 times per second with SVR under heavy load.

Satori pauses are much shorter than SVR’s, but not that much shorter. These numbers are also hard to track because normal GC metrics only count global STW pauses, so this “hidden” pause time doesn’t show up in normal analysis. But in total, they definitely make an impact.

 In this test the application running Satori took almost twice as long to execute than the application running SVR. Giving up half of your application’s throughput is a very expensive decision to make. 

 

But there is a bit of hope; there is a switch to disable Gen0 collections (setting env var DOTNET_gcGen0=0) and it holds many wonders for us:

Metric

Satori GC (No Gen0)

Execution Time (ms)

13,528.3383

Peak WorkingSet (bytes)

1,541,136,384

WorkingSet After Test (bytes)

58,818,560

Max Pause Time (ms)

1.2347

Avg Pause Time (ms)

0.13912

P99.9 Pause Time (ms)

0.9887

P99 Pause Time (ms)

0.5814

P95 Pause Time (ms)

0.3536

P90 Pause Time (ms)

0.2681

P80 Pause Time (ms)

0.1942

Total Pause Time (ms)

317.0574

Gen 0 Count

2297[1]

Gen 1 Count

2297

Gen 2 Count

1583

 

 

With Gen0 disabled, this test crushed the competition, completing in 40% less time than SVR while improving 99.9th percentile pause times by an astonishing 245x. So, there is definitely still work to be done, but potentially hope for improvement through tuning.

But this behavior isn’t the most disappointing part: Satori doesn’t seem to help real world applications as much.

Satori in real applications is not as magical

As of this writing, there have been two applications that have tested Satori in their code and reported back the results. The rhythm game “Osu!” and an application in the modding application tModLoader.

The results from tModLoader show that Satori isn’t necessarily providing much benefit in their already minimal allocation application. When modified to allocate much more as a test, Satori displayed more erratic performance than the existing WKS GC. That isn’t great. Part of this is that UI applications are mostly single threaded, so Satori’s frequent Gen0s will be primarily pausing the main UI thread. It is possible that the application could respond better with Gen0 collections disabled as in the above test, but that experiment hasn’t yet been run.

--EDIT--

Just before I posted this, I checked back with the discussion thread. Chicken-Bones, (who is the lead performance engineer for the game Terraria [2]), has done some fantastic analysis and experimentation in the last few hours and has produced very compelling results with Satori by tweaking it’s collection behavior to perform a collection once per frame instead of randomly with Gen0 disabled. In these tests, Satori now performs the best of any collector tested!

--END EDIT--

 

The results from Osu! tell a more complicated story. On the one hand, Satori uses quite a bit more memory than WKS does; 2.2GB vs 1.3GB. But…this was measured on a 64GB workstation, so Satori wasn’t exactly running the system into the ground.

But the interesting note is about the application performance when allocating a lot, such as when fast scrolling a menu. You’ll have to watch the videos and draw your own conclusions, but to my eye, WKS seems to produce an average framerate somewhere around ~160, sometimes dipping down to 80-90. Satori on the other hand, seems to float around 220-240 fps sometimes surging to over 300fps and dipping once down to 125fps. This seems like a no brainer, but here is the cold water you were probably expecting:

 “During gameplay we're only allocating ~2MB/sec, so the GC isn't taking much away from the average but Satori is smoothing out the P99 frame times.”

So, in real world applications that have already been sensitive to GC latency, they have already squeezed down the amount of work the GC has to do so much that Satori doesn’t have much to offer that WKS isn’t already serving. The fear of GC pauses runs deep in this community. But Satori can still offer a compelling option at this stage with some tuning.

 

Conclusion

Overall, the expected but very unsexy answer is that more testing is needed in order to see how impactful this experimental GC can be to real world applications. It might turn out that we don’t actually need it because SVR and WKS can carry the required workloads between them in existing applications. But the hope for Satori is that it can free application developers from needing to care quite so much about the behavior of the GC and give them back time to work on the actual problems that their software is trying to solve. 

 

Footnotes

1.      As an aside, you will note that this test still shows Gen0 collections, that is due to a quirk in how they are counted. Basically, every time a Gen 1 happens, it counts as a gen0 also. Gen2s also count as Gen1s and Gen0s also.

2.      Terraria is one of the highest selling games of all time.