So, about a week ago I made a post that talked about Satori, an exciting experimental garbage collector that showed pareto front advances of key metrics. Put another way, this GC seems to offer a new level of near minimal STW pause durations, minimal heap sizes and strong throughput performance in a number of synthetic benchmarks.
Initial benchmarking showed Satori as a shooting star of low latency with some modest give back on the throughput axis and great heap sizes, which seemed like a very attractive tradeoff for some classes of applications. But as more benchmarks have been run, we begin to see some of the places where Satori doesn’t shine as brightly. So, after a week of testing what have we learned so far?
By the way, if there are terms that you don’t understand, visit these annex pages for more details:
- What is a garbage collector “Generation”?
- An overview of .NET Garbage Collector Types
And if you want to test out Satori for yourself you can grab binaries here, graciously built by the Osu! developers and instructions here.
What Satori is good at
Here are the highlights:
- Some of the early synthetic benchmarks run by hez2010, huoyaoyuan and me show Satori delivering excellent (typically sub-millisecond) pause latency at the 99th and 99.9th percentiles, which is really fantastic.
- Satori also shows good results at minimizing application memory usage
- Satori shows strong allocation throughput that is generally superior to WKS and can often rival SVR.
Here are some links to the benchmarks run so far:
- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162716
- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162726
- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162728
- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162732
- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162742
- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162743
- https://github.com/dotnet/runtime/discussions/115627#discussioncomment-13162754
These initial results were enough to get some performance enthusiasts very excited about Satori and prompted my original blog post in the first place.
But synthetic benchmarks are not real-world tests and only cover very specific situations. However, they are still useful because these extreme situations can occur in production systems which almost by definition are at the worst possible times. You don’t want to find out under peak system load if the GC you are relying on is going to go into a death spiral when pushed into a corner.
A bit about how Satori works internally
An interesting behavior that Satori has relates to how it performs in the Generation 0 collections.
If you don’t know what this means, take a look at the end of this article for an overview
Satori, under certain workloads, uses Gen0 collections extensively as opposed to the more invasive Gen1 or Gen2 collections. This is generally a good tradeoff because Gen0 collections do not have to pause the entire application.
Instead Gen0s happen local to a specific thread whenever that thread allocates memory at a high rate; above a threshold of about 2MB/s. When this happens, Satori will occasionally pause a thread when that thread attempts to allocate a new object, to clean up the portion of memory that is exclusive to that thread. This can happen very, very quickly, but not instantly. I think this is a big part of how Satori achieves its high performance; by trading larger, mostly infrequent global pauses for more frequent thread specific pauses.
Thread local pauses are better than whole application pauses for several reasons:
- Obviously, it doesn’t freeze your entire application allowing it to continue processing overall.
- Since you don’t have to wait on other threads, you can immediately jump into your collection tasks without delay and then return control back to the application with as little overhead as a function call.
These localized collections are really great in general, but our testing has found that Satori, can be absolutely voracious when using Gen0s. When leads to the next section…
What Satori isn’t so good at…
In fact, it can issue 1000s of Gen0s per second across the
application. Take a look at these numbers collected by hez2010:
Metric |
Workstation GC |
Server GC |
DATAS GC |
Satori GC |
Satori GC LowLatency |
Execution Time (ms) |
63,611.3954 |
22,645.3525 |
24,881.6114 |
41,515.6333 |
40,642.3008 |
Peak WorkingSet (bytes) |
1,442,217,984 |
4,314,828,800 |
2,076,291,072 |
1,734,955,008 |
1,537,855,488 |
WorkingSet After Test (bytes) |
485,978,112 |
715,259,904 |
423,571,456 |
57,421,824 |
81,395,712 |
Max Pause Time (ms) |
48.9107 |
259.9675 |
197.7212 |
6.5239 |
4.0979 |
Avg Pause Time (ms) |
6.117282383 |
12.00785067 |
3.304014164 |
0.673435691 |
0.437758553 |
P99.9 Pause Time (ms) |
46.8537 |
243.2844 |
172.3259 |
5.8535 |
3.6835 |
P99 Pause Time (ms) |
44.0532 |
207.3627 |
57.4681 |
5.2661 |
3.2012 |
P95 Pause Time (ms) |
39.4903 |
48.7269 |
8.92 |
3.0054 |
1.3854 |
P90 Pause Time (ms) |
23.1327 |
21.4588 |
2.8013 |
1.7859 |
0.9204 |
P80 Pause Time (ms) |
8.3317 |
4.7577 |
1.7581 |
0.8009 |
0.6006 |
Total Pause Time (ms) |
31,216.492 |
1,801.1776 |
5,411.9752 |
209.4385 |
133.0786 |
Gen 0 Count |
5,104 |
150 |
1,638 |
35,892 |
35,866 |
Gen 1 Count |
1,707 |
33 |
203 |
311 |
304 |
Gen 2 Count |
76 |
3 |
15 |
143 |
134 |
Satori performed over 35,800 Gen0 collections when the SVR collector only performed only 150. My own testing similarly showed Satori pausing threads up to 10,000 times per second (total of all process threads) compared to *only* ~2000 times per second with SVR under heavy load.
Satori pauses are much shorter than SVR’s, but not that much shorter. These numbers are also hard to track because normal GC metrics only count global STW pauses, so this “hidden” pause time doesn’t show up in normal analysis. But in total, they definitely make an impact.
In this test the application running Satori took almost twice as long to execute than the application running SVR. Giving up half of your application’s throughput is a very expensive decision to make.
But there is a bit of hope; there is a switch to disable Gen0 collections (setting env var DOTNET_gcGen0=0) and it holds many wonders for us:
Metric |
Satori GC (No Gen0) |
Execution Time (ms) |
13,528.3383 |
Peak WorkingSet (bytes) |
1,541,136,384 |
WorkingSet After Test (bytes) |
58,818,560 |
Max Pause Time (ms) |
1.2347 |
Avg Pause Time (ms) |
0.13912 |
P99.9 Pause Time (ms) |
0.9887 |
P99 Pause Time (ms) |
0.5814 |
P95 Pause Time (ms) |
0.3536 |
P90 Pause Time (ms) |
0.2681 |
P80 Pause Time (ms) |
0.1942 |
Total Pause Time (ms) |
317.0574 |
Gen 0 Count |
2297[1] |
Gen 1 Count |
2297 |
Gen 2 Count |
1583 |
With Gen0 disabled, this test crushed the competition, completing in 40% less time than SVR while improving 99.9th percentile pause times by an astonishing 245x. So, there is definitely still work to be done, but potentially hope for improvement through tuning.
But this behavior isn’t the most disappointing part: Satori doesn’t seem to help real world applications as much.
Satori in real applications is not as magical
As of this writing, there have been two applications that have tested Satori in their code and reported back the results. The rhythm game “Osu!” and an application in the modding application tModLoader.
The results from tModLoader show that Satori isn’t necessarily providing much benefit in their already minimal allocation application. When modified to allocate much more as a test, Satori displayed more erratic performance than the existing WKS GC. That isn’t great. Part of this is that UI applications are mostly single threaded, so Satori’s frequent Gen0s will be primarily pausing the main UI thread. It is possible that the application could respond better with Gen0 collections disabled as in the above test, but that experiment hasn’t yet been run.
--EDIT--
Just before I posted this, I checked back with the discussion thread. Chicken-Bones, (who is the lead performance engineer for the game Terraria [2]), has done some fantastic analysis and experimentation in the last few hours and has produced very compelling results with Satori by tweaking it’s collection behavior to perform a collection once per frame instead of randomly with Gen0 disabled. In these tests, Satori now performs the best of any collector tested!
--END EDIT--
The results from Osu! tell a more complicated story. On the one hand, Satori uses quite a bit more memory than WKS does; 2.2GB vs 1.3GB. But…this was measured on a 64GB workstation, so Satori wasn’t exactly running the system into the ground.
But the interesting note is about the application
performance when allocating a lot, such as when fast scrolling a menu. You’ll
have to watch the videos and draw your own conclusions, but to my eye, WKS
seems to produce an average framerate somewhere around ~160, sometimes dipping
down to 80-90. Satori on the other hand, seems to float around 220-240 fps
sometimes surging to over 300fps and dipping once down to 125fps. This seems like
a no brainer, but here is the
cold water you were probably expecting:
“During gameplay we're only allocating ~2MB/sec, so the GC isn't taking much away from the average but Satori is smoothing out the P99 frame times.”
So, in real world applications that have already been
sensitive to GC latency, they have already squeezed down the amount of work the
GC has to do so much that Satori doesn’t have much to offer that WKS isn’t
already serving. The fear of GC pauses runs deep in this community. But Satori can still offer a compelling option at this stage with some tuning.
Conclusion
Overall, the expected but very unsexy answer is that more testing is needed in order to see how impactful this experimental GC can be to real world applications. It might turn out that we don’t actually need it because SVR and WKS can carry the required workloads between them in existing applications. But the hope for Satori is that it can free application developers from needing to care quite so much about the behavior of the GC and give them back time to work on the actual problems that their software is trying to solve.
Footnotes
1. As an aside, you will note that this test still shows Gen0 collections, that is due to a quirk in how they are counted. Basically, every time a Gen 1 happens, it counts as a gen0 also. Gen2s also count as Gen1s and Gen0s also.
2. Terraria is one of the highest selling games of all time.