Howdy! This is another chapter in my series about software testing. You don't have to have read any of the previous installments to understand this article, but "An Introduction to Testing the Impossible" covers some similar concepts and tools.
Today we are going to cover another advanced testing technique called Model-Based Testing that can help us find exotic bugs very quickly without a lot of code. It can also go places that other testing strategies cannot such as testing systems for bugs that only occur after a magic sequence of operations.
If you like this one, check out some of the other pages on my blog for similar content.
Model-based testing (hereafter MBT) is a testing discipline that involves building a model also known as a mock of a system and then performing side by side tests of the system and the mock looking for differences. The big takeaway here is that MBT is most effective at testing conformance to an idealized model of a system.
To add some more color, we can contrast this with property-based testing. With PBT, we are generally performing the same operation on a system many times with randomized inputs each time, checking to ensure that a specific behavior, also known as a āpropertyā, is upheld at all times. Model-based testing is also randomized, but instead of testing for adherence to a single hardcoded property, we are testing many or all observable behaviors of a system at the same time.Ā
Also of note with MBT is a focus on the behavior of systems spanning multiple operations or API calls. You provide the MBT runtime with a list of possible operations on a systemās API. The runtime then randomly generates sequences of these operations and executes them on the SUT and the model in side-by-side. In contrast, PBT typically resets the state of the system-under-test (hereafter SUT) on each test iteration.Ā
After each sequence of operations has executed, the modelās state and the SUTās state are then compared to ensure they are in agreement. If they are, then the test iteration succeeds, if they are not, then the test iteration fails. This āagreementā test can evaluate many properties of the system at a time including internal state, which allows for higher density testing.
This multi-property testing per iteration, combined with the ability to search for bugs that only occur after many operations in a sequence, allow MBT to become one of the most efficient forms of test automation. Hopefully you find this as exciting as I do š!
So, weāve established that it is a powerful technique, but is it always appropriate? Well, it depends on how much work you want to put in. Here are a few guidelines
-Ā Ā Ā Ā ā Consider MBT if your SUT has a complex implementation, but can be mocked easily
-Ā Ā Ā Ā ā Consider MBT if your SUT maintains internal state
-Ā Ā Ā Ā ā Consider MBT if you have concerns about complex interactions of your SUT's API
-Ā Ā Ā Ā š¤ You might want to avoid MBT if your SUT has poorly defined API semantics
-Ā Ā Ā Ā š¤ You might want to avoid MBT if your SUT has very slow performance
-Ā Ā Ā Ā š¤ You might want to avoid MBT if your SUT's behavior changes a lot
Lets break these down. MBT works really well if you have a complicated system, but a simple API. For example, consider this interface
public interface IKeyValueStore<TKey, TValue>
{
TValue Get(TKey key);
void Delete(TKey key);
void Post(TKey key, TValue value);
void Put(TKey key, TValue value);
}
This interface is pretty simple and has obvious semantics. But we can imagine under the covers that it could have a very complex implementation, using advanced data structures, value serializers and I/O operations. But we can mock it with minimal effort by just a Dictionary or Map type. This is a great candidate for MBT because not only is the API a good fit, but the mock is trivial to set up.
Next, letās look at another example
public record Person(int Id, string Name, string Email, ICalendar<Person> Calendar);
public record Location(string Name, ICalendar<Location> Calendar);
public record CalendarEvent(
string Name,
DateTimeOffset Start,
DateTimeOffset End,
string Description,
Location Location,
Person Organizer,
ImmutableSortedSet<Person> Attendees
);
public record TimeSlot(DateTimeOffset Start, DateTimeOffset End);
public interface ICalendar
{
public IEnumerable<CalendarEvent> GetEvents(
DateTime start, DateTime end
);
public IEnumerable<TimeSlot> FindAvailableTimeSlots(
DateTimeOffset start, DateTimeOffset end, TimeSpan duration
);
public void AddEvent(CalendarEvent calendarEvent);
public void RemoveEvent(CalendarEvent calendarEvent);
}
public interface IOrganization
{
public IEnumerable<Person> GetPeople();
public IEnumerable<Location> GetLocations();
public IEnumerable<TimeSlot> FindAvailableTimeSlots(
Location location,
IEnumerable<Person> attendees,
DateTimeOffset searchStart,
TimeSpan duration
);
public CalendarEvent BookEvent(
string name,
TimeSlot timeSlot,
string description,
Location location,
IEnumerable<Person> attendees
);
public void CancelEvent(CalendarEvent calendarEvent);
}
In this example, the API gets a little more nuanced. The challenge here is that in order to create a mock/model of the system, we are going to have to implement some real code to match the semantics. In fact, the mock might get complicated enough to where we might accidentally write a bug into the mock. At first it might seem like a potentially buggy mock might be useless and a waste of time, but it can actually still help!
Just like in my article about metamorphic testing we can still get some juice for the squeeze as long as we approach testing as an iterative, exploratory process. Now, if by chance the mock and the SUT share an identical bug that we donāt have an explicit validation for, then we canāt catch it. But if the two systems simply disagree because either one of them has a bug, we can catch it and go to investigate.
Lets look at another example
public interface ICompiler
{
public byte[] Compile(string code);
}
Now in this example, MBT might not be as effective, for several reasons.
First, the core behavior of the system, compiling valid code, isnāt really going to be exercised as a primary function of an MBT test. This is much better suited to property-based testing that will validate the output of various inputs, rather than sequences of multiple inputs.
Second, we can reasonably expect that calling this API multiple times in a row will probably not yield many bugs. The SUT probably doesnāt persist any data in between iterations so there isnāt much use in validating the state of the SUT after each operation, because there probably isnāt any.
Third, if we are validating the output or the internal state of the SUT, then the mock will have to be of similar complexity to the SUT. This could be a huge undertaking because compilers are hellishly complex beasts. If we cut corners on the mock and make it really simple, then we might not gain insight into complex behaviors of the SUT. But if we invest large amounts of time into a high-fidelity mock, then we would still have to be able to make many assertions between the mock and the SUT to make the test exercise worthwhile overall.
This is a case that is probably better suited to property-based or metamorphic testing.
The last three rules are easier guidelines to follow:
In short, yes but with caveats. You might have a question lingering in your head about how exactly random sequences of operations will meaningfully results. In the first example, what if the MBT runtime picks a sequence of 3 `Get` operations. They will all obviously fail because the sequence never put any valid records into the database before querying it. Or perhaps a sequence did put a random record into the database, but it had a key like 237236127 and the odds of another random Get operation on a random key actually finding this record are basically zero. These are obvious gaps that randomized testing will struggle with. But we can employ some tactics to help the system along.
The most important thing to realize is that while it might be obvious to define an MBT operation for each of the methods in an API, we donāt have to maintain a 1-1 relationship. We could define a composite operation that performs a PUT for a key and then performs a GET for the same key. In this way we can make sure we are exercising this specific use case. Our MBT library may also allow us to make up dependent sequences of operations, where the choice of the first operation will determine the valid choices for the next operation.
Although composite and constrained operations are critical for ensure efficent coverage of valid scenarios, even unconstrained random operations can hunt for bug that you might not have been expecting. The only drawback is that you might have to trudge through some noisy data to separate out real bugs from invalid scenarios.
Enough abstract explanations, lets get into a real example of how this saved me a lot of time and grief when validating what should have been a simple project.
.NET has a type called a `ReadOnlySequence<T>` which is like a `List<T>`, but it might be made up of multiple linked buffers under the covers. This is useful for lots of reasons that are outside the scope of this document, but you can imagine some of them. Now .NET defines the `ReadOnlySequence<T>`, but confusingly it doesnāt define a way to actually create them š¤¦āāļø. So I defined a `ISequenceWriter<T>` that allows us to create sequences!
This seems pretty simple but leave it to me to mess it up in subtle ways. Iām really bad at things like boundary conditions and overlapping buffers and stuff and there is plenty of that in this type. Also I'm doing some fancy footwork with buffer rentals and trying to maximize buffer utilization. But the interface is pretty simple:
public interface IBufferWriter<T>
{
void Advance(int count);
Memory<T> GetMemory(int sizeHint = 0);
Span<T> GetSpan(int sizeHint = 0);
}
public interface ISequenceWriter<T> : IBufferWriter<T>
{
public ReadOnlySequence<T> ToSequence();
}
You rent some buffers, fill them with data, advance the writerās position and then repeat until you eventually call `ToSequence()` to yield the final result. This is an example of where the API of the SUT doesnāt make sense to implement as multiple operations. With an IBufferWriter<T>`, you must first call either `GetSpan()` or `GetMemory()`and then call `Advance()` after you have written some data into the provided buffer. If we implemented the test as 3 separate operations (one for each method) then the MBT runtime would generate lots of invalid usages which would just throw exceptions. So instead, Iāve implemented a single operation that invokes the methods in a correct manner. This might seem less useful, but it is increasing the signal to noise ratio because any exceptions we get are unexpected instead of just invalid API usage.
Moving on, lets look at the design of the mock:
private sealed class MockSequenceWriter<T> : ISequenceWriter<T>
{
private readonly ArrayBufferWriter<T> _bufferWriter = new();
public Span<T> GetSpan(int sizeHint = 0) => _bufferWriter.GetSpan(sizeHint);
public Memory<T> GetMemory(int sizeHint = 0) => _bufferWriter.GetMemory(sizeHint);
public void Advance(int count) => _bufferWriter.Advance(count);
public ReadOnlySequence<T> ToSequence()
{
var data = _bufferWriter.WrittenSpan.ToArray();
return new ReadOnlySequence<T>(data);
}
}
This is super barebones, but it does the job. I kinda lied earlier when I said there is no implementation of a sequence writer in .NET. There is `ArrayBufferWriter<T>` but it is super simple and no better than using a `List<T>`. But i can use it under the covers to simplify my mock. This is what you are looking for when writing a mock model in MBT. Minimum code, maximum fidelity.
Next, we are going to need to get familiar with the testing framework. I am going to use the same library that I used for the metamorphic testing article: CsCheck. Lets refresh a little bit on some core concepts of CsCheck.
Central to everything in CsCheck is the concept of `Gen<T>`. Gen stands for generator, and it is a kind of factory for instances of T. It does some fancy stuff to make the testing process more ergonomic overall, but donāt worry about that for now. There are a number of built in primitive Gen instances for convenience, but we can transform or combine them to make more complicated structures. Here is a quick start guide:
[Fact]
public void Examples()
{
var constant = Gen.Const(1);
var randomChoice = Gen.OneOfConst(1, 2, 3, 4, 5);
var anyInt = Gen.Int;
var restrictedRangeInt = Gen.Int[0, 100];
var arrayOfInt = Gen.Int.Array[1, 100];//length 1-100
var simpleOnlyOddInts = Gen.Int.Where(i => i % 2 == 1);//`where` is wasteful
var betterOnlyOddInts = Gen.Int.Select(i => 1 + (i * 2));//`select` is efficient
var daysOfWeek = Gen.Enum<DayOfWeek>();
var stringGen = Gen.String;
var hexCharsOnlyString = Gen.String["1234567890abcdef"];
var buildHexString = Gen.Char["1234567890abcdef"].Array[1, 10].Select(chars => new string(chars));
Gen<(int, char)> tupleGen = Gen.Select(Gen.Int, Gen.Char);
//linq syntax is very convenient for composite gens
Gen<(int, char)> linqSyntaxTupleGen =
from i in Gen.Int
from c in Gen.Char
select (i, c);
Gen<Color> complexType =
from a in Gen.Byte
from r in Gen.Byte
from g in Gen.Byte
from b in Gen.Byte
select Color.FromArgb(a, r, g, b);
//property based testing
anyInt.Sample(
iter: 100, //optional: iteration count
threads: -1, //optional: thread count
predicate: i =>
{
//are all squares divisible by their factor? (the answer is 'no' btw)
var square = i * i;
return square % i == 0;//return false if assertion fails (or throw exception)
});
}
Ā
Ok, now that we have gotten back up to speed on PBT, letās look at the extensions we need to make this model based. First, letās look at the signature of the model sampling method
public static void SampleModelBased<Actual, Model>(this Gen<(Actual, Model)> initial,
GenOperation<Actual, Model>[] operations, ...);
So basically, we need to provide a Gen that will build the initial states of our mock and our SUT. We are not going to do much with this in our example project, but you can imagine how it can be useful to also be able to randomize the initial state of our system before executing operations on it.
Then we need one or more "GenOperations" for the CsCheck runtime to arrange into sequences for us. Not too bad. I should note now that there are both an `actual` and a `model` type parameter on this method, so you can actually use completely different types for the SUT and the mock without the need for a unifying interface. Neat! But we are actually going to consolidate some of the implementations because we do in fact share an interface.
I thought about walking you through the process that I went through of refining the implementation down to a minimal form, but I think I am going to skip ahead some and just show the finished product with some comments added in for clarity.
//this object holds all the arguments needed to perform an operation
private sealed record RentWriteAdvanceOpArgs(
BufferType BufferType,
int RentByteCount,
int WriteByteCount
);
public enum BufferType { Span, Memory }
[Fact]
public void ModelConformanceTest()
{
//this is the gen that creates the initial states of our SUT and mock.
//for our test, we are always going to create the same initial objects
//so we use Const
var initial = Gen.Const(() => (
new SequenceWriter<int>(),
new MockSequenceWriter<int>()
));
//next we are going to build the object that represents the arguments of the operation
//we will talk about what these args do later
var rentAndWriteGen =
from bufferType in Gen.Enum<BufferType>()
from rentCount in Gen.Int[0, 5000]
from writeCount in Gen.Int[0, rentCount]
select new RentWriteAdvanceOpArgs(bufferType, rentCount, writeCount);
//this is how we define the GenOperation type. this is the encapsulation of the logic
//of the operation. It has 3 parts:
// - A function to label the instance of the operation (for debugging)
// - A function to perform the operation on the SUT/actual
// - A function to perform the operation on the mock/model
var rentWriteAdvanceOp =
rentAndWriteGen.Operation<ISequenceWriter<int>, ISequenceWriter<int>>(
name: args => $"[{args.BufferType}, {args.RentByteCount}, {args.WriteByteCount}]",
actual: RentWriteAdvancedAction,
model: RentWriteAdvancedAction
);
//we wrap the operation(s) in a array
var operations = new[] {
rentWriteAdvanceOp
};
//finally we put it all together
Check.SampleModelBased(
initial, operations, TestEquality, printActual: Print, printModel: Print,
threads: -1, iter: 100, seed: null //configurations
);
}
To recap what we are seeing here, we first declare the generator for the initial state of the SUT and model. This is because the MBT runtime has support for starting the test iteration from arbitrary points and then running arbitrary sequences of operations on it.
The next detail is that we have defined an object called `RentWriteAdvanceOpArgs`which incapsulates the args needed to define a `RentWriteAdvance`. In this way CsCheck separates data from logic.
Next up we define the actual operation for this test. A GenOperation object includes two operation functions, one for the actual/SUT and one for the model/mock respectively. These hooks are defined as functions that accept the test target object and the operation argument that is used to parameterize the operation functions. Typically, this is done by passing a lambda function to each prong of the test (SUT and mock), but I have used a shared function to save code because both my SUT and mock present the same interface.Ā
It is important to note that even though there is just one defined operation, the MBT runtime will still chain together randomly generated sequences of randomized operations, so we are still actually exercising the API effectively
The actual test function looks like this:
//opArgs is a randomly generated set of arguments for the operation
private static void RentWriteAdvancedAction(ISequenceWriter<int> sequenceWriter, RentWriteAdvanceOpArgs opArgs)
{
//pick a buffer type to rent from the sequence writer
//we do this so we are exercising both APIs
var buffer = opArgs.BufferType switch {
BufferType.Memory => sequenceWriter.GetMemory(opArgs.RentByteCount).Span,
BufferType.Span => sequenceWriter.GetSpan(opArgs.RentByteCount),
_ => throw new NotSupportedException()
};
//assert that the returned buffer is at least as large as the requested size
//this is a requirement of the API
Guard.HasSizeGreaterThanOrEqualTo(buffer, opArgs.RentByteCount);
//write unique values to the buffer so that we can verify the
//correctness of the data later
for (var i = 0; i < opArgs.WriteByteCount; i++)
buffer[i] = i;
//advance the sequence writer, leaving it ready for the next operation
sequenceWriter.Advance(opArgs.WriteByteCount);
}
Finally, after the setup, we call `SampleModelBased` in order to start the test.
Check.SampleModelBased(
initial, operations, TestEquality, printActual: Print, printModel: Print,
threads: -1, iter: 100, seed: null //configurations
);
...
private static bool TestEquality<T>(ISequenceWriter<T> actual, ISequenceWriter<T> model)
{
var actualData = actual.ToSequence().ToArray();
var modelData = model.ToSequence().ToArray();
return actualData.SequenceEqual(modelData);
}
private static string Print<T>(ISequenceWriter<T> writer)
{
var data = writer.ToSequence();
var builder = new StringBuilder();
_ = builder.Append('[');
foreach (var segment in data)
{
foreach (var item in segment.Span)
{
_ = builder.Append(item).Append(',');
}
}
_ = builder.Append(']');
return builder.ToString();
}
I am passing several optional arguments to the function for convenience of debugging, but the last required one is the `TestEquality` method which is used by CsCheck to validate that the SUT and the mock are semantically equivalent. In my case, the only thing I care about is āis the data exactlyĀ the same?ā, so the method is pretty simple.
If our test fails, CsCheck will report back to us with a message that looks like this:
CsCheck.CsCheckException : Set seed: "0nQm8eH9ps6c" or -e CsCheck_Seed=0nQm8eH9ps6c to reproduce (3 shrinks, 77 skipped, 100 total).
Operations: [{Span Rent=3448 Write=3119}, {Memory Rent=1162 Write=639}, {Span Rent=1131 Write=801}]
Initial Actual: []
Initial Model: []
Exception: System.ArgumentException: Parameter "buffer" (System.Span<int>) must have a size of at least 1162, had a size of <977>. (Parameter 'buffer')
at ___.Core.Tests.SequenceWriterTests.RentWriteAdvancedAction(ISequenceWriter`1 sequenceWriter, RentWriteAdvanceOpArgs operation) in D:\Repos\___\___\src\tests\Core.Tests\SequenceWriterTests.cs:line 52
at CsCheck.GenOperation.<>c__DisplayClass3_1`3.<Create>b__1(Actual a)
at CsCheck.Check.<>c__DisplayClass62_0`2.<SampleModelBased>b__1(ModelBasedData`2 d)
CsCheck.CsCheckException : Set seed: "0yhbcum0j1Cf" or -e CsCheck_Seed=0yhbcum0j1Cf to reproduce (22 shrinks, 10,099,674 skipped, 10,260,565 total).
Operations: [{Span Rent=99 Write=93}, {Span Rent=44 Write=12}]
Initial Actual: []
Initial Model: []
Exception: System.ArgumentException: Parameter "buffer" (System.Span<int>) must have a size of at least 44, had a size of <35>. (Parameter 'buffer')
at ___.Core.Tests.SequenceWriterTests.RentWriteAdvancedAction(ISequenceWriter`1 sequenceWriter, RentWriteAdvanceOpArgs operation) in D:\Repos\___\___\src\tests\Core.Tests\SequenceWriterTests.cs:line 52
at CsCheck.Check.<>c__DisplayClass62_0`2.<SampleModelBased>b__1(ModelBasedData`2 d)
It worked shockingly well for this problem. This particular example of a sequence writer was almost tailor made for MBT since there are all kinds of weird edge cases around the buffer renting logic.Ā
I'm going to be a bit vulnerable here and mention the bugs I found with this test:
That second bug was particularly nasty, because everything would work fine until a certian point. For example, if you
asked GetSpan for a buffer of 600 bytes and then wrote 600, then asked for another buffer of 300 bytes and wrote
300, everything would work fine because it just so happens that the actually buffer that got allocated was 1024
bytes so it would hold everything you wrote in a single segment. It wasn't until you tried to rent a bigger chunk
after renting a smaller chunk that things would start to break. And this bug might actually change in behavior
depending on the `MemoryPool` that the `SequenceWriter` was constructed with. But my test caught it.
There are a hundred questions that might lead to incorrect behavior. If we were writting this test by hand, how soon would we get tired of coding these edge cases before we gave up and assumed it was working? CsCheck will produce sequences of up to 127 operations to find bugs. That is a lot more due diligence than most of us are willing to put into our manual test suites.
For the price of about 100 lines of code, we get a pretty extensive degree of correctness testing. Based on this test alone, I ended up making 40 lines of changes to my 250 line implementation. Some of those lines were simple fixes, but most were refactors in order to make the code more obvious and less error prone. Once my test passed cleanly with the default 100 iterations, I ran it with 100,000 and it also passed. With this single test, I feel pretty damn confident that my implementation is doing what it is supposed to for any kind of input that is thrown at it.
Lastly, it is also important to note that during this process I also made several edits to the test itself. You see, each time the test failed, I had to ask, āis the test broken or is the SUT?ā. This feedback loop is essential during the testing phase and it becomes even more useful the earlier you apply it during the development lifecycle. The testing library can ask you questions about your design that you couldnāt even think of, and this can dramatically enrich the robustness of your design and implementation.
I hope this has been a useful article. If so, let me know so that I know to write more like it š!
]]>"What is the difference between a class and a struct in C#?". I like this as a lead-in interview question because most candidates can offer up a quick answer that can help them get some confidence rolling in case they are nervous (we are not generally looking to asses candidates for performance under duress š). There is a spectrum of correct answers that can shed light on how the candidate thinks about this fundamental component of the .NET type system. I'm going to lay out some of the ways that I think about structs in .NETĀ
Some folks have never heard of structs. A struct (also known as a Value Type) is an alternative way of defining a type in C# and other .NET languages like F# and VB.NET. They have various tradeoffs, but they have been around since the beginning of .NET. My speculation is they looked at how Java was designed to treat everything as an object and observed that the Java compiler wasnāt good enough at optimizing the usage of normal classes for them to always be appropriate from a performance perspective.
They share many similarities with classes, but they are completely different in others. So letās jump right in!
I hear this answer a lot and it does grasp the basic usage guideline of when to use structs. The traditional guidance from Microsoft was that structs were appropriate for data between 1 and 16 bytes per object. Larger than that, and you should consider using a class. Even after 20 years, this guidance is correct enough for beginners to pay attention to, so if you only remember one fact about structs and classes, it should probably be this one.
At the next level, a candidate might recognize another defining characteristic of structs, they canāt participate in traditional inheritance like classes can[0]. Since inheritance is a pillar of most object-oriented programming languages, this would seem to be a major limitation! But it is very important to note that structs CAN implement interfaces, which can get us polymorphism, which is good enough for most purposes of object orientation.
In fact, if you take into account Default Interface Methods that were recently introduced into .NET, then you can get the code sharing advantages that used to be reserved only for classes! But that is kind of a bonus fact probably belongs at a higher level š.
Ok, at this level the candidate recognizes that there are two major locations in memory that object instances live in: the āHeapā and the āStackā. All class object instances live on the heap [1]. It is a general-purpose dumping ground for data and it can grow to near infinite sizes, which is part of the reason that we recommend designing large objects as classes instead of structs.
Another key detail about the heap Is that objects that are allocated on the heap have to be eventually cleaned up by the Garbage Collector and this has negative performance implications for high throughput applications. So for small, commonly created objects it is best to define them as structs so that we can avoid the overhead of garbage collection.
Lastly, in contrast to the heap, the āstackā is very limited in size. Usually than just 1-4MB compared to the heap which can be many GB or more. You might think that this is why there is a recommended 16-byte limit on the size of structs, but it isnāt (that is coming up next). š
Now we are getting some real details that are starting to affect high performance applications. Now somewhere around skill levels 3 or 4, a candidate should also know that structs are pass-by-value and that classes are pass-by-reference. This is a subtle but very important detail that basically means that when we have an instance of a class stored in a variable, what we are actually storing under the covers is a āreferenceā (aka a pointer) to a location in the heap. This reference is very much like a postal address in the real world; a small piece of information that tells us where to find the real object we are looking for.
Structs are completely different. When we store a struct in a variable, we are actually copying the entire value of that object instance in directly into the variable. If we pass this variable into a function, we copy the entire value of the struct instance into the method we are calling. With a class, we are just passing the reference, which is always limited to 8 bytes or less.
The big idea to take away at this level is that if we were to define a very large struct, letās say 100 bytes, then every time that we passed that struct into a new method, the computer would have to copy a lot of data around again and again which starts to eat away at the performance of our application. This is the real reason that Microsoft has that 16-byte guideline on the size of structs [2]. Now at this level we should also point out that Microsoft violates their own 16-byte limit in several structs in the BCL and they do this because that limit is a general guideline and not a hard limit. If you want, you can define a struct that is 100 bytes or 100 kilobytes and our program will just start to slow down more and more if we use it a lot. But just be aware that at a certain point the limitations we talked about at the end of level 3 will become increasingly important.
This is a VERY important detail about structs. They can be a bug farm in your code. You see, because structs are pass-by-value they are implicitly copied every time you call a method or even assign them to a new variable name.
So letās say that you have a struct and you pass it into a child method that does some work and then modifies a field or a property on the struct and the returns. Simple enough, but itās a bug. When you pass the struct into the child method, the compiler implicitly makes a copy of your struct. So the struct instance that got modified in the child method, is not the struct instance that exists in the parent method.
It is SUPER easy to introduce this kind of copy-then-mutate bug into your code and as a result the use of struct for storing updated information is STRONGLY discouraged by Microsoft and everyone else out there. But if you look at my code, you will see custom structs all over the place. So how do I do this and avoid these bugs? Immutability! I almost always define my structs as `readonly` which prevents them from being mutated by any code. You canāt have a copy-then-mutate bug if there is no mutation! Now we still have to take things like level 4 into consideration when choosing structs over classes, but this levelās warning simply doesnāt apply when using `readonly` structs.
Boxing is an important concept in .NET that can have big performance implications when you are using structs. Basically, it is an exception to Level 3, where a struct can accidentally end up on the heap and becomes subject to the Garbage Collector having to clean them up just like classes.
Hereās how it works: letās say you have a struct that implements `IDisposable` and you pass it to a method that accepts an argument of type IDisposable. Because of āmagicā ⨠in the .NET runtime, this works just fine, and you donāt really notice anything different than passing a class that implements `IDisposable `. But this magic is complicated under the covers. The runtime writes special low-level code inside of our method that is designed to accept any type of object that Implements the `IDisposable` interface. Specifically, the runtime performs a lookup on each object to find the exact method slot to call on that specific object. That lookup is called virtual dispatch, and it involves reading a section of the object called the object header. The problem is that only objects on the heap have object headers. Structs donāt have headers, they are pure data which is part of why structs are so lightweight.
But we just said that that you CAN in fact pass a struct to a method that expects a heap object with a header, so what gives? Well, the runtime makes this work by allocating a special temporary object called a āboxā on the heap, just like a class. The box has an object header and everything else needed to perform virtual dispatch. The runtime copies the data in the struct into the box and then passes the box to the method. Easy peasy.
But the solution creates another problem here. When the method is completed, the box has to be cleaned up just like a normal class object. So, every time you invoke a method that expects an interface (or of type `object`) then a new box gets created and then thrown away. Now imagine invoking that method thousands of times per second and you can see that it will add up quickly. On top of the pain for the garbage collector, there is the overhead of actually creating the box object and copying the data into it. It can drag application performance down.
Now the final bit to this level is that it actually isnāt true. In C# you can use a technique called āgeneric constraintsā to change how the compiler reasons about you use of types. When you do this, .NET can elide creation of the box and use the struct directly in the child method.
Up until this point, a candidate is displaying some decent depth of understanding of the limitations of structs and why they are not always a good choice. But it has all been drawbacks and warnings about using structs. At Level 7, they are getting into why structs are actually amazing. We have talked quite a bit about how method calls work under the covers in .NET. Now we need to learn that not all method calls are created equal.
We talked a little bit about the concept of dispatch in Level 6. This is a fundamental concept of compiler theory that every programming language needs to implement. In .NET there are actually many different types of method dispatch.
Generally, when you invoke a method on a class, you get a virtual dispatch like we talked about in Level 6. If you pass a class object of type āMyClassā, the compiler has to account for the fact that the actual type of the object could be `MyClass` or some other child or grandchild class that inherits from `MyClass`. So, every time we invoke the method, the compiler needs to peek under the covers to see the exact type of the actual object so that it can know which method slot to use.
But when you invoke a method on a struct (excluding boxed structs) you get a simpler, faster type of dispatch called ādirect dispatchā. Direct dispatch is faster because the runtime can determine the exact method slot to invoke ahead of time before the method is called. This is because structs donāt support inheritance so if we pass a struct of type `MyStruct` to a function, the compiler knows ahead of time that we only need to support object of the exact type `MyStruct` and selects the correct method slot without checking.
Now the difference between virtual and direct dispatch is just a few nanoseconds, but as we have said before, if you are calling a method thousands or millions of times per second, you will start to notice a difference.
Lastly, it is important to note that there are exceptions to these guidelines in both directions. Boxing and sealed classes being two examples of when classes and structs and use different forms of dispatch than normal. But using structs to gain access to direct dispatch is a reliable way to have the power of abstraction without the performance penalty.
At this level, we attain enlightenment. We no longer fear the use of large structs because we know that we can pass those structs around by reference using `in` and `ref` and avoid copying their data at all. And because of that, we recognize there are now times that we can ignore Level 3 to obtain optimal performance.
We learned about ref-structs and their ability to store managed references, which unlocks the ability to reference exotic forms of memory such as stack-allocated and unmanaged memory. We can control the exact way our data is represented in memory using `StructLayout` and `InlineArray` and that we can slice memory buffers into shapes that look like packet headers and data structures because we see the 1s and 0s underneath it all. We are pooling our buffers, using zero-copy I/O, using SkipLocalsInit, swapping out memory allocators and have become aware of memory alignment and SIMD operations. We have entered the realm of the fastest systems languages on earth.
And at the end of it all, we have realized the most important thing about this power. That we donāt really need it. Classes work for 99.5% of all the code that we write on a daily basis. .NET Garbage Collector is pretty damn good at managing memory. The walled garden of .NET memory management saves us a lot of pain that we never even had to learn about such as memory fragmentation. The simple rule we learned in Level 1, ultimately stays intact. If you look at my code, you will see a lot of `readonly struct` and `readonly record struct` mentioned, but they are almost all very small and expected to be created and thrown away quickly. If I need a type to implement an interface, I just use a class 99% of the time rather than the struct + generic constraint technique from Level 6. Itās just easier.Ā
If I was writing database engines or high throughput streaming data processors, then I would make very different choices. But in a world where my application is spending most of its time waiting on some external database or obnoxiously an slow 3rd party system, there is just no point in trying to optimize something from 1-2 milliseconds down to 1-2 microseconds. Nobody will notice it. So just when in doubt, just take the easy way and use a class š.
Ā
[0] If youāre a smarty pants, you realize that all structs actually inherit from System.ValueType which itself inherits from System.Object. But aside from this special case, we cannot define a struct that inherits from another struct or from a class.
[1] : In recent versions of .NET the runtime is now smart enough to stack allocate class instances in some limited cases, which bypasses the Garbage Collector, so this rule is no longer exactly correct!
[2] : So there is another reason and that has to do with the compilerās historical ability to āenregisterā struct values, which is a very low level detail of something called a āCalling Conventionā that .NET uses to tell the CPU exactly how to perform a method call. But donāt worry about that too much š.
This post is a beginner's walk through of how to use Dafny to build formally verified software. It is written for at a junior or beginner developer level and should be approachable by anyone with a little experience in C#, Java, Javascript or other similar languages.
If you are not familiar with Formal Methods or the Dafny Verification Language see my previous article for an introduction.
You might remember from last time that I promised to walk through my formal proof of a basic communication protocol. That's not gonna happen. Instead we are going to do something simpler than that. Today we are going to walk through how to formally prove a basic Mutex, more commonly called a thread lock. If you are not experienced with multithreaded programming, don't worry, we are not going to get deep into that, it's just used as a toy example.
You can follow along by downloading VSCode and installing the Dafny extension. Then create a empty file called Lock.dfy and you are ready to go! Dafny is a huge language, but it is well documented.
Source code for this post is available here.
Locks are critical to writing multi-threaded code correctly. They are also called Mutexes, which stands for "Mutual Exclusion". The idea is that you have a important piece of code called a critical section what if multiple threads (or users) were to try to access at the same time, "Bad Thingsā¢" would happen. (When we say "Bad Thingsā¢", we are talking about crashes, infinite loops, corrupt or incorrect data and other things that would negatively impact our software's operation.
So we use a lock to check if someone else is currently accessing the critical section before we enter it. Now that we have introduced the concept, lets give it a definition.
The first, and perhaps most important step of writing designing formal systems is specifying how it should and should not behave. From there, we can design the abstraction that we need to fulfill that contract of behavior. For our lock, this will be pretty brief.
From here we can build a implementation that we expect to work.
A lock, for our purposes, is a structure that is equipped with two methods:
TryAcquireLock accepts the id of the thread that is attempting to acquire the lock and returns a bool that indicates if the lock was successfully acquired. A thread may acquire a lock if and only if no other thread has acquired it.
ReleaseLock, predictably, returns the lock to the default state, permitting another thread to subsequently acquire it. In almost all circumstances, this method is only called by the thread that successfully acquired the lock.
We are going to implement the lock with something called a compare-and-swap operation, also known as a compare-and-exchange. This is typically a special operation provided by the CPU that checks a variable and then either chooses to keep it or swap it out, if the value matches some target value or not and the CPU does this as a single atomic operation. Our compare-and-swap isn't the actual CPU instruction, but it will behave the same for purposes our proof.
Those familiar with locks might ask about advanced scenarios such as reentrancy, which means that a thread can re-acquire the lock, while it has already acquired it. We are not going to support it for now. If a thread tries to enter a lock twice, we will reject it.
Technically this implementation is slightly different than the spec because it stores the owner of the thread in it's state. But we can treat it the same from a verification standpoint and it sets us up nicely for extending it in the future.
Rather than spending time walking through writing the Lock class, I am going to provide the implementation up front. The focus of this post is on proving that it is correct. Take a look:
Ok lets unpack this a bit.
newtype ThreadId = i: nat | i <= 0xffff_ffff
First we see the newtype declaration; this declares a new type that is convertible to and works the same as an underlying type. This is useful for keeping various values from getting mixed up. In this case the underlying type is "nat" with a restriction that only permits values up to 2^32, which is the domain of a uint32. So basically a ThreadId is an uint32 in our program.
In most programming languages we specific sized variables like uint32 and uint64 to store numbers. This is because in a real program, we need to know how much space to use to store the variable in order to process data efficiently.
But Dafny operates in an abstract space where we are not constrained to finite datatypes. Specifically, nat refers to an abstract "natrual number". Natrual numbers are defined as either 0 or any of the positive integers up to but not including positive infinity, so 0, 1, 2, 3,...ā¾. We then restrict it down to a uint32, not because we "have to", but because we want this spec to model the real way that a C# or Java program would operate. This matters because weird bugs can happen at the lower and upper values of an integer's range (0 and 0xFFFFFFFF) and if we allow unlimited sized numbers in Dafny, we will not catch bugs at the upper edge of the value range that might exist in C# because there is no upper edge of to "infinite".
Next up, we have a class declaration. Classes in Dafny behave much the same way that they do in most programming languages like C# and Java. They are objects that contain fields and may define methods. One big change from most OOP languages is that there are no accessibility keywords such as "private" or "protected". In Dafny, everything is public. This has to do with the concept of "transparency" that is important to how Dafny reasons about whole programs (but more on that later). In you don't intend for a method to be called from outside of the class, then don't do that š.
Moving on to the class definition, we define:
The syntax of the method signatures is a little different then other languages, but it is still pretty clear. One important detail is that methods in Dafny can return multiple values. In C# we would do this with out parameters, but the concept is the same. (Actually return values in Dafny are always "out parameters"!)
The next oddity is the "modifies this" statement. I won't dwell on this, but it essentially declares that a method is permitted to modify the state of the parent class. This is important to Dafny so that it can reason about when state is changing.
method TryAcquireLock(threadId : ThreadId) returns (success : bool)
modifies this //<- kinda weird š¤
A big mental jump that we need to make with Dafny is how we think about testing our code. In C#, when we write a test, we write a small program that walks through specific steps and add some asserts along the way and then run that program and see if it explodes.
Dafny can do this too, but the power of Dafny is by specifying the program in such a way that the program is "always" correct without us having to actual execute tests. But first, lets write a traditional looking test.
Ok, here we have a basic test, much like a unit test. This creates a lock, then defines two different threads and then tries to do a few things with the lock. What might be a bit unusual is that we see some compiler errors before we have ever run the program yet. Just by writing the test we already get feedback from Dafny about it's correctness.
Here, Dafny has determined that our program "might" not work. Put another way, Dafny can't prove for certain that it will work, but this is not a damning assessment of our code....yet. In most cases, we need to begin helping Dafny (and ourselves!) by being more specific about the intent of our program.
An important thing to mention is that methods are "opaque" to the verifier. That means that Dafny does not look directly inside of the body of methods that are called to determine their behavior and side effects. Instead, it uses the preconditions and postconditions (hereafter called the method's specification or spec) declared on each method to understand the intent and behavior of the method.
This works in two directions:
If we look at the first failing assert, Dafny cannot prove that the method will always return success=true. This is because the method spec doesn't say anything about what the method actually returns. In fact the method spec doesn't say anything yet!
Lets add a rule to say when the method returns success:
The function should return success if the lock is not currently held by another thread at the start of the method call.
That is a plain English rule that we are going to transform into a spec rule. Since this rule involves the method return value (which only matters after the method has finished), we are going to specify a postcondition using the ensures keyword.
This new rule is basically equivalent to our English rule. The old(ownerThread) means that we are referring to the value of ownerThread just BEFORE the method executes. We are saying that if that value is equal to our null thread id value, then that "implies" that the return value (called "success") is true.
The "fat arrow" ==> is the "implies" operator. It means that is the left side of the operator is true, then the right side is true. BUT, the reverse is not necessarily true; that is to say that if the right side is true, the left side may or may not be true. So to recap, success is true when ownerThread is "null" (actually NULL_THREAD_ID, which is zero) when the method starts. Neat!
Oh wait...
We have a new error on the return statement of the method now.
Hmm, "a postcondition could not be proved"? So Dafny now says "I see that you want this method to return true when this condition is met, but how are you gonna do that?". Ok, so we need to prove this somehow.... Lets work backward.
The condition depends on the return value "success", which this method defines as casResult == threadId. So that means we have to somehow prove that this expression is true when we start with ownerThread is null. The variable threadId comes from the method argument, so we can't predict that, but casResult we can interpret. So perhaps if we tell Dafny what values CompareAndSwap returns and when, it can make this leap of faith with us. Time for a new rule:
If valueToCompare is equal to ownerThread when CompareAndSwap begins, then CompareAndSwap will return the value of valueToSwap
I'm going to pause here to answer a question that might be bubbling up already: "Why do I have to spell everything out for Dafny? Can't it just 'figure it out'?". Great question! There are two main reasons:
First, it helps the underlying engine by giving it guardrails on how much of the universe it needs to search and understand. By treating method as "opaque" and only paying attention to the explicit rules that we have written, the number of things the solving engine has to keep track of is reduced by a massive amount. Without this, every possible observable behavior of your program would have to be considered by the engine regardless of whether it matters or not. The complexity explodes very quickly into infinity and it is computationally infeasible to analyze even small programs.
The next reason is more subtle. A distant cousin of Dafny is F#. F# is a language that has a powerful feature called "type inference" that allows us to just write code without specifying the type of each variable or method parameters. This is really cool because it gives us the ease of languages like JavaScript and Python, but with the type safety of languages like C# and Java. The compiler just "notices" your changes and you get correct compiler errors that tell you the other changes you need to make to bring your program back into a consistent state.
But it turns out that is too much of a good thing. Don Syme, the creator of the F# language still recommends explicitly specifying the types of variables in most situations. This is a choice done to give "structure" to a program and make the types used explicit parts of the program contract. Because without this structure, making a tiny change somewhere might cause a cascade of changes in behaviors and errors that can become overwhelming to understand.
The situation is the same in Dafny. We annotate our methods with static specification not only to help the engine figure things out, but also to make sure that the implementation code we are writing is actually doing what we want it to according to our method specifications. Many times, we make subtle mistakes in the implementation of our code and a key tactic is to add more preconditions and postconditions to our program until we notice the inconsistency. And that is exactly what we are going to do next!
So now Dafny can understand that after CompareAndSwap executes, it's return value will end up fulfilling the contracts as we intended. And at this point the CompareAndSwap and TryAcquireLock methods both turn green! But... our first test assertion is still failing :(. Dafny still doesn't believe that TryAcquireLock will always return true after the first time it is called. I'm actually kind a stumped. When would this not be true? Well it turns out Dafny can tell us by producing a counterexample! A counterexample is when Dafny shows us a concrete example that violates our specifications. We can study them to understand what Dafny is thinking.
You can enable counter examples in VSCode by using the command pallet.
Once we do, Dafny throws a lot of stuff at us
This is overwhelming at first, but lets unpack it. Each blue line is telling us the state of the program just before that line executes. We can also tell Dafny to copy these to the clipboard using the command pallet. lets look at the first one.
At "{" ...
defass:bool = false
lock:? = ?
thread1Id:int = 0
thread2Id:int = thread1Id
Ok, everything is uninitialized here because it is the opening curly brace of Main. The '?' means that the lock variable is uninitialized and the threadIds are both zero at this point. The variable thread2Id is "equal" to thread1Id because the underlying engine consolidates all variables with identical values. This looks a little confusing can help a lot with with keeping track of complicated states of our program and determining when they are the same.
The "defass" variable is special and refers to definite assignment. This indicates if Dafny can prove if all of our variables have been assigned values yet. Don't worry too much about this one for now.
Lets skip forward to the next one:
At "var lock := new Lock();" ...
defass:bool = true
lock:_module.Lock = (NULL_THREAD_ID := 0, ownerThread := 39, lock#0 := true)
thread1Id:int = 0
thread2Id:int = thread1Id
Hmm, that's weird. it says that the ownerThread in our Lock object is equal to 39!? How the heck did that happen!? As a proving tactic, Dafny will sometimes try to plug in a random number as the value of a field or variable and see if it works.
So what Dafny is saying, is that after the constructor has finished, ownerThread could be anything because we forgot to specify that it will always be zero in the postcondition!
Recall Dafny does not look inside of methods (and constructors) to see how they work unless we add a specification to that method and we haven't done that for our constructor, so Dafny assumes that anything could be happening in there. This is a little annoying, but easy to fix.
And now, like magic, our first assertion passes! But the second one is still failing š.
Back to the counterexamples!
At "var success := lock.TryAcquireLock(thread1Id);" ...
defass:bool = true
lock:_module.Lock = (NULL_THREAD_ID := 0, ownerThread := 7720, lock#0 := true)
thread1Id:int = 1
thread2Id:int = 2
success:bool = defass
Hmm, ownerThread ended up with a crazy value again after we acquired the lock. I guess we need to constrain the behavior of TryAcquireLock some more. Lets add a new rule:
If TryAcquireLock succeeds, ownerThread will be equal to the threadId argument
*Sigh*...that leads to a new error. I guess because CompareAndSwap doesn't specify it's return value. Wait! Doesn't our postcondition on CompareAndSwap already says that it's return value is specified correctly? So what gives? Shouldn't Dafny be able to understand that CompareAndSwap will make TryAcquireLock work correctly? Well, lets look at the counter example:
At "var casResult := CompareAndSwap(NULL_THREAD_ID, threadId);"
this:_module.Lock = (NULL_THREAD_ID := 0, ownerThread := 38)
threadId:int = 39
casResult:int = threadId
Ok, so in this example we are in TryAcquireLock and Dafny has started us out with the threadId argument set to 39 and ownerThread equal to 38. So thread 39 is trying to acquire the lock, but it is already held by thread 38. This should fail. But wait:
At "return casResult == threadId;" :
this:_module.Lock = (NULL_THREAD_ID := 0, ownerThread := 38)
threadId:int = 39
casResult:int = threadId
defass:bool = true
success:bool = defass
Dafny says that CompareAndSwap returned the value of threadId into casResult. But that should not have happened because of the post condition on CompareAndSwap that controls the return value....right? Nope. Lets look at our English post condition for CAS again:
If valueToCompare is equal to ownerThread when CompareAndSwap begins, then CompareAndSwap will return the value of valueToSwap
This only defines what happens when CAS succeeds! (valueToCompare == ownerThread). We haven't covered what happens when it doesn't! I'm going to skip some steps also and mention that we didn't constrain that ownerThread must not change if CAS fails. And several other things... are you getting tired yet?
My final version looks like this:
Dafny has features that allow us to "pick" a random possible value for a variable. This is superior to manually specifying values such as "1" and "2" for our thread ids. So I have generalized those variables to be random, which forces Dafny to prove that our tests work for all possible values instead of just the few that we can manually think of. This uncovered another bug where a bad user could pass 0 as the threadId to TryAcquireLock and think that they held a lock when it could still be taken from them š.
As a note, this is different than randomized testing or property-based testing, which run tests again and again with multiple values. But they have to stop running combinations eventually and cannot guarantee that they have checked everything. But Dafny can!
As long as we have defined the desired behavior of the class in the test, then this code will work. Period. No corner cases, no surprises.
Programming correct code requires designing behavior and then implementing that behavior without deviating from the design. Dafny can help with both sides of this.
That being said. Writing formal proofs is a humbling experience. The code you thought was simple turns out to be very subtle and nuanced. And modeling more complicated systems like, loop processing, client-server protocols and timer based functionality can be very tricky to even know how to start.
Even with "simple" code, the size of the proof specification can dwarf the actual code. Our Lock class took 1 precondition and 8 post conditions to prove correct. You will also notice that there are 9 lines of specification to cover just 5 lines of implementation code. That is an unsurprising statistic in the world of formal methods.
This is frustrating, time consuming and expensive work to apply to a project and should be generally treated as an extreme step after significant investments in other testing methods have been made. This example overall took me around 4 hours. Partly because I was rusty at Dafny but mostly because it is hard. Damn hard.
But at the end of the day, Formal Verification can go places beyond any software testing technique and Dafny is the
most approachable way to achieve this that I have yet seen. Give it a shot, you just might like it. Or maybe one day
you will use it to prevent a bug that could mean life or death.
A few days ago I was having dinner with some close friends and the discussion drifted to the nature of humanity. My friend argued that the measure of humanity was a sliding scale that was weighted against the amount of survival pressure that a person is under. Further, that survival pressure is primary driver of altruistic expression. Essentially that when the chips are down, our mental risk assessment becomes dominated by external factors instead of an internal sense of morals. The conversation depressed me because it darkens my world view with the clouds of doubt that any of us are actually firmly tied to our sense of morals, but instead it is all about how we perceive and interpret risk.
Working in an engineering organization involves daily interactions with risk management. Risks to timelines, risks to availability and operations, risks to security. We have to weigh the costs and expected ROI of preventative measures against the weighted probability and severity of negative outcomes. "If we invest in penetration testing, how much will it reduce the projected likelihood and severity of a security breech?"
Another fundamental but less discussed risk is the cost/benefit of investing in people. People have value in many different dimensions that matter to organizations. Specific skills, experience, knowledge, principles, curiosity, and determination; these are all things that drive varying degrees of the success of an organization. Some of these might be considered intrinsic to a person's nature such as curiosity and determination, but others, such as skill, experience and knowledge naturally accumulate with age. So to maximize as many dimensions of excellence as possible, organizations naturally favor senior talent because they have natrual advantages. Who wouldn't want to maximize chances of organizational success? I mean, this is dog-eat-dog capitalism.
The risk analysis of an organization will naturally vary with time as an organization's economic context ebbs and flows from peacetime to wartime and back again. But even as we shift footing, an organization's principles should persist, just as I believe people should not abandon their moral center when times get hard. It is in wartime, in times of challenge that our metal is tested and shows. I hear so much about "culture" in tech companies but so little about guiding principles and group conscience that guide us as we endure as an organization and as an institution.
So what is this post actually about? It's about inclusivity. The tech industry owes a great debt to many underserved communities. Even for those of us that don't belong to a specific minority group, we have all belonged to a group that is undervalued and underserved: the junior role.
Each of us started somewhere that required someone in a position of power to reach down a hand and take a risk on us. Maybe we had good grades or some extra help from an acquaintance that could vouch for us, but at the end of the day, we all started inexperienced, untested and unproven. Maybe it was a megacorp, or a scrappy startup or one of the countless middle size companies that make up the bulk of this industry, but every person in tech remembers, that open door, that first handshake, literal or figurative, that sealed our first position on our career path. The good faith opportunity, the chance to gain experience, confidence and prove to ourselves and others, our value. At every step in our careers, whether it was our first job or our most recent promotion was a handshake, a vote of confidence in our potential and value.
This is a subject I feel so strongly about, because I wasn't ready when I got my first handshake, but they took a chance on me and supported me until I was. I owe that handshake so much, from the cozy vacations I have taken, to the roof over my head. I have a bit of a lonely disposition and I don't usually see much of myself in others. But I absolutely see myself in those that waiting for that open door and that first handshake.
Whenever we can, we should try to see the potential and the value in people, even when they don't have 10 page CV to show for it. Junior roles bring new blood and ideas into an organization. They bring perspective and questions that we forgot to ask. They bring mistakes and opportunities for resilience. They bring connection with younger generations that will make up our clients one day; sooner than we think. They can bring fire, ambition and a "why not" attitude that reminds us of who we were and might be again.
But they always bring risk. Risk of slipping a deadline, risk of bugs and mistakes, risk of failure. It cuts to our core principles to take a human risk, especially when we are under pressure to deliver. Maybe next quarter when things are not as busy, maybe once we ship this feature, maybe once we stabilize this system. It's a fallacy that is so easy to tell ourselves. But the reality of capitalism is that tomorrow isn't promised to us and there is never going to be a perfect time to bring on a junior that will take time and investment to provide that all important ROI.
Our choices with that risk, I believe serve as a reflection of our values both as organizations and as people, and that we should take them under serious consideration. There is always going to be the question of how to make as much money as fast as possible. But an essential question that we must ask ourselves is "have we reached far enough to give someone that handshake?". And that handshake, after I've earned my last dollar and my last line of code has stopped executing, is how I would want to be remembered.
Hey, if you like this article be sure to check out the follow up "Getting started with Dafny: Your first proof"
I have a little joke that I tell to each new employee that I meet: "Hey I work on <Component X>, I'm sorry in advance for all the bugs you'll have to deal with š ". I tell it to new QA, support folks and even sales people. It always gets a chuckle and is a good ice breaker. (Did I mention that we are hiring?)
But the self deprecating humor is a bit of a defense mechanism, because bugs in my code mortify me. I've seen all sorts over the years: little bugs, big bugs, sneaky bugs, heisenbugs, "I thought we fixed this already!?" bugs and "HOW DID THIS MAKE IT PAST QA š" bugs. I've seen bugs that hid for years and then exploded. I've seen bugs with 7 figure price tags...multiple times.
They all hurt. So in my code I take pains to follow principles such as "parse, don't validate" (as written by the brilliant Alexis King), the use of immutable data structures to simplify multithreaded programming and strong type systems to catch inconsistencies. I lean on unit tests to catch bugs in "well written code" and Property Based Testing to validate tricky assumptions.
But even with all of these tactics, there will always be gaps where bugs live in the darkness. But what if there was a way to make your code airtight, so that it contained no corner cases? What if you could prove that every possible configuration combination in your software did not cause invalid behavior? Well there is! In this article I want to introduce you to the gold standard for software verification: Formal Methods.
Lets ask Wikipedia:
In computer science, formal methods are mathematically rigorous techniques for the specification, development, analysis, and verification of software and hardware systems
Simple enough. It all starts with building a specification that defines "correct" behavior of your system. We make a LOT of assumptions on a day-to-day, minute-to-minute basis about how our code will work. Although the wibbly-wobbly-timey-typey objects in JavaScript or Python are quick and convenient, we have to nail those down and decide exactly how they should behave.
Once your software has a specification that defines correct behavior, then you need to check your implementation to ensure it follows the spec. This is where we typically reach for suites of unit tests. But as we all know, suites of unit tests can never catch every weird combination of conditions that causes incorrect behaviors to arise.
That is where proof assistants (F*, Dafny, Coq, Lean, Liquid Haskell) and model checkers (Alloy, TLA+) arrive on the Formal Methods station wagon. These tools take our specifications along with details of our implementation and rigorously validate it. And when I say rigorously I mean they deliver mathematical certainty of correct behavior as defined by your specification.
What if you need to write code that was speeding down the highway? Or that was zapping tumors in people's brains? Or code that controlled an irreplaceable satellite millions of miles away? Or code that controls hundreds of millions of dollars (this happens a lot)? Are you going to trust your normal day-to-day programming tools with this weight of responsibility? I, for one, would not.
If your code absolutely, positively MUST work the first time and every time, then you need Formal Methods. It will save you money, time and the heartache of having to explain failures to your clients or far worse, their loved ones.
Put simply, because it is a pain in the ass š.
A lot of proof assistant languages such as Lean and F* rapidly become a soup of esoteric symbols and punctuation marks that us mere mortals cannot fathom. That's probably half the reason most people turn 360 degrees around and moonwalk out the door when the discussion turns to formal methods; they seem impenetrable.
For those that make it past the crazed syntax and esoteric concepts, then there is the real problem. Proving programs are correct is hard. Damn hard. You see, when we are confronted with a machine that does not accept hand waving to hard questions we either leave in frustration or shrink back in defeat. But never fear, there is an easier path and that path is Dafny!
Dafny, unlike it's relatives like F* and Lean, comes from the syntactic lineage of C/C++ and more recently, C# and Java. If you can write any of those, you can write Dafny. There are no weird symbols and abbreviations and we are able to build programs in both functional and object-oriented styles.
So lets take a look at the syntax:
Not that bad ehh? The "ensures" thing looks a little weird, but I bet that you can guess what it does without me saying anything. This snippet was from a little project modeling a toy connection protocol between a server and a client. The whole thing less than 100 lines.
The second big win of Dafny is that has a great toolchain! You can get started in 60 seconds in VSCode just by downloading the extension and creating a new file. There is no barrier to entry!
Dafny is a mature and complex language under continuous development and improvement (the reference document is 448 pages long!). As you can imagine, we are not going to be able to cover everything, so consider this a gentle introduction. A future post will provide examples and show how it really works.
"Most proof languages can help you write a formal proof, but then you need to write separate code that exactly implements the proved algorithm."
First and foremost, Dafny helps you write correct software by giving you advanced ways of telling the compiler and verifier what your code is supposed to do. This is also known as design by contract. The contracts primarily take the form of preconditions and postconditions.
In addition to the simpler syntax and more approachable toolchain, Dafny has a super power. Most proof languages can help you write a formal proof, but then you need to write separate code that exactly implements the proved algorithm. This gap leaves room for human error and can undermine the whole point of going through pain for formal proofs.
Dafny however can compile your verified code down into a number of mortal languages including C#, Java, JavaScript, Go and Python. This can give you great confidence that your code is going to work exactly as intended.
Each time we define a function or a method in Dafny we provide markup that defines the boundaries of expected behavior including inputs, outputs and reads and writes to other variables. These behavior constraints come in the form of loop invariants, assertions, preconditions and postconditions. We declare this metadata and Dafny's verifier will let us know if we have covered every possible corner case.
These contracts are also important to us humans, because they clearly state the developer's intentions to future maintainers. Many times we change or refactor some code and accidentally alter a hidden behavior that was important to the system. Making all of these explicit prevents these types of mistakes.
A precondition is a rule that a method or function requires in order to work correctly. We do this in C# and other languages quite commonly by checking method arguments and throwing exceptions when they are not null or otherwise invalid.
Postconditions are somewhat rare in most day-to-day programming but they are important in Dafny. A post condition is a promise that when a function completes, the program will be in a specific state. It could be a guarantee that a data structure is valid, or that a bank account is not negative or anything else that is important to us.
Loop invariants are trickier. Loops are particularly difficult to prove correct. They can overwrite variables, transform arrays and perform other complex operations that can confuse humans and unfortunately even the verifier system. Loops can also be very hard to prove that they will ever stop looping. To help things along, it is sometimes necessary to add some rules to a loop that the system can prove are always true. These rules are known as Loop Invariants.
Pretty much anything. The language provides you incredible flexibility in how you structure your code. It could be in a functional programming style, or an object oriented style. If you can write it in a mainstream language like C#, you can write it in Dafny. Dafny has been used to model everything from encryption algorithms to complex distributed systems to blockchain smart contracts. With enough effort, you can describe almost any type of software; the sky (and your time/budget) is the limit!
There are three big difficulties in working with Dafny (and most other formal verifiers).
First, you are going to have to build everything from scratch š. Unlike in C# and JavaScript where we have package managers that have millions of pre-built classes and functions, Dafny only gives you very basic building blocks to work with. You will need to build up your application's behavior brick by brick which can be very time consuming. If it's any consolation, you have total control over the behavior of the resulting program.
Next you will quickly find out that Dafny's verifier is really dumb. Don't get me wrong, it is operationally perfect at proving things false, but we have to remember that Turing Complete languages are undecidable in general and proving them correct is impossible. It won't take you very long before you write a piece of code that you know for a fact is correct, but Dafny can't make the mental jump to verify it.
This is because the underlying verification engine (called Boogie, which is built on the Z3 SMT solver) uses conservative heuristics to prove your program is correct. This is because even some simple algorithms can be very very computationally expensive to verify. So the verifier doesn't search every possible path to verify your program. It uses various tricks and tactics to do a best effort search. As a result, you will commonly have to add "extra" assertions or invariants that seem obvious to you, but help Dafny figure out that your program is correct.
There is some good news however, Dafny will generate counter examples anytime it fails to prove something fully. You can inspect these examples to see where the logic broke down and try to fix them.
I should note that Dafny will never generate false positives, only false negatives. That is to say that if Dafny says your program is correct, you can rest assured that it is correct (unless your design is bad...). But if Dafny says that it cannot prove your program, it just means that you may need to add extra assertions to help Dafny out. In fact that is how we commonly debug programs in Dafny; when presented with an error, we start adding assertions to smaller and smaller chunks of the program until we either find the gap/mistake, or Dafny agrees that we are telling the truth.
Lastly as I mentioned before, it is really really hard to write formal code. All those little assumptions have to be written down in order for the computer to help us validate them. It is normal for the contracts defining a function to take up more lines than the function itself! And when we do write them down, we may find that they interact in complex ways that keep sending us back to the drawing board in our designs.
While this exercise is critical to producing correct and verifiable software, it comes at the cost of slowing development down to a fraction of our normal pace. Writing just one hundred executing lines a day is exceptional for most of this work and it can be a lot rougher than that when dealing with tricky algorithms. It does get easier with time, but it will always be much much slower than just writing normal code.
There is also always the risk that our software specification itself is wrong or incorrectly defined. Dafny will catch contradictions in our specifications, but if you define something incorrectly Dafny can't do much to help you.
It depends š¤·. It's going to take an investment of time and effort and you will need to weigh that against the risks and costs of failures. The folks at AWS invested heavily in TLA+ and were able to uncover crazy complex bugs, including one with a 35-step reproduction path. It also enabled them to implement performance optimizations with confidence, which at the scale of AWS can save millions of dollars.
Microsoft has also invested heavily in formal methods. In fact, Dafny, TLA+, Boogie, Z3, Lean and F* all come from, or are stewarded by Microsoft Research. Microsoft also has used formal methods to prove complex systems, such as their IronFleet project which validates the behavior of distributed systems. But at what cost? IronFleet took 3.7 man-years to implement and it was only 5000 lines of code.
At the end of the day, it's all about risk assessment. Probability of Failure x Cost of Failure = Risk. You'll need to do that math for yourself. But knowing how to use these techniques will give you tools tools that let you deliver on super-human levels of reliability.
In my next post, I'll demonstrate a practical example of Dafny to model the behavior of my hypothetical scheduling application I talked about in my previous post. Until then!
So, last time we covered the control plane of this platform and some of the nifty things that Microsoft Orleans provides for us that make it a great choice for this kind of platform. Today, we will cove the data plane, which includes routing and execution. But first a quick refresher.
A basic serverless platform suitable for internal/private deployments. The goal is to provide an easy way to bolt services together, much like AWS Lambda or Azure Functions. These services can be great time savers because they make it very quick and easy to get some code running in the cloud without worrying about managing VMs or dealing with Kubernetes which might have so much deployment complexity that it dominates the effort when you just need a few lines of code running.
Because we can š! It is a great exercise to reinvent the wheel a bit sometimes to learn more about how the wheel actually works. And sometimes (more often than you think), your simpler wheel can out perform the big SaaS wheels that are out there. In this case, I'm trying to demonstrate that a lot of what Azure and AWS sells you is actually not very complex and for certain use cases, you can save a lot of money by building your only miniature version of it. Now back to our regularly scheduled programming š.
So, as we saw before, the control plane is a set of administrative APIs that "control" the behavior of the system. Specifically, they control how HTTP requests are handled by the system. The data plane handles and routes the HTTP requests coming into the system. Now asp.net already has a lot of fancy routing tech built it for us to use. But we are going to throw that all way and leverage Orleans some more to handle our routing. We do this because asp.net routing is mostly intended to be static upon application startup and we explicitly do not want to have to restart our application just to add new routes and handlers.
So the front door of our system is very simple:
As a side note, I think I'm going to switch to using Carbon.now.sh for my code formatting from now on, The formatting here on LinkedIn is just terrible š.
This is using the very nice minimal APIs from asp.net and as we can see, it's a single mapping for all GET requests to anything below /app in the HTTP path. The real work is being done by this mysterious DataPlaneRouterService that is being injected into out handler. So lets take a look at that:
Again, not much really going on in here, but lets walk through it. We first do some checking to ensure the "organization name", which serves as the path prefix, is clean. From there we normalize it to all lowercase and then lookup an Orleans Grain that has that identifier and call it's Execute method.
I'm going to step back for a moment and remind everyone about how Grains work in Orleans again. As we stated before, Orleans is Microsoft's implementation of the Distributed Virtual Actor pattern. In this case, we are going to focus on the "Virtual" part.
When we talk about Virtual Memory, we are talking about a magical system that lets each process running on a computer believe that it has access to an essentially unlimited amount of memory. The operating system does a lot of bait-and-switch magic behind the scenes to preserve this illusion. If your program hasn't touched a particular chunk of memory for a little while, the operating system will save it to the page file on disk and use the space in RAM for something more important. The next time you need that chunk, the operating system loads it back from the page file and makes it available on the fly. It's a miracle that computers can even work at all, with all of this nonsense going on. But they do and programming is much simpler because of it.
In Orleans, Grains are "Virtual" in the same way. When you need to talk to a specific Grain, you just call a method on the Grain as if it was there the whole time. The Orleans runtime will magically bring it into existence just-in-time somewhere in your cluster without your code knowing about it. If your Grain had previously had state data stored in it, that data is loaded from storage at that time. When you haven't called any methods on that exact Grain instance for a little while, Orleans will put the Grain and it's data back into cold storage to be used at a some point in the future.
The net effect is that in the above code we can summon a grain for any possible path that has been sent to the HTTP API and allow the Grain's internal logic to handle it. The Grain is coded to look in it's internal state and see if a handler and then decides how to proceed. This is fundamentally how we determine how to handle each possible route in our application even if they don't "exist".
By the way, if you're wondering how many Grains Orleans can hold active in memory at once, the answer is millions and the number of Grains that can live in cold storage is infinite (assuming you can pay for the storage).
The data plane for each RouteHandlerGrain consists of 1 method: Execute. That method looks like this:
We first check to see if this Grain instance has any State loaded already. If it doesn't, we know that no one has set up a handler for this Grain and therefore we treat it as a 404.
As we discussed above, once a Grain has been loaded/activated, it exists as a normal in-memory C# object. This includes the normal ability to use fields and properties to store data. Leveraging this ability, we check if we have already cached a compiled version of the handler script stored in a nullable property. If there is a cached version, then we invoke the script, if we have not, then we load the script on demand, compile and cache it, then invoke it.
We do it this way because compiled/optimized script formats cannot always be serialized to and from cold storage like the other Grain state data, so we recompile the script on-demand after the Grain has been activated. The next time a request comes in for this route, we can used the volatile cached copy to keep things running fast.
The textual bodies of the scripts themselves are stored in another Grain that works much like a stored object in Git. This is done to support a history mechanism so that we can have tracking of what changes have been made to the handler script over time and rollback to them if needed.
One of my company's executives has a saying that goes something like: "It's hard to make things simple". The first version of this code (which you can find here) was kinda ugly, but it got the job done. As I matured the development of this system, I spent a fair amount of effort cleaning up the design so that the code was cleaner to read and would be easier to extend. As a result, the Evaluate method now looks like this:
Again it isn't too complicated to read. We create a context object that encapsulates the request path and a future extension point for data. This context data will be injected into the script's runtime space. We then do some bookkeeping to measure how many resources the script will take to execute and then fire the script. After the script completes we measure how large the output is so we can track it for our consumption based billing, and then update the consumption data for the Organization. That's it.
So this does skip over the most interesting part: how do we actually execute the code?! Well, we use a very handy .NET library called Jint that provides a Javascript interpreter. There are pros and cons to using an interpreter vs a compiler, but Jint is well engineered and executes small scripts very quickly, which is what we want. We inject the arguments and a mini utility library API into the script's runtime and then invoke it. The plumbing to do this is organized in a couple of classes, but it boils down to this class that does the heavy lifting.
Of note is the fact that our scripts don't have any of the normal Javascript runtime APIs that you would get from a browser like Chrome or a server like Node or Deno. We have to create them each manually. Currently the only thing I implemented was a very basic Fetch API so that our script could call other upstream services if it needs to. You can see that here:
As needed, other runtime components can be implemented to make life easier for the script itself, but this is enough already to do useful things. I really want to attach a JSON based key/value store API as a next step.
You'll note there is some code in here to measure the amount of time that we spend waiting on the HTTP GET call. This is to support a future enhancement where we do not bill the client for time that we spend waiting on some other service to execute. This is similar to how Cloudflare does it's billing and I consider it to be more fair. Theoretically, while the script is blocked on an upstream service, we can do other work with the thread so it really wouldn't cost us much. But this isn't fully implemented yet.
"...32,500 requests for second is pretty good for a single machine..."
Pretty darn good š! Despite the Orleans runtime doing a ton of magical work in the background, there isn't much fundamental overhead to the system. We can see that in a local test on my workstation, which looked like this:
==================================== Thread Count: 8 Requests Per Thread: 10000 Total: 80000 Errors: 0 Subjective Time: 19.166 Real Time: 2.462 Subjective Average: 4,174.01 Mean Latency: 0.240 Real Average: 32,497.69
This shows that we are about to send 80,000 requests through the data plane to a trivial handler script, in about 2.5 seconds. This works out to only 240 microseconds per call, including all the asp.net HTTP controller and Orleans dispatching overhead.
I think 32,500 requests per second is pretty good for a single machine test run and I should mention this number can go much higher when the load is spread across multiple handlers that don't have to contend for the Grain's single threaded execution. There are also other optimizations we could make by to allow better scaling, such as:
As I mentioned before, the normalized cost of compute in Azure functions is as high as 6 times the cost of bare VMs or K8s containers. Those services deliver ease of deployment, hands free scale-out and shiny web editors. But with a little ingenuity, we don't have to pay the convivence tax to get good functionality. This service could be hooked up to a CI pipeline for script deployments the same as Lambda/Functions and deliver that same power to your organization.
But what about scaling? Well, the final cherry on this cake, is that we require ZERO changes to this code to scale out from 1 machine up to 100 machines. Just add more replicas to the environment and the Orleans cluster will expand to them and load balance across them.
I'm pretty confident that this tiny serverless platform could displace a large fraction of the work that we currently do with Azure functions and it can save a fair amount of money because we don't have to rely on complex and expensive cloud primitives that AWS or Azure have built for us. You can run this on a bare VM or in containers and use nothing else besides S3 or Blob/Table storage, which are as cheap and cost effective as you can get in the cloud.
I'll part with a quote that a very wise client told me years ago:
"You pay for what you don't know"
Ok, ok. It took me 2 days. But I was doing other things with my life š¤·. But anyway, onto the real story.
So a few days ago, I was writing some internal documentation for my job at Ziosk (did I mention we are hiring?), specifically I was writing a primer for newcomers to Microsoft Orleans. To start with, Orleans is Microsoft's implementation of the Distributed Virtual Actor model, which makes people's eyes glaze over, so I try to build intuition other ways. I have fallen into a pattern of introducing Orleans as a "nanoservice" platform. You see, microservices are small independent APIs that handle some specific functionality. But the more microservices we add, the higher the complexity and the greater the woes. But Orleans skips a lot of that pain and lets us break out functionality as if we were programming normal C# classes and interfaces. This enables us to define smaller and more fine-grained services if our designs call for it. Hence, "nanoservices".
While I was explaining this in my document, I mused a bit that since nano is smaller than micro, you could build microservices out of smaller nanoservices. Then it clicked! Maybe I could actually do that! And what could be a more hip way to deploy a microservice but to use a serverless platform and pay 6 times the rate š! So I cracked open VSCode and started writing.
Well, a serverless platform! So what is that? Lets ask wikipedia:
Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers.
It further states:
Serverless computing does not hold resources in volatile memory; computing is rather done in short bursts with the results persisted to storage. When an app is not in use, there are no computing resources allocated to the app. Pricing is based on the actual amount of resources consumed by an application.
Ok, so it's a platform that abstracts the developer completely away from the machine. It also does computation in short bursts (request/response processing) and costs the client nothing while it is not executing.
From this we can distil a couple of basic requirements:
We are going to use one of my favorite tech stacks: C# + Orleans + gRPC. Actually F# is my favorite language, but that would be harder for you folks to follow along with, so back to C#, my 3rd favorite language š. The serverless code will be in Javascript because it is the most popular language on earth and it is well supported. As for Orleans, I think it will fit the workload very well and make prototyping easier. gRPC is a no-brainer for me. It gives me compile time safety for my client and server code, it's my default choice.
We are going to build this app on the fly and not do a lot of up front design. This was thought of and built on a whim, so I didn't want to try to make something perfect. This was a personal hackathon. If you stop by my repo, you can see it commit by commit. Lets jump in!
Our system will consist of a control plane and a data plane. The control plane is how routes are controlled and other administrative tasks are controlled. The data plane is where requests flow in the front door and are ultimately served by the serverless code. The control plane will need a proper interface that we can talk to programmatically. So I reach for my favorite tool for APIs, which is gRPC. gRPC lets me define a server in a special definition language and then generate both server and client code.
The control plane basically looks like this :
service ControlPlane {
rpc QueryUsage QueryConsumptionRequest)
returns QueryConsumptionResponse) {}
rpc RegisterHandler (RegisterHandlerRequest)
returns (RegisterHandlerResponse) {}
}
So there are two methods, QueryUsage and RegisterHandler. This is the bare minimum for us to compute billing and upload code. This code will be handled by a asp.net gRPC controller that will then forward the details onward to Orleans. We are hosting the gRPC controllers in the same process as the Orleans because it is faster and simpler. The controllers are not very complex, here the code for `RegisterHandler` with the error handling removed for brevity
var normalizedHandlerName =
helpers.GetNormalizedHandlerName(
request.HandlerId
);
var grain = clusterClient.GetGrain<IJavascriptGrain>(
normalizedHandlerName
);
await grain.Import(request.HandlerJsBody);
return new RegisterHandlerResponse {
Success = true
};
That's about it. We convert the handler id to lowercase, lookup the Orleans grain by that id, call Import on the grain and return a result. This is missing auth, but hang with me.
You may be curious about the Import method, so here it is:
private Script? Script { get; set; }
public async Task Import(string code)
{
var script = ParseScriptOrThrow(code);
Script = script;
state.State ??= new JavascriptGrainState();
state.State.Code = code;
await state.WriteStateAsync();//save to storage
}
The method parses the code to ensure the syntax is valid, because why let someone push broken code š. After that it stores the parsed script in a volatile field in the class (which is just fine) and then updates the persistent state with the raw code string and saves it. (We'll see why later...š).
Actually lets see why now š.
You see an Grain in Orleans is like an instance of an object. In fact it literally is an object instance in .NET. It has methods and fields and a constructor. We can store things in the fields of this object like the Script property we saw above. And they will disappear when the object is collected. However, we can mark some of those fields and properties as persistent and then Orleans takes over control of the data.
A Grain in Orleans is never created and it is never destroyed. That is why we call it the Virtual Actor pattern. You call up a grain by providing a unique key, much like a primary key in a database. If this grain has state associated with it, Orleans transparently loads that data into memory when it routes a request to that Grain. After that, any method calls to the Grain can use that state as normal. If the Grain instance in memory hasn't had a method call sent to it for a while, Orleans will deactivate the grain and save it's state back to persistent storage (such as S3 or a database).
Does this sound a little familiar?
Yep, most serverless platforms work the same way. If your Azure Function or AWS Lambda isn't used for a while, it goes to sleep and then the next time it is called, it has to warm up. This is called the cold start problem and it can be a big annoyance for some use cases (but my prototype doesn't suffer much from this š). So now you see why Orleans is a good fit. We store the raw script in the database and then have Orleans Load it on demand when we need it. Once loaded we parse the script once and cache the result in the Grain's Script property and then reuse the cached Script to execute the request again and again.
We, of course, could have done all this ourselves and used normal asp.net. But then we would need to:
But we don't have to, because Orleans exists and it does all of this for us and more. āØ
This blog post is getting a little long, so lets take a break here. Next time, we will see how the data plane is
implemented and also handle how to measure consumption for billing purposes. Until then!
I just finished my 2nd reading of āThere is no Antimemetics divisionā by qntm. It is a short but wild novel that centers around the concept of creatures that the human mind cannot perceive and a group of intrepid protagonists that must use deduction and raw intelligence to notice gaps in their reality where they have encountered these creatures. The book explores a very thought-provoking concept: How do you combat an adversary that you cannot see or even comprehend fully?
This bares a passing resemblance to the topic of todayās post: How do you find bugs when your software is so complex that you cannot tell if its output is correct? We see lots of documentation and blog posts talking about how to write tests for trivial functions and toy programs. But the process of writing good tests for software with complex behaviors or outputs is often left as an exercise for the reader. In this post Iām going to cover an advanced approach to help solve that problem.
Let's pretend for a moment that you have just completed a particularly challenging and intricate event scheduling system. It is a critical, high-performance component that was hard to write and full of tricky business requirements. The component reads a bunch of schedule entries, some for specific dates and times, others that repeat, some for only certain objects or groups of objects, some with superseding priority and turns them all into a ācompiledā `ISchedule` object that can be queried.
public interface ISchedule
{
public ScheduleEvent? GetCurrentEvent(DateTime pointInTime, Entity entity);
}
You draw out a small example schedule to test. It seems ok, but after some thought, you are worried about your implementation accidentally applying an event every day, or every third Wednesday. So, you write many more invariants and conditions using small test schedules with a couple of events each.
However, the schedules in production will involve thousands of overlapping events for hundreds of entities. It seems intractable to test these large and complex schedules authoritatively. After a schedule gets to a certain size, it becomes difficult for humans to reason about what is a correct answer to a query. We canāt really write a mock to validate our answers system because the mock would need to be almost as complex as the SUT. How could we improve our confidence in the behavior of the SUT with high complexity inputs?
I was puzzling around this concept on and off for about 18 months before I stumbled on a research paper that finally gave me the vocabulary I needed to dig into the literature. The problem we are describing is called the Test Oracle Problem. A slightly more formal definition is given in the paper āThe Oracle Problem in Software Testing: A Surveyā:
"Given an input for a system, the challenge of distinguishing the corresponding desired, correct behaviour from potentially incorrect behavior is called the 'test oracle problem'"
It turns out that several disciplines of software development suffer badly from this problem. Machine learning systems, compilers and search engines create outputs that are so complex that it is extremely difficult, if not impossible, to determine if they are fully correct. However, a technique called Metamorphic Testing can be used to reason about outputs that would otherwise be too large for humans to evaluate. It requires us to generalize our approach and move up to a higher level of reasoning.
The basic idea goes like this, you take some input value and then you alter it in a controlled way creating a new, related input. You then rack your brain and come up with a āhypothesisā that your change should cause some predictable and observable effect on the output. You then randomly generate thousands of inputs, alter them and then run the input pairs through the SUT. Afterwards you inspect the differences between the outputs to see if your hypothesis is true. Concretely, the hypothesis is a rule that states, āgiven any two inputs that are related in a specific way, the outputs should be related in another specific way. This hypothesis is formally called a āMetamorphic Relationā.
A basic example might look like this: You start with a list. You then copy it and add a new value to the list. You then declare that if I have added a new item to a list, the length of the new list should always be greater than the original list. This might not seem profound, but it has a key advantage over normal invariants and postconditions; you do not need to know anything about the original list. The starting list might have been empty, or it might have had 100, 000 items already. You only need to know about the change itself; in this case, adding a new item.
If you think back to our example of the compiled schedule, we no longer need to know anything about the schedule itself, the items in it or its validity. We could randomly generate an extremely large schedule involving thousands of rules and then use this technique to hunt for bugs despite the overwhelming complexity of the SUT output. So how might we attack the schedule problem in practical terms? Let's take another look at the definition of the SUT output.
public interface ISchedule
{
public ScheduleEvent? GetCurrentEvent(DateTime pointInTime, Entity entity);
}
We have a function called GetCurrentEvent. It accepts two inputs and produces a nullable output. This seems very tough to validate in general and deceptively impossible to contrast two versions of it. But it turns out that we can transform this function into a concrete data structure that is easier to manipulate. Buckle up, we are going to go a little deep on the math for a moment.
We first need to recognize that any pure function can be thought of as a Set of input and output pairs. This might seem overly abstract but trust me, Sets are good because they are quite easy to work with. So, now we need to create a Set that is a decent representation of our function. We can do this simply by āsamplingā the function. That is to say that we take a bunch of input values, run them through the function to get some outputs, then take the input / output pairs and insert them into the Set. There are several ways we can sample the function, randomly or at regular intervals or by some other heuristic.
It should be noted however that sampling too loosely will cause the Map to be a poor approximation of the underlying function and may not pick up key detail. And of course, sampling too much will be slow and inefficient. For my testing, I chose to take a sample for each schedule entity, every 15 minutes from 2022-1-1 through 2022-12-31. That works out to about 35, 000 data points for each entity in the schedule. That might seem like alot, but it only costs a few KB of memory, and we should take advantage of having amazingly fast computers. Anyway, this produces a large mapping that has āgoodā resolution into the underlying schedule.
Now that we have a convenient (and powerful) way of contrasting outputs (by comparing the Set representations of sample maps), we can begin to speculate about what a useful hypothesis might be. How about this, āIf I add a new schedule rule that does not match any valid Entity in the schedule, the schedule sample map should not changeā. We then use our tooling to generate thousands of random, schedule rules (or groups of rules) that fit this description and test it! This, again, might sound simple, but it provides a way for us to test some properties of our SUT. Another rule might be āIf I add a schedule rule that matches a valid entity, the schedule map should changeā. This sounds might sound good at first, but if we think carefully, this statement isnāt true. If we happen to add a schedule rule that exactly matches a rule that is already in the schedule, then the schedule would not change. Or more importantly, it should not change! We can turn this into a Metamorphic Relation like, āIf I add a schedule rule that is already in the schedule, the schedule map should not changeā.
But we can go deeper than this. Because we have translated our function into a convenient Set representation, we can use off the shelf set operations to transform our outputs. What about this relation āIf I add a schedule rule that affects entity X, then the output should only differ by output samples that relate to Xā. We can determine this, by using Set subtraction which will leave us with a Set that only contains the differences between the two sets.
I bet you can think of a few more relations that exist between outputs. Each one of these can help us cover more of the possible input space and they can do so while operating on massively complex inputs that we can't evaluate by hand. But the key takeaway is that we can combine this with other more traditional types of testing to help shine light on more of the universe of possible defects that might exist. It isnāt a silver bullet, but it does help us shine light on dark corners that might have hidden bugs.
Iāll leave you with a favorite quote of mine about testing.
"ā¦But unit tests are not enough. Type systems are not enough. Contracts are not enough, formal specs are not enough, code review isnāt enough,Ā nothingĀ is enough. We have to use everything we have to even hope of writing correct code, because thereās only one way a program is right and infinite ways a program can be wrong, and we canāt assume that any tool we use will prevent more than a narrow slice of all those wrong ways."
- Hillel Wayne
Ever needed to process a set of data that is too large for a single computer? Like in that whole cloud thing everyone keeps talking about? This paper is a one stop shop for learning about external sorting algorithms that do just that.
Ah yes, the mighty B-Tree, 50 years old and still the fundamental underpinning of most of the world's data. I had some fun this winter experimenting with copy-on-write trees to build a non-blocking transaction system and this paper was a great resource on how to remove bottlenecks from the design.
This paper defies gravity by side-stepping one of the most fundamental laws of filesystems: guaranteed write ordering. It demonstrates a technique for crash proof data integrity even if write operations are interrupted or arrive out of order.
A stunning paper that shattered the performance ceiling on parsing unstructured text. It introduces and combines several techniques for complex, delimited/escaped text along with novel methods for parsing floating-point numbers. And it did it without any pre-processing and zero compromise to security. That sound you hear is your solid-state disks shaking in their boots.
A paper so packed with ideas that it would make Ralph Merkle himself proud. It draws a line in the sand about the design, performance and applications of hash trees while providing instructional insight into everything from cryptographic algorithm design and data validation all the way to SIMD CPU instructions and parallel data processing. It reminds me of back in highschool when I learned how Bittorrent sliced and shared data securely and I knew at once that I wanted to study computer science. This one is really special.