So what is Model Based Testing anyway?

Howdy! This is another chapter in my series about software testing. You don't have to have read any of the previous installments to understand this article, but "An Introduction to Testing the Impossible" covers some similar concepts and tools.

Today we are going to cover another advanced testing technique called Model-Based Testing that can help us find exotic bugs very quickly without a lot of code. It can also go places that other testing strategies cannot such as testing systems for bugs that only occur after a magic sequence of operations.

If you like this one, check out some of the other pages on my blog for similar content.

  1. An Introduction to Testing the Impossible
  2. It is just a formality
  3. Getting started with Dafny: Your first formal proof

😐 So, what is Model Based Testing anyway?

Model-based testing (hereafter MBT) is a testing discipline that involves building a model also known as a mock of a system and then performing side by side tests of the system and the mock looking for differences. The big takeaway here is that MBT is most effective at testing conformance to an idealized model of a system.

To add some more color, we can contrast this with property-based testing. With PBT, we are generally performing the same operation on a system many times with randomized inputs each time, checking to ensure that a specific behavior, also known as a “property”, is upheld at all times. Model-based testing is also randomized, but instead of testing for adherence to a single hardcoded property, we are testing many or all observable behaviors of a system at the same time. 

Also of note with MBT is a focus on the behavior of systems spanning multiple operations or API calls. You provide the MBT runtime with a list of possible operations on a system’s API. The runtime then randomly generates sequences of these operations and executes them on the SUT and the model in side-by-side. In contrast, PBT typically resets the state of the system-under-test (hereafter SUT) on each test iteration. 

After each sequence of operations has executed, the model’s state and the SUT’s state are then compared to ensure they are in agreement. If they are, then the test iteration succeeds, if they are not, then the test iteration fails. This “agreement” test can evaluate many properties of the system at a time including internal state, which allows for higher density testing.

This multi-property testing per iteration, combined with the ability to search for bugs that only occur after many operations in a sequence, allow MBT to become one of the most efficient forms of test automation. Hopefully you find this as exciting as I do 😃!

🤔 When can I use model-based testing?

So, we’ve established that it is a powerful technique, but is it always appropriate? Well, it depends on how much work you want to put in. Here are a few guidelines

-       ✅ Consider MBT if your SUT has a complex implementation, but can be mocked easily

-       ✅ Consider MBT if your SUT maintains internal state

-       ✅ Consider MBT if you have concerns about complex interactions of your SUT's API

-       🤔 You might want to avoid MBT if your SUT has poorly defined API semantics

-       🤔 You might want to avoid MBT if your SUT has very slow performance

-       🤔 You might want to avoid MBT if your SUT's behavior changes a lot

Lets break these down. MBT works really well if you have a complicated system, but a simple API. For example, consider this interface

public interface IKeyValueStore<TKey, TValue>
{
    TValue Get(TKey key);
    void Delete(TKey key);
    void Post(TKey key, TValue value);
    void Put(TKey key, TValue value);
}

This interface is pretty simple and has obvious semantics. But we can imagine under the covers that it could have a very complex implementation, using advanced data structures, value serializers and I/O operations. But we can mock it with minimal effort by just a Dictionary or Map type. This is a great candidate for MBT because not only is the API a good fit, but the mock is trivial to set up.

Next, let’s look at another example

public record Person(int Id, string Name, string Email, ICalendar<Person> Calendar);
public record Location(string Name, ICalendar<Location> Calendar);
public record CalendarEvent(
    string Name,
    DateTimeOffset Start,
    DateTimeOffset End,
    string Description,
    Location Location,
    Person Organizer,
    ImmutableSortedSet<Person> Attendees
);

public record TimeSlot(DateTimeOffset Start, DateTimeOffset End);
public interface ICalendar
{
    public IEnumerable<CalendarEvent> GetEvents(
        DateTime start, DateTime end
    );

    public IEnumerable<TimeSlot> FindAvailableTimeSlots(
        DateTimeOffset start, DateTimeOffset end, TimeSpan duration
    );
    public void AddEvent(CalendarEvent calendarEvent);
    public void RemoveEvent(CalendarEvent calendarEvent);
}

public interface IOrganization
{
    public IEnumerable<Person> GetPeople();
    public IEnumerable<Location> GetLocations();

    public IEnumerable<TimeSlot> FindAvailableTimeSlots(
        Location location,
        IEnumerable<Person> attendees,
        DateTimeOffset searchStart,
        TimeSpan duration
    );

    public CalendarEvent BookEvent(
        string name,
        TimeSlot timeSlot,
        string description,
        Location location,
        IEnumerable<Person> attendees
    );

    public void CancelEvent(CalendarEvent calendarEvent);
}

In this example, the API gets a little more nuanced. The challenge here is that in order to create a mock/model of the system, we are going to have to implement some real code to match the semantics. In fact, the mock might get complicated enough to where we might accidentally write a bug into the mock. At first it might seem like a potentially buggy mock might be useless and a waste of time, but it can actually still help!

Just like in my article about metamorphic testing we can still get some juice for the squeeze as long as we approach testing as an iterative, exploratory process. Now, if by chance the mock and the SUT share an identical bug that we don’t have an explicit validation for, then we can’t catch it. But if the two systems simply disagree because either one of them has a bug, we can catch it and go to investigate.

Lets look at another example

public interface ICompiler
{
    public byte[] Compile(string code);
}

Now in this example, MBT might not be as effective, for several reasons.

First, the core behavior of the system, compiling valid code, isn’t really going to be exercised as a primary function of an MBT test. This is much better suited to property-based testing that will validate the output of various inputs, rather than sequences of multiple inputs.

Second, we can reasonably expect that calling this API multiple times in a row will probably not yield many bugs. The SUT probably doesn’t persist any data in between iterations so there isn’t much use in validating the state of the SUT after each operation, because there probably isn’t any.

Third, if we are validating the output or the internal state of the SUT, then the mock will have to be of similar complexity to the SUT. This could be a huge undertaking because compilers are hellishly complex beasts. If we cut corners on the mock and make it really simple, then we might not gain insight into complex behaviors of the SUT. But if we invest large amounts of time into a high-fidelity mock, then we would still have to be able to make many assertions between the mock and the SUT to make the test exercise worthwhile overall.

This is a case that is probably better suited to property-based or metamorphic testing.

The last three rules are easier guidelines to follow:

  • 🤔 You might want to avoid MBT if your SUT has poorly defined API semantics
    • Since you will have to write a mock implementaion of your SUT, if you are unsure of what behaviors will be relevant to your mock, it is really hard to know what to do. In these situations you might want to start with a very simple mock and then add more complexity to it iteratively. But MBT might be too time consuming or confusing if you don't know where you are going.
  • 🤔 You might want to avoid MBT if your SUT has very slow performance
    • Since MBT relies on performing lots of test iterations to look for bugs it might take prohibitively long amounts of time or resources to find any worthwhile bugs. But if you have the time if can still be useful.
  • 🤔 You might want to avoid MBT if your SUT's behavior changes a lot
    • MBT requires changing your mock implementation in concert with your SUT. Depending on the complexity of your mock this might become a big maintenance burden. You'll have to make a decsision on how much time you are willing to put into maintining your tests.


🤨 Are random sequences of operations actually meaningful?

In short, yes but with caveats. You might have a question lingering in your head about how exactly random sequences of operations will meaningfully results. In the first example, what if the MBT runtime picks a sequence of 3 `Get` operations. They will all obviously fail because the sequence never put any valid records into the database before querying it. Or perhaps a sequence did put a random record into the database, but it had a key like 237236127 and the odds of another random Get operation on a random key actually finding this record are basically zero. These are obvious gaps that randomized testing will struggle with. But we can employ some tactics to help the system along.

The most important thing to realize is that while it might be obvious to define an MBT operation for each of the methods in an API, we don’t have to maintain a 1-1 relationship. We could define a composite operation that performs a PUT for a key and then performs a GET for the same key. In this way we can make sure we are exercising this specific use case. Our MBT library may also allow us to make up dependent sequences of operations, where the choice of the first operation will determine the valid choices for the next operation.

Although composite and constrained operations are critical for ensure efficent coverage of valid scenarios, even unconstrained random operations can hunt for bug that you might not have been expecting. The only drawback is that you might have to trudge through some noisy data to separate out real bugs from invalid scenarios.

📚 Case Study: Sequence Writer

Enough abstract explanations, lets get into a real example of how this saved me a lot of time and grief when validating what should have been a simple project.

.NET has a type called a `ReadOnlySequence<T>` which is like a `List<T>`, but it might be made up of multiple linked buffers under the covers. This is useful for lots of reasons that are outside the scope of this document, but you can imagine some of them. Now .NET defines the `ReadOnlySequence<T>`, but confusingly it doesn’t define a way to actually create them 🤦‍♂️. So I defined a `ISequenceWriter<T>` that allows us to create sequences!

This seems pretty simple but leave it to me to mess it up in subtle ways. I’m really bad at things like boundary conditions and overlapping buffers and stuff and there is plenty of that in this type. Also I'm doing some fancy footwork with buffer rentals and trying to maximize buffer utilization. But the interface is pretty simple:

public interface IBufferWriter<T>
{
    void Advance(int count);
    Memory<T> GetMemory(int sizeHint = 0);
    Span<T> GetSpan(int sizeHint = 0);
}

public interface ISequenceWriter<T> : IBufferWriter<T>
{
    public ReadOnlySequence<T> ToSequence();
}

You rent some buffers, fill them with data, advance the writer’s position and then repeat until you eventually call `ToSequence()` to yield the final result. This is an example of where the API of the SUT doesn’t make sense to implement as multiple operations. With an IBufferWriter<T>`, you must first call either `GetSpan()` or `GetMemory()`and then call `Advance()` after you have written some data into the provided buffer. If we implemented the test as 3 separate operations (one for each method) then the MBT runtime would generate lots of invalid usages which would just throw exceptions. So instead, I’ve implemented a single operation that invokes the methods in a correct manner. This might seem less useful, but it is increasing the signal to noise ratio because any exceptions we get are unexpected instead of just invalid API usage.

Moving on, lets look at the design of the mock:

private sealed class MockSequenceWriter<T> : ISequenceWriter<T>
{
    private readonly ArrayBufferWriter<T> _bufferWriter = new();
    public Span<T> GetSpan(int sizeHint = 0) => _bufferWriter.GetSpan(sizeHint);
    public Memory<T> GetMemory(int sizeHint = 0) => _bufferWriter.GetMemory(sizeHint);
    public void Advance(int count) => _bufferWriter.Advance(count);
    public ReadOnlySequence<T> ToSequence()
    {
        var data = _bufferWriter.WrittenSpan.ToArray();

        return new ReadOnlySequence<T>(data);
    }
}

This is super barebones, but it does the job. I kinda lied earlier when I said there is no implementation of a sequence writer in .NET. There is `ArrayBufferWriter<T>` but it is super simple and no better than using a `List<T>`. But i can use it under the covers to simplify my mock. This is what you are looking for when writing a mock model in MBT. Minimum code, maximum fidelity.

Next, we are going to need to get familiar with the testing framework. I am going to use the same library that I used for the metamorphic testing article: CsCheck. Lets refresh a little bit on some core concepts of CsCheck.

Central to everything in CsCheck is the concept of `Gen<T>`. Gen stands for generator, and it is a kind of factory for instances of T. It does some fancy stuff to make the testing process more ergonomic overall, but don’t worry about that for now. There are a number of built in primitive Gen instances for convenience, but we can transform or combine them to make more complicated structures. Here is a quick start guide:

[Fact]
public void Examples()
{
  var constant = Gen.Const(1);
  var randomChoice = Gen.OneOfConst(1, 2, 3, 4, 5);

  var anyInt = Gen.Int;
  var restrictedRangeInt = Gen.Int[0, 100];
  var arrayOfInt = Gen.Int.Array[1, 100];//length 1-100

  var simpleOnlyOddInts = Gen.Int.Where(i => i % 2 == 1);//`where` is wasteful
  var betterOnlyOddInts = Gen.Int.Select(i => 1 + (i * 2));//`select` is efficient
  var daysOfWeek = Gen.Enum<DayOfWeek>();

  var stringGen = Gen.String;
  var hexCharsOnlyString = Gen.String["1234567890abcdef"];
  var buildHexString = Gen.Char["1234567890abcdef"].Array[1, 10].Select(chars => new string(chars));

  Gen<(int, char)> tupleGen = Gen.Select(Gen.Int, Gen.Char);
  //linq syntax is very convenient for composite gens
  Gen<(int, char)> linqSyntaxTupleGen =
      from i in Gen.Int
      from c in Gen.Char
      select (i, c);

  Gen<Color> complexType =
      from a in Gen.Byte
      from r in Gen.Byte
      from g in Gen.Byte
      from b in Gen.Byte
      select Color.FromArgb(a, r, g, b);

  //property based testing
  anyInt.Sample(
      iter: 100, //optional: iteration count
      threads: -1, //optional: thread count

      predicate: i =>
      {
        //are all squares divisible by their factor? (the answer is 'no' btw)
        var square = i * i;
        return square % i == 0;//return false if assertion fails (or throw exception)
      });
}

 

Ok, now that we have gotten back up to speed on PBT, let’s look at the extensions we need to make this model based. First, let’s look at the signature of the model sampling method

public static void SampleModelBased<Actual, Model>(this Gen<(Actual, Model)> initial,
GenOperation<Actual, Model>[] operations, ...);

So basically, we need to provide a Gen that will build the initial states of our mock and our SUT. We are not going to do much with this in our example project, but you can imagine how it can be useful to also be able to randomize the initial state of our system before executing operations on it.

Then we need one or more "GenOperations" for the CsCheck runtime to arrange into sequences for us. Not too bad. I should note now that there are both an `actual` and a `model` type parameter on this method, so you can actually use completely different types for the SUT and the mock without the need for a unifying interface. Neat! But we are actually going to consolidate some of the implementations because we do in fact share an interface.

I thought about walking you through the process that I went through of refining the implementation down to a minimal form, but I think I am going to skip ahead some and just show the finished product with some comments added in for clarity.

//this object holds all the arguments needed to perform an operation
private sealed record RentWriteAdvanceOpArgs(
    BufferType BufferType, 
    int RentByteCount, 
    int WriteByteCount
);
public enum BufferType { Span, Memory }

[Fact]
public void ModelConformanceTest()
{
    //this is the gen that creates the initial states of our SUT and mock.
    //for our test, we are always going to create the same initial objects
    //so we use Const
    var initial = Gen.Const(() => (
        new SequenceWriter<int>(),
        new MockSequenceWriter<int>()
    ));

    //next we are going to build the object that represents the arguments of the operation
    //we will talk about what these args do later
    var rentAndWriteGen =
        from bufferType in Gen.Enum<BufferType>()
        from rentCount in Gen.Int[0, 5000]
        from writeCount in Gen.Int[0, rentCount]
        select new RentWriteAdvanceOpArgs(bufferType, rentCount, writeCount);

    //this is how we define the GenOperation type. this is the encapsulation of the logic 
    //of the operation. It has 3 parts:
    // - A function to label the instance of the operation (for debugging)
    // - A function to perform the operation on the SUT/actual
    // - A function to perform the operation on the mock/model
    var rentWriteAdvanceOp =
        rentAndWriteGen.Operation<ISequenceWriter<int>, ISequenceWriter<int>>(
            name: args => $"[{args.BufferType}, {args.RentByteCount}, {args.WriteByteCount}]",
            actual: RentWriteAdvancedAction,
            model: RentWriteAdvancedAction
        );

    //we wrap the operation(s) in a array
    var operations = new[] {
            rentWriteAdvanceOp
        };

    //finally we put it all together 
    Check.SampleModelBased(
        initial, operations, TestEquality, printActual: Print, printModel: Print,
        threads: -1, iter: 100, seed: null //configurations
    );
}

To recap what we are seeing here, we first declare the generator for the initial state of the SUT and model. This is because the MBT runtime has support for starting the test iteration from arbitrary points and then running arbitrary sequences of operations on it.

The next detail is that we have defined an object called `RentWriteAdvanceOpArgs`which incapsulates the args needed to define a `RentWriteAdvance`. In this way CsCheck separates data from logic.

Next up we define the actual operation for this test. A GenOperation object includes two operation functions, one for the actual/SUT and one for the model/mock respectively. These hooks are defined as functions that accept the test target object and the operation argument that is used to parameterize the operation functions. Typically, this is done by passing a lambda function to each prong of the test (SUT and mock), but I have used a shared function to save code because both my SUT and mock present the same interface. 

It is important to note that even though there is just one defined operation, the MBT runtime will still chain together randomly generated sequences of randomized operations, so we are still actually exercising the API effectively

The actual test function looks like this:

//opArgs is a randomly generated set of arguments for the operation
private static void RentWriteAdvancedAction(ISequenceWriter<int> sequenceWriter, RentWriteAdvanceOpArgs opArgs)
{
    //pick a buffer type to rent from the sequence writer
    //we do this so we are exercising both APIs
    var buffer = opArgs.BufferType switch {
        BufferType.Memory => sequenceWriter.GetMemory(opArgs.RentByteCount).Span,
        BufferType.Span => sequenceWriter.GetSpan(opArgs.RentByteCount),
        _ => throw new NotSupportedException()
    };

    //assert that the returned buffer is at least as large as the requested size
    //this is a requirement of the API
    Guard.HasSizeGreaterThanOrEqualTo(buffer, opArgs.RentByteCount);

    //write unique values to the buffer so that we can verify the 
    //correctness of the data later
    for (var i = 0; i < opArgs.WriteByteCount; i++)
        buffer[i] = i;

    //advance the sequence writer, leaving it ready for the next operation
    sequenceWriter.Advance(opArgs.WriteByteCount);
}

Finally, after the setup, we call `SampleModelBased` in order to start the test.

 Check.SampleModelBased(
     initial, operations, TestEquality, printActual: Print, printModel: Print,
     threads: -1, iter: 100, seed: null //configurations
 );

...

private static bool TestEquality<T>(ISequenceWriter<T> actual, ISequenceWriter<T> model)
{
    var actualData = actual.ToSequence().ToArray();
    var modelData = model.ToSequence().ToArray();

    return actualData.SequenceEqual(modelData);
}

private static string Print<T>(ISequenceWriter<T> writer)
{
    var data = writer.ToSequence();
    var builder = new StringBuilder();
    _ = builder.Append('[');
    foreach (var segment in data)
    {
        foreach (var item in segment.Span)
        {
            _ = builder.Append(item).Append(',');
        }
    }
    _ = builder.Append(']');

    return builder.ToString();
}

I am passing several optional arguments to the function for convenience of debugging, but the last required one is the `TestEquality` method which is used by CsCheck to validate that the SUT and the mock are semantically equivalent. In my case, the only thing I care about is “is the data exactly  the same?”, so the method is pretty simple.

If our test fails, CsCheck will report back to us with a message that looks like this:

CsCheck.CsCheckException : Set seed: "0nQm8eH9ps6c" or -e CsCheck_Seed=0nQm8eH9ps6c to reproduce (3 shrinks, 77 skipped, 100 total).

    Operations: [{Span Rent=3448 Write=3119}, {Memory Rent=1162 Write=639}, {Span Rent=1131 Write=801}]
Initial Actual: []
Initial  Model: []
     Exception: System.ArgumentException: Parameter "buffer" (System.Span<int>) must have a size of at least 1162, had a size of <977>. (Parameter 'buffer')
   at ___.Core.Tests.SequenceWriterTests.RentWriteAdvancedAction(ISequenceWriter`1 sequenceWriter, RentWriteAdvanceOpArgs operation) in D:\Repos\___\___\src\tests\Core.Tests\SequenceWriterTests.cs:line 52
   at CsCheck.GenOperation.<>c__DisplayClass3_1`3.<Create>b__1(Actual a)
   at CsCheck.Check.<>c__DisplayClass62_0`2.<SampleModelBased>b__1(ModelBasedData`2 d)
Here we can see that it found a problem with a particular sequence of 3 operations. There are two things to note about this.

First CsCheck actually found a more complex failure than this, but it has "shrunk" the failing case down 3 times until it found this minimal failure case. This a is a really important feature because sometimes CsCheck will find a failure that took 80 operations trigger. But trying to debug that would be really difficult, so it spends some time trying to work backwards to find a minimal repro for the bug. In this case, it ended up discarding 77/100 more complex cases while searching for a minimal repro. This test pass only took ~60ms to execute, so we can tell CsCheck to spend a little more time trying to shrink the problem by adjusting either the `iter` or `time` arguments to `SampleModelBased`. If I increase the time to 1 second, then the failure report looks like this
CsCheck.CsCheckException : Set seed: "0yhbcum0j1Cf" or -e CsCheck_Seed=0yhbcum0j1Cf to reproduce (22 shrinks, 10,099,674 skipped, 10,260,565 total).
 
     Operations: [{Span Rent=99 Write=93}, {Span Rent=44 Write=12}]
 Initial Actual: []
 Initial  Model: []
      Exception: System.ArgumentException: Parameter "buffer" (System.Span<int>) must have a size of at least 44, had a size of <35>. (Parameter 'buffer')
    at ___.Core.Tests.SequenceWriterTests.RentWriteAdvancedAction(ISequenceWriter`1 sequenceWriter, RentWriteAdvanceOpArgs operation) in D:\Repos\___\___\src\tests\Core.Tests\SequenceWriterTests.cs:line 52
    at CsCheck.Check.<>c__DisplayClass62_0`2.<SampleModelBased>b__1(ModelBasedData`2 d)
We can see here that CsCheck has now had time to shrink the case down to two small operations that should be much easier for us to debug. You can also see that in just 1 second, CsCheck has performed over 10 million iterations, discarding the vast majority while searching for simpler reproduction steps. Which leads me to the second important detail.

Since CsCheck uses randomized testing, if we run this same test again 10 times, we will get 10 different failures or sometimes perhaps even no failures if the bug is hard to trigger and the iteration/time budget is too low. But once we have found a failure, we can that the "seed" value (here `0yhbcum0j1Cf`) and pass it to `SampleModelBased` function in order to make CsCheck replay this specific failure deterministically so we can study it in the debugger.

So how well did this actually work?

It worked shockingly well for this problem. This particular example of a sequence writer was almost tailor made for MBT since there are all kinds of weird edge cases around the buffer renting logic. 

I'm going to be a bit vulnerable here and mention the bugs I found with this test:

  • [logic] Segment RunningIndex was being erroneously updated when linking to a new segment
  • [typo] Writes were being appended to `_head` segment instead of `_tail`
  • [typo] Inverted argument guard clause logic

That second bug was particularly nasty, because everything would work fine until a certian point. For example, if you asked GetSpan for a buffer of 600 bytes and then wrote 600, then asked for another buffer of 300 bytes and wrote 300, everything would work fine because it just so happens that the actually buffer that got allocated was 1024 bytes so it would hold everything you wrote in a single segment. It wasn't until you tried to rent a bigger chunk after renting a smaller chunk that things would start to break. And this bug might actually change in behavior depending on the `MemoryPool` that the `SequenceWriter` was constructed with. But my test caught it.

There are a hundred questions that might lead to incorrect behavior. If we were writting this test by hand, how soon would we get tired of coding these edge cases before we gave up and assumed it was working? CsCheck will produce sequences of up to 127 operations to find bugs. That is a lot more due diligence than most of us are willing to put into our manual test suites.

  • What if we rent a buffer of size zero? 
  • What if we rent a non-zero buffer, but then write zero bytes into it? 
  • What if we do that 100 times? 
  • What if we rent a buffer smaller than the minimum allocation size of the MemoryPool?
  • What if we try to rent a buffer exactly the size of the currently allocated segment? 
  • What about if the rental was slightly larger? Or slightly smaller? 
  • What if we rent a giant buffer but then do multiple smaller writes into that segment? 

For the price of about 100 lines of code, we get a pretty extensive degree of correctness testing. Based on this test alone, I ended up making 40 lines of changes to my 250 line implementation. Some of those lines were simple fixes, but most were refactors in order to make the code more obvious and less error prone. Once my test passed cleanly with the default 100 iterations, I ran it with 100,000 and it also passed. With this single test, I feel pretty damn confident that my implementation is doing what it is supposed to for any kind of input that is thrown at it.

Lastly, it is also important to note that during this process I also made several edits to the test itself. You see, each time the test failed, I had to ask, “is the test broken or is the SUT?”. This feedback loop is essential during the testing phase and it becomes even more useful the earlier you apply it during the development lifecycle. The testing library can ask you questions about your design that you couldn’t even think of, and this can dramatically enrich the robustness of your design and implementation.

I hope this has been a useful article. If so, let me know so that I know to write more like it 😃!