Read Committed Snapshot isolation and the Write Skew anomaly

adrogin
Nov 26, 2023
6 min read

Much has been said recently about the benefits of the Azure SQL for Business Central performance, and in particular a few articles focused on Read Committed Snapshot Isolation (RCSI) and the performance gain it brings about. RCSI is an option that must be explicitly enabled in on-premise SQL instances, but it is always switched on in Azure SQL, so this is the reality for all Business Central SaaS environments. If you are not famuliar with snapshot isolation, you can read Stefano Demiliani's post where he explains what this is all about or refer to Microsoft Learn documentation.

In this post, I will not write much about the principles of the row versioning and the snapshot isolation as such, but rather want to pay attention to a certain coding pattern that works quite successfully with the traditional implementation of the Read Committed isolation via row locks, but can cause errors once the database is switched to the Read Committed Snapshot mode.

Some time ago, I wrote about transaction isolation levels and the way these are defined through various data anomalies each of the isolation levels can tolerate. In another post, I also touched upon the Read Committed Snapshot isolation, but did not say anything about the anomaly associated specifically with this isolation level, and this is the gap I'm going to fill with this post. The following text and a few pictures will demonstrate the Write Skew anomaly in action.

Demo scenario

To illustrate the explanation, I prepared a demo example which can be found in my GitHub repository.

The sample scenario is a made-up and a very much simplified shopping cart on top of Business Central functionality. In this shopping cart, we allow users to select an item and reserve it for some time, until the payment is submitted. If we allow reservation of items in the cart, we must ensure the consistency of this reservation and prevent booking of something that is not on the inventory. Since this functionality is based on custom entities instead of Business Central orders or quotes, we cannot simply use the reservation functionality and have to do some custom availability calculations on top of the BC availability management. This is definitely not something I would recommend to do in real life scenarios, but it serves as a good illustration of the write skew inconsistency.

The table that will demonstrate the effect is my hypothetical shopping cart contains the customer number, selected items, and quantity of each item.

table 50711 "Shopping Cart"
{
    DataClassification = CustomerContent;

    fields
    {
        field(1; "Customer No."; Code[20])
        {
            Caption = 'Customer No.';
            TableRelation = Customer."No.";
        }
        field(2; Position; Integer)
        {
            Caption = 'Position';
        }
        field(3; "Item No."; Code[20])
        {
            Caption = 'Item No.';
            TableRelation = Item."No.";
        }
        field(4; Quantity; Decimal)
        {
            Caption = 'Quantity';
        }
    }

    keys
    {
        key(PK; "Customer No.", Position)
        {
            Clustered = true;
        }
        key(Item; "Item No.")
        {
        }
    }
}

The snippet below shows the availability calculation that includes additional quantities currently allocated in all customers' shopping carts. To make sure that customers are not allowed to place an order that cannot be fulfilled, this procedure invokes the BaseApp function CalcAvailableInventory and adds the items currently selected by all active customers. Bold lines zoom in on the statement where the discussed anomaly occurs.

local procedure IsItemAvailable(
    ItemNo: Code[20]; OldQuantity: Decimal; NewQuantity: Decimal): Boolean
var
    Item: Record Item;
    AvailableToPromise: Codeunit "Available to Promise";
    AvailableQty: Decimal;
    TotalBookedQty: Decimal;
    InsufficientQtyErr:
        Label 'Ordered quantity is %1, but only %2 available.',
        Comment = '%1: Ordered quantity, %2: Available quantity';
begin
    Item.Get(ItemNo);
    AvailableQty := AvailableToPromise.CalcAvailableInventory(Item);
    TotalBookedQty := CalcTotalBookedItemQty(ItemNo);
    if NewQuantity - OldQuantity > AvailableQty - TotalBookedQty
    then begin
        LogEvent(
            StrSubstNo(
                InsufficientQtyErr, NewQuantity,
                AvailableQty - TotalBookedQty + OldQuantity));
        exit(false);
    end;

    exit(true);
end;

Zoom in closer once again. The bold line in the following function runs the SQL query which calculates the total quantity which is currently placed in all shopping carts of active customers.

I want to avoid dirty reads, so I set the isolation level to Read Committed.

local procedure CalcTotalBookedItemQty(ItemNo: Code[20]): Decimal
var
    ShoppingCart: Record "Shopping Cart";
begin
    ShoppingCart.ReadIsolation := IsolationLevel::ReadCommitted;
    ShoppingCart.SetRange("Item No.", ItemNo);
    ShoppingCart.CalcSums(Quantity);
    exit(ShoppingCart.Quantity);
end;

To see the code in the context, have a look at the objects in the repository, and now let's run the first test.

Running the sample without snapshot isolation

The test that I run starts two background sessions which simulate the activity of two customers selecting the same item and then updating the cart quantity. I picked the loudspeakers LS-150 as the demo item, of which there are eight pieces in the Cronus inventory (I don't care about specific locations here, just count the total inventory).

Each of the two test sessions follows the same steps:

Put 3 pcs of the item LS-150 in the shopping cart.
Commit and wait for 5 seconds.
Change the quantity in the cart from 3 to 5 and wait without commiting to let the other session to catch up.
Commit.

The second customer session starts one second later and runs the same scenario, so the last update from 3 pcs to five brings the total booked quantity to 10, which exceeds the stock - so this update is expected to fail.

At first, I deploy my code on a container with the default SQL Server settings, meaning that the Read Committed Snapshot isolation is turned off.

Events logged by both sessions match the expectations - all 8 items are booked in two shopping carts, and the last attempt to increase the quantity was not successful.

Read Committed Snapshot Isolation

Now I'm switching the Cronus database to the Read Committed Snapshot mode. To enable the snapshot in the container, I open the database properties and change two properties in the Miscellaneous section: Allow Snapshot Isolation and Is Read Committed Snapshot On.

When I click OK, SQL Server gives me a warning that all current database connections wil be reset, and I'm ready to go.

Run the test again and look at the result.

Wait! What just happened?! Did we just overbook the item despite all the availability checks which worked perfectly in the previous test? Yes, this is exctly what happened, and this situation illistrates the write skew anomaly, an inconcistency typical to isolation achieved through row versioning mechanisms.

Some process details

Now I will try my best to explain the processes under the hood that allow the inconsistency to happen, supplementing the text with some diagrams.

The first diagram shows the low-level database events in the case when the snapshot isolation is disabled. I skipped the first availability check to simplify the picture. Two horizontal lines in the diagram correspond to two shopping cart records, and numbered circles are events which are explained below.

Here are some details of the interactions between the two sessions.

First customer puts the item in the cart. 3 pcs are reserved, and 5 more are available.
COMMIT: The first session commits the transaction.
Now Customer 2 comes into play and also puts his 3 items in the shopping cart. 6 are booked in total.
COMMIT of the second transaction.
Now Customer 1 wants to update the cart changing the quantity from 3 to 5, and this update triggers the availability check. This check reads both cart entries with Read Committed isolation placing transient shared locks on the rows and immediately releasing them.
Verification confirmed the item availability - now the new quantity is set and the record is modified. The update places an exlusive lock on the row, which will be held until the end of the transaction.
Customer 2 wants to change the quantity too and sends the request to update the cart entry. Once again the availability check reads both carts records with Read Committed. But this time, this select is not successful because the ReadCommitted isolation forces it to acquire a shared lock on all selected rows, but row 1 is currently locked exclusively by the Customer 1 session. Exclusive lock (X) is not compatible with the shared lock (S), so the transaction started by Customer 2 has to wait for the lock to be released.
COMMIT. Transaction started by the Customer 1 is successfully completed and all the locks it was holding are released. New quantity for the cart entry 1 is saved.
Now the second transaction can finally access the record it needs to verify the item availability. The transaction reads the updated quantity, and the total booked quantity now sums to 8 - all of the inventory. At this point the availability check fails and the Customer 2 cannot place any more items in the shopping cart.

Now let's look at the same process with the snapshot isolation enabled.

Steps 1 through 4 in the second diagram are exactly the same as before. Both processes create new records independently from each other. Step 5 looks the same in the diagram, but there is an internal difference in the way this query is executed. SQL Server does not place the shared locks on the selected records, as it did when the snapshot was disabled. But at this moment it does not make any difference for our shopping scenario, so I keep the details of the first five steps out.

Step 6 is where the significant deviations begin.

What we just observed is the Write Skew anomaly which is possible when two transactions modify two separate records, but the outcome of each transaction, when it is committed, could potentially impact the other transaction. The cause of the anomaly is the fact that both of the transactions don't place any locks on the rows they read or modify and only see the version of the row which existed before the transaction started.

All in all, the example I demonstrated here is not the best way to build a reservation system. Relying on the Business Central reservation functionality is a far better idea. But how can we be sure that this does not happen with order reservations in BC? And how can we avoid anomalies when we find a similar patterns in custom code? In the next episode, I will answer these questions. Stay tuned.

Read Committed Snapshot isolation and the Write Skew anomaly

Demo scenario

Running the sample without snapshot isolation

Read Committed Snapshot Isolation

Some process details

Recent Posts

SIGN UP AND STAY UPDATED!