On the efficiency of indexes on a skewed dataset
- adrogin
- 2m
- 6 min read
Recently, one particular piece of code in Business Central's codeunit 1501 Workflow Management once again revived the old discussion about the benefits of calling IsEmpty before actually retrieving the records with FindSet. Yes, that inexhaustible if not IsEmpty then FindSet. I already wrote about this a few times here and here. But this time the combination of IsEmpty with FindSet appeared to be a good pretext to talk about index efficiency: when a seemingly good index does not achieve the desired performance result and is instead completely ignored by the query optimizer. How can we tell that an index is not a match for a query and maybe find a better one?
So let's have a look at the code in question, which sits is in the procedure FindWorkflowStepInstanceWithOptionalWorkflowStart of the codeunit Workflow Management:
WorkflowStepInstanceLoop.Reset();
WorkflowStepInstanceLoop.ReadIsolation :=
WorkflowStepInstanceLoop.ReadIsolation::ReadUncommitted; WorkflowStepInstanceLoop.SetLoadFields(
"Sequence No.", "Previous Workflow Step ID", Argument);
WorkflowStepInstanceLoop.SetRange(
Type, WorkflowStepInstanceLoop.Type::"Event");
WorkflowStepInstanceLoop.SetRange(
Status, WorkflowStepInstanceLoop.Status::Active);
WorkflowStepInstanceLoop.SetRange("Function Name", FunctionName);
WorkflowStepInstanceLoop.SetFilter(
"Previous Workflow Step ID", '<>%1', 0);
WorkflowStepInstanceLoop.SetCurrentKey("Sequence No.");
if WorkflowStepInstanceLoop.FindSet() thenCode is very simple, nothing out or ordinary is happening here. A set of filters applied on the Workflow Step Instance table, SetLoadFields, row ordering, and FindSet.
The reason this code suddenly appeared in the spotlight is the drastic difference in performance between FindSet and IsEmtpy with the same filters set on the table, even if no records satisfy the filtering criteria and FindSet returns an empty recordset. With a large number of records in the Workflow Step Instance table, the FindSet in the last line performs very poorly, taking up to ten seconds, event if it finds no records. IsEmpty with the same set of filters, on the other hand, returns almost instantaneously, completing in no more than 2-3 milliseconds. Therefore, a conclusion is made that:
A query produced by the FindSet function has to scan the table and sort the whole dataset, whilst IsEmpty generates a query that can benefit from an index seek, and
It would be a good idea to add a condition if not IsEmpty then before the FindSet.
if not WorkflowStepInstanceLoop.IsEmpty() then
WorkflowStepInstanceLoop.FindSet();Is this indeed a good use case for a combo if not IsEmpty + FindSet? And why is FindSet so much slower in this case, even if no records are returned?
The second question is the key that can lead to the first answer, so let's start from here.
Why is FindSet much slower that IsEmpty?
To run the tests described below, I inserted 1 000 000 records in the Workflow Step Instance, and all the following examples are based on this dataset.
The first suggestion was that the index scan combined with sorting are the cause of the poor performance, whereas the SQL query from IsEmpty can use an index seek and does not have to sort any records. But to counter this statement, I would add a side note that index scan in not an evil operator that must be avoided at all costs. A plan with a scan wasn't necessarily a bad one, even twenty years ago. And now, with data stored on SSD, there is even less ground for blaming all performance problems on index scan.
In fact, there is an index in the Workflow Step Instance table which is supposed to support the query that the AL code above produces. The index Key6 includes all the fields on which filters are applied:
key(Key6; Type, Status, "Function Name", "Previous Workflow Step ID")At first glance the key looks good, as it includes all the filtered fields, but if we look at the actual execution plan, we can see that is does indeed contain an index scan. But despite this, the slowest part of the query is in fact a key lookup, not the scan itself.

And as we can see from the details of the index scan operation, it's not even Key6 being scanned. The query optimizer suggests to use Key2 for this query, ignoring the existence of Key6, which is presumably tailored specifically for this exact select.

Key2 is defined simply as a key with a single field: Sequence No.
key(Key2; "Sequence No.")We remember, of course, that BC actually adds all primary key fields to every index, so in fact, this index includes Sequence No., ID, Workflow Code, Workflow Step ID. But for the following argument additional primary key fields are not so important, we can focus on the Sequence No.
The details of the key lookup operation show us that it takes so much time because the lookup is executed exactly 1 000 000 times. Once per each index key.

And evidently none of the one million executions of the lookup returned any data. Still SQL Server must read every row from the clustered index to validate the filter conditions on each one of them, one by one.
On the other hand, IsEmpty can indeed use index seek, and it can be as fast as just 1 millisecond.

And this index seek actually reads the Key6 - the index which was planned for the respective query in question.

What is also remarkable about the FindSet query is that it can run much faster if the Key2 index is disabled. Without the Key2 at hand, the DB engine reads the clustered index instead - and this is still faster than a million key lookups.

Now let's delve deeper into the reasons of why Key6 might be ignored by the query optimizer, with the seemingly worst possible execution plan being selected instead. Two aspects of the query should be considered here: data distribution and ordering. First of all, let's try to understand the data being queried.
Filters are set on the following fields:
Type: This field has only two distinct values, which isn't very effective for indexing.
Status: Five distinct values in the option list, and the overwhelming majority are probably active, so this can effectively by just one value.
Function Name: Maybe a few dozen unique function names. In case of several workflows assigned to a large number of documents, could also a handful of values in the workflow step instance table.
Previous Workflow Step ID: "Not equal" filtering condition on this field can't efficiently use any index.
By and large, the data distribution in the table fields together with the kinds of filters applied give a picture of a heavily skewed dataset. The query it likely to return either a very large number of rows, or none at all.
In fact, if I run the same SQL query in SSMS, but add an index hint with (index[$Key6]) to force the Key6 on the query, it actually completes much faster: around 2 milliseconds if no records are returned (comparable with IsEmpty), and approximately 2 seconds for 50 000 records returned. The catch is that with 400 000 records returned, query execution time grows to 11 seconds, and with all the fields in search predicates being highly selective, SQL Server builds the execution plan for the worst case scenario and prefers to use Key2 which does not require an expensive sort operation, since Key2 is stored ordered by the Sequence No. field, and can be more efficient in the worst case.
Besides, another drawback of Key6 in this scenario is that it does not allow the database engine to satisfy the FAST(50) option, which Business Central always adds to select queries. Once again, it is the blocking sorting operator that stands in the way: query can't return any data until all rows have been retrieved and sorted. Key2 is already stored in the required order, sorted by the Sequence No., therefore the plan based on the Key2 does not require sorting and can quickly return the requested top 50 rows as soon as the values satisfying the filters are selected, while continuing to read the rest of the index.
Is this a good use case for IsEmpty + FindSet?
Now I answered the first question - why FindSet is so much slower than IsEmpty despite the select query not returning any records - I can reflect on the second question: does this difference between two execution plans suggest that testing the filters with IsEmpty before running FindSet to retrieve the data is beneficial after all?
As I showed above, the root cause of the poor performance of FindSet in this example is an inefficient index on skewed data that forces the SQL query optimizer to execute a key lookup on each index entry. And probably a better index could be built that would aid in validating search predicates without engaging the key lookup operation in a loop. This index should satisfy two requirements:
Contain columns that are likely to eliminate rows not matching the search predicates based on the index data, without involving key lookups;
Its values must be ordered by the Sequence No. column for the DB engine to be able to adhere to the FAST 50 option and return the first 50 rows as fast as possible, before completing the whole query.
Index like Function Name, Status, Sequence No. satisfies both criteria above and does much better job in facilitating the query in this situation. It can be efficiently used to exclude a larger portion of the table rows early during the index seek. And if no records satisfy the filters, it is likely that the function name value does not exist in the active workflows.
With the index suggested above, FindSet can run just as fast as IsEmpty when no records match the filters.

So a more suitable index can solve a performance bottleneck even in a bad situation like this. If any records matching the query predicates are found, this index index is still a better fit for the task.

