Deep dive: Linked entities in FetchXML

From Resco's Wiki
Jump to navigation Jump to search

The first section explains traditional FetchXML link types, i.e., inner and outer joins. It introduces the concept of reverse lookups (as those representing 1:N relations), emphasizing their crucial role in fetch performance. Additionally, it explores how the system's security model affects fetch performance and addresses specifics of the synchronization sync filters.

The second section explores the applicability of In/Exists links:

  • These links can replace inner links when the linked entity’s attributes are neither referenced outside the <link-entity> nor included in the fetch results.
  • They are effective when the performance of an SQL sub-query surpasses that of a join.
  • This is particularly relevant for highly selective fetches (those examining large datasets to yield a small number of results) and their inner links based on reverse lookups.

The key takeaway is that while In/Exists links may not generally improve fetch performance (and can sometimes degrade it), there are specific contexts — particularly in complex Field Service queries — where they can offer significant performance improvements and help avoid potential server timeouts.

A typical example is msdyn_workorder<-bookableresourcebooking link in the MyBookings-centric Field Service client data model.

Linked entities in FetchXML

Links allow you to join related entities (tables) in a single query. For example, you might want to get information for an account which includes contacts related to that account. (We'll discuss this example later.)

Formally, links combine two entities - BaseEntity with LinkedEntity, whereby the latter is described by XML syntax <link-entity name="LinkedEntity" from="fieldFrom" to="fieldTo" alias="..." link-type="...">

In the example

  • LinkedEntity.fieldFrom should match BaseEntity.fieldTo.
    This typically means a lookup BaseEntity.fieldTo pointing to LinkedEntity primary key (fieldFrom), but there are more possibilities. (See below.)
  • Alias defines an (optional) new name for the linked entity.
  • Link-type specifies various ways of joins - traditionally inner (default) and outer.

Besides this, the linked entity may contain attributes and conditions just like the base entity.

Link examples

This fetch outputs a list of contacts (represented by their full name):

<fetch>
	<entity name="contact">
		<attribute name="fullname"/>
	</entity>
</fetch>

If we want to add information about the parent account, we can make use of the lookup contact.parentcustomerid -> account.

For example, the following fetch returns two columns from two different entities: (contact.fullname, account.name).

<fetch>
	<entity name="contact">
		<attribute name="fullname"/>
		<link-entity name='account' from='accountid' to='parentcustomerid' link-type='inner'>
			<attribute name="name"/>
		</link-entity>
	</entity>
</fetch>

The link is inner, which means that standalone contacts (those without a parent account) are not returned. If we also want those contacts, we can use the outer link:

<fetch>
	<entity name="contact">
		<attribute name="fullname"/>
		<link-entity name='account' from='accountid' to='parentcustomerid' link-type='outer'>
			<attribute name="name"/>
		</link-entity>
	</entity>
</fetch>

This time, we get:

  • previous results (contact.fullname, account.name)
  • new results (contact.fullname, null) representing standalone (disconnected, unassociated) contacts.

The above cases are lookup-based, i.e., they use existing relation contact->account. (Relational theory uses the term N:1 relation contact:account as 1 account has N contacts.)

However, we may pose a reverse problem: Get accounts + their contacts. We have to use the same relation as above, but in reverse order: account:1 <- contact:N. We refer to this situation as a reverse lookup. We'll subsequently discuss the two possible fetches - inner and outer.

<fetch>
	<entity name="account">
		<attribute name="name"/>
		<link-entity name='contact' from='parentcustomerid' to='accountid' link-type='inner'>
			<attribute name="fullname"/>
		</link-entity>
	</entity>
</fetch>

The results look like this:

	(account1.name, contact11.fullname)
	(account1.name, contact12.fullname)
	(account1.name, contact13.fullname)
	...
	(account2.name, contact21.fullname)
	(account2.name, contact22.fullname)
	...

We see that one account is returned N times - once per each account's contact. Note that accounts not having any contact at all are not returned. If we want those as well, we have to use the outer link:

<fetch>
	<entity name="account">
		<attribute name="name"/>
		<link-entity name='contact' from='parentcustomerid' to='accountid' link-type='outer'>
			<attribute name="name"/>
		</link-entity>
	</entity>
</fetch>

Then we also get "orphaned" accounts (accounts with no contacts):

	(account3.name, null)

Let's show another example demonstrating a bit unorthodox use of outer link ("accounts without contacts"):

<fetch>
   <entity name='account'>
	  <attribute name='name' />
	  <link-entity name='contact' from='parentcustomerid' to='accountid' link-type='outer' alias='C'/>
	  <filter type='and'>
		 <condition entityname='C' attribute='parentcustomerid' operator='null'/>
	  </filter>
   </entity>
</fetch>

Links over non-lookup fields

Tables are typically linked over lookup fields (i.e., using existing relations), but this is just a convention. Links over non-lookup fields (fields not taking part in any defined relationship) are possible as well, provided some "natural" conditions are met:

  • Joined fields should have the same type. (Otherwise, type conversion may have a performance impact.)
  • These column types cannot be used in from/to attributes: File, Image, MultiSelect Field, PartyList. (Reason: specific column implementation.)
  • Some columns have poor performance: Long texts, calculated fields, and logical (external) columns.

Here is a bit artificial example:

<fetch>
	<entity name='account'>
		<attribute name='name' />
		<attribute name='websiteurl' />
		<link-entity name='contact' from='fullname' to='name' link-type='inner'>
			<attribute name='jobtitle' />
		</link-entity>
	</entity>
</fetch>

One remark about performance: Unlike lookups (which are indexed on Dataverse), evaluating these criteria requires a table scan and might cause performance issues—at least on large tables.

Fetch results

In the case of simple fetches (not containing links), the results are base entity rows matching fetch conditions. Each row is represented by the values of requested attributes, e.g.:

<fetch>
	<entity name='account'>
		<attribute name='name' />
		<attribute name='websiteurl' />
		<filter type="and">
			<condition attribute="statuscode" operator="ne" value="2"/>
		</filter>
	</entity>
</fetch>

The ID column (primary key) of the base entity is always returned - even if not requested.

If the fetch contains links, the fetch returns a cartesian product of

  • Base entity rows (those matching the fetch)
  • Linked entity rows (those matching base entity row)

Again, the linked entity columns (attributes) returned are those specified in the fetch. Notice that the linked entity's ID is not added automatically.

Reverse links

If the linked entity is connected over base entity lookup (i.e., LinkEntity.From points to the Primary Key), we have a 1:1 link, and the linked entity can be considered an extension of the base entity. Results always differ in this case (because of included base entity ID).

If, however, LinkEntity.From points to a field that is not a Primary Key, we have a 1:N link, and the results may have duplicates—depending on which attributes of the linked entity are requested. (Although they ALWAYS differ on the server due to the ID of the linked row.)

A typical example is a parent entity linking child records, such as

<fetch>
	<attribute name="name/>"
		<entity name="queue">
		<link-entity name="queueitem" from="queueid" to="queueid"/>
	</entity>
</fetch>

The above fetch would return each queue record multiple times, once for each of its associated queueitem child records. To prevent returning equal results, you may use the fetch with the DISTINCT keyword: <fetch DISTINCT='true'>.... Fetching distinct records has the following impacts:

  • Decreased web transfer
  • Decreased client work
  • Increased (sometimes substantially) server work

Sync Filter specifics

Sync Filter represents the conditions that need to be matched for given server records to be downloaded to the client. It is expressed as a fetch. If the sync filter is undefined, then the default filter is used: "Get all records".

Sync Filter does not have attributes. Resco Mobile CRM automatically supplies all attributes of the base entity as defined in the client customization. The ID attribute assures that two base entity records always differ.

Sync Filter may have linked entities. Their attributes are never returned. Hence, links serve only for match testing (filtering).

If the sync filter contains a reverse link, the SyncEngine automatically generates a DISTINCT fetch to avoid potential duplicates. This can negatively affect performance. (Reverse link is a link based on 1:N relation, i.e., on a reverse lookup.)

Performance considerations

The execution of a simple query (not containing links) depends on whether a suitable index is available.

  • In the worst case, the database has to do a table scan (testing filter match row by row). The number of operations is directly proportional to the number of records.
  • In a case where a good index is available, the number of operations for N records is proportional to log2(N).

Example: Imagine the database table PhoneDirectory with one million entries and the index idx_PhoneDirectory_Name.

CREATE TABLE PhoneDirectory (
	Address TEXT,
	Name TEXT,
	Phone TEXT,
)

CREATE INDEX idx_PhoneDirectory_Name ON PhoneDirectory(Name);

SELECT * WHERE (Name='John Smith')			// 20 operations (ops)
SELECT * WHERE (Phone='0905 918065')		// 500K ops => 25000 x slower

According to ChatGPT, "Lookup columns in Dataverse are automatically indexed." Resco Mobile CRM respects this rule in the device's local database.

  • Lookups are fast, they require log2(N) ops.
  • Joins over lookups are "fast" as well; they require the same number of operations.

Imagine a simple query: Give me table rows such that (something).

  • Example: Get active accounts (As a fullsync query, this is a table scan; it returns nearly all visited records. The query is fast but returns many results.)
  • Example: Get accounts whose versionnumber is larger than x (As an IncSync query, it is highly selective. If a versionnumber index exists, the query will be fast - index seek.)

The performance of this query (Q1) is between log2(N) and N.

  • log2(N) if there exists an index suitable for the evaluation of the condition "(something)", and
  • N ops if no such index is found.

We are simplifying here. What actually matters is the number of visited (tested) rows per returned result. The estimates above apply to the worst case of highly selective queries.

The next example includes a linked query1:

Give me table1 rows such that (something1) (Q1)
and table1_row.lookup points to a table2 row such that (something2)
For example: Get salesorderdetails such that manualdiscountamount>0 AND parent_salesorder.totalamount > 1000

The performance estimate for the first part of the query (Q1) is the same as before: between log2(N1) and N ops. However, with each record satisfying (Q1), we have to do additional steps:

  • Find row2 from the Table2 - this requires log2(N2) operations.
  • Execute (something2) test.

Given these numbers, we could estimate overall performance to be between log2(N1)*log2(N2) and N1*log2(N2).

Note: The sequence of steps above is called a query plan. Many query plans might lead to the correct results. The SQL server query planner is responsible for selecting the most efficient one.

Cost of reverse lookups

Next, imagine a linked query2 that uses a reverse lookup.

Give me Table1 rows such that (something1) (Q1)
such that there is a table2 row fulfilling (something2)
which has a lookup table2_row.lookup -> table1_row
For example: Get salesorders with totalamount > 1000 which have child_salesorderdetail with manualdiscountamount>0

It is difficult to estimate query2 performance. This is caused by a tree-like space of possible solutions:

table1_row
	<--- table2_row1
	<--- table2_row2
	<--- table2_row3
	...

Let's only say that the performance of these queries is lower (sometimes much lower). The details are too difficult to present here.

Here is a real-life example coming from field service: Get "MyAccounts".

Get accounts
	used as billing account by some workOrder
		which has a booking
			assigned to me

Here is the link sequence (notice the two reverse lookups): account <- msdyn_workorder <- bookableresourcebooking -> bookableresource (userid == MyUserid)

Traversed records:

account
	<--- workOrder1
		<--- booking1 -> bookableresource1: Test(userid)
		<--- booking2 -> bookableresource2: Test(userid)
		<--- booking3 -> bookableresource3: Test(userid)
		...
	<--- workOrder2
		<--- booking1 -> bookableresource1: Test(userid)
		...

Hidden costs of the security model

The security model further complicates things. If you send the query get records that fulfill something, the server actually executes a more complex query: Get records that fulfill something AND (you_have_read_access_to_the_record OR the_record_was_shared_to_you).

read_access_test

  • Problem: Do you or one of your teams have a role with read access to a given record?
  • In theory, this means traversing a lot of system tables (team, teammembership, role, systemuser, principal, roleprivileges, systemuserroles).
  • In practice, Dataverse maintains a security cache allowing for faster checks.
  • At the time of this writing, this cache was represented by a single table called systemuserbusinessunitentitymap.
  • The problem is if this table is huge. Some customers report up to 50M records. This consumes 3G+ of server RAM, ~10G with indexes. The cost of search (log2(50M) = 27) is a lesser risk.
  • In some situations the access test may be simplified (for example, owningbusinessunitid==?) or even disappear completely (for entities with general read access).

sharing_test

  • Testing whether a given record resides in the POA table (the table containing all shares) with at least read permissions for the user issuing the query.
  • Possible problems: Large POA table (10M+ records) or many entries for a given entity.
  • As before, sharing_test brings some performance penalty, but the highest risk might be increased memory pressure due to this table's caching. For entities with general read access, the sharing test is skipped altogether.

Security tests are executed on all tables used by the fetch. This example demonstrates the reason: Imagine I have access to the Employee table but can't read the Salary table. If links were not tested for security, I could issue queries such as "Get employees with salary = 2000."

Which security model is good? A model that simplifies your fetches as much as possible (while maintaining the level of security you need). Extreme examples:

  • You define granular business units and let mobile users download all the data they see. (Security replaces sync filters.)
  • Entities with general read access: Security tests are skipped completely. (Sync filters are responsible for security, too.)

What are some of the performance risks related to security?

  • Too complex role-based security
  • Heavy use of sharing
  • Number of links in the query

How does the risk manifest?

  • Security typically adds some (usually minor) penalty to the query execution.
  • In the worst case, security checks require a lot of memory, which is missing for query execution. Increased disk i/o may then lead to server timeouts.

How to mitigate the risks?

  • POA maintenance: Don't let it get too large.
  • Simplify security: Give overall read access to child entities (resco_question!) or insensitive entities.
  • Simplify fetches: Reduce query links.

Major performance risks

As a rule of thumb, a bad query is one that uses the hard drive a lot. Here are some of the reasons:

  • Size of data being traversed: Total size of all tables involved in the query, including security cache and POA.
  • Query complexity: conditions (especially if they force table scan), links (especially reverse), sorting
  • Highly selective queries that require table scan (no suitable index exists, many rows visited to produce one result)
  • Complex security
  • Heavy sharing

New link types (In, Exists)

Historically, FetchXML supported only inner/outer links. We discussed them above, hence just a brief summary:

  • They represent SQL joins.
  • The result is a cartesian product of joined rows.
  • Fetch specifies which attributes of the joined rows are to be returned. The server always returns the ID of the base record, even if not requested.
  • Attributes of the joined entity may or may not be requested.
  • In the latter case, joins are used for filtering only, and fetch might return duplicates (for joins using reverse lookups).
  • The only difference between inner/outer links: Outer joins also return base entity rows without any joined row. (Joined row attributes are represented as null.)

Performance point of view:

  • Joins over standard lookups do not represent any risk. Their cost is log2(N), and they enable linear traversing of data.
  • Joins over reverse lookup represent 1:N relation and introduce branching, i.e., data needs to be traversed as a tree. Such joins are costly and may produce duplicates.

In the next part, we'll be talking about In/Exists link types. Microsoft introduced even more new link types, but:

  • They don’t appear to be very useful.
  • Some of them (such as the matchfirstrowusingcrossapply link type) are unsupported by SQLite (the database engine used in the Resco Mobile CRM).

EXISTS/IN link types

These two link types were introduced to Resco Mobile CRM in the release 18.0.

  • The links are variants of the INNER link type using different conditions (Exists/In).
  • They are implemented via SQL subqueries.
  • The query does not return duplicates. Therefore, sync does not need to add DISTINCT.
  • The attributes of the linked entity cannot be used outside of <link-entity> node, i.e., in filters or sorting. They cannot be returned by the parent query.

In the existing projects, which sync filter links could be potentially replaced by In/Exists?

  • Inner links whose attributes (of the linked entity) are not used outside of those links.
  • Inner links over lookups: While In/Exists could formally work (if the first point is fulfilled), we would gain nothing. (Probably a slower query.) Only inner links representing 1:N conditions make sense (reverse lookups, LinkedEntity.From is not Primary Key).

Performance gains:

  • Exists/In might return less data than Inner/Outer (because they skip duplicates)
  • They might do less work in the server database because the loop testing multiple matching rows is interrupted at the first match.

The meaning of In/Exists link types is defined by their SQL translation.

EXISTS Returns TRUE if the subquery returns any results. An example follows:

// Fetch using Exists link type
// Get contacts which act as a primary contact for some account
<fetch>
   <entity name='contact'>
	  <attribute name='fullname' />
	  <link-entity name='account' from='primarycontactid' to='contactid' link-type='exists'>
		 <filter type='and'>
			<condition attribute='statecode' operator='eq' value='1' />
		 </filter>
	  </link-entity>
   </entity>
</fetch>

// SQL translation (2nd row represents the subquery)
SELECT fullname FROM contact AS C WHERE EXISTS (
	SELECT 1 FROM account AS A WHERE A.statecode=1 AND C.contactid=A.primarycontactid
	)

IN Returns TRUE if the value of the tested expression ("to" value) is equal to any value returned by the subquery. An example follows:

// Fetch using In link type
// Same meaning as above fetch, although a different SQL translation is produced.
<fetch>
   <entity name='contact'>
	  <attribute name='fullname' />
	  <link-entity name='account' from='primarycontactid' to='contactid' link-type='in'>
		 <filter type='and'>
			<condition attribute='statecode' operator='eq' value='1' />
		 </filter>
	  </link-entity>
   </entity>
</fetch>

// SQL translation
SELECT fullname FROM contact AS C WHERE C.contactid IN (
	SELECT A.primarycontactid FROM Account as A WHERE A.statecode=1
)

Although we have not discussed it so far, inner/outer links correspond to SQL joins. (They are translated as SQL joins, too.) A natural question arises - which one of the two techniques is better?

SQL dilemma: Join vs subquery

Subqueries (also called inner queries or nested queries) involve a SELECT statement embedded in another SELECT statement, typically in its WHERE clause.

Joins are used to combine rows from two or more tables based on a related column.

According to SQL Server documentation:

"Many T-SQL statements with subqueries can be alternatively formulated as joins. There's usually no performance difference between a statement using a subquery and a semantically equivalent version that doesn't. However, in some cases where existence must be checked, a join yields better performance."

In/Exists usage - Summary

A link can be replaced by In/Exists if:

  • It is an inner link.
  • LinkEntity has no attributes and no sort order.
  • Linked entity columns are not used outside of LinkEntity. (One way to force this is to use a null alias.)

Effects of In/Exists links:

  • Fetch->SQL translation uses a subquery instead of a join.
  • Sync fetches don't use DISTINCT keyword. (There is no need to do so.)

Recommendation:

  • Use In/Exists for 1:N links only, i.e., for links over a reverse lookup. Intuitively, this might be the only case where it's less work for the SQL server. (Because the traversal of multiple branches is stopped when the first result is found.)
  • In general, there are not too many sync filter candidates for using IN links, but all of them belong to fetches that might have performance problems.
  • Note, however, that the final decision on which of the two competing technologies (join or subquery) is selected remains at the SQL server, more precisely at its query planner.

Real-life example

This was a real-life example that proved the usefulness of IN links. The customer showed that the following fetch (used in IncSync with Incremental Sync Filter) results in server timeout if the booking link is INNER, but executes very fast for link-type IN.

// Slightly simplified FetchXml syntax
<fetch distinct="true">
	<entity name="msdyn_workorderincident">		// 5.7M recs, 17 incidents/order
		<condition attribute="statuscode" operator="in" values="Active, Complete" />
		<condition attribute="versionnumber" operator="ge" value="331647639" />		// since the last sync

		<link-entity name="msdyn_workorder" to="msdyn_workorder" from="msdyn_workorderid">		// 330K recs, 1:1, 457 WOs/user
			<condition attribute="statecode" operator="eq" value="0" />
			
			<link-entity name="bookableresourcebooking" from="msdyn_workorder"  to="msdyn_workorderid" link-type="IN">		// 440K recs, 1:N
				<condition attribute="starttime" operator="on-or-after" value="@@TODAY-30" />
				
				<link-entity name="bookableresource" from="bookableresourceid" to="resource">	// 722 recs
					<condition attribute="userid" operator="eq" value="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" />
				</link-entity>
			</link-entity>
		</link-entity>
	</entity>
</fetch>

If you don't feel like parsing the example, here's a human-readable description. We only get records that have changed since the last sync (version number >= 331647639) and that match the sync filter. And the sync filter is, "get (active/complete) incidents related to active work orders where I have recent booking (= MyBookings)".

The fetch returns 7430 results. Here is the sequence of links involved together with entity cardinalities:

msdyn_workorderincident(5.7M recs) -> msdyn_workorder(330K) <- bookableresourcebooking(440K) -> bookableresource(722)

Reverse lookup (2nd link) introduces branching into data traversal. (Which requires DISTINCT fetches if inner links are used.)

The point to note is that the MyBookings subquery reduces the number of records from 440K to a few hundreds. (One technician has at most 100s of relevant bookings.) So introducing IN link-type (which is based on MyBookings subquery) changes the query this way:

msdyn_workorderincident(5.7M) -> msdyn_workorder(330K) WHERE msdyn_workorderid is referred by MyBookings(O(100))

This greatly reduces the amount of data traversed during query execution and increases thus query performance.

Generalization for field service

The fetch above comes from a field service app built around a concept that seems to be particularly suited for applying IN links. Let's start by explaining this (MyBookings-centric) concept. The main idea is that the client (mobile user) needs his jobs (bookableresourcebooking's) + their related records (work order, accounts, ...).

Here is a typical example:

MyBookings: My active bookings (bookableresourcebooking) scheduled within some period

	- MyWorkOrders: workOrders (msdyn_workorder) related to MyBookings
		-- MyServiceTasks: serviceTasks (msdyn_workorderservicetask) belonging to MyWorkOrders
		-- MyIncidents: relevant incidents (msdyn_workorderincident) related to MyWorkOrders
			--- MyAssets: assets (msdyn_customerasset) where MyIncidents were reported

		-- MyAccounts: accounts related to MyWorkOrders
		...

Different customizations may use different entities, but the principle stays the same. In a typical business, all tables involved will be large (or even huge), but the number of bookings per user is usually in the range of hundreds, and there's a similar number of related work orders, accounts, incidents, etc. The database schema projection is small and dynamic: bookings change every day.

Sync fetches are very complex. ("Get assets related to incidents related to work orders related to bookings which are assigned to me during next 2 weeks" - 4+ links.) But, all of them use the MyBookings link that has a small result set and uses a reverse link. So the bookings link really looks like a situation when link-type IN should excel.

(Work order link is another candidate for IN link-type, but in this case, we would not get rid of the reverse lookup.)

These considerations might be valid for many Field Service fetches - all those based on MyBookings. (MyVisits, MyAppointments...)

Again, the final verdict is on whether the SQL server query optimizer discovers the optimal query plan. (Our usage of the IN link type is nothing more than a recommendation.)

SQLite tests

(This is for techies who want to understand the impact on the database level.)

Here is a test comparing the performance of join vs subselect. You can subsequently see

  • SQL command
  • SQLite query plan
  • performance
SELECT fullname FROM contact inner JOIN account AS A ON (contact.pe_accountid = A.accountid AND (A.statuscode = ?) )
0 | 0 | 1 | SCAN TABLE account AS A
0 | 1 | 0 | SEARCH TABLE contact USING INDEX idx_contact_pe_accountid (pe_accountid=?)
INNER LINK: 451947 recs in 21945ms (2nd trial 18264ms)
SELECT fullname FROM contact WHERE pe_accountid IN (SELECT A.accountid FROM account AS A WHERE A.statuscode=1)
0 | 0 | 0 | SEARCH TABLE contact USING INDEX idx_contact_pe_accountid (pe_accountid=?)
0 | 0 | 0 | EXECUTE LIST SUBQUERY 1
1 | 0 | 0 | SCAN TABLE account AS A
IN LINK: 451947 recs in 19431ms (2nd trial 19900ms)

As you see, in this case, both queries showed about the same performance, although they used different query plans. We had the same experience with several other queries.

Another interesting observation: As we increased the database cache size from 80M to 1.6G, the performance changed dramatically:

  • INNER LINK: 451947 recs in 2708ms
  • IN LINK: 451947 recs in 3920ms

For yet another query, INNER performance was

a) the same as IN performance for small DB cache,
b) much worse for medium database cache,
c) again, the same for large database cache. The performance in this case was much higher than in case a), of course.

This suggests that

  • Disk i/o plays a dominant role in the performance.
  • Queries containing INNER LINK need more memory and are thus more sensitive towards memory stress.

We just saw that despite being equivalent, INNER/IN queries lead to different query plans, which might perform better or worse depending on the database configuration.