APEX, Development, SOQL, SOQL

[ART-005] SOQL Advanced


In the previous article, we have learned some basics to handle correctly SOQL within Apex Code, now we will focus on an important topic that is often unknown or not understand correctly. Here’s an overview of advanced queries to handle large data volumes.


Query Optimizer

Salesforce database don’t require any special tuning when the number of records is below 1 million. Thinks start to be tricky after the 1st million and more you have records more you have to think about the design to put in place to access data. These 2 tables must mastered to overcome some query issues.

Selectivity

Every query that you write against the Salesforce database has to be selective, means that it has to to be optimized to respect the constraints exposed in previous tables, otherwise, you could run into some query exception or timeout. But don’t worry, you can still query the full database with bulk api (100 millions of records) for one table for example. The following query is considered as selective and will bring you all data from Account table.

Select Id from Account;

On the other hand, while using a filter criteria, the threshold drops to 33% of records when using standard index and 10% when using custom index, which means that it’s not possible to retrieve more than 333 333 records for the first one and 100 000 records for the second one when dealing with 1 million records.

Select Id from Account where CreatedDate < today;
Select Id from Account where CustomCreatedDate__c < today;

Even if those 2 queries look like the same and should bring the same amount of records, there are some subtleties:

  • CreatedDate and CustomCreatedDate__c are filled with the same values
  • CreatedDate is a standard field that is indexed by default
  • CustomCreatedDate__c is a custom field that is not indexed

The second query will always gather less records than the first one and you must ask Salesforce support to index it for you. The indexing process is not quite easy, the field has to be deployed in production and the number of data in the table must be that huge that the query must always go on timeout. You cannot obtain indexation if you do not meet those criteria, so plan carefully the deployment of new queries in production to have the time to ask for indexation and monitor you data growth against queries that have been already deployed and need some attention. Unfortunately, it’s often too late when you fall in this kind of issues, Salesforce batches that stop running suddenly is a common issue that every one has once encountered.

Sometimes, we think that filtering the query with more criteria will help getting the precious records but it’s not that easy:

Select Id from Account where CreatedDate < today and CustomIndexedCriteria__c = 3;

If we suppose the query will be bring only 1 record and it’s going to work, it’s not always true depending on the amount of records in Account table. Salesforce will apply selectivity for the first criteria, then apply the selectivity for the second criteria and then it will apply the intersection to the 2 collections. It means that your query will fail or the result will not be found. If you want your query to succeed in extreme condition, you need to ask Salesforce Support for Composite index. More the number of records will grow more it will be difficult to target one record by criteria.

“OR” and “Like” should be avoided in your queries as there are not eligible to selectivity.

Well, don’t be upset, there are still some pattern to achieve what you are looking for more or less:

  • Select only the Id field to improve overall performance (Ex: QueryLocator inside batch)
  • Filter on standard indexed field
  • Filter on custom indexed field
  • Filter on Ids (lookup, master-details)
  • Filter on ExternalId field
  • Ask Salesforce Support for Custom or Composite index
  • Avoid filtering on NULL values
  • Query Full without criteria
  • Use indexed formula
  • Implements division
  • Ask Salesforce Support for optimization (tiny table…)
  • Modulo pattern
  • PK Chunking pattern

Overview on Modulo Pattern

The idea behind is to split data into many indexes that should keep the query under the selectivity threshold. A prerequisite is to know exactly the number of records and estimate the number of modulo to keep the partition below 100 000 records for example. Every partition should hold less than 10% of the database to avoid query timeout. Hence, a batch can be scheduled for each partition to deal with small amount of data.

The drawback of this approach is that the database has to be update for the first shoot and and a maintenance run has to be run every time a partition goes over the threshold. Some trigger implementation has also needed to affect a modulo for new records.

Not the best option, but it can be lifesaver in some situations.

Overview on PK Chunking

How to retrieve more than 10 millions records and consume less API resource? Salesforce has introduced a new option in BULK API called “Sforce-Enable-PKChunking. This option allows to split a query by chunk of 100 000 records (with a maximum of 250 000 by default). Suppose we want to run this query:

Select Name from Account;

Salesforce will add automatically where clauses based on Id range:

Batch1 : Select name from Account where Id >= Id1 and id <Id250001
Batch2 : Select name from Account where Id >= Id250001 and id <Id500001
….
Batch40 : Select name from Account where Id >= Id9750001 and id <Id10millions1

40 batches will be consumed and each batch will handle up to 250 000 records. The following setting has to be put in the request header:

Sforce-Enable-PKChunking: chunkSize=250000; startRow=00130000000xEftMGH

startrow will allow to define the starting Id otherwise Salesforce will start with the first Id of the table.

Conclusion

Most of the time, this kind of issues only happen in batches or bulk operations where a huge amount of data is processed. Other processes generally deal with a smaller amount of records that are filtered by Id (ex: trigger, user interaction …). So, keep in mind those constraints when designing a new batch or when preparing bulk operations and don’t forget to thing about selectivity for every query you will write.

Hope you enjoy reading this article, see you soon for the next one ...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.