Pages

Saturday, March 14, 2020

Einstein Analytics: Dataflow Performance Best Practice

Performance is critical for Einstein Analytics dataflow, e.g. an optimized dataflow may take only 10 minutes, while the same dataflow with a poor design may take 1 hour (this includes sync setup) to run. Therefore, without great architected dataflows, it will be hard to maintain and sustain Einstein Analytics as a whole, as the company evolved.

Here are a few items noted based on my personal findings/experience, if you have additional inputs or a different perspective, feel free to reach me.


1. Combine all computeExpression nodes whenever possible

image-1



image-2

calcURI node in image-1 contains 1 compute field return Numeric, the same for calURI2 node also contains 1 other compute field return Numeric, a total of calcURI1 + calURI2 = 3:41 sec.

In image-2, we combined both compute fields into calcURI node, and it only took 2:0 sec.


2. Do compute as early as possible, and augment as late as possible

The rationale behind this is, compute node will process lesser fields before augment (as augment always adding fields to the stream), unless you need the field from the augment node for computation.


3. Remove all unnecessary fields 

Remove all unnecessary fields with a slice node or not to include the unnecessary fields when do augment from the right source. The more fields are handled by each node, the system will need more power and time to process, so slice out unnecessary fields if they are not needed in the dashboard or lens. 

Register node usually takes much more time when you write lots of fields, so always clean up before registering to a dataset.

image-3

Notice that calcURI3 in image-1 and image-2 took around 2:08 sec. In image-3, we add a slice node before calcURI3 to remove unnecessary fields, this reduces the number of fields processed in calcURI3, therefore it took only 1:55 sec.


4. Combine all sfdcDigest nodes of the same object to a node, if sync is not enabled

For some reason, your org. maybe not enable for sync, this does not mean you "must" enable straight away, and please DO NOT enable it without a complete analysis, as this may cause data filtering issue.

You should combine all sfdcDigest nodes of the same object into a node, imagine if you have 10 millions row of opportunity, every sfdcDigest nodes take 10 minutes (as an example), and if the dataflow designer adds 3 sfdcDigest nodes of opportunity, the data retrieve itself will need 30 minutes.


5. Do not perform Null check on filter node
So instead of having something like 'Check.Id' is null in SAQL filter, create a computeExpression node to have a Yes/No compute field, then filter with CheckIdIsNull:EQ:Yes
Filter node with Null check will take a lot of time when the dataflow runs.


6. Remove unused Register node
Many times, we add Register nodes across dataflow for testing/debugging, but once deployed to Production, make sure Register nodes for testing are removed. Register nodes will take quite some time of the dataflow run, depending on the number of fields and rows.


7. Remove all nodes that are not related to a Register node
These nodes are simply useless.


8. Use Source Field in computeRelative node, not SAQL, whenever possible
Check out this blog.



2 comments: