F.A.Q.

ADD - Deleting Data from Workspaces

  • 19 January 2021
  • 0 replies
  • 241 views

Userlevel 1

Introduction

The x__deleted field is an optional field for any and all source data tables that feed Gooddata’s Automated Data Distribution (ADD). However, even if the field can be used on any source table, its utility depends on whether the correlated datasets are being refreshed as a full batch load or incrementally.

 

Incremental vs. Full Load

Full Load: When ADD is set to refresh a dataset by full load, the dataset will be completely rewritten. If the associated source table has the field x__deleted, then only records where x__deleted = FALSE will be distributed to the platform.

 

Recommendation: If the x__deleted field is excluded, ADD full load of a dataset will still operate normally as if all records are defaulted to x__deleted = FALSE. So, the x__deleted field is often unnecessary for full load datasets, and it’s best to exclude this field from such datasets.

 

Incremental Load: For datasets that are loaded incrementally, the x__deleted field becomes the primary way to delete data records from the platform. During incremental loads, the data is processed as follows:

  1. Data after a certain point in time (based on LSLTS, Last Successfully Loaded Time Stamp) where x__deleted = FALSE is loaded to the datasets using the UPSERT method.

  2. Data after a certain point in time (based on LSLTS) where x__deleted = TRUE is matched to existing data on the platform, and deletes said record(s) from the datasets.

 

Note: See here for a visual representation of using the x__deleted field
 

Delete Side Tables

Delete tables are an alternative to x__deleted for deleting incremental load data. Datasets using the x__deleted field require a defined connection point, fact table grain, or set of attributes to match on to delete specific records. Delete tables, however, can only delete data based on some set of attribute values and delete all records with those matching attribute values (similar to a DELETE statement in SQL).

 

x__deleted vs Delete Side Tables

There are two factors to keep in mind when choosing between the x__deleted and delete side tables for dataset deletions: number of fields in the fact table grain and data volume size.

 

Generally, if the number of fields in the fact table grain is high (greater than five) we recommend using the x__deleted as the delete side table requires an additional side table join on multiple conditions that can show weaker performance than the delete flag.

 

When the volume of data is low, though there won’t be a very notable difference in performance, the x__deleted flag is going to show better delete performance than delete side tables. This is in part due to a need to query an additional table when a delete side table is used.

 

However, the story is a little more nuanced when working with large data volumes. Such datasets are typically partitioned, segmented over multiple nodes, and optimized using proper projections. For these types of tables, in cases where the records to delete will mostly be within 1 to a few partitions and nodes, using the x__deleted field will usually be more performant. But, when deleting data over a number of different partitions and nodes (ie. deleting based on some subset of attribute value(s) not used to define the partitions and segmentation across nodes) the delete side tables typically show better performance. 

 

Still, there are cases where these guidelines may not hold true, or other aspects of your data pipeline creates design constraints. So, we suggest trialing both methods when needed and discover the best blend for your project’s needs.

 

To learn about other features of ADD, see also:


0 replies

Be the first to reply!

Reply