ADD - Incremental Data Loading

  • 26 January 2021
  • 0 replies
  • 71 views

Introduction

Automated Data Distribution (ADD) allows you to load data into each dataset in a workspace either in full or incremental mode. Full load means that all data in the target dataset will be replaced with all applicable data from your source table or file every time ADD runs. This can mean transferring unnecessarily large amounts of data. On the other hand, incremental loads allow you to transfer new and updated rows only every time ADD runs, making for faster updates.

 

How does it work

If your data source is a database: To turn on incremental load into a dataset, add a x__timestamp field to the corresponding source data table. ADD automatically recognizes this field, and ADD will switch to incremental loads for any dataset loaded from this table.  

The absence of the x__timestamp field in a source table indicates that the mapped datasets will be loaded in full. 

 

Every time ADD runs in the incremental mode, it saves the maximum loaded x__timestamp value to an LSLTS (Last Successfully Loaded Timestamp) for each LDM dataset in GoodData. The LSLTS identifies the last timestamp for a specific dataset in a specific workspace. Data records with x__timestamp > LSLTS for each dataset will be used to incrementally update the workspace.

When the ADD load runs for the first time, the data is always loaded in full, even if x__timestamp is set on the source table. The below picture illustrates how LSLTS for a dataset changes with subsequent ADD runs.

Here is a detailed example of how this works. 

 

If your data source is a CSV file on s3: If you’re using ADD to refresh data in a workspace from CSV files stored on a cloud object storage, you don’t use the x__timestamp column. Instead, the timestamp is taken from the filenames that have to follow a specific naming convention. The rest - LSLTS - works the same. To learn more about the required naming conventions and see examples refer to our documentation.

 

When loading data incrementally, you often also need a way to delete obsolete data. You can learn more about how to do that in this article: ADD - Deleting Data from Workspaces.

 

Recommendations:

  • Like any application of a timestamp, make sure the timestamp used to populate x__timestamp is consistent (same timestamp, timezone, and format for all data loads)

  • If you are using x__deleted to delete old records, don’t forget to update x__timestamp, so the record is flagged with the change.

  • If you would like to force a full load for a dataset with x__timestamp, there are several options:

 

To learn about other features of ADD, see also:


0 replies

Be the first to reply!

Reply