Also secondary question If I truncate my tables every day r GoodData #gooddata-platform

--- Also, secondary question: If I truncate my tab...

Philippe Hébert

02/22/2022, 9:15 PM

--- Also, secondary question: If I truncate my tables every day, refill them with an x__timestamp = NOW() and I only use one incremental load per day in GD, would that be equivalent to a daily full load? 🤔

✅ 1

Michal Hauzírek

02/22/2022, 9:23 PM

This would not be equivalent to a full load because in full load all the data currently in the dataset are replaced with what is newly loaded there. If you truncate whole table, and insert rows with x__timestamp=NOW() it will perform a MERGE (insert+update) based on the keys in each dataset. In case some keys were in your tables before and now they are removed, they will still remain in the workspace. Also MERGING a lot of data into a lot of data (like in case you would be re-uploading everything in incremental mode) can be quite slow. Increments work best with the increment being relatively smaller than what is already loaded.

Philippe Hébert

02/22/2022, 10:40 PM

OK so due to incremental load not executing delete on records which are no longer in source data, incremental load differs from full load. So... that means that in the case where a client extraction failed, I cannot provide them with the previous day's data until I run a subsequent incremental load job because there is a risk that some entries have been deleted between the two runs and that those will stay in the workspace after the incremental (refresh) load job. 🤔 Is this a correct interpretation? If so, would that mean that the only way to provide some fallback (stale) data until the next load job is to do a full load job once again after the data has been extracted?

Philippe Hébert

02/22/2022, 10:42 PM

(an alternate way would be to track deletion in our ELT and have a filter pretty much everywhere downstream, but that's a can of worms I don't want to open)

Philippe Hébert

02/23/2022, 7:00 PM

--- So my initial question is even more important then: Can I trigger full load via an API call rather than via a schedule?

Philippe Hébert

02/23/2022, 7:04 PM

Especially if we end up with an issue internally and for instance our ELT throttles, I want to avoid having a full load done on potentially inconsistent data

Philippe Hébert

02/23/2022, 9:48 PM

@Michal Hauzírek Is that what I am looking for? https://help.gooddata.com/doc/enterprise/en/expand-your-gooddata-platform/api-reference#operation/executeSchedule If so, in the description it says:

Depending on whether you are executing an Automated Data Distribution (ADD) schedule

Does that mean I need to have a schedule (recurring) load job to be able to call this endpoint? Or can I just have a connection/data source configured and from there I can call the API Endpoint whenever I want to tell GD to schedule a load?

Michal Hauzírek

02/24/2022, 4:59 PM

Yes this API is the one you can use to call load to GoodData workspace. And yes, it also supports forcing a full load based on the parameters you send. The “schedule” in GoodData is basically a (data loading) process associated with some parameters. It does not need to be recurring (you can have a schedule which is set to run “manually”). Schedules also hold the history of their execution and for some time also the log from its run. And yes, schedule is the recommended way how to interact with the loads - both manually in the Data Integration Console or via API. If you’ve loaded data to your workspace, you probably already have such schedule. Each (data loading) schedule needs to exist in some workspace. connection/datasource is one level above - it is really just a definition of a connection string and credentials. It does not say into which workspace, which datasets etc. to load the data to. There are two ways how to work with the schedules: 1. schedule can either exists within the workspace into which you are loading the data (that is the “current workspace” option) 2. or if you are using the Lifecycle Management to handle many workspaces of the same structure (and different client_ids) there can be one schedule in one workspace which handles loads to all the workspaces within the Lifecycle Management Segment. In such setup we usually recommend to have one special “service” workspace for this purpose which does not have any client_id or even any data model but serves just as an envelope for the data loading (and other) processes.

✅ 1

Philippe Hébert

02/24/2022, 5:02 PM

Lovely. That's some quality answer @Michal Hauzírek. Thanks a lot for your thoroughness. I'd suggest your answer be recycled and added in the docs of the API ref 🙂

🙏 1

2 Views

Open in Slack

Previous Next