Denis Baltor
07/12/2023, 12:00 PMFrancisco Antunes
07/12/2023, 12:13 PMDenis Baltor
07/12/2023, 2:45 PMFrancisco Antunes
07/12/2023, 2:49 PMDenis Baltor
07/12/2023, 3:02 PMHowever, be aware that depending on the process and the current state of the data in the target workspace, you may be inserting duplicate data in the workspace.Can you please elaborate more on that?
Michal HauzĂrek
07/12/2023, 3:38 PMDenis Baltor
07/12/2023, 4:57 PMMichal HauzĂrek
07/12/2023, 8:03 PMPOST /gdc/projects/{projectId}/schedules/{scheduleId}/executions
{
"execution": {
"params": {}
}
}
And let me just briefly describe the overall architecture of the ADDv2:
If you want to load a single workspace (or a few):
• you (once) define a datasource (basically credentials to your data storage), this is above the workspaces and set it to load one workspace
• in the workspace you (once) deploy the Automated Data Distribution process (with the datasource)
• you (once) define a “schedule” for this process (it can be “manual” schedule, that means it is technically not scheduled to automatically run) where you define which datasets to load
• you then (repeatedly) execute the schedule and it will read the data and load them to your workspace
If you want to load a whole segment of workspaces, you typically organize them to a “segment” and then deploy only one ADDv2 process to a special workspace and instruct it to load the whole segment.
Note that all these articles above and APIs are for “GoodData Platform” - the original platform fully hosted by GoodData. They are not relevant for the “GoodData Cloud” or “GoodData.CN”.Denis Baltor
07/13/2023, 11:44 AMqueryParallelism
to 2
although it's still in beta (it might not be a good idea for the time being). Is my understanding correct?
2. For multiple segments, can I trigger multiple ADDv2 processes in parallel, one per segment?
3. We can't predict when the ADDv2 process will start given that the trigger command is placed in the execution queue. Is there any SLA or at least can I have a ballpark estimate on how long that might take (1min, 5min, 10min, 30min) based upon your knowledge on how GoodData Platform is performing these days?
4. What might happen if I trigger the data load for one segment whilst this same segment has got another one taking place?
5. What would cause the problems described in here: "However, be aware that depending on the process and the current state of the data in the target workspace, you may be inserting duplicate data in the workspace.".
Thank you in advance.Michal HauzĂrek
07/13/2023, 1:27 PMqueryParallelism
has any effect while loading from a file (CSV or parquet). It for sure works when loading from SQL database, but for the files, we need to check internally with the engineers.
2. Yes, that should work, but the total number of running ADDv2 processes per environment is limited and quite low (it used to be 2 I think, but might depend on the datacenter). That is also the reason why it is not a good idea to load many workspaces by having one ADDv2 process in each workspace.
3. That depends, but if the queue is not caused by another processes running within the same environment (=you running many other ADDv2 instances - see 2. above) it usually starts very quickly, I would say within a minute. Of course if you would execute (put to queue) dozens of ADDv2 processes within your environment and each would be loading tons of data, they can wait for a long time.
4. It is not possible to run the schedule again while it is already running. If you try that via API, you will receive error saying that it is already running. If it would be time-scheduled run it will be skipped and will run next time (if it is not still running).
5. To be honest, I am not sure what were they trying to say with this and how does it relate to manually executing the load 🙂 As far as I know, if you have primary keys properly defined in your datasets, you will never load duplicities. The only case where you could load duplicities (apart from having wrongly defined primary key) would be to have no primary key and use incremental load.
In general for the best performance of data loads in GoodData, from my experience here are the most important things to consider:
• the basic rule is - the less data you are loading the faster it is 🙂
â—¦ if speed of load is important, do the incremental loads instead of full loads
â—¦ if possible only load data of those datasets and workspaces that need loading data
▪︎ I’ve seen customers trying to load static dimensions with never changing data every hour to hundreds of workspaces
â—¦ if your data is very detailed and granular (and not needed in that detail for your dashboards) pre-consider aggregating the data before loading them (it is also cheaper).Denis Baltor
07/13/2023, 2:50 PM