Hi all, hope all is well. I can read <here> that G...
# gd-beginners
d
Hi all, hope all is well. I can read here that GD can load Parquet files, but the Loading Data via REST API page mentions only CSV files. Can we trigger Parquet files loading via the REST API?
f
Hello Denis, I’m afraid that the REST API is only able to load CSV files. The Parquet loading capability is covered via the Automated Data Distribution process, which can load data directly from your your source, or via a more extensive ETL pipeline.
đź‘€ 1
d
Hi @Francisco Antunes, that's helpful. Regardless the load being directly or via pipeline, it has to be scheduled, hasn't it? Is the minimum interval 15min?
f
It can be configured to be run manually, so it is not strictly necessary to have it running on a schedule. But yes, the minimum interval is 15 minutes.
d
Got you. About loading data manually, we can read here the following:
However, be aware that depending on the process and the current state of the data in the target workspace, you may be inserting duplicate data in the workspace.
Can you please elaborate more on that?
m
Hi Denis, just to add to what Francisco mentioned, you can still invoke the "scheduled" process via API. So you can create a process and a schedule (either with the UI or via API and then trigger it to load data from parquet file. You can check the GoodData platform API reference and look for "execute schedule". (I am on my phone right now, will post the exact API call once I get to laptop).
d
Thanks @Michal HauzĂ­rek for piling on. Yes, please share the API when you can as Francisco and I were under the impression that the API would only allow us to load CSV files.
m
OK so let me clarify: • The article Loading Data via REST API describes a very low level approach where you upload data directly to the workspace and need to specify all the parameters (mapping, incremental load…) with each load. This one only supports CSV, can only load data physically stored to the temporary GoodData storage (accessible via WebDAV) and can not be scheduled on the GoodData end. • The article GoodData-S3 Integration Details describes specifics of loading with a higher level tool called Automated Data Distribution (ADDv2). This tool provides additional services such as possibility to schedule the runs on GoodData end, it can read data from different storages (including AWS S3). And this one DOES support Parquet files. • While the ADDv2 is primarily handled from the UI, it can also be fully operated with the REST API. Deployment, configuration, execution and monitoring can all be done with REST API. Probably the most interesting for you would be how to trigger existing schedule via API. For this you can use this API call: https://help.gooddata.com/doc/free/en/expand-your-gooddata-platform/api-reference/#operation/executeSchedule there is quite a lot of options (since schedules can be of different types, not just the ADDv2), but to simply manually execute the schedule as defined, all you need to do is call
Copy code
POST /gdc/projects/{projectId}/schedules/{scheduleId}/executions

{
  "execution": {
    "params": {}
  }
}
And let me just briefly describe the overall architecture of the ADDv2: If you want to load a single workspace (or a few): • you (once) define a datasource (basically credentials to your data storage), this is above the workspaces and set it to load one workspace • in the workspace you (once) deploy the Automated Data Distribution process (with the datasource) • you (once) define a “schedule” for this process (it can be “manual” schedule, that means it is technically not scheduled to automatically run) where you define which datasets to load • you then (repeatedly) execute the schedule and it will read the data and load them to your workspace If you want to load a whole segment of workspaces, you typically organize them to a “segment” and then deploy only one ADDv2 process to a special workspace and instruct it to load the whole segment. Note that all these articles above and APIs are for “GoodData Platform” - the original platform fully hosted by GoodData. They are not relevant for the “GoodData Cloud” or “GoodData.CN”.
🙌 1
d
Hi @Michal HauzĂ­rek. That was extremely helpful. Much appreciated. I think you understood my use case pretty well. If I may, I'd like to understand better how I might try to improve the data load parallelism. 1. If I got you and the docs right, GoodData will deploy only one ADDv2 process per segment. It will load the data to all workspaces belonging to that segment. The better we can do here to speed things up is set
queryParallelism
to
2
although it's still in beta (it might not be a good idea for the time being). Is my understanding correct? 2. For multiple segments, can I trigger multiple ADDv2 processes in parallel, one per segment? 3. We can't predict when the ADDv2 process will start given that the trigger command is placed in the execution queue. Is there any SLA or at least can I have a ballpark estimate on how long that might take (1min, 5min, 10min, 30min) based upon your knowledge on how GoodData Platform is performing these days? 4. What might happen if I trigger the data load for one segment whilst this same segment has got another one taking place? 5. What would cause the problems described in here: "However, be aware that depending on the process and the current state of the data in the target workspace, you may be inserting duplicate data in the workspace.". Thank you in advance.
m
Let me try to answer your questions: 1. yes, you got it right about the deployment, For the speedup - to be honest I am not sure whether the
queryParallelism
has any effect while loading from a file (CSV or parquet). It for sure works when loading from SQL database, but for the files, we need to check internally with the engineers. 2. Yes, that should work, but the total number of running ADDv2 processes per environment is limited and quite low (it used to be 2 I think, but might depend on the datacenter). That is also the reason why it is not a good idea to load many workspaces by having one ADDv2 process in each workspace. 3. That depends, but if the queue is not caused by another processes running within the same environment (=you running many other ADDv2 instances - see 2. above) it usually starts very quickly, I would say within a minute. Of course if you would execute (put to queue) dozens of ADDv2 processes within your environment and each would be loading tons of data, they can wait for a long time. 4. It is not possible to run the schedule again while it is already running. If you try that via API, you will receive error saying that it is already running. If it would be time-scheduled run it will be skipped and will run next time (if it is not still running). 5. To be honest, I am not sure what were they trying to say with this and how does it relate to manually executing the load 🙂 As far as I know, if you have primary keys properly defined in your datasets, you will never load duplicities. The only case where you could load duplicities (apart from having wrongly defined primary key) would be to have no primary key and use incremental load. In general for the best performance of data loads in GoodData, from my experience here are the most important things to consider: • the basic rule is - the less data you are loading the faster it is 🙂 ◦ if speed of load is important, do the incremental loads instead of full loads ◦ if possible only load data of those datasets and workspaces that need loading data ▪︎ I’ve seen customers trying to load static dimensions with never changing data every hour to hundreds of workspaces ◦ if your data is very detailed and granular (and not needed in that detail for your dashboards) pre-consider aggregating the data before loading them (it is also cheaper).
d
Thank you @Michal Hauzírek for sharing this. Much appreciated! 🙏
🙏 1