Hi
@Thomas Karbe, if you are having troubles with the performance of data loads, here are some tips:
Switch large datasets to Incremental load
• This is the step #1 in the load performance tuning with potentially the biggest effect.
• in general the time to load the data to GD Platform depends on the data volume being loaded so with large datasets the best way to speed up the load is to switch from full loads to incremental load where only the data that has been modified or added are loaded (and potentially also data to be deleted).
• With ADDv2 (which I assume you are using to load data from Redshift), the incremental load can be enabled quite easily by adding a x__timestamp (or similar mapped) column to your table as described
in the documentation. This should be a timestamp when each particular row in your table was created or updated.
• GoodData will then automatically remember for each workspace and incremental dataset the value of last loaded timestamp and next time will automatically only query newer data.
• This feature can be enabled per dataset so you can still load some datasets in full mode while loading the large ones incrementally.
• In case of incremental load, please make sure your dataset has a primary key (one or multiple columns) defined. GoodData will then use this key to update any existing rows if the same value of the key is re-loaded
• unless a large portion of your data is changing with every load, the switch to incremental load usually helps significantly with the performance of the load
• First time the load will be full. After that (when the last loaded timestamp is stored) it will continue incrementally. You can still force a full load if you ever need to.
Non-unique primary keys
• you mentioned above you might have non-unique primary keys in your table
• this itself should not affect the performance that much, GoodData is actually automatically deduplicating input data by the key
• You can disable this deduplication feature as
described here. Disabling it will cause slightly faster load (but not significantly). And if there will be any duplicity on primary key during the load, it will cause the load to fail. (which is IMO quite useful since non-unique keys usually mean something might be wrong with the source data).
Other load performance tips
• the performance of the load depends on the data volume being loaded - the number of rows and number of columns (especially attributes and references) so datasets with a lot of attributes/references will load slower than those with fewer
• GD Platform generates and maintains a unique token/hash for every single attribute value loaded
◦ this can lead to upload slowdown if there are attributes with a very high cardinality (many millions of distinct values)
◦ if those are not needed for reporting or references (i.e. unique ID of detailed transaction, long free text fields) it is best not to include them in the data model at all
•
Back-end engine type
◦ for very large workspaces, switch to a high-performance engine might help not only with the reporting performance but also with the loads. But typically only the data load performance would NOT be a driver for the switch (usually the performance of the calculation & data volume is).
◦ normally GD workspaces use a standard engine
◦ this works perfectly well for most cases, but for very large workspaces (high tens of millions of rows) the performance might slow down
◦ for such workspaces it might be possible to switch them to a high-performance engine
▪︎ it is up to GD support engineers to evaluate if the switch is possible and reasonable option
▪︎ there might be contractual implications (only some tiers might be eligible for this and the high-performance engine will costs more)