It there some way to tune the performance on a loa...
# gooddata-platform
t
It there some way to tune the performance on a load, and to identify potential bottlenecks?
j
You may want to PARTITION YOUR TABLES to help with performance, but there are a few other hints you can find in our docs HERE
Hi Thomas, this might be a bit too broad and maybe you don’t load your data this way. Can you please confirm the data source you are using and method for loading? If there is anything more specific, I can reach out in a DM to make sure we are covering all basis of overall workspace performance.
t
Its a redshift table
I can query the table rather fast, so it feels like its slow on the gooddata side
Like few minutes vs 1 hour
j
DM sent
t
I think I have found the issue. Could a primary key that’s not actually unique be the problem?
that wouldn’t matter much when you simply load a table in redshift, since it doesn’t care for keys at all, but the GD data model uses this to link up dimensions and facts
j
Hi Thomas, That is correct. Each dimension that is connected to a fact table in the LDM must have its connection point (primary key). GoodData platform stores connection point values in special in-memory structures for faster data normalization during data loading and report computation. These structures scale up to single digit hundreds of millions values. The connection point also plays an important role in incremental loading as records with the same connection point values are updated (instead of appended) during incremental data loads. You can find some more information on this here: https://help.gooddata.com/doc/enterprise/en/data-integration/data-modeling-in-gooddat[…]a-model-components-in-gooddata/facts-in-logical-data-models/
m
Hi @Thomas Karbe, if you are having troubles with the performance of data loads, here are some tips: Switch large datasets to Incremental load • This is the step #1 in the load performance tuning with potentially the biggest effect. • in general the time to load the data to GD Platform depends on the data volume being loaded so with large datasets the best way to speed up the load is to switch from full loads to incremental load where only the data that has been modified or added are loaded (and potentially also data to be deleted). • With ADDv2 (which I assume you are using to load data from Redshift), the incremental load can be enabled quite easily by adding a x__timestamp (or similar mapped) column to your table as described in the documentation. This should be a timestamp when each particular row in your table was created or updated. • GoodData will then automatically remember for each workspace and incremental dataset the value of last loaded timestamp and next time will automatically only query newer data. • This feature can be enabled per dataset so you can still load some datasets in full mode while loading the large ones incrementally. • In case of incremental load, please make sure your dataset has a primary key (one or multiple columns) defined. GoodData will then use this key to update any existing rows if the same value of the key is re-loaded • unless a large portion of your data is changing with every load, the switch to incremental load usually helps significantly with the performance of the load • First time the load will be full. After that (when the last loaded timestamp is stored) it will continue incrementally. You can still force a full load if you ever need to. Non-unique primary keys • you mentioned above you might have non-unique primary keys in your table • this itself should not affect the performance that much, GoodData is actually automatically deduplicating input data by the key • You can disable this deduplication feature as described here. Disabling it will cause slightly faster load (but not significantly). And if there will be any duplicity on primary key during the load, it will cause the load to fail. (which is IMO quite useful since non-unique keys usually mean something might be wrong with the source data). Other load performance tips • the performance of the load depends on the data volume being loaded - the number of rows and number of columns (especially attributes and references) so datasets with a lot of attributes/references will load slower than those with fewer • GD Platform generates and maintains a unique token/hash for every single attribute value loaded ◦ this can lead to upload slowdown if there are attributes with a very high cardinality (many millions of distinct values) ◦ if those are not needed for reporting or references (i.e. unique ID of detailed transaction, long free text fields) it is best not to include them in the data model at all • Back-end engine type ◦ for very large workspaces, switch to a high-performance engine might help not only with the reporting performance but also with the loads. But typically only the data load performance would NOT be a driver for the switch (usually the performance of the calculation & data volume is). ◦ normally GD workspaces use a standard engine ◦ this works perfectly well for most cases, but for very large workspaces (high tens of millions of rows) the performance might slow down ◦ for such workspaces it might be possible to switch them to a high-performance engine ▪︎ it is up to GD support engineers to evaluate if the switch is possible and reasonable option ▪︎ there might be contractual implications (only some tiers might be eligible for this and the high-performance engine will costs more)
t
Hi @Michal Hauzírek. We indeed have one dataset with around 50Mio rows, but that alone doesn’t seem to be the problem. I just realised, that some keys were mistakenly stored as text even though the source tables have them all as bigints. I noticed, that when manually adding to the data model, text is very often automatically assumed as data type and I’ll clean that up. Also, the incremental load might very well be a good option for us, since most rows in a 50 mio dataset are pretty historic and won’t change