Hi! Looks like help is needed for db load optimisa...
# gooddata-platform
m
Hi! Looks like help is needed for db load optimisations. I receive multiple tables in Postgres, which I union all, then complete the data set with additional columns into a materialised view and finally join with workspace info for LCM with x__client_id_ (edited) But it’s a full load every time into GD platform Apart from incremental loads, anything I can do to prepare the data better? Like adding indices to the last table that gets loaded? Partition the materialised view? Any suggestion that can speed up load dramatically would be of great help!
b
Hi Michael, When you run the dataload in the GoodData, what happens is that your database is queried through API and then the result is downloaded. The query that is being run is quite simple (let's assume that you have dataset with 3 columns A,B and C and three clients) - the query would be:
Copy code
SELECT A as A, B as B, C as C FROM dataset WHERE x__client_id =CLIENT1 OR x__client_id =CLIENT2 OR x__client_id =CLIENT3;
so it would all come down to optimizing performance of such simple queries in your database. Other factor can be the network throughput - the speed of downloading the data. What can be optimized on the GoodData is to set how many clients will get downloaded by a single query (4 is default for full loads).
m
Hi @Boris, thanks for the info. So the query takes the possible values from the segment in LCM, clear, thanks. Let me optimise performance that query. How do I increase the default 4? I’d to change it to a 10 please.
b
Hi Michael - explained here in the best practices section (i.e. setting parameter
GDC_DOWNLOAD_PROJECTS_CHUNK_SIZE
in the dataload schedule)
m
HI @Boris Had a look at the queries, seems fine to me. Is there a difference if the workspace is empty? What prost-processing is happening? Query finished long time ago but job takes a long time to complete.
b
the post processing should be quite straightforward - basically uploads the csv with the query results to the workspace (it could take longer based on the amount of data loaded). How much data are you loading and how long does it take?
m
Hi @Boris It’s loading from postgres, dev loads 20m records in 1 hour. Then prod loads LCM segments in 2 hours.
b
Hi Michael, this will need more thorough review, let me write you a PM