We re working on optimizing our performance with gooddata an GoodData #gooddata-cloud

We're working on optimizing our performance with g...

Ryan Peters

10/03/2024, 2:52 PM

We're working on optimizing our performance with gooddata, and we've come across some really weird behaviour in the underlying queries that are generated. We see repeated queries that stack the following in their inner joins:

Copy code

INNER JOIN (SELECT * FROM <TABLE_SCHEMA> WHERE `wdf__tenant_id` = "<TENANT_NAME>")

When there's a clear filter for date after it.

Copy code

`t`.`case_opened` >= TO_TIMESTAMP("2024-09-25", "yyyy-MM-dd") 
AND `t`.`case_opened` < (TO_TIMESTAMP("2024-09-25", "yyyy-MM-dd") + (INTERVAL '1 day'))

So I'm curious why this might be the case? Just even inserting the date filter into the select * from portion of this generated query reduces runtime by nearly 20%.

Joseph Heun

10/03/2024, 3:15 PM

Hi Ryan, performance issues can be related to a multitude of things, but it does make sense if you filter for a date range that it limits the data and reduces runtime. You can find some general tips on data base performance HERE

Ryan Peters

10/03/2024, 3:17 PM

We've gone through all of those, and have reduced the load on our side by a factor of 99.9%. Running the data on properly CTE'ed data sets runs in < 10s. So we're seeing 5x runtimes in GoodData compared to a proper optimized query. So trying to understand what could be done for these sort of select * scenarios that are causing massive lag on our use of the tool.

Ryan Peters

10/03/2024, 4:55 PM

@Joseph Heun Is there the possibility that this formulation of a query like this can be fixed via LDM, or is this just a core feature of how the underlying query structure from GD to our datalake runs?

Joseph Heun

10/03/2024, 5:05 PM

Absolutely. How you structure your LDM can certainly affect performance in what you are querying. Workspaces are essentially built from the LDM up and how each dataset connects can certainly affect the performance.

Francisco Antunes

10/07/2024, 2:37 PM

Hi @Ryan Peters, after running some tests from our end, I found out what is causing these INNER JOIN queries you noticed. These queries are related to the Workspace Data Filter feature, which is used to limit which sections of the data are available for each tenant based on values on a column of your choice on your Data Source (in your case,

wdf__tenant_id

). I tested on my end, and the type of

SELECT * WHERE <column_name> IN <value>

is expected when using this feature - which makes sense, since it’s intended as a security/privacy feature and needs to apply to the entirety of the column, lest a tenant have access to data they shouldn’t. So these queries are indeed a core part of the WDF feature, and cannot be avoided while that is in place.

Ryan Peters

10/07/2024, 2:44 PM

@Francisco Antunes I think the point I'd want to bring up though, is on a join from that

select *

is that there's a date filter which I would assume comes from the dashboard date config already. Wouldn't it make sense to have that auto-populate within that

select *

range? Because we're seeing examples where this is already grabbing 10M rows of data before reducing, and drastically slows things down. (We have clients that within year easily clear 50M rows of data in our schema)

Francisco Antunes

10/07/2024, 3:01 PM

I assume the Date filter is added via an attribute filter on your end - either on the Dashboard or on the Visualization. As such, it’s not something that interacts directly with the WDF part of the query - which must always be applied beforehand. Which isn’t to say that its performance cannot be improved, but that would be in the realm of Product Feedback for our devs - I’ll make sure to share your experience with how the feature affects queries to data sources with very large columns!

Doug Creighton

10/07/2024, 7:30 PM

@Francisco Antunes is there a way we can reduce the size of the join. Currently that filterless join is on our table with millions data per month - which is slow If we changed our model could that join happen in a different way. For instance, If we created a DIM table for our Tenants. Instead of 1 million rows it could just be under 100? Whats the best practice here?

Doug Creighton

10/09/2024, 3:03 PM

@Francisco Antunes any insight to the above?

Francisco Antunes

10/09/2024, 3:18 PM

Hi Doug, in the context of using WDF, using a DIM table might not do the trick - the WDF does not propagate itself across the LDM, it only applies to the dataset(s) where it’s defined. That said, I haven’t tested that specific scenario, so it might be worth a try - I’m just not entirely sure how the filter would propagate in this case. Another option for filtering data is User Data Filters, which can be set for Users or User Groups, and do propagate through the LDM. I’d suggest checking it out as well and seeing if the filtering you need to do would be more efficient if done through UDFs.

Doug Creighton

10/11/2024, 8:04 PM

We need the security of the WDF Filter. However, I don't really understand your comment... mostly the "not propagate itself across the LDM" Currently the WDF filter is set up on our main aggregation table (the only aggregation table (lets say there is 1 Billion rows in it). I assume its doing this INNER JOIN (SELECT * FROM Main aggregation table WHERE

wdf__tenant_id

= "TENANT_NAME") Because we have set it up the WDF filter on that table If we instead had a table with all our clients that is a 1:N join with our aggregation table (we just have the 1) Dim_client Client 1 WDF_filter Client 2 WDF_FILTER _______ Aggregation table ID: DFSFSD Client_it = {WDF_FILTER} Sales: Purchases: ..... items: and then we join the client_id with the WDF_filter Then we did a 1:N join between these tables. I would have thought that INNER JOIN Would change from our aggregation table to the DIM table but with the join it would auto filter the aggregation table. This would make that inner join on 100 rows vs 1 Billion rows.

Francisco Antunes

10/15/2024, 9:11 AM

Hi Doug, I ran a few tests on my end to illustrate the point about WDF not propagating. I started by creating a WDF and applying it to the

Product Category

attribute on the

Product

dataset, which has a 1:N connection with the

Order lines

dataset (screenshot #1). However, when I create a simple visualization with the sum of

Order Line ID

, the WDF doesn’t work (note that I even have the

Category

attribute as the only option on the filter; It’s set to “All”, though, so even though only 1 option is available, it still doesn’t apply the WDF to the viz, only to the filter, if that makes sense). That’s shown in the second Screenshot. Then, I actually add the

Category

attribute to the visualization itself, which then leads to the INNER JOIN, thus resulting in the WDF applying. I could have used any of the attributes from the

Product

dataset, in fact, with similar results. That’s what I meant by the WDF not propagating - it only applies to the original dataset, so any other datasets down the line will not be affected by it without actually enforcing the join by slicing or filtering the visualization by an attribute from the dataset where WDF is set up (in your example, the DIM table). So I’m afraid that your idea with the DIM table wouldn’t quite work out as-is. You’d need to find a way to reference this table during report computation to ensure the WDF applies - be it by adding an attribute from it to the report, or even having it set as a locked/hidden filter on the Dashboard the end-users actually visit. I hope this helps explain the feature in more detail!

Doug Creighton

10/17/2024, 4:41 PM

Thanks for the input.

12 Views

Open in Slack

Previous Next