I m getting some unexpected behaviour from the COUNT functio GoodData #gd-beginners

I’m getting some unexpected behaviour from the COU...

Jamie Lawrie

04/18/2024, 4:26 AM

I’m getting some unexpected behaviour from the COUNT function. I was told that it always runs a UNIQUE count but that isn’t happening for a visual I’m testing. See thread

Jamie Lawrie

04/18/2024, 4:26 AM

metric is a simple unique count of a user id

Copy code

SELECT count({label/CURRENT_SERVICE_EDUCATORS_GD.USER_ID})

Jamie Lawrie

04/18/2024, 4:28 AM

When I run the query in Snowflake using the table GoodData sees I get a unique count of 289 for a set of services I am QC’ing

Copy code

SELECT COUNT(DISTINCT USER_ID) as UNIQUE_USERS
FROM current_service_educators_gd

Jamie Lawrie

04/18/2024, 4:28 AM

image.png

Jamie Lawrie

04/18/2024, 4:28 AM

In GoodData it does not take a unique count

Jamie Lawrie

04/18/2024, 4:29 AM

353 is the total rows but there are some duplicate user_ids hence the correct number is 289

Jamie Lawrie

04/18/2024, 4:31 AM

This appears to be the query Snowflake received from GoodData

Copy code

SELECT "m_d57f747f246d7b3ce3ed53709c95bfd9" AS "m_1"
FROM (SELECT COUNT("CURRENT_SERVICE_EDUCATORS_GD"."USER_ID") AS "m_d57f747f246d7b3ce3ed53709c95bfd9", TRUE AS "def_m_d57f747f246d7b3ce3ed53709c95bfd9"
FROM "GOODDATA"."CURRENT_SERVICE_EDUCATORS_GD"
INNER JOIN "GOODDATA"."ALL_SERVICES" ON "CURRENT_SERVICE_EDUCATORS_GD"."SERVICE_ID" = "ALL_SERVICES"."SERVICE_ID"
WHERE "ALL_SERVICES"."UNIQUE_SERVICE_NAME" IN ({LIST OF SERVICES REDACTED})
GROUP BY CAST(NULL AS INTEGER)) AS "t2"
WHERE "t2"."def_m_d57f747f246d7b3ce3ed53709c95bfd9"

Jamie Lawrie

04/18/2024, 4:32 AM

why is this failing?

Jamie Lawrie

04/18/2024, 4:37 AM

why is this running

Copy code

SELECT COUNT("CURRENT_SERVICE_EDUCATORS_GD"."USER_ID")

in particular - surely if count always behaves as unique count it should always be a distinct count?

Jamie Lawrie

04/18/2024, 4:37 AM

I can see other examples of counts where it does use distinct count

Jamie Lawrie

04/18/2024, 4:38 AM

how is it possible for a COUNT to not be treated as distinct?

Jamie Lawrie

04/18/2024, 4:39 AM

@William Whelan FYI

Jamie Lawrie

04/18/2024, 4:47 AM

I ran the same thing without a service filter to simplify things and it is still doing the same thing

Jamie Lawrie

04/18/2024, 4:49 AM

Here is the simplified query from Snowflake - no filter

Copy code

SELECT "m_9247d7c0652127f5e523386f63d9f023" AS "m_1"
FROM (SELECT COUNT("USER_ID") AS "m_9247d7c0652127f5e523386f63d9f023", TRUE AS "def_m_9247d7c0652127f5e523386f63d9f023"
FROM "GOODDATA"."CURRENT_SERVICE_EDUCATORS_GD"
GROUP BY CAST(NULL AS INTEGER)) AS "t1"
WHERE "def_m_9247d7c0652127f5e523386f63d9f023"

It seems very clearly to be failing to apply a distinct count

Marcel van Vliet

04/18/2024, 4:54 AM

iirc (I'm just starting out) you can specify the table identity that you want to run the distinct over.

Marcel van Vliet

04/18/2024, 4:55 AM

Thats what the second parameter in the count is for. Possibly the count could be different depending where in the multi dimensional cube you were looking from.

Jamie Lawrie

04/18/2024, 4:55 AM

thanks Marcel that may be the issue - this table is linked to another table downstream

Jamie Lawrie

04/18/2024, 5:14 AM

I’ve gone with

Copy code

SELECT COUNT({label/CURRENT_SERVICE_EDUCATORS_GD.USER_ID}) USING {dataset/CURRENT_SERVICE_EDUCATORS_GD}

with the expectation that this will tell it to count the users on the actual educators table

Jamie Lawrie

04/18/2024, 5:16 AM

I’ve cleared the cache and I’m still getting the wrong count. Can someone explain how my metric should be structured to force it to do a distinct count on current_service_educators_gd ?

Jamie Lawrie

04/18/2024, 5:24 AM

For reference my current_service_educators_gd dataset is linked in the data model to another set by user_id

Jamie Lawrie

04/18/2024, 5:25 AM

I am going to try and force it to act the way I want to by creating a copy of user ID that can act not as a key but I absolutely shouldn’t have to

Jamie Lawrie

04/18/2024, 6:05 AM

the workaround above fixed the problem - GoodData always seems to have a problem with doing counting on primary keys in the model and copying the key to use as a counter worked. But there MUST be a way around this in the metrics - what should I have done?

Michal Hauzírek

04/18/2024, 6:39 AM

Hi Jamie, do I understand it correctly, that you have a table where user_id is NOT unique and in GoodData you have made this column a primary key of the dataset and running COUNT on top of it? If yes, then I think this might be the cause. I believe the system’s query optimizer assumes anything marked as primary key is unique in the data source and therefore it uses COUNT(x) instead of in general more expensive COUNT(DISTINCT x) in the generated SQL. I do not know your exact use case and data, but if the column does not have unique values, I would not recommend marking it as primary key in GoodData as it will not only affect how distinct counts are calculated, but might also affect joins to other datasets where the data can be multiplicated by the join (same as in SQL). A dataset in GoodData does not have to have a primary key unless you want to reference to it from another dataset.

Jamie Lawrie

04/18/2024, 8:15 PM

right I see thanks @Michal Hauzírek - correct the table is unique on combination of user and service id and is linked to another table by user id only because we want the service level filters to carry through. It sounds like we should always have a derived unique key in GoodData? Still, it would be good if the documentation covered the fact that count is not always unique in every context - it is clearly possible for COUNT to behave differently

Michal Hauzírek

04/19/2024, 10:33 PM

I see, GoodData Cloud now allows you to set a compound primary key consisting of several attributes which are unique in combination (i.e. user_id + service_id). And you can even use this compound key consisting of multiple columns as a reference/join to another dataset. But it is not possible to reference something else than a primary key (i.e. only client_id in your case if you had a compound key client_id + service_id), it always needs to be the whole primary key. I agree that the documentation should clearly describe these limitations and potential consequences including the impact of COUNT. Let me raise this to the documentation team.

2 Views

Open in Slack

Previous Next