I’m getting some unexpected behaviour from the COU...
# gd-beginners
j
I’m getting some unexpected behaviour from the COUNT function. I was told that it always runs a UNIQUE count but that isn’t happening for a visual I’m testing. See thread
metric is a simple unique count of a user id
Copy code
SELECT count({label/CURRENT_SERVICE_EDUCATORS_GD.USER_ID})
When I run the query in Snowflake using the table GoodData sees I get a unique count of 289 for a set of services I am QC’ing
Copy code
SELECT COUNT(DISTINCT USER_ID) as UNIQUE_USERS
FROM current_service_educators_gd
image.png
In GoodData it does not take a unique count
353 is the total rows but there are some duplicate user_ids hence the correct number is 289
This appears to be the query Snowflake received from GoodData
Copy code
SELECT "m_d57f747f246d7b3ce3ed53709c95bfd9" AS "m_1"
FROM (SELECT COUNT("CURRENT_SERVICE_EDUCATORS_GD"."USER_ID") AS "m_d57f747f246d7b3ce3ed53709c95bfd9", TRUE AS "def_m_d57f747f246d7b3ce3ed53709c95bfd9"
FROM "GOODDATA"."CURRENT_SERVICE_EDUCATORS_GD"
INNER JOIN "GOODDATA"."ALL_SERVICES" ON "CURRENT_SERVICE_EDUCATORS_GD"."SERVICE_ID" = "ALL_SERVICES"."SERVICE_ID"
WHERE "ALL_SERVICES"."UNIQUE_SERVICE_NAME" IN ({LIST OF SERVICES REDACTED})
GROUP BY CAST(NULL AS INTEGER)) AS "t2"
WHERE "t2"."def_m_d57f747f246d7b3ce3ed53709c95bfd9"
why is this failing?
why is this running
Copy code
SELECT COUNT("CURRENT_SERVICE_EDUCATORS_GD"."USER_ID")
in particular - surely if count always behaves as unique count it should always be a distinct count?
I can see other examples of counts where it does use distinct count
how is it possible for a COUNT to not be treated as distinct?
@William Whelan FYI
I ran the same thing without a service filter to simplify things and it is still doing the same thing
Here is the simplified query from Snowflake - no filter
Copy code
SELECT "m_9247d7c0652127f5e523386f63d9f023" AS "m_1"
FROM (SELECT COUNT("USER_ID") AS "m_9247d7c0652127f5e523386f63d9f023", TRUE AS "def_m_9247d7c0652127f5e523386f63d9f023"
FROM "GOODDATA"."CURRENT_SERVICE_EDUCATORS_GD"
GROUP BY CAST(NULL AS INTEGER)) AS "t1"
WHERE "def_m_9247d7c0652127f5e523386f63d9f023"
It seems very clearly to be failing to apply a distinct count
m
iirc (I'm just starting out) you can specify the table identity that you want to run the distinct over.
Thats what the second parameter in the count is for. Possibly the count could be different depending where in the multi dimensional cube you were looking from.
j
thanks Marcel that may be the issue - this table is linked to another table downstream
I’ve gone with
Copy code
SELECT COUNT({label/CURRENT_SERVICE_EDUCATORS_GD.USER_ID}) USING {dataset/CURRENT_SERVICE_EDUCATORS_GD}
with the expectation that this will tell it to count the users on the actual educators table
I’ve cleared the cache and I’m still getting the wrong count. Can someone explain how my metric should be structured to force it to do a distinct count on current_service_educators_gd ?
For reference my current_service_educators_gd dataset is linked in the data model to another set by user_id
I am going to try and force it to act the way I want to by creating a copy of user ID that can act not as a key but I absolutely shouldn’t have to
the workaround above fixed the problem - GoodData always seems to have a problem with doing counting on primary keys in the model and copying the key to use as a counter worked. But there MUST be a way around this in the metrics - what should I have done?
m
Hi Jamie, do I understand it correctly, that you have a table where user_id is NOT unique and in GoodData you have made this column a primary key of the dataset and running COUNT on top of it? If yes, then I think this might be the cause. I believe the system’s query optimizer assumes anything marked as primary key is unique in the data source and therefore it uses COUNT(x) instead of in general more expensive COUNT(DISTINCT x) in the generated SQL. I do not know your exact use case and data, but if the column does not have unique values, I would not recommend marking it as primary key in GoodData as it will not only affect how distinct counts are calculated, but might also affect joins to other datasets where the data can be multiplicated by the join (same as in SQL). A dataset in GoodData does not have to have a primary key unless you want to reference to it from another dataset.
j
right I see thanks @Michal Hauzírek - correct the table is unique on combination of user and service id and is linked to another table by user id only because we want the service level filters to carry through. It sounds like we should always have a derived unique key in GoodData? Still, it would be good if the documentation covered the fact that count is not always unique in every context - it is clearly possible for COUNT to behave differently
m
I see, GoodData Cloud now allows you to set a compound primary key consisting of several attributes which are unique in combination (i.e. user_id + service_id). And you can even use this compound key consisting of multiple columns as a reference/join to another dataset. But it is not possible to reference something else than a primary key (i.e. only client_id in your case if you had a compound key client_id + service_id), it always needs to be the whole primary key. I agree that the documentation should clearly describe these limitations and potential consequences including the impact of COUNT. Let me raise this to the documentation team.