How should we model multi-valued attributes? e.g. ...
# gooddata-cn
s
How should we model multi-valued attributes? e.g. stored as an array in postgres.
j
Hi @Sheldon Nathan, you can transform it into M:N relationship in logical data model.
j
M:N is definitely a way to go in this case, but if I understand the @Sheldon Nathan situation correctly, he has these attribute values stored in PostgreSQL ARRAY data types, which is not supported by GD.CN/cloud, right? So he would have to implement an ELT process to extract these arrays, right?
j
Exactly. The ARRAY data type is not supported in data mapping of data in DB to GD logical model, so it is necessary to transform the data by ETL process or using database view (defined by SQL query) into data mart structures which do not contain complex data types such as arrays. There may be other alternatives how to build data mart structures in some some simple cases. For an array with fixed small size you can split it to separate columns (e.g.
attribute_1
,
attribute_2
, etc.) Multi value attribute which contains for example values like
{'tag1'}
or
{'tag1','tag2'}
can be transformed by ETL into attribute
Is tag1
=Y/N and
Is tag2
=Y/N. The modeling of data mart using M:N can cope with any size of the array. You can select the approach depending on the use case. Each of the mentioned modeling approaches will offer slightly different user experience for users who will be creating insights using GoodData.
j
@Jakub Sterba AFAIK there is an internal Jira in our backlog focused on complex data types. Could you extend it by this use case (if it is not already there)?
p
🎉 New note created.
s
The ask/need here is two fold; 1. Allow displaying multiple values in charts e.g. multi choice responses on a form. What did a person eat when their sugar levels exceeded a certain threshhold - User ate fish, beans and soup. 2. Allow analysis on array type attributes - Show me sugar levels when user's meal included fish and eggs. @Jan Soubusta what is the requirement/design/outcome of the ticket to support array/complex data types?
j
I just extended it, now it looks like this: • JSON data type ◦ Create an attribute/fact from JSON element (key is attribute name/value is its value ◦ Document the performance impact and recommendation (e.g. some databases can index JSON fields, …) ◦ The first step could be easier - what if I would specify a column name as “person.name”? Would it generate
SELECT "person.name"
SQL? If yes, it could work out of the box. • Array data type ◦ Fact representing COUNT of elements ◦ Attribute ▪︎ Search use case - e.g.
"COMPLETED" in Status
,
Status like "Compl%"
▪︎ Slice by the attribute(ARRAY), display the value concatenated by e.g. space or comma
👍 1
Btw there is one more solution coming soon - so called SQL datasets. You will be able to create a new type of dataset in our LDM - SQL dataset, defined by any SQL query. In this case, you could create a STRING attribute by applying
array_to_string()
function. Would it be an acceptable temporary(or even permanent) solution for you?
s
On the JSON support, I believe databases that support JSON have a slightly different SQL syntax on referring to the attribute and nested fields.e.g. Postgres uses ->; while oracle requires a function wrapper - will this be customizable based on the DB or will it be baked in
SQL datasets sound interesting ... when is it targetted for?
j
Good point! If we support complex data types like proposed above, we will generate the right dialect based on the type of your data source, just like we already do it, e.g. in the case of date arithmetic(really tricky). In the case of SQL datasets, you have to use the corresponding dialect, and once you decide to migrate to different DB, you will have to update SQL datasets accordingly.
SQL datasets should be release in March, maybe under feature flag first, but very soon for all.
Watch the release notes 😉
s
And will this be cloud only or on CN too?
j
No limitation
s
Also, with SQL datasets how will variable length arrays be handled in the LDM ? we do not know the number of elements (LDM attributes) at compile time
Consider - SQL dataset for 'Meal items' which could have 1 or more of {fish,eggs,beans,beef, rice, salad}
j
I assume it will be just single attribute if you transform array into string using
array_to_string()
function.
j
exactly. But it comes with obvious limitations. When you concat it, you cannot apply exact match when creating e.g. filters, like attribute in ('fish', 'eggs'), instead you can apply only LIKE. So definitely it is just a partial solution.
j
Other option can be to use
unnest()
function and transform data to multiple rows and use it with M:N relation in LDM. The attribute "Meal" will then let you select data which contains at least one of the selected meals in the filter dropdown.
With more complex metric you can select probably data which contain all selected meals.
You can probably combine the approaches and add into your model one concatenated attribute for displaying multi-value data in a single table cell and M:N relation to Meal dimenision dataset for filtering use cases.
s
@Jakub Sterba I suppose that could be one approach
j
You can also create dimension with all combinations and use M:N relation between these combinations and Meal dimension. It may reduce size of bridge dataset and help query performance
s
would be great if the interpretation of the usage could be handled internally by GD for array types; that would reduce the learning curve for insight creators 🙂
@Jakub Sterba could you point me to any documentation around creating the dimension?
j
I meant by this dimension table/dataset containing all the unique meal names.
You can learn about dimensional approach for example in this course https://university.gooddata.com/designing-data-models