hi every one, the data that is uploaded to gooddat...
# gooddata-cn
b
hi every one, the data that is uploaded to gooddata, where is it stored, in redis or in postgreSQL? I'm talking about the 250MB in the free plan, and the 1GB in the "growth" plan.
m
Hi Benjamin, that depends whether we are talking about GoodData.CN (the self-hosted docker/kubernetes version) or the GoodData Platform (hosted by GoodData on their servers. • in GoodData Platform the data are physically loaded into the workspaces and are stored in the Postgres back-end. And this goddata-hosted Postgres back-end is queried every time some calculation is made on the front-end. This internal postgres is not exposed and accessible otherwise. • in GoodData.CN the source data are not copied from the original database and reside in it. Every time a calculation is made from the front-end, the query is executed against the original database (not GoodData) and some aggregated results are temporarily stored to the redis cache.
b
If I have a bigquery access connected to my GoodData.CN, you are telling me that this data ( my main "datasource" array) is not copied from the original database, but every time I need a data from it, I will have to go directly to BQ to look for it (for this example I use bigquery)
m
Yes, if you have GoodData.CN connected to your BQ, then GoodData uses your BQ to execute the queries needed to calculate the insights and metrics. It automatically connects to it. And it stores some caches of the calculated results temporarily in redis to speed up calculations if the same query would be needed again after it was already calculated. Not physically copying data from the source is one of the main differences between the hosted GoodData Platform and GoodData.CN.
b
i undertand.
.....
what would be the criteria to know which metrics/results are cached ( redis ) and which are executed directly in BQ? ( for this example )
Precisely I do not ask about the results or calculated metrics, but about the main data and extracted from the datasource. In the case of GoodData Platform, through a cronjob I can go to search through my "datasource" for the data available in my dataset in bigquery. this process ( cronjob ) extracts the available tables in this dataset.
m
I believe it is something like "the same query executed" and at the same time the upload notification API https://www.gooddata.com/developers/cloud-native/doc/1.7/administration/add-data-sources/notification/ was not executed between the cache creation and consequent execution. But I am not an expert on this. Maybe @Jan Soubusta will know better.
b
""the same query executed"" <<---I imagined something like this definition, which basically is the definition of a "cache"
thanks for your time @Michal Hauzírek..
I can deduce from this that each calculated metric should be stored in redis, but the original data that I use to build these metrics? for example if my data source is a table of 10 records, and after a day there are 20 records (table created a unique data set in bigquuery), so I need to be able to update this data in my GD.CN... When I update my datasource again, where are these 20 records stored? .. in my redis ?? ..(I'm not talking about calculated metrics)
j
Actually it is a little bit more complicated 😉 Report definition consists of: 1. metrics, dimensionality(attributes), filters 2. sort order, pivoting definition(if required) and paging. In GD.CN we cache so called raw result (not reflecting point 2.) and the final result(including point 2.). If you request different setup of point 2., no SQL is executed against your data source, only the raw result is transformed into the final result in a different way. Finally, we provide a, optional 3rd level level of caching. It can be turned on during data source registration (attributes enableCaching and cachePath). When you turned it on and you issue a report, all levels of aggregation (pre-aggregations) are materialized into your data source into a schema configured in cachePath. If you issue a report, which is similar to previously executed report (same pre-aggregations), these pre-aggregations are reused. All levels of caches can be invalidated by calling the uploadNotification API mentioned by @Michal Hauzírek. Best practice is to call it as a part of your ETL process changing the tables, which are mapped into GD.CN.
b
thx jan, very good and complete answers
but i have another doubt
In the example above, if I have my table of 10 records in bigquery, and they updated to 20 records, in my ETL process I will notify GoodData.CN via the api that I have 10 new records. here gooddata reloads the 20 records or just reloads the 10 missing records? And imagine that after the data is loaded correctly, the metrics are recalculated?
j
We load only report results, which are aggregations - we do not load the raw data. Let's you have 1M of rows in a table. You issue a report aggregating facts in this table by country_id (cardinality is let's say 100). We download 100 rows, not 1M of rows. When you call uploadNotification API, all caches are invalidated and everything must be recalculated completely. We do not support any kind of incremental refresh of these caches yet. Is it clear now?
👍 1
b
its very clear, thanks a lot !