Hi All I would like to ask some support help on <http Goodda GoodData #gooddata-cn

Hi All, I would like to ask some support/help on <...

Zoltan Mazula

08/15/2022, 7:09 AM

Hi All, I would like to ask some support/help on Gooddata.CN performance. We’re experiencing really slow UI rendering and would like to find the bottlenecks in order to fine-tune it on production. The current dashboard loading is ~20-25sec from-scratch and all the data is in gooddata cache so no query run on DB side when refreshing, we should be able to reduce the duration <5sec. Could you please check my settings here? Adding those to this thread. Thanks!

Zoltan Mazula

08/15/2022, 7:09 AM

Servers are in AWS - us-west-2, I tested it from Europe on gigabit internet, I know that some latency should be there because of this but I asked a team member to test it in the US but only minimal faster performance has been experienced

Zoltan Mazula

08/15/2022, 7:10 AM

image.png

Zoltan Mazula

08/15/2022, 7:10 AM

image.png

Zoltan Mazula

08/15/2022, 7:10 AM

here’s the loading video:

Zoltan Mazula

08/15/2022, 7:10 AM

gooddata_loading.mov

Zoltan Mazula

08/15/2022, 7:11 AM

prod settings:

Copy code

INSTANCE_COUNT: 6
INSTANCE_TYPE: t3.large

Zoltan Mazula

08/15/2022, 7:12 AM

Copy code

--set replicaCount=${REPLICA_COUNT}

Zoltan Mazula

08/15/2022, 7:12 AM

Copy code

REPLICA_COUNT: 1

Zoltan Mazula

08/15/2022, 7:12 AM

Copy code

PULSAR_VERSION: 2.9.2
GOODDATA_VERSION: 2.0.1

Zoltan Mazula

08/15/2022, 7:14 AM

pulsar-values.yaml

Zoltan Mazula

08/15/2022, 7:14 AM

gooddata-values.yaml

Zoltan Mazula

08/15/2022, 7:16 AM

Any suggestion/recommendation would be appreciated!

Jan Soubusta

08/15/2022, 11:26 AM

We are discussing it internally, I will answer ASAP

Zoltan Mazula

08/15/2022, 11:26 AM

Hi @Jan Soubusta thank you very much!

Jan Soubusta

08/15/2022, 11:28 AM

Btw. have you integrated any kind of monitoring with the GoodData.CN deployment? Something like Prometheus + Grafana? Are you able to monitoring metrics like CPU/RAM usage per POD?

Zoltan Mazula

08/15/2022, 11:30 AM

we’ve integrated Cloudwatch but let me ask about our devops guys

Zoltan Mazula

08/15/2022, 1:31 PM

just from mailing with Martin, FYI @Martin Svadlenka @Ondrej Macek

Jan Soubusta

08/15/2022, 3:38 PM

We are missing this info in our public DOC, we are going to fix it soon. We strongly recommend to set up an appropriate monitoring infrastructure. We plan to document some guides, we are also considering an option to open source e.g. Grafana dashboards as an example how to do it. Without monitoring of at least CPU/RAM per POD, we are quite blind what resource limits should be changed. As of now, I can only guess based on our past experiences. • In the video, I can see that everything is slow, not only report executions, but also metadata APIs (e.g. /attributes). • Update memory limits of the following services - metadata-api, calcique, sql-executor and result-cache. Try to start with 1G and increase it further if it is too slow • Update CPU limits of the same services. Start with 1000m (millicores) and increase it if it is too slow • Check how the external Postgres and Redis are deployed and limited. Esp. how much memory can Redis consume. Based on how much data you analyze and how big report results can be, size the Redis memory limit accordingly, so LRU does not evict cache records too often. Please, send here more information about how big datasets you analyze and how complex your model is (how many tables/datasets, how many columns in widest table and in average). Finally, have you considered to try our cloud offering? https://www.gooddata.com/trial/ We recommend to go this way if you do not have enough experiences with Kubernetes and related infra needed for running GoodData.CN on-premise.

Zoltan Mazula

08/16/2022, 6:33 AM

• In the video, I can see that everything is slow, not only report executions, but also metadata APIs (e.g. /attributes).

exactly! Thanks, will try the mem/cpu tune options, and when I get access our cloudwatch then I can tell you more about our infra.

Robert Moucha

08/16/2022, 10:21 AM

Just a brief note - you're using t3-class instances. These instance types are so-called "burstable", it means they have just 30% cpu baseline and additional cpu power is provided using so-called cpu credits. You should also check cpu credit status in cloudwatch to make sure the credits are not depleted. In that case, the cpu performance drops to its baseline (30%) until credits are "recharged".

Zoltan Mazula

08/16/2022, 10:23 AM

yep, great, thanks for the heads-up!

Zoltan Mazula

08/16/2022, 10:25 AM

anyway do you have recommended instance type for PROD which let’s say okay for gooddata application?

Robert Moucha

08/16/2022, 11:02 AM

for the same instance sizes, it's better to use m6i.large (+16% price) or m6a.large (+4% price) that do not suffer performance drop on sustained load. But I recommend checking cloudwatch for cpu credit status - if the cpu credit is not depleted, changing instance types will not help and you should follow the resource tuning guidance as described above. Monitoring of your ec2 instance utilization and collecting container metrics in k8s is crucial for any reasonable decission.

Zoltan Mazula

08/16/2022, 2:53 PM

We did a bunch of modification today but still the bandwidth of the page is the same, no improvements noticed…this would mean that the bottleneck is somewhere else…

Zoltan Mazula

08/16/2022, 2:53 PM

for example we had this request:

Zoltan Mazula

08/16/2022, 2:53 PM

image.png

Zoltan Mazula

08/16/2022, 2:53 PM

it’s 2 seconds but assuming it should be 50ms, right? 🙂

Zoltan Mazula

08/16/2022, 2:55 PM

based on our metrics no high load is happening in the cluster/redis/postgres and the limits are far away from the actual usage 😞 any idea from you guys? Maybe proxy/lb side can be a bottleneck somewhere?

Jan Soubusta

08/16/2022, 3:02 PM

Well, it depends. We need to know more about your use case. Everything recorded on the video relates to a single dashboard, right? If yes, I need to know how the dashboard looks like. How many insights are there and how many dashboard filters are there. Also, related to the last screenshot, what is the cardinality of filter values? collectLabelElements API collects distinct values for filters. If you create a filter for label (-> database column) with high cardinality, the DISTINCT SQL query can be running long time. Anyway, if your monitoring is right and there is no resource starvation, then it is weird that everything is so slow. One thing is the network throughput between the platform and your browser. Second thing is how many concurrent processes are executed against server. If there are not enough CPUs available, processes(threads) can wait for each other. Recently we saw a similar slowness in case of another customer. Finally we fixed it by adding a lot of CPUs to

calcique

PODs. But should not help in case of API calls like

/attributes

. Btw, are these calls now faster than before?

Zoltan Mazula

08/22/2022, 10:47 AM

Everything recorded on the video relates to a single dashboard, right?

yes

then it is weird that everything is so slow.

yes, exactly. We did a bunch of fine-tune last week and will check it on this week on PROD from the US in order to have the minimal latency. If we cannot reach any significant improvement is it possible to have a quick call with screenshare in order to have some feedback from you guys? Thanks

Robert Moucha

08/22/2022, 1:10 PM

Hi Zoltán, I just checked your video again in more detail. My findings are, that your your dashboard probably contains a lot of reports (indicated by many parallel

execute

calls). The point is that Chrome, Firefox, Safari and many other browsers limit number of concurrent HTTP/1.1 connection to six per domain (IE 11 supports 13 connections). Requests that spent a lot of time waiting for free connection slot have long grey bar in waterfall statistics ("Stalled"). Note this limitaton is valid for HTTP/1.1. If your loadbalancer were supporting HTTP/2, this would not happen.

Robert Moucha

08/22/2022, 1:12 PM

Optionally, split the large dashboard to multiple parts so one of them would contain too many reports.

Zoltan Mazula

08/22/2022, 1:17 PM

Hi @Robert Moucha, yep, you’re right, I was thinking on this too, the chrome limits 6 concurrent downloads but didn’t know that only the 1.1…but I think we created our LB from the gooddata doc but maybe I’m wrong…let us to check our LB instance

Zoltan Mazula

08/22/2022, 1:20 PM

anyway we have 18 charts on that dashboard

Robert Moucha

08/22/2022, 1:40 PM

Yes. the documentation contains example setup that creates ELB classic LB that supports only HTTP/1.x. For HTTP/2, you need Network Load Balancer (NLB).

Zoltan Mazula

08/22/2022, 1:42 PM

oh I see…let us check this in our infra…anyway maybe it would worth to mention in the doc that this limitation is there with that config, just as a note 🙂

Robert Moucha

08/22/2022, 1:43 PM

are you using ACM for delivering TLS certs or are you using cert-manager?

Robert Moucha

08/22/2022, 1:44 PM

LetEncrypt, according to your values file

Robert Moucha

08/22/2022, 2:35 PM

We're using slightly non-standard setup with ACM certificate loaded to NLB listener and passing all traffic to ingress-nginx controller so I can't offer you our real-life config. Usually, you need to set up a plain L4 (TCP) NLB and pass traffic unmodified to ingress-nginx that maintains all virtual hosts and has SSL certicates provided by cert-manager: There's a lot of existing docs on this topic. What is the most important is to have the following annotations on ingress-nginx service:

Copy code

controller:
  service:
    annotations:
      # deploy NLB instead of ELB
      <http://service.beta.kubernetes.io/aws-load-balancer-type|service.beta.kubernetes.io/aws-load-balancer-type>: nlb
      # support TLS 1.3, disable TLS 1.1 and lower
      <http://service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy|service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy>: 'ELBSecurityPolicy-TLS13-1-2-2021-06'
    targetPorts:
      http: http
      # this is default in chart, but differs from GD docs - https is NOT terminated on LB
      https: https
    # Preserve client IP address
    externalTrafficPolicy: Local

Alternatively, you can also deploy NLB on your own (using cloudformation or terraform) and use other controller service annotations (

<http://service.beta.kubernetes.io/aws-load-balancer-*|service.beta.kubernetes.io/aws-load-balancer-*>

) to make the ingress-nginx working with this external NLB.

Jan Soubusta

08/22/2022, 3:50 PM

Anyway, if all reports on the dashboard have already been executed and therefore the results are cached in GD.CN, even with 20 insights and with the browser limit(6 concurrent downloads) everything should be loaded in few seconds.

Jan Soubusta

08/22/2022, 3:52 PM

We can arrange a call, DM me.

Zoltan Mazula

08/22/2022, 7:31 PM

Thank you @Robert Moucha, we’ll check it probably tomorrow

Zoltan Mazula

08/22/2022, 7:33 PM

Anyway, if all reports on the dashboard have already been executed and therefore the results are cached in GD.CN, even with 20 insights and with the browser limit(6 concurrent downloads) everything should be loaded in few seconds.

yep, everything is cached and when I refresh the page now it’s around 12-15 sec to see every insights and there are some background stuff so the whole page load is around 20s…hopefully the NLB can help out us here

Zoltan Mazula

08/22/2022, 7:34 PM

We can arrange a call, DM me.

Thank you @Jan Soubusta, will see the NLB stuff and I may reach out to you if needed

Alexander Sorokin

08/30/2022, 11:04 AM

Hello all. We changed the instance type to m6a.xlarge, Configured NLB to forward all traffic to Nginx ingress controler with cert-manager from letsencrypt. Increased the requested CPU and memory for goodata pods. Checked none of the pods are stalling on requested resources at the time of the request. But the performance has not changed at all. What other tuning steps can be applied?

Jan Soubusta

08/30/2022, 1:35 PM

We need a monitoring on your side, or even tracing. In our SaaS deployment we use Prometheus/Grafana(monitoring, alerting) and Jaeger(tracing) We need to find out what part of request processing is slow, which microservice is responsible. But let's iterate. Please, find the slowest report (AFM) execution and its traceId. Then get all log records from all PODs for this traceId. We can try to analyze it. Meanwhile, please, set up a monitoring infra on your side or even the tracing infra.

Zoltan Mazula

08/31/2022, 1:26 PM

Hi Jan, thanks! Sure, we should a full monitoring system in place, current we’re only working with CloudWatch so see the resources load. I think we cannot narrow down to one service exactly, for me it looks like everything is a bit slow than expected if we check the networking tab in the browser, maybe there is a bottleneck somewhere in between the services and the dns. So let’s organize a quick screensharing meeting in order to check first the networking tab together, hopefully you can see what can be a problem, maybe a reference site would be good with GoodData.CN it order to see the networking tab as well.

Jan Soubusta

09/01/2022, 8:09 AM

DM me

Tomislav Kos-Grabar

03/07/2023, 8:32 PM

@Jan Soubusta is there any news regarding open sourcing Grafana dashboards ?

Jan Soubusta

03/08/2023, 7:21 AM

Unfortunatelly no.

Productboard

03/08/2023, 7:23 AM

🎉 New note created.

Open in Slack

Previous Next