Hi All, I would like to ask some support/help on <...
# gooddata-cn
z
Hi All, I would like to ask some support/help on Gooddata.CN performance. We’re experiencing really slow UI rendering and would like to find the bottlenecks in order to fine-tune it on production. The current dashboard loading is ~20-25sec from-scratch and all the data is in gooddata cache so no query run on DB side when refreshing, we should be able to reduce the duration <5sec. Could you please check my settings here? Adding those to this thread. Thanks!
Servers are in AWS - us-west-2, I tested it from Europe on gigabit internet, I know that some latency should be there because of this but I asked a team member to test it in the US but only minimal faster performance has been experienced
image.png
image.png
here’s the loading video:
gooddata_loading.mov
prod settings:
Copy code
INSTANCE_COUNT: 6
INSTANCE_TYPE: t3.large
Copy code
--set replicaCount=${REPLICA_COUNT}
Copy code
REPLICA_COUNT: 1
Copy code
PULSAR_VERSION: 2.9.2
GOODDATA_VERSION: 2.0.1
pulsar-values.yaml
gooddata-values.yaml
Any suggestion/recommendation would be appreciated!
j
We are discussing it internally, I will answer ASAP
z
Hi @Jan Soubusta thank you very much!
j
Btw. have you integrated any kind of monitoring with the GoodData.CN deployment? Something like Prometheus + Grafana? Are you able to monitoring metrics like CPU/RAM usage per POD?
z
we’ve integrated Cloudwatch but let me ask about our devops guys
just from mailing with Martin, FYI @Martin Svadlenka @Ondrej Macek
j
We are missing this info in our public DOC, we are going to fix it soon. We strongly recommend to set up an appropriate monitoring infrastructure. We plan to document some guides, we are also considering an option to open source e.g. Grafana dashboards as an example how to do it. Without monitoring of at least CPU/RAM per POD, we are quite blind what resource limits should be changed. As of now, I can only guess based on our past experiences. • In the video, I can see that everything is slow, not only report executions, but also metadata APIs (e.g. /attributes). • Update memory limits of the following services - metadata-api, calcique, sql-executor and result-cache. Try to start with 1G and increase it further if it is too slow • Update CPU limits of the same services. Start with 1000m (millicores) and increase it if it is too slow • Check how the external Postgres and Redis are deployed and limited. Esp. how much memory can Redis consume. Based on how much data you analyze and how big report results can be, size the Redis memory limit accordingly, so LRU does not evict cache records too often. Please, send here more information about how big datasets you analyze and how complex your model is (how many tables/datasets, how many columns in widest table and in average). Finally, have you considered to try our cloud offering? https://www.gooddata.com/trial/ We recommend to go this way if you do not have enough experiences with Kubernetes and related infra needed for running GoodData.CN on-premise.
z
• In the video, I can see that everything is slow, not only report executions, but also metadata APIs (e.g. /attributes).
exactly! Thanks, will try the mem/cpu tune options, and when I get access our cloudwatch then I can tell you more about our infra.
r
Just a brief note - you're using t3-class instances. These instance types are so-called "burstable", it means they have just 30% cpu baseline and additional cpu power is provided using so-called cpu credits. You should also check cpu credit status in cloudwatch to make sure the credits are not depleted. In that case, the cpu performance drops to its baseline (30%) until credits are "recharged".
z
yep, great, thanks for the heads-up!
anyway do you have recommended instance type for PROD which let’s say okay for gooddata application?
r
for the same instance sizes, it's better to use m6i.large (+16% price) or m6a.large (+4% price) that do not suffer performance drop on sustained load. But I recommend checking cloudwatch for cpu credit status - if the cpu credit is not depleted, changing instance types will not help and you should follow the resource tuning guidance as described above. Monitoring of your ec2 instance utilization and collecting container metrics in k8s is crucial for any reasonable decission.
z
We did a bunch of modification today but still the bandwidth of the page is the same, no improvements noticed…this would mean that the bottleneck is somewhere else…
for example we had this request:
image.png
it’s 2 seconds but assuming it should be 50ms, right? 🙂
based on our metrics no high load is happening in the cluster/redis/postgres and the limits are far away from the actual usage 😞 any idea from you guys? Maybe proxy/lb side can be a bottleneck somewhere?
j
Well, it depends. We need to know more about your use case. Everything recorded on the video relates to a single dashboard, right? If yes, I need to know how the dashboard looks like. How many insights are there and how many dashboard filters are there. Also, related to the last screenshot, what is the cardinality of filter values? collectLabelElements API collects distinct values for filters. If you create a filter for label (-> database column) with high cardinality, the DISTINCT SQL query can be running long time. Anyway, if your monitoring is right and there is no resource starvation, then it is weird that everything is so slow. One thing is the network throughput between the platform and your browser. Second thing is how many concurrent processes are executed against server. If there are not enough CPUs available, processes(threads) can wait for each other. Recently we saw a similar slowness in case of another customer. Finally we fixed it by adding a lot of CPUs to
calcique
PODs. But should not help in case of API calls like
/attributes
. Btw, are these calls now faster than before?
z
Everything recorded on the video relates to a single dashboard, right?
yes
then it is weird that everything is so slow.
yes, exactly. We did a bunch of fine-tune last week and will check it on this week on PROD from the US in order to have the minimal latency. If we cannot reach any significant improvement is it possible to have a quick call with screenshare in order to have some feedback from you guys? Thanks
r
Hi Zoltán, I just checked your video again in more detail. My findings are, that your your dashboard probably contains a lot of reports (indicated by many parallel
execute
calls). The point is that Chrome, Firefox, Safari and many other browsers limit number of concurrent HTTP/1.1 connection to six per domain (IE 11 supports 13 connections). Requests that spent a lot of time waiting for free connection slot have long grey bar in waterfall statistics ("Stalled"). Note this limitaton is valid for HTTP/1.1. If your loadbalancer were supporting HTTP/2, this would not happen.
Optionally, split the large dashboard to multiple parts so one of them would contain too many reports.
z
Hi @Robert Moucha, yep, you’re right, I was thinking on this too, the chrome limits 6 concurrent downloads but didn’t know that only the 1.1…but I think we created our LB from the gooddata doc but maybe I’m wrong…let us to check our LB instance
anyway we have 18 charts on that dashboard
r
Yes. the documentation contains example setup that creates ELB classic LB that supports only HTTP/1.x. For HTTP/2, you need Network Load Balancer (NLB).
z
oh I see…let us check this in our infra…anyway maybe it would worth to mention in the doc that this limitation is there with that config, just as a note 🙂
r
are you using ACM for delivering TLS certs or are you using cert-manager?
LetEncrypt, according to your values file
We're using slightly non-standard setup with ACM certificate loaded to NLB listener and passing all traffic to ingress-nginx controller so I can't offer you our real-life config. Usually, you need to set up a plain L4 (TCP) NLB and pass traffic unmodified to ingress-nginx that maintains all virtual hosts and has SSL certicates provided by cert-manager: There's a lot of existing docs on this topic. What is the most important is to have the following annotations on ingress-nginx service:
Copy code
controller:
  service:
    annotations:
      # deploy NLB instead of ELB
      <http://service.beta.kubernetes.io/aws-load-balancer-type|service.beta.kubernetes.io/aws-load-balancer-type>: nlb
      # support TLS 1.3, disable TLS 1.1 and lower
      <http://service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy|service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy>: 'ELBSecurityPolicy-TLS13-1-2-2021-06'
    targetPorts:
      http: http
      # this is default in chart, but differs from GD docs - https is NOT terminated on LB
      https: https
    # Preserve client IP address
    externalTrafficPolicy: Local
Alternatively, you can also deploy NLB on your own (using cloudformation or terraform) and use other controller service annotations (
<http://service.beta.kubernetes.io/aws-load-balancer-*|service.beta.kubernetes.io/aws-load-balancer-*>
) to make the ingress-nginx working with this external NLB.
j
Anyway, if all reports on the dashboard have already been executed and therefore the results are cached in GD.CN, even with 20 insights and with the browser limit(6 concurrent downloads) everything should be loaded in few seconds.
We can arrange a call, DM me.
z
Thank you @Robert Moucha, we’ll check it probably tomorrow
Anyway, if all reports on the dashboard have already been executed and therefore the results are cached in GD.CN, even with 20 insights and with the browser limit(6 concurrent downloads) everything should be loaded in few seconds.
yep, everything is cached and when I refresh the page now it’s around 12-15 sec to see every insights and there are some background stuff so the whole page load is around 20s…hopefully the NLB can help out us here
We can arrange a call, DM me.
Thank you @Jan Soubusta, will see the NLB stuff and I may reach out to you if needed
a
Hello all. We changed the instance type to m6a.xlarge, Configured NLB to forward all traffic to Nginx ingress controler with cert-manager from letsencrypt. Increased the requested CPU and memory for goodata pods. Checked none of the pods are stalling on requested resources at the time of the request. But the performance has not changed at all. What other tuning steps can be applied?
j
We need a monitoring on your side, or even tracing. In our SaaS deployment we use Prometheus/Grafana(monitoring, alerting) and Jaeger(tracing) We need to find out what part of request processing is slow, which microservice is responsible. But let's iterate. Please, find the slowest report (AFM) execution and its traceId. Then get all log records from all PODs for this traceId. We can try to analyze it. Meanwhile, please, set up a monitoring infra on your side or even the tracing infra.
z
Hi Jan, thanks! Sure, we should a full monitoring system in place, current we’re only working with CloudWatch so see the resources load. I think we cannot narrow down to one service exactly, for me it looks like everything is a bit slow than expected if we check the networking tab in the browser, maybe there is a bottleneck somewhere in between the services and the dns. So let’s organize a quick screensharing meeting in order to check first the networking tab together, hopefully you can see what can be a problem, maybe a reference site would be good with GoodData.CN it order to see the networking tab as well.
j
DM me
t
@Jan Soubusta is there any news regarding open sourcing Grafana dashboards ?
j
Unfortunatelly no.
p
🎉 New note created.