Topic: Scaling We are planning to run a load test ...
# gooddata-cn
v
Topic: Scaling We are planning to run a load test on our GD.CN installation. Before I start that, I wanted to collect some info on how GD.CN is doing scaling, but did not manage to find anything. • Is it supposed to autoscale? If so, do we have to increase the nodes in the K8S setup? Or can it provision new servers for itself? • Or should we do manual/static scaling via some help parameters? Thanks and HNY!
m
Hi Vajk, at the moment, only manual scaling is possible based on customer settings (triggers etc.)
v
Thanks! • Is it on the roadmap to imrove on this? Is there maybe any ETA? • It would be great to have a mapping between usage patterns and microservices, aka which microservices are we expected to scale when endusers are viewing dashboards, and which ones when powerusers are working on dashboards. • Or from another angle, to know which are the key services that are heavily influenced by different load types. • Haven't yet added a monitoring tool to our installation. I see the documentation mentioning Prometheus endpoints. Would Prometheus provide me with load and response-time statistics, or would you recommend another tool for working on scaling this? I see it in the helm chart's source that each microservice's
replicaCount
can be set in
values.yaml
specifically, or via the global
replicaCount
parameter. Thanks
r
automatic scaling of k8s cluster may be tricky - basically you need to handle two levels of scaling: 1. scale deployment replica counts based on load (HorizontalPodAutoscaler might help, but I'm not sure mature it is, I don't have experience with it) 2. scale the cluster itself (bring worker nodes up/down depending on the workload) - it's a task for cluster autoscaler - it works fine in our environment. Keep in mind there's necessary delay since the high load starts till the autoscaler provisions a new k8s node to accomodate load. I strongly suggest configure at least basic monitoring, to see pods behavior under heavy load. k8s itself provides a lot of valuable metrics on pod level (cpu, ram...). GoodData.CN and Pulsar show additional application-level metrics, mostly related to JVM and HTTP metrics.
One more thing - for large-scale deployments we recommend setting pod resource limits (see values.yaml) - they are intentionally set low in helm chart so it will not consume too many resources if not needed - but default values might be too restrictive in some cases.
j
We experienced with HorizontalPodAutoscaler while implementing a k8s operator for Vertica database. It works as expected and should be officially production-ready. It is possible to configure it based on standard k8s metrics (CPU, memory, ...) or with custom metrics (this we haven't tested). Definitely consider a situation when (re)starting PODs - our microservices are based on JVM and during startup they consume a lot of CPU. AFAIK there is a way how to postpone the application of HorizontalPodAutoscaler rules until POD is ready (horizontal-pod-autoscaler-initial-readiness-delay).
m
@Vajk Hermecz It is also worth mentioning that we do not support GRPC load balancing internally yet, so the scaling won’t provide the desired effect to boost up the performance significantly for certain cases. I do not know how much time you want to spend on the load tests, so I want to prevent any bad surprises.
v
I've implemented monitoring via Prometheus+Grafana. Do you maybe have some best practices on what to monitor, or even a drop-in grafana dashboard? Thank you!
Related: As I tried to learn about exposed metrics from the services, I found there are some that does not have prometheus.io/port annotations (namely: gooddata-cn-analytical-designer, gooddata-cn-apidocs, gooddata-cn-aqe, gooddata-cn-dashboards, gooddata-cn-home-ui, gooddata-cn-ldm-modeler, gooddata-cn-measure-editor, gooddata-cn-organization-controller, gooddata-cn-tools) Is this expected? I find it a bit surprising. The code I used to gather the exposed ports:
Copy code
# Code to extract prometheus metrics ports. (You might want to change the namespace)
for podname in $(kubectl get pods --namespace gooddata-cn | cut -d' ' -f1 | sed -n '1d;p')
do
	echo "$podname = $(kubectl get pod $podname --namespace=gooddata-cn -o jsonpath='{.metadata.annotations.prometheus\.io/port}')"
done
r
Yes, it is expected. These services do not expose any prometheus metrics. • _gooddata-cn-analytical-designer, gooddata-cn-apidocs, gooddata-cn-dashboards, gooddata-cn-home-ui, gooddata-cn-ldm-modeler, gooddata-cn-measure-editor_: These are simple nginx-based containers hosting static file contents (html/js/css/images) and it doesn't contain any nginx module providing metrics. • gooddata-cn-aqe is being deprecated in future releases in favor of a new service called Calcique (that will provide p8s metrics) • gooddata-cn-organization-controller is a kubernetes operator built on top of Kopf framework. There is no direct support for prometheus integration, but it should be possible to expose some metrics (we need to figure out what it is worth of exposing). I'm filing internal ticket for future improvements. • gooddata-cn-tools is some sort of "management pod" that doesn't run any service (docker command is
sleep 365d
). It contains simple undocummented tool
tiger-tools
for metadata manipulation. It's not involved in data processing at all.
As far as Graphana dashboard is concerned, we use some internally, but it is being currently redesigned to support multiple deployments. We can make this dashboard publicly available later, if necessary.
❤️ 1
v
@Robert Moucha just realized I never responded to this. That would be jolly good 🙂 and thx for the detailed desc on the detailed list of pods.