Pete Lorenz
07/13/2023, 10:29 PManalyticalDesigner:
resources:
limits:
cpu: 100m
memory: 50Mi
ephemeral-storage: 25Mi
dashboards:
resources:
limits:
cpu: 100m
memory: 50Mi
ephemeral-storage: 25Mi
homeUi:
resources:
limits:
cpu: 100m
memory: 50Mi
ephemeral-storage: 25Mi
ldmModeler:
resources:
limits:
cpu: 100m
memory: 50Mi
ephemeral-storage: 25Mi
measureEditor:
resources:
limits:
cpu: 100m
memory: 50Mi
ephemeral-storage: 25Mi
webComponents:
resources:
limits:
cpu: 100m
memory: 50Mi
ephemeral-storage: 25Mi
Before we try raising the resource requests / limits, we're wondering what debugging steps we can try. We're also wondering if resource or capacity issues in the cluster itself might cause this behavior. (Attaching the results of kubectl describe for one of the affected pods.)Robert Moucha
07/14/2023, 8:15 AMkubectl top pod -n gooddata-cn
. (if you have access to some cluster-wide monitoring tool, you can see the same more conveniently - ask infrastructure team for access). The pods you've mentioned are really lightweight and should not consume more than 30MiB, so if you have limits set to 50MiB, the OOMs you're observing are not caused by exceeding their limits. So the most probable cause of this issue is the cluster nodes are overcommited (sum of all running pods memory limits is higher than physical memory on that node). Memory overcommit is usually bad thing to happen and will cause random OOM kills done by Linux kernel when node runs out of free memory. Burstable Pods with small mem requests are more susceptible to OOM kills because their oom_score_adj
is higher.
There's possible workaround - assign resources.requests.memory to the same value as resources.limits.memory - it will force the k8s scheduler to place these pods to nodes that have sufficient capacity. But if you do so and your cluster doesn't find a node that could satisfy pod's requests, then the pod remains in Pending state until the cluster is upscaled on some other workload stops and frees up some memory.
To see if the node is overcommitted, you can check the kubectl describe node ...
output, check the section that looks like:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3375m (42%) 17400m (219%)
memory 10846Mi (73%) 22676Mi (154%)
ephemeral-storage 300Mi (0%) 300Mi (0%)
(Note the memory limits are overcommitted by 54% in this example)Pete Lorenz
07/14/2023, 2:40 PMPete Lorenz
07/14/2023, 2:51 PM> kubectl top pod -n gooddata-cn
W0714 14:45:32.801916 63 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME CPU(cores) MEMORY(bytes)
gooddata-cn-afm-exec-api-7448779d99-2fqwh 9m 404Mi
gooddata-cn-analytical-designer-d99c8d595-9xjs2 1m 12Mi
gooddata-cn-api-gateway-59f675d5db-872vn 6m 318Mi
gooddata-cn-auth-service-5597659dd-rnjsl 5m 461Mi
gooddata-cn-calcique-5b9f68b6f9-ctcjx 12m 402Mi
gooddata-cn-export-controller-c79c7d8d4-4vxcn 16m 442Mi
gooddata-cn-home-ui-789b74fd4b-ftjbx 1m 12Mi
gooddata-cn-ldm-modeler-5fdc8c9bc4-8w5ql 1m 12Mi
gooddata-cn-metadata-api-64d8f6485d-n4kl4 12m 631Mi
gooddata-cn-organization-controller-dc6fc894f-pnrvv 1m 39Mi
gooddata-cn-result-cache-84fb96d996-bsgll 17m 373Mi
gooddata-cn-scan-model-5bb5cc5998-h57rj 3m 437Mi
gooddata-cn-sql-executor-9c86c49df-zsq5r 17m 446Mi
gooddata-cn-tabular-exporter-795cb57d66-ftzqx 47m 311Mi
gooddata-cn-tools-856b5cb768-sj256 0m 0Mi
gooddata-cn-web-components-7467c679b7-fn7ls 14m 10Mi
Will try tweaking the resourcing a bit then reach out to our infra team ...Robert Moucha
07/15/2023, 4:07 PM