We re observing that our pods for several <http GoodData CN| GoodData #gooddata-cn

We're observing that our pods for several <GoodDat...

Pete Lorenz

07/13/2023, 10:29 PM

We're observing that our pods for several GoodData.CN (version 2.3.2) services are being terminated with reason OOMKilled. The following services are mainly affected (though we've observed others in this state): • gooddata-cn-ldm-modeler • gooddata-cn-analytical-designer • gooddata-cn-dashboards We're overriding the following resource values in our chart:

Copy code

analyticalDesigner:
      resources:
        limits:
          cpu: 100m
          memory: 50Mi
          ephemeral-storage: 25Mi
    dashboards:
      resources:
        limits:
          cpu: 100m
          memory: 50Mi
          ephemeral-storage: 25Mi
    homeUi:
      resources:
        limits:
          cpu: 100m
          memory: 50Mi
          ephemeral-storage: 25Mi
    ldmModeler:
      resources:
        limits:
          cpu: 100m
          memory: 50Mi
          ephemeral-storage: 25Mi
    measureEditor:
      resources:
        limits:
          cpu: 100m
          memory: 50Mi
          ephemeral-storage: 25Mi
    webComponents:
      resources:
        limits:
          cpu: 100m
          memory: 50Mi
          ephemeral-storage: 25Mi

Before we try raising the resource requests / limits, we're wondering what debugging steps we can try. We're also wondering if resource or capacity issues in the cluster itself might cause this behavior. (Attaching the results of kubectl describe for one of the affected pods.)

gooddata-analytical-designer-describe-OOM-killed.txt

Robert Moucha

07/14/2023, 8:15 AM

Hi Pete, I'm sorry to hear that. Assuming you have metrics-server installed on your cluster, you may check the actual cpu/mem consumption of your pods by running

kubectl top pod -n gooddata-cn

. (if you have access to some cluster-wide monitoring tool, you can see the same more conveniently - ask infrastructure team for access). The pods you've mentioned are really lightweight and should not consume more than 30MiB, so if you have limits set to 50MiB, the OOMs you're observing are not caused by exceeding their limits. So the most probable cause of this issue is the cluster nodes are overcommited (sum of all running pods memory limits is higher than physical memory on that node). Memory overcommit is usually bad thing to happen and will cause random OOM kills done by Linux kernel when node runs out of free memory. Burstable Pods with small mem requests are more susceptible to OOM kills because their

oom_score_adj

is higher. There's possible workaround - assign resources.requests.memory to the same value as resources.limits.memory - it will force the k8s scheduler to place these pods to nodes that have sufficient capacity. But if you do so and your cluster doesn't find a node that could satisfy pod's requests, then the pod remains in Pending state until the cluster is upscaled on some other workload stops and frees up some memory. To see if the node is overcommitted, you can check the

kubectl describe node ...

output, check the section that looks like:

Copy code

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         3375m (42%)    17400m (219%)
  memory                      10846Mi (73%)  22676Mi (154%)
  ephemeral-storage           300Mi (0%)     300Mi (0%)

(Note the memory limits are overcommitted by 54% in this example)

❤️ 1

Pete Lorenz

07/14/2023, 2:40 PM

Thank you Robert, will investigate ...

Pete Lorenz

07/14/2023, 2:51 PM

Here's what I observe:

Copy code

> kubectl top pod -n gooddata-cn
W0714 14:45:32.801916      63 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                                                  CPU(cores)   MEMORY(bytes)   
gooddata-cn-afm-exec-api-7448779d99-2fqwh             9m           404Mi           
gooddata-cn-analytical-designer-d99c8d595-9xjs2       1m           12Mi            
gooddata-cn-api-gateway-59f675d5db-872vn              6m           318Mi           
gooddata-cn-auth-service-5597659dd-rnjsl              5m           461Mi           
gooddata-cn-calcique-5b9f68b6f9-ctcjx                 12m          402Mi           
gooddata-cn-export-controller-c79c7d8d4-4vxcn         16m          442Mi           
gooddata-cn-home-ui-789b74fd4b-ftjbx                  1m           12Mi            
gooddata-cn-ldm-modeler-5fdc8c9bc4-8w5ql              1m           12Mi            
gooddata-cn-metadata-api-64d8f6485d-n4kl4             12m          631Mi           
gooddata-cn-organization-controller-dc6fc894f-pnrvv   1m           39Mi            
gooddata-cn-result-cache-84fb96d996-bsgll             17m          373Mi           
gooddata-cn-scan-model-5bb5cc5998-h57rj               3m           437Mi           
gooddata-cn-sql-executor-9c86c49df-zsq5r              17m          446Mi           
gooddata-cn-tabular-exporter-795cb57d66-ftzqx         47m          311Mi           
gooddata-cn-tools-856b5cb768-sj256                    0m           0Mi             
gooddata-cn-web-components-7467c679b7-fn7ls           14m          10Mi

Will try tweaking the resourcing a bit then reach out to our infra team ...

Robert Moucha

07/15/2023, 4:07 PM

Well, as expected, the UI-related pods (analytical-designer, ldm-modeler) are far below the mem limit of 50MiB. I can't see dashboards pod, but I expect the similar mem usage. There's not much you can do at this moment with resources without knowing what's going on your cluster. The workload may suffer from "noisy neighbour" issue - another deployment may suddenly increase its memory usage, causing the cluster to evict your pods to free up some space. And as I wrote before, small burstable pods become oomkill victims more often than large pods (because they have higher oom_score_adj). Check total node's resources on overcommit. Contact infra team and ask them for help; they have more options to solve it - add extra node, rebalance the workload, squeeze memory on other resources, ...)

4 Views

Open in Slack

Previous Next