Hi team, currently we are on GD CN v3.19.0 and we ...
# gooddata-cn
s
Hi team, currently we are on GD CN v3.19.0 and we are observing quiver-cache in crashloopbackoff state. Also the GD CN UI is very slow and sometimes not responding. Attached quiver-cache pod logs. Please advise what we can do to fix this and also improve the performance? Thank you!
j
Hello Sathish, you can find some tips on this here: https://www.gooddata.com/docs/cloud-native/3.20/connect-data/performance/
s
@Joseph Heun Can you help me why quiver-cache is in crashloopbackoff state?
j
The
CrashLoopBackOff
state in Kubernetes indicates that a container is repeatedly crashing after starting. Here are several reasons why the
quiver-cache
might be in this state: 1. Application Errors: The application itself (in this case,
quiver-cache
) may be encountering errors during startup. Check the logs of the pod to identify any uncaught exceptions or configuration issues. 2. Configuration Issues: Misconfiguration of environment variables, configuration files, or command-line arguments can lead to the application failing to start. Validate that all required configurations are correct. 3. Dependency Failures: If
quiver-cache
relies on other services or databases, ensure these dependencies are available and operating correctly. Timeout or connection issues could cause the application to crash. 4. Resource Limits: The container's resource limits (CPU and memory) might be too low, causing the application to be terminated for exceeding those limits. Review and adjust the resource requests and limits in your deployment. 5. Health Checks: If the pod is configured with liveness or readiness probes and they are misconfigured, Kubernetes may kill the container repeatedly thinking it is unhealthy. Check the probe configurations to ensure they are appropriate. 6. Missing Files or Directories: If the application expects certain files or directories to be present and they are missing, it can fail during initialization. Verify that all necessary files are available. 7. Docker Image Issues: The container image for
quiver-cache
may have issues, such as incomplete builds, missing dependencies, or corrupt files. Ensure the image is built and pushed correctly. 8. Permissions Issues: Insufficient permissions for accessing resources, either within the container or to external systems, may cause crashes. Validate that the application has the necessary permissions. Next Steps for Troubleshooting: 1. Check Logs: Use
kubectl logs <pod-name>
to view the logs for the crashing pod. Look for any error messages that can help identify the issue. 2. Describe the Pod: Use
kubectl describe pod <pod-name>
to get detailed information about the pod's state, including events that might indicate why it's failing. 3. Review Resource Usage: Check if pods are being terminated due to resource limits. Use
kubectl top pod <pod-name>
to monitor resource usage. 4. Modify Probes: Temporarily modify or disable liveness and readiness probes to see if that resolves the issue while you debug. 5. Run Locally: If possible, run the
quiver-cache
application locally with the same configurations to replicate the issue outside of Kubernetes. By systematically examining these areas, you should be able to identify the reason for the
CrashLoopBackOff
state and implement the necessary fixes. If you need further assistance, feel free to ask!
s
@Joseph Heun We tried increasing the resources for metadata-api, calcique, sql-executor, and result-cache. quiver-cache crashing issue is resolved. But, performance is very slow. Mainly /attributes takes too much time. Attached screenshot for reference.
j
How many rows are you trying to create in your visualizations? Does it work better if you filter things down?
Also, the way datasets connect in the LDM could contribute to this depending on how complex the LDM is and how complex the metrics are.
s
Approx. 300K rows. It is same even when I add filter.
Screenshot of one of the visualization.
j
Hi Sathish, can you confirm that the performance is the same if you are trying to load 300k rows vs a small amount like lets say ten? Are all of your attribute lists this long?
s
Hi Joseph, Yes performance is same for both 300k and even 5-10 rows. Attributes differ depending on the chart, some has only 2 attributes. I have just shared one of them which has many attributes.
m
Hello Sathish, based on the screenshot from the browser showing the API calls, it appears that the API requests to the metadata service, which eventually reach the metadata-api pods and the metadata database, are taking a long time to process. I recommend taking one of these requests, locating the traceId in the header, and checking the logs to identify where the most time is being consumed. When exactly did you start to observe the slowdown? Are you able to connect it to some action on your side like update of CN, infrastructure, change of anything in the environment such as model, number of workspaces etc?
s
Hi Martin, We upgraded GD CN from v2.5.1 to v3.19.0 and then we also moved one of the workspace from GD freemium account to CN. Since then we are facing performance issue.
j
Hi Satish, as Martin pointed out based on the screenshot API calls related to metadata service are taking a long time.. it is a good idea to get traceId of the calls that are taking a long time. It’s located on response headers of the call in the network browser console under
x-gdc-trace-id:
. Then checking backend logs searching for traceId. mainly (but not only) in metadata-api pods. Also do you experience similar slowness across all workspaces or is it limited only to one particular workspace? What is your current sizing of metadata-api pods?
s
Hi @Jan Kos, We will check with the trace id. We are experiencing same with all the workspaces. metadata-api - 2 active pods Limits: cpu: 1250m memory: 1300Mi Requests: cpu: 100m memory: 800Mi