We'd like to upgrade our version of <GoodData.CN> ...
# gooddata-cn
p
We'd like to upgrade our version of GoodData.CN on AWS from version 2.3.2 to 2.4.0. Our current deployment of version 2.3.2 is succesfully connected to ElastiCache (Redis) and RDS (Postgres). However, when we deploy the 2.4.0, our pods with the 2.4.0 images are having issues connecting to Redis or Postgres. We're seeing the following warning in the calcique logs:
Copy code
{"ts":"2023-08-16 23:26:36.638","level":"WARN","logger":"org.springframework.boot.actuate.redis.RedisReactiveHealthIndicator","thread":"boundedElastic-1","traceId":"0a7e69f9297f9c9a","spanId":"0a7e69f9297f9c9a","msg":"Redis health check failed","exc":"org.springframework.data.redis.RedisConnectionFailureException: Unable to connect to Redis; nested exception is org.springframework.data.redis.connection.PoolException: Could not get a resource from the pool; nested exception is io.lettuce.core.RedisException: Cannot obtain initial Redis Cluster topology\n\tat org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.translateException(LettuceConnectionFactory.java:1689)\n\tat
In addition, we're seeing the check-postgres-db job in the initContainer step stuck in both the metadata-api and sql-executor deployments (it appears to hang with no messages in the logs). We're using version 6.2.6 of Redis and version 14.5 of Postgres. Please let us know what we can do to upgrade to GD.CN 2.4.0 while maintaining our existing Redis and Postgres services on AWS.
@Jan Kos @Moises Morales any guidance here? Should we just stay on 2.3.2 for now?
We're making progress on the DB connectivity, will post update here. It's an issue mounting the secret.
We've solved the DB issue, our AWS secrets provider was not in the new chart. Still looking at the Redis issue.
The Redis issue seems to have resolved itself, we're no longer seeing the above error in new pods.
Actually, we're still seeing the Redis health check failure and error message "Cannot obtain initial Redis Cluster topology" in the calcique pods
Readiness probes are failing for calcique
We're also seeing a Pulsar exception in the calcique logs:
Copy code
{
  "ts": "2023-08-17 16:53:43.025",
  "level": "WARN",
  "logger": "org.apache.pulsar.client.admin.internal.BaseResource",
  "thread": "AsyncHttpClient-12-1",
  "msg": "[<http://pulsar-broker.pulsar:8080/admin/v2/persistent/gooddata-cn/gooddata-cn/sql.select>] Failed to perform http put request: javax.ws.rs.ClientErrorException: HTTP 409 Conflict"
}
r
Hi, the HTTP 409 error is harmless. App is trying to create pulsar topic that is already there, so HTTP 409 is returned. Topic creation is idempotent operation, you can safely ignore this one. I will check other issues. The root cause seems to be misconfiguration of redis connection:
Copy code
<redis://gooddata-cn-redis.8pswjo.0001.usw2.cache.amazonaws.com?timeout=20s>]: ERR This instance has cluster support disabled
Did you changed any helm values related to Redis, most notably
service.redis.clusterMode
? It should be set to
false
(default) if your Redis doesn't have cluster mode enabled.
Because calcique pod (and most probably other pods) has env variable
SPRING_REDIS_CLUSTER_NODES
that is set only when service.redis.clusterMode=true. The Elasticache for Redis running in cluster mode is supported, but you need to explicitly turn it on in your AWS Elasticache instance.
Regarding the Postgresql - Please make sure the postgresql server is accessible (on network layer) from your k8s cluster. There's a pod called "tools" that contains
psql
client. You can connect to that pod's Bash shell using
kubectl exec -it
and check if the database host is accessible with credentials you use.
Copy code
PGPASSWORD=yourpostgrespassword psql -U postgres -h your-db.host.name
p
Thanks for the helpful analysis @Robert Moucha. As you pointed out, the issue was that our Redis "cluster" was actually a single node due to some confusing terminology with the ElastiCache Terraform providers (aws_elasticache_cluster with default settings creates a single-node Redis instance). We switched to a true multi-node setup (i.e. aws_elasticache_replication_group) and now our calcique service has stabilized in both of our k8s clusters. However, in one of our clusters we're seeing a Crashloopbackoff in one of the 3 pods in our gooddata-cn-ectd StatefulSet (2 of 3 pods are running fine). Error message is:
Copy code
{
  "level": "warn",
  "ts": "2023-08-23T16:01:20.885694Z",
  "logger": "etcd-client",
  "caller": "v3@v3.5.9/retry_interceptor.go:62",
  "msg": "retrying of unary invoker failed",
  "target": "<etcd-endpoints://0xc0001a8000/gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>",
  "attempt": 0,
  "error": "rpc error: code = NotFound desc = etcdserver: member not found"
}
In the other cluster (with identical configuration) the gooddata-cn-ectd StatefulSet is Ready with all 3 pods running. Where should we look to further debug this?
r
Etcd subchart accidentally leaked to our gooddata-cn helm chart sooner than we expected. It will be required later for a new "quiver" component for advanced caching (not installed yet). You can safely get rid of etcd by setting
useInternalQuiverEtcd: false
in your custom helm values file. Sorry about that. Are there any other outstanding issues with your deployment? You mentioned something about inaccessible database, did you managed to fix it?
p
Thanks, @Robert Moucha. We will remove etcd for now using the setting you mentioned. The issue with RDS connectivity was our mistake: we forgot to include our secret-provider object in the new 2.4.0 chart and, hence, the secrets were not mounting in the deployments. Everything looks good so far in 2 of our 3 AWS environments (stage and prod): we've onboarded some new organizations and created insights in the UI. We're having an issue deploying Pulsar in our dev environment on AWS but that is a separate issue. We're planning to debug this soon and may have some questions when we get to it. Thanks for all of your helpful support! 😀
r
Glad to hear that, Pete! Let me know if you can't resolve the pulsar issue on your own.
👍 1
p
Thanks, Robert. Will do 😀