We d like to upgrade our version of <http GoodData CN|GoodDa GoodData #gooddata-cn

We'd like to upgrade our version of <GoodData.CN> ...

Pete Lorenz

08/16/2023, 11:59 PM

We'd like to upgrade our version of GoodData.CN on AWS from version 2.3.2 to 2.4.0. Our current deployment of version 2.3.2 is succesfully connected to ElastiCache (Redis) and RDS (Postgres). However, when we deploy the 2.4.0, our pods with the 2.4.0 images are having issues connecting to Redis or Postgres. We're seeing the following warning in the calcique logs:

Copy code

{"ts":"2023-08-16 23:26:36.638","level":"WARN","logger":"org.springframework.boot.actuate.redis.RedisReactiveHealthIndicator","thread":"boundedElastic-1","traceId":"0a7e69f9297f9c9a","spanId":"0a7e69f9297f9c9a","msg":"Redis health check failed","exc":"org.springframework.data.redis.RedisConnectionFailureException: Unable to connect to Redis; nested exception is org.springframework.data.redis.connection.PoolException: Could not get a resource from the pool; nested exception is io.lettuce.core.RedisException: Cannot obtain initial Redis Cluster topology\n\tat org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.translateException(LettuceConnectionFactory.java:1689)\n\tat

In addition, we're seeing the check-postgres-db job in the initContainer step stuck in both the metadata-api and sql-executor deployments (it appears to hang with no messages in the logs). We're using version 6.2.6 of Redis and version 14.5 of Postgres. Please let us know what we can do to upgrade to GD.CN 2.4.0 while maintaining our existing Redis and Postgres services on AWS.

Pete Lorenz

08/17/2023, 3:04 PM

@Jan Kos @Moises Morales any guidance here? Should we just stay on 2.3.2 for now?

Pete Lorenz

08/17/2023, 3:18 PM

We're making progress on the DB connectivity, will post update here. It's an issue mounting the secret.

metadata-api-deploy.txt metadata-api-pod.txt

Pete Lorenz

08/17/2023, 4:21 PM

We've solved the DB issue, our AWS secrets provider was not in the new chart. Still looking at the Redis issue.

Pete Lorenz

08/17/2023, 4:33 PM

The Redis issue seems to have resolved itself, we're no longer seeing the above error in new pods.

Pete Lorenz

08/17/2023, 4:41 PM

Actually, we're still seeing the Redis health check failure and error message "Cannot obtain initial Redis Cluster topology" in the calcique pods

Pete Lorenz

08/17/2023, 4:45 PM

Readiness probes are failing for calcique

calcique-pod.txt

Pete Lorenz

08/17/2023, 4:55 PM

We're also seeing a Pulsar exception in the calcique logs:

Copy code

{
  "ts": "2023-08-17 16:53:43.025",
  "level": "WARN",
  "logger": "org.apache.pulsar.client.admin.internal.BaseResource",
  "thread": "AsyncHttpClient-12-1",
  "msg": "[<http://pulsar-broker.pulsar:8080/admin/v2/persistent/gooddata-cn/gooddata-cn/sql.select>] Failed to perform http put request: javax.ws.rs.ClientErrorException: HTTP 409 Conflict"
}

Robert Moucha

08/18/2023, 7:22 AM

Hi, the HTTP 409 error is harmless. App is trying to create pulsar topic that is already there, so HTTP 409 is returned. Topic creation is idempotent operation, you can safely ignore this one. I will check other issues. The root cause seems to be misconfiguration of redis connection:

Copy code

<redis://gooddata-cn-redis.8pswjo.0001.usw2.cache.amazonaws.com?timeout=20s>]: ERR This instance has cluster support disabled

Did you changed any helm values related to Redis, most notably

service.redis.clusterMode

? It should be set to

false

(default) if your Redis doesn't have cluster mode enabled.

Robert Moucha

08/18/2023, 7:26 AM

Because calcique pod (and most probably other pods) has env variable

SPRING_REDIS_CLUSTER_NODES

that is set only when service.redis.clusterMode=true. The Elasticache for Redis running in cluster mode is supported, but you need to explicitly turn it on in your AWS Elasticache instance.

Robert Moucha

08/18/2023, 8:49 AM

Regarding the Postgresql - Please make sure the postgresql server is accessible (on network layer) from your k8s cluster. There's a pod called "tools" that contains

psql

client. You can connect to that pod's Bash shell using

kubectl exec -it

and check if the database host is accessible with credentials you use.

Copy code

PGPASSWORD=yourpostgrespassword psql -U postgres -h your-db.host.name

Pete Lorenz

08/23/2023, 4:13 PM

Thanks for the helpful analysis @Robert Moucha. As you pointed out, the issue was that our Redis "cluster" was actually a single node due to some confusing terminology with the ElastiCache Terraform providers (aws_elasticache_cluster with default settings creates a single-node Redis instance). We switched to a true multi-node setup (i.e. aws_elasticache_replication_group) and now our calcique service has stabilized in both of our k8s clusters. However, in one of our clusters we're seeing a Crashloopbackoff in one of the 3 pods in our gooddata-cn-ectd StatefulSet (2 of 3 pods are running fine). Error message is:

Copy code

{
  "level": "warn",
  "ts": "2023-08-23T16:01:20.885694Z",
  "logger": "etcd-client",
  "caller": "v3@v3.5.9/retry_interceptor.go:62",
  "msg": "retrying of unary invoker failed",
  "target": "<etcd-endpoints://0xc0001a8000/gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>",
  "attempt": 0,
  "error": "rpc error: code = NotFound desc = etcdserver: member not found"
}

In the other cluster (with identical configuration) the gooddata-cn-ectd StatefulSet is Ready with all 3 pods running. Where should we look to further debug this?

Robert Moucha

08/24/2023, 2:12 PM

Etcd subchart accidentally leaked to our gooddata-cn helm chart sooner than we expected. It will be required later for a new "quiver" component for advanced caching (not installed yet). You can safely get rid of etcd by setting

useInternalQuiverEtcd: false

in your custom helm values file. Sorry about that. Are there any other outstanding issues with your deployment? You mentioned something about inaccessible database, did you managed to fix it?

Pete Lorenz

08/24/2023, 3:25 PM

Thanks, @Robert Moucha. We will remove etcd for now using the setting you mentioned. The issue with RDS connectivity was our mistake: we forgot to include our secret-provider object in the new 2.4.0 chart and, hence, the secrets were not mounting in the deployments. Everything looks good so far in 2 of our 3 AWS environments (stage and prod): we've onboarded some new organizations and created insights in the UI. We're having an issue deploying Pulsar in our dev environment on AWS but that is a separate issue. We're planning to debug this soon and may have some questions when we get to it. Thanks for all of your helpful support! 😀

Robert Moucha

08/24/2023, 3:27 PM

Glad to hear that, Pete! Let me know if you can't resolve the pulsar issue on your own.

👍 1

Pete Lorenz

08/24/2023, 3:27 PM

Thanks, Robert. Will do 😀

Open in Slack

Previous Next