Hi all. I'm trying to upgrade GD-CN from `3.4.2` t...
# gooddata-cn
b
Hi all. I'm trying to upgrade GD-CN from
3.4.2
to
3.5.5
and the
quiver-cache
pods health checks do not pass:
Liveness probe failed: Get "http://100.64.80.45:8877/live": dial tcp 100.64.80.458877 connect: connection refused
Readiness probe failed: Get "http://100.64.80.45:8877/ready": dial tcp 100.64.80.458877 connect: connection refused
All other upgrades have gone smooth up until 3.5.5. I did see the thread above about using Mi instead of Gi for quiver resources and I have made that change.
i
Hi Brian, If you are referring to this thread, the issue with memory miscalculation should be fixed already. Please let me check internally what might be causing your troubles here.
b
Ah ok thanks for validating the resource bug. Any insight you can provide is appreciated!
Good morning @Ivana Gasparekova, I just wanted to check back and see if you were able to find anything that could be causing my issue. Thanks you!
m
Hi Brian, sorry for the delay in getting back to you - our team is still checking this for you, and once we have more details on this, we will reach back out to you right away. Thanks for your ongoing patience here!
b
Thank you so much Michael. 🙏
j
Hi Brian, what is your setting for quiver in values.yaml file? Can you please share it? If you check logs in quiver pods, are there any errors during start up of the pods? Apart from liveness and readiness probes failing.
b
Here is the values-file for my dev environment:
Copy code
gooddata-cn:
  quiver:
    replicaCount:
      cache: 2
      xtab: 2
    image:
      name: quiver
    resources:
      cache:
        limits:
          cpu: 300m
          memory: 768Mi
          ephemeral-storage: 1.3Gi # 1Gi storage.cache.diskSize + 256Mi storage.serverWorkDirSize
        requests:
          cpu: 100m
          memory: 256Mi
          ephemeral-storage: 1.3Gi
      xtab:
        limits:
          cpu: 500m
          memory: 512Mi
          ephemeral-storage: 300Mi
        requests:
          cpu: 200m
          memory: 256Mi
          ephemeral-storage: 300Mi
    podDisruptionBudget:
      maxUnavailable: ''
      minAvailable: 1
    storage:
      serverWorkDir: "/quiver/server/data"
      serverWorkDirSize: 256Mi # present in cache & xtab deployments
      cache:
        # emptyDir size
        diskSize: 1Gi #cache
        # Real maximum size of caches (lower than diskSize because of flight schemas size overhead)
        diskCacheSize: 900Mi
        # Path where to store non-durable caches
        diskCachePath: "/quiver/cache/data"
    # -- Type of storage where to store durable caches ("" or "S3" or "FS"). Any change of this value requires ETCD wipe to refresh configuration!
    durableStorageType: "S3"
    # -- S3 durable storage configuration
    s3DurableStorage:
      s3Bucket: 'dev-gooddata-cn-cache-us-east-1'
      s3BucketPrefix: ''
      s3Region: 'us-east-1'
As far as errors in the logs, I don't see any. Here are the logs for
quiver-cache
Copy code
{"module":"quiver_shard","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:14.353895Z","action":"loading_module"}
{"module":"quiver_policy","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:37.845827Z","action":"loading_module"}
{"catalog_db_dir":null,"maintenance_config":{"vacuum":{"retry_period":0.1,"retry_limit":10,"upsert_threshold":5000,"delete_threshold":500,"max_time_threshold":1800},"maintenance_period":1.0,"s
{"memory_limit":805306368,"logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.344658Z","action":"server_health_monitor_started"}
{"platform":"Linux-5.15.148-x86_64-with-glibc2.36","python_version":"3.11.8","arrow_version":"15.0.0","quiver_version":"0.133.0","server_heartbeat_interval":0.1,"server_trim_interval":5,"serve
{"name":"flight-window-svc","affinity":"FLIGHT_PATH","submit_on_get":true,"submit_on_info":true,"logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.348983Z","action":"regi
{"name":"policy_report","affinity":null,"submit_on_get":false,"submit_on_info":true,"logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.349161Z","action":"registering_serv
{"keyspace_prefix":"gooddata-cn|","node_name":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","etcd_config":{"etcd_keyspace_prefix":"gooddata-cn","etcd_registration_ttl":30,"etcd_registration_hear
{"type":"main","node":{"order":0,"host":"gooddata-cn-etcd-0.gooddata-cn-etcd-headless","port":2379,"secure":false},"logger":"quiver.etcd.conn","level":"info","timestamp":"2024-04-02T13:48:38.3
{"node":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","lease_id":8274113667580754860,"lease_id_hex":"0x72d38e8ae35afbac","registered_rev":7202,"logger":"quiver.etcd.heartbeat","level":"info","ti
{"cluster_id":"6b02ea2c920c49a7b190f3d30193d010","logger":"quiver.etcd.conn","level":"info","timestamp":"2024-04-02T13:48:38.379796Z","action":"etcd_conn_established"}
{"type":"initial_load","node":{"order":0,"host":"gooddata-cn-etcd-0.gooddata-cn-etcd-headless","port":2379,"secure":false},"logger":"quiver.etcd.conn","level":"info","timestamp":"2024-04-02T13
{"node_name":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","node_type":"shard","cluster_id":"6b02ea2c920c49a7b190f3d30193d010","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:
{"host":"0.0.0.0","port":8877,"logger":"quiver.health_check_http_server","level":"info","timestamp":"2024-04-02T13:48:38.433063Z","action":"health_server_started"}
{"listen_url":"<grpc://0.0.0.0:16001>","client_url":"<grpc://100.64.77.216:16001>","peer_url":"<grpc://100.64.77.216:16001>","tls":false,"logger":"quiver.flight.server","level":"info","timestamp":"2
{"module":"CacheShardModule","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.455751Z","action":"module_start"}
{"config":{"storage_config":{"storage_durable_init_threads":4,"durable_check_interval":60.0,"durable_check_deadline":5.0,"durable_reinit_deadline":5.0,"durable_critical_threshold":2,"durable_s
{"check_interval":60.0,"check_deadline":5.0,"reinit_deadline":5.0,"critical_threshold":2,"logger":"quiver.durable_storage.monitor","level":"info","timestamp":"2024-04-02T13:48:38.458634Z","act
{"config":{"storage_durable_init_threads":4,"durable_check_interval":60.0,"durable_check_deadline":5.0,"durable_reinit_deadline":5.0,"durable_critical_threshold":2,"durable_s3_writes_in_progre
{"module":"PolicyModule","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.459158Z","action":"module_start"}
{"flight_count_limit":50000,"reporting_engines":["PointInTimeReporting","PrometheusMetricReporting"],"logger":"quiver.policy","level":"info","timestamp":"2024-04-02T13:48:38.459811Z","action":
{"logger":"quiver.etcd.conn","level":"info","timestamp":"2024-04-02T13:48:38.459950Z","action":"dispatching_existing_flight_events"}
{"name":"gooddata-cn-quiver-xtab-65d48598cc-sdsql","peer_url":"<grpc://100.64.86.230:16001>","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.460143Z","action":"handler_u
{"serializable":true,"page_size":5000,"load_revision":7202,"logger":"quiver.etcd.admin","level":"info","timestamp":"2024-04-02T13:48:38.460385Z","action":"load_recycle_bin"}
{"name":"gooddata-cn-quiver-xtab-65d48598cc-wzlmc","peer_url":"<grpc://100.64.78.100:16001>","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.460579Z","action":"handler_u
{"cached_amount":null,"total_amount":null,"policy_id":"default","policy_type":"HierarchicalLimitPolicyConfig","logger":"quiver.policy.limit","level":"info","timestamp":"2024-04-02T13:48:38.462
{"storage_id":"s3_gdc_quiver","storage_type":"s3","logger":"quiver.durable_storage.service","level":"info","timestamp":"2024-04-02T13:48:38.463219Z","action":"storage_added"}
{"duration":0.0030522089800797403,"events":0,"logger":"quiver.etcd.admin","level":"info","timestamp":"2024-04-02T13:48:38.463823Z","action":"recycle_bin_loaded"}
{"storage_id":"s3_gdc_quiver","storage_state":"available","logger":"quiver.durable_storage.service","level":"info","timestamp":"2024-04-02T13:48:38.464009Z","action":"storage_state_changed"}
{"serializable":true,"page_size":5000,"load_revision":7202,"logger":"quiver.etcd.flights","level":"info","timestamp":"2024-04-02T13:48:38.464259Z","action":"load_existing_flights"}
{"storage_id":"s3_gdc_quiver","bucket":"dev-gooddata-cn-cache-us-east-1","prefix":null,"resolved_prefix":null,"region":"us-east-1","endpoint_override":"","scheme":"https","logger":"quiver.stor
{"storage_id":"s3_gdc_quiver","logger":"quiver.durable_storage.service","level":"info","timestamp":"2024-04-02T13:48:38.465008Z","action":"storage_initialized"}
{"storclas_id":"connector_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.563410Z","action":"storclas_version_added"}
{"storclas_id":"connector_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.563616Z","action":"storclas_version_activated"}
{"storclas_id":"connector_cache","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.563846Z","action":"storclas_state_changed"}
{"duration":0.09894363599596545,"logger":"quiver.etcd.flights","level":"info","timestamp":"2024-04-02T13:48:38.564152Z","action":"existing_flights_processed"}
{"storclas_id":"files_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.564481Z","action":"storclas_version_added"}
{"storclas_id":"files_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.564912Z","action":"storclas_version_activated"}
{"storclas_id":"files_cache","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565054Z","action":"storclas_state_changed"}
{"storclas_id":"raw_cache_non_durable","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565317Z","action":"storclas_version_added"}
{"storclas_id":"raw_cache_non_durable","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565457Z","action":"storclas_version_activated"}
{"storclas_id":"raw_cache_non_durable","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565600Z","action":"storclas_state_changed"}
{"storclas_id":"raw_cache_tmp","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565841Z","action":"storclas_version_added"}
{"storclas_id":"raw_cache_tmp","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565980Z","action":"storclas_version_activated"}
{"storclas_id":"raw_cache_tmp","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.566228Z","action":"storclas_state_changed"}
{"start_rev":7203,"logger":"quiver.etcd.watcher","level":"info","timestamp":"2024-04-02T13:48:38.566811Z","action":"watcher_start_watch"}
{"logger":"quiver.etcd.watcher","level":"info","timestamp":"2024-04-02T13:48:38.568757Z","action":"etcd_watcher_started"}
{"storage_id":"s3_gdc_quiver","bucket":"dev-gooddata-cn-cache-us-east-1","prefix":null,"resolved_prefix":null,"region":"us-east-1","endpoint_override":"","scheme":"https","logger":"quiver.storage.durable_s3","level":"info","times
{"storclas_id":"raw_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.236729Z","action":"storclas_version_added"}
{"storclas_id":"raw_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.236950Z","action":"storclas_version_activated"}
{"storclas_id":"raw_cache","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.237096Z","action":"storclas_state_changed"}
{"storclas_id":"result_cache_non_durable","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.237399Z","action":"storclas_version_added"}
{"storclas_id":"result_cache_non_durable","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.237716Z","action":"storclas_version_activated"}
{"storclas_id":"result_cache_non_durable","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.237827Z","action":"storclas_state_changed"}
{"storclas_id":"result_cache_tmp","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238048Z","action":"storclas_version_added"}
{"storclas_id":"result_cache_tmp","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238177Z","action":"storclas_version_activated"}
{"storclas_id":"result_cache_tmp","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238302Z","action":"storclas_state_changed"}
{"storclas_id":"result_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238540Z","action":"storclas_version_added"}
{"storclas_id":"result_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238687Z","action":"storclas_version_activated"}
{"storclas_id":"result_cache","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238816Z","action":"storclas_state_changed"}
{"logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238948Z","action":"storage_service_initialized"}
{"node_name":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","state":"up","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:39.240996Z","action":"set_node_state"}
{"name":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","peer_url":"<grpc://100.64.77.216:16001>","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:39.247645Z","action":"shard_up"}
{"logger":"quiver.shard.replication","level":"info","timestamp":"2024-04-02T13:48:44.246426Z","action":"replication_started"}
{"storage_id":"s3_gdc_quiver","bucket":"dev-gooddata-cn-cache-us-east-1","prefix":null,"resolved_prefix":null,"region":"us-east-1","endpoint_override":"","scheme":"https","logger":"quiver.storage.durable_s3","level":"info","times
{"name":"gooddata-cn-quiver-cache-6655b46df8-d6fjd","peer_url":"<grpc://100.64.84.222:16001>","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:49:53.225438Z","action":"shard_up"}
j
Thanks Brian, I’ve reached out to our devs to help to pinpoint the issue here.
🙌 1
Hi Brian, so far we could’t identify anything in the values.yaml or in the provided logs that would be causing the issue. Limits seems to be ok and quiver logs describe a successful start. Do those errors
Liveness probe failed
happen all the time, causing CrashloopBackoff, or were they just temporary during upgrade? Do both types of quiver pods failing (quiver-xtab and quiver-cache), or just one of them?
Was quiver already in use in
3.4.2
and was ok? Or are you deploying quiver during upgrade to
3.5.5
?
b
Checking
1. The
Liveness probe failed
errors happen during startup and prevent the pod staying up, resulting in the CrashloopBackoff. 2. Quiver was working fine on
3.4.2
. It does not work on
3.5.5
j
and
Do both types of quiver pods failing (quiver-xtab and quiver-cache), or just one of them?
b
Just
quiver-cache
j
Hi, we are still unsure what is happening. Can you please clarify for us the following? Here is feedback from our devs
the suspicious things in all this:
• health checking infrastructure is same across quiver nodes. meaning the health check server in cache and xtab are the same. same impl of /ready and /live
• the connection refused is problematic from this perspective. if the pod had trouble then:
◦ on critical & unrecoverable trouble, the pod would not start
◦ on critical but possibly recoverable (retriable) problems, the pod would start but then health checks will report problems via 500 error codes on /live and /ready
I wonder, the connection refused problems for health checks - is this stuff happening all the time for quiver-cache nodes? meaning every single probe run fails like this and then the pod gets restarted?
(trying to root out whether these connection refused errors are just ‘noise’ happening towards the end of the pod life as it is being killed/restarted.. so that we don’t chase ghosts)
Would it also be possible for you to replicate the issue, collect diagnostics and send us bundle as described here? If so, feel free to send it to support@gooddata.com
b
I will investigate this and get back to you. Thank you!
Email sent to support. As a sanity check, I compared the compiled helm chart for 3.4.2 against 3.5.5 and the files are exactly the same aside from the chart versions