Brian M
03/29/2024, 2:05 PM3.4.2
to 3.5.5
and the quiver-cache
pods health checks do not pass:
Liveness probe failed: Get "http://100.64.80.45:8877/live": dial tcp 100.64.80.458877 connect: connection refused
Readiness probe failed: Get "http://100.64.80.45:8877/ready": dial tcp 100.64.80.458877 connect: connection refusedAll other upgrades have gone smooth up until 3.5.5. I did see the thread above about using Mi instead of Gi for quiver resources and I have made that change.
Ivana Gasparekova
03/29/2024, 3:55 PMBrian M
03/29/2024, 5:44 PMBrian M
04/01/2024, 2:49 PMMichael Ullock
04/01/2024, 4:35 PMBrian M
04/01/2024, 5:00 PMJan Kos
04/02/2024, 1:39 PMBrian M
04/02/2024, 1:56 PMgooddata-cn:
quiver:
replicaCount:
cache: 2
xtab: 2
image:
name: quiver
resources:
cache:
limits:
cpu: 300m
memory: 768Mi
ephemeral-storage: 1.3Gi # 1Gi storage.cache.diskSize + 256Mi storage.serverWorkDirSize
requests:
cpu: 100m
memory: 256Mi
ephemeral-storage: 1.3Gi
xtab:
limits:
cpu: 500m
memory: 512Mi
ephemeral-storage: 300Mi
requests:
cpu: 200m
memory: 256Mi
ephemeral-storage: 300Mi
podDisruptionBudget:
maxUnavailable: ''
minAvailable: 1
storage:
serverWorkDir: "/quiver/server/data"
serverWorkDirSize: 256Mi # present in cache & xtab deployments
cache:
# emptyDir size
diskSize: 1Gi #cache
# Real maximum size of caches (lower than diskSize because of flight schemas size overhead)
diskCacheSize: 900Mi
# Path where to store non-durable caches
diskCachePath: "/quiver/cache/data"
# -- Type of storage where to store durable caches ("" or "S3" or "FS"). Any change of this value requires ETCD wipe to refresh configuration!
durableStorageType: "S3"
# -- S3 durable storage configuration
s3DurableStorage:
s3Bucket: 'dev-gooddata-cn-cache-us-east-1'
s3BucketPrefix: ''
s3Region: 'us-east-1'
Brian M
04/02/2024, 1:57 PMquiver-cache
{"module":"quiver_shard","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:14.353895Z","action":"loading_module"}
{"module":"quiver_policy","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:37.845827Z","action":"loading_module"}
{"catalog_db_dir":null,"maintenance_config":{"vacuum":{"retry_period":0.1,"retry_limit":10,"upsert_threshold":5000,"delete_threshold":500,"max_time_threshold":1800},"maintenance_period":1.0,"s
{"memory_limit":805306368,"logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.344658Z","action":"server_health_monitor_started"}
{"platform":"Linux-5.15.148-x86_64-with-glibc2.36","python_version":"3.11.8","arrow_version":"15.0.0","quiver_version":"0.133.0","server_heartbeat_interval":0.1,"server_trim_interval":5,"serve
{"name":"flight-window-svc","affinity":"FLIGHT_PATH","submit_on_get":true,"submit_on_info":true,"logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.348983Z","action":"regi
{"name":"policy_report","affinity":null,"submit_on_get":false,"submit_on_info":true,"logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.349161Z","action":"registering_serv
{"keyspace_prefix":"gooddata-cn|","node_name":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","etcd_config":{"etcd_keyspace_prefix":"gooddata-cn","etcd_registration_ttl":30,"etcd_registration_hear
{"type":"main","node":{"order":0,"host":"gooddata-cn-etcd-0.gooddata-cn-etcd-headless","port":2379,"secure":false},"logger":"quiver.etcd.conn","level":"info","timestamp":"2024-04-02T13:48:38.3
{"node":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","lease_id":8274113667580754860,"lease_id_hex":"0x72d38e8ae35afbac","registered_rev":7202,"logger":"quiver.etcd.heartbeat","level":"info","ti
{"cluster_id":"6b02ea2c920c49a7b190f3d30193d010","logger":"quiver.etcd.conn","level":"info","timestamp":"2024-04-02T13:48:38.379796Z","action":"etcd_conn_established"}
{"type":"initial_load","node":{"order":0,"host":"gooddata-cn-etcd-0.gooddata-cn-etcd-headless","port":2379,"secure":false},"logger":"quiver.etcd.conn","level":"info","timestamp":"2024-04-02T13
{"node_name":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","node_type":"shard","cluster_id":"6b02ea2c920c49a7b190f3d30193d010","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:
{"host":"0.0.0.0","port":8877,"logger":"quiver.health_check_http_server","level":"info","timestamp":"2024-04-02T13:48:38.433063Z","action":"health_server_started"}
{"listen_url":"<grpc://0.0.0.0:16001>","client_url":"<grpc://100.64.77.216:16001>","peer_url":"<grpc://100.64.77.216:16001>","tls":false,"logger":"quiver.flight.server","level":"info","timestamp":"2
{"module":"CacheShardModule","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.455751Z","action":"module_start"}
{"config":{"storage_config":{"storage_durable_init_threads":4,"durable_check_interval":60.0,"durable_check_deadline":5.0,"durable_reinit_deadline":5.0,"durable_critical_threshold":2,"durable_s
{"check_interval":60.0,"check_deadline":5.0,"reinit_deadline":5.0,"critical_threshold":2,"logger":"quiver.durable_storage.monitor","level":"info","timestamp":"2024-04-02T13:48:38.458634Z","act
{"config":{"storage_durable_init_threads":4,"durable_check_interval":60.0,"durable_check_deadline":5.0,"durable_reinit_deadline":5.0,"durable_critical_threshold":2,"durable_s3_writes_in_progre
{"module":"PolicyModule","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.459158Z","action":"module_start"}
{"flight_count_limit":50000,"reporting_engines":["PointInTimeReporting","PrometheusMetricReporting"],"logger":"quiver.policy","level":"info","timestamp":"2024-04-02T13:48:38.459811Z","action":
{"logger":"quiver.etcd.conn","level":"info","timestamp":"2024-04-02T13:48:38.459950Z","action":"dispatching_existing_flight_events"}
{"name":"gooddata-cn-quiver-xtab-65d48598cc-sdsql","peer_url":"<grpc://100.64.86.230:16001>","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.460143Z","action":"handler_u
{"serializable":true,"page_size":5000,"load_revision":7202,"logger":"quiver.etcd.admin","level":"info","timestamp":"2024-04-02T13:48:38.460385Z","action":"load_recycle_bin"}
{"name":"gooddata-cn-quiver-xtab-65d48598cc-wzlmc","peer_url":"<grpc://100.64.78.100:16001>","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:38.460579Z","action":"handler_u
{"cached_amount":null,"total_amount":null,"policy_id":"default","policy_type":"HierarchicalLimitPolicyConfig","logger":"quiver.policy.limit","level":"info","timestamp":"2024-04-02T13:48:38.462
{"storage_id":"s3_gdc_quiver","storage_type":"s3","logger":"quiver.durable_storage.service","level":"info","timestamp":"2024-04-02T13:48:38.463219Z","action":"storage_added"}
{"duration":0.0030522089800797403,"events":0,"logger":"quiver.etcd.admin","level":"info","timestamp":"2024-04-02T13:48:38.463823Z","action":"recycle_bin_loaded"}
{"storage_id":"s3_gdc_quiver","storage_state":"available","logger":"quiver.durable_storage.service","level":"info","timestamp":"2024-04-02T13:48:38.464009Z","action":"storage_state_changed"}
{"serializable":true,"page_size":5000,"load_revision":7202,"logger":"quiver.etcd.flights","level":"info","timestamp":"2024-04-02T13:48:38.464259Z","action":"load_existing_flights"}
{"storage_id":"s3_gdc_quiver","bucket":"dev-gooddata-cn-cache-us-east-1","prefix":null,"resolved_prefix":null,"region":"us-east-1","endpoint_override":"","scheme":"https","logger":"quiver.stor
{"storage_id":"s3_gdc_quiver","logger":"quiver.durable_storage.service","level":"info","timestamp":"2024-04-02T13:48:38.465008Z","action":"storage_initialized"}
{"storclas_id":"connector_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.563410Z","action":"storclas_version_added"}
{"storclas_id":"connector_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.563616Z","action":"storclas_version_activated"}
{"storclas_id":"connector_cache","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.563846Z","action":"storclas_state_changed"}
{"duration":0.09894363599596545,"logger":"quiver.etcd.flights","level":"info","timestamp":"2024-04-02T13:48:38.564152Z","action":"existing_flights_processed"}
{"storclas_id":"files_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.564481Z","action":"storclas_version_added"}
{"storclas_id":"files_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.564912Z","action":"storclas_version_activated"}
{"storclas_id":"files_cache","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565054Z","action":"storclas_state_changed"}
{"storclas_id":"raw_cache_non_durable","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565317Z","action":"storclas_version_added"}
{"storclas_id":"raw_cache_non_durable","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565457Z","action":"storclas_version_activated"}
{"storclas_id":"raw_cache_non_durable","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565600Z","action":"storclas_state_changed"}
{"storclas_id":"raw_cache_tmp","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565841Z","action":"storclas_version_added"}
{"storclas_id":"raw_cache_tmp","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.565980Z","action":"storclas_version_activated"}
{"storclas_id":"raw_cache_tmp","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:38.566228Z","action":"storclas_state_changed"}
{"start_rev":7203,"logger":"quiver.etcd.watcher","level":"info","timestamp":"2024-04-02T13:48:38.566811Z","action":"watcher_start_watch"}
{"logger":"quiver.etcd.watcher","level":"info","timestamp":"2024-04-02T13:48:38.568757Z","action":"etcd_watcher_started"}
{"storage_id":"s3_gdc_quiver","bucket":"dev-gooddata-cn-cache-us-east-1","prefix":null,"resolved_prefix":null,"region":"us-east-1","endpoint_override":"","scheme":"https","logger":"quiver.storage.durable_s3","level":"info","times
{"storclas_id":"raw_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.236729Z","action":"storclas_version_added"}
{"storclas_id":"raw_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.236950Z","action":"storclas_version_activated"}
{"storclas_id":"raw_cache","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.237096Z","action":"storclas_state_changed"}
{"storclas_id":"result_cache_non_durable","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.237399Z","action":"storclas_version_added"}
{"storclas_id":"result_cache_non_durable","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.237716Z","action":"storclas_version_activated"}
{"storclas_id":"result_cache_non_durable","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.237827Z","action":"storclas_state_changed"}
{"storclas_id":"result_cache_tmp","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238048Z","action":"storclas_version_added"}
{"storclas_id":"result_cache_tmp","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238177Z","action":"storclas_version_activated"}
{"storclas_id":"result_cache_tmp","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238302Z","action":"storclas_state_changed"}
{"storclas_id":"result_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238540Z","action":"storclas_version_added"}
{"storclas_id":"result_cache","storclas_ver":"v1","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238687Z","action":"storclas_version_activated"}
{"storclas_id":"result_cache","state":"enabled","logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238816Z","action":"storclas_state_changed"}
{"logger":"quiver.storage","level":"info","timestamp":"2024-04-02T13:48:39.238948Z","action":"storage_service_initialized"}
{"node_name":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","state":"up","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:39.240996Z","action":"set_node_state"}
{"name":"gooddata-cn-quiver-cache-6655b46df8-bqlcq","peer_url":"<grpc://100.64.77.216:16001>","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:48:39.247645Z","action":"shard_up"}
{"logger":"quiver.shard.replication","level":"info","timestamp":"2024-04-02T13:48:44.246426Z","action":"replication_started"}
{"storage_id":"s3_gdc_quiver","bucket":"dev-gooddata-cn-cache-us-east-1","prefix":null,"resolved_prefix":null,"region":"us-east-1","endpoint_override":"","scheme":"https","logger":"quiver.storage.durable_s3","level":"info","times
{"name":"gooddata-cn-quiver-cache-6655b46df8-d6fjd","peer_url":"<grpc://100.64.84.222:16001>","logger":"quiver.server","level":"info","timestamp":"2024-04-02T13:49:53.225438Z","action":"shard_up"}
Jan Kos
04/02/2024, 4:11 PMJan Kos
04/03/2024, 8:34 AMLiveness probe failed
happen all the time, causing CrashloopBackoff, or were they just temporary during upgrade?
Do both types of quiver pods failing (quiver-xtab and quiver-cache), or just one of them?Jan Kos
04/03/2024, 8:36 AM3.4.2
and was ok? Or are you deploying quiver during upgrade to 3.5.5
?Brian M
04/03/2024, 3:42 PMBrian M
04/03/2024, 3:45 PMLiveness probe failed
errors happen during startup and prevent the pod staying up, resulting in the CrashloopBackoff.
2. Quiver was working fine on 3.4.2
. It does not work on 3.5.5
Jan Kos
04/03/2024, 4:01 PMDo both types of quiver pods failing (quiver-xtab and quiver-cache), or just one of them?
Brian M
04/03/2024, 4:32 PMquiver-cache
Jan Kos
04/05/2024, 9:54 AMthe suspicious things in all this:
• health checking infrastructure is same across quiver nodes. meaning the health check server in cache and xtab are the same. same impl of /ready and /live
• the connection refused is problematic from this perspective. if the pod had trouble then:
◦ on critical & unrecoverable trouble, the pod would not start
◦ on critical but possibly recoverable (retriable) problems, the pod would start but then health checks will report problems via 500 error codes on /live and /ready
I wonder, the connection refused problems for health checks - is this stuff happening all the time for quiver-cache nodes? meaning every single probe run fails like this and then the pod gets restarted?
(trying to root out whether these connection refused errors are just ‘noise’ happening towards the end of the pod life as it is being killed/restarted.. so that we don’t chase ghosts)
Jan Kos
04/05/2024, 9:57 AMBrian M
04/08/2024, 12:47 PMBrian M
04/08/2024, 3:44 PM