Hello, we're on GoodData 3.21.0. We're calling afm...
# gooddata-cn
p
Hello, we're on GoodData 3.21.0. We're calling afmExecApi through the Python SDK and noticing an error that we haven't seen before in the logs that results in 500 errors in the SDK call:
Copy code
errorType=com.gooddata.tiger.grpc.error.GrpcPropagatedServerException, message=UNAVAILABLE: Keepalive failed. The connection is likely gone,<no detail>
	at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertFromUnknownException(ErrorPropagation.kt:230)
	Suppressed: The stacktrace has been enhanced by Reactor, refer to additional information below: 
Error has been observed at the following site(s):
	*__checkpoint ⇢ AuthenticationWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ OAuth2LoginAuthenticationWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ OAuth2AuthorizationRequestRedirectWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ OAuth2AuthorizationRequestRedirectWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ ReactorContextWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ CorsWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ HttpHeaderWriterWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ OrganizationWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ ServerWebExchangeReactorContextWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ org.springframework.security.web.server.WebFilterChainProxy [DefaultWebFilterChain]
	*__checkpoint ⇢ com.gooddata.tiger.httplogging.LogbookWritingFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ HTTP POST "/api/v1/actions/workspaces/d8eb767b09eb2ab637e89663ec2d8d4a/execution/afm/execute" [ExceptionHandlingWebHandler]
Original Stack Trace:
		at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertFromUnknownException(ErrorPropagation.kt:230)
		at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertToTransferableException(ErrorPropagation.kt:216)
		at com.gooddata.tiger.grpc.error.ErrorPropagationKt.clientCatching(ErrorPropagation.kt:65)
		at com.gooddata.tiger.grpc.error.ErrorPropagationKt$clientCatching$1.invokeSuspend(ErrorPropagation.kt)
		at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
		at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:28)
		at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:99)
		at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
		at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:102)
		at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:111)
		at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:99)
		at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
		at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:811)
		at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:715)
		at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:702)
We're wondering what this means or what we can do to debug this further. What does it mean that "Keepalive failed. The connection is likely gone" within the service itself?
j
HI Pete, thanks for sharing this. Few questions: • Could you confirm if this happens consistently or sporadically? • Do you notice it only during specific types of calls (e.g., large reports, heavy datasets)? • If you retry the call, does it typically succeed?
p
Hi Julius, thanks for getting back. It seems this is sporadic. We noticed it yesterday when debugging a separate issue and wanted to check whether this was causing issues in one of our applications, but it seems it was not. It still would be useful to know what causes these messages. Pete
j
Hi Pete, will try to dig something up regarding this. Is this the whole stack trace from the logs? Is the stack trace always the same? Based on this log sample it likely means that gRPC keepalive ping mechanism (gRPC is used internally inside CN application for communication between microservices) failed / didn’t get a response in time, during processing oauth authentication as it is seen on auth checkpoints dropping the connection. Could you check if auth-service was healthy at the time? or if there was a higher network activity?