Solved

GoodData created 21k threads!

  • 9 July 2021
  • 9 replies
  • 142 views

  • Known Participant
  • 22 replies

Hi there!

We have an issue with number of threads GoodData.CN is creating.

Please help to fix this :-)

icon

Best answer by Jan Chochol 9 July 2021, 13:11

View original

9 replies

Hi @NickK,

Could you please show little bit more of command line of problematic Java process?

We are running multiple Java services and all their command line starts with same, so it is not possible to identify problematic process.

We need part after “-Xlog:gc*:file=/tmp/aio/” (there should be internal service name - one of: "afm-exec-api", "auth-service", "metadata-api", "result-cache", "sql-executor", "scan-model").

Also if you can include sample of most offending thread names (in “htop” → F2 → “Display options” and select “Tree view” and “Show custom thread names”.

Here it is:
 

$ ps aux |grep java
nfsnobo+ 9637 6.0 6.0 34731572 1969836 ? Ssl июл08 58:05 java -Xmx4g -Xms128m -XX:+UseStringDeduplication -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -Xlog:gc*:file=/tmp/aio/afm-exec-api-gc.log:time,level,tags -Dlogging.level.root=WARN -Djava.security.egd=file:/dev/./urandom -cp /opt/afm-exec-api org.springframework.boot.loader.JarLauncher
root 9643 5.2 2.3 6875976 762920 ? Ssl июл08 49:40 /opt/java/openjdk/bin/java -cp /opt/apache-pulsar-2.6.1/conf:::/opt/apache-pulsar-2.6.1/lib/*: -Dlog4j.configurationFile=log4j2.yaml -Djute.maxbuffer=10485760 -Djava.net.preferIPv4Stack=true -Dpulsar.allocator.exit_on_oom=true -Dio.netty.recycler.maxCapacity.default=1000 -Dio.netty.recycler.linkCapacity=1024 -Xms128m -Xmx128m -XX:MaxDirectMemorySize=256m -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -Dpulsar.log.appender=RoutingAppender -Dpulsar.log.dir=/logs -Dpulsar.log.level=WARN -Dpulsar.routing.appender.default=Console -Dpulsar.functions.process.container.log.dir=/logs -Dpulsar.functions.java.instance.jar=/opt/apache-pulsar-2.6.1/instances/java-instance.jar -Dpulsar.functions.python.instance.file=/opt/apache-pulsar-2.6.1/instances/python-instance/python_instance_main.py -Dpulsar.functions.extra.dependencies.dir=/opt/apache-pulsar-2.6.1/instances/deps -Dpulsar.functions.instance.classpath=/opt/apache-pulsar-2.6.1/conf:::/opt/apache-pulsar-2.6.1/lib/*: -javaagent:/opt/apache-pulsar-2.6.1/lib/org.aspectj-aspectjweaver-1.9.2.jar -Dzookeeper.4lw.commands.whitelist=* -Dzookeeper.snapshot.trust.empty=true -Dpulsar.log.file=pulsar-standalone.log org.apache.pulsar.PulsarStandaloneStarter --config /opt/apache-pulsar-2.6.1/conf/standalone.conf -nfw -nss
nfsnobo+ 9645 0.1 2.3 10255176 757284 ? Ssl июл08 1:51 java -Xmx4g -Xms128m -XX:+UseStringDeduplication -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -Xlog:gc*:file=/tmp/aio/scan-model-gc.log:time,level,tags -Dlogging.level.root=WARN -Djava.security.egd=file:/dev/./urandom -cp /opt/scan-model org.springframework.boot.loader.JarLauncher
nfsnobo+ 9646 2.2 2.9 10282424 968252 ? Ssl июл08 21:10 java -Xmx4g -Xms128m -XX:+UseStringDeduplication -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -Xlog:gc*:file=/tmp/aio/sql-executor-gc.log:time,level,tags -Dlogging.level.root=WARN -Djava.security.egd=file:/dev/./urandom -Ddata.source.extra.drivers.directory=/opt/extra-drivers -cp /opt/sql-executor org.springframework.boot.loader.JarLauncher
nfsnobo+ 9648 3.0 2.7 10326344 902536 ? Ssl июл08 29:27 java -Xmx4g -Xms128m -XX:+UseStringDeduplication -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -Xlog:gc*:file=/tmp/aio/result-cache-gc.log:time,level,tags -Dlogging.level.root=WARN -Djava.security.egd=file:/dev/./urandom -cp /opt/result-cache org.springframework.boot.loader.JarLauncher
nfsnobo+ 9650 1.4 4.3 10376688 1415448 ? Ssl июл08 14:06 java -Xmx4g -Xms128m -XX:+UseStringDeduplication -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -Xlog:gc*:file=/tmp/aio/metadata-api-gc.log:time,level,tags -Dlogging.level.root=WARN -Dlogging.level.org.hibernate.SQL=WARN -Dlogging.level.Exposed=WARN -Djava.security.egd=file:/dev/./urandom org.springframework.boot.loader.JarLauncher
nfsnobo+ 9666 0.4 3.1 10400512 1009548 ? Ssl июл08 4:38 java -Xmx4g -Xms128m -XX:+UseStringDeduplication -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -Xlog:gc*:file=/tmp/aio/auth-service-gc.log:time,level,tags -Dlogging.level.root=WARN -Djava.security.egd=file:/dev/./urandom -cp /opt/auth-service org.springframework.boot.loader.JarLauncher

 

 

Is it possible to set up a call with screenshare to fix this?

Unfortunately we do not yet know root cause, so we do not know how to fix it.

Anyway, your logs points to problematic service (PID 9637).

We would appreciate, if you could collect Java thread dump (just run “kill -QUIT 9637” and thread dump should be visible in docker log, but be aware that it can be long) - it will help investigation.

If you want just to fix the service, you can just kill it (“kill 9637”) - services inside GoodData.CN CE are supervised by s6 supervisor, so killed service will be immediately restarted (this service is stateless, so it is safe to restart it).

Here it is.

Ok, I’ve killed the process and number of threads became around 1k total.

But it doesn’t look to solve the issue: I can see that the new process still adds 1-2 threads per second, so it will eventually get to the same point:
 

 

Thanks, you thread dump helped identifying the issue - https://github.com/spring-projects/spring-framework/issues/26263 .

We know how to fix it (update Spring Boot version), but there is nothing to do, until we deliver new version.

 

I can only recommend to kill offending process when it reaches too many threads.

Ok, thanks.

Userlevel 2

For the record, the “graceful” way how to restart supervised service is:

s6-svc -r /run/s6/services/SERVICE-NAME

where SERVICE_NAME is one of supervised services:

  • afm-exec-api
  • aqe
  • auth-service
  • cache-garbage-collector
  • dex
  • metadata-api
  • nginx
  • postgres
  • pulsar
  • redis
  • result-cache
  • scan-model
  • sql-executor

But in the most cases, simple restart of the whole gooddata-cn-ce container is much easier and recommended solution.

Hi @NickK,

Threads problem will be fixed in next version, which is currently schedule to be release at the beginning of August.

Reply