Hi. I have set up Cloudwatch for my GDCN cluster a...
# gooddata-cn
d
Hi. I have set up Cloudwatch for my GDCN cluster and sent all logs to Cloudwatch. With so many micro services, a single web request may pass through many of the services. What is the best practices using Cloudwatch (anomalies, insights) to work with GDCN to verify or debug the applications? For instance, When I tried to load the LDM for one of the workspaces with "/modeler/#/f97c8a4......", it seems to be slow, taking about 40 seconds to load, and I'd like to know where the bottleneck is. I can use the time interval (since I know when it happens, and only me on the cluster at this point) to get back 672 records, and when I filtered it by the workspace number "f97c8a4", I got only 22 records, from "metadata-api" and "ingress-nginx-controller". From these logs, I don't seem to get much useful information. Each log contains tons of information about the pod configuration, maybe useful for other things, but not for my purpose. If I concentrated on "msg" from "metadata-api", I see 2 "Workspace meta configuration created", 2 "Retrieve logical model.", and 9 "HTTP response". The logs for "ingress-nginx-controller" shows a lot of HTTP calls generated from the original call, but I am not sure about the format and which one shows the response time. Of course, I am also concerned about that more than 600 logs we filtered out with the workspace ID, and their function for this request. So how should we use the logs? How do we tie all logs related to one request? I used workspace ID here for testing, but in production, there could be many simultaneous requests related to the same workspace ID, so how do we tell them apart? More importantly, can we get more useful messages? Do we need to go to "DEBUG" level of logging? Any help and insight is appreciated.
j
Hi @Dongfeng Lu generally, when you want to track a particular request and how it went through the application, we
traces
. When a request is made (through ui or api directly)
traceId
is generated for each request. You can then search through logs how the request went through each micro service and limit the results by it. Try to check brower dev tools / network tab for a request and in response headers, you can see
traceId
generated and assigned to the request (viz screenshot).
d
Hi Jan. Thanks for the response, and that picture showing how to tie trace ID from the browswer to the log. However, I still felt that I did not get enough information from the log. For instance, in the attached screenshot related to one particular traceID (281c18b5218a9904) for request "/api/v1/layout/workspaces/f97c8a4d2c0344d284c5eb6ca386af2a/logicalModel?includeParents=true", I only see 4 records, with the following "msgs":
Copy code
Server intercept call
Server intercept call
Retrieve logical model.
HTTP response
The timestamps between the last two msgs is about 15 seconds, implying the "retrieving" part takes the longest for this request. But where do I go further? Does it mean we need to check cache or database? Also, loading LDM once, I saw both "logicalModel?includeParents=true" and "logicalModel?includeParents=false", each taking 15 or 16 seconds, so the whole page takes more than 30 sec to load. Is it correct to load it twice? Thanks.
j
Getting logical data model needs to be retrieved from metadata database. Are you using your own RDS or did you use built in postgres in helm chart installation? Also metadata api pods are big part on the request, what is your sizing of md pods? How big is the LDM? Could you post the LDM (retrievable by ldm layout api).. I will double check internally why there are two call for getting LDM.
d
Hi Jan, We are using external RDS and ElastiCache following the steps in https://www.gooddata.com/docs/cloud-native/3.7/deploy-and-install/cloud-native/environment/aws. For Metadata API, we specified metadataApi: encryptor: enabled: false resources: limits: cpu: 1250m memory: 1300Mi requests: cpu: 1250m memory: 1300Mi I am sending you two LDM files, one for "GDCN-LDM-Dev Master converted from platform.json", which is described above with a long loading time. We don't have a data source connected to this model yet, and it was created based on a conversion from our Platform LDM. The other is for reference using Gooddata demo data, for which the data was imported to a Redshift, and the LDM was created by connecting to the data source. In another word, it should be purely created by GDCN. For both LDMs, I observed both "logicalModel?includeParents=true" and "logicalModel?includeParents=false".
j
Hi, I’m sorry for the delay getting back to you. Retrieving LDM can be an expensive operation and take some time when there is a bigger LDM. However I reviewed internally and there is no need that UI client would call “logicalModel?includeParents=true” and “logicalModel?includeParents=false” every time when rendering LDM. I created an internal ticket for optimization however there is no specific timeline for the fix yet.
d
Good to hear. Thank you.