Hi GD support team, hope everyone is doing well :s...
# gooddata-platform
a
Hi GD support team, hope everyone is doing well 😄 Im currently trying to use a FlexConnect as a data source. However, I noticed that if it returns a size of ~200k rows (with say about 10-15 columns), it would give this error
Copy code
Reached limit of max size of data returned from data source. The limit is 20971520 bytes
This can occur even if I select 1 column to be placed under
Rows
on GD (unless I should go change my code to return only for that column). Then again, even if so, this can arise if the number of rows is large. Is there any way to resolve this? I didnt run into this issue when I was using CSV files as data source. Perhaps does GD support pagination or returning data in batches? (actually I've tried the latter, by changing from PyArrow Data and turned it into batches prior to returning, but still run into this issue). Any assistance will be much appreciated! 🙏
👀 1
j
Hi, The limit for size of dataset returned by FlexConnect in GD Cloud is indeed 20MB while for CSV dataset the current limit is 200MB (1TB for all the files in total). We may consider to increase the limit to 200MB as for CSV files. Would it be enough for your use case? One way to mitigate the problem is to analyze
columns
in an execution context of FlexConnect table function and return only columns which are needed for the SQL query which is executed on top of returned data. You can return only a single column even if the table function is declared with 100 columns, but visualization displays just one. This can help to reduce the size in case of wide dataset with many columns.
a
Hello Jakub, thank you for your reply. Yes is it possible to also increase the limit of the returned payload? I see, so there's an additional SQL layer being done on the returned data. I was suspecting if I could do that, because I noticed that I was only requesting for 1 column via GD, but I could return the entire table and it worked. Ok I will also add this optimization piece
j
The limit can be currently increased by configuration change only in GoodData CN and not on shared clusters of GD Cloud. Would 200MB satisfy your needs?
Btw, in some cases you may also consider pre-aggregation of data if some columns are not returned by query. E.g. if column X contains additive quantitative information and data looks like:
Copy code
A.  B.   X
a1. b1.  10
a2  b1.  20
a1. b2.  30
a2. b2.  40
and visualization requests only column A and X you may return
Copy code
A.    X
a1    10
a2.   20
a1.   30
a2.   40
and the SQL query on top will aggregate from the detail data, but you can also return
Copy code
A.    X
a1    40
a2.   60
the SQL function will still perform the aggregation but amount of transferred data for the aggregation will be lower
a
Yes, I think yes, let's try with 200 MB for now. Understood and thanks for the example on the above, I will try to add these optimizations in the code
Also on a second note, is it possible to increase the timeout duration? I realized that before my MongoDB finishes the querying to return the results, GD platform will show an error, but if given more time, then we'll be able to return the actual data
j
What is the use case? We have 180 seconds timeout which should be enough for interactive applications. The limit is there to prevent overloading of underlying databases by bad queries which do not provide data in time.
a
Hmm it's at 180s? I am currently setting the error show up within 25-30s. When I query off a smaller table then it wouldn't show the error. But yes I do understand the intention. I'll investigate further and provide evidence on this
Btw Jakub, will the 20MB be lifted to 200MB? If so, roughly when will this change be made?
m
Hi Alson, at this time we have no ETA on when this change will be made. However, I will update our internal ticket with our Engineers now to see if I can get more details. Thanks for bearing with us in the meantime.
a
Thanks Michael, I hope that this change wouldnt be too big and wouldnt trouble you guys too much. But I think it's kinda important, I have some data that Im unable to show right now, even though I've just performed the optimizations suggested by Jakub Is there a way for me to keep track of this ticket or change?
m
Hi Alson, I hope it's OK for me to step in here. I am afraid that the ticket is internal and cannot be viewed on your side. However, as my colleague said, we will keep you posted. Thank you for your understanding.
a
Noted Moises, thanks for the assistance, looking forward to further updates 😄
Hi @Moises Morales do you happen to know if there's any update on this ticket pls?
m
Hi Alson, thank you for checking in on this feature. I checked our internal ticket, and while it’s currently on our radar, it hasn’t been prioritised yet as our team is focused on other high-impact initiatives at the moment. If this functionality is important to you, we’d recommend reaching out to your Account Owner to discuss this and who can help push this update for you.
a
Thanks for your reply and the details, Michael. Sure understood, I will reach out to her, this is quite important for us as we may have columns that have high cardinality, resulting in a larger than 20MB sized data.
👍 1