hello, I’m currently struggling with a load where ...
# gooddata-platform
hello, I’m currently struggling with a load where the source table became too big over time, so the load fails with this:
"message":"Feature flag etl.lastRecordDeduplication must be disabled for upload file size larger than %s
. Is there a way to make this work temporarily, before we can work on reducing the dataset size?
Hello Thomas, in general, the error is accurate in its description: The total size of your upload is too large to go through with this feature flag enabled. From there, you have two options: 1. Make sure your upload is under 60 GB of data 2. Disable the feature flag. This can be done at /gdc/projects/workspace_id/config via the gray pager or an API call. The entry looks like this:
Copy code
  "settingItem": {
    "key": "etl.lastRecordDeduplication",
    "value": "true",
    "source": "catalog",
    "links": {
      "self": "/gdc/projects/workspace_id/config/etl.lastRecordDeduplication"
looks like the limit is 32GiB
so we’ll go for the deduplication feature then
switching this off will still replace old data with newer data, but we don’t get a deduplication check WITHIN the new data set anymore, correct?
so as long as I make sure that everything in there is already unique, I’m fine?
Good question. Regarding the consequences: * true (current status) = if input data contain duplicities on key (connection point or fact table grain) they are deduplicated (last row) before being loaded. * false - if input data contain duplicities on key (connection point or fact table grain) the data load will fail with error message. This can also be found in the following documentation: https://help.gooddata.com/doc/growth/en/workspace-and-user-administration/administrat[…]ce-objects/configure-various-features-via-platform-settings/
ok, this talks specifically about input data, making me hope the the old data still gets discarded and replaced
Yes Thomas. No matter how the
feature flag is set, a data load will still: • remove all old data if it is a ful lload • update the existing data based on the defined primary key in case of incremental load The only difference is really in what happens if there is a duplicity on the primary key inside the batch of data being uploaded.
thanks for clarifying that!