Hello. I am trying to load one set of Gooddata sam...
# gooddata-platform
d
Hello. I am trying to load one set of Gooddata sample data into my workspace. I have a new workspace and created a datasource from AWS Redshift at https://analytics.tst2.livevox.com/admin/connect/#/workspaces/m8ify7ik9gkfsgn14798hdn2ernba06v/datasource/667db1163c1e5549d812f942. From here, I created the data model at https://analytics.tst2.livevox.com/modeler/#/workspaces/m8ify7ik9gkfsgn14798hdn2ernba06v. Everything looks good, but there is no data. At the upper left corner of the data model page, there is a "Load" button, which leads to https://analytics.tst2.livevox.com/admin/load/#/workspaces/m8ify7ik9gkfsgn14798hdn2ernba06v. I simply created a new schedule, selected the datasource, load all mapped datasets, selected "Manually" and saved it. I clicked on "run now", and it runs and failed. From its log https://analytics.tst2.livevox.com/gdc/projects/m8ify7ik9gkfsgn14798hdn2ernba06v/dataload/processes/c2ba7b6c-631c-43cf-a9ab-d8b94e74d2b2/executions/667dd6bc78f6c407f8a1f5f0/log#first_error, it looks like it has downloaded all data correctly 2024-06-27T231653.741+0200 [INFO]: Data for project="m8ify7ik9gkfsgn14798hdn2ernba06v" was downloaded. Datasets: [{dataset.campaign_channels, upsert, rows=330, size=17743 bytes}, {dataset.campaigns, upsert, rows=154, size=4084 bytes}, {dataset.customers, upsert, rows=2000, size=115912 bytes}, {dataset.order_lines, upsert, rows=11978, size=1056172 bytes}, {dataset.products, upsert, rows=36, size=781 bytes}] The number of rows for each dataset matches the tables in Redshift database tables. However, it failed with the following error message: 2024-06-27T231658.753+0200 [WARN]: Project "m8ify7ik9gkfsgn14798hdn2ernba06v" was not integrated. Reason: Upload to dataset dataset.products failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "3" in column "id".'. Upload to dataset dataset.campaign_channels failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "151" in column "id".'. Upload to dataset dataset.customers failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "915" in column "id".'. Upload to dataset dataset.order_lines failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "4590" in column "id".'. Upload to dataset dataset.campaigns failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "172" in column "id".'. I am puzzled by this message as I don't see "column "id"" in any of the datasets and dataset mappings. Could you please help figure it out? Joe will invite you to our workspace. Thanks in advance.
m
Hi @Dongfeng Lu, the issue has to do with the duplicated records in your source data. There are two options, you can verify if the data for "column_id" is really unique, or you can force the passing of the duplicated values by setting the platform setting
etl.lastRecordDeduplication
as true at the workspace level:
https://{<http://your.domain.com|your.domain.com>}/gdc/projects/{workspace_id}/config/
• Set as true - if input data contain duplicities on key (connection point or fact table grain) they are silently deduplicated (last row wins) before being loaded • Set as false - if input data contain duplicities on key (connection point or fact table grain) the data load will fail with error message - as in this case More details about the platform settings can be found in our documentation here: Disable deduplication
d
Hi Moises, Thanks for the response. However, I am not sure you fully got my question. You mentioned "you can verify if the data for "column_id" is really unique", but my problem is there is no "column_id" in the dataset. For that matter, I think the error message "Duplicated value "151" in column "id".'" is saying there is a field named "id", not named "column_id", but I don't see a field called just "id" either. I am not sure if you can see the log so I am copying it here. You can review the "Output Stage -> LDM mapping" that there are "product_id", "customer_id", etc., but not a standalone "id". Where does it come from? What database field is mapped to it? 2024-06-27T231644.360+0200 [INFO]: Data distribution worker started 2024-06-27T231644.360+0200 [INFO]: Request id: "data_load_TPpwozezQ4_u3k495HbE4t9xIMjuj8KI4zyuuP4iEAST8phRsu2Bh" 2024-06-27T231644.360+0200 [INFO]: Data source: "667db1163c1e5549d812f942" 2024-06-27T231644.360+0200 [INFO]: Additional parameters: {GDC_DE_SYNCHRONIZE_ALL=true, PROCESS_ID=c2ba7b6c-631c-43cf-a9ab-d8b94e74d2b2} 2024-06-27T231644.360+0200 [INFO]: Synchronization mode: all mapped datasets (default) 2024-06-27T231648.703+0200 [INFO]: Synchronized datasets: [dataset.products, dataset.campaign_channels, dataset.customers, dataset.order_lines, dataset.campaigns] 2024-06-27T231648.703+0200 [INFO]: ====================== Mapping Validation ====================== Mapping Validation is OK ====================== End of Mapping Validation ====================== 2024-06-27T231648.703+0200 [INFO]: ====================== Output Stage -> LDM mapping ====================== Output Stage table/view: campaign_channels (full, shared) -> Dataset: dataset.campaign_channels category[VARCHAR(128)] -> label.campaign_channels.category{LABEL}[VARCHAR(128)] campaign_channel_id[VARCHAR(128)] -> label.campaign_channels.campaign_channel_id{LABEL}[VARCHAR(128)] campaign_id[INT4] -> dataset.campaigns{REFERENCE}[INT] spend[NUMERIC(15,2)] -> fact.campaign_channels.spend{FACT}[DECIMAL(15,2)] type[VARCHAR(128)] -> label.campaign_channels.type{LABEL}[VARCHAR(128)] budget[NUMERIC(15,2)] -> fact.campaign_channels.budget{FACT}[DECIMAL(15,2)] Output Stage table/view: campaigns (full, shared) -> Dataset: dataset.campaigns campaign_id[INT4] -> label.campaigns.campaign_id{LABEL}[INT] campaign_name[VARCHAR(128)] -> label.campaigns.campaign_name{LABEL}[VARCHAR(128)] Output Stage table/view: customers (full, shared) -> Dataset: dataset.customers geo__state__location[VARCHAR(64)] -> label.customers.geo_state_location{LABEL}[VARCHAR(64)] customer_name[VARCHAR(128)] -> label.customers.customer_name{LABEL}[VARCHAR(128)] customer_id[INT4] -> label.customers.customer_id{LABEL}[INT] state[VARCHAR(64)] -> label.customers.state{LABEL}[VARCHAR(64)] region[VARCHAR(64)] -> label.customers.region{LABEL}[VARCHAR(64)] Output Stage table/view: order_lines (full, shared) -> Dataset: dataset.order_lines campaign_id[INT4] -> dataset.campaigns{REFERENCE}[INT] product_id[INT4] -> dataset.products{REFERENCE}[INT] wdf__region[VARCHAR(128)] -> label.order_lines.wdf_region{LABEL}[VARCHAR(128)] wdf__state[VARCHAR(64)] -> label.order_lines.wdf_state{LABEL}[VARCHAR(64)] order_line_id[VARCHAR(128)] -> label.order_lines.order_line_id{LABEL}[VARCHAR(128)] customer_id[INT4] -> dataset.customers{REFERENCE}[INT] order_id[VARCHAR(128)] -> label.order_lines.order_id{LABEL}[VARCHAR(128)] order_status[VARCHAR(128)] -> label.order_lines.order_status{LABEL}[VARCHAR(128)] date[DATE] -> date{DATE}[DATE] quantity[NUMERIC(15,2)] -> fact.order_lines.quantity{FACT}[DECIMAL(15,2)] price[NUMERIC(15,2)] -> fact.order_lines.price{FACT}[DECIMAL(15,2)] Output Stage table/view: products (full, shared) -> Dataset: dataset.products product_name[VARCHAR(128)] -> label.products.product_name{LABEL}[VARCHAR(128)] product_id[INT4] -> label.products.product_id{LABEL}[INT] category[VARCHAR(128)] -> label.products.category{LABEL}[VARCHAR(128)] ====================== End of Output Stage -> LDM mapping ====================== 2024-06-27T231648.713+0200 [INFO]: ====================== Data distribution scope ====================== 2024-06-27T231648.713+0200 [INFO]: Project="m8ify7ik9gkfsgn14798hdn2ernba06v"; datasets=[{dataset.products, full}, {dataset.campaign_channels, full}, {dataset.customers, full}, {dataset.order_lines, full}, {dataset.campaigns, full}] 2024-06-27T231648.713+0200 [INFO]: ====================== End of Data distribution scope ====================== 2024-06-27T231648.713+0200 [INFO]: ====================== Downloading and integrating data ====================== 2024-06-27T231653.741+0200 [INFO]: Data for project="m8ify7ik9gkfsgn14798hdn2ernba06v" was downloaded. Datasets: [{dataset.campaign_channels, upsert, rows=330, size=17743 bytes}, {dataset.campaigns, upsert, rows=154, size=4084 bytes}, {dataset.customers, upsert, rows=2000, size=115912 bytes}, {dataset.order_lines, upsert, rows=11978, size=1056172 bytes}, {dataset.products, upsert, rows=36, size=781 bytes}] 2024-06-27T231658.753+0200 [WARN]: Project "m8ify7ik9gkfsgn14798hdn2ernba06v" was not integrated. Reason: Upload to dataset dataset.products failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "3" in column "id".'. Upload to dataset dataset.campaign_channels failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "151" in column "id".'. Upload to dataset dataset.customers failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "915" in column "id".'. Upload to dataset dataset.order_lines failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "4590" in column "id".'. Upload to dataset dataset.campaigns failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "172" in column "id".'. 2024-06-27T231708.760+0200 [INFO]: ====================== End of downloading and integrating data ====================== 2024-06-27T231708.760+0200 [ERROR]: Data distribution worker failed. Reason: All projects failed to load. Automated Data Distribution v2 25sError found Reload X RequestID WatchDog DEBUG INFO WARN ERROR phases parsing not available Another question. I did not explicitly set the "output stage", so is there a way I can download what the process has downloaded in the following log to examine the "downloaded data"? 2024-06-27T231653.741+0200 [INFO]: Data for project="m8ify7ik9gkfsgn14798hdn2ernba06v" was downloaded. Datasets: [{dataset.campaign_channels, upsert, rows=330, size=17743 bytes}, {dataset.campaigns, upsert, rows=154, size=4084 bytes}, {dataset.customers, upsert, rows=2000, size=115912 bytes}, {dataset.order_lines, upsert, rows=11978, size=1056172 bytes}, {dataset.products, upsert, rows=36, size=781 bytes}] Thanks.
Hi Moises, My bad. After some digging, I learned that AWS Redshift does not enforce primary key constraint, which we did not expect. We do have duplicates. After removing duplicates, the "load" succeeded. Thanks
🙌 1