Hello I am trying to load one set of Gooddata sample data in GoodData #gooddata-platform

Hello. I am trying to load one set of Gooddata sam...

Dongfeng Lu

06/27/2024, 11:01 PM

Hello. I am trying to load one set of Gooddata sample data into my workspace. I have a new workspace and created a datasource from AWS Redshift at https://analytics.tst2.livevox.com/admin/connect/#/workspaces/m8ify7ik9gkfsgn14798hdn2ernba06v/datasource/667db1163c1e5549d812f942. From here, I created the data model at https://analytics.tst2.livevox.com/modeler/#/workspaces/m8ify7ik9gkfsgn14798hdn2ernba06v. Everything looks good, but there is no data. At the upper left corner of the data model page, there is a "Load" button, which leads to https://analytics.tst2.livevox.com/admin/load/#/workspaces/m8ify7ik9gkfsgn14798hdn2ernba06v. I simply created a new schedule, selected the datasource, load all mapped datasets, selected "Manually" and saved it. I clicked on "run now", and it runs and failed. From its log https://analytics.tst2.livevox.com/gdc/projects/m8ify7ik9gkfsgn14798hdn2ernba06v/dataload/processes/c2ba7b6c-631c-43cf-a9ab-d8b94e74d2b2/executions/667dd6bc78f6c407f8a1f5f0/log#first_error, it looks like it has downloaded all data correctly 2024-06-27T231653.741+0200 [INFO]: Data for project="m8ify7ik9gkfsgn14798hdn2ernba06v" was downloaded. Datasets: [{dataset.campaign_channels, upsert, rows=330, size=17743 bytes}, {dataset.campaigns, upsert, rows=154, size=4084 bytes}, {dataset.customers, upsert, rows=2000, size=115912 bytes}, {dataset.order_lines, upsert, rows=11978, size=1056172 bytes}, {dataset.products, upsert, rows=36, size=781 bytes}] The number of rows for each dataset matches the tables in Redshift database tables. However, it failed with the following error message: 2024-06-27T231658.753+0200 [WARN]: Project "m8ify7ik9gkfsgn14798hdn2ernba06v" was not integrated. Reason: Upload to dataset dataset.products failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "3" in column "id".'. Upload to dataset dataset.campaign_channels failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "151" in column "id".'. Upload to dataset dataset.customers failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "915" in column "id".'. Upload to dataset dataset.order_lines failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "4590" in column "id".'. Upload to dataset dataset.campaigns failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "172" in column "id".'. I am puzzled by this message as I don't see "column "id"" in any of the datasets and dataset mappings. Could you please help figure it out? Joe will invite you to our workspace. Thanks in advance.

Moises Morales

06/28/2024, 9:20 AM

Hi @Dongfeng Lu, the issue has to do with the duplicated records in your source data. There are two options, you can verify if the data for "column_id" is really unique, or you can force the passing of the duplicated values by setting the platform setting

etl.lastRecordDeduplication

as true at the workspace level:

https://{<http://your.domain.com|your.domain.com>}/gdc/projects/{workspace_id}/config/

• Set as true - if input data contain duplicities on key (connection point or fact table grain) they are silently deduplicated (last row wins) before being loaded • Set as false - if input data contain duplicities on key (connection point or fact table grain) the data load will fail with error message - as in this case More details about the platform settings can be found in our documentation here: Disable deduplication

Dongfeng Lu

06/28/2024, 3:10 PM

Hi Moises, Thanks for the response. However, I am not sure you fully got my question. You mentioned "you can verify if the data for "column_id" is really unique", but my problem is there is no "column_id" in the dataset. For that matter, I think the error message "Duplicated value "151" in column "id".'" is saying there is a field named "id", not named "column_id", but I don't see a field called just "id" either. I am not sure if you can see the log so I am copying it here. You can review the "Output Stage -> LDM mapping" that there are "product_id", "customer_id", etc., but not a standalone "id". Where does it come from? What database field is mapped to it? 2024-06-27T231644.360+0200 [INFO]: Data distribution worker started 2024-06-27T231644.360+0200 [INFO]: Request id: "data_load_TPpwozezQ4_u3k495HbE4t9xIMjuj8KI4zyuuP4iEAST8phRsu2Bh" 2024-06-27T231644.360+0200 [INFO]: Data source: "667db1163c1e5549d812f942" 2024-06-27T231644.360+0200 [INFO]: Additional parameters: {GDC_DE_SYNCHRONIZE_ALL=true, PROCESS_ID=c2ba7b6c-631c-43cf-a9ab-d8b94e74d2b2} 2024-06-27T231644.360+0200 [INFO]: Synchronization mode: all mapped datasets (default) 2024-06-27T231648.703+0200 [INFO]: Synchronized datasets: [dataset.products, dataset.campaign_channels, dataset.customers, dataset.order_lines, dataset.campaigns] 2024-06-27T231648.703+0200 [INFO]: ====================== Mapping Validation ====================== Mapping Validation is OK ====================== End of Mapping Validation ====================== 2024-06-27T231648.703+0200 [INFO]: ====================== Output Stage -> LDM mapping ====================== Output Stage table/view: campaign_channels (full, shared) -> Dataset: dataset.campaign_channels category[VARCHAR(128)] -> label.campaign_channels.category{LABEL}[VARCHAR(128)] campaign_channel_id[VARCHAR(128)] -> label.campaign_channels.campaign_channel_id{LABEL}[VARCHAR(128)] campaign_id[INT4] -> dataset.campaigns{REFERENCE}[INT] spend[NUMERIC(15,2)] -> fact.campaign_channels.spend{FACT}[DECIMAL(15,2)] type[VARCHAR(128)] -> label.campaign_channels.type{LABEL}[VARCHAR(128)] budget[NUMERIC(15,2)] -> fact.campaign_channels.budget{FACT}[DECIMAL(15,2)] Output Stage table/view: campaigns (full, shared) -> Dataset: dataset.campaigns campaign_id[INT4] -> label.campaigns.campaign_id{LABEL}[INT] campaign_name[VARCHAR(128)] -> label.campaigns.campaign_name{LABEL}[VARCHAR(128)] Output Stage table/view: customers (full, shared) -> Dataset: dataset.customers geo__state__location[VARCHAR(64)] -> label.customers.geo_state_location{LABEL}[VARCHAR(64)] customer_name[VARCHAR(128)] -> label.customers.customer_name{LABEL}[VARCHAR(128)] customer_id[INT4] -> label.customers.customer_id{LABEL}[INT] state[VARCHAR(64)] -> label.customers.state{LABEL}[VARCHAR(64)] region[VARCHAR(64)] -> label.customers.region{LABEL}[VARCHAR(64)] Output Stage table/view: order_lines (full, shared) -> Dataset: dataset.order_lines campaign_id[INT4] -> dataset.campaigns{REFERENCE}[INT] product_id[INT4] -> dataset.products{REFERENCE}[INT] wdf__region[VARCHAR(128)] -> label.order_lines.wdf_region{LABEL}[VARCHAR(128)] wdf__state[VARCHAR(64)] -> label.order_lines.wdf_state{LABEL}[VARCHAR(64)] order_line_id[VARCHAR(128)] -> label.order_lines.order_line_id{LABEL}[VARCHAR(128)] customer_id[INT4] -> dataset.customers{REFERENCE}[INT] order_id[VARCHAR(128)] -> label.order_lines.order_id{LABEL}[VARCHAR(128)] order_status[VARCHAR(128)] -> label.order_lines.order_status{LABEL}[VARCHAR(128)] date[DATE] -> date{DATE}[DATE] quantity[NUMERIC(15,2)] -> fact.order_lines.quantity{FACT}[DECIMAL(15,2)] price[NUMERIC(15,2)] -> fact.order_lines.price{FACT}[DECIMAL(15,2)] Output Stage table/view: products (full, shared) -> Dataset: dataset.products product_name[VARCHAR(128)] -> label.products.product_name{LABEL}[VARCHAR(128)] product_id[INT4] -> label.products.product_id{LABEL}[INT] category[VARCHAR(128)] -> label.products.category{LABEL}[VARCHAR(128)] ====================== End of Output Stage -> LDM mapping ====================== 2024-06-27T231648.713+0200 [INFO]: ====================== Data distribution scope ====================== 2024-06-27T231648.713+0200 [INFO]: Project="m8ify7ik9gkfsgn14798hdn2ernba06v"; datasets=[{dataset.products, full}, {dataset.campaign_channels, full}, {dataset.customers, full}, {dataset.order_lines, full}, {dataset.campaigns, full}] 2024-06-27T231648.713+0200 [INFO]: ====================== End of Data distribution scope ====================== 2024-06-27T231648.713+0200 [INFO]: ====================== Downloading and integrating data ====================== 2024-06-27T231653.741+0200 [INFO]: Data for project="m8ify7ik9gkfsgn14798hdn2ernba06v" was downloaded. Datasets: [{dataset.campaign_channels, upsert, rows=330, size=17743 bytes}, {dataset.campaigns, upsert, rows=154, size=4084 bytes}, {dataset.customers, upsert, rows=2000, size=115912 bytes}, {dataset.order_lines, upsert, rows=11978, size=1056172 bytes}, {dataset.products, upsert, rows=36, size=781 bytes}] 2024-06-27T231658.753+0200 [WARN]: Project "m8ify7ik9gkfsgn14798hdn2ernba06v" was not integrated. Reason: Upload to dataset dataset.products failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "3" in column "id".'. Upload to dataset dataset.campaign_channels failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "151" in column "id".'. Upload to dataset dataset.customers failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "915" in column "id".'. Upload to dataset dataset.order_lines failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "4590" in column "id".'. Upload to dataset dataset.campaigns failed with 'An error occured during loading of csv files: SQLSTATE_UNIQUE_VIOLATION. Details: Duplicated value "172" in column "id".'. 2024-06-27T231708.760+0200 [INFO]: ====================== End of downloading and integrating data ====================== 2024-06-27T231708.760+0200 [ERROR]: Data distribution worker failed. Reason: All projects failed to load. Automated Data Distribution v2 25sError found Reload X RequestID WatchDog DEBUG INFO WARN ERROR phases parsing not available Another question. I did not explicitly set the "output stage", so is there a way I can download what the process has downloaded in the following log to examine the "downloaded data"? 2024-06-27T231653.741+0200 [INFO]: Data for project="m8ify7ik9gkfsgn14798hdn2ernba06v" was downloaded. Datasets: [{dataset.campaign_channels, upsert, rows=330, size=17743 bytes}, {dataset.campaigns, upsert, rows=154, size=4084 bytes}, {dataset.customers, upsert, rows=2000, size=115912 bytes}, {dataset.order_lines, upsert, rows=11978, size=1056172 bytes}, {dataset.products, upsert, rows=36, size=781 bytes}] Thanks.

Dongfeng Lu

06/28/2024, 5:27 PM

Hi Moises, My bad. After some digging, I learned that AWS Redshift does not enforce primary key constraint, which we did not expect. We do have duplicates. After removing duplicates, the "load" succeeded. Thanks

🙌 1

9 Views

Open in Slack

Previous Next