Loading Apache Spark CSV data from S3 with ADD

  • 17 November 2020
  • 2 replies
  • 415 views

I’m just getting started with Gooddata and have successfully loaded CSV files from an S3 bucket with ADD, after a bit of fiddling to get my filenames matching my model name etc. (This article was very helpful, thanks!). I can see the data populating in my model and view it in a dashboard - so far, so good...

However the data I really want to load is being written by Apache Spark (on AWS EMR) as a CSV file that is written in “append” mode, so the structure of the file in S3 looks like this:

[aws-bucket]

   |- [analytivcs-data]

             |- hourly-data-dump.csv

                         |- part-00000-00231726-b34a-454f-b534-bd9e492ab361-c000.csv

                         |- part-00000-00a86ed4-a0b1-4c45-909c-49753c6ce877-c000.csv

                         |- part-00000-00c9ebc7-550e-4632-b909-1aa88cd86b8e-c000.csv

                         …

 

There are potentially hundreds of the part-00000-xxx files written by Spark. I control the name of the “folder” hourly-data-dump.csv but not the files within in.

This must be a pretty common use case - any tips on how to get this data to load into Gooddata using ADD as a regular update?

Thanks,

Simon


2 replies

Userlevel 2

Hi Simon,

Thank you for your question.

S3 integration is quite a new feature introduced recently and it’s being constantly improved. While the use case it’s not currently supported, we are planning to expand the feature to allow more flexible rules on file structure and naming. First improvements should be coming in Q1/2021.

Can you please describe your use case in more detail?

I assume that hourly-data-dump.csv is a (sub)folder containing the csv files you would like to load.

 

We appreciate your feedback and we’ll pass it to our product team, it will help them to shape the final form of the feature.

 

All the best,

Boris

Hi Boris, 

Thanks for the quick response. The files I want to read are written by Apache Spark using the Scala API described here:

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html

Typically Spark breaks a CSV file into multiple parts and saves them in a folder with the name passed to the save() call. The number of files written depends on the number of partitions in which Spark stores the data, or in the case of an “append” write, a new CSV file is written for each append.

Reading the Spark docs, I can see that this is an underlying Hadoop naming convention and Spark does not allow it to be changed at the point of writing. There are some workarounds available such as this one. I’ll proceed along these lines for now.

Simon

Reply