I’m just getting started with Gooddata and have successfully loaded CSV files from an S3 bucket with ADD, after a bit of fiddling to get my filenames matching my model name etc. (This article was very helpful, thanks!). I can see the data populating in my model and view it in a dashboard - so far, so good...
However the data I really want to load is being written by Apache Spark (on AWS EMR) as a CSV file that is written in “append” mode, so the structure of the file in S3 looks like this:
[aws-bucket]
|- [analytivcs-data]
|- hourly-data-dump.csv
|- part-00000-00231726-b34a-454f-b534-bd9e492ab361-c000.csv
|- part-00000-00a86ed4-a0b1-4c45-909c-49753c6ce877-c000.csv
|- part-00000-00c9ebc7-550e-4632-b909-1aa88cd86b8e-c000.csv
…
There are potentially hundreds of the part-00000-xxx files written by Spark. I control the name of the “folder” hourly-data-dump.csv but not the files within in.
This must be a pretty common use case - any tips on how to get this data to load into Gooddata using ADD as a regular update?
Thanks,
Simon