Spark Streaming on a S3 Directory
So I have thousands of events being streamed through Amazon Kinesis into SQS then dumped into a S3 directory. About every 10 minutes, a new text file is created to dump the data from Kinesis into S3. I would like to set up Spark Streaming so that it streams the new files being dumped into S3. Right now I have
import org.apache.spark.streaming._ val currentFileStream = ssc.textFileStream("s3://bucket/directory/event_name=accepted/") currentFileStream.print ssc.start()
However, Spark Streaming is not picking up the new files being dumped into S3. I think it has something to do with the file write requirements:
The files must have the same data format. The files must be created in the dataDirectory by atomically moving or renaming them into the data directory. Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
Why is Spark streaming not picking up the new files? Is it because AWS is creating the files in the directory and not moving them? How can I make sure Spark picks up the files being dumped into S3?