This connector provides a Sink that writes partitioned files to filesystems
supported by the Flink
FileSystem abstraction. Since in streaming the input
is potentially infinite, the streaming file sink writes data into buckets. The
bucketing behaviour is configurable but a useful default is time-based
bucketing where we start writing a new bucket every hour and thus get
individual files that each contain a part of the infinite output stream.
Within a bucket, we further split the output into smaller part files based on a rolling policy. This is useful to prevent individual bucket files from getting too big. This is also configurable but the default policy rolls files based on file size and a timeout, i.e if no new data was written to a part file.
StreamingFileSink supports both row-wise encoding formats and
bulk-encoding formats, such as Apache Parquet.
The only required configuration are the base path where we want to output our
data and an
that is used for serializing records to the
OutputStream for each file.
Basic usage thus looks like this:
This will create a streaming sink that creates hourly buckets and uses a default rolling policy. The default bucket assigner is DateTimeBucketAssigner and the default rolling policy is DefaultRollingPolicy. You can specify a custom BucketAssigner and RollingPolicy on the sink builder. Please check out the JavaDoc for StreamingFileSink for more configuration options and more documentation about the workings and interactions of bucket assigners and rolling policies.
In the above example we used an
Encoder that can encode or serialize each
record individually. The streaming file sink also supports bulk-encoded output
formats such as Apache Parquet. To use these,
StreamingFileSink.forRowFormat() you would use
StreamingFileSink.forBulkFormat() and specify a
has static methods for creating a
BulkWriter.Factory for various types.