Dynamic tables are the core concept of Flink’s Table & SQL API for processing both bounded and unbounded
data in a unified fashion.
Because dynamic tables are only a logical concept, Flink does not own the data itself. Instead, the content
of a dynamic table is stored in external systems (such as databases, key-value stores, message queues) or files.
Dynamic sources and dynamic sinks can be used to read and write data from and to an external system. In
the documentation, sources and sinks are often summarized under the term connector.
Flink provides pre-defined connectors for Kafka, Hive, and different file systems. See the connector section
for more information about built-in table sources and sinks.
This page focuses on how to develop a custom, user-defined connector.
Attention New table source and table sink interfaces have been
introduced in Flink 1.11 as part of FLIP-95.
Also the factory interfaces have been reworked. If necessary, take a look at the old table sources and sinks page.
Those interfaces are still supported for backwards compatibility.
In many cases, implementers don’t need to create a new connector from scratch but would like to slightly
modify existing connectors or hook into the existing stack. In other cases, implementers would like to
create specialized connectors.
This section helps for both kinds of use cases. It explains the general architecture of table connectors
from pure declaration in the API to runtime code that will be executed on the cluster.
The filled arrows show how objects are transformed to other objects from one stage to the next stage during
the translation process.
Both Table API and SQL are declarative APIs. This includes the declaration of tables. Thus, executing
a CREATE TABLE statement results in updated metadata in the target catalog.
For most catalog implementations, physical data in the external system is not modified for such an
operation. Connector-specific dependencies don’t have to be present in the classpath yet. The options declared
in the WITH clause are neither validated nor otherwise interpreted.
The metadata for dynamic tables (created via DDL or provided by the catalog) is represented as instances
of CatalogTable. A table name will be resolved into a CatalogTable internally when necessary.
When it comes to planning and optimization of the table program, a CatalogTable needs to be resolved
into a DynamicTableSource (for reading in a SELECT query) and DynamicTableSink (for writing in
an INSERT INTO statement).
DynamicTableSourceFactory and DynamicTableSinkFactory provide connector-specific logic for translating
the metadata of a CatalogTable into instances of DynamicTableSource and DynamicTableSink. In most
of the cases, a factory’s purpose is to validate options (such as 'port' = '5022' in the example),
configure encoding/decoding formats (if required), and create a parameterized instance of the table
By default, instances of DynamicTableSourceFactory and DynamicTableSinkFactory are discovered using
Java’s Service Provider Interfaces (SPI). The
connector option (such as 'connector' = 'custom' in the example) must correspond to a valid factory
Although it might not be apparent in the class naming, DynamicTableSource and DynamicTableSink
can also be seen as stateful factories that eventually produce concrete runtime implementation for reading/writing
the actual data.
The planner uses the source and sink instances to perform connector-specific bidirectional communication
until an optimal logical plan could be found. Depending on the optionally declared ability interfaces (e.g.
SupportsProjectionPushDown or SupportsOverwrite), the planner might apply changes to an instance and
thus mutate the produced runtime implementation.
Once the logical planning is complete, the planner will obtain the runtime implementation from the table
connector. Runtime logic is implemented in Flink’s core connector interfaces such as InputFormat or SourceFunction.
Those interfaces are grouped by another level of abstraction as subclasses of ScanRuntimeProvider,
LookupRuntimeProvider, and SinkRuntimeProvider.
For example, both OutputFormatProvider (providing org.apache.flink.api.common.io.OutputFormat) and SinkFunctionProvider (providing org.apache.flink.streaming.api.functions.sink.SinkFunction) are concrete instances of SinkRuntimeProvider
that the planner can handle.
The framework will check for a single matching factory that is uniquely identified by factory identifier
and requested base class (e.g. DynamicTableSourceFactory).
The factory discovery process can be bypassed by the catalog implementation if necessary. For this, a
catalog needs to return an instance that implements the requested base class in org.apache.flink.table.catalog.Catalog#getFactory.
Dynamic Table Source
By definition, a dynamic table can change over time.
When reading a dynamic table, the content can either be considered as:
A changelog (finite or infinite) for which all changes are consumed continuously until the changelog
is exhausted. This is represented by the ScanTableSource interface.
A continuously changing or very large external table whose content is usually never read entirely
but queried for individual values when necessary. This is represented by the LookupTableSource
A class can implement both of these interfaces at the same time. The planner decides about their usage depending
on the specified query.
Scan Table Source
A ScanTableSource scans all rows from an external storage system during runtime.
The scanned rows don’t have to contain only insertions but can also contain updates and deletions. Thus,
the table source can be used to read a (finite or infinite) changelog. The returned changelog mode indicates
the set of changes that the planner can expect during runtime.
For regular batch scenarios, the source can emit a bounded stream of insert-only rows.
For regular streaming scenarios, the source can emit an unbounded stream of insert-only rows.
For change data capture (CDC) scenarios, the source can emit bounded or unbounded streams with insert,
update, and delete rows.
A table source can implement further ability interfaces such as SupportsProjectionPushDown that might
mutate an instance during planning. All abilities can be found in the org.apache.flink.table.connector.source.abilities
package and are listed in the source abilities table.
The runtime implementation of a ScanTableSource must produce internal data structures. Thus, records
must be emitted as org.apache.flink.table.data.RowData. The framework provides runtime converters such
that a source can still work on common data structures and perform a conversion at the end.
Lookup Table Source
A LookupTableSource looks up rows of an external storage system by one or more keys during runtime.
Compared to ScanTableSource, the source does not have to read the entire table and can lazily fetch individual
values from a (possibly continuously changing) external table when necessary.
Compared to ScanTableSource, a LookupTableSource does only support emitting insert-only changes currently.
Further abilities are not supported. See the documentation of org.apache.flink.table.connector.source.LookupTableSource
for more information.
The runtime implementation of a LookupTableSource is a TableFunction or AsyncTableFunction. The function
will be called with values for the given lookup keys during runtime.
Enables to pass available partitions to the planner and push down partitions into a DynamicTableSource.
During the runtime, the source will only read data from the passed partition list for efficiency.
Enables to push down a (possibly nested) projection into a DynamicTableSource. For efficiency,
a source can push a projection further down in order to be close to the actual data generation. If the source
also implements SupportsReadingMetadata, the source will also read the required metadata only.
Enables to read metadata columns from a DynamicTableSource. The source
is responsible to add the required metadata at the end of the produced rows. This includes
potentially forwarding metadata column from contained formats.
Enables to push down a watermark strategy into a DynamicTableSource. The watermark
strategy is a builder/factory for timestamp extraction and watermark generation. During the runtime, the
watermark generator is located inside the source and is able to generate per-partition watermarks.
Attention The interfaces above are currently only available for
ScanTableSource, not for LookupTableSource.
Dynamic Table Sink
By definition, a dynamic table can change over time.
When writing a dynamic table, the content can always be considered as a changelog (finite or infinite)
for which all changes are written out continuously until the changelog is exhausted. The returned changelog mode
indicates the set of changes that the sink accepts during runtime.
For regular batch scenarios, the sink can solely accept insert-only rows and write out bounded streams.
For regular streaming scenarios, the sink can solely accept insert-only rows and can write out unbounded streams.
For change data capture (CDC) scenarios, the sink can write out bounded or unbounded streams with insert,
update, and delete rows.
A table sink can implement further ability interfaces such as SupportsOverwrite that might mutate an
instance during planning. All abilities can be found in the org.apache.flink.table.connector.sink.abilities
package and are listed in the sink abilities table.
The runtime implementation of a DynamicTableSink must consume internal data structures. Thus, records
must be accepted as org.apache.flink.table.data.RowData. The framework provides runtime converters such
that a sink can still work on common data structures and perform a conversion at the beginning.
Enables to overwrite existing data in a DynamicTableSink. By default, if
this interface is not implemented, existing tables or partitions cannot be overwritten using
e.g. the SQL INSERT OVERWRITE clause.
Enables to write metadata columns into a DynamicTableSource. A table sink is
responsible for accepting requested metadata columns at the end of consumed rows and persist
them. This includes potentially forwarding metadata columns to contained formats.
Encoding / Decoding Formats
Some table connectors accept different formats that encode and decode keys and/or values.
Formats work similar to the pattern DynamicTableSourceFactory -> DynamicTableSource -> ScanRuntimeProvider,
where the factory is responsible for translating options and the source is responsible for creating runtime logic.
Because formats might be located in different modules, they are discovered using Java’s Service Provider
Interface similar to table factories. In order to discover a format factory,
the dynamic table factory searches for a factory that corresponds to a factory identifier and connector-specific
For example, the Kafka table source requires a DeserializationSchema as runtime interface for a decoding
format. Therefore, the Kafka table source factory uses the value of the value.format option to discover
The following format factories are currently supported:
The format factory translates the options into an EncodingFormat or a DecodingFormat. Those interfaces are
another kind of factory that produce specialized format runtime logic for the given data type.
For example, for a Kafka table source factory, the DeserializationFormatFactory would return an EncodingFormat<DeserializationSchema>
that can be passed into the Kafka table source.
This section sketches how to implement a scan table source with a decoding format that supports changelog
semantics. The example illustrates how all of the mentioned components play together. It can serve as
a reference implementation.
In particular, it shows how to
create factories that parse and validate options,
implement table connectors,
implement and discover custom formats,
and use provided utilities such as data structure converters and the FactoryUtil.
The table source uses a simple single-threaded SourceFunction to open a socket that listens for incoming
bytes. The raw bytes are decoded into rows by a pluggable format. The format expects a changelog flag
as the first column.
We will use most of the interfaces mentioned above to enable the following DDL:
Because the format supports changelog semantics, we are able to ingest updates during runtime and create
an updating view that can continuously evaluate changing data:
Use the following command to ingest data in a terminal:
This section illustrates how to translate metadata coming from the catalog to concrete connector instances.
Both factories have been added to the META-INF/services directory.
The SocketDynamicTableFactory translates the catalog table to a table source. Because the table source
requires a decoding format, we are discovering the format using the provided FactoryUtil for convenience.
The ChangelogCsvFormatFactory translates format-specific options to a format. The FactoryUtil in SocketDynamicTableFactory
takes care of adapting the option keys accordingly and handles the prefixing like changelog-csv.column-delimiter.
Because this factory implements DeserializationFormatFactory, it could also be used for other connectors
that support deserialization formats such as the Kafka connector.
Table Source and Decoding Format
This section illustrates how to translate from instances of the planning layer to runtime instances that
are shipped to the cluster.
The SocketDynamicTableSource is used during planning. In our example, we don’t implement any of the
available ability interfaces. Therefore, the main logic can be found in getScanRuntimeProvider(...)
where we instantiate the required SourceFunction and its DeserializationSchema for runtime. Both
instances are parameterized to return internal data structures (i.e. RowData).
The ChangelogCsvFormat is a decoding format that uses a DeserializationSchema during runtime. It
supports emitting INSERT and DELETE changes.
For completeness, this section illustrates the runtime logic for both SourceFunction and DeserializationSchema.
The ChangelogCsvDeserializer contains a simple parsing logic for converting bytes into Row of Integer
and String with a row kind. The final conversion step converts those into internal data structures.
The SocketSourceFunction opens a socket and consumes bytes. It splits records by the given byte
delimiter (\n by default) and delegates the decoding to a pluggable DeserializationSchema. The
source function can only work with a parallelism of 1.