Python API Tutorial

This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest stable version.

This walkthrough will quickly get you started building a pure Python Flink project.

Please refer to the Python Table API installation guide on how to set up the Python execution environments.

Setting up a Python Project
Writing a Flink Python Table API Program
Executing a Flink Python Table API Program

Setting up a Python Project

You can begin by creating a Python project and installing the PyFlink package following the installation guide.

Writing a Flink Python Table API Program

Table API applications begin by declaring a table environment; either a BatchTableEvironment for batch applications or StreamTableEnvironment for streaming applications. This serves as the main entry point for interacting with the Flink runtime. It can be used for setting execution parameters such as restart strategy, default parallelism, etc. The table config allows setting Table API specific configurations.

exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)

The the table environment created, you can declare source and sink tables.

t_env.connect(FileSystem().path('/tmp/input')) \
    .with_format(OldCsv()
                 .field('word', DataTypes.STRING())) \
    .with_schema(Schema()
                 .field('word', DataTypes.STRING())) \
    .create_temporary_table('mySource')

t_env.connect(FileSystem().path('/tmp/output')) \
    .with_format(OldCsv()
                 .field_delimiter('\t')
                 .field('word', DataTypes.STRING())
                 .field('count', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('word', DataTypes.STRING())
                 .field('count', DataTypes.BIGINT())) \
    .create_temporary_table('mySink')

You can also use the TableEnvironment.sql_update() method to register a source/sink table defined in DDL:

my_source_ddl = """
    create table mySource (
        word VARCHAR
    ) with (
        'connector.type' = 'filesystem',
        'format.type' = 'csv',
        'connector.path' = '/tmp/input'
    )
"""

my_sink_ddl = """
    create table mySink (
        word VARCHAR,
        `count` BIGINT
    ) with (
        'connector.type' = 'filesystem',
        'format.type' = 'csv',
        'connector.path' = '/tmp/output'
    )
"""

t_env.sql_update(my_source_ddl)
t_env.sql_update(my_sink_ddl)

This registers a table named mySource and a table named mySink in the execution environment. The table mySource has only one column, word, and it consumes strings read from file /tmp/input. The table mySink has two columns, word and count, and writes data to the file /tmp/output, with \t as the field delimiter.

You can now create a job which reads input from table mySource, preforms some transformations, and writes the results to table mySink.

t_env.from_path('mySource') \
    .group_by('word') \
    .select('word, count(1)') \
    .insert_into('mySink')

Finally you must execute the actual Flink Python Table API job. All operations, such as creating sources, transformations and sinks are lazy. Only when t_env.execute(job_name) is called will the job be run.

t_env.execute("tutorial_job")

The complete code so far:

from pyflink.dataset import ExecutionEnvironment
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment
from pyflink.table.descriptors import Schema, OldCsv, FileSystem

exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)

t_env.connect(FileSystem().path('/tmp/input')) \
    .with_format(OldCsv()
                 .field('word', DataTypes.STRING())) \
    .with_schema(Schema()
                 .field('word', DataTypes.STRING())) \
    .create_temporary_table('mySource')

t_env.connect(FileSystem().path('/tmp/output')) \
    .with_format(OldCsv()
                 .field_delimiter('\t')
                 .field('word', DataTypes.STRING())
                 .field('count', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('word', DataTypes.STRING())
                 .field('count', DataTypes.BIGINT())) \
    .create_temporary_table('mySink')

t_env.from_path('mySource') \
    .group_by('word') \
    .select('word, count(1)') \
    .insert_into('mySink')

t_env.execute("tutorial_job")

Executing a Flink Python Table API Program

Firstly, you need to prepare input data in the “/tmp/input” file. You can choose the following command line to prepare the input data:

$ echo -e  "flink\npyflink\nflink" > /tmp/input

Next, you can run this example on the command line (Note: if the result file “/tmp/output” has already existed, you need to remove the file before running the example):

$ python WordCount.py

The command builds and runs the Python Table API program in a local mini cluster. You can also submit the Python Table API program to a remote cluster, you can refer Job Submission Examples for more details.

Finally, you can see the execution result on the command line:

$ cat /tmp/output
flink	2
pyflink	1

This should get you started with writing your own Flink Python Table API programs. To learn more about the Python Table API, you can refer Flink Python Table API Docs for more details.