This page describes the solutions to some common questions for PyFlink users.
You can download a convenience script to prepare a Python virtual env zip which can be used on Mac OS and most Linux distributions. You can specify the PyFlink version to generate a Python virtual environment required for the corresponding PyFlink version, otherwise the most recent version will be installed.
After setting up a python virtual environment, as described in the previous section, you should activate the environment before executing the PyFlink job.
For details on the usage of
set_python_executable, you can refer to the relevant documentation.
A PyFlink job may depend on jar files, i.e. connectors, Java UDFs, etc. You can specify the dependencies with the following Python Table APIs or through command-line arguments directly when submitting the job.
For details about the APIs of adding Java dependency, you can refer to the relevant documentation
You can use the command-line arguments
pyfs or the API
TableEnvironment to add python file dependencies which could be python files, python packages or local directories.
For example, if you have a directory named
myDir which has the following hierarchy:
myDir ├──utils ├──__init__.py ├──my_util.py
You can add the Python files of directory
myDir as following:
When executing jobs in mini cluster(e.g. when executing jobs in IDE) and using the following APIs in the jobs( e.g. TableEnvironment.execute_sql, StatementSet.execute, etc in the Python Table API; StreamExecutionEnvironment.execute_async in the Python DataStream API), please remember to explicitly wait for the job execution to finish as these APIs are asynchronous. Otherwise you may could not find the execution results as the program will exit before the job execution finishes. Please refer to the following example on how to do that:
Note: There is no need to wait for the job execution to finish when executing jobs in remote cluster and so remember to remove these codes when executing jobs in remote cluster.