Production Readiness Checklist

The production readiness checklist provides an overview of configuration options that should be carefully considered before bringing an Apache Flink job into production. While the Flink community has attempted to provide sensible defaults for each configuration, it is important to review this list and ensure the options chosen are sufficient for your needs.

Set An Explicit Max Parallelism

The max parallelism, set on a per-job and per-operator granularity, determines the maximum parallelism to which a stateful operator can scale. There is currently no way to change the maximum parallelism of an operator after a job has started without discarding that operators state. The reason maximum parallelism exists, versus allowing stateful operators to be infinitely scalable, is that it has some impact on your application’s performance and state size. Flink has to maintain specific metadata for its ability to rescale state which grows linearly with max parallelism. In general, you should choose max parallelism that is high enough to fit your future needs in scalability, while keeping it low enough to maintain reasonable performance.

Note: Maximum parallelism must fulfill the following conditions: 0 < parallelism <= max parallelism <= 2^15

You can explicitly set maximum parallelism by using setMaxParallelism(int maxparallelism). If no max parallelism is set Flink will decide using a function of the operators parallelism when the job is first started:

  • 128 : for all parallelism <= 128.
  • MIN(nextPowerOfTwo(parallelism + (parallelism / 2)), 2^15) : for all parallelism > 128.

Set UUIDs For All Operators

As mentioned in the documentation for savepoints, users should set uids for each operator in their DataStream. Uids are necessary for Flink’s mapping of operator states to operators which, in turn, is essential for savepoints. By default, operator uids are generated by traversing the JobGraph and hashing specific operator properties. While this is comfortable from a user perspective, it is also very fragile, as changes to the JobGraph (e.g., exchanging an operator) results in new UUIDs. To establish a stable mapping, we need stable operator uids provided by the user through setUid(String uid).

Choose The Right State Backend

See the description of state backends for choosing the right one for your use case.

Configure JobManager High Availability

The JobManager serves as a central coordinator for each Flink deployment, being responsible for both scheduling and resource management of the cluster. It is a single point of failure within the cluster, and if it crashes, no new jobs can be submitted, and running applications will fail.

Configuring High Availability, in conjunction with Apache Zookeeper, allows for a swift recovery and is highly recommended for production setups.

Back to top