The result of DataSet.aggregate.
The result of DataSet.aggregate. This can be used to chain more aggregations to the one aggregate operator.
The type of the DataSet, i.e., the type of the elements of the DataSet.
A specific DataSet that results from a coGroup
operation.
A specific DataSet that results from a coGroup
operation. The result of a default coGroup
is a tuple containing two arrays of values from the two sides of the coGroup. The result of the
coGroup can be changed by specifying a custom coGroup function using the apply
method or by
providing a RichCoGroupFunction.
Example:
val left = ... val right = ... val coGroupResult = left.coGroup(right).where(0, 2).isEqualTo(0, 1) { (left, right) => new MyCoGroupResult(left.min, right.max) }
Or, using key selector functions with tuple data types:
val left = ... val right = ... val coGroupResult = left.coGroup(right).where({_._1}).isEqualTo({_._1) { (left, right) => new MyCoGroupResult(left.max, right.min) }
Type of the left input of the coGroup.
Type of the right input of the coGroup.
A specific DataSet that results from a cross
operation.
A specific DataSet that results from a cross
operation. The result of a default cross is a
tuple containing the two values from the two sides of the cartesian product. The result of the
cross can be changed by specifying a custom cross function using the apply
method or by
providing a RichCrossFunction.
Example:
val left = ... val right = ... val crossResult = left.cross(right) { (left, right) => new MyCrossResult(left, right) }
Type of the left input of the cross.
Type of the right input of the cross.
The DataSet, the basic abstraction of Flink.
The DataSet, the basic abstraction of Flink. This represents a collection of elements of a
specific type T
. The operations in this class can be used to create new DataSets and to combine
two DataSets. The methods of ExecutionEnvironment can be used to create a DataSet from an
external source, such as files in HDFS. The write*
methods can be used to write the elements
to storage.
All operations accept either a lambda function or an operation-specific function object for specifying the operation. For example, using a lambda:
val input: DataSet[String] = ... val mapped = input flatMap { _.split(" ") }
And using a MapFunction
:
val input: DataSet[String] = ... val mapped = input flatMap { new FlatMapFunction[String, String] { def flatMap(in: String, out: Collector[String]): Unit = { in.split(" ") foreach { out.collect(_) } } }
A rich function can be used when more control is required, for example for accessing the
RuntimeContext
. The rich function for flatMap
is RichFlatMapFunction
, all other functions
are named similarly. All functions are available in package
org.apache.flink.api.common.functions
.
The elements are partitioned depending on the parallelism of the ExecutionEnvironment or of one specific DataSet.
Most of the operations have an implicit TypeInformation parameter. This is supplied by
an implicit conversion in the flink.api.scala
Package. For this to work,
createTypeInformation needs to be imported. This is normally achieved with a
import org.apache.flink.api.scala._
The type of the DataSet, i.e., the type of the elements of the DataSet.
The ExecutionEnvironment is the context in which a program is executed.
The ExecutionEnvironment is the context in which a program is executed. A local environment will cause execution in the current JVM, a remote environment will cause execution on a remote cluster installation.
The environment provides methods to control the job execution (such as setting the parallelism) and to interact with the outside world (data access).
To get an execution environment use the methods on the companion object:
Use ExecutionEnvironment#getExecutionEnvironment to get the correct environment depending on where the program is executed. If it is run inside an IDE a local environment will be created. If the program is submitted to a cluster a remote execution environment will be created.
A DataSet to which a grouping key was added.
A DataSet to which a grouping key was added. Operations work on groups of elements with the
same key (aggregate
, reduce
, and reduceGroup
).
A secondary sort order can be added with sortGroup, but this is only used when using one
of the group-at-a-time operations, i.e. reduceGroup
.
A specific DataSet that results from a join
operation.
A specific DataSet that results from a join
operation. The result of a default join is a
tuple containing the two values from the two sides of the join. The result of the join can be
changed by specifying a custom join function using the apply
method or by providing a
RichFlatJoinFunction.
Example:
val left = ... val right = ... val joinResult = left.join(right).where(0, 2).equalTo(0, 1) { (left, right) => new MyJoinResult(left, right) }
Or, using key selector functions with tuple data types:
val left = ... val right = ... val joinResult = left.join(right).where({_._1}).equalTo({_._1) { (left, right) => new MyJoinResult(left, right) }
Type of the left input of the join.
Type of the right input of the join.
The result of DataSet.sortPartition.
The result of DataSet.sortPartition. This can be used to append additional sort fields to the one sort-partition operator.
The type of the DataSet, i.e., the type of the elements of the DataSet.
SelectByMaxFunction to work with Scala tuples
SelectByMaxFunction to work with Scala tuples
SelectByMinFunction to work with Scala tuples
SelectByMinFunction to work with Scala tuples
An unfinished coGroup operation that results from DataSet.coGroup The keys for the left and
right side must be specified using first where
and then equalTo
.
An unfinished coGroup operation that results from DataSet.coGroup The keys for the left and
right side must be specified using first where
and then equalTo
. For example:
val left = ... val right = ... val coGroupResult = left.coGroup(right).where(...).equalTo(...)
The type of the left input of the coGroup.
The type of the right input of the coGroup.
An unfinished inner join operation that results from calling DataSet.join().
An unfinished inner join operation that results from calling DataSet.join().
The keys for the left and right side must be specified using first where
and then equalTo
.
For example:
val left = ... val right = ... val joinResult = left.join(right).where(...).equalTo(...)
The type of the left input of the join.
The type of the right input of the join.
An unfinished outer join operation that results from calling, e.g.
An unfinished outer join operation that results from calling, e.g. DataSet.fullOuterJoin().
The keys for the left and right side must be specified using first where
and then equalTo
.
Note that a join function must always be specified explicitly when construction an outer join operator.
For example:
val left = ... val right = ... val joinResult = left.fullOuterJoin(right).where(...).equalTo(...) { (first, second) => ... }
The type of the left input of the join.
The type of the right input of the join.
acceptPartialFunctions extends the original DataSet with methods with unique names that delegate to core higher-order functions (e.g.
acceptPartialFunctions extends the original DataSet with methods with unique names
that delegate to core higher-order functions (e.g. map
) so that we can work around
the fact that overloaded methods taking functions as parameters can't accept partial
functions as well. This enables the possibility to directly apply pattern matching
to decompose inputs such as tuples, case classes and collections.
The following is a small example that showcases how this extensions would work on a Flink data set:
object Main { import org.apache.flink.api.scala.extensions._ case class Point(x: Double, y: Double) def main(args: Array[String]): Unit = { val env = ExecutionEnvironment.getExecutionEnvironment val ds = env.fromElements(Point(1, 2), Point(3, 4), Point(5, 6)) ds.filterWith { case Point(x, _) => x > 1 }.reduceWith { case (Point(x1, y1), (Point(x2, y2))) => Point(x1 + y1, x2 + y2) }.mapWith { case Point(x, y) => (x, y) }.flatMapWith { case (x, y) => Seq('x' -> x, 'y' -> y) }.groupingBy { case (id, value) => id } } }
The extension consists of several implicit conversions over all the data set representations
that could gain from this feature. To use this set of extensions methods the user has to
explicitly opt-in by importing org.apache.flink.api.scala.extensions.acceptPartialFunctions
.
For more information and usage examples please consult the Apache Flink official documentation.
The Flink Scala API. org.apache.flink.api.scala.ExecutionEnvironment is the starting-point of any Flink program. It can be used to read from local files, HDFS, or other sources. org.apache.flink.api.scala.DataSet is the main abstraction of data in Flink. It provides operations that create new DataSets via transformations. org.apache.flink.api.scala.GroupedDataSet provides operations on grouped data that results from org.apache.flink.api.scala.DataSet.groupBy().
Use org.apache.flink.api.scala.ExecutionEnvironment.getExecutionEnvironment to obtain an execution environment. This will either create a local environment or a remote environment, depending on the context where your program is executing.