class CrossDataSet[L, R] extends DataSet[(L, R)]
A specific DataSet that results from a cross
operation. The result of a default cross is a
tuple containing the two values from the two sides of the cartesian product. The result of the
cross can be changed by specifying a custom cross function using the apply
method or by
providing a RichCrossFunction.
Example:
val left = ... val right = ... val crossResult = left.cross(right) { (left, right) => new MyCrossResult(left, right) }
- L
Type of the left input of the cross.
- R
Type of the right input of the cross.
- Annotations
- @deprecated @Public()
- Deprecated
(Since version 1.18.0)
- See also
- Alphabetic
- By Inheritance
- CrossDataSet
- DataSet
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
CrossDataSet(defaultCross: CrossOperator[L, R, (L, R)], leftInput: DataSet[L], rightInput: DataSet[R])
- Deprecated
All Flink Scala APIs are deprecated and will be removed in a future Flink major version. You can still build your application in Scala, but you should move to the Java version of either the DataStream and/or Table API.
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
aggregate(agg: Aggregations, field: String): AggregateDataSet[(L, R)]
Creates a new DataSet by aggregating the specified field using the given aggregation function.
-
def
aggregate(agg: Aggregations, field: Int): AggregateDataSet[(L, R)]
Creates a new DataSet by aggregating the specified tuple field using the given aggregation function.
-
def
apply[O](crosser: CrossFunction[L, R, O])(implicit arg0: TypeInformation[O], arg1: ClassTag[O]): DataSet[O]
Creates a new DataSet by passing each pair of values to the given function.
Creates a new DataSet by passing each pair of values to the given function. The function can output zero or more elements using the Collector which will form the result.
A RichCrossFunction can be used to access the broadcast variables and the org.apache.flink.api.common.functions.RuntimeContext.
-
def
apply[O](fun: (L, R) ⇒ O)(implicit arg0: TypeInformation[O], arg1: ClassTag[O]): DataSet[O]
Creates a new DataSet where the result for each pair of elements is the result of the given function.
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[java.lang]
- Definition Classes
- AnyRef
- Annotations
- @native() @throws( ... )
-
def
coGroup[O](other: DataSet[O])(implicit arg0: ClassTag[O]): UnfinishedCoGroupOperation[(L, R), O]
For each key in
this
DataSet and theother
DataSet, create a tuple containing a list of elements for that key from both DataSets.For each key in
this
DataSet and theother
DataSet, create a tuple containing a list of elements for that key from both DataSets. To specify the join keys thewhere
andisEqualTo
methods must be used. For example:val left: DataSet[(String, Int, Int)] = ... val right: DataSet[(Int, String, Int)] = ... val coGrouped = left.coGroup(right).where(0).isEqualTo(1)
A custom coGroup function can be used if more control over the result is required. This can either be given as a lambda or a custom CoGroupFunction. For example:
val left: DataSet[(String, Int, Int)] = ... val right: DataSet[(Int, String, Int)] = ... val coGrouped = left.coGroup(right).where(0).isEqualTo(1) { (l, r) => // l and r are of type Iterator (l.min, r.max) }
A coGroup function with a Collector can be used to implement a filter directly in the coGroup or to output more than one values. This type of coGroup function does not return a value, instead values are emitted using the collector
val left: DataSet[(String, Int, Int)] = ... val right: DataSet[(Int, String, Int)] = ... val coGrouped = left.coGroup(right).where(0).isEqualTo(1) { (l, r, out: Collector[(String, Int)]) => out.collect((l.min, r.max)) out.collect(l.max, r.min)) }
- Definition Classes
- DataSet
-
def
collect(): Seq[(L, R)]
Convenience method to get the elements of a DataSet as a List As DataSet can contain a lot of data, this method should be used with caution.
Convenience method to get the elements of a DataSet as a List As DataSet can contain a lot of data, this method should be used with caution.
- returns
A Seq containing the elements of the DataSet
- Definition Classes
- DataSet
- Annotations
- @throws( classOf[Exception] )
- See also
org.apache.flink.api.java.Utils.CollectHelper
-
def
combineGroup[R](fun: (Iterator[(L, R)], Collector[R]) ⇒ Unit)(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Applies a GroupCombineFunction on a grouped DataSet.
Applies a GroupCombineFunction on a grouped DataSet. A GroupCombineFunction is similar to a GroupReduceFunction but does not perform a full data exchange. Instead, the GroupCombineFunction calls the combine method once per partition for combining a group of results. This operator is suitable for combining values into an intermediate format before doing a proper groupReduce where the data is shuffled across the node for further reduction. The GroupReduce operator can also be supplied with a combiner by implementing the RichGroupReduce function. The combine method of the RichGroupReduce function demands input and output type to be the same. The GroupCombineFunction, on the other side, can have an arbitrary output type.
- Definition Classes
- DataSet
-
def
combineGroup[R](combiner: GroupCombineFunction[(L, R), R])(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Applies a GroupCombineFunction on a grouped DataSet.
Applies a GroupCombineFunction on a grouped DataSet. A GroupCombineFunction is similar to a GroupReduceFunction but does not perform a full data exchange. Instead, the GroupCombineFunction calls the combine method once per partition for combining a group of results. This operator is suitable for combining values into an intermediate format before doing a proper groupReduce where the data is shuffled across the node for further reduction. The GroupReduce operator can also be supplied with a combiner by implementing the RichGroupReduce function. The combine method of the RichGroupReduce function demands input and output type to be the same. The GroupCombineFunction, on the other side, can have an arbitrary output type.
- Definition Classes
- DataSet
-
def
count(): Long
Convenience method to get the count (number of elements) of a DataSet
Convenience method to get the count (number of elements) of a DataSet
- returns
A long integer that represents the number of elements in the set
- Definition Classes
- DataSet
- Annotations
- @throws( classOf[Exception] )
- See also
org.apache.flink.api.java.Utils.CountHelper
-
def
cross[O](other: DataSet[O]): CrossDataSet[(L, R), O]
Creates a new DataSet by forming the cartesian product of
this
DataSet and theother
DataSet.Creates a new DataSet by forming the cartesian product of
this
DataSet and theother
DataSet.The default cross result is a DataSet with 2-Tuples of the combined values. A custom cross function can be used if more control over the result is required. This can either be given as a lambda or a custom CrossFunction. For example:
val left: DataSet[(String, Int, Int)] = ... val right: DataSet[(Int, String, Int)] = ... val product = left.cross(right) { (l, r) => (l._2, r._3) } }
- Definition Classes
- DataSet
-
def
crossWithHuge[O](other: DataSet[O]): CrossDataSet[(L, R), O]
Special cross operation for explicitly telling the system that the left side is assumed to be a lot smaller than the right side of the cartesian product.
-
def
crossWithTiny[O](other: DataSet[O]): CrossDataSet[(L, R), O]
Special cross operation for explicitly telling the system that the right side is assumed to be a lot smaller than the left side of the cartesian product.
-
def
distinct(firstField: String, otherFields: String*): DataSet[(L, R)]
Returns a distinct set of this DataSet using expression keys.
Returns a distinct set of this DataSet using expression keys.
The field position keys specify the fields of Tuples or Pojos on which the decision is made if two elements are distinct or not.
The field expression keys specify the fields of a org.apache.flink.api.common.typeutils.CompositeType (e.g., Tuple or Pojo type) on which the decision is made if two elements are distinct or not. In case of a org.apache.flink.api.common.typeinfo.AtomicType, only the wildcard expression ("_") is valid.
- firstField
First field position on which the distinction of the DataSet is decided
- otherFields
Zero or more field positions on which the distinction of the DataSet is decided
- Definition Classes
- DataSet
-
def
distinct(fields: Int*): DataSet[(L, R)]
Returns a distinct set of a tuple DataSet using field position keys.
Returns a distinct set of a tuple DataSet using field position keys.
The field position keys specify the fields of Tuples on which the decision is made if two Tuples are distinct or not.
Note: Field position keys can only be specified for Tuple DataSets.
- fields
One or more field positions on which the distinction of the DataSet is decided.
- Definition Classes
- DataSet
-
def
distinct(): DataSet[(L, R)]
Returns a distinct set of this DataSet.
Returns a distinct set of this DataSet.
If the input is a composite type (Tuple or Pojo type), distinct is performed on all fields and each field must be a key type.
- Definition Classes
- DataSet
-
def
distinct[K](fun: ((L, R)) ⇒ K)(implicit arg0: TypeInformation[K]): DataSet[(L, R)]
Creates a new DataSet containing the distinct elements of this DataSet.
Creates a new DataSet containing the distinct elements of this DataSet. The decision whether two elements are distinct or not is made using the return value of the given function.
- fun
The function which extracts the key values from the DataSet on which the distinction of the DataSet is decided.
- Definition Classes
- DataSet
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
filter(fun: ((L, R)) ⇒ Boolean): DataSet[(L, R)]
Creates a new DataSet that contains only the elements satisfying the given filter predicate.
Creates a new DataSet that contains only the elements satisfying the given filter predicate.
- Definition Classes
- DataSet
-
def
filter(filter: FilterFunction[(L, R)]): DataSet[(L, R)]
Creates a new DataSet that contains only the elements satisfying the given filter predicate.
Creates a new DataSet that contains only the elements satisfying the given filter predicate.
- Definition Classes
- DataSet
-
def
finalize(): Unit
- Attributes
- protected[java.lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
first(n: Int): DataSet[(L, R)]
Creates a new DataSet containing the first
n
elements of this DataSet.Creates a new DataSet containing the first
n
elements of this DataSet.- Definition Classes
- DataSet
-
def
flatMap[R](fun: ((L, R)) ⇒ TraversableOnce[R])(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by applying the given function to every element and flattening the results.
Creates a new DataSet by applying the given function to every element and flattening the results.
- Definition Classes
- DataSet
-
def
flatMap[R](fun: ((L, R), Collector[R]) ⇒ Unit)(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by applying the given function to every element and flattening the results.
Creates a new DataSet by applying the given function to every element and flattening the results.
- Definition Classes
- DataSet
-
def
flatMap[R](flatMapper: FlatMapFunction[(L, R), R])(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by applying the given function to every element and flattening the results.
Creates a new DataSet by applying the given function to every element and flattening the results.
- Definition Classes
- DataSet
-
def
fullOuterJoin[O](other: DataSet[O], strategy: JoinHint): UnfinishedOuterJoinOperation[(L, R), O]
Special fullOuterJoin operation for explicitly telling the system what join strategy to use.
Special fullOuterJoin operation for explicitly telling the system what join strategy to use. If null is given as the join strategy, then the optimizer will pick the strategy.
- Definition Classes
- DataSet
-
def
fullOuterJoin[O](other: DataSet[O]): UnfinishedOuterJoinOperation[(L, R), O]
Creates a new DataSet by performing a full outer join of
this
DataSet with theother
DataSet, by combining two elements of two DataSets on key equality.Creates a new DataSet by performing a full outer join of
this
DataSet with theother
DataSet, by combining two elements of two DataSets on key equality. Elements of both DataSets that do not have a matching element on the opposing side are joined withnull
and emitted to the resulting DataSet.To specify the join keys the
where
andequalTo
methods must be used. For example:val left: DataSet[(String, Int, Int)] = ... val right: DataSet[(Int, String, Int)] = ... val joined = left.fullOuterJoin(right).where(0).equalTo(1)
When using an outer join you are required to specify a join function. For example:
val joined = left.fullOuterJoin(right).where(0).equalTo(1) { (left, right) => val a = if (left == null) null else left._1 val b = if (right == null) null else right._3 (a, b) }
- Definition Classes
- DataSet
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
getExecutionEnvironment: ExecutionEnvironment
Returns the execution environment associated with the current DataSet.
Returns the execution environment associated with the current DataSet.
- returns
associated execution environment
- Definition Classes
- DataSet
-
def
getParallelism: Int
Returns the parallelism of this operation.
Returns the parallelism of this operation.
- Definition Classes
- DataSet
-
def
getType(): TypeInformation[(L, R)]
Returns the TypeInformation for the elements of this DataSet.
Returns the TypeInformation for the elements of this DataSet.
- Definition Classes
- DataSet
-
def
groupBy(firstField: String, otherFields: String*): GroupedDataSet[(L, R)]
Creates a GroupedDataSet which provides operations on groups of elements.
Creates a GroupedDataSet which provides operations on groups of elements. Elements are grouped based on the given fields.
This will not create a new DataSet, it will just attach the field names which will be used for grouping when executing a grouped operation.
- Definition Classes
- DataSet
-
def
groupBy(fields: Int*): GroupedDataSet[(L, R)]
Creates a GroupedDataSet which provides operations on groups of elements.
Creates a GroupedDataSet which provides operations on groups of elements. Elements are grouped based on the given tuple fields.
This will not create a new DataSet, it will just attach the tuple field positions which will be used for grouping when executing a grouped operation.
This only works on Tuple DataSets.
- Definition Classes
- DataSet
-
def
groupBy[K](fun: ((L, R)) ⇒ K)(implicit arg0: TypeInformation[K]): GroupedDataSet[(L, R)]
Creates a GroupedDataSet which provides operations on groups of elements.
Creates a GroupedDataSet which provides operations on groups of elements. Elements are grouped based on the value returned by the given function.
This will not create a new DataSet, it will just attach the key function which will be used for grouping when executing a grouped operation.
- Definition Classes
- DataSet
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
iterate(maxIterations: Int)(stepFunction: (DataSet[(L, R)]) ⇒ DataSet[(L, R)]): DataSet[(L, R)]
Creates a new DataSet by performing bulk iterations using the given step function.
Creates a new DataSet by performing bulk iterations using the given step function. The iterations terminate when
maxIterations
iterations have been performed.For example:
val input: DataSet[(String, Int)] = ... val iterated = input.iterate(5) { previous => val next = previous.map { x => (x._1, x._2 + 1) } next }
This example will simply increase the second field of the tuple by 5.
- Definition Classes
- DataSet
-
def
iterateDelta[R](workset: DataSet[R], maxIterations: Int, keyFields: Array[String], solutionSetUnManaged: Boolean)(stepFunction: (DataSet[(L, R)], DataSet[R]) ⇒ (DataSet[(L, R)], DataSet[R]))(implicit arg0: ClassTag[R]): DataSet[(L, R)]
Creates a new DataSet by performing delta (or workset) iterations using the given step function.
Creates a new DataSet by performing delta (or workset) iterations using the given step function. At the beginning
this
DataSet is the solution set andworkset
is the Workset. The iteration step function gets the current solution set and workset and must output the delta for the solution set and the workset for the next iteration.Note: The syntax of delta iterations are very likely going to change soon.
- Definition Classes
- DataSet
-
def
iterateDelta[R](workset: DataSet[R], maxIterations: Int, keyFields: Array[String])(stepFunction: (DataSet[(L, R)], DataSet[R]) ⇒ (DataSet[(L, R)], DataSet[R]))(implicit arg0: ClassTag[R]): DataSet[(L, R)]
Creates a new DataSet by performing delta (or workset) iterations using the given step function.
Creates a new DataSet by performing delta (or workset) iterations using the given step function. At the beginning
this
DataSet is the solution set andworkset
is the Workset. The iteration step function gets the current solution set and workset and must output the delta for the solution set and the workset for the next iteration.Note: The syntax of delta iterations are very likely going to change soon.
- Definition Classes
- DataSet
-
def
iterateDelta[R](workset: DataSet[R], maxIterations: Int, keyFields: Array[Int], solutionSetUnManaged: Boolean)(stepFunction: (DataSet[(L, R)], DataSet[R]) ⇒ (DataSet[(L, R)], DataSet[R]))(implicit arg0: ClassTag[R]): DataSet[(L, R)]
Creates a new DataSet by performing delta (or workset) iterations using the given step function.
Creates a new DataSet by performing delta (or workset) iterations using the given step function. At the beginning
this
DataSet is the solution set andworkset
is the Workset. The iteration step function gets the current solution set and workset and must output the delta for the solution set and the workset for the next iteration.Note: The syntax of delta iterations are very likely going to change soon.
- Definition Classes
- DataSet
-
def
iterateDelta[R](workset: DataSet[R], maxIterations: Int, keyFields: Array[Int])(stepFunction: (DataSet[(L, R)], DataSet[R]) ⇒ (DataSet[(L, R)], DataSet[R]))(implicit arg0: ClassTag[R]): DataSet[(L, R)]
Creates a new DataSet by performing delta (or workset) iterations using the given step function.
Creates a new DataSet by performing delta (or workset) iterations using the given step function. At the beginning
this
DataSet is the solution set andworkset
is the Workset. The iteration step function gets the current solution set and workset and must output the delta for the solution set and the workset for the next iteration.Note: The syntax of delta iterations are very likely going to change soon.
- Definition Classes
- DataSet
-
def
iterateWithTermination(maxIterations: Int)(stepFunction: (DataSet[(L, R)]) ⇒ (DataSet[(L, R)], DataSet[_])): DataSet[(L, R)]
Creates a new DataSet by performing bulk iterations using the given step function.
Creates a new DataSet by performing bulk iterations using the given step function. The first DataSet the step function returns is the input for the next iteration, the second DataSet is the termination criterion. The iterations terminate when either the termination criterion DataSet contains no elements or when
maxIterations
iterations have been performed.For example:
val input: DataSet[(String, Int)] = ... val iterated = input.iterateWithTermination(5) { previous => val next = previous.map { x => (x._1, x._2 + 1) } val term = next.filter { _._2 < 3 } (next, term) }
This example will simply increase the second field of the Tuples until they are no longer smaller than 3.
- Definition Classes
- DataSet
-
def
join[O](other: DataSet[O], strategy: JoinHint): UnfinishedJoinOperation[(L, R), O]
Special join operation for explicitly telling the system what join strategy to use.
-
def
join[O](other: DataSet[O]): UnfinishedJoinOperation[(L, R), O]
Creates a new DataSet by joining
this
DataSet with theother
DataSet.Creates a new DataSet by joining
this
DataSet with theother
DataSet. To specify the join keys thewhere
andequalTo
methods must be used. For example:val left: DataSet[(String, Int, Int)] = ... val right: DataSet[(Int, String, Int)] = ... val joined = left.join(right).where(0).equalTo(1)
The default join result is a DataSet with 2-Tuples of the joined values. In the above example that would be
((String, Int, Int), (Int, String, Int))
. A custom join function can be used if more control over the result is required. This can either be given as a lambda or a custom JoinFunction. For example:val left: DataSet[(String, Int, Int)] = ... val right: DataSet[(Int, String, Int)] = ... val joined = left.join(right).where(0).equalTo(1) { (l, r) => (l._1, r._2) }
A join function with a Collector can be used to implement a filter directly in the join or to output more than one values. This type of join function does not return a value, instead values are emitted using the collector:
val left: DataSet[(String, Int, Int)] = ... val right: DataSet[(Int, String, Int)] = ... val joined = left.join(right).where(0).equalTo(1) { (l, r, out: Collector[(String, Int)]) => if (l._2 > 4) { out.collect((l._1, r._3)) out.collect((l._1, r._1)) } else { None } }
- Definition Classes
- DataSet
-
def
joinWithHuge[O](other: DataSet[O]): UnfinishedJoinOperation[(L, R), O]
Special join operation for explicitly telling the system that the left side is assumed to be a lot smaller than the right side of the join.
-
def
joinWithTiny[O](other: DataSet[O]): UnfinishedJoinOperation[(L, R), O]
Special join operation for explicitly telling the system that the right side is assumed to be a lot smaller than the left side of the join.
-
def
leftOuterJoin[O](other: DataSet[O], strategy: JoinHint): UnfinishedOuterJoinOperation[(L, R), O]
An outer join on the left side.
An outer join on the left side.
Elements of the left side (i.e.
this
) that do not have a matching element on the other side are joined withnull
and emitted to the resulting DataSet.- other
The other DataSet with which this DataSet is joined.
- strategy
The strategy that should be used execute the join. If { @code null} is given, then the optimizer will pick the join strategy.
- returns
An UnfinishedJoinOperation to continue with the definition of the join transformation
- Definition Classes
- DataSet
- See also
#fullOuterJoin
-
def
leftOuterJoin[O](other: DataSet[O]): UnfinishedOuterJoinOperation[(L, R), O]
An outer join on the left side.
An outer join on the left side.
Elements of the left side (i.e.
this
) that do not have a matching element on the other side are joined withnull
and emitted to the resulting DataSet.- other
The other DataSet with which this DataSet is joined.
- returns
An UnfinishedJoinOperation to continue with the definition of the join transformation
- Definition Classes
- DataSet
- See also
#fullOuterJoin
-
def
map[R](fun: ((L, R)) ⇒ R)(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by applying the given function to every element of this DataSet.
Creates a new DataSet by applying the given function to every element of this DataSet.
- Definition Classes
- DataSet
-
def
map[R](mapper: MapFunction[(L, R), R])(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by applying the given function to every element of this DataSet.
Creates a new DataSet by applying the given function to every element of this DataSet.
- Definition Classes
- DataSet
-
def
mapPartition[R](fun: (Iterator[(L, R)]) ⇒ TraversableOnce[R])(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by applying the given function to each parallel partition of the DataSet.
Creates a new DataSet by applying the given function to each parallel partition of the DataSet.
This function is intended for operations that cannot transform individual elements and requires no grouping of elements. To transform individual elements, the use of map and flatMap is preferable.
- Definition Classes
- DataSet
-
def
mapPartition[R](fun: (Iterator[(L, R)], Collector[R]) ⇒ Unit)(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by applying the given function to each parallel partition of the DataSet.
Creates a new DataSet by applying the given function to each parallel partition of the DataSet.
This function is intended for operations that cannot transform individual elements and requires no grouping of elements. To transform individual elements, the use of map and flatMap is preferable.
- Definition Classes
- DataSet
-
def
mapPartition[R](partitionMapper: MapPartitionFunction[(L, R), R])(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by applying the given function to each parallel partition of the DataSet.
Creates a new DataSet by applying the given function to each parallel partition of the DataSet.
This function is intended for operations that cannot transform individual elements and requires no grouping of elements. To transform individual elements, the use of map and flatMap is preferable.
- Definition Classes
- DataSet
-
def
max(field: String): AggregateDataSet[(L, R)]
Syntactic sugar for aggregate with
MAX
-
def
max(field: Int): AggregateDataSet[(L, R)]
Syntactic sugar for aggregate with
MAX
-
def
maxBy(fields: Int*): DataSet[(L, R)]
Selects an element with maximum value.
Selects an element with maximum value.
The maximum is computed over the specified fields in lexicographical order.
Example 1: Given a data set with elements [0, 1], [1, 0], the results will be:
maxBy(0)[1, 0] maxBy(1)[0, 1]
Example 2: Given a data set with elements [0, 0], [0, 1], the results will be:
maxBy(0, 1)[0, 1]
If multiple values with maximum value at the specified fields exist, a random one will be picked Internally, this operation is implemented as a ReduceFunction.
- Definition Classes
- DataSet
-
def
min(field: String): AggregateDataSet[(L, R)]
Syntactic sugar for aggregate with
MIN
-
def
min(field: Int): AggregateDataSet[(L, R)]
Syntactic sugar for aggregate with
MIN
-
def
minBy(fields: Int*): DataSet[(L, R)]
Selects an element with minimum value.
Selects an element with minimum value.
The minimum is computed over the specified fields in lexicographical order.
Example 1: Given a data set with elements [0, 1], [1, 0], the results will be:
minBy(0)[0, 1] minBy(1)[1, 0]
Example 2: Given a data set with elements [0, 0], [0, 1], the results will be:
minBy(0, 1)[0, 0]
If multiple values with minimum value at the specified fields exist, a random one will be picked. Internally, this operation is implemented as a ReduceFunction
- Definition Classes
- DataSet
-
def
minResources: ResourceSpec
Returns the minimum resources of this operation.
Returns the minimum resources of this operation.
- Definition Classes
- DataSet
- Annotations
- @PublicEvolving()
-
def
name(name: String): DataSet[(L, R)]
Sets the name of the DataSet.
Sets the name of the DataSet. This will appear in logs and graphical representations of the execution graph.
- Definition Classes
- DataSet
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
output(outputFormat: OutputFormat[(L, R)]): DataSink[(L, R)]
Emits
this
DataSet using a custom org.apache.flink.api.common.io.OutputFormat.Emits
this
DataSet using a custom org.apache.flink.api.common.io.OutputFormat.- Definition Classes
- DataSet
-
def
partitionByHash[K](fun: ((L, R)) ⇒ K)(implicit arg0: TypeInformation[K]): DataSet[(L, R)]
Partitions a DataSet using the specified key selector function.
Partitions a DataSet using the specified key selector function.
Important:This operation shuffles the whole DataSet over the network and can take significant amount of time.
- Definition Classes
- DataSet
-
def
partitionByHash(firstField: String, otherFields: String*): DataSet[(L, R)]
Hash-partitions a DataSet on the specified fields.
Hash-partitions a DataSet on the specified fields.
important: This operation shuffles the whole DataSet over the network and can take significant amount of time.
- Definition Classes
- DataSet
-
def
partitionByHash(fields: Int*): DataSet[(L, R)]
Hash-partitions a DataSet on the specified tuple field positions.
Hash-partitions a DataSet on the specified tuple field positions.
important: This operation shuffles the whole DataSet over the network and can take significant amount of time.
- Definition Classes
- DataSet
-
def
partitionByRange[K](fun: ((L, R)) ⇒ K)(implicit arg0: TypeInformation[K]): DataSet[(L, R)]
Range-partitions a DataSet using the specified key selector function.
Range-partitions a DataSet using the specified key selector function.
important: This operation requires an extra pass over the DataSet to compute the range boundaries and shuffles the whole DataSet over the network. This can take significant amount of time.
- Definition Classes
- DataSet
-
def
partitionByRange(firstField: String, otherFields: String*): DataSet[(L, R)]
Range-partitions a DataSet on the specified fields.
Range-partitions a DataSet on the specified fields.
important: This operation requires an extra pass over the DataSet to compute the range boundaries and shuffles the whole DataSet over the network. This can take significant amount of time.
- Definition Classes
- DataSet
-
def
partitionByRange(fields: Int*): DataSet[(L, R)]
Range-partitions a DataSet on the specified tuple field positions.
Range-partitions a DataSet on the specified tuple field positions.
important: This operation requires an extra pass over the DataSet to compute the range boundaries and shuffles the whole DataSet over the network. This can take significant amount of time.
- Definition Classes
- DataSet
-
def
partitionCustom[K](partitioner: Partitioner[K], fun: ((L, R)) ⇒ K)(implicit arg0: TypeInformation[K]): DataSet[(L, R)]
Partitions a DataSet on the key returned by the selector, using a custom partitioner.
Partitions a DataSet on the key returned by the selector, using a custom partitioner. This method takes the key selector to get the key to partition on, and a partitioner that accepts the key type.
Note: This method works only on single field keys, i.e. the selector cannot return tuples of fields.
- Definition Classes
- DataSet
-
def
partitionCustom[K](partitioner: Partitioner[K], field: String)(implicit arg0: TypeInformation[K]): DataSet[(L, R)]
Partitions a POJO DataSet on the specified key fields using a custom partitioner.
Partitions a POJO DataSet on the specified key fields using a custom partitioner. This method takes the key expression to partition on, and a partitioner that accepts the key type.
Note: This method works only on single field keys.
- Definition Classes
- DataSet
-
def
partitionCustom[K](partitioner: Partitioner[K], field: Int)(implicit arg0: TypeInformation[K]): DataSet[(L, R)]
Partitions a tuple DataSet on the specified key fields using a custom partitioner.
Partitions a tuple DataSet on the specified key fields using a custom partitioner. This method takes the key position to partition on, and a partitioner that accepts the key type.
Note: This method works only on single field keys.
- Definition Classes
- DataSet
-
def
preferredResources: ResourceSpec
Returns the preferred resources of this operation.
Returns the preferred resources of this operation.
- Definition Classes
- DataSet
- Annotations
- @PublicEvolving()
-
def
print(): Unit
Prints the elements in a DataSet to the standard output stream System.out of the JVM that calls the print() method.
Prints the elements in a DataSet to the standard output stream System.out of the JVM that calls the print() method. For programs that are executed in a cluster, this method needs to gather the contents of the DataSet back to the client, to print it there.
The string written for each element is defined by the AnyRef.toString method.
This method immediately triggers the program execution, similar to the collect() and count() methods.
- Definition Classes
- DataSet
-
def
printOnTaskManager(prefix: String): DataSink[(L, R)]
Writes a DataSet to the standard output streams (stdout) of the TaskManagers that execute the program (or more specifically, the data sink operators).
Writes a DataSet to the standard output streams (stdout) of the TaskManagers that execute the program (or more specifically, the data sink operators). On a typical cluster setup, the data will appear in the TaskManagers' .out files.
To print the data to the console or stdout stream of the client process instead, use the print() method.
For each element of the DataSet the result of AnyRef.toString() is written.
- prefix
The string to prefix each line of the output with. This helps identifying outputs from different printing sinks.
- returns
The DataSink operator that writes the DataSet.
- Definition Classes
- DataSet
-
def
printToErr(): Unit
Prints the elements in a DataSet to the standard error stream System.err of the JVM that calls the print() method.
Prints the elements in a DataSet to the standard error stream System.err of the JVM that calls the print() method. For programs that are executed in a cluster, this method needs to gather the contents of the DataSet back to the client, to print it there.
The string written for each element is defined by the AnyRef.toString method.
This method immediately triggers the program execution, similar to the collect() and count() methods.
- Definition Classes
- DataSet
-
def
rebalance(): DataSet[(L, R)]
Enforces a re-balancing of the DataSet, i.e., the DataSet is evenly distributed over all parallel instances of the following task.
Enforces a re-balancing of the DataSet, i.e., the DataSet is evenly distributed over all parallel instances of the following task. This can help to improve performance in case of heavy data skew and compute intensive operations.
Important: This operation shuffles the whole DataSet over the network and can take significant amount of time.
- returns
The rebalanced DataSet.
- Definition Classes
- DataSet
-
def
reduce(fun: ((L, R), (L, R)) ⇒ (L, R)): DataSet[(L, R)]
Creates a new DataSet by merging the elements of this DataSet using an associative reduce function.
-
def
reduce(reducer: ReduceFunction[(L, R)]): DataSet[(L, R)]
Creates a new DataSet by merging the elements of this DataSet using an associative reduce function.
-
def
reduceGroup[R](fun: (Iterator[(L, R)]) ⇒ R)(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by passing all elements in this DataSet to the group reduce function.
-
def
reduceGroup[R](fun: (Iterator[(L, R)], Collector[R]) ⇒ Unit)(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by passing all elements in this DataSet to the group reduce function.
-
def
reduceGroup[R](reducer: GroupReduceFunction[(L, R), R])(implicit arg0: TypeInformation[R], arg1: ClassTag[R]): DataSet[R]
Creates a new DataSet by passing all elements in this DataSet to the group reduce function.
-
def
registerAggregator(name: String, aggregator: Aggregator[_]): DataSet[(L, R)]
Registers an org.apache.flink.api.common.aggregators.Aggregator for the iteration.
Registers an org.apache.flink.api.common.aggregators.Aggregator for the iteration. Aggregators can be used to maintain simple statistics during the iteration, such as number of elements processed. The aggregators compute global aggregates: After each iteration step, the values are globally aggregated to produce one aggregate that represents statistics across all parallel instances. The value of an aggregator can be accessed in the next iteration.
Aggregators can be accessed inside a function via org.apache.flink.api.common.functions.AbstractRichFunction#getIterationRuntimeContext.
- name
The name under which the aggregator is registered.
- aggregator
The aggregator class.
- Definition Classes
- DataSet
- Annotations
- @PublicEvolving()
-
def
rightOuterJoin[O](other: DataSet[O], strategy: JoinHint): UnfinishedOuterJoinOperation[(L, R), O]
An outer join on the right side.
An outer join on the right side.
Elements of the right side (i.e.
other
) that do not have a matching element onthis
side are joined withnull
and emitted to the resulting DataSet.- other
The other DataSet with which this DataSet is joined.
- strategy
The strategy that should be used execute the join. If { @code null} is given, then the optimizer will pick the join strategy.
- returns
An UnfinishedJoinOperation to continue with the definition of the join transformation
- Definition Classes
- DataSet
- See also
#fullOuterJoin
-
def
rightOuterJoin[O](other: DataSet[O]): UnfinishedOuterJoinOperation[(L, R), O]
An outer join on the right side.
An outer join on the right side.
Elements of the right side (i.e.
other
) that do not have a matching element onthis
side are joined withnull
and emitted to the resulting DataSet.- other
The other DataSet with which this DataSet is joined.
- returns
An UnfinishedJoinOperation to continue with the definition of the join transformation
- Definition Classes
- DataSet
- See also
#fullOuterJoin
-
def
setParallelism(parallelism: Int): DataSet[(L, R)]
Sets the parallelism of this operation.
Sets the parallelism of this operation. This must be greater than 1.
- Definition Classes
- DataSet
-
def
sortPartition[K](fun: ((L, R)) ⇒ K, order: Order)(implicit arg0: TypeInformation[K]): DataSet[(L, R)]
Locally sorts the partitions of the DataSet on the extracted key in the specified order.
Locally sorts the partitions of the DataSet on the extracted key in the specified order. The DataSet can be sorted on multiple values by returning a tuple from the KeySelector.
Note that no additional sort keys can be appended to a KeySelector sort keys. To sort the partitions by multiple values using KeySelector, the KeySelector must return a tuple consisting of the values.
- Definition Classes
- DataSet
-
def
sortPartition(field: String, order: Order): DataSet[(L, R)]
Locally sorts the partitions of the DataSet on the specified field in the specified order.
Locally sorts the partitions of the DataSet on the specified field in the specified order. The DataSet can be sorted on multiple fields by chaining sortPartition() calls.
- Definition Classes
- DataSet
-
def
sortPartition(field: Int, order: Order): DataSet[(L, R)]
Locally sorts the partitions of the DataSet on the specified field in the specified order.
Locally sorts the partitions of the DataSet on the specified field in the specified order. The DataSet can be sorted on multiple fields by chaining sortPartition() calls.
- Definition Classes
- DataSet
-
def
sum(field: String): AggregateDataSet[(L, R)]
Syntactic sugar for aggregate with
SUM
-
def
sum(field: Int): AggregateDataSet[(L, R)]
Syntactic sugar for aggregate with
SUM
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
def
union(other: DataSet[(L, R)]): DataSet[(L, R)]
Creates a new DataSet containing the elements from both
this
DataSet and theother
DataSet.Creates a new DataSet containing the elements from both
this
DataSet and theother
DataSet.- Definition Classes
- DataSet
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @throws( ... )
-
def
withBroadcastSet(data: DataSet[_], name: String): DataSet[(L, R)]
Adds a certain data set as a broadcast set to this operator.
Adds a certain data set as a broadcast set to this operator. Broadcast data sets are available at all parallel instances of this operator. A broadcast data set is registered under a certain name, and can be retrieved under that name from the operators runtime context via
org.apache.flink.api.common.functions.RuntimeContext.getBroadCastVariable(String)
The runtime context itself is available in all UDFs via
org.apache.flink.api.common.functions.AbstractRichFunction#getRuntimeContext()
- data
The data set to be broadcast.
- name
The name under which the broadcast data set retrieved.
- returns
The operator itself, to allow chaining function calls.
- Definition Classes
- DataSet
-
def
withForwardedFields(forwardedFields: String*): DataSet[(L, R)]
Adds semantic information about forwarded fields of the user-defined function.
Adds semantic information about forwarded fields of the user-defined function. The forwarded fields information declares fields which are never modified by the function and which are forwarded to the same position in the output or copied unchanged to another position in the output.
Fields that are forwarded to the same position are specified just by their position. The specified position must be valid for the input and output data type and have the same type. For example
withForwardedFields("_3")
declares that the third field of an input tuple is copied to the third field of an output tuple.Fields which are copied to another position in the output unchanged are declared by specifying the source field reference in the input and the target field reference in the output.
withForwardedFields("_1->_3")
denotes that the first field of the input tuple is copied to the third field of the output tuple unchanged. When using a wildcard ("*") ensure that the number of declared fields and their types in input and output type match.Multiple forwarded fields can be annotated in one (
withForwardedFields("_2; _3->_1; _4")
) or separate Strings (withForwardedFields("_2", "_3->_1", "_4")
). Please refer to the JavaDoc oforg.apache.flink.api.common.functions.Function
or Flink's documentation for details on field references such as nested fields and wildcard.It is not possible to override existing semantic information about forwarded fields which was for example added by a
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
class annotation.NOTE: Adding semantic information for functions is optional! If used correctly, semantic information can help the Flink optimizer to generate more efficient execution plans. However, incorrect semantic information can cause the optimizer to generate incorrect execution plans which compute wrong results! So be careful when adding semantic information.
- forwardedFields
A list of field forward expressions.
- returns
This operator with annotated forwarded field information.
- Definition Classes
- DataSet
- See also
org.apache.flink.api.java.functions.FunctionAnnotation
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
-
def
withForwardedFieldsFirst(forwardedFields: String*): DataSet[(L, R)]
Adds semantic information about forwarded fields of the first input of the user-defined function.
Adds semantic information about forwarded fields of the first input of the user-defined function. The forwarded fields information declares fields which are never modified by the function and which are forwarded to the same position in the output or copied unchanged to another position in the output.
Fields that are forwarded to the same position are specified just by their position. The specified position must be valid for the input and output data type and have the same type. For example
withForwardedFieldsFirst("_3")
declares that the third field of an input tuple from the first input is copied to the third field of an output tuple.Fields which are copied from the first input to another position in the output unchanged are declared by specifying the source field reference in the first input and the target field reference in the output.
withForwardedFieldsFirst("_1->_3")
denotes that the first field of the first input tuple is copied to the third field of the output tuple unchanged. When using a wildcard ("*") ensure that the number of declared fields and their types in the first input and output type match.Multiple forwarded fields can be annotated in one (
withForwardedFieldsFirst("_2; _3->_0; _4")
) or separate Strings (withForwardedFieldsFirst("_2", "_3->_0", "_4")
). Please refer to the JavaDoc oforg.apache.flink.api.common.functions.Function
or Flink's documentation for details on field references such as nested fields and wildcard.It is not possible to override existing semantic information about forwarded fields of the first input which was for example added by a
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFieldsFirst
class annotation.NOTE: Adding semantic information for functions is optional! If used correctly, semantic information can help the Flink optimizer to generate more efficient execution plans. However, incorrect semantic information can cause the optimizer to generate incorrect execution plans which compute wrong results! So be careful when adding semantic information.
- forwardedFields
A list of forwarded field expressions for the first input of the function.
- returns
This operator with annotated forwarded field information.
- Definition Classes
- DataSet
- See also
org.apache.flink.api.java.functions.FunctionAnnotation
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFieldsFirst
-
def
withForwardedFieldsSecond(forwardedFields: String*): DataSet[(L, R)]
Adds semantic information about forwarded fields of the second input of the user-defined function.
Adds semantic information about forwarded fields of the second input of the user-defined function. The forwarded fields information declares fields which are never modified by the function and which are forwarded to the same position in the output or copied unchanged to another position in the output.
Fields that are forwarded to the same position are specified just by their position. The specified position must be valid for the input and output data type and have the same type. For example
withForwardedFieldsFirst("_3")
declares that the third field of an input tuple from the second input is copied to the third field of an output tuple.Fields which are copied from the second input to another position in the output unchanged are declared by specifying the source field reference in the second input and the target field reference in the output.
withForwardedFieldsFirst("_1->_3")
denotes that the first field of the second input tuple is copied to the third field of the output tuple unchanged. When using a wildcard ("*") ensure that the number of declared fields and their types in the second input and output type match.Multiple forwarded fields can be annotated in one (
withForwardedFieldsFirst("_2; _3->_0; _4")
) or separate Strings (withForwardedFieldsFirst("_2", "_3->_0", "_4")
). Please refer to the JavaDoc oforg.apache.flink.api.common.functions.Function
or Flink's documentation for details on field references such as nested fields and wildcard.It is not possible to override existing semantic information about forwarded fields of the second input which was for example added by a
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFieldsFirst
class annotation.NOTE: Adding semantic information for functions is optional! If used correctly, semantic information can help the Flink optimizer to generate more efficient execution plans. However, incorrect semantic information can cause the optimizer to generate incorrect execution plans which compute wrong results! So be careful when adding semantic information.
- forwardedFields
A list of forwarded field expressions for the second input of the function.
- returns
This operator with annotated forwarded field information.
- Definition Classes
- DataSet
- See also
org.apache.flink.api.java.functions.FunctionAnnotation
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFieldsFirst
-
def
withParameters(parameters: Configuration): DataSet[(L, R)]
- Definition Classes
- DataSet
-
def
write(outputFormat: FileOutputFormat[(L, R)], filePath: String, writeMode: WriteMode = null): DataSink[(L, R)]
Writes
this
DataSet to the specified location using a custom org.apache.flink.api.common.io.FileOutputFormat.Writes
this
DataSet to the specified location using a custom org.apache.flink.api.common.io.FileOutputFormat.- Definition Classes
- DataSet
-
def
writeAsCsv(filePath: String, rowDelimiter: String = ..., fieldDelimiter: String = ..., writeMode: WriteMode = null): DataSink[(L, R)]
Writes
this
DataSet to the specified location as CSV file(s).Writes
this
DataSet to the specified location as CSV file(s).This only works on Tuple DataSets. For individual tuple fields AnyRef.toString is used.
- Definition Classes
- DataSet
- See also
org.apache.flink.api.java.DataSet#writeAsText(String)
-
def
writeAsText(filePath: String, writeMode: WriteMode = null): DataSink[(L, R)]
Writes
this
DataSet to the specified location.Writes
this
DataSet to the specified location. This uses AnyRef.toString on each element.- Definition Classes
- DataSet
- See also
org.apache.flink.api.java.DataSet#writeAsText(String)
Deprecated Value Members
-
def
print(sinkIdentifier: String): DataSink[(L, R)]
* Writes a DataSet to the standard output stream (stdout) with a sink identifier prefixed.
* Writes a DataSet to the standard output stream (stdout) with a sink identifier prefixed. This uses AnyRef.toString on each element.
- sinkIdentifier
The string to prefix the output with.
- Definition Classes
- DataSet
- Annotations
- @deprecated @PublicEvolving()
- Deprecated
-
def
printToErr(sinkIdentifier: String): DataSink[(L, R)]
Writes a DataSet to the standard error stream (stderr) with a sink identifier prefixed.
Writes a DataSet to the standard error stream (stderr) with a sink identifier prefixed. This uses AnyRef.toString on each element.
- sinkIdentifier
The string to prefix the output with.
- Definition Classes
- DataSet
- Annotations
- @deprecated @PublicEvolving()
- Deprecated