public class DataSet<T> extends Object
T
. The operations in this class can be used to create new DataSets and to combine
two DataSets. The methods of ExecutionEnvironment
can be used to create a DataSet from an
external source, such as files in HDFS. The write*
methods can be used to write the elements
to storage.
All operations accept either a lambda function or an operation-specific function object for specifying the operation. For example, using a lambda:
val input: DataSet[String] = ...
val mapped = input flatMap { _.split(" ") }
And using a MapFunction
:
val input: DataSet[String] = ...
val mapped = input flatMap { new FlatMapFunction[String, String] {
def flatMap(in: String, out: Collector[String]): Unit = {
in.split(" ") foreach { out.collect(_) }
}
}
A rich function can be used when more control is required, for example for accessing the
RuntimeContext
. The rich function for flatMap
is RichFlatMapFunction
, all other functions
are named similarly. All functions are available in package
org.apache.flink.api.common.functions
.
The elements are partitioned depending on the parallelism of the
ExecutionEnvironment
or of one specific DataSet.
Most of the operations have an implicit TypeInformation
parameter. This is supplied by
an implicit conversion in the flink.api.scala
Package. For this to work,
createTypeInformation
needs to be imported. This is normally achieved with a
import org.apache.flink.api.scala._
Constructor and Description |
---|
DataSet(DataSet<T> set,
scala.reflect.ClassTag<T> evidence$1) |
Modifier and Type | Method and Description |
---|---|
AggregateDataSet<T> |
aggregate(Aggregations agg,
int field)
Creates a new
DataSet by aggregating the specified tuple field using the given aggregation
function. |
AggregateDataSet<T> |
aggregate(Aggregations agg,
String field)
Creates a new
DataSet by aggregating the specified field using the given aggregation
function. |
<F> F |
clean(F f,
boolean checkSerializable)
Clean a closure to make it ready to serialized and send to tasks
(removes unreferenced variables in $outer's, updates REPL variables)
If checkSerializable is set, clean will also proactively
check to see if f is serializable and throw a SparkException
if not.
|
<O> UnfinishedCoGroupOperation<T,O> |
coGroup(DataSet<O> other,
scala.reflect.ClassTag<O> evidence$30)
For each key in
this DataSet and the other DataSet, create a tuple containing a list
of elements for that key from both DataSets. |
scala.collection.Seq<T> |
collect()
Convenience method to get the elements of a DataSet as a List
As DataSet can contain a lot of data, this method should be used with caution.
|
<R> DataSet<R> |
combineGroup(scala.Function2<scala.collection.Iterator<T>,Collector<R>,scala.runtime.BoxedUnit> fun,
TypeInformation<R> evidence$26,
scala.reflect.ClassTag<R> evidence$27)
Applies a GroupCombineFunction on a grouped
DataSet . |
<R> DataSet<R> |
combineGroup(GroupCombineFunction<T,R> combiner,
TypeInformation<R> evidence$24,
scala.reflect.ClassTag<R> evidence$25)
Applies a GroupCombineFunction on a grouped
DataSet . |
long |
count()
Convenience method to get the count (number of elements) of a DataSet
|
<O> CrossDataSet<T,O> |
cross(DataSet<O> other)
Creates a new DataSet by forming the cartesian product of
this DataSet and the other
DataSet. |
<O> CrossDataSet<T,O> |
crossWithHuge(DataSet<O> other)
Special
cross operation for explicitly telling the system that the left side is assumed
to be a lot smaller than the right side of the cartesian product. |
<O> CrossDataSet<T,O> |
crossWithTiny(DataSet<O> other)
Special
cross operation for explicitly telling the system that the right side is assumed
to be a lot smaller than the left side of the cartesian product. |
DataSet<T> |
distinct()
Returns a distinct set of this DataSet.
|
<K> DataSet<T> |
distinct(scala.Function1<T,K> fun,
TypeInformation<K> evidence$28)
Creates a new DataSet containing the distinct elements of this DataSet.
|
DataSet<T> |
distinct(scala.collection.Seq<Object> fields)
Returns a distinct set of a tuple DataSet using field position keys.
|
DataSet<T> |
distinct(String firstField,
scala.collection.Seq<String> otherFields)
Returns a distinct set of this DataSet using expression keys.
|
DataSet<T> |
filter(FilterFunction<T> filter)
Creates a new DataSet that contains only the elements satisfying the given filter predicate.
|
DataSet<T> |
filter(scala.Function1<T,Object> fun)
Creates a new DataSet that contains only the elements satisfying the given filter predicate.
|
DataSet<T> |
first(int n)
Creates a new DataSet containing the first
n elements of this DataSet. |
<R> DataSet<R> |
flatMap(FlatMapFunction<T,R> flatMapper,
TypeInformation<R> evidence$12,
scala.reflect.ClassTag<R> evidence$13)
Creates a new DataSet by applying the given function to every element and flattening
the results.
|
<R> DataSet<R> |
flatMap(scala.Function1<T,scala.collection.TraversableOnce<R>> fun,
TypeInformation<R> evidence$16,
scala.reflect.ClassTag<R> evidence$17)
Creates a new DataSet by applying the given function to every element and flattening
the results.
|
<R> DataSet<R> |
flatMap(scala.Function2<T,Collector<R>,scala.runtime.BoxedUnit> fun,
TypeInformation<R> evidence$14,
scala.reflect.ClassTag<R> evidence$15)
Creates a new DataSet by applying the given function to every element and flattening
the results.
|
<O> UnfinishedOuterJoinOperation<T,O> |
fullOuterJoin(DataSet<O> other)
Creates a new DataSet by performing a full outer join of
this DataSet
with the other DataSet, by combining two elements of two DataSets on
key equality. |
<O> UnfinishedOuterJoinOperation<T,O> |
fullOuterJoin(DataSet<O> other,
JoinOperatorBase.JoinHint strategy)
Special
fullOuterJoin operation for explicitly telling the system what join strategy to
use. |
ExecutionEnvironment |
getExecutionEnvironment()
Returns the execution environment associated with the current DataSet.
|
int |
getParallelism()
Returns the parallelism of this operation.
|
TypeInformation<T> |
getType()
Returns the TypeInformation for the elements of this DataSet.
|
<K> GroupedDataSet<T> |
groupBy(scala.Function1<T,K> fun,
TypeInformation<K> evidence$29)
Creates a
GroupedDataSet which provides operations on groups of elements. |
GroupedDataSet<T> |
groupBy(scala.collection.Seq<Object> fields)
Creates a
GroupedDataSet which provides operations on groups of elements. |
GroupedDataSet<T> |
groupBy(String firstField,
scala.collection.Seq<String> otherFields)
Creates a
GroupedDataSet which provides operations on groups of elements. |
DataSet<T> |
iterate(int maxIterations,
scala.Function1<DataSet<T>,DataSet<T>> stepFunction)
Creates a new DataSet by performing bulk iterations using the given step function.
|
<R> DataSet<T> |
iterateDelta(DataSet<R> workset,
int maxIterations,
int[] keyFields,
boolean solutionSetUnManaged,
scala.Function2<DataSet<T>,DataSet<R>,scala.Tuple2<DataSet<T>,DataSet<R>>> stepFunction,
scala.reflect.ClassTag<R> evidence$32)
Creates a new DataSet by performing delta (or workset) iterations using the given step
function.
|
<R> DataSet<T> |
iterateDelta(DataSet<R> workset,
int maxIterations,
int[] keyFields,
scala.Function2<DataSet<T>,DataSet<R>,scala.Tuple2<DataSet<T>,DataSet<R>>> stepFunction,
scala.reflect.ClassTag<R> evidence$31)
Creates a new DataSet by performing delta (or workset) iterations using the given step
function.
|
<R> DataSet<T> |
iterateDelta(DataSet<R> workset,
int maxIterations,
String[] keyFields,
boolean solutionSetUnManaged,
scala.Function2<DataSet<T>,DataSet<R>,scala.Tuple2<DataSet<T>,DataSet<R>>> stepFunction,
scala.reflect.ClassTag<R> evidence$34)
Creates a new DataSet by performing delta (or workset) iterations using the given step
function.
|
<R> DataSet<T> |
iterateDelta(DataSet<R> workset,
int maxIterations,
String[] keyFields,
scala.Function2<DataSet<T>,DataSet<R>,scala.Tuple2<DataSet<T>,DataSet<R>>> stepFunction,
scala.reflect.ClassTag<R> evidence$33)
Creates a new DataSet by performing delta (or workset) iterations using the given step
function.
|
DataSet<T> |
iterateWithTermination(int maxIterations,
scala.Function1<DataSet<T>,scala.Tuple2<DataSet<T>,DataSet<?>>> stepFunction)
Creates a new DataSet by performing bulk iterations using the given step function.
|
DataSet<T> |
javaSet()
Returns the underlying Java DataSet.
|
<O> UnfinishedJoinOperation<T,O> |
join(DataSet<O> other)
Creates a new DataSet by joining
this DataSet with the other DataSet. |
<O> UnfinishedJoinOperation<T,O> |
join(DataSet<O> other,
JoinOperatorBase.JoinHint strategy)
Special
join operation for explicitly telling the system what join strategy to use. |
<O> UnfinishedJoinOperation<T,O> |
joinWithHuge(DataSet<O> other)
Special
join operation for explicitly telling the system that the left side is assumed
to be a lot smaller than the right side of the join. |
<O> UnfinishedJoinOperation<T,O> |
joinWithTiny(DataSet<O> other)
Special
join operation for explicitly telling the system that the right side is assumed
to be a lot smaller than the left side of the join. |
<O> UnfinishedOuterJoinOperation<T,O> |
leftOuterJoin(DataSet<O> other)
An outer join on the left side.
|
<O> UnfinishedOuterJoinOperation<T,O> |
leftOuterJoin(DataSet<O> other,
JoinOperatorBase.JoinHint strategy)
An outer join on the left side.
|
<R> DataSet<R> |
map(scala.Function1<T,R> fun,
TypeInformation<R> evidence$4,
scala.reflect.ClassTag<R> evidence$5)
Creates a new DataSet by applying the given function to every element of this DataSet.
|
<R> DataSet<R> |
map(MapFunction<T,R> mapper,
TypeInformation<R> evidence$2,
scala.reflect.ClassTag<R> evidence$3)
Creates a new DataSet by applying the given function to every element of this DataSet.
|
<R> DataSet<R> |
mapPartition(scala.Function1<scala.collection.Iterator<T>,scala.collection.TraversableOnce<R>> fun,
TypeInformation<R> evidence$10,
scala.reflect.ClassTag<R> evidence$11)
Creates a new DataSet by applying the given function to each parallel partition of the
DataSet.
|
<R> DataSet<R> |
mapPartition(scala.Function2<scala.collection.Iterator<T>,Collector<R>,scala.runtime.BoxedUnit> fun,
TypeInformation<R> evidence$8,
scala.reflect.ClassTag<R> evidence$9)
Creates a new DataSet by applying the given function to each parallel partition of the
DataSet.
|
<R> DataSet<R> |
mapPartition(MapPartitionFunction<T,R> partitionMapper,
TypeInformation<R> evidence$6,
scala.reflect.ClassTag<R> evidence$7)
Creates a new DataSet by applying the given function to each parallel partition of the
DataSet.
|
AggregateDataSet<T> |
max(int field)
Syntactic sugar for
aggregate with MAX |
AggregateDataSet<T> |
max(String field)
Syntactic sugar for
aggregate with MAX |
DataSet<T> |
maxBy(scala.collection.Seq<Object> fields)
Selects an element with maximum value.
|
AggregateDataSet<T> |
min(int field)
Syntactic sugar for
aggregate with MIN |
AggregateDataSet<T> |
min(String field)
Syntactic sugar for
aggregate with MIN |
DataSet<T> |
minBy(scala.collection.Seq<Object> fields)
Selects an element with minimum value.
|
DataSet<T> |
name(String name)
Sets the name of the DataSet.
|
DataSink<T> |
output(OutputFormat<T> outputFormat)
Emits
this DataSet using a custom OutputFormat . |
<K> DataSet<T> |
partitionByHash(scala.Function1<T,K> fun,
TypeInformation<K> evidence$35)
Partitions a DataSet using the specified key selector function.
|
DataSet<T> |
partitionByHash(scala.collection.Seq<Object> fields)
Hash-partitions a DataSet on the specified tuple field positions.
|
DataSet<T> |
partitionByHash(String firstField,
scala.collection.Seq<String> otherFields)
Hash-partitions a DataSet on the specified fields.
|
<K> DataSet<T> |
partitionByRange(scala.Function1<T,K> fun,
TypeInformation<K> evidence$36)
Range-partitions a DataSet using the specified key selector function.
|
DataSet<T> |
partitionByRange(scala.collection.Seq<Object> fields)
Range-partitions a DataSet on the specified tuple field positions.
|
DataSet<T> |
partitionByRange(String firstField,
scala.collection.Seq<String> otherFields)
Range-partitions a DataSet on the specified fields.
|
<K> DataSet<T> |
partitionCustom(Partitioner<K> partitioner,
scala.Function1<T,K> fun,
TypeInformation<K> evidence$39)
Partitions a DataSet on the key returned by the selector, using a custom partitioner.
|
<K> DataSet<T> |
partitionCustom(Partitioner<K> partitioner,
int field,
TypeInformation<K> evidence$37)
Partitions a tuple DataSet on the specified key fields using a custom partitioner.
|
<K> DataSet<T> |
partitionCustom(Partitioner<K> partitioner,
String field,
TypeInformation<K> evidence$38)
Partitions a POJO DataSet on the specified key fields using a custom partitioner.
|
void |
print()
Prints the elements in a DataSet to the standard output stream
System.out of the
JVM that calls the print() method. |
DataSink<T> |
print(String sinkIdentifier)
Deprecated.
Use
printOnTaskManager(String) instead. |
DataSink<T> |
printOnTaskManager(String prefix)
Writes a DataSet to the standard output streams (stdout) of the TaskManagers that execute
the program (or more specifically, the data sink operators).
|
void |
printToErr()
Prints the elements in a DataSet to the standard error stream
System.err of the
JVM that calls the print() method. |
DataSink<T> |
printToErr(String sinkIdentifier)
Deprecated.
Use
printOnTaskManager(String) instead. |
DataSet<T> |
rebalance()
Enforces a re-balancing of the DataSet, i.e., the DataSet is evenly distributed over all
parallel instances of the
following task.
|
DataSet<T> |
reduce(scala.Function2<T,T,T> fun)
Creates a new
DataSet by merging the elements of this DataSet using an associative reduce
function. |
DataSet<T> |
reduce(ReduceFunction<T> reducer)
Creates a new
DataSet by merging the elements of this DataSet using an associative reduce
function. |
<R> DataSet<R> |
reduceGroup(scala.Function1<scala.collection.Iterator<T>,R> fun,
TypeInformation<R> evidence$22,
scala.reflect.ClassTag<R> evidence$23)
Creates a new
DataSet by passing all elements in this DataSet to the group reduce function. |
<R> DataSet<R> |
reduceGroup(scala.Function2<scala.collection.Iterator<T>,Collector<R>,scala.runtime.BoxedUnit> fun,
TypeInformation<R> evidence$20,
scala.reflect.ClassTag<R> evidence$21)
Creates a new
DataSet by passing all elements in this DataSet to the group reduce function. |
<R> DataSet<R> |
reduceGroup(GroupReduceFunction<T,R> reducer,
TypeInformation<R> evidence$18,
scala.reflect.ClassTag<R> evidence$19)
Creates a new
DataSet by passing all elements in this DataSet to the group reduce function. |
DataSet<T> |
registerAggregator(String name,
Aggregator<?> aggregator)
Registers an
Aggregator
for the iteration. |
<O> UnfinishedOuterJoinOperation<T,O> |
rightOuterJoin(DataSet<O> other)
An outer join on the right side.
|
<O> UnfinishedOuterJoinOperation<T,O> |
rightOuterJoin(DataSet<O> other,
JoinOperatorBase.JoinHint strategy)
An outer join on the right side.
|
DataSet<T> |
setParallelism(int parallelism)
Sets the parallelism of this operation.
|
<K> DataSet<T> |
sortPartition(scala.Function1<T,K> fun,
Order order,
TypeInformation<K> evidence$40)
Locally sorts the partitions of the DataSet on the extracted key in the specified order.
|
DataSet<T> |
sortPartition(int field,
Order order)
Locally sorts the partitions of the DataSet on the specified field in the specified order.
|
DataSet<T> |
sortPartition(String field,
Order order)
Locally sorts the partitions of the DataSet on the specified field in the specified order.
|
AggregateDataSet<T> |
sum(int field)
Syntactic sugar for
aggregate with SUM |
AggregateDataSet<T> |
sum(String field)
Syntactic sugar for
aggregate with SUM |
DataSet<T> |
union(DataSet<T> other)
Creates a new DataSet containing the elements from both
this DataSet and the other
DataSet. |
DataSet<T> |
withBroadcastSet(DataSet<?> data,
String name)
Adds a certain data set as a broadcast set to this operator.
|
DataSet<T> |
withForwardedFields(scala.collection.Seq<String> forwardedFields) |
DataSet<T> |
withForwardedFieldsFirst(scala.collection.Seq<String> forwardedFields) |
DataSet<T> |
withForwardedFieldsSecond(scala.collection.Seq<String> forwardedFields) |
DataSet<T> |
withParameters(Configuration parameters) |
DataSink<T> |
write(FileOutputFormat<T> outputFormat,
String filePath,
FileSystem.WriteMode writeMode)
Writes
this DataSet to the specified location using a custom
FileOutputFormat . |
DataSink<T> |
writeAsCsv(String filePath,
String rowDelimiter,
String fieldDelimiter,
FileSystem.WriteMode writeMode)
Writes
this DataSet to the specified location as CSV file(s). |
DataSink<T> |
writeAsText(String filePath,
FileSystem.WriteMode writeMode)
Writes
this DataSet to the specified location. |
public TypeInformation<T> getType()
public ExecutionEnvironment getExecutionEnvironment()
public <F> F clean(F f, boolean checkSerializable)
f
- the closure to cleancheckSerializable
- whether or not to immediately check f for serializabilityInvalidProgramException
- if checkSerializable is set but f
is not serializablepublic DataSet<T> name(String name)
public DataSet<T> setParallelism(int parallelism)
public int getParallelism()
public DataSet<T> registerAggregator(String name, Aggregator<?> aggregator)
Aggregator
for the iteration. Aggregators can be used to maintain simple statistics during the
iteration, such as number of elements processed. The aggregators compute global aggregates:
After each iteration step, the values are globally aggregated to produce one aggregate that
represents statistics across all parallel instances.
The value of an aggregator can be accessed in the next iteration.
Aggregators can be accessed inside a function via
AbstractRichFunction.getIterationRuntimeContext()
.
name
- The name under which the aggregator is registered.aggregator
- The aggregator class.public DataSet<T> withBroadcastSet(DataSet<?> data, String name)
org.apache.flink.api.common.functions.RuntimeContext.getBroadCastVariable(String)
The runtime context itself is available in all UDFs via
org.apache.flink.api.common.functions.AbstractRichFunction#getRuntimeContext()
data
- The data set to be broadcasted.name
- The name under which the broadcast data set retrieved.public DataSet<T> withForwardedFields(scala.collection.Seq<String> forwardedFields)
public DataSet<T> withForwardedFieldsFirst(scala.collection.Seq<String> forwardedFields)
public DataSet<T> withForwardedFieldsSecond(scala.collection.Seq<String> forwardedFields)
public DataSet<T> withParameters(Configuration parameters)
public <R> DataSet<R> map(MapFunction<T,R> mapper, TypeInformation<R> evidence$2, scala.reflect.ClassTag<R> evidence$3)
public <R> DataSet<R> map(scala.Function1<T,R> fun, TypeInformation<R> evidence$4, scala.reflect.ClassTag<R> evidence$5)
public <R> DataSet<R> mapPartition(MapPartitionFunction<T,R> partitionMapper, TypeInformation<R> evidence$6, scala.reflect.ClassTag<R> evidence$7)
This function is intended for operations that cannot transform individual elements and
requires no grouping of elements. To transform individual elements,
the use of map
and flatMap
is preferable.
public <R> DataSet<R> mapPartition(scala.Function2<scala.collection.Iterator<T>,Collector<R>,scala.runtime.BoxedUnit> fun, TypeInformation<R> evidence$8, scala.reflect.ClassTag<R> evidence$9)
This function is intended for operations that cannot transform individual elements and
requires no grouping of elements. To transform individual elements,
the use of map
and flatMap
is preferable.
public <R> DataSet<R> mapPartition(scala.Function1<scala.collection.Iterator<T>,scala.collection.TraversableOnce<R>> fun, TypeInformation<R> evidence$10, scala.reflect.ClassTag<R> evidence$11)
This function is intended for operations that cannot transform individual elements and
requires no grouping of elements. To transform individual elements,
the use of map
and flatMap
is preferable.
public <R> DataSet<R> flatMap(FlatMapFunction<T,R> flatMapper, TypeInformation<R> evidence$12, scala.reflect.ClassTag<R> evidence$13)
public <R> DataSet<R> flatMap(scala.Function2<T,Collector<R>,scala.runtime.BoxedUnit> fun, TypeInformation<R> evidence$14, scala.reflect.ClassTag<R> evidence$15)
public <R> DataSet<R> flatMap(scala.Function1<T,scala.collection.TraversableOnce<R>> fun, TypeInformation<R> evidence$16, scala.reflect.ClassTag<R> evidence$17)
public DataSet<T> filter(FilterFunction<T> filter)
public DataSet<T> filter(scala.Function1<T,Object> fun)
public AggregateDataSet<T> aggregate(Aggregations agg, int field)
DataSet
by aggregating the specified tuple field using the given aggregation
function. Since this is not a keyed DataSet the aggregation will be performed on the whole
collection of elements.
This only works on Tuple DataSets.
public AggregateDataSet<T> aggregate(Aggregations agg, String field)
DataSet
by aggregating the specified field using the given aggregation
function. Since this is not a keyed DataSet the aggregation will be performed on the whole
collection of elements.
This only works on CaseClass DataSets.
public AggregateDataSet<T> sum(int field)
aggregate
with SUM
public AggregateDataSet<T> max(int field)
aggregate
with MAX
public AggregateDataSet<T> min(int field)
aggregate
with MIN
public AggregateDataSet<T> sum(String field)
aggregate
with SUM
public AggregateDataSet<T> max(String field)
aggregate
with MAX
public AggregateDataSet<T> min(String field)
aggregate
with MIN
public long count()
Utils.CountHelper
public scala.collection.Seq<T> collect()
Utils.CollectHelper
public DataSet<T> reduce(ReduceFunction<T> reducer)
DataSet
by merging the elements of this DataSet using an associative reduce
function.public DataSet<T> reduce(scala.Function2<T,T,T> fun)
DataSet
by merging the elements of this DataSet using an associative reduce
function.public <R> DataSet<R> reduceGroup(GroupReduceFunction<T,R> reducer, TypeInformation<R> evidence$18, scala.reflect.ClassTag<R> evidence$19)
public <R> DataSet<R> reduceGroup(scala.Function2<scala.collection.Iterator<T>,Collector<R>,scala.runtime.BoxedUnit> fun, TypeInformation<R> evidence$20, scala.reflect.ClassTag<R> evidence$21)
public <R> DataSet<R> reduceGroup(scala.Function1<scala.collection.Iterator<T>,R> fun, TypeInformation<R> evidence$22, scala.reflect.ClassTag<R> evidence$23)
DataSet
by passing all elements in this DataSet to the group reduce function.public <R> DataSet<R> combineGroup(GroupCombineFunction<T,R> combiner, TypeInformation<R> evidence$24, scala.reflect.ClassTag<R> evidence$25)
DataSet
. A
GroupCombineFunction is similar to a GroupReduceFunction but does not
perform a full data exchange. Instead, the GroupCombineFunction calls
the combine method once per partition for combining a group of
results. This operator is suitable for combining values into an
intermediate format before doing a proper groupReduce where the
data is shuffled across the node for further reduction. The
GroupReduce operator can also be supplied with a combiner by
implementing the RichGroupReduce function. The combine method of
the RichGroupReduce function demands input and output type to be
the same. The GroupCombineFunction, on the other side, can have an
arbitrary output type.public <R> DataSet<R> combineGroup(scala.Function2<scala.collection.Iterator<T>,Collector<R>,scala.runtime.BoxedUnit> fun, TypeInformation<R> evidence$26, scala.reflect.ClassTag<R> evidence$27)
DataSet
. A
GroupCombineFunction is similar to a GroupReduceFunction but does not
perform a full data exchange. Instead, the GroupCombineFunction calls
the combine method once per partition for combining a group of
results. This operator is suitable for combining values into an
intermediate format before doing a proper groupReduce where the
data is shuffled across the node for further reduction. The
GroupReduce operator can also be supplied with a combiner by
implementing the RichGroupReduce function. The combine method of
the RichGroupReduce function demands input and output type to be
the same. The GroupCombineFunction, on the other side, can have an
arbitrary output type.public DataSet<T> minBy(scala.collection.Seq<Object> fields)
The minimum is computed over the specified fields in lexicographical order.
Example 1: Given a data set with elements [0, 1], [1, 0], the results will be:
minBy(0)[0, 1]
minBy(1)[1, 0]
Example 2: Given a data set with elements [0, 0], [0, 1], the
results will be:
minBy(0, 1)[0, 0]
If multiple values with minimum value at the specified fields exist, a random one will be
picked.
Internally, this operation is implemented as a ReduceFunction
public DataSet<T> maxBy(scala.collection.Seq<Object> fields)
The maximum is computed over the specified fields in lexicographical order.
Example 1: Given a data set with elements [0, 1], [1, 0], the results will be:
maxBy(0)[1, 0]
maxBy(1)[0, 1]
Example 2: Given a data set with elements [0, 0], [0, 1], the
results will be:
maxBy(0, 1)[0, 1]
If multiple values with maximum value at the specified fields exist, a random one will be
picked
Internally, this operation is implemented as a ReduceFunction
.
public DataSet<T> first(int n)
n
elements of this DataSet.public <K> DataSet<T> distinct(scala.Function1<T,K> fun, TypeInformation<K> evidence$28)
fun
- The function which extracts the key values from the DataSet on which the
distinction of the DataSet is decided.public DataSet<T> distinct()
If the input is a composite type (Tuple or Pojo type), distinct is performed on all fields and each field must be a key type.
public DataSet<T> distinct(scala.collection.Seq<Object> fields)
The field position keys specify the fields of Tuples on which the decision is made if two Tuples are distinct or not.
Note: Field position keys can only be specified for Tuple DataSets.
fields
- One or more field positions on which the distinction of the DataSet is decided.public DataSet<T> distinct(String firstField, scala.collection.Seq<String> otherFields)
The field position keys specify the fields of Tuples or Pojos on which the decision is made if two elements are distinct or not.
The field expression keys specify the fields of a
CompositeType
(e.g., Tuple or Pojo type)
on which the decision is made if two elements are distinct or not.
In case of a AtomicType
, only the
wildcard expression ("_") is valid.
firstField
- First field position on which the distinction of the DataSet is decidedotherFields
- Zero or more field positions on which the distinction of the DataSet
is decidedpublic <K> GroupedDataSet<T> groupBy(scala.Function1<T,K> fun, TypeInformation<K> evidence$29)
GroupedDataSet
which provides operations on groups of elements. Elements are
grouped based on the value returned by the given function.
This will not create a new DataSet, it will just attach the key function which will be used for grouping when executing a grouped operation.
public GroupedDataSet<T> groupBy(scala.collection.Seq<Object> fields)
GroupedDataSet
which provides operations on groups of elements. Elements are
grouped based on the given tuple fields.
This will not create a new DataSet, it will just attach the tuple field positions which will be used for grouping when executing a grouped operation.
This only works on Tuple DataSets.
public GroupedDataSet<T> groupBy(String firstField, scala.collection.Seq<String> otherFields)
GroupedDataSet
which provides operations on groups of elements. Elements are
grouped based on the given fields.
This will not create a new DataSet, it will just attach the field names which will be used for grouping when executing a grouped operation.
public <O> UnfinishedJoinOperation<T,O> join(DataSet<O> other)
this
DataSet with the other
DataSet. To specify the join
keys the where
and equalTo
methods must be used. For example:
val left: DataSet[(String, Int, Int)] = ...
val right: DataSet[(Int, String, Int)] = ...
val joined = left.join(right).where(0).equalTo(1)
The default join result is a DataSet with 2-Tuples of the joined values. In the above example
that would be ((String, Int, Int), (Int, String, Int))
. A custom join function can be used
if more control over the result is required. This can either be given as a lambda or a
custom JoinFunction
. For example:
val left: DataSet[(String, Int, Int)] = ...
val right: DataSet[(Int, String, Int)] = ...
val joined = left.join(right).where(0).equalTo(1) { (l, r) =>
(l._1, r._2)
}
A join function with a Collector
can be used to implement a filter directly in the join
or to output more than one values. This type of join function does not return a value, instead
values are emitted using the collector:
val left: DataSet[(String, Int, Int)] = ...
val right: DataSet[(Int, String, Int)] = ...
val joined = left.join(right).where(0).equalTo(1) {
(l, r, out: Collector[(String, Int)]) =>
if (l._2 > 4) {
out.collect((l._1, r._3))
out.collect((l._1, r._1))
} else {
None
}
}
public <O> UnfinishedJoinOperation<T,O> join(DataSet<O> other, JoinOperatorBase.JoinHint strategy)
join
operation for explicitly telling the system what join strategy to use. If
null is given as the join strategy, then the optimizer will pick the strategy.public <O> UnfinishedJoinOperation<T,O> joinWithTiny(DataSet<O> other)
join
operation for explicitly telling the system that the right side is assumed
to be a lot smaller than the left side of the join.public <O> UnfinishedJoinOperation<T,O> joinWithHuge(DataSet<O> other)
join
operation for explicitly telling the system that the left side is assumed
to be a lot smaller than the right side of the join.public <O> UnfinishedOuterJoinOperation<T,O> fullOuterJoin(DataSet<O> other)
this
DataSet
with the other
DataSet, by combining two elements of two DataSets on
key equality.
Elements of both DataSets that do not have a matching element on the
opposing side are joined with null
and emitted to the resulting DataSet.
To specify the join keys the where
and equalTo
methods must be used. For example:
val left: DataSet[(String, Int, Int)] = ...
val right: DataSet[(Int, String, Int)] = ...
val joined = left.fullOuterJoin(right).where(0).equalTo(1)
When using an outer join you are required to specify a join function. For example:
val joined = left.fullOuterJoin(right).where(0).equalTo(1) {
(left, right) =>
val a = if (left == null) null else left._1
val b = if (right == null) null else right._3
(a, b)
}
public <O> UnfinishedOuterJoinOperation<T,O> fullOuterJoin(DataSet<O> other, JoinOperatorBase.JoinHint strategy)
fullOuterJoin
operation for explicitly telling the system what join strategy to
use. If null is given as the join strategy, then the optimizer will pick the strategy.public <O> UnfinishedOuterJoinOperation<T,O> leftOuterJoin(DataSet<O> other)
Elements of the left side (i.e. this
) that do not have a matching element on the other
side are joined with null
and emitted to the resulting DataSet.
other
- The other DataSet with which this DataSet is joined.fullOuterJoin(org.apache.flink.api.scala.DataSet<O>)
public <O> UnfinishedOuterJoinOperation<T,O> leftOuterJoin(DataSet<O> other, JoinOperatorBase.JoinHint strategy)
Elements of the left side (i.e. this
) that do not have a matching element on the other
side are joined with null
and emitted to the resulting DataSet.
other
- The other DataSet with which this DataSet is joined.strategy
- The strategy that should be used execute the join. If { @code null} is given,
then the optimizer will pick the join strategy.fullOuterJoin(org.apache.flink.api.scala.DataSet<O>)
public <O> UnfinishedOuterJoinOperation<T,O> rightOuterJoin(DataSet<O> other)
Elements of the right side (i.e. other
) that do not have a matching element on this
side are joined with null
and emitted to the resulting DataSet.
other
- The other DataSet with which this DataSet is joined.fullOuterJoin(org.apache.flink.api.scala.DataSet<O>)
public <O> UnfinishedOuterJoinOperation<T,O> rightOuterJoin(DataSet<O> other, JoinOperatorBase.JoinHint strategy)
Elements of the right side (i.e. other
) that do not have a matching element on this
side are joined with null
and emitted to the resulting DataSet.
other
- The other DataSet with which this DataSet is joined.strategy
- The strategy that should be used execute the join. If { @code null} is given,
then the optimizer will pick the join strategy.fullOuterJoin(org.apache.flink.api.scala.DataSet<O>)
public <O> UnfinishedCoGroupOperation<T,O> coGroup(DataSet<O> other, scala.reflect.ClassTag<O> evidence$30)
this
DataSet and the other
DataSet, create a tuple containing a list
of elements for that key from both DataSets. To specify the join keys the where
and
isEqualTo
methods must be used. For example:
val left: DataSet[(String, Int, Int)] = ...
val right: DataSet[(Int, String, Int)] = ...
val coGrouped = left.coGroup(right).where(0).isEqualTo(1)
A custom coGroup function can be used
if more control over the result is required. This can either be given as a lambda or a
custom CoGroupFunction
. For example:
val left: DataSet[(String, Int, Int)] = ...
val right: DataSet[(Int, String, Int)] = ...
val coGrouped = left.coGroup(right).where(0).isEqualTo(1) { (l, r) =>
// l and r are of type Iterator
(l.min, r.max)
}
A coGroup function with a Collector
can be used to implement a filter directly in the
coGroup or to output more than one values. This type of coGroup function does not return a
value, instead values are emitted using the collector
val left: DataSet[(String, Int, Int)] = ...
val right: DataSet[(Int, String, Int)] = ...
val coGrouped = left.coGroup(right).where(0).isEqualTo(1) {
(l, r, out: Collector[(String, Int)]) =>
out.collect((l.min, r.max))
out.collect(l.max, r.min))
}
public <O> CrossDataSet<T,O> cross(DataSet<O> other)
this
DataSet and the other
DataSet.
The default cross result is a DataSet with 2-Tuples of the combined values. A custom cross
function can be used if more control over the result is required. This can either be given as
a lambda or a custom CrossFunction
. For example:
val left: DataSet[(String, Int, Int)] = ...
val right: DataSet[(Int, String, Int)] = ...
val product = left.cross(right) { (l, r) => (l._2, r._3) }
}
public <O> CrossDataSet<T,O> crossWithTiny(DataSet<O> other)
cross
operation for explicitly telling the system that the right side is assumed
to be a lot smaller than the left side of the cartesian product.public <O> CrossDataSet<T,O> crossWithHuge(DataSet<O> other)
cross
operation for explicitly telling the system that the left side is assumed
to be a lot smaller than the right side of the cartesian product.public DataSet<T> iterate(int maxIterations, scala.Function1<DataSet<T>,DataSet<T>> stepFunction)
maxIterations
iterations have been performed.
For example:
val input: DataSet[(String, Int)] = ...
val iterated = input.iterate(5) { previous =>
val next = previous.map { x => (x._1, x._2 + 1) }
next
}
This example will simply increase the second field of the tuple by 5.
public DataSet<T> iterateWithTermination(int maxIterations, scala.Function1<DataSet<T>,scala.Tuple2<DataSet<T>,DataSet<?>>> stepFunction)
maxIterations
iterations have been performed.
For example:
val input: DataSet[(String, Int)] = ...
val iterated = input.iterateWithTermination(5) { previous =>
val next = previous.map { x => (x._1, x._2 + 1) }
val term = next.filter { _._2 < 3 }
(next, term)
}
This example will simply increase the second field of the Tuples until they are no longer smaller than 3.
public <R> DataSet<T> iterateDelta(DataSet<R> workset, int maxIterations, int[] keyFields, scala.Function2<DataSet<T>,DataSet<R>,scala.Tuple2<DataSet<T>,DataSet<R>>> stepFunction, scala.reflect.ClassTag<R> evidence$31)
this
DataSet is the solution set and workset
is the Workset.
The iteration step function gets the current solution set and workset and must output the
delta for the solution set and the workset for the next iteration.
Note: The syntax of delta iterations are very likely going to change soon.
public <R> DataSet<T> iterateDelta(DataSet<R> workset, int maxIterations, int[] keyFields, boolean solutionSetUnManaged, scala.Function2<DataSet<T>,DataSet<R>,scala.Tuple2<DataSet<T>,DataSet<R>>> stepFunction, scala.reflect.ClassTag<R> evidence$32)
this
DataSet is the solution set and workset
is the Workset.
The iteration step function gets the current solution set and workset and must output the
delta for the solution set and the workset for the next iteration.
Note: The syntax of delta iterations are very likely going to change soon.
public <R> DataSet<T> iterateDelta(DataSet<R> workset, int maxIterations, String[] keyFields, scala.Function2<DataSet<T>,DataSet<R>,scala.Tuple2<DataSet<T>,DataSet<R>>> stepFunction, scala.reflect.ClassTag<R> evidence$33)
this
DataSet is the solution set and workset
is the Workset.
The iteration step function gets the current solution set and workset and must output the
delta for the solution set and the workset for the next iteration.
Note: The syntax of delta iterations are very likely going to change soon.
public <R> DataSet<T> iterateDelta(DataSet<R> workset, int maxIterations, String[] keyFields, boolean solutionSetUnManaged, scala.Function2<DataSet<T>,DataSet<R>,scala.Tuple2<DataSet<T>,DataSet<R>>> stepFunction, scala.reflect.ClassTag<R> evidence$34)
this
DataSet is the solution set and workset
is the Workset.
The iteration step function gets the current solution set and workset and must output the
delta for the solution set and the workset for the next iteration.
Note: The syntax of delta iterations are very likely going to change soon.
public DataSet<T> union(DataSet<T> other)
this
DataSet and the other
DataSet.public DataSet<T> partitionByHash(scala.collection.Seq<Object> fields)
'''important:''' This operation shuffles the whole DataSet over the network and can take significant amount of time.
public DataSet<T> partitionByHash(String firstField, scala.collection.Seq<String> otherFields)
'''important:''' This operation shuffles the whole DataSet over the network and can take significant amount of time.
public <K> DataSet<T> partitionByHash(scala.Function1<T,K> fun, TypeInformation<K> evidence$35)
'''Important:'''This operation shuffles the whole DataSet over the network and can take significant amount of time.
public DataSet<T> partitionByRange(scala.collection.Seq<Object> fields)
'''important:''' This operation requires an extra pass over the DataSet to compute the range boundaries and shuffles the whole DataSet over the network. This can take significant amount of time.
public DataSet<T> partitionByRange(String firstField, scala.collection.Seq<String> otherFields)
'''important:''' This operation requires an extra pass over the DataSet to compute the range boundaries and shuffles the whole DataSet over the network. This can take significant amount of time.
public <K> DataSet<T> partitionByRange(scala.Function1<T,K> fun, TypeInformation<K> evidence$36)
'''important:''' This operation requires an extra pass over the DataSet to compute the range boundaries and shuffles the whole DataSet over the network. This can take significant amount of time.
public <K> DataSet<T> partitionCustom(Partitioner<K> partitioner, int field, TypeInformation<K> evidence$37)
Note: This method works only on single field keys.
public <K> DataSet<T> partitionCustom(Partitioner<K> partitioner, String field, TypeInformation<K> evidence$38)
Note: This method works only on single field keys.
public <K> DataSet<T> partitionCustom(Partitioner<K> partitioner, scala.Function1<T,K> fun, TypeInformation<K> evidence$39)
Note: This method works only on single field keys, i.e. the selector cannot return tuples of fields.
public DataSet<T> rebalance()
'''Important:''' This operation shuffles the whole DataSet over the network and can take significant amount of time.
public DataSet<T> sortPartition(int field, Order order)
public DataSet<T> sortPartition(String field, Order order)
public <K> DataSet<T> sortPartition(scala.Function1<T,K> fun, Order order, TypeInformation<K> evidence$40)
Note that no additional sort keys can be appended to a KeySelector sort keys. To sort the partitions by multiple values using KeySelector, the KeySelector must return a tuple consisting of the values.
public DataSink<T> writeAsText(String filePath, FileSystem.WriteMode writeMode)
this
DataSet to the specified location. This uses AnyRef.toString
on
each element.
DataSet.writeAsText(String)
public DataSink<T> writeAsCsv(String filePath, String rowDelimiter, String fieldDelimiter, FileSystem.WriteMode writeMode)
this
DataSet to the specified location as CSV file(s).
This only works on Tuple DataSets. For individual tuple fields AnyRef.toString
is used.
DataSet.writeAsText(String)
public DataSink<T> write(FileOutputFormat<T> outputFormat, String filePath, FileSystem.WriteMode writeMode)
this
DataSet to the specified location using a custom
FileOutputFormat
.public DataSink<T> output(OutputFormat<T> outputFormat)
this
DataSet using a custom OutputFormat
.public void print()
System.out
of the
JVM that calls the print() method. For programs that are executed in a cluster, this
method needs to gather the contents of the DataSet back to the client, to print it
there.
The string written for each element is defined by the AnyRef.toString
method.
This method immediately triggers the program execution, similar to the
collect()
and count()
methods.
public void printToErr()
System.err
of the
JVM that calls the print() method. For programs that are executed in a cluster, this
method needs to gather the contents of the DataSet back to the client, to print it
there.
The string written for each element is defined by the AnyRef.toString
method.
This method immediately triggers the program execution, similar to the
collect()
and count()
methods.
public DataSink<T> printOnTaskManager(String prefix)
To print the data to the console or stdout stream of the client process instead, use the
print()
method.
For each element of the DataSet the result of AnyRef.toString()
is written.
prefix
- The string to prefix each line of the output with. This helps identifying outputs
from different printing sinks.public DataSink<T> print(String sinkIdentifier)
printOnTaskManager(String)
instead.AnyRef.toString
on each element.
sinkIdentifier
- The string to prefix the output with.public DataSink<T> printToErr(String sinkIdentifier)
printOnTaskManager(String)
instead.AnyRef.toString
on each element.
sinkIdentifier
- The string to prefix the output with.Copyright © 2014–2017 The Apache Software Foundation. All rights reserved.