public class DataStream<T> extends Object
Constructor and Description |
---|
DataStream(DataStream<T> stream) |
Modifier and Type | Method and Description |
---|---|
DataStreamSink<T> |
addSink(scala.Function1<T,scala.runtime.BoxedUnit> fun)
Adds the given sink to this DataStream.
|
DataStreamSink<T> |
addSink(SinkFunction<T> sinkFunction)
Adds the given sink to this DataStream.
|
DataStream<T> |
assignAscendingTimestamps(scala.Function1<T,Object> extractor)
Assigns timestamps to the elements in the data stream and periodically creates
watermarks to signal event time progress.
|
DataStream<T> |
assignTimestamps(TimestampExtractor<T> extractor)
Extracts a timestamp from an element and assigns it as the internal timestamp of that element.
|
DataStream<T> |
assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks<T> assigner)
Assigns timestamps to the elements in the data stream and periodically creates
watermarks to signal event time progress.
|
DataStream<T> |
assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks<T> assigner)
Assigns timestamps to the elements in the data stream and periodically creates
watermarks to signal event time progress.
|
DataStream<T> |
broadcast()
Sets the partitioning of the DataStream so that the output tuples
are broad casted to every parallel instance of the next component.
|
<F> F |
clean(F f)
Returns a "closure-cleaned" version of the given function.
|
<T2> CoGroupedStreams<T,T2> |
coGroup(DataStream<T2> otherStream)
Creates a co-group operation.
|
<T2> ConnectedStreams<T,T2> |
connect(DataStream<T2> dataStream)
Creates a new ConnectedStreams by connecting
DataStream outputs of different type with each other.
|
AllWindowedStream<T,GlobalWindow> |
countWindowAll(long size)
Windows this
DataStream into tumbling count windows. |
AllWindowedStream<T,GlobalWindow> |
countWindowAll(long size,
long slide)
Windows this
DataStream into sliding count windows. |
TypeInformation<T> |
dataType()
Returns the TypeInformation for the elements of this DataStream.
|
DataStream<T> |
disableChaining()
Turns off chaining for this operator so thread co-location will not be
used as an optimization.
|
ExecutionConfig |
executionConfig()
Returns the execution config.
|
StreamExecutionEnvironment |
executionEnvironment()
Returns the
StreamExecutionEnvironment associated with this data stream |
DataStream<T> |
filter(FilterFunction<T> filter)
Creates a new DataStream that contains only the elements satisfying the given filter predicate.
|
DataStream<T> |
filter(scala.Function1<T,Object> fun)
Creates a new DataStream that contains only the elements satisfying the given filter predicate.
|
<R> DataStream<R> |
flatMap(FlatMapFunction<T,R> flatMapper,
TypeInformation<R> evidence$8)
Creates a new DataStream by applying the given function to every element and flattening
the results.
|
<R> DataStream<R> |
flatMap(scala.Function1<T,scala.collection.TraversableOnce<R>> fun,
TypeInformation<R> evidence$10)
Creates a new DataStream by applying the given function to every element and flattening
the results.
|
<R> DataStream<R> |
flatMap(scala.Function2<T,Collector<R>,scala.runtime.BoxedUnit> fun,
TypeInformation<R> evidence$9)
Creates a new DataStream by applying the given function to every element and flattening
the results.
|
DataStream<T> |
forward()
Sets the partitioning of the DataStream so that the output tuples
are forwarded to the local subtask of the next component (whenever
possible).
|
ExecutionConfig |
getExecutionConfig()
Deprecated.
Use
executionConfig instead. |
StreamExecutionEnvironment |
getExecutionEnvironment()
Deprecated.
Use
executionEnvironment instead |
int |
getId()
Returns the ID of the DataStream.
|
String |
getName()
Deprecated.
Use
name instead |
int |
getParallelism()
Deprecated.
Use
parallelism instead. |
TypeInformation<T> |
getType()
Deprecated.
Use
dataType instead. |
DataStream<T> |
global()
Sets the partitioning of the DataStream so that the output values all go to
the first instance of the next processing operator.
|
<R,F> DataStream<R> |
iterate(scala.Function1<ConnectedStreams<T,F>,scala.Tuple2<DataStream<F>,DataStream<R>>> stepFunction,
long maxWaitTimeMillis,
TypeInformation<F> evidence$5)
Initiates an iterative part of the program that creates a loop by feeding
back data streams.
|
<R> DataStream<R> |
iterate(scala.Function1<DataStream<T>,scala.Tuple2<DataStream<T>,DataStream<R>>> stepFunction,
long maxWaitTimeMillis,
boolean keepPartitioning)
Initiates an iterative part of the program that creates a loop by feeding
back data streams.
|
DataStream<T> |
javaStream()
Gets the underlying java DataStream object.
|
<T2> JoinedStreams<T,T2> |
join(DataStream<T2> otherStream)
Creates a join operation.
|
<K> KeyedStream<T,K> |
keyBy(scala.Function1<T,K> fun,
TypeInformation<K> evidence$1)
Groups the elements of a DataStream by the given K key to
be used with grouped operators like grouped reduce or grouped aggregations.
|
KeyedStream<T,Tuple> |
keyBy(scala.collection.Seq<Object> fields)
Groups the elements of a DataStream by the given key positions (for tuple/array types) to
be used with grouped operators like grouped reduce or grouped aggregations.
|
KeyedStream<T,Tuple> |
keyBy(String firstField,
scala.collection.Seq<String> otherFields)
Groups the elements of a DataStream by the given field expressions to
be used with grouped operators like grouped reduce or grouped aggregations.
|
<R> DataStream<R> |
map(scala.Function1<T,R> fun,
TypeInformation<R> evidence$6)
Creates a new DataStream by applying the given function to every element of this DataStream.
|
<R> DataStream<R> |
map(MapFunction<T,R> mapper,
TypeInformation<R> evidence$7)
Creates a new DataStream by applying the given function to every element of this DataStream.
|
String |
name()
Gets the name of the current data stream.
|
DataStream<T> |
name(String name)
Sets the name of the current data stream.
|
int |
parallelism()
Returns the parallelism of this operation.
|
<K> DataStream<T> |
partitionCustom(Partitioner<K> partitioner,
scala.Function1<T,K> fun,
TypeInformation<K> evidence$4)
Partitions a DataStream on the key returned by the selector, using a custom partitioner.
|
<K> DataStream<T> |
partitionCustom(Partitioner<K> partitioner,
int field,
TypeInformation<K> evidence$2)
Partitions a tuple DataStream on the specified key fields using a custom partitioner.
|
<K> DataStream<T> |
partitionCustom(Partitioner<K> partitioner,
String field,
TypeInformation<K> evidence$3)
Partitions a POJO DataStream on the specified key fields using a custom partitioner.
|
DataStreamSink<T> |
print()
Writes a DataStream to the standard output stream (stdout).
|
DataStreamSink<T> |
printToErr()
Writes a DataStream to the standard output stream (stderr).
|
DataStream<T> |
rebalance()
Sets the partitioning of the DataStream so that the output tuples
are distributed evenly to the next component.
|
DataStream<T> |
rescale()
Sets the partitioning of the
DataStream so that the output tuples
are distributed evenly to a subset of instances of the downstream operation. |
DataStream<T> |
setBufferTimeout(long timeoutMillis)
Sets the maximum time frequency (ms) for the flushing of the output
buffer.
|
DataStream<T> |
setParallelism(int parallelism)
Sets the parallelism of this operation.
|
DataStream<T> |
shuffle()
Sets the partitioning of the DataStream so that the output tuples
are shuffled to the next component.
|
DataStream<T> |
slotSharingGroup(String slotSharingGroup)
Sets the slot sharing group of this operation.
|
SplitStream<T> |
split(scala.Function1<T,scala.collection.TraversableOnce<String>> fun)
Creates a new
SplitStream that contains only the elements satisfying the
given output selector predicate. |
SplitStream<T> |
split(OutputSelector<T> selector)
Operator used for directing tuples to specific named outputs using an
OutputSelector.
|
DataStream<T> |
startNewChain()
Starts a new task chain beginning at this operator.
|
AllWindowedStream<T,TimeWindow> |
timeWindowAll(Time size)
Windows this DataStream into tumbling time windows.
|
AllWindowedStream<T,TimeWindow> |
timeWindowAll(Time size,
Time slide)
Windows this DataStream into sliding time windows.
|
<R> DataStream<R> |
transform(String operatorName,
OneInputStreamOperator<T,R> operator,
TypeInformation<R> evidence$11)
Transforms the
DataStream by using a custom OneInputStreamOperator . |
DataStream<T> |
uid(String uid)
Sets an ID for this operator.
|
DataStream<T> |
union(scala.collection.Seq<DataStream<T>> dataStreams)
Creates a new DataStream by merging DataStream outputs of
the same type with each other.
|
<W extends Window> |
windowAll(WindowAssigner<? super T,W> assigner)
Windows this data stream to a
AllWindowedStream , which evaluates windows
over a key grouped stream. |
DataStreamSink<T> |
writeAsCsv(String path)
Writes the DataStream in CSV format to the file specified by the path parameter.
|
DataStreamSink<T> |
writeAsCsv(String path,
FileSystem.WriteMode writeMode)
Writes the DataStream in CSV format to the file specified by the path parameter.
|
DataStreamSink<T> |
writeAsCsv(String path,
FileSystem.WriteMode writeMode,
String rowDelimiter,
String fieldDelimiter)
Writes the DataStream in CSV format to the file specified by the path parameter.
|
DataStreamSink<T> |
writeAsText(String path)
Writes a DataStream to the file specified by path in text format.
|
DataStreamSink<T> |
writeAsText(String path,
FileSystem.WriteMode writeMode)
Writes a DataStream to the file specified by path in text format.
|
DataStreamSink<T> |
writeToSocket(String hostname,
Integer port,
SerializationSchema<T> schema)
Writes the DataStream to a socket as a byte array.
|
DataStreamSink<T> |
writeUsingOutputFormat(OutputFormat<T> format)
Writes a DataStream using the given
OutputFormat . |
public DataStream(DataStream<T> stream)
public StreamExecutionEnvironment getExecutionEnvironment()
executionEnvironment
insteadStreamExecutionEnvironment
associated with the current DataStream
.
public TypeInformation<T> getType()
dataType
instead.public int getParallelism()
parallelism
instead.public ExecutionConfig getExecutionConfig()
executionConfig
instead.public int getId()
public DataStream<T> javaStream()
public TypeInformation<T> dataType()
public ExecutionConfig executionConfig()
public StreamExecutionEnvironment executionEnvironment()
StreamExecutionEnvironment
associated with this data streampublic int parallelism()
public DataStream<T> setParallelism(int parallelism)
public String name()
public String getName()
name
insteadpublic DataStream<T> name(String name)
public DataStream<T> uid(String uid)
The specified ID is used to assign the same operator ID across job submissions (for example when starting a job from a savepoint).
Important: this ID needs to be unique per transformation and job. Otherwise, job submission will fail.
uid
- The unique user-specified ID of this transformation.public DataStream<T> disableChaining()
StreamExecutionEnvironment.disableOperatorChaining()
however it is not advised for performance considerations.
public DataStream<T> startNewChain()
public DataStream<T> slotSharingGroup(String slotSharingGroup)
Operations inherit the slot sharing group of input operations if all input operations are in the same slot sharing group and no slot sharing group was explicitly specified.
Initially an operation is in the default slot sharing group. An operation can be put into
the default group explicitly by setting the slot sharing group to "default"
.
slotSharingGroup
- The slot sharing group name.public DataStream<T> setBufferTimeout(long timeoutMillis)
timeoutMillis
- The maximum time between two output flushes.public DataStream<T> union(scala.collection.Seq<DataStream<T>> dataStreams)
public <T2> ConnectedStreams<T,T2> connect(DataStream<T2> dataStream)
public KeyedStream<T,Tuple> keyBy(scala.collection.Seq<Object> fields)
public KeyedStream<T,Tuple> keyBy(String firstField, scala.collection.Seq<String> otherFields)
public <K> KeyedStream<T,K> keyBy(scala.Function1<T,K> fun, TypeInformation<K> evidence$1)
public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, int field, TypeInformation<K> evidence$2)
Note: This method works only on single field keys.
public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, String field, TypeInformation<K> evidence$3)
Note: This method works only on single field keys.
public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, scala.Function1<T,K> fun, TypeInformation<K> evidence$4)
Note: This method works only on single field keys, i.e. the selector cannot return tuples of fields.
public DataStream<T> broadcast()
public DataStream<T> global()
public DataStream<T> shuffle()
public DataStream<T> forward()
public DataStream<T> rebalance()
public DataStream<T> rescale()
DataStream
so that the output tuples
are distributed evenly to a subset of instances of the downstream operation.
The subset of downstream operations to which the upstream operation sends elements depends on the degree of parallelism of both the upstream and downstream operation. For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 4, then one upstream operation would distribute elements to two downstream operations while the other upstream operation would distribute to the other two downstream operations. If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism 4 then two upstream operations will distribute to one downstream operation while the other two upstream operations will distribute to the other downstream operations.
In cases where the different parallelisms are not multiples of each other one or several downstream operations will have a differing number of inputs from upstream operations.
public <R> DataStream<R> iterate(scala.Function1<DataStream<T>,scala.Tuple2<DataStream<T>,DataStream<R>>> stepFunction, long maxWaitTimeMillis, boolean keepPartitioning)
stepfunction: initialStream => (feedback, output)
A common pattern is to use output splitting to create feedback and output DataStream. Please refer to the .split(...) method of the DataStream
By default a DataStream with iteration will never terminate, but the user can use the maxWaitTime parameter to set a max waiting time for the iteration head. If no data received in the set time the stream terminates.
By default the feedback partitioning is set to match the input, to override this set the keepPartitioning flag to true
public <R,F> DataStream<R> iterate(scala.Function1<ConnectedStreams<T,F>,scala.Tuple2<DataStream<F>,DataStream<R>>> stepFunction, long maxWaitTimeMillis, TypeInformation<F> evidence$5)
The input stream of the iterate operator and the feedback stream will be treated as a ConnectedStreams where the the input is connected with the feedback stream.
This allows the user to distinguish standard input from feedback inputs.
stepfunction: initialStream => (feedback, output)
The user must set the max waiting time for the iteration head. If no data received in the set time the stream terminates. If this parameter is set to 0 then the iteration sources will indefinitely, so the job must be killed to stop.
public <R> DataStream<R> map(scala.Function1<T,R> fun, TypeInformation<R> evidence$6)
public <R> DataStream<R> map(MapFunction<T,R> mapper, TypeInformation<R> evidence$7)
public <R> DataStream<R> flatMap(FlatMapFunction<T,R> flatMapper, TypeInformation<R> evidence$8)
public <R> DataStream<R> flatMap(scala.Function2<T,Collector<R>,scala.runtime.BoxedUnit> fun, TypeInformation<R> evidence$9)
public <R> DataStream<R> flatMap(scala.Function1<T,scala.collection.TraversableOnce<R>> fun, TypeInformation<R> evidence$10)
public DataStream<T> filter(FilterFunction<T> filter)
public DataStream<T> filter(scala.Function1<T,Object> fun)
public AllWindowedStream<T,TimeWindow> timeWindowAll(Time size)
This is a shortcut for either .window(TumblingEventTimeWindows.of(size))
or
.window(TumblingProcessingTimeWindows.of(size))
depending on the time characteristic
set using
StreamExecutionEnvironment.setStreamTimeCharacteristic
.
Note: This operation can be inherently non-parallel since all elements have to pass through the same operator instance. (Only for special cases, such as aligned time windows is it possible to perform this operation in parallel).
size
- The size of the window.public AllWindowedStream<T,TimeWindow> timeWindowAll(Time size, Time slide)
This is a shortcut for either .window(SlidingEventTimeWindows.of(size, slide))
or
.window(SlidingProcessingTimeWindows.of(size, slide))
depending on the time characteristic
set using
StreamExecutionEnvironment.setStreamTimeCharacteristic
.
Note: This operation can be inherently non-parallel since all elements have to pass through the same operator instance. (Only for special cases, such as aligned time windows is it possible to perform this operation in parallel).
size
- The size of the window.public AllWindowedStream<T,GlobalWindow> countWindowAll(long size, long slide)
DataStream
into sliding count windows.
Note: This operation can be inherently non-parallel since all elements have to pass through the same operator instance. (Only for special cases, such as aligned time windows is it possible to perform this operation in parallel).
size
- The size of the windows in number of elements.slide
- The slide interval in number of elements.public AllWindowedStream<T,GlobalWindow> countWindowAll(long size)
DataStream
into tumbling count windows.
Note: This operation can be inherently non-parallel since all elements have to pass through the same operator instance. (Only for special cases, such as aligned time windows is it possible to perform this operation in parallel).
size
- The size of the windows in number of elements.public <W extends Window> AllWindowedStream<T,W> windowAll(WindowAssigner<? super T,W> assigner)
AllWindowedStream
, which evaluates windows
over a key grouped stream. Elements are put into windows by a WindowAssigner
. The grouping
of elements is done both by key and by window.
A Trigger
can be defined to specify
when windows are evaluated. However, WindowAssigner
have a default Trigger
that is used if a Trigger
is not specified.
Note: This operation can be inherently non-parallel since all elements have to pass through the same operator instance. (Only for special cases, such as aligned time windows is it possible to perform this operation in parallel).
assigner
- The WindowAssigner
that assigns elements to windows.public DataStream<T> assignTimestamps(TimestampExtractor<T> extractor)
If you know that the timestamps are strictly increasing you can use an
AscendingTimestampExtractor
. Otherwise,
you should provide a TimestampExtractor
that also implements
TimestampExtractor#getCurrentWatermark
to keep track of watermarks.
Watermark
public DataStream<T> assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks<T> assigner)
This method creates watermarks periodically (for example every second), based
on the watermarks indicated by the given watermark generator. Even when no new elements
in the stream arrive, the given watermark generator will be periodically checked for
new watermarks. The interval in which watermarks are generated is defined in
ExecutionConfig.setAutoWatermarkInterval(long)
.
Use this method for the common cases, where some characteristic over all elements should generate the watermarks, or where watermarks are simply trailing behind the wall clock time by a certain amount.
For the second case and when the watermarks are required to lag behind the maximum
timestamp seen so far in the elements of the stream by a fixed amount of time, and this
amount is known in advance, use the
org.apache.flink.streaming.api.functions.TimestampExtractorWithFixedAllowedLateness
.
For cases where watermarks should be created in an irregular fashion, for example
based on certain markers that some element carry, use the
AssignerWithPunctuatedWatermarks
.
AssignerWithPeriodicWatermarks
,
AssignerWithPunctuatedWatermarks
,
assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks)
public DataStream<T> assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks<T> assigner)
This method creates watermarks based purely on stream elements. For each element
that is handled via AssignerWithPunctuatedWatermarks#extractTimestamp(Object, long)
,
the AssignerWithPunctuatedWatermarks#checkAndGetNextWatermark()
method is called,
and a new watermark is emitted, if the returned watermark value is larger than the previous
watermark.
This method is useful when the data stream embeds watermark elements, or certain elements carry a marker that can be used to determine the current event time watermark. This operation gives the programmer full control over the watermark generation. Users should be aware that too aggressive watermark generation (i.e., generating hundreds of watermarks every second) can cost some performance.
For cases where watermarks should be created in a regular fashion, for example
every x milliseconds, use the AssignerWithPeriodicWatermarks
.
AssignerWithPunctuatedWatermarks
,
AssignerWithPeriodicWatermarks
,
assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks)
public DataStream<T> assignAscendingTimestamps(scala.Function1<T,Object> extractor)
This method is a shortcut for data streams where the element timestamp are known to be monotonously ascending within each parallel stream. In that case, the system can generate watermarks automatically and perfectly by tracking the ascending timestamps.
For cases where the timestamps are not monotonously increasing, use the more
general methods assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks)
and assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks)
.
public SplitStream<T> split(OutputSelector<T> selector)
SplitStream
.public SplitStream<T> split(scala.Function1<T,scala.collection.TraversableOnce<String>> fun)
SplitStream
that contains only the elements satisfying the
given output selector predicate.public <T2> CoGroupedStreams<T,T2> coGroup(DataStream<T2> otherStream)
CoGroupedStreams
for an example of how the keys
and window can be specified.public <T2> JoinedStreams<T,T2> join(DataStream<T2> otherStream)
JoinedStreams
for an example of how the keys
and window can be specified.public DataStreamSink<T> print()
public DataStreamSink<T> printToErr()
For each element of the DataStream the result of
AnyRef.toString()
is written.
public DataStreamSink<T> writeAsText(String path)
path
- The path pointing to the location the text file is written topublic DataStreamSink<T> writeAsText(String path, FileSystem.WriteMode writeMode)
path
- The path pointing to the location the text file is written towriteMode
- Controls the behavior for existing files. Options are NO_OVERWRITE and
OVERWRITE.public DataStreamSink<T> writeAsCsv(String path)
path
- Path to the location of the CSV filepublic DataStreamSink<T> writeAsCsv(String path, FileSystem.WriteMode writeMode)
path
- Path to the location of the CSV filewriteMode
- Controls whether an existing file is overwritten or notpublic DataStreamSink<T> writeAsCsv(String path, FileSystem.WriteMode writeMode, String rowDelimiter, String fieldDelimiter)
path
- Path to the location of the CSV filewriteMode
- Controls whether an existing file is overwritten or notrowDelimiter
- Delimiter for consecutive rowsfieldDelimiter
- Delimiter for consecutive fieldspublic DataStreamSink<T> writeUsingOutputFormat(OutputFormat<T> format)
OutputFormat
.public DataStreamSink<T> writeToSocket(String hostname, Integer port, SerializationSchema<T> schema)
SerializationSchema
.public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction)
public DataStreamSink<T> addSink(scala.Function1<T,scala.runtime.BoxedUnit> fun)
public <F> F clean(F f)
ExecutionConfig
.public <R> DataStream<R> transform(String operatorName, OneInputStreamOperator<T,R> operator, TypeInformation<R> evidence$11)
DataStream
by using a custom OneInputStreamOperator
.
operatorName
- name of the operator, for logging purposesoperator
- the object containing the transformation logicCopyright © 2014–2017 The Apache Software Foundation. All rights reserved.