Raw

Raw Format #

Format: Serialization Schema Format: Deserialization Schema

The Raw format allows to read and write raw (byte based) values as a single column.

Note: this format encodes null values as null of byte[] type. This may have limitation when used in upsert-kafka, because upsert-kafka treats null values as a tombstone message (DELETE on the key). Therefore, we recommend avoiding using upsert-kafka connector and the raw format as a value.format if the field can have a null value.

The Raw connector is built-in into the Blink planner, no additional dependencies are required.

Example #

For example, you may have following raw log data in Kafka and want to read and analyse such data using Flink SQL.

47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain.com/?p=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"

The following creates a table where it reads from (and can writes to) the underlying Kafka topic as an anonymous string value in UTF-8 encoding by using raw format:

CREATE TABLE nginx_log (
  log STRING
) WITH (
  'connector' = 'kafka',
  'topic' = 'nginx_log',
  'properties.bootstrap.servers' = 'localhost:9092',
  'properties.group.id' = 'testGroup',
  'format' = 'raw'
)

Then you can read out the raw data as a pure string, and split it into multiple fields using an user-defined-function for further analysing, e.g. my_split in the example.

SELECT t.hostname, t.datetime, t.url, t.browser, ...
FROM(
  SELECT my_split(log) as t FROM nginx_log
);

In contrast, you can also write a single column of STRING type into this Kafka topic as an anonymous string value in UTF-8 encoding.

Format Options #

Option Required Default Type Description
format
required (none) String Specify what format to use, here should be 'raw'.
raw.charset
optional UTF-8 String Specify the charset to encode the text string.
raw.endianness
optional big-endian String Specify the endianness to encode the bytes of numeric value. Valid values are 'big-endian' and 'little-endian'. See more details of endianness.

Data Type Mapping #

The table below details the SQL types the format supports, including details of the serializer and deserializer class for encoding and decoding.

Flink SQL type Value
CHAR / VARCHAR / STRING A UTF-8 (by default) encoded text string.
The encoding charset can be configured by 'raw.charset'.
BINARY / VARBINARY / BYTES The sequence of bytes itself.
BOOLEAN A single byte to indicate boolean value, 0 means false, 1 means true.
TINYINT A single byte of the singed number value.
SMALLINT Two bytes with big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
INT Four bytes with big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
BIGINT Eight bytes with big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
FLOAT Four bytes with IEEE 754 format and big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
DOUBLE Eight bytes with IEEE 754 format and big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
RAW The sequence of bytes serialized by the underlying TypeSerializer of the RAW type.