Raw Format

Format: Serialization Schema Format: Deserialization Schema

The Raw format allows to read and write raw (byte based) values as a single column.

Note: this format encodes null values as null of byte[] type. This may have limitation when used in upsert-kafka, because upsert-kafka treats null values as a tombstone message (DELETE on the key). Therefore, we recommend avoiding using upsert-kafka connector and the raw format as a value.format if the field can have a null value.

Dependencies

In order to use the RAW format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.

Maven dependency SQL Client JAR
flink-raw Built-in

Example

For example, you may have following raw log data in Kafka and want to read and analyse such data using Flink SQL.

47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain.com/?p=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"

The following creates a table where it reads from (and can writes to) the underlying Kafka topic as an anonymous string value in UTF-8 encoding by using raw format:

CREATE TABLE nginx_log (
  log STRING
) WITH (
  'connector' = 'kafka',
  'topic' = 'nginx_log',
  'properties.bootstrap.servers' = 'localhost:9092',
  'properties.group.id' = 'testGroup',
  'format' = 'raw'
)

Then you can read out the raw data as a pure string, and split it into multiple fields using an user-defined-function for further analysing, e.g. my_split in the example.

SELECT t.hostname, t.datetime, t.url, t.browser, ...
FROM(
  SELECT my_split(log) as t FROM nginx_log
);

In contrast, you can also write a single column of STRING type into this Kafka topic as an anonymous string value in UTF-8 encoding.

Format Options

Option Required Default Type Description
format
required (none) String Specify what format to use, here should be 'raw'.
raw.charset
optional UTF-8 String Specify the charset to encode the text string.
raw.endianness
optional big-endian String Specify the endianness to encode the bytes of numeric value. Valid values are 'big-endian' and 'little-endian'. See more details of endianness.

Data Type Mapping

The table below details the SQL types the format supports, including details of the serializer and deserializer class for encoding and decoding.

Flink SQL type Value
CHAR / VARCHAR / STRING A UTF-8 (by default) encoded text string.
The encoding charset can be configured by 'raw.charset'.
BINARY / VARBINARY / BYTES The sequence of bytes itself.
BOOLEAN A single byte to indicate boolean value, 0 means false, 1 means true.
TINYINT A single byte of the singed number value.
SMALLINT Two bytes with big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
INT Four bytes with big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
BIGINT Eight bytes with big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
FLOAT Four bytes with IEEE 754 format and big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
DOUBLE Eight bytes with IEEE 754 format and big-endian (by default) encoding.
The endianness can be configured by 'raw.endianness'.
RAW The sequence of bytes serialized by the underlying TypeSerializer of the RAW type.