DataGen SQL Connector

Scan Source: Bounded Scan Source: UnBounded

The DataGen connector allows for creating tables based on in-memory data generation. This is useful when developing queries locally without access to external systems such as Kafka. Tables can include Computed Column syntax which allows for flexible record generation.

The DataGen connector is built-in, no additional dependencies are required.

Usage

By default, a DataGen table will create an unbounded number of rows with a random value for each column. For variable sized types, char/varchar/string/array/map/multiset, the length can be specified. Additionally, a total number of rows can be specified, resulting in a bounded table.

There also exists a sequence generator, where users specify a sequence of start and end values. If any column in a table is a sequence type, the table will be bounded and end with the first sequence completes.

Time types are always the local machines current system time.

CREATE TABLE Orders (
    order_number BIGINT,
    price        DECIMAL(32,2),
    buyer        ROW<first_name STRING, last_name STRING>
    order_time   TIMESTAMP(3)
) WITH (
  'connector' = 'datagen'
)

Often, the data generator connector is used in conjuction with the LIKE clause to mock out physical tables.

CREATE TABLE Orders (
    order_number BIGINT,
    price        DECIMAL(32,2),
    buyer        ROW<first_name STRING, last_name STRING>
    order_time   TIMESTAMP(3)
) WITH (...)

-- create a bounded mock table
CREATE TEMPORARY TABLE GenOrders 
WITH ( 
    'connector' = 'datagen',
    'number-of-rows' = '10'
)
LIKE Orders (EXCLUDING ALL)

Types

Type Supported Generators Notes
BOOLEAN random
CHAR random / sequence
VARCHAR random / sequence
STRING random / sequence
DECIMAL random / sequence
TINYINT random / sequence
SMALLINT random / sequence
INT random / sequence
BIGINT random / sequence
FLOAT random / sequence
DOUBLE random / sequence
DATE random Always resolves to the current date of the local machine.
TIME random Always resolves to the current time of the local machine.
TIMESTAMP random Always resolves to the current timestamp of the local machine.
TIMESTAMP WITH LOCAL TIMEZONE random Always resolves to the current timestamp of the local machine.
INTERVAL YEAR TO MONTH random
INTERVAL DAY TO MONTH random
ROW random Generates a row with random subfields.
ARRAY random Generates an array with random entries.
MAP random Generates a map with random entries.
MULTISET random Generates a multiset with random entries.

Connector Options

Option Required Default Type Description
connector
required (none) String Specify what connector to use, here should be 'datagen'.
rows-per-second
optional 10000 Long Rows per second to control the emit rate.
number-of-rows
optional (none) Long The total number of rows to emit. By default, the table is unbounded.
fields.#.kind
optional random String Generator of this '#' field. Can be 'sequence' or 'random'.
fields.#.min
optional (Minimum value of type) (Type of field) Minimum value of random generator, work for numeric types.
fields.#.max
optional (Maximum value of type) (Type of field) Maximum value of random generator, work for numeric types.
fields.#.length
optional 100 Integer Size or length of the collection for generating char/varchar/string/array/map/multiset types.
fields.#.start
optional (none) (Type of field) Start value of sequence generator.
fields.#.end
optional (none) (Type of field) End value of sequence generator.