Version: 0.15

Pipeline Configuration

Pipeline is a mechanism in GreptimeDB for parsing and transforming log data. It consists of a unique name and a set of configuration rules that define how log data is formatted, split, and transformed. Currently, we support JSON (application/json or application/x-ndjson, the latter is preferred) and plain text (text/plain) formats as input for log data.

These configurations are provided in YAML format, allowing the Pipeline to process data during the log writing process according to the defined rules and store the processed data in the database for subsequent structured queries.

Input Formats

In general, a pipeline receives a set of key-value pairs in the form of a map as its input. The following three formats can all be used as pipeline input:

JSON Format

When the request's Content-Type is application/json, the data format should be standard JSON.

[
  {"message": "hello world", "time": "2024-07-12T16:18:53.048"},
  {"message": "hello greptime", "time": "2024-07-12T16:18:53.048"}
]

NDJSON Format

When the request's Content-Type is application/x-ndjson, each line is treated as a separate standard JSON object.

{"message": "hello world", "time": "2024-07-12T16:18:53.048"}
{"message": "hello greptime", "time": "2024-07-12T16:18:53.048"}

Plain Text Format

When the request's Content-Type is text/plain, each line is treated as a separate object. The entire line is treated as a string and stored in a map object containing only the message field.

hello world 2024-07-12T16:18:53.048
hello greptime 2024-07-12T16:18:53.048

The above plain text data will be converted to the following equivalent form:

{"message": "hello world 2024-07-12T16:18:53.048"}
{"message": "hello greptime 2024-07-12T16:18:53.048"}

In other words, when the input is in plain text format, you need to use message to refer to the content of each line when writing Processor and Transform configurations.

Overall structure

Pipeline consists of four parts: Processors, Dispatcher, Transform, and Table suffix. Processors pre-processes input log data. Dispatcher forwards pipeline execution context onto different subsequent pipeline. Transform decides the final datatype and table structure in the database. Table suffix allows storing the data into different tables.

Version is used to state the pipeline configuration format. Although it's optional, it's high recommended to start with version 2. See here for more details.
Processors are used for preprocessing log data, such as parsing time fields and replacing fields.
Dispatcher(optional) is used for forwarding the context into another pipeline, so that the same batch of input data can be divided and processed by different pipeline based on certain fields.
Transform(optional) is used for converting data formats, such as converting string types to numeric types, and specifying indexes.
Table suffix(optional) is used for storing data into different table for later convenience.

Here is an example of a simple configuration that includes Processors and Transform:

version: 2
processors:
  - urlencoding:
      fields:
        - string_field_a
        - string_field_b
      method: decode
      ignore_missing: true
dispatcher:
  field: type
  rules:
    - value: http
      table_suffix: http
      pipeline: http
    - value: db
      table_suffix: db
transform:
  - fields:
      - string_field_a
      - string_field_b
    type: string
  # The written data must include the timestamp field
  - fields:
      - reqTimeSec, req_time_sec
    # epoch is a special field type and must specify precision
    type: epoch, ms
    index: timestamp
table_suffix: _${string_field_a}

Starting from v0.15, the GreptimeDB introduce a version 2 format. The main change is the transform process. Refer to the following documentation for detailed changes.

Processor

The Processor is used for preprocessing log data, and its configuration is located under the processors section in the YAML file. The Pipeline processes data by applying multiple Processors in sequential order, where each Processor depends on the result of the previous Processor. A Processor consists of a name and multiple configurations, and different types of Processors have different fields in their configuration.

We currently provide the following built-in Processors:

date: parses formatted time string fields, such as 2024-07-12T16:18:53.048.
epoch: parses numeric timestamp fields, such as 1720772378893.
dissect: splits log data fields.
gsub: replaces log data fields.
join: merges array-type fields in logs.
letter: converts log data fields to letters.
regex: performs regular expression matching on log data fields.
urlencoding: performs URL encoding/decoding on log data fields.
csv: parses CSV data fields in logs.
json_path: extracts fields from JSON data.
json_parse: parse a field into JSON object.
simple_extract: extracts fields from JSON data using simple key.
digest: extracts the template from a log message by removing variable content.
select: retain(include) or exclude fields from the pipeline context.
vrl: runs the vrl script against the pipeline context.

Input and output of Processors

Most processors accept a field or fields parameter, which is a list of fields processed sequentially, as input data. Processors produce one or more output based on the input. For processors that produce a single output, the result will overwrite the original input key in the context.

In the following example, after the letter processor, an upper-case version of the original string is saved back to the message field.

processors:
  - letter:
      fields:
        - message
      method: upper

We can save the output data to another field, leaving the original field unchanged. In the following example, we still use the message field as the input of the letter processor, but saving the output to a new field named upper_message.

processors:
  - letter:
      fields:
        - key: message
          rename_to: upper_message
      method: upper

This renaming syntax also has a shortcut form: simply separate the two field names with a ,. Here is an example of this shortcut:

processors:
  - letter:
      fields:
        - message, upper_message
      method: upper

There are mainly two reasons why renaming is needed:

Leaving the original field unchanged, so it can be reused in more than one processors, or be saved into the database as well.
Unifying naming. For example rename camel-case names to snake-case names for consistency.

`date`

The date processor is used to parse time fields. Here's an example configuration:

processors:
  - date:
      fields:
        - time
      formats:
        - '%Y-%m-%d %H:%M:%S%.3f'
      ignore_missing: true
      timezone: 'Asia/Shanghai'

In the above example, the configuration of the date processor includes the following fields:

fields: A list of time field names to be parsed.
formats: Time format strings, supporting multiple format strings. Parsing is attempted in the order provided until successful. You can find reference here for formatting syntax.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.
timezone: Time zone. Use the time zone identifiers from the tz_database to specify the time zone. Defaults to UTC.

`epoch`

The epoch processor is used to parse timestamp fields. Here's an example configuration:

processors:
  - epoch:
      fields:
        - reqTimeSec
      resolution: millisecond
      ignore_missing: true

In the above example, the configuration of the epoch processor includes the following fields:

fields: A list of timestamp field names to be parsed.
resolution: Timestamp precision, supports s, sec, second, ms, millisecond, milli, us, microsecond, micro, ns, nanosecond, nano. Defaults to ms.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

`dissect`

The dissect processor is used to split log data fields. Here's an example configuration:

processors:
  - dissect:
      fields:
        - message
      patterns:
        - '%{key1} %{key2}'
      ignore_missing: true
      append_separator: '-'

In the above example, the configuration of the dissect processor includes the following fields:

fields: A list of field names to be split, does not support field renaming.
patterns: The dissect pattern for splitting.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.
append_separator: Specifies the separator for concatenating multiple fields with same field name together. Defaults to an empty string. See + modifier below.

Dissect pattern

Similar to Logstash's dissect pattern, the dissect pattern consists of %{key}, where %{key} is a field name. For example:

"%{key1} %{key2} %{+key3} %{+key4/2} %{key5->} %{?key6}"

Dissect modifiers

The dissect pattern supports the following modifiers:

Modifier	Description	Example
`+`	Concatenates two or more fields together	`%{+key} %{+key}`
`+` and `/n`	Concatenates two or more fields in the specified order	`%{+key/2} %{+key/1}`
`->`	Ignores any repeating characters on the right side	`%{key1->} %{key2->}`
`?`	Ignores matching values	`%{?key}`

`dissect` examples

For example, given the following log data:

"key1 key2 key3 key4 key5 key6"

Using the following Dissect pattern:

"%{key1} %{key2} %{+key3} %{+key3/2} %{key5->} %{?key6}"

The result will be:

{
  "key1": "key1",
  "key2": "key2",
  "key3": "key3 key4",
  "key5": "key5"
}

`gsub`

The gsub processor is used to replace values in log data fields. Here's an example configuration:

processors:
  - gsub:
      fields:
        - message
      pattern: 'old'
      replacement: 'new'
      ignore_missing: true

In the above example, the configuration of the gsub processor includes the following fields:

fields: A list of field names to be replaced.
pattern: The string to be replaced. Supports regular expressions.
replacement: The string to replace with.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

`join`

The join processor is used to merge Array-type fields in log data. Here's an example configuration:

processors:
  - join:
      fields:
        - message
      separator: ','
      ignore_missing: true

In the above example, the configuration of the join processor includes the following fields:

fields: A list of field names to be merged. Note that all the fields mentioned are already of array type. Each field will have its own array merged separately. The content of multiple fields will not be combined.
separator: The separator for the merged result.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

`join` example

For example, given the following log data:

{
  "message": ["a", "b", "c"]
}

Using the following configuration:

processors:
  - join:
      fields:
        - message
      separator: ','

The result will be:

{
  "message": "a,b,c"
}

`letter`

The letter processor is used to convert the case of characters in log data fields. Here's an example configuration:

processors:
  - letter:
      fields:
        - message
      method: upper
      ignore_missing: true

In the above example, the configuration of the letter processor includes the following fields:

fields: A list of field names to be transformed.
method: The transformation method, supports upper, lower, capital. Defaults to lower. Note the capital only changes the first letter to uppercase.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

`regex`

The regex processor is used to perform regular expression matching on log data fields. Here's an example configuration:

processors:
  - regex:
      fields:
        - message
      patterns:
        - ':(?<id>[0-9])'
      ignore_missing: true

In the above example, the configuration of the regex processor includes the following fields:

fields: A list of field names to be matched. If you rename the field, the renamed fields will be combined with the capture groups in patterns to generate the result name.
pattern: The regular expression pattern to match. Named capture groups are required to extract corresponding data from the respective field.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

Rules for named capture groups in regex

The regex processor supports the syntax (?<group-name>...) to define named capture groups. The data will be processed into the following format:

{
  "<field-name>_<group-name>": "<value>"
}

For example, if the field name specified in the regex processor is message, and the corresponding content is "[ERROR] error message", you can set the pattern as \[(?<level>[A-Z]+)\] (?<content>.+), and the data will be processed as:

{
  "message_level": "ERROR",
  "message_content": "error message"
}

`urlencoding`

The urlencoding processor is used to perform URL encoding on log data fields. Here's an example configuration:

processors:
  - urlencoding:
      fields:
        - string_field_a
        - string_field_b
      method: decode
      ignore_missing: true

In the above example, the configuration of the urlencoding processor includes the following fields:

fields: A list of field names to be encoded.
method: The encoding method, supports encode, decode. Defaults to encode.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

`csv`

The csv processor is used to parse CSV-type fields in log data that do not have a header. Here's an example configuration:

processors:
  - csv:
      fields:
        - message
      separator: ','
      quote: '"'
      trim: true
      ignore_missing: true

In the above example, the configuration of the csv processor includes the following fields:

fields: A list of field names to be parsed.
separator: The separator.
quote: The quotation mark.
trim: Whether to trim whitespace. Defaults to false.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

`json_path` (experimental)

Note: The json_path processor is currently in the experimental stage and may be subject to change.

The json_path processor is used to extract fields from JSON data. Here's an example configuration:

processors:
  - json_path:
      fields:
        - complex_object
      json_path: "$.shop.orders[?(@.active)].id"
      ignore_missing: true
      result_index: 1

In the above example, the configuration of the json_path processor includes the following fields:

fields: A list of field names to be extracted.
json_path: The JSON path to extract.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.
result_index: Specifies the index of the value in the extracted array to be used as the result value. By default, all values are included. The extracted value of the processor is an array containing all the values of the path. If an index is specified, the corresponding value in the extracted array will be used as the final result.

JSON path syntax

The JSON path syntax is based on the jsonpath-rust library.

At this stage we only recommend using some simple field extraction operations to facilitate the extraction of nested fields to the top level.

`json_path` example

For example, given the following log data:

{
  "product_object": {
    "hello": "world"
  },
  "product_array": [
    "hello",
    "world"
  ],
  "complex_object": {
    "shop": {
      "orders": [
        {
          "id": 1,
          "active": true
        },
        {
          "id": 2
        },
        {
          "id": 3
        },
        {
          "id": 4,
          "active": true
        }
      ]
    }
  }
}

Using the following configuration:

processors:
  - json_path:
      fields:
        - product_object, object_target
      json_path: "$.hello"
      result_index: 0
  - json_path:
      fields:
        - product_array, array_target
      json_path: "$.[1]"
      result_index: 0
  - json_path:
      fields:
        - complex_object, complex_target_1
      json_path: "$.shop.orders[?(@.active)].id"
  - json_path:
      fields:
        - complex_target_1, complex_target_2
      json_path: "$.[1]"
      result_index: 0
  - json_path:
      fields:
        - complex_object, complex_target_3
      json_path: "$.shop.orders[?(@.active)].id"
      result_index: 1
transform:
  - fields:
      - object_target
      - array_target
    type: string
  - fields:
      - complex_target_3
      - complex_target_2
    type: uint32
  - fields:
      - complex_target_1
    type: json

The result will be:

{
  "object_target": "world",
  "array_target": "world",
  "complex_target_3": 4,
  "complex_target_2": 4,
  "complex_target_1": [1, 4]
}

`json_parse`

json_parse, as its name suggests, parses a string field into a JSON object. Here's an example configuration:

processors:
  - json_parse:
      fields:
        - complex_object, target_field
      ignore_missing: true

In the above example, the configuration of the json_parse processor includes the following fields:

fields: For each field, the first key represents the entry string in the context, the second key represents the output key name. If the second key is not present, it uses the first key as the output key name too, which means in-place replacement of the original value. The example above means parsing complex_object and put the output into target_field.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

`json_parse` example

For example, given the following log data:

{
  "product_object": "{\"hello\":\"world\"}",
}

Using the following configuration:

processors:
  - json_parse:
      fields:
        - product_object

The result will be:

{
  "product_object": {
    "hello": "world"
  }
}

`simple_extract`

While json_path processor is capable of extracting fields from JSON objects using complex expressions, it's relatively slow and costly. The simple_extract processor offers a simple way to extract fields by using just key names. Here's an example configuration:

processors:
  - simple_extract:
      fields:
        - complex_object, target_field
      key: "shop.name"
      ignore_missing: true

In the above example, the configuration of the simple_extract processor includes the following fields:

fields: For each field, the first key represents the entry JSON object in the context, the second key represents the output key name. The example above means extracting from complex_object and put the output into target_field.
key: The extracting key, using x.x format, each . represents a new nested layer.
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

`simple_extract` example

For example, given the following log data:

{
  "product_object": {
    "hello": "world"
  },
  "product_array": [
    "hello",
    "world"
  ],
  "complex_object": {
    "shop": {
      "name": "some_shop_name"
    }
  }
}

Using the following configuration:

processors:
  - simple_extract:
      fields:
        - complex_object, shop_name
      key: "shop.name"
transform:
  - fields:
      - shop_name
    type: string

The result will be:

{
  "shop_name": "some_shop_name"
}

`digest`

The digest processor is used to extract the template from a log message by removing variable content, such as numbers, UUIDs, IP addresses, quoted strings, and bracketed text. The digested result can be useful for log template grouping and analysis. Here's an example configuration:

processors:
  - digest:
      fields:
        - message
      presets:
        - numbers
        - uuid
        - ip
        - quoted
        - bracketed
      regex:
        - "foobar"
      ignore_missing: true

In the above example, the configuration of the digest processor includes the following fields:

fields: A list of field names to be digested. The digested result will be stored in a new field with suffix _digest.
presets: A list of preset patterns to remove. The following patterns are supported:
- numbers: Matches numeric sequences
- uuid: Matches UUID strings like 123e4567-e89b-12d3-a456-426614174000
- ip: Matches IPv4/IPv6 addresses with optional port numbers
- quoted: Matches text within single/double quotes (including various Unicode quotes)
- bracketed: Matches text within various types of brackets (including various Unicode brackets)
regex: A list of custom regex patterns to remove
ignore_missing: Ignores the case when the field is missing. Defaults to false. If the field is missing and this configuration is set to false, an exception will be thrown.

`digest` example

For example, given the following log data:

{
  "message": "User 'john.doe' from [192.168.1.1] accessed resource 12345 with UUID 123e4567-e89b-12d3-a456-426614174000"
}

Using the following configuration:

processors:
  - digest:
      fields:
        - message
      presets:
        - numbers
        - uuid
        - ip
        - quoted
        - bracketed

The result will be:

{
  "message": "User 'john.doe' from [192.168.1.1] accessed resource 12345 with UUID 123e4567-e89b-12d3-a456-426614174000",
  "message_digest": "User  from  accessed resource  with UUID "
}

The digested template can be used to group similar log messages together or analyze log patterns, even when the variable content differs. For example, all these log messages would generate the same template:

User 'alice' from [10.0.0.1] accessed resource 54321 with UUID 987fbc97-4bed-5078-9141-2791ba07c9f3
User 'bob' from [2001:0db8::1] accessed resource 98765 with UUID 550e8400-e29b-41d4-a716-446655440000

`select`

The select processor is used to retain or exclude fields from the pipeline execution context.

Starting from v0.15 release, we are introducing auto-transform for simplicity. The auto-transform mode will try to preserve all fields from the pipeline execution context. The select processor can be used here to select fields to include or exclude, which will reflects the final table schema if auto-transform is used.

The configuration options for select is simple:

type(optional)
- include(default): only keeps the selected fields
- exclude: removes the selected fields from the current context
fields: fields selected from the pipeline execution context

Here is an example:

processors:
  - dissect:
      fields:
        - message
      patterns:
        - "%{+ts} %{+ts} %{http_status_code} %{content}"
  - date:
      fields:
        - ts
      formats:
        - "%Y-%m-%d %H:%M:%S%.3f"
  - select:
      fields:
        - http_status_code
        - ts

With dissect and date processor, there are four fields in the context: ts, http_status_code, content and the original message. Without the select processor, all four fields are preserved. The select processor here selects http_status_code and ts fields to include (by default), which effectively removes content and message in the pipeline execution context, resulting the http_status_code and ts to be preserved into the database.

The above example can also be done in the following select processor's configuration:

  - select:
      type: exclude
      fields:
        - content
        - message

`vrl`

Experimental Feature

This experimental feature may contain unexpected behavior, have its functionality change in the future.

The vrl processor run the vrl programming script against the pipeline context. It's more powerful than simple processors, for it allows you to write actual programming codes to manipulate the context; however, it also costs more resource to execute. Refer to the official website for more language introduction and usage.

The vrl processor only takes one configuration, that is the source field. Here's an example:

processors:
  - date:
      field: time
      formats:
        - "%Y-%m-%d %H:%M:%S%.3f"
      ignore_missing: true
  - vrl:
      source: |
        .from_source = "channel_2"
        cond, err = .id1 > .id2
        if (cond) {
            .from_source = "channel_1"
        }
        del(.id1)
        del(.id2)
        .

The configuration uses a | to start a multi-line content in the YAML file. The whole script is then written below.

Some notes regarding the vrl processor:

The script have to end with a newline of ., indicating returning the whole context as the returning value of the script.
The returning value of the vrl script should not contain any regex-type variables. They can be used in the script, but have to be deled before returning.
Due to type conversion between pipeline's value type and vrl's, the value type that comes out of the vrl script will be the ones with max capacity, meaning i64, f64, and Timestamp::nanoseconds.

Transform

Transform is used to convert data formats and specify indexes upon columns. It is located under the transform section in the YAML file.

Starting from v0.15, GreptimeDB is introducing version 2 format and auto-transform to largely simplify the configuration. See below for details.

A Transform consists of one or more configurations, and each configuration contains the following fields:

fields: A list of field names to be transformed.
type: The target transformation type in the database.
index(optional): The index type.
tag(optional): Specify the field to be a tag field.
on_failure(optional): Handling method for transformation failures.
default(optional): Default value.

Transform in version 2

Originally, you have to manually specify all the fields in the transform section for them to be persisted in the database. If a field is not specify in the transform, it will be discards. With the number of field growing, this can make the configuration both tedious and error-prone.

Starting from v0.15, GreptimeDB introduces a new transform mode which make it easier to write pipeline configuration. You only set necessary fields in the transform section, specifying particular datatype and index for them; the rest of the fields from the pipeline context are set automatically by the pipeline engine. With the select processor, you can decide what field is wanted and what isn't in the final table.

However, this is a breaking change to the existing pipeline configuration files. If you has already used pipeline with dissect or regex processors, after upgrading the database, the original message string, which is still in the pipeline context, gets immediately inserted into the database and there's no way to stop this behavior.

Therefore, GreptimeDB introduces the concept of version to decide which transform mode you want to use, just like the version in a Docker Compose file. Here is an example:

version: 2
processors:
  - date:
    field: input_str
    formats:
      - "%Y-%m-%dT%H:%M:%S%.3fZ"

transform:
  - field: input_str, ts
    type: time, ms
    index: timestamp

Simply add a version: 2 line at the top of the config file, and the pipeline engine will run the transform in combined mode:

Process all written transform rules sequentially.
Write all fields of the pipeline context to the final table.

Note:

The transform section must contains one timestamp index field.
The transform process in the version 2 will consume the original field in the pipeline context, so you can't transform the same field twice.

Auto-transform

The transform configuration in version 2 format is already a large simplification over the original transform. However, there are times when you might want to combine the power of processors with the ease of using greptime_identity, writing no transform code, letting the pipeline engine auto infer and persist the data.

Now it is possible in custom pipelines. If no transform section is specified, the pipeline engine will attempt to infer the datatype of the fields from the pipeline context and preserve them into the database, much like what the identity_pipeline does.

To create a table in GreptimeDB, a timestamp index column must be specified. In this case, the pipeline engine will try to find a field of type timestamp in the context and set it as the time index column. A timestamp field is produced by a date or epoch processor, so at least one date or epoch processor must be defined in the processor's section. Additionally, only one timestamp field is allowed, multiple timestamp fields would lead to an error due to ambiguity.

For example, the following pipeline configuration is now valid.

version: 2
processors:
  - dissect:
      fields:
        - message
      patterns:
        - '%{ip_address} - %{username} [%{timestamp}] "%{http_method} %{request_line} %{protocol}" %{status_code} %{response_size}'
      ignore_missing: true
  - date:
      fields:
        - timestamp
      formats:
        - "%d/%b/%Y:%H:%M:%S %z"

And it result to the following table schema:

mysql> desc auto_trans;
+---------------+---------------------+------+------+---------+---------------+
| Column        | Type                | Key  | Null | Default | Semantic Type |
+---------------+---------------------+------+------+---------+---------------+
| timestamp     | TimestampNanosecond | PRI  | NO   |         | TIMESTAMP     |
| host          | String              |      | YES  |         | FIELD         |
| http_method   | String              |      | YES  |         | FIELD         |
| ip_address    | String              |      | YES  |         | FIELD         |
| message       | String              |      | YES  |         | FIELD         |
| protocol      | String              |      | YES  |         | FIELD         |
| request_line  | String              |      | YES  |         | FIELD         |
| response_size | String              |      | YES  |         | FIELD         |
| service       | String              |      | YES  |         | FIELD         |
| source_type   | String              |      | YES  |         | FIELD         |
| status_code   | String              |      | YES  |         | FIELD         |
| username      | String              |      | YES  |         | FIELD         |
+---------------+---------------------+------+------+---------+---------------+
12 rows in set (0.03 sec)

We can see that all fields are preserved, including the original message field. Note that all fields are saved as String type, and the timestamp field is automatically set as the time index.

The `fields` field

Each field name is a string. fields in transform can also use renaming, refer to the docs here. The final name of the field will be used as the column name in the database table.

The `type` field

GreptimeDB currently provides the following built-in transformation types:

int8, int16, int32, int64: Integer types.
uint8, uint16, uint32, uint64: Unsigned integer types.
float32, float64: Floating-point types.
string: String type.
time: Time type, which will be converted to GreptimeDB timestamp(9) type.
epoch: Timestamp type, which will be converted to GreptimeDB timestamp(n) type. The value of n depends on the precision of the epoch. When the precision is s, n is 0; when the precision is ms, n is 3; when the precision is us, n is 6; when the precision is ns, n is 9.

If a field obtains an illegal value during the transformation process, the Pipeline will throw an exception. For example, when converting a string abc to an integer, an exception will be thrown because the string is not a valid integer.

The `index` field

The Pipeline will write the processed data to the automatically created table in GreptimeDB. To improve query efficiency, GreptimeDB creates indexes for certain columns in the table. The index field is used to specify which fields need to be indexed. For information about GreptimeDB index types, please refer to the Data Index documentation.

GreptimeDB supports the following four types of index for fields:

timestamp: Specifies a column as a timestamp index column.
inverted: Specifies a column to use the inverted index type.
fulltext: Specifies a column to use the fulltext index type. The column must be of string type.
skipping: Specifies a column to use the skipping index type. The column must be of string type.

When index field is not provided, GreptimeDB doesn't create index on the column.

In GreptimeDB, a table must include exact one column of type timestamp as the time index column. Therefore, a Pipeline can have only one time index column.

The Timestamp column

Specify which field is the timestamp index column using index: timestamp. Refer to the Transform Example below for syntax.

The Inverted Index

Specify which field uses the inverted index. Refer to the Transform Example below for syntax.

The Fulltext Index

Specify which field will be used for full-text search using index: fulltext. This index greatly improves the performance of log search. Refer to the Transform Example below for syntax.

The Skipping Index

Specify which field uses the skipping index. This index speeds up the query on high cardinality fields but consumes far less storage for building index files. Refer to the Transform Example below for syntax.

The `tag` field

The Pipeline can also specify a tag field manually. For information about tag field, please refer to the Data Model documentation. Refer to the Transform Example below for syntax.

The `on_failure` field

The on_failure field is used to specify the handling method when a transformation fails. It supports the following methods:

ignore: Ignore the failed field and do not write it to the database.
default: Write the default value. The default value is specified by the default field.

The `default` field

The default field is used to specify the default value when a transformation fails.

Transform Example

For example, with the following log data:

{
  "num_field_a": "3",
  "string_field_a": "john",
  "string_field_b": "It was snowing when he was born.",
  "time_field_a": 1625760000
}

Using the following configuration:

transform:
  - fields:
      - string_field_a, name
    type: string
    index: skipping
    tag: true
  - fields:
      - num_field_a, age
    type: int32
    index: inverted
  - fields:
      - string_field_b, description
    type: string
    index: fulltext
  - fields:
      - time_field_a, born_time
    type: epoch, s
    index: timestamp

The result will be:

{
  "name": "john",
  "age": 3,
  "description": "It was snowing when he was born.",
  "born_time": 2021-07-08 16:00:00
}

Dispatcher

The pipeline dispatcher routes requests to other pipelines based on configured field values. It is useful when multiple log types share a single source and need to be stored in separate tables with different structures.

A sample configuration is like:

dispatcher:
  field: type
  rules:
    - value: http
      table_suffix: http
      pipeline: http
    - value: db
      table_suffix: db

The dispatcher runs after processors and before transformations. It routes data to the next pipeline, where the defined processors will execute.

You can specify a field and rules for routing. If the field value matches a rule's value, the data is routed to the specified pipeline. If no pipeline is defined, the current data structure is used as the table structure. In the example configuration above, if the type of input data is http, we will call http as next pipeline. And if type is db, we use current data structure to store the data.

The target table name is determined by table_suffix, appended to the current table and an underscore _. For example, if the table is applogs and it matches the http rule, data is stored in applogs_http.

If no rules match, data is transformed by the current pipeline's transformations.

Table suffix

Experimental Feature

This experimental feature may contain unexpected behavior, have its functionality change in the future.

There are cases where you want to split and insert log data into different target table based on some certain values of input data. For example, if you want to divide and store the log data based on the application where the log is produced, thus adding an app name suffix to the target table.

A sample configuration is like:

table_suffix: _${app_name}

The syntax is simple: use ${} to include the variable in the pipeline execution context. The variable can be directly from the input data or a product of former process. After the table suffix is formatted, the whole string will be added to the input table name.

Note:

The variable must be an integer number or a string type of data.
If any error occurs in runtime(e.g: the variable is missing or not a valid type), the input table name will be used.

Here is an example of how it works. The input data is like following:

[
  {"type": "db"},
  {"type": "http"},
  {"t": "test"}
]

The input table name is persist_app, and the pipeline config is like

table_suffix: _${type}

These three lines of input log will be inserted into three tables:

persist_app_db
persist_app_http
persist_app, for it doesn't have a type field, thus the default table name will be used.

Pipeline Configuration

Input Formats​

JSON Format​

NDJSON Format​

Plain Text Format​

Overall structure​

Processor​

Input and output of Processors​

date​

epoch​

dissect​

Dissect pattern​

Dissect modifiers​

dissect examples​

gsub​

join​

join example​

letter​

regex​

Rules for named capture groups in regex​

urlencoding​

csv​

json_path (experimental)​

JSON path syntax​

json_path example​

json_parse​

json_parse example​

simple_extract​

simple_extract example​

digest​

digest example​

select​

vrl​

Transform​

Transform in version 2​

Auto-transform​

The fields field​

The type field​

The index field​

The Timestamp column​

The Inverted Index​

The Fulltext Index​

The Skipping Index​

The tag field​

The on_failure field​

The default field​

Transform Example​

Dispatcher​

Table suffix​

Input Formats

JSON Format

NDJSON Format

Plain Text Format

Overall structure

Processor

Input and output of Processors

`date`

`epoch`

`dissect`

Dissect pattern

Dissect modifiers

`dissect` examples

`gsub`

`join`

`join` example

`letter`

`regex`

Rules for named capture groups in regex

`urlencoding`

`csv`

`json_path` (experimental)

JSON path syntax

`json_path` example

`json_parse`

`json_parse` example

`simple_extract`

`simple_extract` example

`digest`

`digest` example

`select`

`vrl`

Transform

Transform in version 2

Auto-transform

The `fields` field

The `type` field

The `index` field

The Timestamp column

The Inverted Index

The Fulltext Index

The Skipping Index

The `tag` field

The `on_failure` field

The `default` field

Transform Example

Dispatcher

Table suffix