Introduction

This paper introduces a "tool capable of losslessly compressing unstructured text logs while enabling fast searches directly on the compressed data."

Two alternative:

log-search tools: Elasticsearch and Splunk Enterprise
- Defects: The index is too large -> large cost of storage space and memory usage -> only keep the indexed logs for a few weeks.
log archival and compression tools: Gzip
- Defects: The searching is slow due to decompressing process.
- Encode in length-distance pairs.

Background:

Tech company has lots of log data
Some industries are required to maintain logs for years

Method:

Combining a domain-specific compression and search algorithm with a lightweight general-purpose compression algorithm.
The former one is good at performance and bad at compression. The later one on the contrary.
Process:
1. Domain-specific compression extract duplicated patterns into a dictionary
2. Convert the remaining value also into a dictionary.
3. The the log is encoded as a series of integers and dictionary entries.
4. Use general-purpose compressor to compress.
5. The search process also encodes the query. And search.

Main limitation: CLP's algorithm is designed primarily for text logs.

Design Overview

Three objectives:

Compressed losslessly, so the users can delete their original files.
Allow users to search their logs for any value.
Performant and scalable

Compression

Two steps:

Deduplicates highly repetitive parts of each log message and encodes them in a special format
Applies a lightweight compressor to the encoded data, further reducing its size.

Messages are splitted into three pieces:

log type
variable values
- dictionary variables: repetitive (e.g. identifiers like a username)
- non-dictionary variables: non repetitive(e.g. a job's completion time)
timestamp (if the message contains one).

Compressing Messages

Storage format:

Dictionary variables:
- Two-level dictionary
Non-dictionary variable:
- Directly in the encoded log message
- Supports encoding floating point numbers and integers as non-dictionary variables
- Use a fixed-width 64-bit encoding instead of a variable-width encoding
The remaining portion is treated as part of the log type
- Uses byte ‘\x11’ to represent a dictionary variable value

A encoded message

a 64-bit timestamp
a 32- bit log type ID
a sequence of 64-bit variable IDs and encoded variable values

If ‘\x11’ appear in text logs, CLP will escape them before insertion into the log type.

Decompressing Messages

Convert the the IDs to original values.

On-disk Format

Process:

The encoded messages are initially buffered in memory
And then use Zstandard to compress
Write to disk (Segment)
Segments belong to an archive (share log type and variable dictionaries.)

Encoded messages are stored in a column-oriented manner.

To improve compression ratio
Improve search performance

CLP is I/O bound reading segments

Index at the granularity of segments

For each archive, CLP stores the corresponding metadata of the files and directories (like paths).

The timestamp may be reconstructed into other formats.

Three offsets are stored as metadata.

The lightweight compressor support different types.

three modes:
- “Default” that uses Zstandard at level 3
- “Archive” that uses 7z-lzma at level 1
- “Ultra” that uses 7z-lzma at level 9

Search

Given a query, first compress, then search from dictionaries for the log type and dictionary variables.

CLP supports search phrases that can contain two types of wildcard characters:

‘*’: zero or more chars
‘?’: any single char

With wildcards, a token will be ambiguously categorized as either a log type, a dictionary variable, or a non-dictionary variable.

Handling Ambiguous Tokens

Insert *-card at beginning and end.
Tokenization
Compare each token again all the variable schemas (if not matches -> log type)
Generate sub-queries.
Each sub-query will be processed in three steps
1. Searches the ltDict for matching log type
2. If find, searching the vDict for the dictionary variables.
3. If find, get intersection of log type matched and dictionary variables matched, and for each of these segments, decompress searches for encoded messages matching the encoded query.

Optimizing CLP Queries

log type dictionary search time < variable dictionary search time < segment scan time

The best practice is to provide enough information in the search phrase to help CLP narrow down the log type or the dictionary variable values or both.

CLP does not use any additional index(future work) on its dictionary entries <-- the dictionaries are small.

Handling Special Cases

Users have two options for changing variable schemas after log data has already been compressed.

Only apply on new logs
Decompress and recompress the old data.

CLP can warn the user if the schemas they provided are not optimal.

CLP supports the deletion of log messages.

The segment index in the dictionaries will also be updated.

Distributed Architecture

A simple controller and data node design.

Controller: manage metadata and node failures.
- three metadata tables (is only to speedup the search):
  - log files, stores the metadata of each raw log file (its file system path, the number of log messages, etc.) as well as the archive that contains this log file.
  - archives, stores the metadata of each archive including which data node stores this archive.
  - empty directories, stores the paths of empty directories compressed in each archive.
Data nodes: do data intensive computation of compression and search.

Each archive is independent of other archives and immutable once written.

To avoid coordination between search threads, each archive is only queried by a single thread for a given query.

Thus, CLP parallelizes compression and search at the granularity of individual archives.

CLP support File System Integration Using FUSE, it can imitate find.

Wildcards and Schemas

Two fundamental challenges:

Determining how to tokenize a string containing wildcards, given that a wildcard could either match a delimiter or non-delimiter character
Determining if a token containing wildcards (a wildcard token) could match a given variable schema.

Wildcard String Tokenization

for *-cards, it can be interpreted as either:

matching non-delimiters only
matching delimiters only
matching non-delimiters and delimiters

If a *-card is interpreted to have a different type than either of the characters surrounding it, the tokenization should split the string at the *-card while leaving *-cards attached to the surrounding character.

Comparing Expressions

Use a self-designed engine to compute intersection to know whether two wildcard-contained tokens' intersection is empty.

Schema Design

Provide a few default schemas

CLP treats most non-alphanumeric characters as delimiters except for a few like underscores and periods.

Compressed Persistent Caching

Bottleneck: typically to be I/O.

A caching mechanism is introduced to avoid redundant segment reading.

Two possible solutions for reading unnecessary data:

sort the log types in each segment, cause two problems:
- need to recompress when insert
- need to record the old position
store each log type in its own segment, cause complications:
- have to repeatedly open and close several segments
- do not use the strategy in storage, but use it in cache

CLP’s policy: cache recently read, infrequent log types by storing each log type in its own segment.

Eviction, choose to be evicted:

have not been recently queried
contain more messages than the new log type to be cached

The format of each log type segment in the cache is similar to regular ones, the differences:

no log type column
each message additionally includes a log file path identifier, a timestamp format identifier and an optional message number.
the log type segment is named in a way that it can be easily referenced using the corresponding log type ID

CLP processes a query in two parts:

for the log type cache
for the non-cache segments

Data Scrubbing and Obfuscation

The sensitive data (username) can be easily replaced in dictionaries so that the data can be obfuscated.

Evaluation

Evaluation focuses on:

CLP’s compression ratio and speed
CLP’s search performance
CLP’s scalability and resource efficiency.

TBA

Questions

What if variables contain timestamp

Will "assigned it to container" match "assigned to container"?

Will the space in query be strictly matched?

Will the pipeline be merged? Like search "AA" | grep "CC"

Is it allowed to search while writing (strong consistency)? (As some parts are still mutable) Or just keeping inconsistent is OK?

One log is considered as one message, so there is no cross-line search support? (Maybe 'find' also does not support)

How does the schemes generate? What if get a new message not in the scheme? Really do not understand the Schema Design Section

Use schemes given by users. Use some method to guess. But the result do work very well.

kexiang's blog

CLP: Efficient and Scalable Search on Compressed Text Logs