Custom Partitioning Key
In most cases you do not need a partition key, and in most other cases you do not need a partition key more granular than by months.
You should never use too granular of partitioning. Don't partition your data by client identifiers or names. Instead, make a client identifier or name the first column in the ORDER BY expression.
Partitioning is available for the MergeTree family tables, including replicated tables and materialized views.
A partition is a logical combination of records in a table by a specified criterion. You can set a partition by an arbitrary criterion, such as by month, by day, or by event type. Each partition is stored separately to simplify manipulations of this data. When accessing the data, ClickHouse uses the smallest subset of partitions possible. Partitions improve performance for queries containing a partitioning key because ClickHouse will filter for that partition before selecting the parts and granules within the partition.
The partition is specified in the PARTITION BY expr
clause when creating a table. The partition key can be any expression from the table columns. For example, to specify partitioning by month, use the expression toYYYYMM(date_column)
:
CREATE TABLE visits
(
VisitDate Date,
Hour UInt8,
ClientID UUID
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(VisitDate)
ORDER BY Hour;
The partition key can also be a tuple of expressions (similar to the primary key). For example:
ENGINE = ReplicatedCollapsingMergeTree('/clickhouse/tables/name', 'replica1', Sign)
PARTITION BY (toMonday(StartDate), EventType)
ORDER BY (CounterID, StartDate, intHash32(UserID));
In this example, we set partitioning by the event types that occurred during the current week.
By default, the floating-point partition key is not supported. To use it enable the setting allow_floating_point_partition_key.
When inserting new data to a table, this data is stored as a separate part (chunk) sorted by the primary key. In 10-15 minutes after inserting, the parts of the same partition are merged into the entire part.
A merge only works for data parts that have the same value for the partitioning expression. This means you shouldn't make overly granular partitions (more than about a thousand partitions). Otherwise, the SELECT
query performs poorly because of an unreasonably large number of files in the file system and open file descriptors.
Use the system.parts table to view the table parts and partitions. For example, let's assume that we have a visits
table with partitioning by month. Let's perform the SELECT
query for the system.parts
table:
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'visits'
┌─partition─┬─name──────────────┬─active─┐
│ 201901 │ 201901_1_3_1 │ 0 │