ClickHouse——常规函数




2021-04-06

blog_main_img

ClickHouse 的函数非常多。真正写分析 SQL 的时候,最常撞上的不是高深算法,而是这些普通函数:类型转换、字符串清洗、JSON 提取、数组处理、条件分支、NULL 兜底。

官方把这类函数放在 regular functions 体系里。它们和聚合函数不一样:普通函数通常对每一行输入做计算,然后返回一个值;聚合函数会把多行折成一个结果。

regular function

SELECT
    lowerUTF8(event_name) AS event_name_norm,
    toUInt64OrZero(user_id_text) AS user_id,
    if(amount > 100, 'large', 'normal') AS amount_level
FROM raw_events;

这里每一行都会执行这些表达式,并得到新的列。

ClickHouse regular function pipeline

和聚合函数对比一下:

SELECT
    event_name,
    count() AS total
FROM raw_events
GROUP BY event_name;

count() 是聚合函数,它把一组行折成一个值。lowerUTF8()toUInt64OrZero()if() 这类就是普通函数,它们负责逐行加工。

强类型:别指望它偷偷帮你转

ClickHouse 的类型系统很硬。字符串就是字符串,整数就是整数。你想比较、计算、入库,最好先转成明确类型。

SELECT
    user_id_text,
    toUInt64OrZero(user_id_text) AS user_id
FROM raw_events;

几类常见转换:

SELECT
    toInt64OrNull('-42') AS a,
    toUInt64OrZero('bad') AS b,
    toFloat64OrNull('19.9') AS c,
    toString(12345) AS d;

OrNull 适合保留坏数据状态,后面再统一处理。

OrZero 适合明确知道坏值可以归零的场景。它很方便,但也容易把脏数据悄悄吃掉。

如果你在做数据清洗,我更推荐先用 OrNull,再统计坏值比例:

SELECT
    count() AS total,
    countIf(toUInt64OrNull(user_id_text) IS NULL) AS bad_user_id
FROM raw_events;

条件函数:ifmultiIf

简单二选一用 if

SELECT
    amount,
    if(amount >= 100, 'high', 'low') AS amount_level
FROM orders;

多个分支用 multiIf

SELECT
    amount,
    multiIf(
        amount >= 1000, 'vip',
        amount >= 100, 'normal',
        amount > 0, 'small',
        'empty'
    ) AS amount_level
FROM orders;

multiIf 很适合做规则分层、状态映射、质量标记。

如果规则越来越长,不要把 SQL 写成一屏套娃。可以考虑把映射表做成维表,用 JOIN 或字典来管理。

NULL 处理:先把坑填平

分析查询里,NULL 经常来自类型转换失败、外部表字段缺失、JSON 字段不存在。

常用函数:

SELECT
    ifNull(city, 'unknown') AS city_norm,
    coalesce(city, province, 'unknown') AS region,
    nullIf(trim(name), '') AS name_or_null
FROM users;

含义很直接:

  • ifNull(x, y):如果 x 是 NULL,就用 y
  • coalesce(a, b, c):返回第一个非 NULL 值
  • nullIf(a, b):如果 a = b,返回 NULL

把空字符串转 NULL,再做兜底,是清洗文本字段的常见套路:

SELECT
    ifNull(nullIf(trim(channel), ''), 'unknown') AS channel_norm
FROM raw_events;

字符串清洗:URL、日志、埋点都靠它

字符串函数是分析 SQL 里的高频工具。

ClickHouse 字符串 JSON 函数

大小写归一:

SELECT
    lowerUTF8(event_name) AS event_name_norm
FROM raw_events;

判断是否包含:

SELECT
    url,
    positionCaseInsensitive(url, '/checkout') > 0 AS is_checkout
FROM page_views;

替换脏字符:

SELECT
    replaceRegexpAll(phone_text, '[^0-9]', '') AS phone_digits
FROM user_inputs;

拆分字符串:

SELECT
    splitByChar('/', path) AS path_parts
FROM page_views;

取数组里的某段:

SELECT
    path,
    arrayElement(splitByChar('/', path), 2) AS first_part
FROM page_views;

注意:ClickHouse 数组下标从 1 开始。这个点很容易被写 Python 的习惯带偏。

URL 函数:别自己硬拆链接

URL 字段如果直接用字符串切,SQL 会很脆。ClickHouse 有专门的 URL 处理函数。

SELECT
    url,
    domain(url) AS host,
    path(url) AS url_path,
    queryString(url) AS query_text
FROM page_views;

如果要取参数:

SELECT
    url,
    extractURLParameter(url, 'utm_source') AS utm_source
FROM page_views;

日志分析、广告归因、落地页统计里,这类函数很实用。

JSON 函数:先取字段,再参与计算

很多埋点会把扩展属性放在 JSON 字符串里:

{"device":"ios","scene":"home","score":92}

ClickHouse 可以直接提取:

SELECT
    JSONExtractString(props, 'device') AS device,
    JSONExtractString(props, 'scene') AS scene,
    JSONExtractInt(props, 'score') AS score
FROM raw_events;

如果字段可能不存在,可以加兜底:

SELECT
    ifNull(nullIf(JSONExtractString(props, 'scene'), ''), 'unknown') AS scene
FROM raw_events;

有些 JSON 结构很稳定,而且只需要提取简单字段,可以了解 simpleJSONExtract* 系列。它更偏轻量,但能力边界也更明确。

如果某个 JSON 字段在查询里天天用,就别每次都现场解析了。更好的做法是写入时拆成独立列,或者用物化列承接。

CREATE TABLE raw_events
(
    event_name LowCardinality(String),
    props String,
    device LowCardinality(String) MATERIALIZED JSONExtractString(props, 'device')
)
ENGINE = MergeTree
ORDER BY event_name;

这样查询 device 时就不用反复从 JSON 字符串里挖。

数组函数:标签、路径、明细都能处理

数组函数非常适合处理标签、路径片段、商品列表、特征集合。

ClickHouse 数组和 NULL 函数

把数组每个元素改成小写:

SELECT
    arrayMap(x -> lowerUTF8(x), tags) AS tags_norm
FROM user_profiles;

过滤空标签:

SELECT
    arrayFilter(x -> length(trim(x)) > 0, tags) AS clean_tags
FROM user_profiles;

判断是否有某类标签:

SELECT
    user_id,
    arrayExists(x -> x = 'vip', tags) AS is_vip
FROM user_profiles;

把数组展开成多行:

SELECT
    user_id,
    arrayJoin(tags) AS tag
FROM user_profiles;

arrayJoin 会改变行数。写复杂 SQL 时要特别留意它的位置,否则结果可能被放大。

高阶函数:x -> expr 很香

ClickHouse 的数组函数经常配 lambda 使用:

SELECT
    arrayMap(x -> x * 10, [1, 2, 3]) AS multiplied,
    arrayFilter(x -> x > 2, [1, 2, 3, 4]) AS filtered,
    arrayCount(x -> x >= 60, [45, 80, 90]) AS passed_count;

也可以处理多个数组:

SELECT
    arrayMap((name, score) -> concat(name, ':', toString(score)),
             ['a', 'b', 'c'],
             [90, 75, 88]) AS pairs;

复杂一点的标签评分:

SELECT
    user_id,
    arraySum(arrayMap(x -> multiIf(
        x = 'vip', 10,
        x = 'active', 5,
        x = 'new', 2,
        0
    ), tags)) AS tag_score
FROM user_profiles;

这个写法适合轻量规则。如果规则需要运营配置,还是把规则表独立出来。

字典和 tuple:让表达式更紧凑

tuple 可以把多个值临时包起来:

SELECT
    tuple(user_id, event_name, amount) AS event_tuple
FROM raw_events;

取 tuple 元素:

SELECT
    tupleElement(tuple(user_id, event_name), 2) AS event_name
FROM raw_events;

map 适合处理键值对:

SELECT
    map('device', 'ios', 'scene', 'home') AS kv,
    mapContains(kv, 'device') AS has_device,
    kv['scene'] AS scene;

如果字段结构稳定,普通列通常比 map 更适合高频分析。map 更像灵活扩展区,不要把所有字段都塞进去。

函数可以组合,但别写成迷宫

ClickHouse 函数很适合链式清洗:

SELECT
    lowerUTF8(
        ifNull(
            nullIf(trim(JSONExtractString(props, 'device')), ''),
            'unknown'
        )
    ) AS device_norm
FROM raw_events;

这段 SQL 能跑,但可读性一般。可以用 WITH 把中间步骤拆开:

WITH
    JSONExtractString(props, 'device') AS device_raw,
    nullIf(trim(device_raw), '') AS device_or_null
SELECT
    lowerUTF8(ifNull(device_or_null, 'unknown')) AS device_norm
FROM raw_events;

分析 SQL 不只是写给机器,也写给下一位维护的人。函数嵌套超过三层,就值得拆一拆。

用函数做数据质量检查

函数很适合写质量看板。

SELECT
    count() AS total,
    countIf(toUInt64OrNull(user_id_text) IS NULL) AS bad_user_id,
    countIf(length(trim(event_name)) = 0) AS empty_event,
    countIf(not isValidJSON(props)) AS bad_json
FROM raw_events;

也可以按来源分组:

SELECT
    source,
    count() AS total,
    countIf(toFloat64OrNull(amount_text) IS NULL) AS bad_amount,
    round(bad_amount / total, 4) AS bad_ratio
FROM raw_orders
GROUP BY source
ORDER BY bad_ratio DESC;

数据质量检查别只放在离线脚本里。ClickHouse 查得快,直接拿它做抽样检查和质量统计也很顺手。

Python 执行函数查询

Python 可以用 clickhouse-connect 调 ClickHouse。

pip install clickhouse-connect

创建客户端:

import clickhouse_connect


client = clickhouse_connect.get_client(
    host="127.0.0.1",
    port=8123,
    username="default",
    password="",
)

建一张小表:

client.command("""
CREATE TABLE IF NOT EXISTS raw_events_func_demo
(
    user_id_text String,
    event_name String,
    amount_text String,
    url String,
    tags Array(String),
    props String
)
ENGINE = Memory
""")

插入样例:

rows = [
    [
        "10001",
        "PAY",
        "199.9",
        "https://example.com/checkout?utm_source=ad",
        ["vip", "active"],
        '{"device":"ios","scene":"home","score":92}',
    ],
    [
        "bad",
        "Refund",
        "",
        "https://example.com/refund",
        ["new", ""],
        '{"device":"android","scene":"order","score":80}',
    ],
]

client.insert(
    "raw_events_func_demo",
    rows,
    column_names=[
        "user_id_text",
        "event_name",
        "amount_text",
        "url",
        "tags",
        "props",
    ],
)

查询时使用函数:

result = client.query("""
SELECT
    toUInt64OrNull(user_id_text) AS user_id,
    lowerUTF8(event_name) AS event_name_norm,
    toFloat64OrNull(amount_text) AS amount,
    domain(url) AS host,
    extractURLParameter(url, 'utm_source') AS utm_source,
    arrayFilter(x -> length(trim(x)) > 0, tags) AS clean_tags,
    JSONExtractString(props, 'device') AS device,
    JSONExtractInt(props, 'score') AS score
FROM raw_events_func_demo
ORDER BY event_name_norm
""")

for row in result.result_rows:
    print(row)

这个例子把类型转换、字符串、URL、数组、JSON 函数都串起来了。真实项目里可以把这段 SQL 放进视图、物化视图或数据清洗任务。

用 Python 传参数,别拼 SQL

参数化查询更稳:

sql = """
SELECT
    event_name,
    count() AS total
FROM raw_events_func_demo
WHERE lowerUTF8(event_name) = {event_name:String}
GROUP BY event_name
"""

result = client.query(
    sql,
    parameters={"event_name": "pay"},
)

print(result.result_rows)

不要把用户输入直接拼进 SQL 字符串。哪怕只是内部工具,也尽量保持好习惯。

什么时候把函数结果落成列

如果函数只是临时分析,用查询表达式就行。

如果某个表达式到处都用,而且计算不轻,就考虑落成列:

CREATE TABLE events_clean
(
    event_name String,
    props String,
    device LowCardinality(String) MATERIALIZED JSONExtractString(props, 'device'),
    event_name_norm LowCardinality(String) MATERIALIZED lowerUTF8(event_name)
)
ENGINE = MergeTree
ORDER BY (event_name_norm, device);

或者用物化视图做清洗落表:

CREATE MATERIALIZED VIEW mv_events_clean
TO events_clean
AS
SELECT
    event_name,
    props,
    JSONExtractString(props, 'device') AS device,
    lowerUTF8(event_name) AS event_name_norm
FROM raw_events;

现场算适合灵活探索;落列适合高频查询。别一开始就把所有东西都物化,也别让高频查询一直重复做重计算。