2021-04-06
ClickHouse 的函数非常多。真正写分析 SQL 的时候,最常撞上的不是高深算法,而是这些普通函数:类型转换、字符串清洗、JSON 提取、数组处理、条件分支、NULL 兜底。
官方把这类函数放在 regular functions 体系里。它们和聚合函数不一样:普通函数通常对每一行输入做计算,然后返回一个值;聚合函数会把多行折成一个结果。
SELECT
lowerUTF8(event_name) AS event_name_norm,
toUInt64OrZero(user_id_text) AS user_id,
if(amount > 100, 'large', 'normal') AS amount_level
FROM raw_events;
这里每一行都会执行这些表达式,并得到新的列。
和聚合函数对比一下:
SELECT
event_name,
count() AS total
FROM raw_events
GROUP BY event_name;
count() 是聚合函数,它把一组行折成一个值。lowerUTF8()、toUInt64OrZero()、if() 这类就是普通函数,它们负责逐行加工。
ClickHouse 的类型系统很硬。字符串就是字符串,整数就是整数。你想比较、计算、入库,最好先转成明确类型。
SELECT
user_id_text,
toUInt64OrZero(user_id_text) AS user_id
FROM raw_events;
几类常见转换:
SELECT
toInt64OrNull('-42') AS a,
toUInt64OrZero('bad') AS b,
toFloat64OrNull('19.9') AS c,
toString(12345) AS d;
OrNull 适合保留坏数据状态,后面再统一处理。
OrZero 适合明确知道坏值可以归零的场景。它很方便,但也容易把脏数据悄悄吃掉。
如果你在做数据清洗,我更推荐先用 OrNull,再统计坏值比例:
SELECT
count() AS total,
countIf(toUInt64OrNull(user_id_text) IS NULL) AS bad_user_id
FROM raw_events;
if 和 multiIf简单二选一用 if:
SELECT
amount,
if(amount >= 100, 'high', 'low') AS amount_level
FROM orders;
多个分支用 multiIf:
SELECT
amount,
multiIf(
amount >= 1000, 'vip',
amount >= 100, 'normal',
amount > 0, 'small',
'empty'
) AS amount_level
FROM orders;
multiIf 很适合做规则分层、状态映射、质量标记。
如果规则越来越长,不要把 SQL 写成一屏套娃。可以考虑把映射表做成维表,用 JOIN 或字典来管理。
分析查询里,NULL 经常来自类型转换失败、外部表字段缺失、JSON 字段不存在。
常用函数:
SELECT
ifNull(city, 'unknown') AS city_norm,
coalesce(city, province, 'unknown') AS region,
nullIf(trim(name), '') AS name_or_null
FROM users;
含义很直接:
ifNull(x, y):如果 x 是 NULL,就用 ycoalesce(a, b, c):返回第一个非 NULL 值nullIf(a, b):如果 a = b,返回 NULL把空字符串转 NULL,再做兜底,是清洗文本字段的常见套路:
SELECT
ifNull(nullIf(trim(channel), ''), 'unknown') AS channel_norm
FROM raw_events;
字符串函数是分析 SQL 里的高频工具。
大小写归一:
SELECT
lowerUTF8(event_name) AS event_name_norm
FROM raw_events;
判断是否包含:
SELECT
url,
positionCaseInsensitive(url, '/checkout') > 0 AS is_checkout
FROM page_views;
替换脏字符:
SELECT
replaceRegexpAll(phone_text, '[^0-9]', '') AS phone_digits
FROM user_inputs;
拆分字符串:
SELECT
splitByChar('/', path) AS path_parts
FROM page_views;
取数组里的某段:
SELECT
path,
arrayElement(splitByChar('/', path), 2) AS first_part
FROM page_views;
注意:ClickHouse 数组下标从 1 开始。这个点很容易被写 Python 的习惯带偏。
URL 字段如果直接用字符串切,SQL 会很脆。ClickHouse 有专门的 URL 处理函数。
SELECT
url,
domain(url) AS host,
path(url) AS url_path,
queryString(url) AS query_text
FROM page_views;
如果要取参数:
SELECT
url,
extractURLParameter(url, 'utm_source') AS utm_source
FROM page_views;
日志分析、广告归因、落地页统计里,这类函数很实用。
很多埋点会把扩展属性放在 JSON 字符串里:
{"device":"ios","scene":"home","score":92}
ClickHouse 可以直接提取:
SELECT
JSONExtractString(props, 'device') AS device,
JSONExtractString(props, 'scene') AS scene,
JSONExtractInt(props, 'score') AS score
FROM raw_events;
如果字段可能不存在,可以加兜底:
SELECT
ifNull(nullIf(JSONExtractString(props, 'scene'), ''), 'unknown') AS scene
FROM raw_events;
有些 JSON 结构很稳定,而且只需要提取简单字段,可以了解 simpleJSONExtract* 系列。它更偏轻量,但能力边界也更明确。
如果某个 JSON 字段在查询里天天用,就别每次都现场解析了。更好的做法是写入时拆成独立列,或者用物化列承接。
CREATE TABLE raw_events
(
event_name LowCardinality(String),
props String,
device LowCardinality(String) MATERIALIZED JSONExtractString(props, 'device')
)
ENGINE = MergeTree
ORDER BY event_name;
这样查询 device 时就不用反复从 JSON 字符串里挖。
数组函数非常适合处理标签、路径片段、商品列表、特征集合。
把数组每个元素改成小写:
SELECT
arrayMap(x -> lowerUTF8(x), tags) AS tags_norm
FROM user_profiles;
过滤空标签:
SELECT
arrayFilter(x -> length(trim(x)) > 0, tags) AS clean_tags
FROM user_profiles;
判断是否有某类标签:
SELECT
user_id,
arrayExists(x -> x = 'vip', tags) AS is_vip
FROM user_profiles;
把数组展开成多行:
SELECT
user_id,
arrayJoin(tags) AS tag
FROM user_profiles;
arrayJoin 会改变行数。写复杂 SQL 时要特别留意它的位置,否则结果可能被放大。
x -> expr 很香ClickHouse 的数组函数经常配 lambda 使用:
SELECT
arrayMap(x -> x * 10, [1, 2, 3]) AS multiplied,
arrayFilter(x -> x > 2, [1, 2, 3, 4]) AS filtered,
arrayCount(x -> x >= 60, [45, 80, 90]) AS passed_count;
也可以处理多个数组:
SELECT
arrayMap((name, score) -> concat(name, ':', toString(score)),
['a', 'b', 'c'],
[90, 75, 88]) AS pairs;
复杂一点的标签评分:
SELECT
user_id,
arraySum(arrayMap(x -> multiIf(
x = 'vip', 10,
x = 'active', 5,
x = 'new', 2,
0
), tags)) AS tag_score
FROM user_profiles;
这个写法适合轻量规则。如果规则需要运营配置,还是把规则表独立出来。
tuple 可以把多个值临时包起来:
SELECT
tuple(user_id, event_name, amount) AS event_tuple
FROM raw_events;
取 tuple 元素:
SELECT
tupleElement(tuple(user_id, event_name), 2) AS event_name
FROM raw_events;
map 适合处理键值对:
SELECT
map('device', 'ios', 'scene', 'home') AS kv,
mapContains(kv, 'device') AS has_device,
kv['scene'] AS scene;
如果字段结构稳定,普通列通常比 map 更适合高频分析。map 更像灵活扩展区,不要把所有字段都塞进去。
ClickHouse 函数很适合链式清洗:
SELECT
lowerUTF8(
ifNull(
nullIf(trim(JSONExtractString(props, 'device')), ''),
'unknown'
)
) AS device_norm
FROM raw_events;
这段 SQL 能跑,但可读性一般。可以用 WITH 把中间步骤拆开:
WITH
JSONExtractString(props, 'device') AS device_raw,
nullIf(trim(device_raw), '') AS device_or_null
SELECT
lowerUTF8(ifNull(device_or_null, 'unknown')) AS device_norm
FROM raw_events;
分析 SQL 不只是写给机器,也写给下一位维护的人。函数嵌套超过三层,就值得拆一拆。
函数很适合写质量看板。
SELECT
count() AS total,
countIf(toUInt64OrNull(user_id_text) IS NULL) AS bad_user_id,
countIf(length(trim(event_name)) = 0) AS empty_event,
countIf(not isValidJSON(props)) AS bad_json
FROM raw_events;
也可以按来源分组:
SELECT
source,
count() AS total,
countIf(toFloat64OrNull(amount_text) IS NULL) AS bad_amount,
round(bad_amount / total, 4) AS bad_ratio
FROM raw_orders
GROUP BY source
ORDER BY bad_ratio DESC;
数据质量检查别只放在离线脚本里。ClickHouse 查得快,直接拿它做抽样检查和质量统计也很顺手。
Python 可以用 clickhouse-connect 调 ClickHouse。
pip install clickhouse-connect
创建客户端:
import clickhouse_connect
client = clickhouse_connect.get_client(
host="127.0.0.1",
port=8123,
username="default",
password="",
)
建一张小表:
client.command("""
CREATE TABLE IF NOT EXISTS raw_events_func_demo
(
user_id_text String,
event_name String,
amount_text String,
url String,
tags Array(String),
props String
)
ENGINE = Memory
""")
插入样例:
rows = [
[
"10001",
"PAY",
"199.9",
"https://example.com/checkout?utm_source=ad",
["vip", "active"],
'{"device":"ios","scene":"home","score":92}',
],
[
"bad",
"Refund",
"",
"https://example.com/refund",
["new", ""],
'{"device":"android","scene":"order","score":80}',
],
]
client.insert(
"raw_events_func_demo",
rows,
column_names=[
"user_id_text",
"event_name",
"amount_text",
"url",
"tags",
"props",
],
)
查询时使用函数:
result = client.query("""
SELECT
toUInt64OrNull(user_id_text) AS user_id,
lowerUTF8(event_name) AS event_name_norm,
toFloat64OrNull(amount_text) AS amount,
domain(url) AS host,
extractURLParameter(url, 'utm_source') AS utm_source,
arrayFilter(x -> length(trim(x)) > 0, tags) AS clean_tags,
JSONExtractString(props, 'device') AS device,
JSONExtractInt(props, 'score') AS score
FROM raw_events_func_demo
ORDER BY event_name_norm
""")
for row in result.result_rows:
print(row)
这个例子把类型转换、字符串、URL、数组、JSON 函数都串起来了。真实项目里可以把这段 SQL 放进视图、物化视图或数据清洗任务。
参数化查询更稳:
sql = """
SELECT
event_name,
count() AS total
FROM raw_events_func_demo
WHERE lowerUTF8(event_name) = {event_name:String}
GROUP BY event_name
"""
result = client.query(
sql,
parameters={"event_name": "pay"},
)
print(result.result_rows)
不要把用户输入直接拼进 SQL 字符串。哪怕只是内部工具,也尽量保持好习惯。
如果函数只是临时分析,用查询表达式就行。
如果某个表达式到处都用,而且计算不轻,就考虑落成列:
CREATE TABLE events_clean
(
event_name String,
props String,
device LowCardinality(String) MATERIALIZED JSONExtractString(props, 'device'),
event_name_norm LowCardinality(String) MATERIALIZED lowerUTF8(event_name)
)
ENGINE = MergeTree
ORDER BY (event_name_norm, device);
或者用物化视图做清洗落表:
CREATE MATERIALIZED VIEW mv_events_clean
TO events_clean
AS
SELECT
event_name,
props,
JSONExtractString(props, 'device') AS device,
lowerUTF8(event_name) AS event_name_norm
FROM raw_events;
现场算适合灵活探索;落列适合高频查询。别一开始就把所有东西都物化,也别让高频查询一直重复做重计算。