
在关系型数据库中,当我们需要从多个通过一对多或多对多关系连接的表中汇总数据时,常常会遇到一个被称为“聚合陷阱”的问题。例如,一个销售记录(sale)可以有多个销售明细(sale_lines),也可以有多个现金交易记录(cash_transactions)。如果直接将这些表连接起来,再对某个字段进行求和,sql server(或其他数据库)会先生成一个包含所有连接行的大结果集,然后在这个结果集上进行聚合。这导致与主表(sale)相关的重复行被多次计算,从而得出错误的汇总结果。
考虑以下数据库结构:
其中,sale与sale_lines是一对多关系,sale与cash_transactions也是一对多关系。cash_transactions中的received_currency_id和converted_currency_id可能与sale的currency_items_sold_in不同。
示例数据模型 (DDL):
CREATE TABLE currency (
iso_number CHARACTER VARYING(3) PRIMARY KEY,
iso_code CHARACTER VARYING(3)
);
INSERT INTO currency(iso_number, iso_code) VALUES ('208','DKK'), ('752','SEK'), ('572','NOK');
CREATE TABLE product (
id SERIAL PRIMARY KEY,
name CHARACTER VARYING(12),
current_price INTEGER
);
INSERT INTO product(id,name,current_price) VALUES (1,'icecream',200), (2,'sunglasses',300);
CREATE TABLE sale (
id SERIAL PRIMARY KEY,
time_of_sale TIMESTAMP,
currency_items_sold_in CHARACTER VARYING(3)
);
INSERT INTO sale(id, time_of_sale, currency_items_sold_in)
VALUES
(1, CURRENT_TIMESTAMP, '208'), -- 销售1以DKK计价
(2, CURRENT_TIMESTAMP, '752') -- 销售2以SEK计价
;
CREATE TABLE sale_lines (
id SERIAL PRIMARY KEY,
sale_id INTEGER,
product_id INTEGER,
price_paid INTEGER,
quantity FLOAT
);
INSERT INTO sale_lines(id, sale_id, product_id, price_paid, quantity)
VALUES
(1, 1, 1, 200, 1.0), -- 销售1明细1
(2, 1, 2, 300, 1.0), -- 销售1明细2
(3, 2, 1, 100, 1.0), -- 销售2明细1
(4, 2, 1, 100, 1.0) -- 销售2明细2
;
CREATE TABLE cash_transactions (
id SERIAL PRIMARY KEY,
sale_id INTEGER,
received_currency_id CHARACTER VARYING(3),
converted_currency_id CHARACTER VARYING(3),
received_amount INTEGER,
converted_amount INTEGER
);
INSERT INTO cash_transactions(id, sale_id, received_currency_id, converted_currency_id, received_amount, converted_amount)
VALUES
(1, 1, '208', '208', 200, 200), -- 销售1交易1: DKK -> DKK
(2, 1, '752', '208', 400, 300), -- 销售1交易2: SEK -> DKK (收到SEK 400,转换为DKK 300)
(3, 2, '572', '208', 150, 100), -- 销售2交易1: NOK -> DKK (收到NOK 150,转换为DKK 100)
(4, 2, '208', '208', 100, 100) -- 销售2交易2: DKK -> DKK
;问题表现:直接连接与聚合
如果我们尝试直接连接所有相关表并按sale的币种分组求和,sale_lines.price_paid和cash_transactions的金额都会因行重复而计算错误。
-- 错误示例:直接连接并聚合 SELECT s.currency_items_sold_in, SUM(sl.price_paid) as "price_paid", SUM(ct.received_amount) as "total_received_amount", SUM(ct.converted_amount) as "total_converted_amount" FROM sale s LEFT JOIN sale_lines sl ON sl.sale_id = s.id LEFT JOIN cash_transactions ct ON ct.sale_id = s.id GROUP BY s.currency_items_sold_in;
上述查询将产生不正确的结果,因为sale_lines和cash_transactions的行数可能不同,导致sale的每一行被重复多次,进而导致SUM操作重复计算。
为了解决行重复导致的聚合问题,一种常见的方法是在连接之前,先在子查询中对每个一对多关系进行聚合。这种方法被称为“Monkey-Poop”方法,即“先拉出来(聚合),再拼回去(连接)”。
-- 子查询预聚合方法 (部分解决,但仍有币种问题)
SELECT
s.currency_items_sold_in,
SUM(sale_line_aggregates.price_paid) as "total_price_paid",
SUM(cash_transaction_aggregates.converted_amount) as "total_converted_amount",
SUM(cash_transaction_aggregates.received_amount) as "total_received_amount"
FROM sale s
LEFT JOIN (
SELECT
sale_id,
SUM(price_paid) AS price_paid
FROM sale_lines
GROUP BY sale_id
) AS sale_line_aggregates ON sale_line_aggregates.sale_id = s.id
LEFT JOIN (
SELECT
sale_id,
SUM(converted_amount) as converted_amount,
SUM(received_amount) as received_amount
FROM cash_transactions
GROUP BY sale_id
) AS cash_transaction_aggregates ON cash_transaction_aggregates.sale_id = s.id
GROUP BY s.currency_items_sold_in;这个查询解决了sale_lines和cash_transactions各自内部的重复计算问题。然而,它仍然存在一个关键的限制:它尝试将所有金额都聚合到sale的currency_items_sold_in这个维度下。对于cash_transactions,received_amount和converted_amount可能涉及多种不同的币种(received_currency_id和converted_currency_id),直接按sale的币种分组会导致不同币种的金额被混合在一起求和,失去实际业务意义。例如,它会将DKK、SEK、NOK等不同币种的收到金额简单相加,这在财务上是无意义的。
为了彻底解决上述问题,我们需要更精细地处理币种维度。解决方案的核心思想是:
以下是使用CTE实现的完整解决方案:
WITH CTE_SALE AS (
-- 定义一个CTE来获取所有销售的主ID和销售币种
SELECT
id AS sale_id,
currency_items_sold_in AS iso_number
FROM sale
)
SELECT
curr.iso_code AS currency,
COALESCE(line.price_paid, 0) AS total_price_paid,
COALESCE(received.amount, 0) AS total_received_amount,
COALESCE(converted.amount, 0) AS total_converted_amount
FROM currency AS curr
LEFT JOIN (
-- 聚合销售明细金额,按销售主币种分组
SELECT
s.iso_number,
SUM(sl.price_paid) AS price_paid
FROM sale_lines sl
JOIN CTE_SALE s ON s.sale_id = sl.sale_id
GROUP BY s.iso_number
) AS line
ON line.iso_number = curr.iso_number
LEFT JOIN (
-- 聚合收到的金额,按收到的币种分组
SELECT
tr.received_currency_id AS iso_number,
SUM(tr.received_amount) AS amount
FROM cash_transactions tr
JOIN CTE_SALE s ON s.sale_id = tr.sale_id
GROUP BY tr.received_currency_id
) AS received
ON received.iso_number = curr.iso_number
LEFT JOIN (
-- 聚合转换后的金额,按转换后的币种分组
SELECT
tr.converted_currency_id AS iso_number,
SUM(tr.converted_amount) AS amount
FROM cash_transactions AS tr
JOIN CTE_SALE s ON s.sale_id = tr.sale_id
GROUP BY tr.converted_currency_id
) AS converted
ON converted.iso_number = curr.iso_number;结果示例:
currency | total_price_paid | total_received_amount | total_converted_amount :------- | ---------------: | --------------------: | ---------------------: DKK | 500 | 300 | 700 SEK | 200 | 400 | 0 NOK | 0 | 150 | 0
代码解析:
通过这种方法,我们成功地将不同币种的金额分离开来,并分别进行准确的聚合,最终在一个统一的视图中展示了按币种划分的销售明细总额、收到总额和转换总额。
处理包含复杂多对多关系和多币种数据的SQL聚合是数据库查询中的一个高级挑战。通过本文介绍的利用CTE和币种维度进行预聚合的方法,我们能够有效地避免因行重复导致的聚合错误,并确保在多维度业务场景下获得准确、有意义的统计数据。这种方法不仅解决了技术难题,也提升了查询的可读性和可维护性,是专业数据库开发和数据分析中不可或缺的技能。
每个人都需要一台速度更快、更稳定的 PC。随着时间的推移,垃圾文件、旧注册表数据和不必要的后台进程会占用资源并降低性能。幸运的是,许多工具可以让 Windows 保持平稳运行。
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号