SQL Aggregation functions, CUBE, RollUP & Grouping Sets. ~ <a href="http://learnprogrammingdata.blogspot.com/">IT-Works</a>

ROLLUP() explained in SQL

The SQL standard defines special functions that can be used in the GROUP BY clause: the grouping functions. These functions can be used to generate several groupings in a single clause. This can best be explained in SQL. Let's take ROLLUP() for instance:

ROLLUP() grouping function provides N+1 groupings, when N is the number of arguments to the ROLLUP() function. Each grouping has an additional group field from the ROLLUP() argument field list. The results of the second query might look something like this:

-- ROLLUP() with one argument
SELECT AUTHOR_ID, COUNT(*)
FROM BOOK
GROUP BY ROLLUP(AUTHOR_ID)

-- ROLLUP() with two arguments
SELECT AUTHOR_ID, PUBLISHED_IN, COUNT(*)
FROM BOOK
GROUP BY ROLLUP(AUTHOR_ID, PUBLISHED_IN)

+-----------+--------------+----------+
| AUTHOR_ID | PUBLISHED_IN | COUNT(*) |
+-----------+--------------+----------+
| 1 | 1945 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 1 | 1948 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 1 | NULL | 2 | <- GROUP BY (AUTHOR_ID)
| 2 | 1988 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 2 | 1990 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 2 | NULL | 2 | <- GROUP BY (AUTHOR_ID)
| NULL | NULL | 4 | <- GROUP BY ()
+-----------+--------------+----------+

CUBE() explained in SQL

CUBE() is different from ROLLUP() in the way that it doesn't just create N+1 groupings, it creates all 2^N possible combinations between all group fields in the CUBE() function argument list. Let's re-consider our second query from before:

-- CUBE() with two arguments
SELECT AUTHOR_ID, PUBLISHED_IN, COUNT(*)
FROM BOOK
GROUP BY CUBE(AUTHOR_ID, PUBLISHED_IN)

The results would then hold:
+-----------+--------------+----------+
| AUTHOR_ID | PUBLISHED_IN | COUNT(*) |
+-----------+--------------+----------+
| NULL | NULL | 2 | <- GROUP BY ()
| NULL | 1945 | 1 | <- GROUP BY (PUBLISHED_IN)
| NULL | 1948 | 1 | <- GROUP BY (PUBLISHED_IN)
| NULL | 1988 | 1 | <- GROUP BY (PUBLISHED_IN)
| NULL | 1990 | 1 | <- GROUP BY (PUBLISHED_IN)
| 1 | NULL | 2 | <- GROUP BY (AUTHOR_ID)
| 1 | 1945 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 1 | 1948 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 2 | NULL | 2 | <- GROUP BY (AUTHOR_ID)
| 2 | 1988 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 2 | 1990 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
+-----------+--------------+----------+

GROUPING SETS()

GROUPING SETS() are the generalised way to create multiple groupings. From our previous examples
ROLLUP(AUTHOR_ID, PUBLISHED_IN) corresponds to GROUPING SETS((AUTHOR_ID, PUBLISHED_IN), (AUTHOR_ID), ())
CUBE(AUTHOR_ID, PUBLISHED_IN) corresponds to GROUPING SETS((AUTHOR_ID, PUBLISHED_IN), (AUTHOR_ID), (PUBLISHED_IN), ())

More explanation on the aggregation functions.

Aggregate functions can be nested:

SELECT
AVG(
MAX(SAL)
)
FROM
EMP
GROUP BY
DEPTNO;

AVG(MAX(SAL))
-------------
3616.66667

----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 | 4 (25)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 7 | 4 (25)| 00:00:01 |
| 2 | SORT GROUP BY | | 1 | 7 | 4 (25)| 00:00:01 |
| 3 | TABLE ACCESS FULL| EMP | 14 | 98 | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------

The average of the maximum salary of each department is returned.

A query containing a nested aggregation requires one group by clause and returns one row.

The CUBE, ROLLUP and GROUPING SETS functions are used in the GROUP BY clause to generate totals and subtotals.

SELECT
DEPTNO,
SUM(SAL)
FROM
EMP
GROUP BY
ROLLUP(DEPTNO);

DEPTNO SUM(SAL)
---------- ----------
10 8750
20 10875
30 9400
29025

-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 21 | 4 (25)| 00:00:01 |
| 1 | SORT GROUP BY ROLLUP| | 3 | 21 | 4 (25)| 00:00:01 |
| 2 | TABLE ACCESS FULL | EMP | 14 | 98 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------

The sum of salaries is aggregated per department, and then an additional row containing the total of all employees' salaries is returned.

The ROLLUP has generated four rows:

One for each group

One for the department

One for the total of all groups

ROLLUP can have more than one dimension, generating grand totals and subtotals:

SELECT
DEPTNO,
JOB,
SUM(SAL)
FROM
EMP
GROUP BY
ROLLUP(DEPTNO,JOB);

DEPTNO JOB SUM(SAL)
-------- --------- ----------
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
10 8750
20 CLERK 1900
20 ANALYST 6000
20 MANAGER 2975
20 10875
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600
30 9400
29025

The sum generates a grand total, a subtotal per department and a sub-subtotal per department and job.

It also possible to mix ROLLUP groups with normal groups in the GROUP BY clause:

SELECT
DEPTNO,
JOB,
SUM(SAL)
FROM
EMP
GROUP BY
DEPTNO,ROLLUP(JOB);

DEPTNO JOB SUM(SAL)
-------- --------- ----------
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
10 8750
20 CLERK 1900
20 ANALYST 6000
20 MANAGER 2975
20 10875
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600
30 9400

A subtotal per department and a sub-subtotal per department and job are returned. As DEPTNO is a standard GROUP BY expression, there is no grand total.

SELECT
MAX(ENAME)
KEEP
(
DENSE_RANK
FIRST
ORDER BY
SAL DESC
) ENAME,
DEPTNO,
JOB,
MAX(SAL) SAL
FROM
EMP
GROUP BY
CUBE(DEPTNO,JOB);

ENAME DEPTNO JOB SAL
---------- ---------- --------- ----------
KING 5000
MILLER CLERK 1300
SCOTT ANALYST 3000
JONES MANAGER 2975
ALLEN SALESMAN 1600
KING PRESIDENT 5000
KING 10 5000
MILLER 10 CLERK 1300
CLARK 10 MANAGER 2450
KING 10 PRESIDENT 5000
SCOTT 20 3000
ADAMS 20 CLERK 1100
SCOTT 20 ANALYST 3000
JONES 20 MANAGER 2975
BLAKE 30 2850
JAMES 30 CLERK 950
BLAKE 30 MANAGER 2850
ALLEN 30 SALESMAN 1600

-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 11 | 231 | 4 (25)| 00:00:01 |
| 1 | SORT GROUP BY | | 11 | 231 | 4 (25)| 00:00:01 |
| 2 | GENERATE CUBE | | 11 | 231 | 4 (25)| 00:00:01 |
| 3 | SORT GROUP BY | | 11 | 231 | 4 (25)| 00:00:01 |
| 4 | TABLE ACCESS FULL| EMP | 14 | 294 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------

CUBE generates nine rows for the various job/departments, three rows for the subtotal per department, three rows for the subtotal per job and one row for the grand total. For each job and department, the maximum salary is returned, and for this salary, the name of the corresponding employee is returned. In case of duplicates, the MAX function is applied to the name.

With this result set, a superaggregation is performed for each dimension, the department, the job, and the overall. In addition to the rows that are aggregated by ROLLUP, CUBE also produces a row for each job.

GROUPING SETS simplifies the management of the subtotals:

SELECT
DEPTNO,
JOB,
MIN(SAL)
FROM
EMP
GROUP BY
GROUPING SETS
(
(JOB),
(DEPTNO)
);

DEPTNO JOB MIN(SAL)
---------- --------- ----------
ANALYST 3000
CLERK 800
MANAGER 2450
PRESIDENT 5000
SALESMAN 1250
10 1300
20 800
30 950

GROUPING SET is a subset of CUBE. The minimum salary in each job and the minimum salary in each department are returned.

Compare with:

SELECT
CASE
WHEN GROUPING_ID(JOB, DEPTNO)=1
THEN 'Count per job'
WHEN GROUPING_ID(JOB, DEPTNO)=2
THEN 'Count per department'
END " ",
CASE
WHEN GROUPING(JOB)=0
THEN JOB
ELSE '========='
END JOB,
CASE
WHEN GROUPING(DEPTNO)=0
THEN TO_CHAR(DEPTNO,'99999')
ELSE '======'
END DEPTNO,
COUNT(*)
FROM
EMP
GROUP BY
CUBE(JOB, DEPTNO)
HAVING
GROUPING_ID (JOB, DEPTNO) in (0,1,2);

JOB DEPTNO COUNT(*)
-------------------- --------- ------ ----------
CLERK 10 1
CLERK 20 2
CLERK 30 1
ANALYST 20 2
MANAGER 10 1
MANAGER 20 1
MANAGER 30 1
SALESMAN 30 4
PRESIDENT 10 1
Count per job CLERK ====== 4
Count per job ANALYST ====== 2
Count per job MANAGER ====== 3
Count per job SALESMAN ====== 4
Count per job PRESIDENT ====== 1
Count per department ========= 10 3
Count per department ========= 20 5
Count per department ========= 30 6

CUBE generates counts in every possible dimension including overall counts. The HAVING clause evaluates the GROUPING_ID function. When no superaggregation occurs, it gets the value of 0; when the first argument group is a subtotal, it gets 1 (20); when the second is a subtotal, it gets 2 (21). If it is a grand total for both the first and the second arguments, it gets 3 (20+21). Only the 0, 1 and 2 grouping ids are returned. GROUPING returns 1 when the column is summarized and otherwise, 0.

There is one more function related to grouping that is called GROUP_ID which is useful when there are duplicate subtotals:

SELECT
CASE
WHEN GROUPING(ENAME)=0
THEN NULL
WHEN GROUP_ID()=0
THEN 'SUM'
WHEN GROUP_ID()=1
THEN 'AVG'
END TYPE,
ENAME,
DEPTNO,
CASE
WHEN GROUPING(ENAME)=0
THEN NULL
WHEN GROUP_ID()=0
THEN SUM(SAL)
WHEN GROUP_ID()=1
THEN AVG(SAL)
END VALUE
FROM
EMP
GROUP BY
GROUPING SETS
(
(ENAME,DEPTNO),
(DEPTNO),
(DEPTNO),
(),
()
)
DEPTNO,
ENAME,
GROUP_ID();

TYP ENAME DEPTNO VALUE
--- ---------- ---------- ----------
CLARK 10
KING 10
MILLER 10

SUM 10 8750
AVG 10 2916.66667

ADAMS 20
FORD 20
JONES 20
SCOTT 20
SMITH 20

SUM 20 10875
AVG 20 2175
ALLEN 30
BLAKE 30
JAMES 30
MARTIN 30
TURNER 30
WARD 30

SUM 30 9400
AVG 30 1566.66667

SUM 29025
AVG 2073.21429

IT-Works

Saturday, August 5, 2017

SQL Aggregation functions, CUBE, RollUP & Grouping Sets.

0 comments:

Post a Comment

Popular Posts

Blogger templates

Blogroll

Categories

Blog Archive

About

Search This Blog

Labels

Blogger templates

Pages List

About Me