Saturday, August 5, 2017

SQL Aggregation functions, CUBE, RollUP & Grouping Sets.



ROLLUP() explained in SQL

The SQL standard defines special functions that can be used in the GROUP BY clause: the grouping functions. These functions can be used to generate several groupings in a single clause. This can best be explained in SQL. Let's take ROLLUP() for instance:


ROLLUP() grouping function provides N+1 groupings, when N is the number of arguments to the ROLLUP() function. Each grouping has an additional group field from the ROLLUP() argument field list. The results of the second query might look something like this:

-- ROLLUP() with one argument
SELECT AUTHOR_ID, COUNT(*)
FROM BOOK
GROUP BY ROLLUP(AUTHOR_ID)


-- ROLLUP() with two arguments
SELECT AUTHOR_ID, PUBLISHED_IN, COUNT(*)
FROM BOOK
GROUP BY ROLLUP(AUTHOR_ID, PUBLISHED_IN)


+-----------+--------------+----------+
| AUTHOR_ID | PUBLISHED_IN | COUNT(*) |
+-----------+--------------+----------+
|         1 |         1945 |        1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
|         1 |         1948 |        1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
|         1 |         NULL |        2 | <- GROUP BY (AUTHOR_ID)
|         2 |         1988 |        1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
|         2 |         1990 |        1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
|         2 |         NULL |        2 | <- GROUP BY (AUTHOR_ID)
|      NULL |         NULL |        4 | <- GROUP BY ()
+-----------+--------------+----------+



CUBE() explained in SQL

CUBE() is different from ROLLUP() in the way that it doesn't just create N+1 groupings, it creates all 2^N possible combinations between all group fields in the CUBE() function argument list. Let's re-consider our second query from before:


-- CUBE() with two arguments
SELECT AUTHOR_ID, PUBLISHED_IN, COUNT(*)
FROM BOOK
GROUP BY CUBE(AUTHOR_ID, PUBLISHED_IN)


The results would then hold:
+-----------+--------------+----------+
| AUTHOR_ID | PUBLISHED_IN | COUNT(*) |
+-----------+--------------+----------+
|      NULL |         NULL |        2 | <- GROUP BY ()
|      NULL |         1945 |        1 | <- GROUP BY (PUBLISHED_IN)
|      NULL |         1948 |        1 | <- GROUP BY (PUBLISHED_IN)
|      NULL |         1988 |        1 | <- GROUP BY (PUBLISHED_IN)
|      NULL |         1990 |        1 | <- GROUP BY (PUBLISHED_IN)
|         1 |         NULL |        2 | <- GROUP BY (AUTHOR_ID)
|         1 |         1945 |        1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
|         1 |         1948 |        1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
|         2 |         NULL |        2 | <- GROUP BY (AUTHOR_ID)
|         2 |         1988 |        1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
|         2 |         1990 |        1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
+-----------+--------------+----------+



GROUPING SETS()

GROUPING SETS() are the generalised way to create multiple groupings. From our previous examples
ROLLUP(AUTHOR_ID, PUBLISHED_IN) corresponds to GROUPING SETS((AUTHOR_ID, PUBLISHED_IN), (AUTHOR_ID), ())
CUBE(AUTHOR_ID, PUBLISHED_IN) corresponds to GROUPING SETS((AUTHOR_ID, PUBLISHED_IN), (AUTHOR_ID), (PUBLISHED_IN), ())



More explanation on the aggregation functions.

Aggregate functions can be nested:

SELECT
   AVG(
      MAX(SAL)
   )
FROM
   EMP
GROUP BY
   DEPTNO;

AVG(MAX(SAL))
-------------
   3616.66667

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |     7 |     4  (25)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |     7 |     4  (25)| 00:00:01 |
|   2 |   SORT GROUP BY     |      |     1 |     7 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| EMP  |    14 |    98 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------

The average of the maximum salary of each department is returned.

A query containing a nested aggregation requires one group by clause and returns one row.

The CUBE, ROLLUP and GROUPING SETS functions are used in the GROUP BY clause to generate totals and subtotals.

SELECT
   DEPTNO,
   SUM(SAL)
FROM
   EMP
GROUP BY
   ROLLUP(DEPTNO);

    DEPTNO   SUM(SAL)
---------- ----------
        10       8750
        20      10875
        30       9400
                29025

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     3 |    21 |     4  (25)| 00:00:01 |
|   1 |  SORT GROUP BY ROLLUP|      |     3 |    21 |     4  (25)| 00:00:01 |
|   2 |   TABLE ACCESS FULL  | EMP  |    14 |    98 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

The sum of salaries is aggregated per department, and then an additional row containing the total of all employees' salaries is returned.

The ROLLUP has generated four rows:

One for each group

One for the department

One for the total of all groups

ROLLUP can have more than one dimension, generating grand totals and subtotals:

SELECT
   DEPTNO,
   JOB,
   SUM(SAL)
FROM
   EMP
GROUP BY
   ROLLUP(DEPTNO,JOB);

  DEPTNO JOB         SUM(SAL)
-------- --------- ----------
      10 CLERK           1300
        10 MANAGER         2450
      10 PRESIDENT       5000
      10                 8750
      20 CLERK           1900
      20 ANALYST         6000
      20 MANAGER         2975
      20                10875
      30 CLERK            950
      30 MANAGER         2850
      30 SALESMAN        5600
      30                 9400
                        29025

The sum generates a grand total, a subtotal per department and a sub-subtotal per department and job.

It also possible to mix ROLLUP groups with normal groups in the GROUP BY clause:

SELECT
   DEPTNO,
   JOB,
   SUM(SAL)
FROM
   EMP
GROUP BY
   DEPTNO,ROLLUP(JOB);

  DEPTNO JOB         SUM(SAL)
-------- --------- ----------
      10 CLERK           1300
    10 MANAGER         2450
    10 PRESIDENT       5000
    10                 8750
    20 CLERK           1900
    20 ANALYST         6000
    20 MANAGER         2975
    20                10875
    30 CLERK            950
    30 MANAGER         2850
    30 SALESMAN        5600
    30                 9400

A subtotal per department and a sub-subtotal per department and job are returned. As DEPTNO is a standard GROUP BY expression, there is no grand total.

SELECT
   MAX(ENAME)
   KEEP
   (
      DENSE_RANK
         FIRST
      ORDER BY
         SAL DESC
   ) ENAME,
   DEPTNO,
   JOB,
   MAX(SAL) SAL
FROM
   EMP
GROUP BY
   CUBE(DEPTNO,JOB);

ENAME          DEPTNO JOB              SAL
---------- ---------- --------- ----------
KING                                  5000
MILLER                CLERK           1300
SCOTT                 ANALYST         3000
JONES                 MANAGER         2975
ALLEN                 SALESMAN        1600
KING                  PRESIDENT       5000
KING               10                 5000
MILLER             10 CLERK           1300
CLARK              10 MANAGER         2450
KING               10 PRESIDENT       5000
SCOTT              20                 3000
ADAMS              20 CLERK           1100
SCOTT              20 ANALYST         3000
JONES              20 MANAGER         2975
BLAKE              30                 2850
JAMES              30 CLERK            950
BLAKE              30 MANAGER         2850
ALLEN              30 SALESMAN        1600

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |    11 |   231 |     4  (25)| 00:00:01 |
|   1 |  SORT GROUP BY       |      |    11 |   231 |     4  (25)| 00:00:01 |
|   2 |   GENERATE CUBE      |      |    11 |   231 |     4  (25)| 00:00:01 |
|   3 |    SORT GROUP BY     |      |    11 |   231 |     4  (25)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| EMP  |    14 |   294 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

CUBE generates nine rows for the various job/departments, three rows for the subtotal per department, three rows for the subtotal per job and one row for the grand total. For each job and department, the maximum salary is returned, and for this salary, the name of the corresponding employee is returned.  In case of duplicates, the MAX function is applied to the name.

With this result set, a superaggregation is performed for each dimension, the department, the job, and the overall. In addition to the rows that are aggregated by ROLLUP, CUBE also produces a row for each job.

GROUPING SETS simplifies the management of the subtotals:

SELECT
   DEPTNO,
   JOB,
   MIN(SAL)
FROM
   EMP
GROUP BY
   GROUPING SETS
   (
      (JOB),
      (DEPTNO)
   );


    DEPTNO JOB         MIN(SAL)
---------- --------- ----------
           ANALYST         3000
           CLERK            800
           MANAGER         2450
           PRESIDENT       5000
              SALESMAN        1250
        10                 1300
        20                  800
        30                  950

GROUPING SET is a subset of CUBE. The minimum salary in each job and the minimum salary in each department are returned.

Compare with:

SELECT
   CASE
      WHEN GROUPING_ID(JOB, DEPTNO)=1
      THEN 'Count per job'
      WHEN GROUPING_ID(JOB, DEPTNO)=2
      THEN 'Count per department'
   END " ",
   CASE
      WHEN GROUPING(JOB)=0
      THEN JOB
      ELSE '========='
   END JOB,
   CASE
      WHEN GROUPING(DEPTNO)=0
      THEN TO_CHAR(DEPTNO,'99999')
      ELSE '======'
   END DEPTNO,
   COUNT(*)
FROM
   EMP
GROUP BY
   CUBE(JOB, DEPTNO)
HAVING
   GROUPING_ID (JOB, DEPTNO) in (0,1,2);

                     JOB       DEPTNO   COUNT(*)
-------------------- --------- ------ ----------
                     CLERK         10          1
                     CLERK         20          2
                     CLERK         30          1
                     ANALYST       20          2
                     MANAGER       10          1
                     MANAGER       20          1
                     MANAGER       30          1
                     SALESMAN      30          4
                     PRESIDENT     10          1
Count per job        CLERK     ======          4
Count per job        ANALYST   ======          2
Count per job        MANAGER   ======          3
Count per job        SALESMAN  ======          4
Count per job        PRESIDENT ======          1
Count per department =========     10          3
Count per department =========     20          5
Count per department =========     30          6

CUBE generates counts in every possible dimension including overall counts. The HAVING clause evaluates the GROUPING_ID function. When no superaggregation occurs, it gets the value of 0; when the first argument group is a subtotal, it gets 1 (20); when the second is a subtotal, it gets 2 (21). If it is a grand total for both the first and the second arguments, it gets 3 (20+21). Only the 0, 1 and 2 grouping ids are returned. GROUPING returns 1 when the column is summarized and otherwise, 0.

There is one more function related to grouping that is called GROUP_ID which is useful when there are duplicate subtotals:

SELECT
   CASE
      WHEN GROUPING(ENAME)=0
      THEN NULL
      WHEN GROUP_ID()=0
      THEN 'SUM'
      WHEN GROUP_ID()=1
      THEN 'AVG'
   END TYPE,
   ENAME,
   DEPTNO,
   CASE
      WHEN GROUPING(ENAME)=0
      THEN NULL
      WHEN GROUP_ID()=0
      THEN SUM(SAL)
      WHEN GROUP_ID()=1
      THEN AVG(SAL)
   END VALUE
FROM
   EMP
GROUP BY
   GROUPING SETS
   (
      (ENAME,DEPTNO),
      (DEPTNO),
      (DEPTNO),
      (),
      ()
   )
   DEPTNO,
   ENAME,
   GROUP_ID();

TYP ENAME          DEPTNO      VALUE
--- ---------- ---------- ----------
     CLARK              10
    KING               10
    MILLER             10

SUM                    10       8750
AVG                    10 2916.66667

    ADAMS              20
    FORD               20
    JONES              20
    SCOTT              20
    SMITH              20

SUM                    20      10875
AVG                    20       2175
      ALLEN              30
    BLAKE              30
    JAMES              30
    MARTIN             30
    TURNER             30
    WARD               30

SUM                    30       9400
AVG                    30 1566.66667

SUM                            29025
AVG                       2073.21429

0 comments:

Post a Comment