Monday, July 3, 2017

Hive RANK function

RANK() function in SQL is used to allocate a sequential incremented number (rank) to each of the row in question, based on a particular column. It acts like ROW_NUMBER function with only difference that if two rows have same value, they will be given same rank.

ROW_NUMBER(): attributes a unique value to each row
RANK(): attributes the same row number to the same value, leaving "holes"
DENSE_RANK(): attributes the same row number to the same value, leaving no "holes"

Let's do this by creating a table in HIVE.

CREATE TABLE test_test AS
SELECT 'a' v FROM dual UNION ALL
SELECT 'a' FROM dual UNION ALL
SELECT 'a' FROM dual UNION ALL
SELECT 'b' FROM dual UNION ALL
SELECT 'c' FROM dual UNION ALL
SELECT 'c' FROM dual UNION ALL
SELECT 'd' FROM dual UNION ALL
SELECT 'e' FROM dual;

select v, row_number() over (order by v) row_number, rank() over (order by v) rank, dense_rank() over (order by v) dense_rank from test_test order by v;

After running this query, that took 2 mapreduce jobs and here is following output.

a 3 1 1
a 2 1 1
a 1 1 1
b 4 4 2
c 6 5 3
c 5 5 3
d 7 7 4
e 8 8 5

rank() : It is used to rank a record within a group of rows.
dense_rank() : The DENSE_RANK function acts like the RANK function except that it assigns consecutive ranks.

Query -
select
ENAME,SAL,
RANK() over (order by SAL) RANK
from EMP;

Output -
+--------+------+------+
| ENAME | SAL | RANK |
+--------+------+------+
| SMITH | 800 | 1 |
| JAMES | 950 | 2 |
| ADAMS | 1100 | 3 |
| MARTIN | 1250 | 4 |
| WARD | 1250 | 4 |
| TURNER | 1500 | 6 |
+--------+------+------+

Query -
select
ENAME,SAL,dense_rank() over (order by SAL) DEN_RANK
from
EMP;

Output -
+--------+------+-----------+
| ENAME | SAL | DEN_RANK |
+--------+------+-----------+
| SMITH | 800 | 1 |
| JAMES | 950 | 2 |
| ADAMS | 1100 | 3 |
| MARTIN | 1250 | 4 |
| WARD | 1250 | 4 |
| TURNER | 1500 | 5 |
+--------+------+-----------+

Let's understand with this simple query on the sample EMP table in SCOTT schema:

SELECT empno, deptno, sal, 
rank() over(partition BY deptno order by sal) rn
FROM emp;

     EMPNO     DEPTNO        SAL         RN
---------- ---------- ---------- ----------
      7934         10       1300          1
      7782         10       2450          2
      7839         10       5000          3
      7369         20        800          1
      7876         20       1100          2
      7566         20       2975          3
      7788         20       3000          4
      7902         20       3000          4
      7900         30        950          1
      7654         30       1250          2
      7521         30       1250          2
      7844         30       1500          4
      7499         30       1600          5
      7698         30       2850          6

14 rows selected.

RANK function is a built in analytic function which is used to rank a record within a group of rows.
The PARTITION BY clause in the window is what groups the rows, and the ORDER BY clause tells how to rank, i.e. which row in each group will hold the 1st rank and then assign next rank to the next rows in that order.
  • GROUP
  • SORT
  • Assign rank
So, in above example, the rows are grouped by department, and ordered by salary. In each group, the rank is assigned starting from the lowest salary(ascending order). When there is a tie, the rank is not incremented, however, the immediate next row with change in value will not have consecutive sequence. And that's what happened here:

7654         30       1250          2
7521         30       1250          2
7844         30       1500          4

The rank is not consecutive because there is a tie between two rows with salary 1250. To keep the sequence consecutive, you need to use DENSE_RANK.
Another set of example.

Windowing and Analytics functions.

1. Sales Rep data
Here is a CSV file with Sales Rep data:
$ more reps.csv 
1,William,2
2,Nadia,1
3,Daniel,2
4,Jana,1
Create a Hive table for the Sales Rep data:
create table SalesRep (
  RepID INT,
  RepName STRING,
  Territory INT
  )
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n';

... and load the CSV into the Hive Sales Rep table:
LOAD DATA 
 LOCAL INPATH '/home/hadoop/MyDemo/reps.csv'
 INTO TABLE SalesRep;

2. Purchase Order data
Here is a CSV file with PO data:
$ more purchases.csv 
4,1,100
2,2,200
2,3,600
3,4,80
4,5,120
1,6,170
3,7,140
Create a Hive table for the PO's:
create table purchases (
SalesRepId INT,
PurchaseOrderId INT,
Amount INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
... and load CSV into the Hive PO table:
LOAD DATA
LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
INTO TABLE purchases;

3. Hive JOIN
So this is the underlining data that is being worked with:
SELECT p.PurchaseOrderId, s.RepName, p.amount, s.Territory
FROM purchases p JOIN SalesRep s
WHERE p.SalesRepId = s.RepID;
PO IDRep
Amount
Territory
1Jana1001
2Nadia2001
3Nadia6001
4Daniel802
5Jana1201
6William1702
7Daniel1402

4. Hive Rank by Volume only
SELECT 
  s.RepName, s.Territory, V.volume, 
rank() over (ORDER BY V.volume DESC) as rank
FROM 
  SalesRep s
  JOIN 
    ( SELECT
      SalesRepId, SUM(amount) as Volume
      FROM purchases
      GROUP BY SalesRepId) V
  WHERE V.SalesRepId = s.RepID
  ORDER BY V.volume DESC;
Rep
TerritoryAmount
Rank
Nadia18001
Daniel22202
Jana12202
William21704
The ranking over the entire data set - Daniel is tied for second among all Reps.

5. Hive Rank within Territory, by Volume
SELECT 
  s.RepName, s.Territory, V.volume, 
  rank() over (PARTITION BY s.Territory ORDER BY V.volume DESC) as rank
FROM 
  SalesRep s
  JOIN 
    ( SELECT
      SalesRepId, SUM(amount) as Volume
      FROM purchases
      GROUP BY SalesRepId) V
  WHERE V.SalesRepId = s.RepID
  ORDER BY V.volume DESC;
Rep
TerritoryAmount
Rank
Nadia18001
Jana12202
Daniel22201
William21702
The ranking is within the territory - Daniel is the best is his territory.

Difference between Hive AND Impala
Impala is an excellent choice

0 comments:

Post a Comment