Hive RANK function ~ <a href="http://learnprogrammingdata.blogspot.com/">IT-Works</a>

RANK() function in SQL is used to allocate a sequential incremented number (rank) to each of the row in question, based on a particular column. It acts like ROW_NUMBER function with only difference that if two rows have same value, they will be given same rank.

ROW_NUMBER(): attributes a unique value to each row
RANK(): attributes the same row number to the same value, leaving "holes"
DENSE_RANK(): attributes the same row number to the same value, leaving no "holes"

Let's do this by creating a table in HIVE.

CREATE TABLE test_test AS
SELECT 'a' v FROM dual UNION ALL
SELECT 'a' FROM dual UNION ALL
SELECT 'a' FROM dual UNION ALL
SELECT 'b' FROM dual UNION ALL
SELECT 'c' FROM dual UNION ALL
SELECT 'c' FROM dual UNION ALL
SELECT 'd' FROM dual UNION ALL
SELECT 'e' FROM dual;

select v, row_number() over (order by v) row_number, rank() over (order by v) rank, dense_rank() over (order by v) dense_rank from test_test order by v;

After running this query, that took 2 mapreduce jobs and here is following output.

a 3 1 1
a 2 1 1
a 1 1 1
b 4 4 2
c 6 5 3
c 5 5 3
d 7 7 4
e 8 8 5

rank() : It is used to rank a record within a group of rows.
dense_rank() : The DENSE_RANK function acts like the RANK function except that it assigns consecutive ranks.

Query -
select
ENAME,SAL,
RANK() over (order by SAL) RANK
from EMP;

Output -
+--------+------+------+
| ENAME | SAL | RANK |
+--------+------+------+
| SMITH | 800 | 1 |
| JAMES | 950 | 2 |
| ADAMS | 1100 | 3 |
| MARTIN | 1250 | 4 |
| WARD | 1250 | 4 |
| TURNER | 1500 | 6 |
+--------+------+------+

Query -
select
ENAME,SAL,dense_rank() over (order by SAL) DEN_RANK
from
EMP;

Output -
+--------+------+-----------+
| ENAME | SAL | DEN_RANK |
+--------+------+-----------+
| SMITH | 800 | 1 |
| JAMES | 950 | 2 |
| ADAMS | 1100 | 3 |
| MARTIN | 1250 | 4 |
| WARD | 1250 | 4 |
| TURNER | 1500 | 5 |
+--------+------+-----------+

Let's understand with this simple query on the sample EMP table in SCOTT schema:

SELECT empno, deptno, sal, 
rank() over(partition BY deptno order by sal) rn
FROM emp;

     EMPNO     DEPTNO        SAL         RN
---------- ---------- ---------- ----------
      7934         10       1300          1
      7782         10       2450          2
      7839         10       5000          3
      7369         20        800          1
      7876         20       1100          2
      7566         20       2975          3
      7788         20       3000          4
      7902         20       3000          4
      7900         30        950          1
      7654         30       1250          2
      7521         30       1250          2
      7844         30       1500          4
      7499         30       1600          5
      7698         30       2850          6

14 rows selected.

RANK function is a built in analytic function which is used to rank a record within a group of rows.
The PARTITION BY clause in the window is what groups the rows, and the ORDER BY clause tells how to rank, i.e. which row in each group will hold the 1st rank and then assign next rank to the next rows in that order.

GROUP
SORT
Assign rank

So, in above example, the rows are grouped by department, and ordered by salary. In each group, the rank is assigned starting from the lowest salary(ascending order). When there is a tie, the rank is not incremented, however, the immediate next row with change in value will not have consecutive sequence. And that's what happened here:

7654         30       1250          2
7521         30       1250          2
7844         30       1500          4

The rank is not consecutive because there is a tie between two rows with salary 1250. To keep the sequence consecutive, you need to use DENSE_RANK.
Another set of example.

Windowing and Analytics functions.

1. Sales Rep data
Here is a CSV file with Sales Rep data:
$ more reps.csv 1,William,22,Nadia,13,Daniel,24,Jana,1
Create a Hive table for the Sales Rep data:
create table SalesRep ( RepID INT, RepName STRING, Territory INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
... and load the CSV into the Hive Sales Rep table:
LOAD DATA LOCAL INPATH '/home/hadoop/MyDemo/reps.csv' INTO TABLE SalesRep;

2. Purchase Order data
Here is a CSV file with PO data:
$ more purchases.csv 4,1,1002,2,2002,3,6003,4,804,5,1201,6,1703,7,140
Create a Hive table for the PO's:
create table purchases (
SalesRepId INT,
PurchaseOrderId INT,
Amount INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
... and load CSV into the Hive PO table:
LOAD DATA
LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
INTO TABLE purchases;

3. Hive JOIN
So this is the underlining data that is being worked with:
SELECT p.PurchaseOrderId, s.RepName, p.amount, s.Territory
FROM purchases p JOIN SalesRep s
WHERE p.SalesRepId = s.RepID;

PO ID	Rep	Amount	Territory
1	Jana	100	1
2	Nadia	200	1
3	Nadia	600	1
4	Daniel	80	2
5	Jana	120	1
6	William	170	2
7	Daniel	140	2

4. Hive Rank by Volume only
SELECT s.RepName, s.Territory, V.volume, rank() over (ORDER BY V.volume DESC) as rankFROM SalesRep s JOIN ( SELECT SalesRepId, SUM(amount) as Volume FROM purchases GROUP BY SalesRepId) V WHERE V.SalesRepId = s.RepID ORDER BY V.volume DESC;

Rep	Territory	Amount	Rank
Nadia	1	800	1
Daniel	2	220	2
Jana	1	220	2
William	2	170	4

The ranking over the entire data set - Daniel is tied for second among all Reps.

5. Hive Rank within Territory, by Volume
SELECT s.RepName, s.Territory, V.volume, rank() over (PARTITION BY s.Territory ORDER BY V.volume DESC) as rankFROM SalesRep s JOIN ( SELECT SalesRepId, SUM(amount) as Volume FROM purchases GROUP BY SalesRepId) V WHERE V.SalesRepId = s.RepID ORDER BY V.volume DESC;

Rep	Territory	Amount	Rank
Nadia	1	800	1
Jana	1	220	2
Daniel	2	220	1
William	2	170	2

The ranking is within the territory - Daniel is the best is his territory.

Difference between Hive AND Impala
Impala is an excellent choice

IT-Works

Monday, July 3, 2017

Hive RANK function

0 comments:

Post a Comment

Popular Posts

Blogger templates

Blogroll

Categories

Blog Archive

About

Search This Blog

Labels

Blogger templates

Pages List

About Me