ROW_NUMBER(): attributes a unique value to each row
RANK(): attributes the same row number to the same value, leaving "holes"
DENSE_RANK(): attributes the same row number to the same value, leaving no "holes"
Let's do this by creating a table in HIVE.
CREATE TABLE test_test AS
SELECT 'a' v FROM dual UNION ALL
SELECT 'a' FROM dual UNION ALL
SELECT 'a' FROM dual UNION ALL
SELECT 'b' FROM dual UNION ALL
SELECT 'c' FROM dual UNION ALL
SELECT 'c' FROM dual UNION ALL
SELECT 'd' FROM dual UNION ALL
SELECT 'e' FROM dual;
select v, row_number() over (order by v) row_number, rank() over (order by v) rank, dense_rank() over (order by v) dense_rank from test_test order by v;
After running this query, that took 2 mapreduce jobs and here is following output.
a 3 1 1
a 2 1 1
a 1 1 1
b 4 4 2
c 6 5 3
c 5 5 3
d 7 7 4
e 8 8 5
rank() : It is used to rank a record within a group of rows.
dense_rank() : The DENSE_RANK function acts like the RANK function except that it assigns consecutive ranks.
Query -
select
ENAME,SAL,
RANK() over (order by SAL) RANK
from EMP;
Output -
+--------+------+------+
| ENAME | SAL | RANK |
+--------+------+------+
| SMITH | 800 | 1 |
| JAMES | 950 | 2 |
| ADAMS | 1100 | 3 |
| MARTIN | 1250 | 4 |
| WARD | 1250 | 4 |
| TURNER | 1500 | 6 |
+--------+------+------+
Query -
select
ENAME,SAL,dense_rank() over (order by SAL) DEN_RANK
from
EMP;
Output -
+--------+------+-----------+
| ENAME | SAL | DEN_RANK |
+--------+------+-----------+
| SMITH | 800 | 1 |
| JAMES | 950 | 2 |
| ADAMS | 1100 | 3 |
| MARTIN | 1250 | 4 |
| WARD | 1250 | 4 |
| TURNER | 1500 | 5 |
+--------+------+-----------+
Let's understand with this simple query on the sample EMP table in SCOTT schema:
SELECT empno, deptno, sal,
rank() over(partition BY deptno order by sal) rn
FROM emp;
EMPNO DEPTNO SAL RN
---------- ---------- ---------- ----------
7934 10 1300 1
7782 10 2450 2
7839 10 5000 3
7369 20 800 1
7876 20 1100 2
7566 20 2975 3
7788 20 3000 4
7902 20 3000 4
7900 30 950 1
7654 30 1250 2
7521 30 1250 2
7844 30 1500 4
7499 30 1600 5
7698 30 2850 6
14 rows selected.
RANK function is a built in analytic function which is used to rank a record within a group of rows.The PARTITION BY clause in the window is what groups the rows, and the ORDER BY clause tells how to rank, i.e. which row in each group will hold the 1st rank and then assign next rank to the next rows in that order.
- GROUP
- SORT
- Assign rank
7654 30 1250 2
7521 30 1250 2
7844 30 1500 4
The rank is not consecutive because there is a tie between two rows with salary 1250
. To keep the sequence consecutive, you need to use DENSE_RANK.Another set of example.
Windowing and Analytics functions.
1. Sales Rep data
Here is a CSV file with Sales Rep data:
$ more reps.csv
1,William,2
2,Nadia,1
3,Daniel,2
4,Jana,1
Create a Hive table for the Sales Rep data:
create table SalesRep (
RepID INT,
RepName STRING,
Territory INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
... and load the CSV into the Hive Sales Rep table:
LOAD DATA
LOCAL INPATH '/home/hadoop/MyDemo/reps.csv'
INTO TABLE SalesRep;
2. Purchase Order data
Here is a CSV file with PO data:
$ more purchases.csv
4,1,100
2,2,200
2,3,600
3,4,80
4,5,120
1,6,170
3,7,140
Create a Hive table for the PO's:
create table purchases (
SalesRepId INT,
PurchaseOrderId INT,
Amount INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
... and load CSV into the Hive PO table:
LOAD DATA
LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
INTO TABLE purchases;
3. Hive JOIN
So this is the underlining data that is being worked with:
SELECT p.PurchaseOrderId, s.RepName, p.amount, s.Territory
FROM purchases p JOIN SalesRep s
WHERE p.SalesRepId = s.RepID;
PO ID | Rep | Amount | Territory |
1 | Jana | 100 | 1 |
2 | Nadia | 200 | 1 |
3 | Nadia | 600 | 1 |
4 | Daniel | 80 | 2 |
5 | Jana | 120 | 1 |
6 | William | 170 | 2 |
7 | Daniel | 140 | 2 |
4. Hive Rank by Volume only
SELECT
s.RepName, s.Territory, V.volume,
rank() over (ORDER BY V.volume DESC) as rank
FROM
SalesRep s
JOIN
( SELECT
SalesRepId, SUM(amount) as Volume
FROM purchases
GROUP BY SalesRepId) V
WHERE V.SalesRepId = s.RepID
ORDER BY V.volume DESC;
Rep | Territory | Amount | Rank |
Nadia | 1 | 800 | 1 |
Daniel | 2 | 220 | 2 |
Jana | 1 | 220 | 2 |
William | 2 | 170 | 4 |
5. Hive Rank within Territory, by Volume
SELECT
s.RepName, s.Territory, V.volume,
rank() over (PARTITION BY s.Territory ORDER BY V.volume DESC) as rank
FROM
SalesRep s
JOIN
( SELECT
SalesRepId, SUM(amount) as Volume
FROM purchases
GROUP BY SalesRepId) V
WHERE V.SalesRepId = s.RepID
ORDER BY V.volume DESC;
Rep | Territory | Amount | Rank |
Nadia | 1 | 800 | 1 |
Jana | 1 | 220 | 2 |
Daniel | 2 | 220 | 1 |
William | 2 | 170 | 2 |
Difference between Hive AND Impala
Impala is an excellent choice
0 comments:
Post a Comment