Tuesday, August 8, 2017

Design interview questions

August 08, 2017 No comments

https://softwareengineering.stackexchange.com/questions/336282/design-elevator-system-algorithm-and-classes

https://www.quora.com/What-is-the-best-answer-to-Design-an-elevator-system

https://www.youtube.com/watch?v=fITuhLSwbt8

https://softwareengineering.stackexchange.com/questions/336282/design-elevator-system-algorithm-and-classes

https://github.com/joeblau/sample-elevator-control-system

http://massivetechinterview.blogspot.com/2015/07/thought-works-object-oriented-design.html

http://blog.gssxgss.me/category/programming/java/

http://blog.gssxgss.me/

http://blog.gssxgss.me/binary-search-tree-learning-note-1/

http://thought-works.blogspot.com/2012/11/object-oriented-design-for-elevator-in.html

https://www.careercup.com/question?id=1808669

https://stackoverflow.com/questions/493276/modelling-an-elevator-using-object-oriented-analysis-and-design

https://tunaozkasap.wordpress.com/page/2/

http://practice.geeksforgeeks.org/problems/design-elevator

MOTHER of All Design questions

http://www.shuati123.com/blog/categories/design/

For the Spark Related stuff.

https://weishungchung.com/

Order, Sort, Cluster, and Distribute By in Hive

August 06, 2017 No comments

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

Different between soprt & orderby in spark.

August 06, 2017 No comments

OrderBy is just an alias for the Sort function and should give the same result.

The below is from the Spark documentation:

/**
* Returns a new Dataset sorted by the given expressions.
* This is an alias of the `sort` function.
*
* @group typedrel
* @since 2.0.0
*/
@scala.annotation.varargs
def orderBy(sortCol: String, sortCols: String*): Dataset[T] = sort(sortCol, sortCols : _*)
Both will order across partitions. To get an understanding of how Spark does a sort take a look at the explanation in the link below:

http://stackoverflow.com/questions/32887595/how-does-spark-achieve-sort-order

If you would like to sort within a partition then you can use repartitionAndSortWithinPartitions.

https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/OrderedRDDFunctions.html#repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)

More about the Spark-orderby
Sorting in Spark is a multiphase process which requires shuffling:

input RDD is sampled and this sample is used to compute boundaries for each output partition (sample followed by collect)
input RDD is partitioned using rangePartitioner with boundaries computed in the first step (partitionBy)
each partition from the second step is sorted locally (mapPartitions)
When data is collected all what is left is to follow the order defined by the partitioner.

Above steps are clearly reflected in a debug string:

scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ...

scala> rdd.sortBy(identity).toDebugString
res1: String =
(6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions
| ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle
+-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps
| ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize

https://stackoverflow.com/questions/40603202/difference-between-sort-and-orderby-spark

Apache Spark ML Sparse Vector vs Dense Vector

August 05, 2017 No comments

Sparse Vector vs Dense Vector

import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

// Create a dense vector (1.0, 0.0, 3.0).
Vector dv = Vectors.dense(1.0, 0.0, 3.0);
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});

Sparse vectors are when you have a lot of values in the vector as zero. While a dense vector is when most of the values in the vector are non zero.

If you have to create a sparse vector from the dense vector you specified, use the following syntax:

Vector sparseVector = Vectors.sparse(4, new int[] {1, 3}, new double[] {3.0, 4.0});
The below link is pretty useful for getting a good understanding of the overall concept.

What is a Sparse Vector

A vector is a one-dimensional array of elements. So in a programming language, an implementation of a vector is as a one-dimensional array. A vector is said to be sparse when many elements of a have zero values. And when we write programs it will not be a good idea from storage perspective to store all these zero values in the array.

So the best way of representation of a sparse vector will be by just specifying the location and value.

Example: 3 1.2 2800 6.3 6000 10.0 50000 5.7

This denotes at position:

3 the value is 1.2
2800 holds value 6.3
6000 holds 10.0
50000 holds value 5.7
When you use the sparse vector in a programming language you will also need to specify a size. In the above example the size of the sparse vector is 4.

Representation of Sparse Vector in Spark

The Vector class of org.apache.spark.mllib.linalg has multiple methods to create your own dense and sparse Vectors.

The most simple way of creating one is by using:

sparse(int size, int[] indices, double[] values)

This method creates a sparse vector where the first argument is the size, second the indices where a value exists and the last one is the values on these indices.

Rest of the elements of this vector have values zero.

Example:

Let’s say we have to create the following vector {0.0, 2.0, 0.0, 0.0, 4.0}. By using the sparse vector API of Spark this can be created as stated below:

Vector sparseVector = Vectors.sparse(5, new int[] {1, 4}, new double[] {2.0, 4.0});

If the same vector can also be created using the dense vector API

Vector denseVector = Vectors.dense(0.0, 2.0, 0.0, 0.0, 4.0);

Apache Spark K- Mean clustering

August 05, 2017 No comments

K.1 : K- Mean clustering:

This is how you run K-Mean clustering using Dataframes in Spark.

These are the libs that you need to import:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
import scala.sys.process._

//Pull data from hive or any other DB ( for this example we assume that the data is stored in hive )

val query = "select query, impression from table_search"
//Now execute the queryval

q1 = hqlContext.sql(query).select("query", "page_impression")

val rowRDD = q1.rdd.map
(r =>
(
r.getString(0),
r.getLong(1),
Vectors.dense(r.getLong(1))
)
)

rowRDD.cache()

val upper_cnt = 8000
val lower_cnt = 100
//select the training set

val trainSet = rowRDD.filter(r => r._2 < upper_cnt && r._2 >= lower_cnt)
val numClusters = 3
val numIterations = 20
val feature = trainSet.map(r => r._3)

val kMeansModel = KMeans.train(feature, numClusters, numIterations)

//Here we are segregating all the queries in 3 seperate clusters

val predDF = trainSet.map(r => kMeansModel.predict(r._3) match {
case 1 => (r._1, r._2, 0)
case 2 => (r._1, r._2, 1)
case 0 => (r._1, r._2, 2)
case _ => (r._1, r._2, -1)
}).toDF("queries", "page_impression", "tier")

//Now saving the data in a
hdfspredDF.coalesce(10).write.mode(SaveMode.Overwrite).parquet(final_part_dir)

K.2: Logistic regression example:

val data = Seq( (1.0, .52,0.34), (0.0, 0.6, 0.43), (0.0, 1.9, 0.54), (1.0, 0.11, 0.11), (1.0, 0.222, 0.33), (1.0, 0.333, 0.99), (0.0, 0.314, 0.86), (0.0, 0.9888, 0.34), (1.0, 0.264, 0.55))

val df = sc.parallelize(data).toDF("label", "x1", "x2").cache()

val train_set = df.rdd.map(row => LabeledPoint(row.getAs[Double](0), Vectors.dense (row.getAs[Double](1), row.getAs[Double](2)))).toDF

train_set.show()

val lr =new LogisticRegression().setMaxIter(10).setRegParam(0.3)

val lrModel = lr.fit(train_set)

//Prepare the test data
val test = df.select ("x1", "x2").rdd.map(row => Feature(Vectors.dense(row.getAs[Double](0), row.getAs[Double](1) ) ) ).toDF("features")

val predictionsDF = lrModel.transform (test)
predictionsDF.show()

Run the Hive Queries on Spark

August 05, 2017 No comments

Case statements
This is how case statements can be written in DataFrames:
Hive Query:
select key, case when value is null then 0 else value as new_value from users limit 10;
In data frames:
val sqlQuery = "select key, value from users"
val data = hqlContext.sql(sqlQuery)
val caseStatement=data.withColumn("new_value", when(data("value") isNull, 0.0).otherwise(data("value"))).take(10)
Now here withColumn function creates one more column other than key and value So if we do printSchema() on caseStatement object then it will return:
caseStatement.printSchema()
----key: int
----value: int
----new_value:int

Source:
http://sparksofdata.blogspot.com/

SQL Aggregation functions, CUBE, RollUP & Grouping Sets.

August 05, 2017 No comments

ROLLUP() explained in SQL

The SQL standard defines special functions that can be used in the GROUP BY clause: the grouping functions. These functions can be used to generate several groupings in a single clause. This can best be explained in SQL. Let's take ROLLUP() for instance:

ROLLUP() grouping function provides N+1 groupings, when N is the number of arguments to the ROLLUP() function. Each grouping has an additional group field from the ROLLUP() argument field list. The results of the second query might look something like this:

-- ROLLUP() with one argument
SELECT AUTHOR_ID, COUNT(*)
FROM BOOK
GROUP BY ROLLUP(AUTHOR_ID)

-- ROLLUP() with two arguments
SELECT AUTHOR_ID, PUBLISHED_IN, COUNT(*)
FROM BOOK
GROUP BY ROLLUP(AUTHOR_ID, PUBLISHED_IN)

+-----------+--------------+----------+
| AUTHOR_ID | PUBLISHED_IN | COUNT(*) |
+-----------+--------------+----------+
| 1 | 1945 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 1 | 1948 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 1 | NULL | 2 | <- GROUP BY (AUTHOR_ID)
| 2 | 1988 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 2 | 1990 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 2 | NULL | 2 | <- GROUP BY (AUTHOR_ID)
| NULL | NULL | 4 | <- GROUP BY ()
+-----------+--------------+----------+

CUBE() explained in SQL

CUBE() is different from ROLLUP() in the way that it doesn't just create N+1 groupings, it creates all 2^N possible combinations between all group fields in the CUBE() function argument list. Let's re-consider our second query from before:

-- CUBE() with two arguments
SELECT AUTHOR_ID, PUBLISHED_IN, COUNT(*)
FROM BOOK
GROUP BY CUBE(AUTHOR_ID, PUBLISHED_IN)

The results would then hold:
+-----------+--------------+----------+
| AUTHOR_ID | PUBLISHED_IN | COUNT(*) |
+-----------+--------------+----------+
| NULL | NULL | 2 | <- GROUP BY ()
| NULL | 1945 | 1 | <- GROUP BY (PUBLISHED_IN)
| NULL | 1948 | 1 | <- GROUP BY (PUBLISHED_IN)
| NULL | 1988 | 1 | <- GROUP BY (PUBLISHED_IN)
| NULL | 1990 | 1 | <- GROUP BY (PUBLISHED_IN)
| 1 | NULL | 2 | <- GROUP BY (AUTHOR_ID)
| 1 | 1945 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 1 | 1948 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 2 | NULL | 2 | <- GROUP BY (AUTHOR_ID)
| 2 | 1988 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
| 2 | 1990 | 1 | <- GROUP BY (AUTHOR_ID, PUBLISHED_IN)
+-----------+--------------+----------+

GROUPING SETS()

GROUPING SETS() are the generalised way to create multiple groupings. From our previous examples
ROLLUP(AUTHOR_ID, PUBLISHED_IN) corresponds to GROUPING SETS((AUTHOR_ID, PUBLISHED_IN), (AUTHOR_ID), ())
CUBE(AUTHOR_ID, PUBLISHED_IN) corresponds to GROUPING SETS((AUTHOR_ID, PUBLISHED_IN), (AUTHOR_ID), (PUBLISHED_IN), ())

More explanation on the aggregation functions.

Aggregate functions can be nested:

SELECT
AVG(
MAX(SAL)
)
FROM
EMP
GROUP BY
DEPTNO;

AVG(MAX(SAL))
-------------
3616.66667

----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 | 4 (25)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 7 | 4 (25)| 00:00:01 |
| 2 | SORT GROUP BY | | 1 | 7 | 4 (25)| 00:00:01 |
| 3 | TABLE ACCESS FULL| EMP | 14 | 98 | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------

The average of the maximum salary of each department is returned.

A query containing a nested aggregation requires one group by clause and returns one row.

The CUBE, ROLLUP and GROUPING SETS functions are used in the GROUP BY clause to generate totals and subtotals.

SELECT
DEPTNO,
SUM(SAL)
FROM
EMP
GROUP BY
ROLLUP(DEPTNO);

DEPTNO SUM(SAL)
---------- ----------
10 8750
20 10875
30 9400
29025

-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 21 | 4 (25)| 00:00:01 |
| 1 | SORT GROUP BY ROLLUP| | 3 | 21 | 4 (25)| 00:00:01 |
| 2 | TABLE ACCESS FULL | EMP | 14 | 98 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------

The sum of salaries is aggregated per department, and then an additional row containing the total of all employees' salaries is returned.

The ROLLUP has generated four rows:

One for each group

One for the department

One for the total of all groups

ROLLUP can have more than one dimension, generating grand totals and subtotals:

SELECT
DEPTNO,
JOB,
SUM(SAL)
FROM
EMP
GROUP BY
ROLLUP(DEPTNO,JOB);

DEPTNO JOB SUM(SAL)
-------- --------- ----------
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
10 8750
20 CLERK 1900
20 ANALYST 6000
20 MANAGER 2975
20 10875
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600
30 9400
29025

The sum generates a grand total, a subtotal per department and a sub-subtotal per department and job.

It also possible to mix ROLLUP groups with normal groups in the GROUP BY clause:

SELECT
DEPTNO,
JOB,
SUM(SAL)
FROM
EMP
GROUP BY
DEPTNO,ROLLUP(JOB);

DEPTNO JOB SUM(SAL)
-------- --------- ----------
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
10 8750
20 CLERK 1900
20 ANALYST 6000
20 MANAGER 2975
20 10875
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600
30 9400

A subtotal per department and a sub-subtotal per department and job are returned. As DEPTNO is a standard GROUP BY expression, there is no grand total.

SELECT
MAX(ENAME)
KEEP
(
DENSE_RANK
FIRST
ORDER BY
SAL DESC
) ENAME,
DEPTNO,
JOB,
MAX(SAL) SAL
FROM
EMP
GROUP BY
CUBE(DEPTNO,JOB);

ENAME DEPTNO JOB SAL
---------- ---------- --------- ----------
KING 5000
MILLER CLERK 1300
SCOTT ANALYST 3000
JONES MANAGER 2975
ALLEN SALESMAN 1600
KING PRESIDENT 5000
KING 10 5000
MILLER 10 CLERK 1300
CLARK 10 MANAGER 2450
KING 10 PRESIDENT 5000
SCOTT 20 3000
ADAMS 20 CLERK 1100
SCOTT 20 ANALYST 3000
JONES 20 MANAGER 2975
BLAKE 30 2850
JAMES 30 CLERK 950
BLAKE 30 MANAGER 2850
ALLEN 30 SALESMAN 1600

-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 11 | 231 | 4 (25)| 00:00:01 |
| 1 | SORT GROUP BY | | 11 | 231 | 4 (25)| 00:00:01 |
| 2 | GENERATE CUBE | | 11 | 231 | 4 (25)| 00:00:01 |
| 3 | SORT GROUP BY | | 11 | 231 | 4 (25)| 00:00:01 |
| 4 | TABLE ACCESS FULL| EMP | 14 | 294 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------

CUBE generates nine rows for the various job/departments, three rows for the subtotal per department, three rows for the subtotal per job and one row for the grand total. For each job and department, the maximum salary is returned, and for this salary, the name of the corresponding employee is returned. In case of duplicates, the MAX function is applied to the name.

With this result set, a superaggregation is performed for each dimension, the department, the job, and the overall. In addition to the rows that are aggregated by ROLLUP, CUBE also produces a row for each job.

GROUPING SETS simplifies the management of the subtotals:

SELECT
DEPTNO,
JOB,
MIN(SAL)
FROM
EMP
GROUP BY
GROUPING SETS
(
(JOB),
(DEPTNO)
);

DEPTNO JOB MIN(SAL)
---------- --------- ----------
ANALYST 3000
CLERK 800
MANAGER 2450
PRESIDENT 5000
SALESMAN 1250
10 1300
20 800
30 950

GROUPING SET is a subset of CUBE. The minimum salary in each job and the minimum salary in each department are returned.

Compare with:

SELECT
CASE
WHEN GROUPING_ID(JOB, DEPTNO)=1
THEN 'Count per job'
WHEN GROUPING_ID(JOB, DEPTNO)=2
THEN 'Count per department'
END " ",
CASE
WHEN GROUPING(JOB)=0
THEN JOB
ELSE '========='
END JOB,
CASE
WHEN GROUPING(DEPTNO)=0
THEN TO_CHAR(DEPTNO,'99999')
ELSE '======'
END DEPTNO,
COUNT(*)
FROM
EMP
GROUP BY
CUBE(JOB, DEPTNO)
HAVING
GROUPING_ID (JOB, DEPTNO) in (0,1,2);

JOB DEPTNO COUNT(*)
-------------------- --------- ------ ----------
CLERK 10 1
CLERK 20 2
CLERK 30 1
ANALYST 20 2
MANAGER 10 1
MANAGER 20 1
MANAGER 30 1
SALESMAN 30 4
PRESIDENT 10 1
Count per job CLERK ====== 4
Count per job ANALYST ====== 2
Count per job MANAGER ====== 3
Count per job SALESMAN ====== 4
Count per job PRESIDENT ====== 1
Count per department ========= 10 3
Count per department ========= 20 5
Count per department ========= 30 6

CUBE generates counts in every possible dimension including overall counts. The HAVING clause evaluates the GROUPING_ID function. When no superaggregation occurs, it gets the value of 0; when the first argument group is a subtotal, it gets 1 (20); when the second is a subtotal, it gets 2 (21). If it is a grand total for both the first and the second arguments, it gets 3 (20+21). Only the 0, 1 and 2 grouping ids are returned. GROUPING returns 1 when the column is summarized and otherwise, 0.

There is one more function related to grouping that is called GROUP_ID which is useful when there are duplicate subtotals:

SELECT
CASE
WHEN GROUPING(ENAME)=0
THEN NULL
WHEN GROUP_ID()=0
THEN 'SUM'
WHEN GROUP_ID()=1
THEN 'AVG'
END TYPE,
ENAME,
DEPTNO,
CASE
WHEN GROUPING(ENAME)=0
THEN NULL
WHEN GROUP_ID()=0
THEN SUM(SAL)
WHEN GROUP_ID()=1
THEN AVG(SAL)
END VALUE
FROM
EMP
GROUP BY
GROUPING SETS
(
(ENAME,DEPTNO),
(DEPTNO),
(DEPTNO),
(),
()
)
DEPTNO,
ENAME,
GROUP_ID();

TYP ENAME DEPTNO VALUE
--- ---------- ---------- ----------
CLARK 10
KING 10
MILLER 10

SUM 10 8750
AVG 10 2916.66667

ADAMS 20
FORD 20
JONES 20
SCOTT 20
SMITH 20

SUM 20 10875
AVG 20 2175
ALLEN 30
BLAKE 30
JAMES 30
MARTIN 30
TURNER 30
WARD 30

SUM 30 9400
AVG 30 1566.66667

SUM 29025
AVG 2073.21429

Design a Car Parking lot with OOPS Concepts.

August 05, 2017 No comments

https://www.youtube.com/watch?v=2vtT6TBnOAM&t=581s

Hotel Booking OOPS Design Question.

August 05, 2017 No comments

https://www.youtube.com/watch?v=x7eABYgjN5c

Design a service like TinyURL

August 05, 2017 No comments

https://www.youtube.com/watch?v=fMZMm_0ZhK4

HackerRank Equals

July 29, 2017 No comments

Christy is interning at HackerRank. One day she has to distribute some chocolates to her colleagues. She is biased towards her friends and may have distributed the chocolates unequally. One of the program managers gets to know this and orders Christy to make sure everyone gets equal number of chocolates.

But to make things difficult for the intern, she is ordered to equalize the number of chocolates for every colleague in the following manner,

For every operation, she can choose one of her colleagues and can do one of the three things.

She can give one chocolate to every colleague other than chosen one.
She can give two chocolates to every colleague other than chosen one.
She can give five chocolates to every colleague other than chosen one.

Calculate minimum number of such operations needed to ensure that every colleague has the same number of chocolates.

Input Format

First line contains an integer

denoting the number of testcases.

testcases follow.
Each testcase has

lines. First line of each testcase contains an integer

denoting the number of colleagues. Second line contains N space separated integers denoting the current number of chocolates each colleague has.

Constraints

Number of initial chocolates each colleague has <

Output Format

lines, each containing the minimum number of operations needed to make sure all colleagues have the same number of chocolates.

Sample Input

1
4
2 2 3 7

Sample Output

Explanation

1st operation: Christy increases all elements by 1 except 3rd one
2 2 3 7 -> 3 3 3 8
2nd operation: Christy increases all element by 5 except last one
3 3 3 8 -> 8 8 8 8

       

import java.util.Scanner;

/**
 * Created by vdokku on 7/29/2017.
 */
public class Equal {

    static boolean DBG = false;

    // Min number in the Array.
    static int min;

    //Find the minimised action counts
    public static int MinRound(int[] counts) {

        int[][] results = new int[counts.length][3];
        for (int i = 0; i < counts.length; i++) {
            for (int j = 0; j < 3; j++) {
                int delta = counts[i] - min + j; // calculating the DELTA is important.
                results[i][j] = 0;
                while (true) {
                    // Greedy approach
                    if (delta >= 5) {
                        delta -= 5;
                        results[i][j]++;
                    } else if (delta >= 2) {
                        delta -= 2;
                        results[i][j]++;
                    } else if (delta >= 1) {
                        delta -= 1;
                        results[i][j]++;
                    } else {
                        break;
                    }
                }
            }
        }
        int finalResult = -1;
        // Compare results from different baseline cases (keep min, take 1, 2 ).
        for (int i = 0; i < 3; i++) {
            int subFinal = 0;
            for (int j = 0; j < counts.length; j++) {
                subFinal += results[j][i];
                if (DBG) System.out.format("results[%d][%d] = %d \n", j, i, results[j][i]);
            }
            if (DBG) System.out.println(subFinal);
            if (finalResult < 0 || finalResult > subFinal) {
                finalResult = subFinal;
            }
        }

        return finalResult;
    }

    public static void main(String[] args) {

        int casesCount = 0;

        Scanner s = new Scanner(System.in);

        if (s.hasNextInt()) {
            casesCount = s.nextInt();
        }
        s.nextLine(); // throw away the newline.

        int[] outputArrary = new int[casesCount];

        for (int i = 0; i < casesCount; i++) {
            int count = 0;
            if (s.hasNextInt()) {
                count = s.nextInt();
            }
            s.nextLine();

            int[] numbers = new int[count];
            min = -1;
            for (int j = 0; j < count; j++) {
                if (s.hasNextInt()) {
                    numbers[j] = s.nextInt();
                    // get min value from input array
                    if (numbers[j] < min || min < 0) {
                        min = numbers[j];
                    }
                } else {
                    System.out.println("You didn't provide enough numbers");
                    break;
                }
            }
            //SortCounts(numbers);
            outputArrary[i] = MinRound(numbers);

        }

        for (int i = 0; i < casesCount; i++) {
            System.out.println(outputArrary[i]);
        }
    }
}

LEARN complex HIVE Queries

July 27, 2017 No comments

When working with the complex hive queries which involves different analytical functions.
Here is a sample query.

       

SELECT
mag.co_magasin,
dem.id_produit                                  as id_produit_orig,
pnvente.dt_debut_commercial                     as dt_debut_commercial,
COALESCE(pnvente.id_produit,dem.id_produit)     as id_produit,
min(
CASE WHEN dem.co_validation IS NULL THEN 0 ELSE 1 END
)                                               as flg_demarque_valide,
sum(CASE WHEN dem.co_validation IS NULL THEN 0 ELSE cast(dem.mt_revient_ope AS INT) END)
           as me_dem_con_prx_cs,
0                                               as me_dem_inc_prx_cs,
0                                               as me_dem_prov_stk_cs,
sum(CASE WHEN dem.co_validation IS NULL THEN 0 ELSE cast(dem.qt_demarque AS INT) END)
           as qt_dem_con,
0                                               as qt_dem_inc,
0                                               as qt_dem_prov_stk,
RANK() OVER (PARTITION BY mag.co_magasin, dem.id_produit ORDER BY pnvente.dt_debut_commercial DESC, COALESCE(pnvente.id_produit,dem.id_produit) DESC) as rang
from default.calendrier cal
INNER JOIN default.demarque_mag_jour dem
ON  CASE WHEN dem.co_societe = 1 THEN 1 ELSE 2 END = '${hiveconf:in_co_societe}'
AND dem.dt_jour    = cal.dt_jour
LEFT OUTER JOIN default.produit_norm pn
ON  pn.co_societe = dem.co_societe
AND pn.id_produit = dem.id_produit
LEFT OUTER JOIN default.produit_norm pnvente
ON  pnvente.co_societe = pn.co_societe
AND pnvente.co_produit_rfu = pn.co_produit_lip
AND pnvente.co_type_motif='05'
INNER JOIN default.kpi_magasin mag
ON  mag.co_societe = '${hiveconf:in_co_societe}'
AND mag.id_magasin = dem.id_magasin
WHERE cal.dt_jour = '${hiveconf:in_dt_jour}'
AND NOT (dem.co_validation IS NULL AND cal.dt_jour > from_unixtime(unix_timestamp()-3*60*60*24, 'ddmmyyyy'))
-- JYP 4.4
AND dem.co_operation_magasin IN ('13','14','32')
GROUP BY
mag.co_magasin,
dem.id_produit,
pnvente.dt_debut_commercial,
COALESCE(pnvente.id_produit,dem.id_produit)

com.google.common.collect.Lists provide different methods.

July 18, 2017 No comments

com.google.common.collect;
Lists

public static <T> List<T> reverse(List<T> list) {
if (list instanceof ImmutableList) {
return ((ImmutableList<T>) list).reverse();
} else if (list instanceof ReverseList) {
return ((ReverseList<T>) list).getForwardList();
} else if (list instanceof RandomAccess) {
return new RandomAccessReverseList<T>(list);
} else {
return new ReverseList<T>(list);
}
}

@Override public List<T> subList(int fromIndex, int toIndex) {
checkPositionIndexes(fromIndex, toIndex, size());
return reverse(forwardList.subList(
reversePosition(toIndex), reversePosition(fromIndex)));
}

public static <T> List<List<T>> partition(List<T> list, int size) {
checkNotNull(list);
checkArgument(size > 0);
return (list instanceof RandomAccess)
? new RandomAccessPartition<T>(list, size)
: new Partition<T>(list, size);
}

why variables inside foreach loop of java8 should be final?

July 18, 2017 No comments

The Java Memory Model has very important property: it guarantees that local variables and method parameters are never writable by another thread. This adds much safety to multi-threading programming. However when you create a lambda (or an anonymous class), nobody knows how it will be used. It can be passed to another thread for execution (for example, if you use parallelStream().forEach(...)). Were it possible to modify the local variable that important property would be violated. Not the thing the Java language developers would sacrifice.

Usually when you are using lambdas, you are trying to program in functional way. But in functional programming mutable variables are considered bad practice: it's better to assign every variable only once. So trying to modify the local variable actually smells. Use various stream reduction methods instead of forEach to produce a good functional code.

https://stackoverflow.com/questions/16635398/java-8-iterable-foreach-vs-foreach-loop
https://stackoverflow.com/questions/31801313/why-variables-inside-foreach-loop-of-java8-should-be-final

Difference between == and === in scala,spark

July 18, 2017 No comments

The "==" is using the equals methods which checks if the two references point to the same object. The definition of "===" depends on the context/object. For Spark , "===" is using the equalTo method.

== returns a boolean
=== returns a column (which contains the result of the comparisons of the elements of two columns)

Quick Sample: Normally SchemaRDD is as RDD[org.apache.spark.sql.Row] , extracting the information out of schemaRDD is by using rec.getString(column number).

.filter(rec => (rec.getString(5) == "" || rec.getString(5) == null)).map(rec => (rec.getString(1))).

Mergin the PART file to a single file. You can use the same option of .repartition(1) or coalesce(1) to save only 1 part file.

hadoop fs -getmerge "HDFS_LOCATION_PART_FILES_LIST" "NFS_LOCATION_SINGLE_PART_FILE"

Hadoop Certification Plan

July 11, 2017 No comments

Hive AVRO

Hive Parquet

Hive Window Operation.

Hive GroupBy

Removing the header of any (TEXT/CSV) files can be done in two ways.

stocksData.take(12).foreach(println)

Stick with the First approach.

1) mapPartitionsWithIndex
Ex: stocksData.mapPartitionsWithIndex{(index, itr) => if (index ==0) itr.drop(1) else itr}.take(12).foreach(println)

2) mapPartitions

Ex: stocksData.mapPartitions(itr => itr.drop(1)).take(12).foreach(println)

Simple commands that needs to be remembered.

import sqlContext.implicits._
spark-shell --packages com.databricks:spark-avro_2.10:2.0.1
import com.databricks.spark.avro._
import org.apache.spark.sql.functions._

Different ways of reading a Text file from HDFS.
Considering the input source files are comma separated.

       

val ordersDF = sc.textFile("/PATH").map(_.split(',')).map(p => orders(p(0).toInt,p(1),p(2).toInt,p(3))).toDF()
val rdd1 = sc.textFile("/Path").map(rec => {  val rec1 =  rec.split(',')  orders(rec1(0).toInt,rec1(1),rec1(2).toInt,rec1(3)) } ).toDF()

Things you need to know, when reading a Parquet & AVRO files.

Once you get the DF out of the PARQUET or AVRO. You can access the values using the rec(INDEX)

Sample example:

       

val ordersDF = sqlContext.read.parquet("PATH")
ordersDF.map(rec => rec.getInt(0)+rec.getLong(1)+rec.getInt(2)+rec.getString(3)).take(12).foreach(println)

ordersDF.map(rec => rec(0)+"\t"+rec(1)+"\t"+rec(2)+"\t"+rec(3)).take(12).foreach(println)

import com.databricks.spark.avro._
val ordersDF = sqlContext.read.avro("CCA_Prep/avro/orders")
ordersDF.map(rec => rec(0)+"\t"+rec(1))

Things to remember when importing the data using the SQOOP.

1) Sqoop import MYSQL data to the TEXT file.

       

sqoop import --connect "" --username "" --password --table order_items --as-textfile --target-dir --fields-terminated-by '' --lines-terminated-by ''

2) Sqoop import MYSQL data to the AVRO Data file.

       

sqoop import --connect "" --username "" --password "" --table order_items --as-avrodatafile --target-dir "" --fileds-terminated-by '' --lines-terminated-by ''

3) Sqoop import MYSQL data to the PARQUET Data file

       

sqoop import --connect "" --username "" --password "" --table order_items --as-parquetfile --target-dir "" --fields-terminated-by '' --lines-terminated-by ''

Java 8 Functional interfaces

July 05, 2017 No comments

Definition:

@FunctionalInterface annotation is useful for compilation time checking of your code.
You cannot have more than one method besides
static,
default and
abstract methods that override methods in Object in your @FunctionalInterface or any other interface used as a functional interface.

Insight 1:

Conceptually,
a functional interface has exactly one abstract method.
Since default methods have an implementation, they are not abstract.
Since default methods are not abstract you’re free to add default methods to your functional interface as many as you like.

Insight 2:

Insight 3:

If an interface declares an abstract method overriding one of the public methods of java.lang.Object, that also does not count toward the interface’s abstract method count since any implementation of the interface will have an implementation from java.lang.Object or elsewhere.
e.g. Below is a valid functional interface even though it declared two abstract methods. Why? Because one of these abstract methods “equals()” which has signature equal to public method in Object class.

@FunctionalInterface
 interface MathOperation {
      int operation(int a, int b);
      @Override
      public String toString();  //Overridden from Object class
      @Override
      public boolean equals(Object obj); //Overridden from Object class
 }

While the intended use of Functional interfaces is for lambda expressions, method references and constructor references, they can still be used, like any interface, with anonymous classes, implemented by classes, or created by factory methods.

Java 8 & Lambda Expressions

July 05, 2017 No comments

In JDK 8 a LAMBDA Expression is a way of defining the anonymous functions. Simple example is as follows.
Consumer<String> consumer = (x) -> System.out.print(x);
Now calling consumer.accept(" Hey There !"); and it will apply the string parameter to the body of the anonymous function.
Result: Hey There !
We can think of it as a function defined in two parts:

The right part of the expression is the body of the function
The signature of the function is defined in a Functional Interface, in this case the Consumer<T>interface (more about Functional Interfaces below.

One important detail is that the lambda expression is of the same type of the functional interface associated with it.
Lambda Expression vs Anonymous classes.
Need to update ....

Java 8 Streams

July 05, 2017 No comments

A Stream is a pipeline of functions that can be evaluated
Streams can transform data
A Stream is not a data structure
Streams cannot mutate data

Most importantly, a stream isn’t a data structure. You can often create a stream from collections to apply a number of functions on a data structure, but a stream itself is not a data structure. That’s so important, I mentioned it twice! A stream can be composed of multiple functions that create a pipeline that data that flows through. This data cannot be mutated. That is to say the original data structure doesn’t change. However the data can be transformed and later stored in another data structure or perhaps consumed by another operation.
We stated that a stream is a pipeline of functions, or operations. These operations can either be classed as an intermediate operation or a terminal operation. The difference between the two is in the output which the operation creates. If an operation outputs another stream, to which you could apply a further operation, we call it an intermediate operation. However, if the operation outputs a concrete type or produces a side effect, it is a terminal type. A subsequent stream operation cannot follow a terminal operation, obviously, as a stream is not returned by the terminal operation!

Intermediate operations

An intermediate operation is always lazily executed. That is to say they are not run until the point a terminal operation is reached. We’ll look in more depth at a few of the most popular intermediate operations used in a stream.

filter – the filter operation returns a stream of elements that satisfy the predicate passed in as a parameter to the operation. The elements themselves before and after the filter will have the same type, however the number of elements will likely change.
map – the map operation returns a stream of elements after they have been processed by the function passed in as a parameter. The elements before and after the mapping may have a different type, but there will be the same total number of elements.
distinct – the distinct operation is a special case of the filter operation. Distinct returns a stream of elements such that each element is unique in the stream, based on the equals method of the elements.

Here’s a table that summarises this, including a couple of other common intermediate operations.

Function	Preserves count	Preserves type	Preserves order
map	✅	❌	✅
filter	❌	✅	✅
distinct	❌	✅	✅
sorted	✅	✅	❌
peek	✅	✅	✅

Terminal operations

A terminal operation is always eagerly executed. This operation will kick off the execution of all previous lazy operations present in the stream. Terminal operations either return concrete types or produce a side effect. For instance, a reduce operation which calls the Integer::sum operation would produce an Optional, which is a concrete type. Alternatively, the forEach operation does not return a concrete type, but you are able to add a side effect such as print out each element. The collect terminal operation is a special type of reduce which takes all the elements from the stream and can produce a Set, Map or List. Here’s a tabulated summary.

Function	Output	When to use
reduce	concrete type	to cumulate elements
collect	list, map or set	to group elements
forEach	side effect	to perform a side effect on elements

All Terminal Operations.

Difference between collect_set & collect_list

July 05, 2017 No comments

hive (venkat_db)> describe purchases;
OK
salesrepid string
purchaseorderid int
amount int
hive (venkat_db)> select * from purchases;
OK
Daniel 8 160
Daniel 8 160
Jana 9 100
Nadia 10 660
Nadia 10 600
Jana 1 100
Nadia 2 200
Nadia 3 600
Daniel 4 80
Jana 5 120
William 6 170
Daniel 7 140
select salesrepid, collect_set(amount) from purchases group by salesrepid;
Daniel [160,80,140]
Jana [100,120]
Nadia [660,600,200]
William [170]
hive (venkat_db)> select salesrepid, collect_list(amount) from purchases group by salesrepid;
Daniel [160,160,80,140]
Jana [100,100,120]
Nadia [660,600,200,600]
William [170]

Dynamic programming

July 04, 2017 No comments

Re-computation of the same problem can be avoided by using Dynamic Programming.

Java Transient Modifier

July 04, 2017 No comments

A transient modifier applied to a field tells Java that this attribute should be excluded when the object is being serialized. When the object is being deserialized, the field will be initialized with its default value (this will typically be a null value for a reference type, or zero/ false if the object is a primitive type).

Where should we apply a transient modifier? When will it be useful?

Think of a User object. This User contains, among all its properties, login, email, and password. When the data is being serialized and transmitted through the network, we can think of a few security reasons why we would not like to send the field password together with the entire object. In this case, marking the field as transient will solve this security problem.

In summary: Use transient when an object contains sensitive data that you do not want to transmit, or when it contains data that you can derive from other elements. Static fields are implicitly transient.

Tuesday, August 8, 2017

Sunday, August 6, 2017

Saturday, August 5, 2017

Saturday, July 29, 2017

Thursday, July 27, 2017

Tuesday, July 18, 2017

Tuesday, July 11, 2017

Wednesday, July 5, 2017

Definition:

Insight 1:

Insight 2:

Insight 3:

Intermediate operations

Terminal operations

Tuesday, July 4, 2017

Popular Posts

Blogger templates

Blogroll

Categories

Blog Archive

About

Search This Blog

Labels

Blogger templates

Pages List

About Me