Tuesday, July 11, 2017

Hadoop Certification Plan

Hive AVRO
Hive Parquet

Hive Window Operation.
Hive GroupBy

Removing the header of any (TEXT/CSV) files can be done in two ways.

stocksData.take(12).foreach(println)


Stick with the First approach.

1)  mapPartitionsWithIndex
 Ex: stocksData.mapPartitionsWithIndex{(index, itr) => if (index ==0) itr.drop(1) else itr}.take(12).foreach(println)



2)  mapPartitions
 Ex: stocksData.mapPartitions(itr => itr.drop(1)).take(12).foreach(println)



Simple commands that needs to be remembered.
  1. import sqlContext.implicits._
  2. spark-shell --packages com.databricks:spark-avro_2.10:2.0.1
  3. import com.databricks.spark.avro._
  4. import org.apache.spark.sql.functions._

Different ways of reading a Text file from HDFS.
Considering the input source files are comma separated.

       

val ordersDF = sc.textFile("/PATH").map(_.split(',')).map(p => orders(p(0).toInt,p(1),p(2).toInt,p(3))).toDF()
val rdd1 = sc.textFile("/Path").map(rec => {  val rec1 =  rec.split(',')  orders(rec1(0).toInt,rec1(1),rec1(2).toInt,rec1(3)) } ).toDF()

       
 

Things you need to know, when reading a Parquet & AVRO files. 

Once you get the DF out of the PARQUET or AVRO. You can access the values using the rec(INDEX)

Sample example:

       

val ordersDF = sqlContext.read.parquet("PATH")
ordersDF.map(rec => rec.getInt(0)+rec.getLong(1)+rec.getInt(2)+rec.getString(3)).take(12).foreach(println)

ordersDF.map(rec => rec(0)+"\t"+rec(1)+"\t"+rec(2)+"\t"+rec(3)).take(12).foreach(println)

import com.databricks.spark.avro._
val ordersDF = sqlContext.read.avro("CCA_Prep/avro/orders")
ordersDF.map(rec => rec(0)+"\t"+rec(1))

       
 


Things to remember when importing the data using the SQOOP.

1) Sqoop import MYSQL data to the TEXT file. 
       

sqoop import --connect "" --username "" --password --table order_items --as-textfile --target-dir --fields-terminated-by '' --lines-terminated-by ''



       
 

2) Sqoop import MYSQL data to the AVRO Data file.
       

sqoop import --connect "" --username "" --password "" --table order_items --as-avrodatafile --target-dir "" --fileds-terminated-by '' --lines-terminated-by ''


       
 

3) Sqoop import MYSQL data to the PARQUET Data file
       

sqoop import --connect "" --username "" --password "" --table order_items --as-parquetfile --target-dir "" --fields-terminated-by '' --lines-terminated-by ''

       
 




0 comments:

Post a Comment