Hadoop Certification Plan ~ <a href="http://learnprogrammingdata.blogspot.com/">IT-Works</a>

Hive AVRO

Hive Parquet

Hive Window Operation.

Hive GroupBy

Removing the header of any (TEXT/CSV) files can be done in two ways.

stocksData.take(12).foreach(println)

Stick with the First approach.

1) mapPartitionsWithIndex
Ex: stocksData.mapPartitionsWithIndex{(index, itr) => if (index ==0) itr.drop(1) else itr}.take(12).foreach(println)

2) mapPartitions

Ex: stocksData.mapPartitions(itr => itr.drop(1)).take(12).foreach(println)

Simple commands that needs to be remembered.

import sqlContext.implicits._
spark-shell --packages com.databricks:spark-avro_2.10:2.0.1
import com.databricks.spark.avro._
import org.apache.spark.sql.functions._

Different ways of reading a Text file from HDFS.
Considering the input source files are comma separated.

       

val ordersDF = sc.textFile("/PATH").map(_.split(',')).map(p => orders(p(0).toInt,p(1),p(2).toInt,p(3))).toDF()
val rdd1 = sc.textFile("/Path").map(rec => {  val rec1 =  rec.split(',')  orders(rec1(0).toInt,rec1(1),rec1(2).toInt,rec1(3)) } ).toDF()

Things you need to know, when reading a Parquet & AVRO files.

Once you get the DF out of the PARQUET or AVRO. You can access the values using the rec(INDEX)

Sample example:

       

val ordersDF = sqlContext.read.parquet("PATH")
ordersDF.map(rec => rec.getInt(0)+rec.getLong(1)+rec.getInt(2)+rec.getString(3)).take(12).foreach(println)

ordersDF.map(rec => rec(0)+"\t"+rec(1)+"\t"+rec(2)+"\t"+rec(3)).take(12).foreach(println)

import com.databricks.spark.avro._
val ordersDF = sqlContext.read.avro("CCA_Prep/avro/orders")
ordersDF.map(rec => rec(0)+"\t"+rec(1))

Things to remember when importing the data using the SQOOP.

1) Sqoop import MYSQL data to the TEXT file.

       

sqoop import --connect "" --username "" --password --table order_items --as-textfile --target-dir --fields-terminated-by '' --lines-terminated-by ''

2) Sqoop import MYSQL data to the AVRO Data file.

       

sqoop import --connect "" --username "" --password "" --table order_items --as-avrodatafile --target-dir "" --fileds-terminated-by '' --lines-terminated-by ''

3) Sqoop import MYSQL data to the PARQUET Data file

       

sqoop import --connect "" --username "" --password "" --table order_items --as-parquetfile --target-dir "" --fields-terminated-by '' --lines-terminated-by ''

IT-Works

Tuesday, July 11, 2017

Hadoop Certification Plan

0 comments:

Post a Comment

Popular Posts

Blogger templates

Blogroll

Categories

Blog Archive

About

Search This Blog

Labels

Blogger templates

Pages List

About Me