Hive AVRO
Hive Parquet
Hive Window Operation.
Hive GroupBy
Removing the header of any (TEXT/CSV) files can be done in two ways.
stocksData.take(12).foreach(println)
Stick with the First approach.
1) mapPartitionsWithIndex
Ex: stocksData.mapPartitionsWithIndex{(index, itr) => if (index ==0) itr.drop(1) else itr}.take(12).foreach(println)
2) mapPartitions
Ex: stocksData.mapPartitions(itr => itr.drop(1)).take(12).foreach(println)
Simple commands that needs to be remembered.
- import sqlContext.implicits._
- spark-shell --packages com.databricks:spark-avro_2.10:2.0.1
- import com.databricks.spark.avro._
- import org.apache.spark.sql.functions._
Different ways of reading a Text file from HDFS.
Considering the input source files are comma separated.
val ordersDF = sc.textFile("/PATH").map(_.split(',')).map(p => orders(p(0).toInt,p(1),p(2).toInt,p(3))).toDF()
val rdd1 = sc.textFile("/Path").map(rec => { val rec1 = rec.split(',') orders(rec1(0).toInt,rec1(1),rec1(2).toInt,rec1(3)) } ).toDF()
Things you need to know, when reading a Parquet & AVRO files.
Once you get the DF out of the PARQUET or AVRO. You can access the values using the rec(INDEX)
Sample example:
val ordersDF = sqlContext.read.parquet("PATH")
ordersDF.map(rec => rec.getInt(0)+rec.getLong(1)+rec.getInt(2)+rec.getString(3)).take(12).foreach(println)
ordersDF.map(rec => rec(0)+"\t"+rec(1)+"\t"+rec(2)+"\t"+rec(3)).take(12).foreach(println)
import com.databricks.spark.avro._
val ordersDF = sqlContext.read.avro("CCA_Prep/avro/orders")
ordersDF.map(rec => rec(0)+"\t"+rec(1))
Things to remember when importing the data using the SQOOP.
1) Sqoop import MYSQL data to the TEXT file.
sqoop import --connect "" --username "" --password --table order_items --as-textfile --target-dir --fields-terminated-by '' --lines-terminated-by ''
2) Sqoop import MYSQL data to the AVRO Data file.
sqoop import --connect "" --username "" --password "" --table order_items --as-avrodatafile --target-dir "" --fileds-terminated-by '' --lines-terminated-by ''
3) Sqoop import MYSQL data to the PARQUET Data file
sqoop import --connect "" --username "" --password "" --table order_items --as-parquetfile --target-dir "" --fields-terminated-by '' --lines-terminated-by ''
0 comments:
Post a Comment