Spark

Spark Datasets: Advantages and Limitations

Spark Datasets: Advantages and Limitations Datasets are available to Spark Scala/Java users and offer more type safety than DataFrames. Python and R infer types during runtime, so these APIs cannot support the Datasets. This post demonstrates how to create Datasets and describes the advantages of this data structure.

by bigdata4info February 22, 2021March 5, 2021

Different file formats of Spark/Hadoop

Create a TEXT file by add storage option as ‘STORED AS TEXTFILE’ at the end of a Hive CREATE TABLE command Ex: Create table textfile_table (column_specs) stored as textfile; 2. Sequence File: Sequence files are Hadoop flat files which stores values in binary key-value pairs. The sequence files are in binary format and these files…

by bigdata4info February 27, 2021March 5, 2021

Spark Resource Allocation

Note: This post is specifically for my learning on the topic and to get warm-up at the time of interviews. Everything in the post I gathered from various books and websites, this is specifically for my understanding and to not for any business use. Resource Allocation is an important aspect during the execution of any spark…

by bigdata4info June 25, 2018March 5, 2021

Spark Interview Questions

Spark Interview Questions All important questions 1. Define Partitions? A. Partitioning is the process to derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD. 2. What do you understand by Transformations in Spark? A. Transformations are functions applied to RDD, resulting into another RDD. It does…

by bigdata4info July 1, 2018March 5, 2021

Spark Notes

Note: This post is specifically for my learning on the topic and to get warm-up at the time of interviews. Everything in the post I gathered from various books and websites, this is specifically for my understanding and to not for any business use. Import Points: RDD can be created in 2 ways: Parallelize a collection…