Apache Spark

What are the main features of Apache Spark?
What is a Resilient Distribution Dataset in Apache Spark?
What is a Transformation in Apache Spark?
What are security options in Apache Spark?
How will you monitor Apache Spark?
What are the main libraries of Apache Spark?
What are the main functions of Spark Core in Apache Spark?
How will you do memory tuning in Spark?
What are the two ways to create RDD in Spark?
What are the main operations that can be done on a RDD in Apache Spark?
What are the common Transformations in Apache Spark?
What are the common Actions in Apache Spark?
What is a Shuffle operation in Spark?
What are the operations that can cause a shuffle in Spark?
What is purpose of Spark SQL?
What is a DataFrame in Spark SQL?
What is a Parquet file in Spark?
What is the difference between Apache Spark and Apache Hadoop MapReduce?
What are the main languages supported by Apache Spark?
What are the file systems supported by Spark?
What is a Spark Driver?
What is an RDD Lineage?
What are the two main types of Vector in Spark?
What are the different deployment modes of Apache Spark?
What is lazy evaluation in Apache Spark?
What are the core components of a distributed application in Apache Spark?
What is the difference in cache() and persist() methods in Apache Spark?
How will you remove data from cache in Apache Spark?
What is the use of SparkContext in Apache Spark?
Do we need HDFS for running Spark application?
What is Spark Streaming?
How does Spark Streaming work internally?
What is a Pipeline in Apache Spark?
How does Pipeline work in Apache Spark?
What is the difference between Transformer and Estimator in Apache Spark?
What are the different types of Cluster Managers in Apache Spark?
How will you minimize data transfer while working with Apache Spark?
What is the main use of MLib in Apache Spark?
What is the Checkpointing in Apache Spark?
What is an Accumulator in Apache Spark?
What is a Broadcast variable in Apache Spark?
What is Structured Streaming in Apache Spark?
How will you pass functions to Apache Spark?
What is a Property Graph?
What is Neighborhood Aggregation in Spark?
What are different Persistence levels in Apache Spark?
How will you select the storage level in Apache Spark?
What are the options in Spark to create a Graph?
What are the basic Graph operators in Spark?
What is the partitioning approach used in GraphX of Apache Spark?
What is RDD?
Name the different types of RDD
What are the methods of creating RDDs in Spark?
What is a Sparse Vector?
What are the languages supported by Apache Spark and which is the most popular one, What is JDBC and why it is popular?
What is Yarn?
Do you need to install Spark on all nodes of Yarn cluster? Why?
Is it possible to run Apache Spark on Apache Mesos?
Define Partitions in Apache Spark
What is a DStream?
What is a Catalyst framework?
What are Actions in Spark?
What is a Parquet file?
What is GraphX?
What file systems does Spark support?
What are the different types of transformations on DStreams? Explain.
What is the difference between persist () and cache ()?
What do you understand by SchemaRDD?
What is Apache Spark?
Explain key features of Spark.
Define RDD?
What does a Spark Engine do?
Define Partitions?
What do you understand by Transformations in Spark?
Define Actions.
Define functions of SparkCore?
What is RDD Lineage?
What is Spark Driver?
What is Hive on Spark?
Define Spark Streaming.
What is Spark SQL?
List the functions of Spark SQL?
What are benefits of Spark over MapReduce?
What is Spark Executor?
What do you understand by worker node?
Illustrate some demerits of using Spark.
What is the advantage of a Parquet file?
What are different o/p methods to get result?
What are two ways to attain a schema from data?
Why should you define your own schema?
Why is JSON a common format in big data pipelines?
By default, how are corrupt records dealt with using spark.read.json()?
Explain the key features of Apache Spark.
What are benefits of Spark over MapReduce?
What is YARN?
Do you need to install Spark on all nodes of YARN cluster?
Is there any benefit of learning MapReduce if Spark is better than MapReduce?
Explain the concept of Resilient Distributed Dataset (RDD).
How do we create RDDs in Spark?
What is Executor Memory in a Spark application?
Define Partitions in Apache Spark.
What operations does RDD support?
What do you understand by Transformations in Spark?
Define Actions in Spark.
Define functions of SparkCore.
Memory management and fault recovery
What do you understand by Pair RDD?
How is Streaming implemented in Spark? Explain with examples.
Is there an API for implementing graphs in Spark?
What is PageRank in GraphX?
How is machine learning implemented in Spark?
Is there a module to implement SQL in Spark? How does it work?
What are receivers in Apache Spark Streaming?
What do you understand by Shuffling in Spark?
How is Apache Spark different from MapReduce?
Explain the working of Spark with the help of its architecture.
What is the working of DAG in Spark?
Under what scenarios do you use Client and Cluster modes for deployment?
What is Spark Streaming and how is it implemented in Spark?
What can you say about Spark Datasets?
Define Spark DataFrames.
Define Executor Memory in Spark
What are the functions of SparkCore?
What do you understand by worker node?
What are some demerits of using Spark in applications?
How can the data transfers be minimized while working with Spark?
What is SchemaRDD in Spark RDD?
What module is used for implementing SQL in Apache Spark?
What are the steps to calculate the executor memory?
Why do we need broadcast variables in Spark?
Can Apache Spark be used along with Hadoop? If yes, then how?
What are Sparse Vectors? How are they different from dense vectors?
How are automatic clean-ups triggered in Spark for handling the accumulated metadata?
How is Caching relevant in Spark Streaming?
Define Piping in Spark.
What API is used for Graph Implementation in Spark?
How can you achieve machine learning in Spark?
What are the limitations of Spark?
Compare Hadoop and Spark.
What is lazy evaluation in Spark?
What are the benefits of lazy evaluation?
What do you mean by Persistence?
Explain the run time architecture of Spark?
What is the difference between DSM and RDD?
How can data transfer be minimized when working with Apache Spark?
How does Apache Spark handles accumulated Metadata?
What are the common faults of the developer while using Apache Spark?
Which among the two is preferable for the project- Hadoop MapReduce or Apache Spark?
List the popular use cases of Apache Spark.
What is Spark.executor.memory in a Spark Application?
What is DataFrames?
What are the advantages of DataFrame?
What is DataSet?
What are the advantages of DataSets?
Explain Catalyst framework.
What is DStream?
Explain different transformation on DStream.
What is written ahead log or journaling?
Explain first operation in Apache Spark RDD.
Describe join operation. How is outer join supported?
Describe coalesce operation. When can you coalesce to a larger number of partitions? Explain.
Describe Partition and Partitioner in Apache Spark.
How can you manually partition the RDD?
Explain API create Or Replace TempView.
What are the various advantages of DataFrame over RDD in Apache Spark?
What is a DataSet and what are its advantages over DataFrame and RDD?
On what all basis can you differentiate RDD and DataFrame and DataSet?
Explain the level of parallelism in Spark Streaming.
Discuss writeahead logging in Apache Spark Streaming.
What do you mean by Speculative execution in Apache Spark?
How do you parse data in XML? Which kind of class do you use with java to pass data?
Explain Machine Learning library in Spark.
List various commonly used Machine Learning Algorithm.
Explain the Parquet File format in Apache Spark. When is it the best to choose this?
What is Lineage Graph?
How can you Trigger Automatic Cleanups in Spark to Handle Accumulated Metadata?
What are the benefits of using Spark With Apache Mesos?
What is the Significance of Sliding Window Operation?
When running Spark Applications is it necessary to install Spark on all Nodes of Yarn Cluster?
What is Catalyst Framework?
Which Spark Library allows reliable File Sharing at Memory Speed across different cluster frameworks?
Why is Blinkdb used?
How can you compare Hadoop and Spark in terms of ease of use?
What are the common mistakes developers make when running Spark Applications?
What are the various Data Sources available in Sparksql?
What are the Key Features of Apache Spark that you like?
What do you understand by Pair Rdd?
Explain about different Types of Transformations on Dstreams?
Explain about popular use cases of Apache Spark?
Is Apache Spark a good fit for reinforcement Learning?
What is Spark Core?
How can you remove the elements with a Key present in any other Rdd?
What is the difference between Persist and Cache?
How Spark handles Monitoring and Logging in Standalone Mode?
Does Apache Spark provide check pointing?
How can you launch Spark Jobs inside Hadoop Mapreduce?
How can you achieve High Availability in Apache Spark?
Hadoop uses Replication to achieve Fault Tolerance and how is this achieved in Apache Spark?
Explain about Core Components of a distributed Spark Application?
What do you understand by Lazy Evaluation?
Define a Worker Node?
What do you understand by Schemardd?
What are the disadvantages of using Apache Spark over Hadoop Mapreduce?
Is it necessary to install Spark on all Nodes of Yarn Cluster while running Apache Spark on Yarn?
What do you understand by Executor Memory in Spark Application?
What does the Spark Engine do?
What makes Apache Spark good at Low latency Workloads like Graph Processing and Machine Learning?
What is Dstream in Apache Spark?
What do you understand by YARN?
Is it necessary to install Spark on all nodes of the YARN cluster?
What are the different data sources available in SparkSQL?
Which are some important internal daemons used in Apache Spark?
What is the method to create a Data frame in Apache Spark?
What do you understand by accumulators in Apache Spark?
What is the default level of parallelism in Apache Spark?
Which companies are using Spark streaming services?
Is it possible to use Spark to access and analyze data stored in Cassandra databases?
Can we run Apache Spark on Apache Mesos?
What do you understand by Spark SQL?
How can you connect Spark to Apache Mesos?
What is the best way to minimize data transfers when working with Spark?
What do you understand by lazy evaluation in Apache Spark?
What do you understand by Spark Driver?
What is the Parquet file in Apache Spark?
What is the way to store the data in Apache Spark?
How is it possible to implement machine learning in Apache Spark?
What are some disadvantages or demerits of using Apache Spark?
What is the use of File system API in Apache Spark?
What are the tasks of a Spark Engine?
What is the use of Apache SparkContext?
Is it possible to do real-time processing with SparkSQL?
What is the use of Akka in Apache Spark?
What do you understand by Spark map() Transformation?
What is the advantage of using the Parquet file?
What is the difference between persist() and cache() functions in Apache Spark?
Which Spark libraries allow reliable file sharing at memory speed across different cluster frameworks?
What is shuffling in Apache Spark? When does it occur?
What is the lineage in Spark?
How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
Is it possible to launch Spark jobs inside Hadoop MapReduce?
What is the use of BlinkDB in Spark?

What are the main features of Apache Spark?

Main features of Apache Spark are as follows:

Performance: The key feature of Apache Spark is its Performance. With Apache Spark we can run programs up to 100 times faster than Hadoop MapReduce in memory. On disk we can run it 10 times faster than Hadoop.
Ease of Use: Spark supports Java, Python, R, Scala etc. languages. So it makes it much easier to develop applications for Apache Spark.
Integrated Solution: In Spark we can create an integrated solution that combines the power of SQL, Streaming and data analytics. R+ un Everywhere: Apache Spark can run on many platforms. It can run on Hadoop, Mesos, in Cloud or standalone. It can also connect to many data sources like HDFS, Cassandra, HBase, S3 etc.
Stream Processing: Apache Spark also supports real time stream processing. With real time streaming we can provide real time analytics solutions. This is very useful for real-time data.

What is a Resilient Distribution Dataset in Apache Spark?

Resilient Distribution Dataset (RDD) is an abstraction of data in Apache Spark. It is a distributed and resilient collection of records spread over many partitions. RDD hides the data partitioning and distribution behind the scenes. Main features of RDD are as follows:

Distributed: Data in a RDD is distributed across multiple nodes.
Resilient: RDD is a fault- tolerant dataset. In case of node failure, Spark can re- compute data.
Dataset: It is a collection of data similar to collections in Scala.
Immutable: Data in RDD cannot be modified after creation. But we can transform it using a Transformation.

Files

spark.md

Latest commit

History

spark.md

File metadata and controls

Main title

Interview questions

Apache Spark

What are the main features of Apache Spark?

What is a Resilient Distribution Dataset in Apache Spark?

What is a Transformation in Apache Spark?

What are security options in Apache Spark?

How will you monitor Apache Spark?

What are the main libraries of Apache Spark?

What are the main functions of Spark Core in Apache Spark?

How will you do memory tuning in Spark?

What are the two ways to create RDD in Spark?

What are the main operations that can be done on a RDD in Apache Spark?

What are the common Transformations in Apache Spark?

What are the common Actions in Apache Spark?

What is a Shuffle operation in Spark?

What are the operations that can cause a shuffle in Spark?

What is purpose of Spark SQL?

What is a DataFrame in Spark SQL?

What is a Parquet file in Spark?

What is the difference between Apache Spark and Apache Hadoop MapReduce?

What are the main languages supported by Apache Spark?

What are the file systems supported by Spark?

What is a Spark Driver?

What is an RDD Lineage?

What are the two main types of Vector in Spark?

What are the different deployment modes of Apache Spark?

What is lazy evaluation in Apache Spark?

What are the core components of a distributed application in Apache Spark?

What is the difference in cache() and persist() methods in Apache Spark?

How will you remove data from cache in Apache Spark?

What is the use of SparkContext in Apache Spark?

Do we need HDFS for running Spark application?

What is Spark Streaming?

How does Spark Streaming work internally?

What is a Pipeline in Apache Spark?

How does Pipeline work in Apache Spark?

What is the difference between Transformer and Estimator in Apache Spark?

What are the different types of Cluster Managers in Apache Spark?

How will you minimize data transfer while working with Apache Spark?

What is the main use of MLib in Apache Spark?

What is the Checkpointing in Apache Spark?

What is an Accumulator in Apache Spark?

What is a Broadcast variable in Apache Spark?

What is Structured Streaming in Apache Spark?

How will you pass functions to Apache Spark?

What is a Property Graph?

What is Neighborhood Aggregation in Spark?

What are different Persistence levels in Apache Spark?

How will you select the storage level in Apache Spark?

What are the options in Spark to create a Graph?

What are the basic Graph operators in Spark?

What is the partitioning approach used in GraphX of Apache Spark?

What is RDD?

Name the different types of RDD

What are the methods of creating RDDs in Spark?

What is a Sparse Vector?

What are the languages supported by Apache Spark and which is the most popular one, What is JDBC and why it is popular?

What is Yarn?

Do you need to install Spark on all nodes of Yarn cluster? Why?

Is it possible to run Apache Spark on Apache Mesos?

Define Partitions in Apache Spark

What is a DStream?

What is a Catalyst framework?

What are Actions in Spark?

What is a Parquet file?

What is GraphX?

What file systems does Spark support?

What are the different types of transformations on DStreams? Explain.

What is the difference between persist () and cache ()?

What do you understand by SchemaRDD?

What is Apache Spark?

Explain key features of Spark.

Define RDD?