Apache Spark Interview Questions and Answers

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

1)How does Spark relate to Apache Hadoop?

Answer)Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

2)Who is using Spark in production?

Answer)As of 2016, surveys show that more than 1000 organizations are using Spark in production. Some of them are listed on the Powered By page and at the Spark Summit.

3)How large a cluster can Spark scale to?

Answer)Many organizations run Spark on clusters of thousands of nodes. The largest cluster we know has 8000 of them. In terms of data size, Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the machines, winning the 2014 Daytona GraySort Benchmark, as well as to sort 1 PB. Several production workloads use Spark to do ETL and data analysis on PBs of data.

4)Does my data need to fit in memory to use Spark?

Answer)No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

5)How can I run Spark on a cluster?

Answer)You can use either the standalone deploy mode, which only needs Java to be installed on each node, or the Mesos and YARN cluster managers. If you'd like to run on Amazon EC2, AMPLab provides EC2 scripts to automatically launch a cluster.Note that you can also run Spark locally (possibly on multiple cores) without any special setup by just passing local[N] as the master URL, where N is the number of parallel threads you want.

6)Do I need Hadoop to run Spark?

Answer)No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

7)Does Spark require modified versions of Scala or Python?

Answer)No. Spark requires no changes to Scala or compiler plugins. The Python API uses the standard CPython implementation, and can call into existing C libraries for Python such as NumPy.

8)We understand Spark Streaming uses micro-batching. Does this increase latency?

Answer)While Spark does use a micro-batch execution model, this does not have much impact on applications, because the batches can be as short as 0.5 seconds. In most applications of streaming big data, the analytics is done over a larger window (say 10 minutes), or the latency to get data in is higher (e.g. sensors collect readings every 10 seconds). Spark's model enables exactly-once semantics and consistency, meaning the system gives correct results despite slow nodes or failures.

9)Why Spark is good at low-latency iterative workloads e.g. Graphs and Machine Learning?

Answer)Machine Learning algorithms for instance logistic regression require many iterations before creating optimal resulting model. And similarly in graph algorithms which traverse all the nodes and edges. Any algorithm which needs many iteration before creating results can increase their performance when the intermediate partial results are stored in memory or at very fast solid state drives.

10)Which all kind of data processing supported by Spark?

Answer)Spark offers three kinds of data processing using batch, interactive (Spark Shell), and stream processing with the unified API and data structures.

11)Which all are the, ways to configure Spark Properties and order them least important to the most important?

Answer)Ans: There are the following ways to set up properties for Spark and user programs (in the order of importance from the least important to the most important):conf/spark-defaults.conf : the default--conf : the command line option used by spark-shell and spark-submitSparkConf

12)What is the Default level of parallelism in Spark?

Answer)Default level of parallelism is the number of partitions when not specified explicitly by a user.

13) Is it possible to have multiple SparkContext in single JVM?

Answer)Yes, spark.driver.allowMultipleContexts is true (default: false ). If true Spark logs warnings instead of throwing exceptions when multiple SparkContexts are active, i.e. multiple SparkContext are running in this JVM. When creating an instance of SparkContex.

14)What is the advantage of broadcasting values across Spark Cluster?

Answer)Spark transfers the value to Spark executors once, and tasks can share it without incurring repetitive network transmissions when requested multiple times.

15)How do you disable Info Message when running Spark Application

Answer)Under $SPARK_HOME/conf dir modify the log4j.properties file - change values INFO to ERROR

16)How do you evaluate your spark application for example i have access to a cluster (12 nodes where each node has 2 processors Intel(R) Xeon(R) CPU E5-2650 2.00GHz, where each processor has 8 cores), i want to know what are criteria that help me to tuning my application and to observe its performance.

Answer)For tuning you application you need to know few things1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have createdMonitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your applicationForm Spark point of youIn spark-defaults.confyou can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.Below are few Example you can tune this parameter based on your requirementsspark.serializer org.apache.spark.serializer.KryoSerializerspark.driver.memory 5gspark.executor.memory 3gspark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GCspark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC

17)How do you define RDD?

Answer)A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.Resilient: Fault-tolerant and so able to recomputed missing or damaged partitions on node failures with the help of RDD lineage graph.Distributed: across clusters.Dataset: is a collection of partitioned data.

18)What is Lazy evaluated RDD mean?

Answer)Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.

19)How would you control the number of partitions of a RDD?

Answer)You can control the number of partitions of a RDD using repartition or coalesce operations.

20)Data is spread in all the nodes of cluster, how spark tries to process this data?

Answer)By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks

21)What is coalesce transformation?

Answer)The coalesce transformation is used to change the number of partitions. It can trigger RDD shuffling depending on the second shuffle boolean input parameter (defaults to false ).

22)What is the difference between cache() and persist() method of RDD

Answer)RDDs can be cached (using RDD’s cache() operation) or persisted (using RDD’s persist(newLevel: StorageLevel) operation). The cache() operation is a synonym of persist() that uses the default storage level MEMORY_ONLY .

23)What is Shuffling?

Answer)Shuffling is a process of repartitioning (redistributing) data across partitions and may cause moving it across JVMs or even network when it is redistributed among executors.Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.

24)What is the difference between groupByKey and use reduceByKey ?

Answer)Avoid groupByKey and use reduceByKey or combineByKey instead.groupByKey shuffles all the data, which is slow.reduceByKey shuffles only the results of sub-aggregations in each partition of the data.

25)What is checkpointing?

Answer)Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system.

26)Define Spark architecture?

Answer) Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run. The driver and the executors run in their own Java processes

27)What are the workers?

Answer)Workers or slaves are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark. A worker receives serialized/marshalled tasks that it runs in a thread pool.

28)Please explain, how worker’s work, when a new Job submitted to them?

Answer)When SparkContext is created, each worker starts one executor. This is a separate java process or you can say new JVM, and it loads application jar in this JVM. Now executors connect back to your driver program and driver send them commands, like, foreach, filter, map etc. As soon as the driver quits, the executors shut down

29) Please define executors in detail?

Answer)Executors are distributed agents responsible for executing tasks. Executors provide in-memory storage for RDDs that are cached in Spark applications. When executors are started they register themselves with the driver and communicate directly to execute tasks.

30)What is DAGSchedular and how it performs?

Answer)DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling, i.e. after an RDD action has been called it becomes a job that is then transformed into a set of stages that are submitted as TaskSets for execution.

31)What is stage, with regards to Spark Job execution?

Answer)A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of a function executed as part of a Spark job.

32)What is Speculative Execution of a tasks?

Answer)Speculative tasks or task strugglers are tasks that run slower than most of the all tasks in a job.Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset . Such slow tasks will be re-launched in another worker. It will not stop the slow tasks, but run a new copy in parallel.

33)Which all cluster manager can be used with Spark?

Answer)Apache Mesos, Hadoop YARN, Spark standalone

34)What is Data locality / placement?

Answer)Spark relies on data locality or data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS.With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits ), and then schedules the work to the SparkWorkers. Spark’s compute nodes / workers should be running on storage nodes.

35)What is a Broadcast Variable?

Answer)Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

36)How can you define Spark Accumulators?

Answer)This are similar to counters in Hadoop MapReduce framework, which gives information regarding completion of tasks, or how much data is processed etc.

37)What is Apache Spark Streaming?

Answer)Spark Streaming helps to process live stream data. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

38)Explain about transformations and actions in the context of RDDs?

Answer)Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

39)Can you use Spark to access and analyse data stored in Cassandra databases?

Answer)Yes, it is possible if you use Spark Cassandra Connector.

40)Is it possible to run Apache Spark on Apache Mesos?

Answer)Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

41)How can you minimize data transfers when working with Spark?

Answer)Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.Using Accumulators – Accumulators help update the values of variables in parallel while executing.The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

42)What is the significance of Sliding Window operation?

Answer)Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

43)What is a DStream?

Answer)Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –Transformations that produce a new DStream.Output operations that write data to an external system.

44)Which one will you choose for a project –Hadoop MapReduce or Apache Spark?

Answer)The answer to this question depends on the given project scenario - as it is known that Spark makes use of memory instead of network and disk I/O. However, Spark uses large amount of RAM and requires dedicated machine to produce effective results. So the decision to use Hadoop or Spark varies dynamically with the requirements of the project and budget of the organization

45)What are the various levels of persistence in Apache Spark?

Answer)Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.The various storage/persistence levels in Spark are -MEMORY_ONLYMEMORY_ONLY_SERMEMORY_AND_DISKMEMORY_AND_DISK_SER,DISK_ONLYOFF_HEAP

46)Explain the difference between Spark SQL and Hive?

Answer)Spark SQL is faster than Hive.Any Hive query can easily be executed in Spark SQL but vice-versa is not true.Spark SQL is a library whereas Hive is a framework.It is not mandatory to create a metastore in Spark SQL but it is mandatory to create a Hive metastore.Spark SQL automatically infers the schema whereas in Hive schema needs to be explicitly declared.

47)What does a Spark Engine do?

Answer)Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.

48)What are benefits of Spark over MapReduce?

Answer)Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

49)What is Spark Driver?

Answer)Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master.The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

50)What is DataFrames?

Answer)It is a collection of data which organize in named columns. It is theoretically equivalent to a table in relational database. But it is more optimized. Just like RDD, DataFrames evaluates lazily. Using lazy evaluation we can optimize the execution. It optimizes by applying the techniques such as bytecode generation and predicate push-downs

51)What are the advantages of DataFrame?

Answer)It makes large data set processing even easier. Data Frame also allows developers to impose a structure onto a distributed collection of data. As a result, it allows higher-level abstraction.Data frame is both space and performance efficient.It can deal with both structured and unstructured data formats, for example, Avro, CSV etc . And also storage systems like HDFS, HIVE tables, MySQL, etc.The DataFrame API’s are available in various programming languages. For example Java, Scala, Python, and R.It provides Hive compatibility. As a result, we can run unmodified Hive queries on existing Hive warehouse.Catalyst tree transformation uses DataFrame in four phases: a) Analyze logical plan to solve references. b) Logical plan optimization c) Physical planning d) Code generation to compile part of the query to Java bytecode.It can scale from kilobytes of data on the single laptop to petabytes of data on the large cluster.

52)What is write ahead log(journaling)?

Answer)The write-ahead log is a technique that provides durability in a database system. It works in the way that all the operation that applies on data, we write it to write-ahead log. The logs are durable in nature. Thus, when the failure occurs we can easily recover the data from these logs. When we enable the write-ahead log Spark stores the data in fault-tolerant file system.

53)How Spark Streaming API works?

Answer)Programmer set a specific time in the configuration, with in this time how much data gets into the Spark, that data separates as a batch. The input stream (DStream) goes into spark streaming. Framework breaks up into small chunks called batches, then feeds into the spark engine for processing. Spark Streaming API passes that batches to the core engine. Core engine can generate the final results in the form of streaming batches. The output also in the form of batches. It can allows streaming data and batch data for processing.

54)If there is certain data that we want to use again and again in different transformations what should improve the performance?

Answer)RDD can be persisted or cached. There are various ways in which it can be persisted: in-memory, on disc etc. So, if there is a dataset that needs a good amount computing to arrive at, you should consider caching it. You can cache it to disc if preparing it again is far costlier than just reading from disc or it is very huge in size and would not fit in the RAM. You can cache it to memory if it can fit into the memory.

55)What happens to RDD when one of the nodes on which it is distributed goes down?

Answer)Since Spark knows how to prepare a certain data set because it is aware of various transformations and actions that have lead to the dataset, it will be able to apply the same transformations and actions to prepare the lost partition of the node which has gone down.

56)How do I skip a header from CSV files in Spark?

Answer)Spark 2.x : spark.read.format("csv").option("header","true").load("fileP‌ath")

57)Have you ever encounter Spark java.lang.OutOfMemoryError? How to fix this issue?

Answer)I have a few suggestions:If your nodes are configured to have 6g maximum for Spark (and are leaving a little for other processes), then use 6g rather than 4g, spark.executor.memory=6g. Make sure you're using as much memory as possible by checking the UI (it will say how much mem you're using)Try using more partitions, you should have 2 - 4 per CPU. IME increasing the number of partitions is often the easiest way to make a program more stable (and often faster). For huge amounts of data you may need way more than 4 per CPU, I've had to use 8000 partitions in some cases!Decrease the fraction of memory reserved for caching, using spark.storage.memoryFraction. If you don't use cache() or persist in your code, this might as well be 0. It's default is 0.6, which means you only get 0.4 * 4g memory for your heap. IME reducing the mem frac often makes OOMs go away. UPDATE: From spark 1.6 apparently we will no longer need to play with these values, spark will determine them automatically.Similar to above but shuffle memory fraction. If your job doesn't need much shuffle memory then set it to a lower value (this might cause your shuffles to spill to disk which can have catastrophic impact on speed). Sometimes when it's a shuffle operation that's OOMing you need to do the opposite i.e. set it to something large, like 0.8, or make sure you allow your shuffles to spill to disk (it's the default since 1.0.0).Watch out for memory leaks, these are often caused by accidentally closing over objects you don't need in your lambdas. The way to diagnose is to look out for the "task serialized as XXX bytes" in the logs, if XXX is larger than a few k or more than an MB, you may have a memory leak.Related to above; use broadcast variables if you really do need large objects.If you are caching large RDDs and can sacrifice some access time consider serialising the RDD Or even caching them on disk (which sometimes isn't that bad if using SSDs).

58)Does SparkSQL support subquery?

Answer)Spark 2.0+Spark SQL should support both correlated and uncorrelated subqueries. See SubquerySuite for details. Some examples include:select * from l where exists (select * from r where l.a = r.c)select * from l where not exists (select * from r where l.a = r.c)select * from l where l.a in (select c from r)select * from l where a not in (select c from r)Unfortunately as for now (Spark 2.0) it is impossible to express the same logic using DataFrame DSL.Spark < 2.0Spark supports subqueries in the FROM clause (same as Hive <= 0.12).SELECT col FROM (SELECT * FROM t1 WHERE bar) t2

59)How to read multiple text files into a single RDD?

Answer)sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

60)What is the difference between map and flatMap and a good use case for each?

Answer)Generally we use word count example in hadoop. I will take the same use case and will use map and flatMap and we will see the difference how it is processing the data.Below is the sample data file.hadoop is fasthive is sql on hdfsspark is superfastspark is awesomeThe above file will be parsed using map and flatMap.Using mapwc = data.map(lambda line:line.split(" "));wc.collect()[u'hadoop is fast', u'hive is sql on hdfs', u'spark is superfast', u'spark is awesome']Input has 4 lines and output size is 4 as well, i.e., N elements ==> N elements.Using flatMapfm = data.flatMap(lambda line:line.split(" "));fm.collect()[u'hadoop', u'is', u'fast', u'hive', u'is', u'sql', u'on', u'hdfs', u'spark', u'is', u'superfast', u'spark', u'is', u'awesome']

61)What are the various data sources available in SparkSQL?

Answer)CSV file, Parquet file, JSON Datasets, Hive Table.

62)What are the key features of Apache Spark that you like?

Answer)Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc.It has built-in APIs in multiple languages like Java, Scala, Python and R.It has good performance gains, as it helps run an application in the Hadoop cluster ten times faster on disk and 100 times faster in memory.

63)Name some sources from where Spark streaming component can process realtime data.

Answer)Apache Flume, Apache Kafka, Amazon Kinesis

64)Name some companies that are already using Spark Streaming.

Answer)Uber, Netflix, Pinterest.

65)What do you understand by receivers in Spark Streaming ?

Answer)Receivers are special entities in Spark Streaming that consume data from various data sources and move them to Apache Spark. Receivers are usually created by streaming contexts as long running tasks on various executors and scheduled to operate in a round robin manner with each receiver taking a single core.

66)What is GraphX?

Answer)Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX component enables programmers to reason about structured data at scale.

67)What does MLlib do?

Answer)MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.

68)What is PageRank?

Answer)A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of vâ€™s importance by u. In simple terms, if a user at Instagram is followed massively, it will rank high on that platform.

69)Do you need to install Spark on all nodes of Yarn cluster while running Spark on Yarn?

Answer)No because Spark runs on top of Yarn.

70)List the advantage of Parquet file in Apache Spark.

Answer)Parquet is a columnar format supported by many data processing systems. The benefits of having a columnar storage are1)Columnar storage limits IO operations.2)Columnar storage can fetch specific columns that you need to access.3)Columnar storage consumes less space.4)Columnar storage gives better-summarized data and follows type-specific encoding.

Apache Spark Interview Questions and Answers

You may also be interested in