Apache Apex Interview Questions and Answers

Apache Apex is a Hadoop YARN native big data processing platform, enabling real time stream as well as batch processing for your big data.


1) What is the differences between Apache Spark and Apache Apex?


Answer)Apache Spark is actually a batch processing. If you consider Spark streaming (which uses spark underneath) then it is micro-batch processing. In contrast, Apache apex is a true stream processing. In a sense that, incoming record does NOT have to wait for next record for processing. Record is processed and sent to next level of processing as soon as it arrives.


2)How Apache Apex is different from Apache Storm?


Answer) There are fundamental differences in architecture which make each of the platform very different in terms of latency, scaling and state management.

At the very basic level,

Apache Storm uses record acknowledgement to guarantee message delivery.

Apache Apex uses checkpointing to guarantee message delivery.


3) How to calculate network latency between operators in Apache Apex


Answer) Assuming your tuples are strings and that the clocks on your cluster nodes are synchronized, you can append a timestamp to each tuple in the sending operator. Then, in the receiving operator, you can strip out the timestamp and compare it to the current time. You can, of course, suitably adapt this approach for other types. If averaged over a suitably large number of tuples, it should give you a good approximation of the network latency.


4) Can an Input Operator be used in the middle of a DAG in Apache Apex


Answer) This is an interesting use-case. You should be able to extend an input operator (say JdbcInputOperator since you want to read from a database) and add an input port to it. This input port receives data (tuples) from another operator from your DAG and updates the "where" clause of the JdbcInputOperator so it reads the data based on that. Hope that is what you were looking for.


5) What is the operator lifecycle in Apache Apex?


Answer) A given operator has the following life cycle as below. The life cycle spans over the execution period of the instance of the operator. In case of operator failure, the lifecycle starts over as below. A checkpoint of operator state occurs periodically once every few windows and it becomes the last known checkpoint in case of failure.


→ Constructor is called

→ State is applied from last known checkpoint

→ setup()

→ loop over {

→ beginWindow()

→ loop over {

→ process()

}

→ endWindow()

}

→ teardown()


6) How to restart Apache Apex application?


Answer)Apache Apex provides a command line interface, "apex" (previously called "dtcli") script, to interact with the applications. Once an application is shut down or killed, you can restart it using following command:

launch pi-demo-3.4.0-incubating-SNAPSHOT.apa -originalAppId application_1465560538823_0074 -Ddt.attr.APPLICATION_NAME="Relaunched PiDemo" -exactMatch "PiDemo"

where,

-originalAppId is ID of the original app. This will ensure that the operators continue from where the original app left-off.

-Ddt.attr.APPLICATION_NAME gives the new name for relaunched app

-exactMatch is used to specify the exact app name

Note that, -Ddt.attr.APPLICATION_NAME & -exactMatch are optional.


7) Does Apache Apex rely on HDFS or does it have its own file system?


Answer) Apache Apex uses checkpointing of operator state for fault tolerance. Apex uses HDFS to write these checkpoints for recovery. However, the store for checkpointing is configurable. Apex also has an implementation to checkpoint to Apache Geode. Apex also uses HDFS to upload artifacts such application package containing the application jar, its dependencies and configurations etc that are needed to launch the application.


8)How to pass arguments to application.java class in Apache Apex?


Answer)You can pass arguments as Configuration. This configuration will be passed as an argument to populateDAG() method in Application.java.

Configuration is org.apache.hadoop.conf.Configuration. You can specify it as xml. For xml syntax please refer to https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html.

There are different ways in which properties can be specified:

~/.dt/dt-site.xml: By default apex cli will look for this file (~ is your home directory). You should use this file for the properties which are common to all the applications in your environment.

-conf option on apex cli: launch command on apex cli provides -conf option to specify properties. You need to specify the path for the configuration xml. You should use this file for the properties which are specific to a particular application or specific to this launch of the application.

-Dproperty-name=value: launch command on apex cli provides -D option to specify properties. You can specify multiple properties like -Dproperty-name1=value1 -Dproperty-name2=value2 etc.


9)How does Apache Apex handle back pressure?


Answer)Buffer server is a pub-sub mechanism within Apex platform that is used to stream data between operators. The buffer server always lives in the same container as the upstream operator (one buffer server per container irrespective of number of operators in container); and the output of upstream operator is written to buffer server. The current operator subscribes from the upstream operator's buffer server when a stream is connected.


So if an operator fails, the upstream operator's buffer server will have the required data state until a common checkpoint is reached. If the upstream operator fails, its upstream operator's buffer server has the data state and so on. Finally, if the input operator fails, which has no upstream buffer server, then the input operator is responsible to replay the data state. Depending on the external system, input operator either relies on the external system for replays or maintain the data state itself until a common checkpoint is reached.


If for some reason the buffer server fails, the container hosting the buffer server fails. So, all the operators in the container and their downstream operators are redeployed from last known checkpoint.


Launch your GraphyLaunch your Graphy
100K+ creators trust Graphy to teach online
Learn Bigdata, Spark & Machine Learning | SmartDataCamp 2024 Privacy policy Terms of use Contact us Refund policy