Apache Cassandra Interview Questions and Answers

Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

1)Explain what is Cassandra?

Answer)Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing and managing large amounts of data across commodity servers. It can server as both

Real time data store system for online applications

Also as a read intensive database for business intelligence system

2)What do you understand by Commit log in Cassandra?

Answer)Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.

3)In which language Cassandra is written?

Answer)Cassandra is written in Java. It is originally designed by Facebook consisting of flexible schemas. It is highly scalable for big data.

4)List the benefits of using Cassandra.

Answer)Unlike traditional or any other database, Apache Cassandradelivers near real-time performance simplifying the work of Developers, Administrators, Data Analysts and Software Engineers.

Instead of master-slave architecture, Cassandra is established on peer-to-peer architecture ensuring no failure.

It also assures phenomenal flexibility as it allows insertion of multiple nodes to any Cassandra cluster in any datacenter. Further, any client can forward its request to any server.

Cassandra facilitates extensible scalability and can be easily scaled up and scaled down as per the requirements. With a high throughput for read and write operations, this NoSQL application need not be restarted while scaling.

Cassandra is also revered for its strong data replication on nodes capability as it allows data storage at multiple locations enabling users to retrieve data from another location if one node fails. Users have the option to set up the number of replicas they want to create.

Shows brilliant performance when used for massive datasets and thus, the most preferable NoSQL DB by most organizations.

Operates on column-oriented structure and thus, quickens and simplifies the process of slicing. Even data access and retrieval becomes more efficient with column-based data model.

Further, Apache Cassandra supports schema-free/schema-optional data model, which un-necessitate the purpose of showing all the columns required by your application.Find out how Cassandra Versus MongoDB can help you get ahead in your career!

5)What was the design goal of Cassandra?

Answer)The main design goal of Cassandra was to handle big data workloads across multiple nodes without a single point of failure.

6)What are the main components of Cassandra data models?

Answer)Following are the main components of Cassandra data model:

Cluster

Keyspace

Column

Column and Family

7)What are the other components of Cassandra?

Answer)Some other components of Cassandra are:

Node

Data Center

Commit log

Mem-table

SSTable

Bloom Filter

8)Explain the concept of Tunable Consistency in Cassandra?

Answer)Tunable Consistency is a phenomenal characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies -Eventual and Consistency and Strong Consistency.

The former guarantees consistency when no new updates are made on a given data item, all accesses return the last updated value eventually. Systems with eventual consistency are known to have achieved replica convergence.

For Strong consistency, Cassandra supports the following condition:

R + W > N, where

N – Number of replicas

W – Number of nodes that need to agree for a successful write

R – Number of nodes that need to agree for a successful read

9)How does Cassandra write?

Answer)Cassandra performs the write function by applying two commits-first it writes to a commit log on disk and then commits to an in-memory structured known as memtable. Once the two commits are successful, the write is achieved. Writes are written in the table structure as SSTable (sorted string table). Cassandra offers speedier write performance.

10)Why cant I set listen_address to listen on 0.0.0.0 (all my addresses)?

Answer)Cassandra is a gossip-based distributed system and listen_address is the address a node tells other nodes to reach it at. Telling other nodes “contact me on any of my addresses” is a bad idea; if different nodes in the cluster pick different addresses for you, Bad Things happen.

If you don’t want to manually specify an IP to listen_address for each node in your cluster (understandable!), leave it blank and Cassandra will use InetAddress.getLocalHost() to pick an address. Then it’s up to you or your ops team to make things resolve correctly (/etc/hosts/, dns, etc).

One exception to this process is JMX, which by default binds to 0.0.0.0

11)What ports does Cassandra use?

Answer)By default, Cassandra uses 7000 for cluster communication (7001 if SSL is enabled), 9042 for native protocol clients, and 7199 for JMX. The internode communication and native protocol ports are configurable in the Cassandra Configuration File. The JMX port is configurable in cassandra-env.sh (through JVM options). All ports are TCP.

12)Define Mem-table in Cassandra.

Answer) It is a memory-resident data structure. After commit log, the data will be written to the mem-table. Mem-table is in-memory/write-back cache space consisting of content in key and column format. The data in mem- table is sorted by key, and each column family consists of a distinct mem-table that retrieves column data via key. It stores the writes until it is full, and then flushed out.

13)What is SSTable?

Answer)SSTable or ‘Sorted String Table,’ refers to an important data file in Cassandra. It accepts regular written memtables which are stored on disk and exist for each Cassandra table. Being immutable, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.

14)What is bloom filter?

Answer)Bloom filter is an off-heap data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.

15)Establish the difference between a node, cluster and data centres in Cassandra.

Answer)Node is a single machine running Cassandra.

Cluster is a collection of nodes that have similar type of data grouped together.

Data centres are useful components when serving customers in different geographical areas. Different nodes of a cluster are grouped into different data centres.

16)Define composite type in Cassandra?

Answer)In Cassandra, composite type allows to define a key or a column name with a concatenation of data of different type. You can use two types of Composite Types:

Row Key

Column Name

17)What is keyspace in Cassandra?

Answer)In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster contains of one keyspace per node.

18)Define the management tools in Cassandra.

Answer)DataStaxOpsCenter: internet-based management and monitoring solution for Cassandra cluster and DataStax. It is free to download and includes an additional Edition of OpsCenter

SPM primarily administers Cassandra metrics and various OS and JVM metrics. Besides Cassandra, SPM also monitors Hadoop, Spark, Solr, Storm, zookeeper and other Big Data platforms. The main features of SPM include correlation of events and metrics, distributed transaction tracing, creating real-time graphs with zooming, anomaly detection and heartbeat alerting.

19)Explain CAP Theorem.

Answer)With a strong requirement to scale systems when additional resources are needed, CAP Theorem plays a major role in maintaining the scaling strategy. It is an efficient way to handle scaling in distributed systems. Consistency Availability and Partition tolerance (CAP) theorem states that in distributed systems like Cassandra, users can enjoy only two out of these three characteristics.

One of them needs to be sacrificed. Consistency guarantees the return of most recent write for the client, Availability returns a rational response within minimum time and in Partition Tolerance, the system will continue its operations when network partitions occur. The two options available are AP and CP.

20)How to write a query in Cassandra?

Answer)Using CQL (Cassandra Query Language).Cqlsh is used for interacting with database.

21)What OS Cassandra supports?

Answer)Windows and Linux

22)Talk about the concept of tunable consistency in Cassandra?

Answer)Tunable Consistency is a characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies – Eventual Consistency and Strong Consistency.

23)What are the three components of Cassandra write?

Answer)The three components are:

Commitlog write

Memtable write

SStable write

Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in SStable.

24)What is the syntax to create keyspace in Cassandra?

Answer)CREATE KEYSPACE identifier WITH properties

25)What happens to existing data in my cluster when I add new nodes?

Answer)When a new nodes joins a cluster, it will automatically contact the other nodes in the cluster and copy the right data to itself.

26)When I delete data from Cassandra, but disk usage stays the same

Answer)Data you write to Cassandra gets persisted to SSTables. Since SSTables are immutable, the data can’t actually be removed when you perform a delete, instead, a marker (also called a tombstone) is written to indicate the value’s new status. Never fear though, on the first compaction that occurs between the data and the tombstone, the data will be expunged completely and the corresponding disk space recovered

27)Explain zero consistency.

Answer)In zero consistency the write operations will be handled in the background, asynchronously. It is the fastest way to write data.

28)What do you understand by Kundera?

Answer)Kundera is an object-relational mapping (ORM) implementation for Cassandra which is written using Java annotations.

29)Why does nodetool ring only show one entry, even though my nodes logged that they see each other joining the ring?

Answer)This happens when you have the same token assigned to each node. Don’t do that. Most often this bites people who deploy by installing Cassandra on a VM (especially when using the Debian package, which auto-starts Cassandra after installation, thus generating and saving a token), then cloning that VM to other nodes. The easiest fix is to wipe the data and commitlog directories, thus making sure that each node will generate a random token on the next restart.

30)Can I change the replication factor (a a keyspace) on a live cluster?

Answer)Yes, but it will require running a full repair (or cleanup) to change the replica count of existing data:

Alter the replication factor for desired keyspace (using cqlsh for instance).

If you’re reducing the replication factor, run nodetool cleanup on the cluster to remove surplus replicated data. Cleanup runs on a per-node basis.

If you’re increasing the replication factor, run nodetool repair -full to ensure data is replicated according to the new configuration. Repair runs on a per-replica set basis. This is an intensive process that may result in adverse cluster performance. It’s highly recommended to do rolling repairs, as an attempt to repair the entire cluster at once will most likely swamp it. Note that you will need to run a full repair (-full) to make sure that already repaired sstables are not skipped.

31)Can I Store (large) BLOBs in Cassandra?

Answer)Cassandra isnt optimized for large file or BLOB storage and a single blob value is always read and send to the client entirely. As such, storing small blobs (less than single digit MB) should not be a problem, but it is advised to manually split large blobs into smaller chunks.

Please note in particular that by default, any value greater than 16MB will be rejected by Cassandra due the max_mutation_size_in_kb configuration of the Cassandra Configuration File file (which default to half of commitlog_segment_size_in_mb, which itself default to 32MB).

32)Nodetool says “Connection refused to host: 127.0.1.1” for any remote host.How to fix it?

Answer)Nodetool relies on JMX, which in turn relies on RMI, which in turn sets up its own listeners and connectors as needed on each end of the exchange. Normally all of this happens behind the scenes transparently, but incorrect name resolution for either the host connecting, or the one being connected to, can result in crossed wires and confusing exceptions.

If you are not using DNS, then make sure that your /etc/hosts files are accurate on both ends. If that fails, try setting the -Djava.rmi.server.hostname=public name JVM option near the bottom of cassandra-env.sh to an interface that you can reach from the remote machine.

33)Will batching my operations speed up my bulk load?

Answer)No. Using batches to load data will generally just add spikes of latency. Use asynchronous INSERTs instead, or use true Bulk Loading.

An exception is batching updates to a single partition, which can be a Good Thing (as long as the size of a single batch stay reasonable). But never ever blindly batch everything.

34)Why does top report that Cassandra is using a lot more memory than the Java heap max?

Answer)Cassandra uses Memory Mapped Files (mmap) internally. That is, we use the operating system’s virtual memory system to map a number of on-disk files into the Cassandra process’ address space. This will “use” virtual memory; i.e. address space, and will be reported by tools like top accordingly, but on 64 bit systems virtual address space is effectively unlimited so you should not worry about that.

What matters from the perspective of “memory use” in the sense as it is normally meant, is the amount of data allocated on brk() or mmap’d /dev/zero, which represent real memory used. The key issue is that for a mmap’d file, there is never a need to retain the data resident in physical memory. Thus, whatever you do keep resident in physical memory is essentially just there as a cache, in the same way as normal I/O will cause the kernel page cache to retain data that you read/write.

The difference between normal I/O and mmap() is that in the mmap() case the memory is actually mapped to the process, thus affecting the virtual size as reported by top. The main argument for using mmap() instead of standard I/O is the fact that reading entails just touching memory - in the case of the memory being resident, you just read it - you don’t even take a page fault (so no overhead in entering the kernel and doing a semi-context switch).

35)What are seeds?

Answer)Seeds are used during startup to discover the cluster.

If you configure your nodes to refer some node as seed, nodes in your ring tend to send Gossip message to seeds more often (also see the section on gossip) than to non-seeds. In other words, seeds are worked as hubs of Gossip network. With seeds, each node can detect status changes of other nodes quickly.

Seeds are also referred by new nodes on bootstrap to learn other nodes in ring. When you add a new node to ring, you need to specify at least one live seed to contact. Once a node join the ring, it learns about the other nodes, so it doesn’t need seed on subsequent boot.

You can make a seed a node at any time. There is nothing special about seed nodes. If you list the node in seed list it is a seed

Seeds do not auto bootstrap (i.e. if a node has itself in its seed list it will not automatically transfer data to itself) If you want a node to do that, bootstrap it first and then add it to seeds later. If you have no data (new install) you do not have to worry about bootstrap at all.

Recommended usage of seeds:

pick two (or more) nodes per data center as seed nodes.

sync the seed list to all your nodes

36)Does single seed mean single point of failure?

Answer)The ring can operate or boot without a seed; however, you will not be able to add new nodes to the cluster. It is recommended to configure multiple seeds in production system.

37)Why do I see messages dropped in the logs?

Answer)This is a symptom of load shedding Cassandra defending itself against more requests than it can handle.

Internode messages which are received by a node, but do not get not to be processed within their proper timeout (see read_request_timeout, write_request_timeout, in the Cassandra Configuration File), are dropped rather than processed (since the as the coordinator node will no longer be waiting for a response).

For writes, this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by read repair, hints or a manual repair. The write operation may also have timeouted as a result.

For reads, this means a read request may not have completed.

Load shedding is part of the Cassandra architecture, if this is a persistent issue it is generally a sign of an overloaded node or cluster.

38)Cassandra dies with java.lang.OutOfMemoryError: Map failed

Answer)If Cassandra is dying specifically with the “Map failed” message, it means the OS is denying java the ability to lock more memory. In linux, this typically means memlock is limited. Check /proc/pid of cassandra/limits to verify this and raise it (eg, via ulimit in bash). You may also need to increase vm.max_map_count. Note that the debian package handles this for you automatically.

39)What happens if two updates are made with the same timestamp?

Answer)Updates must be commutative, since they may arrive in different orders on different replicas. As long as Cassandra has a deterministic way to pick the winner (in a timestamp tie), the one selected is as valid as any other, and the specifics should be treated as an implementation detail. That said, in the case of a timestamp tie, Cassandra follows two rules: first, deletes take precedence over inserts/updates. Second, if there are two updates, the one with the lexically larger value is selected.

40)Why bootstrapping a new node fails with a “Stream failed” error?

Answer)Two main possibilities:

the GC may be creating long pauses disrupting the streaming process

compactions happening in the background hold streaming long enough that the TCP connection fails

In the first case, regular GC tuning advices apply. In the second case, you need to set TCP keepalive to a lower value (default is very high on Linux). Try to just run the following:

$ sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5

To make those settings permanent, add them to your /etc/sysctl.conf file.

41)What is the concept of SuperColumn in Cassandra?

Answer)Cassandra SuperColumn is a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action.

42)When do you have to avoid secondary indexes?

Answer)Try not using secondary indexes on columns containing a high count of unique values as that will produce few results.

43)Mention what does the shell commands Capture and Consistency determines?

Answer)There are various Cqlsh shell commands in Cassandra. Command Capture, captures the output of a command and adds it to a file while, command Consistency display the current consistency level or set a new consistency level.

44)What is mandatory while creating a table in Cassandra?

Answer)While creating a table primary key is mandatory, it is made up of one or more columns of a table.

45)Mention what is Cassandra- CQL collections?

Answer)Cassandra CQL collections help you to store multiple values in a single variable. In Cassandra, you can use CQL collections in following ways

List: It is used when the order of the data needs to be maintained, and a value is to be stored multiple times (holds the list of unique elements)

SET: It is used for group of elements to store and returned in sorted orders (holds repeating elements)

MAP: It is a data type used to store a key-value pair of elements

46)Explain how Cassandra delete Data?

Answer)SSTables are immutable and cannot remove a row from SSTables. When a row needs to be deleted, Cassandra assigns the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered as deleted.

47)Does Cassandra support ACID transactions?

Answer)Unlike relational databases, Cassandra does not support ACID transactions.

48)List the steps in which Cassandra writes changed data into commitlog?

Answer)Cassandra concatenates changed data to commitlog. Then Commitlog acts as a crash recovery log for data. Until the changed data is concatenated to commitlog, write operation will never be considered successful.

49)What is the use of ResultSet execute(Statement statement) method?

Answer)This method is used to execute a query. It requires a statement object.

50)What is Thrift?

Answer)Thrift is the name of the Remote Procedure Call (RPC) client used to communicate with the Cassandra server.

51)What is the use of “void close()” method?

Answer)This method is used to close the current session instance.

52)What are the main features of SPM in Cassandra?

Answer)The main features of SPM are

Correlation of events and metrics

Distributed transaction tracing

Creating real-time graphs with zooming

Detection and heartbeat alerting

53)When can you use ALTER KEYSPACE?

Answer)The ALTER KEYSPACE can be used to change properties such as the number of replicas and the durable_write of a keyspace.

54)What is Hector in Cassandra?

Answer)Hector was one of the early Cassandra clients. It is an open source project written in Java using the MIT license.

55)What do you understand by Snitches?

Answer)A snitch determines which data centers and racks nodes belong to. They inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks. Specifically, the replication strategy places the replicas based on the information provided by the new snitch. All nodes must return to the same rack and data center. Cassandra does its best not to have more than one replica on the same rack.

Apache Cassandra Interview Questions and Answers

You may also be interested in