Home » BIG DATA » Cassandra » Apache Cassandra Tutorial
cassandra learnerscoach

Apache Cassandra Tutorial

What is Apache Cassandra?

Cassandra is a distributed database management system designed for handling a high volume of structured data across commodity servers

Cassandra handles the huge amount of data with its distributed architecture. Data is placed on different machines with more than one replication factor that provides high availability and no single point of failure.

In the image below, circles are Cassandra nodes and lines between the circles shows distributed architecture, while the client is sending data to the node.

history Cassandra learnerscoach

Why you should go for Apache Cassandra Training?

Apache Cassandra is one of the most widely used NoSQL database. It offers features such as Fault Tolerance, Scalability, Flexible Data Storage and it’s efficient writes, which makes it the perfect database for various purposes. Apache Cassandra is the right choice of database if you are looking for scalability and high availability without compromising performance for your mission-critical applications. 
To take benefits of these opportunities you need a structured training with an updated curriculum as per current industry requirements and best practices.
Besides strong theoretical understanding, you also need to work on real-life Cassandra projects as a part of solution strategy. It is open source and is used by many companies like Spotify, eBay, Comcast, Adobe, NASA, Netflix, and Twitter which led to increase in jobs in the Cassandra Domain.

Cassandra History and Why it is popular

Cassandra is an Apache product. It is an open source, distributed and decentralized/distributed storage system (database). It is used to manage very large amounts of structured data spread out across the world. It provides high availability with no single point of failure.

  • Cassandra was first developed at Facebook for inbox search.
  • Facebook open sourced it in July 2008.
  • Apache incubator accepted Cassandra in March 2009.
  • Cassandra is a top level project of Apache since February 2010.
  • The latest version of Apache Cassandra is 3.11.7, beta version 4.0 is available

Important Points of Cassandra

  • Cassandra is a column-oriented database.
  • Cassandra is scalable, consistent, and fault-tolerant.
  • Cassandra’s distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.
  • Cassandra is created at Facebook. It is totally different from relational database management systems.
  • Cassandra follows a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model.
  • Cassandra is being used by some of the biggest companies like Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Nosql Cassandra Database

NoSQL databases are called “Not Only SQL” or “Non-relational” databases. NoSQL databases store and retrieve data other than tabular relations such as relation databases.

NoSQL databases include MongoDB, HBase, and Cassandra.

There are following properties of NoSQL databases.

  • Design Simplicity
  • Horizontal Scaling
  • High Availability

Data structures used in Cassandra are more specified than data structures used in relational databases. Cassandra data structures are faster than relational database structures.

NoSQL databases are increasingly used in Big Data and real-time web applications. NoSQL databases are sometimes called Not Only SQL i.e. they may support SQL-like query language.

Differences between NoSQL and Relational database

NoSQL DatabaseRelational Database
NoSQL Database supports a very simple query language.Relational Database supports a powerful query language.
NoSQL Database has no fixed schema.Relational Database has a fixed schema.
NoSQL Database is only eventually consistent.Relational Database follows acid properties. (Atomicity, Consistency, Isolation, and Durability)
NoSQL databases don’t support transactions (support only simple transactions).Relational Database supports transactions (also complex transactions with joins).
NoSQL Database is used to handle data coming in high velocity.Relational Database is used to handle data coming in low velocity.
The NoSQL?s data arrive from many locations.Data in relational database arrive from one or few locations.
NoSQL database can manage structured, unstructured and semi-structured data.Relational database manages only structured data.
NoSQL databases have no single point of failure.Relational databases have a single point of failure with failover.
NoSQL databases can handle big data or data in a very high volume .NoSQL databases are used to handle moderate volume of data.
NoSQL has decentralized structure.Relational database has centralized structure.
NoSQL database gives both read and write scalability.Relational database gives read scalability only.
NoSQL database is deployed in horizontal fashion.Relation database is deployed in vertical fashion.

Apache Cassandra Features

There are following features that Cassandra provides.

  • Massively Scalable Architecture: Cassandra has a masterless design where all nodes are at the same level which provides operational simplicity and easy scale out.
  • Masterless Architecture: Data can be written and read on any node.
  • Linear Scale Performance: As more nodes are added, the performance of Cassandra increases.
  • No Single point of failure: Cassandra replicates data on different nodes that ensures no single point of failure.
  • Fault Detection and Recovery: Failed nodes can easily be restored and recovered.
  • Flexible and Dynamic Data Model: Supports datatypes with Fast writes and reads.
  • Data Protection: Data is protected with commit log design and build in security like backup and restore mechanisms.
  • Tunable Data Consistency: Support for strong data consistency across distributed architecture.
  • Multi Data Center Replication: Cassandra provides feature to replicate data across multiple data center.
  • Data Compression: Cassandra can compress up to 80% data without any overhead.
  • Cassandra Query language: Cassandra provides query language that is similar like SQL language. It makes very easy for relational database developers moving from relational database to Cassandra.

Cassandra Use Cases/Application

Cassandra is a non-relational database that can be used for different types of applications. Here are some use cases where Cassandra should be preferred.

  • Messaging

Cassandra is a great database for the companies that provides Mobile phones and messaging services. These companies have a huge amount of data, so Cassandra is best for them.

  • Internet of things Application

Cassandra is a great database for the applications where data is coming at very high speed from different devices or sensors.

  • Product Catalogs and retail apps

Cassandra is used by many retailers for durable shopping cart protection and fast product catalog input and output.

  • Social Media Analytics and recommendation engine

Cassandra is a great database for many online companies and social media providers for analysis and recommendation to their customers.

Cassandra Architecture

Cassandra was designed to handle big data workloads across multiple nodes without a single point of failure. It has a peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.

  • In Cassandra, each node is independent and at the same time interconnected to other nodes. All the nodes in a cluster play the same role.
  • Every node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
  • In the case of failure of one node, Read/Write requests can be served from other nodes in the network.

Data Replication in Cassandra

In Cassandra, nodes in a cluster act as replicas for a given piece of data. If some of the nodes are responded with an out-of-date value, Cassandra will return the most recent value to the client. After returning the most recent value, Cassandra performs a read repair in the background to update the stale values.

See the following image to understand the schematic view of how Cassandra uses data replication among the nodes in a cluster to ensure no single point of failure.

cassandra architecture1 learnerscoach

Components of Cassandra

The main components of Cassandra are:

  • Node: A Cassandra node is a place where data is stored.
  • Data center: Data center is a collection of related nodes.
  • Cluster: A cluster is a component which contains one or more data centers.
  • Commit log: In Cassandra, the commit log is a crash-recovery mechanism. Every write operation is written to the commit log.
  • Mem-table: A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
  • SSTable: It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
  • Bloom filter: These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.

Cassandra Query Language

Cassandra Query Language (CQL) is used to access Cassandra through its nodes. CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.

The client can approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data.

Write Operations

Every write activity of nodes is captured by the commit logs written in the nodes. Later the data will be captured and stored in the mem-table. Whenever the mem-table is full, data will be written into the SStable data file. All writes are automatically partitioned and replicated throughout the cluster. Cassandra periodically consolidates the SSTables, discarding unnecessary data.

cassandra architecture2 learnerscoach

Read Operations

In Read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the appropriate SSTable which contains the required data.

There are three types of read request that is sent to replicas by coordinators.

  • Direct request
  • Digest request
  • Read repair request

The coordinator sends direct request to one of the replicas. After that, the coordinator sends the digest request to the number of replicas specified by the consistency level and checks if the returned data is an updated data.

After that, the coordinator sends digest request to all the remaining replicas. If any node gives out of date value, a background read repair request will update that data. This process is called read repair mechanism.

cassandra architecture3 learnerscoach

Apache Cassandra Certification Training

Edureka’s Apache Cassandra Certification Training is designed by professionals as per the industry requirements and demands. This Cassandra Certification Training helps you to master the concepts of Apache Cassandra including Cassandra Architecture, its features, Cassandra Data Model, and its Administration.

Throughout the Cassandra course, you will learn to install, configure, and monitor Cassandra, along with its integration with other Apache frameworks like Hadoop, Spark, and Kafka.

co img 509 1512976416show?id=ELEUF*I0Aoo&bids=742306

Why should you take Apache Cassandra?

  • 10,000 open jobs which require Cassandra Developer and Administration Skills – LinkedIn
  • The average salary of a Software Engineer with Apache Cassandra skill is $120,500 per year. (Payscale.com salary data)
  • Edureka’s Apache Cassandra Professional Certificate Holders work at 1000s of companies like vmware, cisco, Dell and Honeywell
  • Cassandra is widely used in the Industry by many companies such as: Microsoft, Netflix, Walmart, Intel, Intuit,PayPal

Leave a Reply

Your email address will not be published.