Home » BIG DATA » Hadoop » Introduction to Hadoop
what is hadoop

Introduction to Hadoop

What is Hadoop ?

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing).

It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

Why should you go for Big Data Hadoop Online Training?

Big Data is one of the accelerating and most promising fields, considering all the technologies available in the IT market today. In order to take benefit of these opportunities, you need a structured training with the latest curriculum as per current industry requirements and best practices.

Besides strong theoretical understanding, you need to work on various real world big data projects using different Big Data and Hadoop tools as a part of solution strategy.

Additionally, you need the guidance of a Hadoop expert who is currently working in the industry on real world Big Data projects and troubleshooting day to day challenges while implementing them.

Hadoop 19show?id=ELEUF*I0Aoo&bids=742306

Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker.

hadoop architecture learnerscoach

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave architecture. This architecture consist of a single NameNode performs the role of master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to develop HDFS. So any machine that supports Java language can easily run the NameNode and DataNode software.


  • It is a single master server exist in the HDFS cluster.
  • As it is a single node, it may become the reason of single point failure.
  • It manages the file system namespace by executing an operation like the opening, renaming and closing the files.
  • It simplifies the architecture of the system.


  • The HDFS cluster contains multiple DataNodes.
  • Each DataNode contains multiple data blocks.
  • These data blocks are used to store data.
  • It is the responsibility of DataNode to read and write requests from the file system’s clients.
  • It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker

  • The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
  • In response, NameNode provides metadata to Job tracker

Task Tracker

  • It works as a slave node for Job Tracker.
  • It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.

Advantages of Hadoop

  1. Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools to process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
  2. Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
  3. Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as compared to traditional relational database management system.
  4. Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data are replicated thrice but the replication factor is configurable.
co img 523 1540474151show?id=ELEUF*I0Aoo&bids=742306

Hadoop Modules

  • HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture.
  • Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
  • Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key value pair. The Map task takes input data and converts it into a data set which can be computed in Key value pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.
  • Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.

What is HDFS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several machines and replicated to ensure their durability to failure and high availability to parallel application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and node name.

Where to use HDFS

  • Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
  • Streaming Data Access: The time to read whole data set is more important than latency in reading the first. HDFS is built on write-once and read-many-times pattern.
  • Commodity Hardware:It works on low cost hardware.

Where not to use HDFS

  • Low Latency data access: Applications that require very less time to access the first data should not use HDFS as it is giving importance to whole data rather than time to fetch the first record.
  • Lots Of Small Files:The name node contains the metadata of files in memory and if the files are small in size it takes a lot of memory for name node’s memory which is not feasible.
  • Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts

  1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to minimize the cost of seek.
  2. Name Node: HDFS works in master-worker pattern where the name node acts as master.Name Node is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS; the metadata information being file permission, names and location of each block.The metadata are small, so it is stored in the memory of name node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple clients concurrently,so all this information is handled bya single machine. The file system operations like opening, closing, renaming etc. are executed by it.
  3. Data Node: They store and retrieve blocks when they are told to; by client or name node. They report back to name node periodically, with list of blocks that they are storing. The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name node.

HDFS DataNode and NameNode Image:

DataNode NameNode learnerscoach

HDFS Read Image:HDFS Read learnerscoach

HDFS Write Image:

HDFS Write learners

Since all the metadata is stored in name node, it is very important. If it fails the file system can not be used as there would be no way of knowing how to reconstruct the files from blocks present in data node. To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It performs periodic check points.It communicates with the name node and take snapshot of meta data which helps minimize downtime and loss of data.

Big Data Hadoop Certification Training

Edureka’s Big Data Hadoop Training Course is curated by Hadoop industry experts, and it covers in-depth knowledge on Big Data and Hadoop Ecosystem tools such as HDFS, YARN, MapReduce, Hive, Pig, HBase, Spark, Oozie, Flume and Sqoop. Throughout this online instructor-led Hadoop Training, you will be working on real-life industry use cases in Retail, Social Media, Aviation, Tourism and Finance domain using Edureka’s Cloud Lab.

co img 631 1540277920show?id=ELEUF*I0Aoo&bids=742306

Why should you take Big Data and Hadoop?

  • Worldwide revenues for Big Data and Business Analytics solutions will reach $260 billion in 2022 with a CAGR of 11.9% as per International Data Corporation (IDC)
  • Average Salary of Big Data Hadoop Developers is $135,000 (Indeed.com salary data)

Leave a Reply

Your email address will not be published.