Home » BIG DATA » What is Big Data?
What is Big Data

What is Big Data?

Big data is defined as collections of datasets whose volume, velocity or variety is so large that it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools. In the recent years, there has been an exponential growth in the both
structured and unstructured data generated by information technology, industrial, healthcare, Internet of Things, and other systems.

Big Data Analytics

Big Data analytics deals with collection, storage, processing and analysis of this massive scale data. Specialized tools and frameworks are required for big data analysis when:

  1. the volume of data involved is so large that it is difficult to store, process and analyze data on a single machine,
  2. the velocity of data is very high and the data needs to be analyzed
    in real-time,
  3. there is variety of data involved, which can be structured, unstructured or semi-structured, and is collected from multiple data sources,
  4. various types of analytics need to be performed to extract value from the data such as descriptive, diagnostic, predictive and prescriptive analytics.

Big Data tools and frameworks have distributed and parallel processing architectures and can leverage the storage and computational resources of a large cluster of machines.


Big data analytics involves several steps starting from data cleansing, data munging (or wrangling), data processing and visualization. Big data analytics life-cycle starts from the collection of data from multiple data sources.

Specialized tools and frameworks are required to ingest the data from different sources into the dig data analytics backend. The data is stored
in specialized storage solutions (such as distributed filesystems and non-relational databases) which are designed to scale.

Based on the analysis requirements (batch or real-time), and type of analysis to be performed (descriptive, diagnostic, predictive, or predictive) specialized frameworks are used. Big data analytics is enabled by several technologies such as cloud computing, distributed and parallel processing frameworks, non-relational databases, in-memory computing, for instance.

Examples of big data

Some examples of big data are listed as follows:

  • Data generated by social networks including text, images, audio and video data
  • Click-stream data generated by web applications such as e-Commerce to analyze user behavior
  • Machine sensor data collected from sensors embedded in industrial and energy systems for monitoring their health and detecting failures
  • Healthcare data collected in electronic health record (EHR) systems
  • Logs generated by web applications
  • Stock markets data
  • Transactional data generated by banking and financial applications

Characteristics of Big Data

The underlying characteristics of big data include:

1. Volume
Big data is a form of data whose volume is so large that it would not fit on a single machine therefore specialized tools and frameworks are required to store process and analyze such data.

For example, social media applications process billions of messages everyday, industrial and energy systems can generate terabytes of sensor data everyday, cab aggregation applications can process millions of transactions in a day, etc.

The volumes of data generated by modern IT, industrial, healthcare, Internet of Things, and other systems is growing exponentially driven by the lowering costs of data storage and processing architectures and the need to extract valuable insights from the data to improve business processes, efficiency and service to consumers.

Though there is no fixed threshold for the volume of data to be considered as big data, however, typically, the term big data is used for massive scale data that is difficult to store, manage and process using traditional databases and data processing architectures.

2. Velocity
Velocity of data refers to how fast the data is generated. Data generated by certain sources can arrive at very high velocities, for example, social media data or sensor data.

Velocity is another important characteristic of big data and the primary reason for the exponential growth of data. High velocity of data results in the volume of data accumulated to become very large, in short span of time. Some applications can have strict deadlines for data analysis
(such as trading or online fraud detection) and the data needs to be analyzed in real-time.

Specialized tools are required to ingest such high velocity data into the big data infrastructure and analyze the data in real-time.

3. Variety
Variety refers to the forms of the data. Big data comes in different forms such as structured, unstructured or semi-structured, including text data, image, audio, video and sensor data. Big data systems need to be flexible enough to handle such variety of data.

4. Veracity
Veracity refers to how accurate is the data. To extract value from the data, the data needs to be cleaned to remove noise. Data-driven applications can reap the benefits of big data only when the data is meaningful and accurate.

Therefore, cleansing of data is important so that incorrect and faulty data can be filtered out.

5. Value
Value of data refers to the usefulness of data for the intended purpose. The end goal of any big data analytics system is to extract value from the data.

The value of the data is also related to the veracity or accuracy of the data. For some applications value also depends on how fast we are able to process the data.

, ,