The Hadoop Ecosystem:
Hadoop Distributed File System (HDFS)
The default big data storage layer for Apache Hadoop is HDFS.
HDFS is the “Secret Sauce” of Apache Hadoop components as users can dump
huge datasets into HDFS and the data will sit there nicely until the user wants
to leverage it for analysis. HDFS component creates several replicas of
the data block to be distributed across different clusters for reliable and
quick data access. HDFS comprises of 3 important components-NameNode, DataNode
and Secondary NameNode. HDFS operates on a Master-Slave architecture model
where the NameNode acts as the master node for keeping a track of the storage
cluster and the DataNode acts as a slave node summing up to the various systems
within a Hadoop cluster.
MapReduce
MapReduce is a Java-based system
created by Google where the actual data from the HDFS store gets processed
efficiently. MapReduce breaks down a big data processing job into smaller
tasks. MapReduce is responsible for the
analysing large datasets in parallel before reducing it to find the results. In
the Hadoop ecosystem, Hadoop MapReduce is a framework based on YARN
architecture. YARN based Hadoop architecture, supports parallel processing of
huge data sets and MapReduce provides the framework for easily writing
applications on thousands of nodes, considering fault and failure management.
The basic
principle of operation behind MapReduce is that the “Map” job sends a query for
processing to various nodes in a Hadoop cluster and the “Reduce” job collects
all the results to output into a single value. Map Task in the Hadoop ecosystem
takes input data and splits into independent chunks and output of this task
will be the input for Reduce Task. In The same Hadoop ecosystem Reduce task
combines Mapped data tuples into smaller set of tuples. Meanwhile, both input and
output of tasks are stored in a file system. MapReduce takes care of scheduling
jobs, monitoring jobs and re-executes the failed task.
MapReduce
framework forms the compute node while the HDFS file system forms the data
node. Typically in the Hadoop ecosystem architecture both data node and compute
node are considered to be the same.
YARN-
YARN forms an integral part of Hadoop 2.0.YARN is great
enabler for dynamic resource utilization on Hadoop framework as users can run
various Hadoop applications without having to bother about increasing workloads.
Pig-
Pig is a convenient tools
developed by Yahoo for analysing huge data sets efficiently and easily. It
provides a high level data flow language Pig Latin that is optimized,
extensible and easy to use. The most outstanding feature of Pig programs is
that their structure is open to considerable parallelization making it easy for
handling large data sets.
Hive
Hive developed by Facebook
is a data warehouse built on top of Hadoop and provides a simple language known
as HiveQL similar to SQL for querying, data summarization and analysis. Hive
makes querying faster through indexing.
Sqoop
Sqoop
component is used for importing data from external sources into related Hadoop
components like HDFS, HBase or Hive. It can also be used for exporting data
from Hadoop o other external structured data stores. Sqoop parallelized data
transfer, mitigates excessive loads, allows data imports, efficient data
analysis and copies data quickly.
Flume-
Flume component is used to gather and aggregate large amounts of
data. Apache Flume is used for collecting data from its origin and sending it
back to the resting location (HDFS).Flume accomplishes this by outlining data
flows that consist of 3 primary structures channels, sources and sinks. The processes
that run the dataflow with flume are known as agents and the bits of data that
flow via flume are known as events.
HBase –
HBase is a
column-oriented database that uses HDFS for underlying storage of data. HBase
supports random reads and also batch computations using MapReduce. With HBase
NoSQL database enterprise can create large tables with millions of rows and
columns on hardware machine. The best practice to use HBase is when there is a
requirement for random ‘read or write’ access to big datasets.
Oozie-
Oozie is a
workflow scheduler where the workflows are expressed as Directed Acyclic
Graphs. Oozie runs in a Java servlet container Tomcat and makes use of a
database to store all the running workflow instances, their states ad variables
along with the workflow definitions to manage Hadoop jobs (MapReduce, Sqoop,
Pig and Hive).The workflows in Oozie are executed based on data and time
dependencies.
Zookeeper-
Zookeeper is
the king of coordination and provides simple, fast, reliable and ordered
operational services for a Hadoop cluster. Zookeeper is responsible for
synchronization service, distributed configuration service and for providing a
naming registry for distributed systems.
MAHOUT -
Mahout is an
important Hadoop component for machine learning, this provides implementation
of various machine learning algorithms. This Hadoop component helps with
considering user behavior in providing suggestions, categorizing the items to
its respective group, classifying items based on the categorization and supporting
in implementation group mining or itemset mining, to determine items which
appear in group
Kafka-
A distributed
public-subscribe message developed by LinkedIn that is fast, durable and
scalable.Just like other Public-Subscribe messaging systems ,feeds of messages
are maintained in topics
That was a great message in my carrier, and It's wonderful commands like mind relaxes with understand words of knowledge by information's.
ReplyDeleteWeb Development Company in Chennai
mmorpg oyunlar
ReplyDeleteInstagram takipçi satın al
Tiktok Jeton Hilesi
tiktok jeton hilesi
antalya saç ekimi
Takipci
instagram takipçi satın al
metin2 pvp serverlar
instagram takipçi satın al
perde modelleri
ReplyDeleteMobil Onay
Mobil ödeme bozdurma
nft nasıl alınır
Ankara Evden Eve Nakliyat
Trafik Sigortası
dedektör
web sitesi kurma
Aşk romanları