Skip to content

Latest commit

 

History

History
71 lines (51 loc) · 2.48 KB

hadoop.md

File metadata and controls

71 lines (51 loc) · 2.48 KB

Hadoop

Apache Hadoop

  • Hadoop Common
  • Hadoop Distributed File System (HDFS)
  • Hadoop YARN
  • Hadoop MapReduce

HDFS

Hadoop Distributed File System (based on Google File System, GFS)

  • Serves as the distributed file system for most tools in the Hadoop ecosystem
  • Scalability for large data sets
  • Reliability to cope with hardware failures

HDFS good for

  • Large files

Not good for

  • Lots of small files
  • Low latency access (because Disk I/O)

HDFS Architecture

Master/Slave design

  • Master node
    • Single NameNode for managing metadata
  • Slave node
    • Multiple DataNode for storing data
  • Other
    • Secondary NameNode as a backup

NameNode

  • keeps the metadata, the name, location and directory

DataNode

  • provides storage for blocks of data

HDFS Blocks

Replication of Blocks for fault tolerance

HDFS files are divided into blocks

  • It's the basic unit of read/write
  • Default block size is 128MB
  • Hence makes HDFS good for large files, not good for small files

HDFS blocks are replicated multiple times

  • One block stored at multiple location, also at different racks (usually 3 times)
  • This makes HDFS storage fault tolerance and faster to read

MapReduce

Data processing model designed to process large amounts of data in a distributed/parallel computing environment

When large data comes in, the data is divided into blocks of a certain size and a Map Task and a Reduce Task are performed for each block.

Simple programming paradigm for the Hadoop ecosystem

Traditional parallel programming requires expertise of different parallel programming paradigms

The Map Task and the Reduce Task use the Key-Value structure for input and output. Map refers to an operation of grouping processed data in the form of (Key, Value)

  • The Map returns data in the form of a Key, Value in the form of a List.
  • The Reduce removes and merges data with duplicate key values from data processed with Map, and extracts the desired data.