好书推荐 好书速递 排行榜 读书文摘

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
作者:Tom White
副标题:4th Edition
出版社:O'Reilly Media
出版年:2015-04
ISBN:9781491901632
行业:计算机
浏览数:82

内容简介

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.

Learn fundamental components such as MapReduce, HDFS, and YARN

Explore MapReduce in depth, including steps for developing applications with it

Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN

Learn two data formats: Avro for data serialization and Parquet for nested data

Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)

Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop

Learn the HBase distributed database and the ZooKeeper distributed configuration service

......(更多)

作者简介

Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He works for Cloudera, a company set up to offer Hadoop support and training. Previously he was as an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O'Reilly, java.net and IBM's developerWorks, and has spoken at several conferences, including at ApacheCon 2008 on Hadoop. Tom has a Bachelor's degree in Mathematics from the University of Cambridge and a Master's in Philosophy of Science from the University of Leeds, UK.

......(更多)

目录

Hadoop Fundamentals

Chapter 1Meet Hadoop

Data!

Data Storage and Analysis

Querying All Your Data

Beyond Batch

Comparison with Other Systems

A Brief History of Apache Hadoop

What’s in This Book?

Chapter 2MapReduce

A Weather Dataset

Analyzing the Data with Unix Tools

Analyzing the Data with Hadoop

Scaling Out

Hadoop Streaming

Chapter 3The Hadoop Distributed Filesystem

The Design of HDFS

HDFS Concepts

The Command-Line Interface

Hadoop Filesystems

The Java Interface

Data Flow

Parallel Copying with distcp

Chapter 4YARN

Anatomy of a YARN Application Run

YARN Compared to MapReduce 1

Scheduling in YARN

Further Reading

Chapter 5Hadoop I/O

Data Integrity

Compression

Serialization

File-Based Data Structures

MapReduce

Chapter 1Developing a MapReduce Application

The Configuration API

Setting Up the Development Environment

Writing a Unit Test with MRUnit

Running Locally on Test Data

Running on a Cluster

Tuning a Job

MapReduce Workflows

Chapter 2How MapReduce Works

Anatomy of a MapReduce Job Run

Failures

Shuffle and Sort

Task Execution

Chapter 3MapReduce Types and Formats

MapReduce Types

Input Formats

Output Formats

Chapter 4MapReduce Features

Counters

Sorting

Joins

Side Data Distribution

MapReduce Library Classes

Hadoop Operations

Chapter 1Setting Up a Hadoop Cluster

Cluster Specification

Cluster Setup and Installation

Hadoop Configuration

Security

Benchmarking a Hadoop Cluster

Chapter 2Administering Hadoop

HDFS

Monitoring

Maintenance

Related Projects

Chapter 1Avro

Avro Data Types and Schemas

In-Memory Serialization and Deserialization

Avro Datafiles

Interoperability

Schema Resolution

Sort Order

Avro MapReduce

Sorting Using Avro MapReduce

Avro in Other Languages

Chapter 2Parquet

Data Model

Parquet File Format

Parquet Configuration

Writing and Reading Parquet Files

Parquet MapReduce

Chapter 3Flume

Installing Flume

An Example

Transactions and Reliability

The HDFS Sink

Fan Out

Distribution: Agent Tiers

Sink Groups

Integrating Flume with Applications

Component Catalog

Further Reading

Chapter 4Sqoop

Getting Sqoop

Sqoop Connectors

A Sample Import

Generated Code

Imports: A Deeper Look

Working with Imported Data

Importing Large Objects

Performing an Export

Exports: A Deeper Look

Further Reading

Chapter 5Pig

Installing and Running Pig

An Example

Comparison with Databases

Pig Latin

User-Defined Functions

Data Processing Operators

Pig in Practice

Further Reading

Chapter 6Hive

Installing Hive

An Example

Running Hive

Comparison with Traditional Databases

HiveQL

Tables

Querying Data

User-Defined Functions

Further Reading

Chapter 7Crunch

An Example

The Core Crunch API

Pipeline Execution

Crunch Libraries

Further Reading

Chapter 8Spark

Installing Spark

An Example

Resilient Distributed Datasets

Shared Variables

Anatomy of a Spark Job Run

Executors and Cluster Managers

Further Reading

Chapter 9HBase

HBasics

Concepts

Installation

Clients

Building an Online Query Application

HBase Versus RDBMS

Praxis

Further Reading

Chapter 10ZooKeeper

Installing and Running ZooKeeper

An Example

The ZooKeeper Service

Building Applications with ZooKeeper

ZooKeeper in Production

Further Reading

Case Studies

Chapter 1Composable Data at Cerner

From CPUs to Semantic Integration

Enter Apache Crunch

Building a Complete Picture

Integrating Healthcare Data

Composability over Frameworks

Moving Forward

Chapter 2Biological Data Science: Saving Lives with Software

The Structure of DNA

The Genetic Code: Turning DNA Letters into Proteins

Thinking of DNA as Source Code

The Human Genome Project and Reference Genomes

Sequencing and Aligning DNA

ADAM, A Scalable Genome Analysis Platform

From Personalized Ads to Personalized Medicine

Join In

Chapter 3Cascading

Fields, Tuples, and Pipes

Operations

Taps, Schemes, and Flows

Cascading in Practice

Flexibility

Hadoop and Cascading at ShareThis

Summary

Appendix Installing Apache Hadoop

Prerequisites

Installation

Configuration

Appendix Cloudera’s Distribution Including Apache Hadoop

Appendix Preparing the NCDC Weather Data

Appendix The Old and New Java MapReduce APIs

Case Studies

Chapter 1Composable Data at Cerner

From CPUs to Semantic Integration

Enter Apache Crunch

Building a Complete Picture

Integrating Healthcare Data

Composability over Frameworks

Moving Forward

Chapter 2Biological Data Science: Saving Lives with Software

The Structure of DNA

The Genetic Code: Turning DNA Letters into Proteins

Thinking of DNA as Source Code

The Human Genome Project and Reference Genomes

Sequencing and Aligning DNA

ADAM, A Scalable Genome Analysis Platform

From Personalized Ads to Personalized Medicine

Join In

Chapter 3Cascading

Fields, Tuples, and Pipes

Operations

Taps, Schemes, and Flows

Cascading in Practice

Flexibility

Hadoop and Cascading at ShareThis

Summary

Appendix Installing Apache Hadoop

Prerequisites

Installation

Configuration

Appendix Cloudera’s Distribution Including Apache Hadoop

Appendix Preparing the NCDC Weather Data

Appendix The Old and New Java MapReduce APIs

......(更多)

读书文摘

在许多情况下,可以视Mapreduce为关系型数据库管理系统的补充。MapReduce比较适合以批处理的方式处理需要分析整个数据集的问题,尤其是即席分析。RDBMS适用于点查询和更新,数据集被索引后,数据库系统能够提供低延迟的数据检索和快速的少量数据更新。MapReduce适合数据一次写入、多次读取的应用,而关系型数据库更适合持续更新数据集.

......(更多)

猜你喜欢

点击查看