Big Data Analytics is intended for use as a textbook for third- and fourth-year students of B.E., B.Tech., B.Sc., BCA, MCA, and M.Tech. courses in IT, Software, and Computer Science Engineering. The book has been written to help students who enter the software industry to gain a broad understanding of Big Data and the nuances of handling it to extract useful information. Spread across 21 chapters, it elucidates the concept of Big Data and walks the reader through popular frameworks such as Hadoop, MongoDB, Pig and Hive that are used for processing Big Data. The book will also benefit professionals, at all levels, who seek transition to the software field.
S Chandramouli is Associate Director, Cognizant Technology Solutions. Asha A George is PPM and Strategy Consultant, Verbat Technologies. CR Rene Robin is Professor (CSE) and Dean (Innovation), Sri Sairam Engineering College, Chennai. D Doreen Hephzibah Miriam is Founder and Director, Computational Intelligence Research Foundation. J Jasmine Christina Magdalene is Assistant Professor, PG Department of Computer Applications, Bishop Heber College, Tiruchirappalli.
Preface Acknowledgements About the Author
Chapter 1 Introduction to Data Analytics
1.1 Introduction
1.2 What Is Data?
1.1.1 Data Relationships
1.1.2 Data Models
1.3 Types of Data
1.4 Nature of Data
1.5 Data Visualization
1.6 Data Analysis Methods
1.6.1 Correlation
1.6.2 Regression
1.6.3 Forecasting
1.6.4 Clustering
1.6.5 Classification
1.7 Web Data
1.7.1 Evolution of Analytic Scalability
1.7.2 Reporting vs. Analysis
Summary | Multiple Choice Questions | Short-answer Questions | Essay-type Questions
Chapter 2 Data Analytics Life-cycle
2.1 Introduction
2.2 Business Drivers for Analytics
2.2.1 Increasing Profitability and Growth
2.2.2 Strengthening Customer Experience and Intimacy
2.2.3 Driving Digital Transformation and Innovation
2.2.4 Managing Regulatory and Compliance Risks
2.2.5 Increasing Operational Efficiency
2.3 Typical Analytical Architecture
2.3.1 Data Analytical Architecture
2.3.2 Challenges of Conventional Systems
2.4 Analytic Processes and Tools
2.4.1 Types of Analytics
2.4.2 Modern Data Analytic Tools
2.5 Data Analytic Life-cycle
2.5.1 Need of Data Analytic Life-cycle
2.5.2 Phases of Data Analytic Life-cycle
2.6 Key Roles for Successful Analytic Projects
2.7 Modern-day Intelligence
2.7.1 Business Intelligence vs. Data Science
2.7.2 Intelligent Data Analysis
Chapter 3 Fundamentals of Big Data
3.1 Introduction to Big Data
3.2 Big Data Concepts and Terminology
3.2.1 Big Data Processing Activities
3.2.2 Common Terminologies
3.3 Fundamentals of Big Data Types
3.4 Big Data Analytics
3.4.1 Text Analytics
3.4.2 Audio Analytics
3.4.3 Video Content Analytics
3.4.4 Social Media Analytics
3.4.5 Predictive Analytics
3.5 Distributed File System in Big Data
3.6 Big Data Characteristics
3.6.1 The 5 V ’s of Big Data
3.6.2 Challenges of Processing Big Data
3.7 Drivers for Big Data
Chapter 4 Big Data Analytics Technology
4.1 Introduction to Big Data Analytics
4.2 Big Data Analysis Framework
4.3 Approaches for Big Data Analysis
4.4 Understanding Text Analytics and Big Data
4.4.1 Text Mining Process
4.4.2 Applications of Text Analytics
4.5 Predictive Analysis of Big Data
4.5.1 Predictive Analytics Models
4.5.2 Predictive Analytics Algorithms
4.6 Procedural vs. Functional Programming Models for Big Data
4.7 Big Data Integration Process
4.8 Big Data Technology Landscape
4.8.1 Big Data Architecture
4.8.2 Big Data Storage
4.9 Big Data Key Roles
Chapter 5 Fundamentals of Hadoop
5.1 Introduction
5.2 Problems with Traditional Large-scale Systems
5.3 Five V ’s of Big Data
5.4 What Is Hadoop?
5.5 History of Hadoop
5.6 Why Hadoop?
5.7 Different Flavors of Hadoop
5.8 Different Modes of Hadoop
5.8.1 Standalone Mode
5.8.2 Pseudo-distributed Mode (Single-node Cluster)
5.8.3 Fully Distributed Mode
5.9 Core Components of Hadoop
5.10 Hadoop Ecosystem
5.11 Data Ingestion Layer
5.12 ETL and ELT
5.13 Ingestion Tools in Hadoop Ecosystem
5.14 Data Storage Layer
5.14.1 Data Storage Tools
5.15 Processing Layer
5.16 Analysis Layer
5.17 Management and Coordination
5.18 Anatomy of a Hadoop Cluster: HDFS Architecture
5.19 Data Locality in Hadoop
5.20 Configuration files in Hadoop
5.21 Limitations of Hadoop
5.22 Distributed Cache in Apache Hadoop
Chapter 6 Hadoop Distributed File System
6.1 Introduction
6.2 Virtualization
6.3 Downloading VMware
6.4 Installing VMware
6.5 VirtualBox
6.5.1 VirtualBox Installation Steps
6.6 HDP Sandbox Download and Installation
6.7 Ambari Administration
6.8 HDFS Command Line Interface
6.8.1 JPS Command
6.8.2 List of Files
6.8.3 File Management
6.8.4 Upload and Download Files
6.8.5 Ownership and Validation
Chapter 7 MapReduce
7.11 Hadoop Reducer
7.12 Hadoop Key-Value Pair
7.13 Input Format in MapReduce
7.14 InputSplit in MapReduce
7.15 Hadoop Record Reader
7.16 MapReduce Partitioner
7.16.1 MapReduce Combiner
7.17 Shuffling and Sorting in MapReduce
7.17.1 Hadoop Output Format
7.18 Input Split vs. HDFS Block in MapReduce
7.19 MapOnly Job in MapReduce
7.20 Hadoop Speculative Execution
7.21 Hadoop Counters
7.22 Hadoop Optimization
7.23 MapReduce Performance Tuning: Best Practices
7.23.1 System Level Best Practices
7.23.2 Application Level Best Practices
7.24 YARN
Chapter 8 Hadoop Ingestion
8.1 Introduction
8.2 Data Ingestion Types
8.2.1 Real-time Data Ingestion (RTDI)
8.2.2 Batch-based Data Ingestion (BBDI)
8.2.3 Lambda Architecture Data Ingestion (LADI)
8.3 Benefits of Data Ingestion
8.3.1 Data Ingestion Tools Selection
8.4 Introduction to Sqoop
8.5 Features of Sqoop
8.6 Basic SQL Commands and Connecting from Cloudera
8.7 Basic Sqoop Commands from Cloudera Command Prompt
8.8 Sqoop Importing
8.9 Sqoop Incremental Import
8.10 Sqoop Export
8.11 Advantages of Sqoop
8.12 Disadvantages of Sqoop
10.8 HBase Coprocessor
10.9 Setting HBase Environment
10.10 Creating HBase Tables
10.11 Listing all Tables
10.12 Adding Data to a Table
10.13 Getting a Row of Data
10.14 Scanning a Table
10.15 Counting the Number of Rows in a Table
10.16 Altering a Table
10.17 Deleting a Table Row, Column
10.18 Disabling and Enabling a Table
10.19 Truncating and Dropping a Table
10.20 Determining if Table Exists
10.21 Creating a Hive External Table Stored by HBase
10.21.1 Defining an External Table over HBase Tables
10.21.2 Mapping Specific HBase Columns and Column Families
10.21.3 Working Hive with HBase (Integration)
10.22 Advanced Indexing in HBase
10.23 HIndex
10.23.1 Writing Data with Index
10.23.2 Reading Data with Index
10.23.3 HIndex Features
10.24 HBase Admin API
10.25 HBAse Client API
10.25.1 Put Method
10.25.2 Get Method
10.26 Using HBase in Hadoop Applications
10.27 HBase Advanced Usage
10.27.1 Filters
10.27.2 The Filter Hierarchy
10.27.3 Comparison Operators
10.27.4 Comparators
10.27.5 Comparison Filters
10.28 Dedicated Filters
10.29 Decorating Filters
Chapter 11 Hadoop Streaming
11.1 Introduction
11.2 Real-time Analytics
11.2.1 Choosing the Proper Tool for Real-time Analytics
11.2.2 Apache Spark Streaming
11.2.3 Apache Samza
11.2.4 What Would a Perfect Solution Entail?
11.2.5 Challenges to Be Solved
11.3 Thread Pooling
11.4 Stream Computing
11.5 The Future of Data Streaming
11.6 Stream Computing’s Advantages in the Big Data world
11.7 How Streaming Works
11.8 Real-time Streams vs. Batch Processing
11.9 Hadoop Streaming
11.9.1 Hadoop Streaming Characteristics
11.9.2 Specifying Other Plugins for Jobs
Chapter 12 Pig Latin
12.1 Introduction
12.2 Basic Features of Apache Pig
Chapter 13 Fundamentals of Spark
13.6 Design Principles of Apache Spark
13.7 Advantages of Spark
13.8 Disadvantages of Apache Spark
13.9 Installation of Apache Spark on Windows
13.10 Apache Spark Physical Architecture
13.11 Apache Spark Layered Architecture
13.11.1 Resilient Distributed Dataset
13.11.2 Directed Acyclic Graph (DAG)
13.12 Ways to Create RDD in Spark
13.13 Paired RDD
13.14 Features of Spark RDD
13.15 Persistence and Caching Mechanisms in Apache Spark
13.16 Operations of Apache Spark RDD
13.16.1 Transformations
13.16.2 Actions
13.17 Limitations of Apache Spark RDD and Ways to Overcome It
13.18 Directed Acyclic Graph (DAG)
13.19 DAG in Apache Spark
13.19.1 Need for DAG in Apache Spark
13.19.2 Working Principle of DAG in Spark
13.20 Applications of Apache Spark
13.20.1 Streaming Data
13.21 Spark in Real-world
13.22 Use Cases of Spark
13.23 Spark vs. Hadoop
13.24 Sample Program
Chapter 14 Introduction to NoSQL Database Concepts
14.1 Introduction
14.2 Relational Databases
14.3 NoSQL Definition
14.4 Types of NoSQL Databases
14.4.1 Column Family Databases
14.4.2 Key-Value Pair Database
14.4.3 Document Store
14.4.4 Graph Database
14.5 Examples of NoSQL Databases
14.6 Advantages of NoSQL Databases
14.7 NoSQL Usage
14.8 SQL vs. NoSQL
14.9 New SQL
14.10 ACID
14.10.1 Atomicity
14.10.2 Consistency
14.10.3 Isolation
14.10.4 Durability
14.11 BASE
14.12 Two-phase Commit
14.12.1 Commit–request Phase
14.13 Schema
14.13.1 Sharding and Share Nothing Architecture
14.13.2 Partitioning Horizontal and Vertical Data
14.13.3 Four Basic Strategies for Shard Structure
14.14 Brewer’s CAP Theorem
14.15 Cassandra – Definition and Features
14.15.1 Definition
14.15.2 Features
14.15.3 Key Structures in Cassandra
14.15.4 Cassandra Advantages and Use Cases
14.16 MongoDB
14.16.1 Architecture of MongoDB
14.16.2 MongoDB Advantages and Use Cases
14.17 HBase
14.17.1 HBase Architecture
14.18 Comparing Cassandra, MongoDB, and HBase
Chapter 15 Cassandra Data Model
15.1 Introduction
15.2 Use Cases of Cassandra
15.3 Cassandra Installation in Windows Environment
15.3.1 Installing Python 2.7.x Edition
15.3.2 Installing Apache Cassandra
15.4 Cassandra Basic CQL
15.5 How to Create, Alter, Drop and Use Keyspace in Cassandra
15.5.1 Create Keyspace
15.5.2 Simple Strategy
15.5.3 Network Topology Strategy
15.6 Column Families
15.6.1 Types of Columns
15.7 Cassandra Table
15.7.1 Inserting and Displaying Data from the Table
15.7.2 Updating the Table Data
15.8 Data Types in Cassandra
15.8.1 Collection Data Type in Cassandra
15.9 Cassandra BATCH
15.10 Difference Between Cassandra and RDBMS
15.11 Denormalization
15.12 Design Patterns
15.12.1 Coexistence Patterns
15.13 RDBMS Migration Patterns
15.14 CAP Patterns
15.15 Temporal Patterns
Chapter 16 Cassandra Architecture
16.1 Introduction
16.1.1 Cassandra Architecture
16.1.2 Features of Cassandra
16.2 Cassandra’s Peer-to-Peer Approach
16.3 Gossip and Failure Detection
16.4 SS Tables and Commit Log
16.4.1 Partition and Token
16.4.2 Compression Offset Map
16.4.3 Cassandra Commit Log
16.5 Cassandra Memtable
16.5.1 Memtable Allocation Types
16.5.2 Slab Allocator
16.5.3 Memtable Flush
16.5.4 Row Cache
16.5.5 Cassandra Memtable Metrics
16.6 Hashing to the Rescue
16.7 Compaction in Cassandra
16.8 Tombstones in Cassandra
16.9 Hinted Handoff
16.10 Anti-entropy and Read Repair
16.10.1 Anti-entropy
16.10.2 Read repair
16.11 Bloom Filters in Cassandra
16.11.1 Bloom Filter
16.11.2 Changing Bloom Filter
16.12 Load Balancing in Cassandra
16.13 Cassandra Read Process
16.13.1 Example of Cassandra Read Process
16.14 Cassandra Write Process
16.15 Staged Event-Driven Architecture (SEDA)
16.16 Cassandra Migration
16.16.1 Migration Approaches
16.16.2 Partition Key Cache
16.16.3 Partition Summary
16.16.4 Partition Index
16.16.5 Cache Migration Pattern
16.16.6 Estimating a Migration
16.17 Streaming
16.17.1 Streaming Based on Netty
16.17.2 Zero-copy Streaming
16.17.3 Parallelizing of Streaming of Keyspaces
Chapter 17 MongoDB
17.1 Introduction
17.2 History of MongoDB
17.3 MongoDB Environment Setup
17.3.1 Install MongoDB on Windows
17.3.2 Starting the MongoDB Server
17.4 MongoDB Schema Design
17.5 Key Features of MongoDB
17.6 RDBMS vs. MongoDB
17.7 MongoDB Query Language (MQL)
17.8 MongoDB Database, Collection and Documents
17.9 MongoDB Server
17.10 MongoDB Client Through the JavaScript’s Shell
17.11 CRUD Operation in MongoDB
17.11.1 Creating Database in MongoDB (C of CRUD)
17.11.2 Creating Collection in MongoDB
17.11.3 Listing Down the Databases Available in MongoDB
17.11.4 Inserting Records into Collection (Table)
17.11.5 Showcasing the Current Database Used
17.11.6 Showcasing the Tables (Collections) in the
Current Database
17.11.7 Reading Collections in MongoDB (R of CRUD)
17.11.8 Updating documents in MongoDB (U of CRUD)
17.11.9 Delete Operation in MongoDB (D of CRUD)
17.11.10 Dropping (Deleting) a Particular Database
17.12 Pretty () Method
17.13 AND in MongoDB
17.14 OR in MongoDB
17.15 Using AND and OR Together
17.16 NOR in MongoDB
17.17 NOT in MongoDB
17.18 Creating and Querying Through Indexes
17.18.1 The createIndex () method
17.18.2 MongoDB’s dropIndex () Method
17.18.3 The dropIndexes () Method
17.18.4 The getIndexes () Method
17.19 Mongo Compass
17.19.1 MongoDB Connection
17.19.2 Creating Database in Compass
17.19.3 Adding Documents in Compass
17.19.4 MongoDB View
17.19.5 Filters in Compass
17.19.6 Sorting in Compass
17.19.7 Limit Option in Compass
17.19.8 Skip Option in Compass
17.19.9 Project Option in Compass
17.19.10 Dropping a Database in Compass
17.19.11 Dropping a Collection in Compass
17.19.12 Importing Documents in Compass
17.19.13 Aggregations Option in Compass
17.19.14 Schema Option in Compass
17.19.15 Update MongoDB Compass with the Latest Version
Chapter 18 Big Data Visualizations
18.1 Introduction
18.2 History of Data Visualization
18.3 Big Data Visualization
18.4 Importance of Big Data Visualization
18.5 How Does Data Visualization Work?
18.6 Types of Data Visualization
18.7 Challenges of Big Data Visualization
18.8 Introduction to Tableau
18.8.1 Features of Tableau
18.8.2 Tableau Product Suite
18.8.3 Installation of Tableau
18.8.4 Tableau for Big Data Visualization
18.9 Python for Data Visualization
18.9.1 Installation of Python
18.9.2 Visualization of Data Using Python
18.9.3 Matplotlib
Chapter 19 Business Implementation of Big Data
19.1 Introduction
19.2 Big Data in Business
19.2.1 Big Data in Marketing
19.2.2 Big Data in Banking Sector
19.2.3 Big Data in Healthcare Sector
19.2.4 Big Data in Education Sector
19.3 Security in Big Data
19.3.1 User Access Control
19.4 Big Data on Cloud
19.5 Best Practices in Big Data Implementation
19.6 Latest Trends in Big Data
19.6.1 Big Data Analytics Will Incorporate Artificial Intelligence
19.6.2 The Use of Blockchain for Data Security Will Increase
19.6.3 The Internet of Things (IoT) Will Drive Streaming
Analytics Adoption
19.6.4 The Rise of DataOps
19.6.5 Data-as-a-Service (DaaS)
19.6.6 Data Mesh
19.6.7 Synthetic Data
19.6.8 Empowerment of Self-service Analytics
19.6.9 Data Democratization
Chapter 20 Limitations of Hadoop and Solutions to Overcome Them
20.1 Introduction
20.2 Problem with Small Files
20.3 Vulnerability
20.4 Long Processing Time
20.5 Not Easy to Use
20.6 Supports Only Batch Processing
20.7 No Delta Iteration
20.8 Security Issues
Chapter 21 Big Data Case Studies
21.1 Applications of Big Data in the Retail Industry
21.1.1 Customer Segmentation
21.1.2 Inventory Management
21.1.3 Price Optimization
21.1.4 Fraud Detection
21.1.5 Supply Chain Optimization
21.1.6 Predictive Analytics
21.2 Applications of Big Data in the Logistics Industry
21.2.1 Route Optimization
21.2.2 Supply Chain Visibility
21.2.3 Risk Management
21.2.4 Fleet Management
21.2.5 Warehouse Optimization
21.2.6 Pricing Optimization
21.2.7 Quality Control
21.2.8 Environmental Sustainability
21.3 Applications of Big Data in the Manufacturing Industry
21.3.1 Predictive Maintenance
21.3.2 Quality Control
21.3.3 Supply Chain Optimization
21.3.4 Production Optimization
21.3.5 Energy Efficiency
21.3.6 Product Development
21.3.7 Risk Management
21.3.8 Warranty Analytics
21.3.9 Customer Analytics
21.4 Applications of Big Data in the Travel Industry
21.4.1 Customer Service
21.4.2 Predictive Maintenance
21.4.3 Weather Forecasting
21.4.4 Customer Sentiment Analysis
21.4.5 Destination Management
21.4.6 Operational Efficiency
21.4.7 Revenue Management
Summary
Appendix A: Model Questions
Appendix B: Capstone Projects
Appendix C: Model Syllabi
Index