Hadoop: An Introduction For HR and Executives

Here is a high level overview of Hadoop, Big Data and their relationship to your business. So let’s start off simply by answering the basic question.

What is Big Data? Big Data are extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.

Big Data has the potential to transform the way you run your organization. When used properly it will create new insights and more effective ways of doing business, such as:

How you design and deliver your products to the market
How your customers find and interact with you
Procedures you can put to work to boost the bottom line
Competitive Strengths & Weaknesses

What’s even more compelling is that if you have the right technology infrastructure in place, many of these insights can be delivered in real-time. Furthermore, this newfound knowledge isn’t just hypothetical: you can apply what you learn to improve daily operations.

Common Characteristics of Companies That Can Benefit From Analyzing Big Data

Larger amounts of information
More types of data
Data that’s generated by more sources
Data that’s retained for longer periods
5 Data that’s utilized by more types of applications

Implications of Mishandling Big Data

Failing to keep pace with the immense data volumes, mushrooming number of information sources and categories, longer data retention periods, and expanding suite of data-hungry applications has impeded many Big Data plans, and is resulting in:

Delayed or faulty insights
An inability to detect and manage risk
Diminished revenue
Increased cost
Opportunity costs of missing new applications along with operational use of data

Fortunately, new tools and technologies are arriving to help make sense of Big Data; distributed processing methodologies and Hadoop Consulting are prime examples of fresh thinking to address Big Data. Hadoop Consulting is the king of the hill right now, but there are many other technologies entering the space. (see Hadoop page for examples)

Hadoop

Simply stated, Hadoop is a comprehensive software platform that executes distributed data processing techniques. It’s implemented in several distinct, specialized modules:

Storage, principally employing the Hadoop File System (HDFS) (although other more robust alternatives are available as well)
Resource management and scheduling for computational tasks
Distributed processing programming model based on MapReduce
Common utilities and software libraries necessary for the entire Hadoop platform

Examples of Hadoop Applications

Enterprise Data Hub

Ultra-fast data ingestion
Multi-structured data staging
Extract/transform/load and datawarehousing offload
Mainframe offload
Investigative analytics
Simple query and reporting

Market Optimization and Targeting

Cross-channel behavioral analysis
Social media analysis
Click-stream analysis
Recommendation engines and targeting
Advertising impression and conversion analysis

Risk Detection and Prevention

Network security monitoring
Security information and Event management
Fraudulent behavioral analysis
Bot detection and prevention

Operations Intelligence

Supply chain and logistics
System log analysis
Assembly line quality assurance
Preventative maintenance
Smart meter analysis

Specific Examples of Hadoop Implementation By Industry

Financial Services– This industry offers some very interesting optimization prospects because of the huge amounts of data that it generates, its tight processing windows, strict regulatory and reporting requirements, and the ever-present potential for fraudulent or risky behavior. Hadoop is able to apply distributed processing methodologies that excel in conducting the pattern matching necessary to detect fraud or other nefarious activities. It can incorporate hundreds – or even thousands – of indicators to help improve credit score accuracy while also flagging potential risk situations before they can proceed.

Publishing– Analyze user interactions with mobile reading devices to deliver precise search results as well as more meaningful recommendations. Since these data-driven suggestions are accurate, fine-tuned, and timely, users are more likely to make additional purchases and be satisfied with what they’ve bought.

Healthcare– It’s well known that the job of designing new pharmaceutical products is both costly and very risky. Employing Hadoop for massive data storage and then applying analytics to process and correlate raw financial, patient, and drug data speeds up drug development, improves patient care, and ultimately reduces total healthcare costs across the system.

Retail- Load and then process massive amounts of information – such as website searches, shopping cart interactions, tailored promotion responses, and inventory management – to gain a better understanding of customer buying trends. Rapidly analyzing all of these data points from separate systems makes it possible for the retailer to tailor its prices and promotions based on actual intelligence, rather than hunches.

Advertising- Online advertising systems produce massive amounts of information in the blink of an eye. For example, there are almost 40,000 ad auctions per second on Google AdWords. Even the slightest improvement in advertisement pricing yields tremendous profitability advancements. But these optimizations are only possible if they’re conducted in real-time by using Hadoop to analyze conversion rates and the cost to serve ads, and then applying this knowledge to drive incremental revenue.

How To Implement Hadoop

With all of these moving parts, there are now several distinct options for organizations seeking to deploy Hadoop and its related technologies. These generally fall into one of three implementation models:

1) Open source Hadoop and support. This pairs bare-bones open source with paid professional support and services. Hortonworks is a good example of this model.

2) Open source Hadoop and management utilities. This goes a step further by joining open source Hadoop with IT-friendly tools and utilities that make things easier for mainline IT organizations. Cloudera is an instance of this model.

3) Open source Hadoop, management utilities, and innovative added value at all layers – including Hadoop’s foundation. Some vendors are enhancing Hadoop’s capabilities with enterprise-grade features yet still remaining faithful to the core open source components. MapR is the best-known adherent to this approach.

Selecting your Hadoop infrastructure is a vital IT decision that will affect the entire organization for years to come, in ways that you can’t visualize now. This is particularly true since we’re only at the dawn of Big Data in the enterprise. Hadoop is no longer an “esoteric”, lab-oriented technology; instead, it’s becoming mainline, it’s continually evolving, and it must be integrated into your enterprise. Selecting a Hadoop implementation requires the same level of attention and devotion as your organization expends when choosing other critical core technologies, such as application servers, storage, and databases. You can expect your Hadoop environment to be subject to the same requirements as the rest of your IT asset portfolio, including:

Service Level Agreements (SLAs)
Data protection
Security
Integration with other applications

PaperBoat Media is a team of highly skilled recruiters that understand the ins-and-outs of hiring Big Data professionals. Whether contract, contract to hire or in a permanent role, we have the depth, connections and know how to help you find the perfect technical and cultural fit for your Big Data & other technology needs.

Hadoop Terms

As with any new technology, there are hundreds of new buzzwords related. Here is a glossary of some Hadoop and Big Data words you may hear thrown around.

Before you start reviewing these definitions, remember the relationships among Big Data, distributed processing methodologies, and Hadoop:

Big Data. This is the reality that most enterprises face regarding coping with lots of new information, arriving in many different forms, and with the potential to provide insights that can transform the business.

Distributed processing methodologies. These procedures leverage the power of multiple computers to divide and conquer even the biggest datacollections by breaking large tasks into small, then assigning work to individual computers, and finally reassembling the results to answer important questions. MapReduce is a prominent example of a distributed processing methodology, with many other offshoots also enjoying success, including streaming, real-time analysis, and machine learning.

Hadoop. A comprehensive technology offering that employs distributed processing methodologies to make the most of Big Data. Hadoop is at the center of a thriving ecosystem of open source solutions and value-added products.

Apache Software Foundation. A non-profit corporation that manages numerous collaborative, consensus-based open source projects, including the core technologies that underlay and interact with MapReduce and Hadoop.

Avro. Serialization and remote procedure capabilities for interacting with Hadoop, using the JSON data format. Offers a straightforward approach for portraying complex data structures within a Hadoop MapReduce job. (Apache Software Foundation project)

Big Data. This is the reality that most enterprises face regarding coping with lots of new data, arriving in many different forms, and with the potential to provide insights that can transform the business.

Big Table. High performance data storage technology developed at Google, but not distributed elsewhere. Served as an inspiration for Apache HBase.

Cascading. Abstraction layer meant to exploit the power of Hadoop while simplifying the job of designing and building data processing operations. This means that developers don’t need to learn how to program in MapReduce; they can use more familiar languages such as Java.

Cluster. Large-scale Hadoop environment commonly deployed on a collection of inexpensive, commodity servers. Clusters achieve high degrees of scalability merely by adding extra servers when needed, and frequently employ replication to increase resistance to failure.

Data Processing: batch. Analyzing or summarizing very large quantities of information with little to no user interaction while the task is running. Results are then presented to the user upon completion of the processing.

Data Processing: interactive. Live user-driven interactions with data (through query tools or enterprise applications) that produce instantaneous results.

Data Processing: real-time. Machine-driven interactions with data – often continuous. The results of this type of processing commonly serve as input to subsequent real-time operations.

DataNode. Responsible for storing data in the Hadoop File System. Data is typically replicated across multiple DataNodes to provide redundancy.

Drill. Open source framework targeted at exploiting the power of parallel processing to facilitate high-speed, real-time interactions – including ad-hoc analysis – with large data sets. (Apache Software Foundation project)

Extensible Markup Language (XML). A very popular way of representing unstructured/semi-structured information. Text-based and human-readable, there are now hundreds of different XML document formats in use.

Flume. Scalable technology developed at Facebook, commonly used to capture log information and write it into the Hadoop File System. (Apache Software Foundation project)

GitHub. Internet-based hosting service for managing the software development and delivery process, including version control.

Hadoop. A specific approach for implementing the MapReduce architecture, including a foundational platform and a related ecosystem. (Apache Software Foundation project)

Hadoop File System (HDFS). File system designed for portability, scalability, and large-scale distribution. Written in Java, HDFS employs replication to help increase reliability of its storage. However, HDFS is not POSIX-compliant. (Apache Software Foundation project)

HBase. A distributed – but non relational – database that runs on top of the Hadoop File System. (Apache Software Foundation project)

Hive. Data warehousing infrastructure constructed on top of Hadoop. Offers query, analysis, and data summarization capabilities. (Apache Software Foundation project)

Impala. A query engine that works with Hadoop and offers SQL language searches on data stored in the Hadoop File System and HBase database.

JavaScript Object Notation (JSON). An open data format standard. Language independent, and human-readable, often used as a more efficient alternative to XML.

Machine Learning. An array of techniques that evaluate large quantities of information and derive automated insights. After a sufficient number of processing cycles, the underlying algorithms become more accurate and deliver better results – all without human intervention.

Mahout. A collection of algorithms for classification, collaborative filtering, and clustering that deliver machine learning capabilities. Commonly implemented on top of Hadoop. (Apache Software Foundation project)

MapReduce. Distributed, parallel processing techniques for quickly deriving insight into often-massive amounts of information.

Maven. A tool that standardizes and streamlines the process of building software, including managing dependencies among external libraries, components, and packages. (Apache Software Foundation project)

Mirroring. A technique for safeguarding information by copying it across multiple disks. The disk drive, operating system, or other specialized software can provide mirroring.

NameNode. Maintains directory details of all files in the Hadoop File System. Clients interact with the NameNode whenever seek to locate or interact with a given file. The NameNode responds to these inquiries by returning a list of the DataNode servers where the file in question resides.

Network file system (NFS). A file system protocol that makes it possible for both end users and processes on one computer to transparently access and interact with data stored on a remote computer.

NoSQL. Refers to an array of independent technologies that are meant to go beyond standard SQL to provide new access methods, generally to work with unstructured or semi-structured data.

Oozie. A workflow engine that specializes in scheduling and managing Hadoop jobs. (Apache Software Foundation project)

Open Database Connectivity (ODBC). A database-neutral application- programming interface (API) and related middleware that make it easy to write software that works with an expansive assortment of databases.

Open Source. Increasingly popular, collaborative approach for developing software. As opposed to proprietary software, customers have full visibility into all source code, including the right to modify logic if necessary.

Pig. Technology that simplifies the job of creating MapReduce applications running on Hadoop platforms. Uses a language known as ‘Pig Latin’. (Apache Software Foundation project)

POSIX File System. In the context of file systems, POSIX – which stands for Portable Operating System Interface – facilitates both random and sequential access to data. Most modern file systems are POSIX-compliant; however, the Hadoop File System is not.

Scribe. Open source scalable technology developed at Facebook, commonly used to capture log information and write it into the Hadoop File System.

Semi-structured Data. Information that’s neither as rigidly defined as structured data (such as found in relational databases), nor as freeform as unstructured data (such as what’s contained in video or audio files). XML files are a great example of semi-structured data.

Snapshot. A read-only image of a disk volume that’s taken at a particular point in time. This permits accurate rollback in situations when errors may have occurred after the snapshot was created.

Spark. General-purpose cluster computing system, intended to simplify the job of writing massively parallel processing jobs in higher-level languages such as Java, Scala, and Python. Also includes Shark, which is Apache Hive running on the Spark platform. (Apache Software Foundation project)

Structured Query Language (SQL). Highly popular interactive query and data manipulation language, used extensively to work with information stored in relational database management systems (RDBMS).

Sqoop. Tool meant to ease the job of moving data – in bulk – to and from Hadoop as well as structured information repositories such as relational databases. (Apache Software Foundation project)

Structured Data. Information that can be expressed in predictable, well-defined formats – often in the rows and columns used by relational database management systems.

Tez. Applies and reshapes the techniques behind MapReduce to go beyond batch processing and make real-time, interactive queries achievable on mammoth data volumes. (Apache Software Foundation project)

Unstructured Data. Information that can’t be easily described or categorized using rigid, pre-defined structures. An increasingly common way of representing data, with widely divergent examples including XML, images, audio, movie clips, and so on.

YARN. New streamlined techniques for organizing and scheduling MapReduce jobs in a Hadoop environment. (Apache Software Foundation project)