Authors: Nicholas Bessmer
© 2013 Nicholas Bessmer
All Rights Reserved
Table of Contents
NOTE: This is a technical subject. You may need to do some research on your own to learn some basic LINUX commands. This is an introductory deep dive for business generalists who want to learn more about Hadoop. For a gentler introduction, see companion volume Big Data for Small business.
The companion volume Big Data for Small Business discusses how businesses can gain a competitive advantage by using Big Data techniques to filter out noise and determine trends in very large, unstructured data sets. Big Data is a toolbox to perform analysis on large (
) sets of unstructured (Twitter, chats, web logs) data that change in
But … there are two forks in the road in terms of how businesses can use Big Data:
that changes in real-time.
To analyze trends
in massive volumes of structured and unstructured data that are set aside by using
A business may benefit from both flavors of Big Data tools. We want to avoid getting too immersed in buzz words and stay focused on how to realize the greatest Big Data benefits for the least cost.
all familiar with ATM machines where each check that is deposied is considered a transaction – a discrete set of steps with a beginning and an end.
systems in the database world have ways to make sure changes are saved properly including discarding partial information. Cassandra is a
that is not transactional – rather it is much more fluid and suited for operational data.
Imagine an airplane with thousands of measurements occurring in real-time. Everything from speed, height, thrust, navigation to the health of the airplane systems need to be checked almost instantaneously. It is wasted effort to spend a lot of time making sure each data-point is saved somewhere. Rather, the operational data of the plane needs to be fed to the command center (the pilots) in real-time with as little overhead as possible. Cassandra is really good at:
Fault tolerant peer to peer architecture.
Performance that can be easily tuned.
Session Storage (imagine sites like Netflix with millions of people streaming videos)
User Data Storage
Scalable, low-latency storage for mobile apps
Critical data storage
As discussed in the companion volume Big Data for Small Business, Hadoop is really good at the following:
Reporting on large amounts of unstructured data
Ability to sort and perform simple calculations on large amount of unstructured and structured data:
Counting words – this is the standard Map Reduce example
High-volume analysis – gathering and analyzing large scale ad network data
Recommendation engines – analyzing browsing and purchasing patterns to recommend a product
Social graphs – Determining relationships between individuals
For the purpose of this guide, we will work through setting up a Hadoop Big Data Analytics example and run a simple Pig Latin example script from the Pig tutorial. This will perform some analysis on the Excite search engine.
For future reference, you can find huge data sets to test Big Data with at the following site:
examples that can be useful to businesses:
US and foreign Census Data.
Federal Reserve data
Here are examples that are useful to scientists:
Daily global weather measurements
We may want to use census data from our local metropolitan area to identify trends such as disposable income or demographics like where elderly or young people reside. This type of marketing savvy requires not only computer power but also the
that Hadoop provides. Think of Hadoop as a toolbox that allows people to approach managing huge volumes of unstructured and structured data.
In Amazon’s example, these sample big data sets are accessible by signing up for their EC2 service. This is a metered service that allows businesses and institutions to run applications and services in
Amazon is acts as a central utility like the electrical company from which customers rent services – in this case computing power and data storage. EC2 is Amazon’s Elastic Cloud and what follows are the steps to set up an EC2 account through Amazon.
Here is the sign up screen to “rent” Amazon Web Services to run your application and database in the cloud. This is a metered service that fluctuates based on your demand. It will not break the bank.
Better yet, let’s sign up for
the micro version
As part of AWS’s Free Usage Tier, new AWS customers can get started with Amazon EC2 for free. Upon sign-up, new AWS customers receive the following EC2 services each month for one year:
750 hours of EC2 running Linux/Unix Micro instance usage
750 hours of EC2 running Microsoft Windows Server Micro instance usage
750 hours of Elastic Load Balancing plus 15 GB data processing
30 GB of Amazon EBS Standard volume storage plus 2 million IOs and 1 GB snapshot storage
15 GB of bandwidth out aggregated across all AWS services
1 GB of Regional Data Transfer
Not bad to test drive this service. You need to provide you credit number and do a phone validation (to make sure you are a real person). Remember – we want the micro service to start off with. You will receive a confirmation email and you should select MANAGE YOUR ACCOUNT. Sign in with the credentials that you created and: