Big Data on a Shoestring

The companion volume Big Data for Small Business discusses how businesses can gain a competitive advantage by using Big Data techniques to filter out noise and determine trends in very large, unstructured data sets. Big Data is a toolbox to perform analysis on large (
petabytes
) sets of unstructured (Twitter, chats, web logs) data that change in
near
real-time
.

But … there are two forks in the road in terms of how businesses can use Big Data:

»

To process
Operational Data
that changes in real-time.

»

To analyze trends
in massive volumes of structured and unstructured data that are set aside by using
batch jobs
.

A business may benefit from both flavors of Big Data tools. We want to avoid getting too immersed in buzz words and stay focused on how to realize the greatest Big Data benefits for the least cost.

NOTE: This is a technical subject. You may need to do some research on your own to learn some basic LINUX commands
. This is an introductory deep dive for business generalists who want to learn more about Hadoop. For a gentler introduction, see the companion volume Big Data for Small Business.
1 – Cassandra

“In
Greek mythology
,
Cassandra
(
Greek
Κα
σσάνδρα, also Κασάνδρα)
[1]
was the daughter of King
Priam
and Queen
Hecuba
of
Troy
. Her beauty caused
Apollo
to grant her the gift of
prophecy
.”

-Wikipedia

We are
all familiar with ATM machines where each check that is deposied is considered a transaction – a discrete set of steps with a beginning and an end.
Transactional
systems in the database world have ways to make sure changes are saved properly including discarding partial information. Cassandra is a
distributed database
that is not transactional – rather it is much more fluid and suited for operational data.

Imagine an airplane with thousands of measurements occurring in real-time. Everything from speed, height, thrust, navigation to the health of the airplane systems need to be checked almost instantaneously. It is wasted effort to spend a lot of time making sure each data-point is saved somewhere. Rather, the operational data of the plane needs to be fed to the command center (the pilots) in real-time with as little overhead as possible. Cassandra is really good at:



Fault tolerant peer to peer architecture.



Performance that can be easily tuned.



Session Storage (imagine sites like Netflix with millions of people streaming videos)



User Data Storage



Scalable, low-latency storage for mobile apps



Critical data storage

2
– Hadoop

As discussed in the companion volume Big Data for Small Business, Hadoop is really good at the following:



Reporting on large amounts of unstructured data



Ability to sort and perform simple calculations on large amount of unstructured and structured data:

o

Counting words – this is the standard Map Reduce example

o

High-volume analysis – gathering and analyzing large scale ad network data

o

Recommendation engines – analyzing browsing and purchasing patterns to recommend a product

o

Social graphs – Determining relationships between individuals

3
- Our Big Data Analytics Example Using Pig Latin Sample Script

For the purpose of this guide, we will work through setting up a Hadoop Big Data Analytics example and run a simple Pig Latin example script from the Pig tutorial. This will perform some analysis on the Excite search engine.

For future reference, you can find huge data sets to test Big Data with at the following site:

http://aws.amazon.com/publicdatasets/

Some
examples that can be useful to businesses:

»

US and foreign Census Data.

»

Labor statistics

»

Federal Reserve data

»

Federal contracts

Here are examples that are useful to scientists:

»

Daily global weather measurements

»

Genome databases.

We may want to use census data from our local metropolitan area to identify trends such as disposable income or demographics like where elderly or young people reside. This type of marketing savvy requires not only computer power but also the
framework
that Hadoop provides. Think of Hadoop as a toolbox that allows people to approach managing huge volumes of unstructured and structured data.

In Amazon’s example, these sample big data sets are accessible by signing up for their EC2 service. This is a metered service that allows businesses and institutions to run applications and services in
the cloud.
Amazon is acts as a central utility like the electrical company from which customers rent services – in this case computing power and data storage. EC2 is Amazon’s Elastic Cloud and what follows are the steps to set up an EC2 account through Amazon.

Here is the sign up screen to “rent” Amazon Web Services to run your application and database in the cloud. This is a metered service that fluctuates based on your demand. It will not break the bank.

Better yet, let’s sign up for
the micro version
.

Free Tier*

As part of AWS’s Free Usage Tier, new AWS customers can get started with Amazon EC2 for free. Upon sign-up, new AWS customers receive the following EC2 services each month for one year:

750 hours of EC2 running Linux/Unix Micro instance usage

750 hours of EC2 running Microsoft Windows Server Micro instance usage

750 hours of Elastic Load Balancing plus 15 GB data processing

30 GB of Amazon EBS Standard volume storage plus 2 million IOs and 1 GB snapshot storage

15 GB of bandwidth out aggregated across all AWS services

1 GB of Regional Data Transfer

Not bad to test drive this service. You need to provide you credit number and do a phone validation (to make sure you are a real person). Remember – we want the micro service to start off with. You will receive a confirmation email and you should select MANAGE YOUR ACCOUNT. Sign in with the credentials that you created and:

Other books

Imaginary LIves by Schwob, Marcel

Coming Through the Rye by Grace Livingston Hill

The Line of Departure: A Postapocalyptic Novel (The New World Series Book 4) by G. Michael Hopf

The Vision by Jessica Sorensen

A Holiday Proposal by Kimberly Rose Johnson

Ellie by Lesley Pearse

Black Hills Bride by Deb Kastner

CRUISE TO ROMANCE by Poznanski, Toby

Last Shot (Dev Haskell - Private Investigator, Book 6) by Mike Faricy

Perpetual Check by Rich Wallace