Taken from: http://architects.dzone.com/articles/hadoop-azure-introduction
Hadoop on Azure: An Introduction
11.22.2012
|
4227 views
|
I am in complete awe on how this technology is resonating with today’s developers. If I invite developers for an evening event, Big Data is always a sellout. This particular post is about getting everyone up to speed about what Hadoop is at a high level. Big data is a technology that manages voluminous amount of unstructured and semi-structured data. Due to its size and semi-structured nature, it is inappropriate for relational databases for analysis. Big data is generally in the petabytes and exabytes of data.
- However, it is not just about the total size of data (volume)
- It is also about the velocity (how rapidly is the data arriving)
- What is the structure? Does it have variations?
Science | Scientists are regularly challenged by large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. |
Sensors | Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks. |
Social networks | I am thinking of Facebook, LinkedIn, Yahoo, Google |
Social influencers | Blog comments, YELP likes, Twitter, Facebook likes, Apple's app store, Amazon, ZDNet, etc |
Log files | Computer and mobile device log files, web site tracking information, application logs, and sensor data. But there are also sensors from vehicles, video games, cable boxes or, soon, household appliances |
Public Data Stores | Microsoft Azure MarketPlace/DataMarket, The World Bank, SEC/Edgar, Wikipedia, IMDb |
Data warehouse appliances | Teradata, IBM Netezza, EMC Greenplum, which includes internal, transactional data that is already prepared for analysis |
Network and in-stream monitoring technologies | Packets in TCP/IP, email, etc |
Legacy documents | Archives of statements, insurance forms, medical record and customer correspondence |
Two problems to solve
Storage Problem | How do I store a petabyte of data reliably? Afterall, a petabyte is over 333 three TB drives. |
Money Problem | 1 petabyte costs a lot. For just 70 TB you will pay over $100,000. (eBay ad Dell/EMC CLARiiON CX3-40 -70TB- FAST 4G 15K SAN Storage is only 70 TB for $112,000) |
The Google File System | It is about a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients | http://research.google.com/archive/gfs.html |
MapReduce: Simplified Data Processing on Large Clusters | MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key | http://research.google.com/archive/mapreduce.html |
- Hadoop is an open-source software framework that supports data-intensive distributed applications. Hadoop is written in Java.
- I met its creator, Doug Cutting, who was working at Yahoo at the time. Hadoop is named after his son's toy elephant. I was hosting a booth at the time, and I remember Doug was curious about finding some cool stuff to bring home from the booth to give to his son. Another great idea, Doug!
- One of the goals of Hadoop is to run applications on large clusters of commodity hardware. The cluster is composed of a single master and multiple worker nodes.
- Hadoop leverages the the programming model of map/reduce. It is optimized for processing large data sets.
- MapReduce is typically used to do distributed computing on clusters of computer. A cluster had many “nodes,” where each node is a computer in a cluster.
- The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various slave or worker nodes in the cluster, and process the the data in parallel. Hadoop leverages a distributed file system to store the data on various nodes.
The core is map() and reduce() | ||||
---|---|---|---|---|
Understanding these methods is the key to mastering Hadoop
|
01.
public
class
WordCount {
02.
03.
public
static
class
Map
extends
MapReduceBase
04.
implements
Mapper<LongWritable, Text, Text, IntWritable>
05.
{
06.
07.
private
final
static
IntWritable one =
new
IntWritable(
1
);
08.
private
Text word =
new
Text();
09.
10.
public
void
map(LongWritable key, Text value,
11.
OutputCollector<Text, IntWritable> output,
12.
Reporter reporter)
throws
IOException
13.
{
14.
15.
String line = value.toString();
16.
StringTokenizer tokenizer =
new
StringTokenizer(line);
17.
while
(tokenizer.hasMoreTokens())
18.
{
19.
word.set(tokenizer.nextToken());
20.
output.collect(word, one);
21.
}
22.
}
23.
}
24.
25.
public
static
class
Reduce
26.
extends
MapReduceBase
27.
implements
Reducer<Text, IntWritable, Text, IntWritable>
28.
{
29.
30.
public
void
reduce(Text key, Iterator<IntWritable> values,
31.
OutputCollector<Text, IntWritable> output,
32.
Reporter reporter)
throws
IOException
33.
{
34.
int
sum =
0
;
35.
while
(values.hasNext())
36.
{
37.
sum += values.next().get();
38.
}
39.
output.collect(key,
new
IntWritable(sum));
40.
}
41.
}
42.
43.
public
static
void
main(String[] args)
throws
Exception
44.
{
45.
JobConf conf =
new
JobConf(WordCount.
class
);
46.
conf.setJobName(
"wordcount"
);
47.
48.
conf.setOutputKeyClass(Text.
class
);
49.
conf.setOutputValueClass(IntWritable.
class
);
50.
51.
conf.setMapperClass(Map.
class
);
52.
conf.setCombinerClass(Reduce.
class
);
53.
conf.setReducerClass(Reduce.
class
);
54.
55.
conf.setInputFormat(TextInputFormat.
class
);
56.
conf.setOutputFormat(TextOutputFormat.
class
);
57.
58.
FileInputFormat.setInputPaths(conf,
new
Path(args[
0
]));
59.
FileOutputFormat.setOutputPath(conf,
new
Path(args[
1
]));
60.
61.
JobClient.runJob(conf);
62.
}
63.
}
The “map” in MapReduce
- There is a master node and many slave nodes.
- The master node takes the input, divides it into smaller sub-problems, and distributes the input to worker or slave nodes. worker node may do this again in turn, leading to a multi-level tree structure.
- The worker/slave nodes processes the data into a smaller problem, and passes the answer back to its master node.
- Each mapping operation is independent of the others, all maps can be performed in parallel.
- The master node then collects the answers from the worker or slave nodes. It then aggregates the answers and creates the needed output, which is the answer to the problem it was originally trying to solve.
- Reducers can also preform the reduction phase in parallel. That is how the system can process petabytes in a matter of hours.
The map() function will generate a list of key/value pairs based on the data |
The shuffle() phase will bring things together for the reduce() phase |
The reduce() phase will take the list of key/value pairs and hand that to you to do something with. |
- The Hello World sample for Hadoop is a word count example.
- Let's assume our quote is this:
- It is time for all good men to come to the aid of their country.
map() function (see the "to" part) finds "to" twice | (It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (to, 1) (men, 1) (to, 1) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1) |
shuffle() function (see the "to" part) creates (to, 1, 1) | (It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (to, 1, 1) (men, 1) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1) |
reduce() function (see the "to" part) creates (to, 2) | (It, 1) (is, 1) (time, 1) (for, 1) (all, 1) (good, 1) (men, 1) (to, 2) (come, 1) (the, 1) (aid, 1) (of, 1) (their, 1) (country, 1) |
High-level Architecture
- There are two main layers to both the master node and the slave nodes – the MapReduce layer and the Distributed File System Layer. The master node is responsible for mapping the data to slave or worker nodes.
Hadoop Common | The common utilities that support the other Hadoop modules. |
Hadoop Distributed File System (HDFS) | A distributed file system that provides high-throughput access to application data. |
Hadoop YARN | A framework for job scheduling and cluster resource management. |
Hadoop MapReduce | A YARN-based system for parallel processing of large data sets. |
Apache Pig | A platform for analyzing large data sets.It includes a high-level language for expressing data analysis programs A key point of Pig programs is that they support substantial parallelization Pig consists of a compiler that produces sequences of Map-Reduce programs Pig's language layer currently consists of a textual language called Pig Latin | ||||||
Hive | Hive is a data warehouse system for Hadoop. It provides a SQL-like language. It helps with data summarization and ad-hoc queries. I am not sure yet whether this is required with Hadoop on Azure. If not, just a few command line tasks to do:
But from what I saw here, looks like there is an ODBC Hive Setup Module. WehnMing Ye has this video: http://channel9.msdn.com/Events/windowsazure/learn/Hadoop-on-Windows-Azure |
- We will create a job name called “Pi Example.” This very simple sample will calculate PI using a cluster of comptuers.
- This is not necessarily the best example of big data, it is more of a compute problem.
- The final command line will look like this:
- Hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 16 10000000
- More details on this sample coming soon.
(Note:
Opinions expressed in this article and its replies are the opinions of
their respective authors and not those of DZone, Inc.)
No comments:
Post a Comment
I would be glad to know if this post helped you.