Author : Veeraj ThaplooPublished Date : March 11, 2013
Tag List : Amazon Web Services, Apache Hadoop, AWS, Big Data, Cloudera, Data warehouse, Data Warehousing, Elastic MapReduce, EMR, Hadoop, Hadoop Distributions, Hadoop Distributions Comparison, HBase, MapReduce
Enterprises havingÂ Big DataÂ and Hadoop implementation requirements often need to get answers to some of the basic questions before starting this implementation.
1. What do we mean by Hadoop and Hadoop eco-system?
2. Will Hadoop solve my problems?
3. Which Hadoop distribution will fit our requirements best?
In this blog post I have tried to answer the above questions. If you donâ€™t know what Big Data is, then how do you qualify whether your large data aggregation is Big Data or not? This blog postÂ Big Data ExplainedÂ by Varoon Rajani, might help you understand that.
What do we mean by Hadoop and Hadoop eco-system?
Apache Hadoop is an open source framework that allows for distributed processing of large data sets across computing clusters and is the most widely used technology for BigÂ Data processing.
Hadoop framework has evolved into a set of tools and technologies to efficiently process, store and analyze huge amounts of varied data in a linear, scalable and reliable fashion.
Apache Hadoop has two major projects:
- MapReduce: A framework for cluster resource management for parallel processing of large sets of data
- Hadoop Distributed File SystemÂ (HDFS): A distributed File System for high-throughput access to large sets of data
Hadoop Ecosystem is rapidly evolving with large number of community contributors. The following diagram gives an overview of the Hadoop ecosystem.
Some of the ecosystem components are explained below:
- Hive: A data warehouse infrastructure with SQL like querying capabilities on Hadoop Data Sets
- Pig: A high level data flow language and execution framework for parallel computation
- ZooKeeper: A high performance coordination service for distributed applications
- Mahout: A scalable machine learning and data mining library
- HBase: A scalable, distributed database that supports structured data storage for large tables
Will Hadoop solve my problem?
It is important to understand that Hadoop is not a complete replacement for the traditional enterpriseÂ Data WarehousingÂ and Business Intelligence tools, but is a complementary approach to solve some of its challenges.
Hadoop is best suited for:
Some of the use cases for different industries are :
- Social Media Engagement and Clickstream AnalysisÂ (Web Industry): AÂ clickstreamÂ is the recording of the parts of the screen a computer user clicks on web while browsingÂ or using another software application.Â Clickstream analysis is useful for web activity analysis, and customerÂ behaviour software testing, market research, and even for analyzing employee productivity.
- Content Optimization and EngagementÂ (Media Industry): Content required to be optimized for rendering on different devices supporting different content formats. Media companies require large amount of content to be processed in different formats. Also content engagement models need to be mapped for feedback and enhancements.
- Network Analytics and MediationÂ (Telecommunication Industry): Telecommunication companies generate a large amount of data in the form of usage transaction data, network performance data, cell-site information device level data and other forms of back office data. The real time analytics plays a critical role in reducing the OPEX and enhancing the user experience
- Targeting and Product RecommendationÂ (Retail Industry): The retail companies and e-Commerce companies model the data from different sources to target customers and provide product recommendations based on end user’s profile and usage patterns.
- Risk Analysis, Fraud Monitoring and Capital Market AnalysisÂ (BFSI Industry): Banking and finance sectors have large sets of structured and unstructured data generated by different sources like trading pattern in capital markets, consumerÂ behaviourÂ for banking services etc. FinancialÂ institutionsÂ use big data to perform Risk Analysis, Fraud Monitoring and Tracking, Capital Market Analysis, converged data management etc.
The list is really long and specific to requirements, and the good news is that enterprises can profit from structured / unstructured data.
Which Hadoop distribution will fit in our requirements?
There are a lot of open-source and paid distributions available for implementing Hadoop apart from Apacheâ€™s Hadoop open-source distribution. Each Hadoop deployment will implement some or all of the tools listed above depending on the project requirements.
Comparison of three major Hadoop Distributions
- Technical details â€“ This covesr base Hadoop Version, File System support, Job Scheduling Support etc.
- Ease of Deployment â€“ Availability of toolkits to manage deployment
- Ease of Maintenance â€“ Cluster management and tools for orchestration.
- Cost â€“ The Cost of implementation for particular Hadoop distribution, billing model and licenses.
Based on the analysis and mapping the enterprise requirement with the above matrix it will be easy to decide the type of Hadoop distribution best suited for your use case. Please note in above matrix we have compared only three major Hadoop distributions but there many others in the market.
BlazeClanÂ help enterprises profit from the large data sets by implementing right Hadoop Distributions.
In this role I am responsible for technology solution and strategy.
Latest posts by Veeraj Thaploo (see all)
- Previous Blog:
- The A-B-C-D of Cloud computing
- Next Blog:
- Cloud Computing Explained