Hadoop Gives way to Real Time Big Data Stream Processing – The 3 Key Attributes – Collect | Process | Analyze
- By Veeraj Thaploo
To understand the complete Big Data Stream Processing Eco-System, it is important to understand what a data stream is? How is it generated ? Why is it required to be analyzed ? And what make it complex to analyze?. By the time we get this understanding we will deep dive into the different components of the Big Data Stream Processing – Ecosystem.
The Stock Market? The Retail Chain? Social Media? What do they all have in Common?
Yes you guessed it, it’s big data, but hold on, it’s not JUST big data, it’s big data that is being generated with an intense velocity which also requires real time analysis for effective decision making. To do this we need to make use of a data stream. The data stream is basically continuously generated data, this data can be generated as part of application logs, events, or collected from a large pool of devices continuously generating events such as ATM machined or PoS devices or for that matter our mobile devices while playing a multiplayer game with million other users logged to game.
To make this clearer I am discussing 5 examples where data streams are generated in large volumes. I will use these examples through-out this blog post for cross referencing:
Retail: Let us consider a large retail chain with thousands of stores across the globe. With multiple Point of Sales (PoS) machines in all the stores, there are thousands of transactions being done per second, these transactions in turn generate millions of events. All these data events can be processed in real-time, to analyze price adjustments in real-time to promote some products which are not selling well. These analytics can further be used in a number of way to promote products and track consumer behavior.
Data Center: Let us consider a case of large network deployment of a data center with hundreds of servers, switches, routers and other devices in the network. The event logs from all these devices at real time create a stream of data. This data can be used to prevent failures in the data center and automate triggers so that the complete data center is fault tolerant.
Banking: In this industry thousands of customer’s user credit cards and millions of events are generated. If some fraudulent is using / misusing any credit card it will need real-time alerts. These transactional events of data continuously generated is a data stream. To detect fraud in real time the data stream needs to be analysis in real time.
Social Media: Let us now take an example of a social media website with millions of users navigating through the pages. To deliver relevant content, understanding the navigation of the user’s websites log and trace the user clicks are a requirement. These clicks create a stream of data. This stream of data can be analyzed in real-time to make real time changes on the website to drive user engagement.
Stock Market: The data generated here is a stream of data where a lot of events are happening in real-time. The price of stock are continuously varying. These are large continuous data streams which needs analysis in real-time for better decisions on trading.
The 3 Stages | Collect | Process | Analyze | & the Tools Used in Each :
Big Data is about Volume, Variety and Velocity of data. For large Volume of unstructured (Variety) data the Hadoop is a good fit to processing and analysis. But processing real time big data stream has one more level of complexity that is the high Velocity of data generation and there is a need to analyze this data in real time.
For real-time big data stream processing, following three key attributes are required
a) System to collect the big data generated in real time
b) System to handle the massive parallel processing of this data
c) Event correlation and Event Processing Engine for generating analytics
All the above mentioned attributes / components need to be fault tolerant, scalable and distributed, with low latency for each system. There are different softwares for each component in Big Data Stream Processing eco-system
Below is a simple data flow / processing work flow with key attributes for each process step.
Stage 1 – Collection using Data Pipes
Real time data streams need to be listened to continuously. Listeners are scalable flexible, fault tolerant data pipes. Some of the robust data pipe solutions are:
• Kafka: Linkedin developed Kafka for realtime message processing. Kafka is an open-source message broker project developed by in Scala. Kafka is currently a subproject within the Apache Incubator. Kafka is primarily intended for tracking various activity events generated on LinkedIn’s websites, such as pageviews, keywords typed in a search query, ads presented, etc. Kafka provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
• Apache Flume: Apache Flume architecture is a distributed, fault tolerant and available service for efficiently collecting, aggregating, and moving large amounts of log data from different applications to a main store. The Apache Flume has a simple modular architecture and has extensible data models that allows for online analytic applications.
• Suro: Netflix created a custom data pipe Suro. Suro is a distributed data pipeline which enables services for moving, aggregating, routing, storing data. Its design is focused on easy configuration and operation for multiple data sources.
• Facebook Scribe: Scribe is a server for aggregating log data streamed in real-time from a large number of servers. It is designed to be scalable, extensible without client-side modifications, and robust to failure of the network or any specific machine. Scribe was developed at Facebook using Apache Thrift and released in 2008 as open source.
Stage 2 – In Streaming Data Processing
Next step in the data workflow is in-stream data processing. Some of the in-stream data processing solutions are:
• Storm: Storm is a distributed computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. It uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data.
• AWS Kinesis: Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from different sources.
• S4: S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platforms that allows programmers to easily develop applications for processing continuous unbounded streams of data.
• Samza: Samza provides a system for processing stream data from publish-subscribe systems such as Apache Kafka. The developer writes a stream processing task, and executes it as a Samza job. Samza then routes messages between stream processing tasks and the publish-subscribe systems that the messages are addressed to.
Stage 3 – Event Correlation/Processing to Generate Analytics
And the last step for the big-data stream processing is storing the data in the destination system, where we can use this processed data for analytics application input or further process. Typically these destination can be:
• Input to Applications : Let us relook at example 1 and example 5. In example 1 prize re-adjustment triggers can be given to prizing engine / application to update the price. Similarly in case of example 5, credit card fraud detection can be linked to application to trigger blocking of credit card.
• NoSQL Database : NoSQL database like redshift , mongoDB acts as store for event processing results. Let us go through the example 4 where the website click stream processing results can be stored in the NoSQL database to improve user experience. This can also be used for Business intelligence and analytics.
• Hadoop System : Sometime streams need to be processed further at a later stage with no realtime need. In that case stream processing results can be fed to Hadoop Systems for further data processing.
• RDBMS: Typically stream data processing results are put in the relational database systems for integration with BI tools and custom reports.
In the next couple of blogs I will take you through the comparison of the Data Pipes and In-Stream Processing solution. Watch out for this space. Subscribe to our blogs for more interesting cloud bytes!
Some Interesting reads from the Cloud World :
1. Optimize Big Data Analytics with a Complete Guide to Amazon Cloud’s EMR
2. AWS Cloud for Start-ups Jump Start Package at just $99.99
3. Why is Cloud Computing Big Data’s Best Friend? Hadoop in the Clouds to Answer your Doubts!
4. Building a Live AWS Kinesis Application – The Producer Class, Put Constructor & more…