Tag Archives: Real Time Analytics

IoT, Big Data and Machine Learning Push Cloud Computing To The Edge

“The End Of Cloud Computing” – that’s the dramatic title for a talk given by Peter Levine at a16z Summit last month. Levine, a partner at Anderssen Horowitz (a16z) VC fund, worked out his investor foresight and tried to imagine the world beyond cloud computing. The result was an insightful and fluent talk, stating that the centralized cloud computing as we know it is about to be superseded by a distributed cloud inherent in a multitude of edge devices. Levine highlights the rising forces driving this change:

The Internet of Things (IoT). Though the notion of IoT has been around for a few decades it seems it’s really taking place now, and that our world will soon be inhabited by a multitude of smart cars, smart homes and smart everything, each with embedded compute, storage and networking. Levine gives a great example of a computer card found in current day’s luxury cars, containing around 100 CPUs in it. having several such cards in a car would make it a mini data center on wheels. Having thousands of such cars on the roads makes it a massive distributed data center.

smart-car-card

Big Data Analytics. The growing amount of connected devices and sensors around us constantly collecting real world input generates massive amount of data of different types, from temperature and pressure to images and videos. And that unstructured and highly variable data stream needs to be processed and analyzed in real time in order to extract insights and make decisions by the little brains of the smart devices. Just imagine your smart car approaching a stop sign, and the need to process the image input, realize the sign and make the decision to stop – all in a matter of a second or less- would you send it over to the remote cloud for the answer?

Machine Learning. While traditional computer algorithms are well suited for dealing with well-defined problem spaces, the real world has a complex, diverse and unstructured nature of data. Levine believes that endpoints will need to execute Machine Learning algorithms to decipher the data effectively and make intelligent insights and decisions to the countless number and permutations of situations that can occur in the real world.

So should Amazon, Microsoft and Google start worrying? Not really. The central cloud services will still be there, but with different focus. Levine sees the central cloud role in curating data from the edge, performing central non-real-time learning which can then be pushed back to the edge, and long-term storage and archiving of the data. In its new incarnation, the entire world becomes the domain of IT.

You can watch the recording of Levine’s full talk here.

1311765722_picons03 Follow Horovits on Twitter!

Leave a comment

Filed under Big Data, Cloud, IoT

Oracle Boosts Its Big Data Offering

Oracle, the proverbial SQL icon, knows it cannot ignore the trend of big data, nor does it attempt to. On the contrary. Oracle has been promoting big data offerings, both in-house and via acquisitions such as the recent Datalogix acquisition. Yesterday Oracle announced an important milestone with the release of a suite of big data solutions.

In modern organizations data comes from multiple and diverse sources, both structured and unstructured, and across various technologies (e.g. Hadoop, relational databases). Advanced analytics requirements such as real-time counting analytics alongside historical trends analytics further necessitate the different technologies, resulting in a highly advanced solution. Oracle identifies this complexity and offers native integration of the big data solutions with Oracle’s SQL relational databases, with one uniform façade for analytics.

While Hadoop’s advantages are known, it is still non-trivial for analysts and data scientists to extracts analytics and gain insights. Oracle Big Data Discovery comes to address this, providing a “visual face of Hadoop”.

Oracle also extends its GoldenGate replication platform with the release of  Oracle GoldenGate for Big Data, which can replicate unstructured data to Hadoop stack (Hive, HBase, Flume). Another aspect of the uniform façade is Oracle SQL, with queries that can now transparently access data in Hadoop, NoSQL and Oracle Database with Oracle Big Data SQL 1.1.

Oracle’s strategy is to leverage its brand and existing SQL installations within enterprises and offer them enterprise-grade versions of the popular open-source tools, and to provide native integration with Oracle’s traditional installation of SQL databases which already exist within the enterprises. It’s left to see how it catches with the enterprises against the hype of the popular big data -vendors and open source projects.

1311765722_picons03 Follow Dotan on Twitter!

1 Comment

Filed under Big Data

Analytics for Big Data – Venturing with the Twitter Use Case

Performing analytics on Big Data is a hot topic these days. Many organizations realized that they can gain valuable insight from the data that flows through their systems, both in real time and in researching historical data. Imagine what Google, Facebook or Twitter can learn from the data that flows through their systems. And indeed the boom of real time analytics is here: Google Analytics, Facebook Social Analytics, Twitter paid tweet analytics and many others, including start-ups that specialize in real time analytics.

But the challenge of analyzing such huge volumes of data is enormous. Mining Terabytes and Petabytes of data is no simple task, one that traditional databases cannot meet, which drove the invention of new NoSQL database technologies such as Hadoop, Cassandra and MongoDB. Analyzing data in real time is yet another challenging task, when dealing with very high throughputs. For example, according the Twitter’s statistics, the number of tweets sent  on March 11 2011 was 177 million! Now, analyzing this stream, that’s a challenge!

Standing up to the challenge

When discussing the challenge of real time analytics on Big Data, I oftentimes use the Twitter use case as an example to illustrate the challenges. People find it easy to relate to this use case and appreciate the volumes and challenges of such a system. A few weeks ago, when planning a workshop on real time analytics for Big Data, I was challenged to take up this use case and design a proof of concept that meets up to the challenge of analyzing real tweets from real time feeds. Well, challenge is my name, and solution architecture is my game…

Note that it is by no means a complete solution, but more of a thought exercise, and I’d like to share these thoughts with you, as well as the code itself, which is shared on GitHub (see at the bottom of the post). I hope this will only be the starting point of a joint community effort to make it into a complete reference example. So let’s have a look at what I sketched up.

What kind of analytics?

First of all, we need to see what kinds of analytics are of interest in such use case. Looking at Twitter analytics, we see various types of analytics that I group in 3 categories:

Counting: real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.

Correlation: near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.

Research: more in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.

When approaching the architecture of a solution that covers all of the above types of analytics, we need to realize the different nature of the real time vs. historical analysis, and leverage on appropriate technologies to meet each challenge on its own ground, and then combine the technologies into a single harmonious solution. I’ll get back to that point…

In my sample application I wanted to listen on the public timeline of Twitter, perform some sample real-time analytics of word-counting, as well as preparing the data for research analytics, and combining it into a single solution to handle all aspects.

RT analytics for Big Data

Feeding in a hefty stream of tweets

I chose to listen on the Twitter public timeline (so I can share the application with you without having to give you my user/password).

For integration with Twitter I chose to use Spring Social, which offers a built-in Twitter connector integrating with Twitter’s REST API. I wanted to integrate with Twitter’s  Streaming API, but unfortunately it appears that currently Spring Social does not support this API, so I settled for repetitive calls to the regular API.

A feeder took in the tweets and converted them in an ETL style to a canonical Document-oriented representation, which semi-structured nature makes it ideal for the evolving nature of tweet structure, and wrote them to an in-memory data grid (IMDG) on the server side.

The design needs to accommodate a very high throughput of around 10k tweets/sec, with latency of milliseconds. For that end I chose to implement the feeder as an independent processing unit in GigaSpaces XAP, so that I can cope with the write scalability requirement by simply adding more parallel feeders to handle the stream. Since the feeder is a stateless service, scaling out by adding instances is relatively easy. Trying to do the same with stateful services will prove to be much more challenging, as we’re about to find out …

Let’s pick the brains of my accumulated tweets

On the server side, I wanted to store the tweets and prepare them for batch-mode historical research. For the same reason of semi-structured data, I also chose a Document-oriented database to store the Tweets. In this case, I chose the open source Apache Cassandra, which has become a prominent NoSQL database that is in use by Twitter itself, as well as by many other companies. The API to interact with Cassandra is Hector.

To avoid tight coupling of my application code with Cassandra, I followed the Inversion of Control principle (IoC) and created an abstraction layer for the persistent store, where Cassandra is just one implementation, and provided another implementation for testing purposes of persistence to the local file system. Leveraging on Spring Framework wiring capabilities (see below), switching between implementations becomes a configuration choice, with no code changes.

Easy configuration

For easy configuration and wiring I used Spring Framework, leveraging on its wiring capabilities as well as properties injection and parameter configuration. Using these features I made the application highly configurable, exposing the Twitter connection details, buffer sizes, thread pool sizes, etc. This means that the application can be tuned to any load, depending on the hardware, network and JVM specifications.

What can I learn from the tweet stream on the fly?

In addition to persisting the data, I also wanted to perform some sample on-the-fly real-time analytics on the data. For this experiment I chose to perform word counting (or more exactly token counting, as token can also be an expression or a combination of symbols).

At first glance you may think it’s a simple task to implement, but when facing the volumes and throughput of Twitter you’ll quickly realize that we need a highly scalable and distributed architecture that uses an appropriate technology and that takes into account data sharding and consistency, data model de-normalization, processing reliability, message locality, throughput balancing to avoid backlog build-up etc.

Processing workflow

I chose to meet these challenges by employing Event-Driven Architecture (EDA) and designed a workflow to run through the different stages of the processing (parsing, filtering, persisting, local counting, global aggregation, etc.) where each stage of the workflow is a processor. To meet the above challenges of throughput, backlog build-up, distribution etc., I designed the processor with the following characteristics:

  1. Each processor has a thread pool (of a configurable size) to enable concurrent processing.
  2. Each processor thread can process events in batches (of a configurable size) to balance between input and output streams and avoid backlog build-up.
  3. Processors are co-located with the (sharded) data, so that most of the data processing is performed locally, within the same JVM, avoiding distributed transactions, network, and serialization overhead.

The overall workflow looks as follows:

For the implementation of the workflow and the processors I chose XAP Polling Container, which runs inside the in-memory data grid co-located with the data and enables easy implementation of the above characteristics.

The events that drive the workflow are simple POJOs on which I listen and which state changes trigger the events. This is a very useful characteristic of the XAP platform, which saved me the need to generate message objects and place them in a message broker.

Atomic counters

For the implementation of atomic counters I used XAP’s MAP API, which allows using the in-memory data grid as a transactional key-value store, where the key is the token and the value is the count, and each such entry can be locked individually to achieve atomic updates, very similarly to ConcurrentHashMap.

Making it all play together in harmony

So we have a deployment that incorporates a feeder, a processor and a Cassandra persistent store, each such service having multiple instances and needing to scale in/out dynamically based on demand. Designing my solution for the real deal, I’m about to face 10s-100s of instances of each service. Manual deployment or scripting will not be manageable, not to mention the automatic scaling of each service, monitoring and troubleshooting. How do I manage that automatically as a single cohesive solution?

For that I used GigaSpaces Cloudify, which allows me to integrate any stack of services by writing Groovy-based recipes describing declaratively the lifecycle of the application and its services.

I can then deploy and manage the end-to-end application using the CLI and the Web Console.

Conclusion

This was a thought exercise on real-time analytics for big data. I used Twitter use case because I wanted to aim high up the big data challenge and, well, you can’t get much bigger than that.

The end-to-end solution included a clustered Cassandra NoSQL database for the elaborated batch analytics of the historical data, GigaSpaces XAP platform for distributed In-Memory Data Grid with co-located with real-time processing, Spring Social for feeding in Tweets from twitter, Spring Framework for configuration and wiring capabilities, and GigaSpaces Cloudify for deployment, management and monitoring. I used event-driven architecture with semi-structured Documents, POJOs and atomic counters, and with write-behind eviction.

This is just the beginning. My design hardly utilized the capabilities of the chosen technology stack, and it barely scratched the surface of the analytics you can perform on Twitter. Imagine for example what it would take to calculate not just real-time word counts but also the reach of tweets, as done on the tweetreach service.

This project is just the starting point, and I would like to share this project with you and invite you to stand up to the challenge with me and together make it into a complete reference solution for real-time analytics architecture for big data.

The project is found on GitHub under https://github.com/dotanh/rt-analytics.

You’re welcome to contribute!

1311765722_picons03
Follow Dotan on Twitter!

4 Comments

Filed under Big Data, Real Time Analytics, Solution Architecture

Architecting Massively-Scalable Near-Real-Time Risk Analysis Solutions

Recently I held a webinar around architecting solutions for scalable and near-real-time risk analysis solutions based on the experience gathered with our Financial Services customers. In the webinar I also had the honor of hosting Mr. Larry Mitchel, a leading expert in the Financial Services industry, who provided background on the Risk Management domain. Following the general interest in the webinar, I decided to dedicate a post to the subject.

What goes on in the Risk Management domain?

The Finance world continually undergoes changes driven for the most part by the lessons learned from the 2008 financial crash, in an attempt to prevent such catastrophes from reoccurring. Regulations such as Dodd-Frank, EMIR, and Basel III have further formalized it, imposing tighter control and supervision. We see financial institutions addressing these conformance goals by assigning dedicated projects with dedicated budgets (which means more work for solutions architects, lucky me). One of the aspects of this conformance is reducing the risk by shortening the settlement cycles to near-real-time, as seen by initiatives such as Straight-Through Processing.

Traditional architectures, new challenges

Conforming to the new regulations mandates an entirely different approach to risk analysis. This means that the old systems, which relied on overnight batch risk calculations and predefined queries, can no longer suffice, and a more real time approach to risk calculation, with on-the-fly queries, is needed.

From a solution architecture point of view, Risk Analysis is a compute-intensive and a data-intensive process. Looking at our customers’ systems, we see ever-increasing volumes (number of calculated positions and assets, number of re-calculations, data affinity, etc.) and on the other hand we see an ever-increasing demand to reduce the response time, to conform with the regulations or for competitive edge. That makes it a classic Big Data analytics problem.

From a technology point of view, risk analysis solutions traditionally relied on designated compute grid products for the calculations and on relational databases as the data store. That was fine for overnight batch processing, but with the introduction of the new real-time demands databases tend to become bottlenecks under the load, due to the disk and network resources.

Risk Analysis solution architecture revisited

Our experience with such solutions shows that the effective architecture to meet these challenges is a Big Data multi-tiered architecture, in which intraday data is cached in-memory for low-latency response, while historical data is kept in a database for more extensive data mining and reporting. Simple caching solutions cannot provide the scalability of the intraday data under such write-intensive flows (streaming market data, calculation results, and such), and it is therefore an In-Memory Data Grid that has become the standard technology in modern solutions for storing intraday data. Intelligent data grids such as GigaSpaces XAP also provide on-the-fly SQL querying capabilities, which overcome the limitation of predefined queries in traditional architectures.  As for historical data, we see a clear shift from relational databases to NoSQL databases, which perform much better for mining these volumes of semi-structured data.

A piece of the architecture that is often overlooked on initial architecture discussions is the system orchestration. Surprisingly, many of the customers I visit tend to think of risk analysis solutions as the mere sum of a Compute Grid product (for computation scalability) and a Data Grid product (for data scalability). But they neglect to consider the orchestration logic to handle the intersection between the data grid and the compute grid, taking care to avoid duplicate calculations, handling cancellation of calculations, monitoring the state of ongoing calculations, feeding ticks and updates to the client UI, end more. All this amounts to a significant orchestration layer that is traditionally developed in-house.

A much more effective architecture is to embed the orchestration logic together with the data grid within one platform, thereby abstracting the complexities from the clients and removing the need of the clients to interact with anything but the unified platform. GigaSpaces XAP offers the co-location of processing and messaging together with the data, which makes implementing such architectures quite easy. This also enables pre-/post-processing on the data, such as data formatting prior to processing, and result aggregation after calculations, which are requirements often seen in such solutions.

Event-Driven Architecture is highly useful for streaming calculation results to the awaiting clients as they arrive and streaming ticks and other updates to the UI. Using GigaSpaces XAP the implementation of such architecture is made simple by leveraging on the Asynchronous API and on the messaging layer which can treat each data mutation as an event.

To address the real time analytics challenge on the end-to-end Big Data architecture, across both the intraday data (which resides in-memory within the data grid) and the historical data (which resides within a relational/NoSQL database), requires a holistic view of the multi-tier architecture. Intraday data is changed at an extremely high rate with frequent event feeds, whereas historical data can be written in a more relaxed manner, using a write-behind (write-back) caching architectural approach, and consolidating queries across the data stores, making it seem as one unified source for query purposes. Such consolidation is traditionally achieved by combining the various products, but GigaSpaces offers a Real-Time Analytics solution, enabling you to focus on your business logic and leave the rest to the platform.

Future directions

There’s more to discuss in such architectures, such as multi-site deployments over WAN, support for cloud bursting, and more, which should be considered when approaching such solutions. I will not get into these concerns on this post, but you can see coverage of future directions on my webinar.

To get more information on the domain and its challenges, and to hear more on the suggested architecture for Big Data risk analysis solutions I’d recommend watching the full webinar.

 

1311765722_picons03
Follow Dotan on Twitter!

8 Comments

Filed under Big Data, Financial Services, Market Analytics, Real Time Analytics, Risk Managment