Tag Archives: NoSQL

Oracle Boosts Its Big Data Offering

Oracle, the proverbial SQL icon, knows it cannot ignore the trend of big data, nor does it attempt to. On the contrary. Oracle has been promoting big data offerings, both in-house and via acquisitions such as the recent Datalogix acquisition. Yesterday Oracle announced an important milestone with the release of a suite of big data solutions.

In modern organizations data comes from multiple and diverse sources, both structured and unstructured, and across various technologies (e.g. Hadoop, relational databases). Advanced analytics requirements such as real-time counting analytics alongside historical trends analytics further necessitate the different technologies, resulting in a highly advanced solution. Oracle identifies this complexity and offers native integration of the big data solutions with Oracle’s SQL relational databases, with one uniform façade for analytics.

While Hadoop’s advantages are known, it is still non-trivial for analysts and data scientists to extracts analytics and gain insights. Oracle Big Data Discovery comes to address this, providing a “visual face of Hadoop”.

Oracle also extends its GoldenGate replication platform with the release of  Oracle GoldenGate for Big Data, which can replicate unstructured data to Hadoop stack (Hive, HBase, Flume). Another aspect of the uniform façade is Oracle SQL, with queries that can now transparently access data in Hadoop, NoSQL and Oracle Database with Oracle Big Data SQL 1.1.

Oracle’s strategy is to leverage its brand and existing SQL installations within enterprises and offer them enterprise-grade versions of the popular open-source tools, and to provide native integration with Oracle’s traditional installation of SQL databases which already exist within the enterprises. It’s left to see how it catches with the enterprises against the hype of the popular big data -vendors and open source projects.

1311765722_picons03 Follow Dotan on Twitter!

Advertisements

1 Comment

Filed under Big Data

Facebook’s Big Data Analytics Boosts Search Capabilities

A few days ago Facebook announced its new search capabilities. These are Google-like capabilities of searching your history, the feature that was the crown jewel of Google+ – Google’s attempt to fight off Facebook. Want to find that funny thing you posted when you took the ice bucket challenge a few months ago? It’s now easier than ever. And it’s now supported also on your phone.

facebook ice bucket challenge search

You may think this is a simple (yet highly useful) feature. But when you come to think of it, this is quite a challenge, considering the 1.3 billion active users generating millions of events per second. The likes of Facebook, Google and Twitter cannot settle for the traditional processing capabilities, and need to develop innovative ways for stream processing at high volume.

A challenge just as big is encountered with queries: Facebook’s big data stores process tens of petabytes and hundreds of thousands of queries per day. Serving such volumes while keeping most response times under 1 second is hardly the type of challenge of traditional databases.

These challenges called for innovative approach. For example, Facebook’s Data Infrastructure Team was the one to develop and open-source Hive, the popular Hadoop-based software framework for Big Data queries. Facebook also took innovative approach in building its data centers, both in the design of the servers, and in its next-gen networking designed to meet the high and constantly-increasing traffic volumes within their data centers.

10734294_775986955823316_1197403205_n[1]

Facebook is taking its data challenge very seriously, investing in internal research as well as in collaboration with the academia and the open-source community. In a data faculty summit hosted by Facebook a few months ago, Facebook shared its top open data problems. It raised many interesting challenges in managing Small Data, Big Data and related hardware. With the announced release of Facebook Search for mobile, I remembered in particular the challenges raised in the Facebook data faculty summit on how to adapt their systems to the mobile realm, where network is flaky, where much of the content is pre-fetched instead of pulled on-demand, where privacy checks need to be done much earlier on in the process. The recent release may indicate new innovative solutions to these challenges. Looking to hear some insights from the technical team.

Facebook, Twitter and the like face the Big Data challenges early on. as I said before:

These volumes challenge the traditional paradigms and trigger innovative approaches. I would keep a close eye on Facebook as a case study for the challenges we’d all face very soon.

 

1311765722_picons03 Follow Dotan on Twitter!

3 Comments

Filed under Big Data, Real Time Analytics, Solution Architecture

Analytics for Big Data – Venturing with the Twitter Use Case

Performing analytics on Big Data is a hot topic these days. Many organizations realized that they can gain valuable insight from the data that flows through their systems, both in real time and in researching historical data. Imagine what Google, Facebook or Twitter can learn from the data that flows through their systems. And indeed the boom of real time analytics is here: Google Analytics, Facebook Social Analytics, Twitter paid tweet analytics and many others, including start-ups that specialize in real time analytics.

But the challenge of analyzing such huge volumes of data is enormous. Mining Terabytes and Petabytes of data is no simple task, one that traditional databases cannot meet, which drove the invention of new NoSQL database technologies such as Hadoop, Cassandra and MongoDB. Analyzing data in real time is yet another challenging task, when dealing with very high throughputs. For example, according the Twitter’s statistics, the number of tweets sent  on March 11 2011 was 177 million! Now, analyzing this stream, that’s a challenge!

Standing up to the challenge

When discussing the challenge of real time analytics on Big Data, I oftentimes use the Twitter use case as an example to illustrate the challenges. People find it easy to relate to this use case and appreciate the volumes and challenges of such a system. A few weeks ago, when planning a workshop on real time analytics for Big Data, I was challenged to take up this use case and design a proof of concept that meets up to the challenge of analyzing real tweets from real time feeds. Well, challenge is my name, and solution architecture is my game…

Note that it is by no means a complete solution, but more of a thought exercise, and I’d like to share these thoughts with you, as well as the code itself, which is shared on GitHub (see at the bottom of the post). I hope this will only be the starting point of a joint community effort to make it into a complete reference example. So let’s have a look at what I sketched up.

What kind of analytics?

First of all, we need to see what kinds of analytics are of interest in such use case. Looking at Twitter analytics, we see various types of analytics that I group in 3 categories:

Counting: real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.

Correlation: near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.

Research: more in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.

When approaching the architecture of a solution that covers all of the above types of analytics, we need to realize the different nature of the real time vs. historical analysis, and leverage on appropriate technologies to meet each challenge on its own ground, and then combine the technologies into a single harmonious solution. I’ll get back to that point…

In my sample application I wanted to listen on the public timeline of Twitter, perform some sample real-time analytics of word-counting, as well as preparing the data for research analytics, and combining it into a single solution to handle all aspects.

RT analytics for Big Data

Feeding in a hefty stream of tweets

I chose to listen on the Twitter public timeline (so I can share the application with you without having to give you my user/password).

For integration with Twitter I chose to use Spring Social, which offers a built-in Twitter connector integrating with Twitter’s REST API. I wanted to integrate with Twitter’s  Streaming API, but unfortunately it appears that currently Spring Social does not support this API, so I settled for repetitive calls to the regular API.

A feeder took in the tweets and converted them in an ETL style to a canonical Document-oriented representation, which semi-structured nature makes it ideal for the evolving nature of tweet structure, and wrote them to an in-memory data grid (IMDG) on the server side.

The design needs to accommodate a very high throughput of around 10k tweets/sec, with latency of milliseconds. For that end I chose to implement the feeder as an independent processing unit in GigaSpaces XAP, so that I can cope with the write scalability requirement by simply adding more parallel feeders to handle the stream. Since the feeder is a stateless service, scaling out by adding instances is relatively easy. Trying to do the same with stateful services will prove to be much more challenging, as we’re about to find out …

Let’s pick the brains of my accumulated tweets

On the server side, I wanted to store the tweets and prepare them for batch-mode historical research. For the same reason of semi-structured data, I also chose a Document-oriented database to store the Tweets. In this case, I chose the open source Apache Cassandra, which has become a prominent NoSQL database that is in use by Twitter itself, as well as by many other companies. The API to interact with Cassandra is Hector.

To avoid tight coupling of my application code with Cassandra, I followed the Inversion of Control principle (IoC) and created an abstraction layer for the persistent store, where Cassandra is just one implementation, and provided another implementation for testing purposes of persistence to the local file system. Leveraging on Spring Framework wiring capabilities (see below), switching between implementations becomes a configuration choice, with no code changes.

Easy configuration

For easy configuration and wiring I used Spring Framework, leveraging on its wiring capabilities as well as properties injection and parameter configuration. Using these features I made the application highly configurable, exposing the Twitter connection details, buffer sizes, thread pool sizes, etc. This means that the application can be tuned to any load, depending on the hardware, network and JVM specifications.

What can I learn from the tweet stream on the fly?

In addition to persisting the data, I also wanted to perform some sample on-the-fly real-time analytics on the data. For this experiment I chose to perform word counting (or more exactly token counting, as token can also be an expression or a combination of symbols).

At first glance you may think it’s a simple task to implement, but when facing the volumes and throughput of Twitter you’ll quickly realize that we need a highly scalable and distributed architecture that uses an appropriate technology and that takes into account data sharding and consistency, data model de-normalization, processing reliability, message locality, throughput balancing to avoid backlog build-up etc.

Processing workflow

I chose to meet these challenges by employing Event-Driven Architecture (EDA) and designed a workflow to run through the different stages of the processing (parsing, filtering, persisting, local counting, global aggregation, etc.) where each stage of the workflow is a processor. To meet the above challenges of throughput, backlog build-up, distribution etc., I designed the processor with the following characteristics:

  1. Each processor has a thread pool (of a configurable size) to enable concurrent processing.
  2. Each processor thread can process events in batches (of a configurable size) to balance between input and output streams and avoid backlog build-up.
  3. Processors are co-located with the (sharded) data, so that most of the data processing is performed locally, within the same JVM, avoiding distributed transactions, network, and serialization overhead.

The overall workflow looks as follows:

For the implementation of the workflow and the processors I chose XAP Polling Container, which runs inside the in-memory data grid co-located with the data and enables easy implementation of the above characteristics.

The events that drive the workflow are simple POJOs on which I listen and which state changes trigger the events. This is a very useful characteristic of the XAP platform, which saved me the need to generate message objects and place them in a message broker.

Atomic counters

For the implementation of atomic counters I used XAP’s MAP API, which allows using the in-memory data grid as a transactional key-value store, where the key is the token and the value is the count, and each such entry can be locked individually to achieve atomic updates, very similarly to ConcurrentHashMap.

Making it all play together in harmony

So we have a deployment that incorporates a feeder, a processor and a Cassandra persistent store, each such service having multiple instances and needing to scale in/out dynamically based on demand. Designing my solution for the real deal, I’m about to face 10s-100s of instances of each service. Manual deployment or scripting will not be manageable, not to mention the automatic scaling of each service, monitoring and troubleshooting. How do I manage that automatically as a single cohesive solution?

For that I used GigaSpaces Cloudify, which allows me to integrate any stack of services by writing Groovy-based recipes describing declaratively the lifecycle of the application and its services.

I can then deploy and manage the end-to-end application using the CLI and the Web Console.

Conclusion

This was a thought exercise on real-time analytics for big data. I used Twitter use case because I wanted to aim high up the big data challenge and, well, you can’t get much bigger than that.

The end-to-end solution included a clustered Cassandra NoSQL database for the elaborated batch analytics of the historical data, GigaSpaces XAP platform for distributed In-Memory Data Grid with co-located with real-time processing, Spring Social for feeding in Tweets from twitter, Spring Framework for configuration and wiring capabilities, and GigaSpaces Cloudify for deployment, management and monitoring. I used event-driven architecture with semi-structured Documents, POJOs and atomic counters, and with write-behind eviction.

This is just the beginning. My design hardly utilized the capabilities of the chosen technology stack, and it barely scratched the surface of the analytics you can perform on Twitter. Imagine for example what it would take to calculate not just real-time word counts but also the reach of tweets, as done on the tweetreach service.

This project is just the starting point, and I would like to share this project with you and invite you to stand up to the challenge with me and together make it into a complete reference solution for real-time analytics architecture for big data.

The project is found on GitHub under https://github.com/dotanh/rt-analytics.

You’re welcome to contribute!

1311765722_picons03
Follow Dotan on Twitter!

4 Comments

Filed under Big Data, Real Time Analytics, Solution Architecture

Building Cloud Applications the Easy Way Using Elastic Application Platforms

Patterns, Guidelines and Best Practices Revisited

In my previous post I analyzed Amazon’s recent AWS outage and the patterns and best practices that enabled some of the businesses hosted on Amazon’s affected availability zones to survive the outage.

The patterns and best practices I presented are essential to guarantee robust and scalable architectures in general and on the cloud in particular. Those who dismissed my latest post as exaggeration of an isolated incident got affirmation of my statement last week when Amazon found itself apologizing once again after its Cloud Drive service was overwhelmed by unpredictable peak demand for Lady Gaga’s newly-released album (99 cents, who wouldn’t buy it?!) and was rendered non-responsive. This failure to scale up/out to accommodate fluctuating demands raises the scalability concern in the public cloud, in addition to the resilience concern raised in the AWS outage.

Surprisingly, as obvious as the patterns I listed may seem, it seems they are definitely not common practice, seeing the amount of applications that went down when AWS did, and seeing how many other applications have similar issues on public cloud providers.

Why are such fundamental principles not prevalent in today’s architectures on the cloud?

One of the reasons these patterns are not prevalent in today’s cloud applications is that it requires an experienced and confident architect in the areas of distributed and scalable systems to design such architectures. The typical public cloud APIs also require developers to perform complex coding and utilize various non-standard APIs that are usually not common knowledge. Similar difficulties are found in testing, operating, monitoring and maintaining such systems. This makes it quite difficult to implement the above patterns to ensure the application’s resilience and scalability, and diverts valuable development time and resources from the application’s business logic that is the core value of the application.

How can we make the introduction of these patterns and best practices smoother and simpler? Can we get these patterns as a service to our application? We are all used to traditional application servers that provide our enterprise applications with underlying services such as database connection pooling, transaction management and security, and free us from worrying about these concerns so that we can focus on designing our business logic. Similarly, Elastic Application Platforms (EAP)allow your application to easily employ the patterns and best practices I enumerated on my previous post for high availability and elasticity without having to become experts in the field and allowing you to focus on your business logic.

So what is Elastic Application Platform? Forrester defines an elastic application platform as:

An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.

Last month Forrester published a review under the title “Cloud Computing Brings Demand For Elastic Application Platforms”. The review is the result of a comprehensive research, and spans 17 pages (a blog post introducing it can be found on the Forrester blog). It analyzes the difficulties companies encounter in implementing their applications on top of cloud infrastructure, and recognizes the elastic application platforms as the emerging solution for a smooth path into the cloud. It then maps the potential providers of such solutions. For its research Forrester interviewed 17 vendor and user companies. Out of all the reviewed vendors, Forrester identified only 3 vendors that are “offering comprehensive EAPs today”: Microsoft, SalesForce.com and GigaSpaces.

As Forrester did an amazing job in their research reviewing and comparing solutions for EAP today, I’ll avoid repeating that. Instead, I’d like to review the GigaSpaces EAP solution in light of the patterns discussed on my previous post, and see how building your solution on top of GigaSpaces enables you to introduce these patterns easily and without having to divert your focus from your business logic.

Patterns, Guidelines and Best Practices Revisited

Design for failure

Well, that’s GigaSpaces’ bread and butter. Whereas thinking about failure diverts you from your core business, in our case it is our core business. GigaSpaces platform provides underlying services to enable high availability and elasticity, so that you don’t have to take care of that. So now that we’ve established that, let’s see how it’s done.

Stateless and autonomous services

The GigaSpaces architecture segregates your application into Processing Units. A Processing Unit (PU) is an autonomous unit of your application. It can be a pure business-logic (stateless) unit, or hold data in-memory, or provide a web application, and mix together these and other functions. You can define the required Service Level Agreement (SLA) for your Processing Unit, and the GigaSpaces platform will make sure to enforce it. When your Processing Unit SLA requires high-availability – the platform will deploy a (hot) backup instance (or multiple backups) of the Processing Unit to which the PU will fail over in case the primary instance fails. When your application needs to scale out – the platform will add another instance of the Processing Unit (maybe over a newly-provisioned virtual machine booted automatically by the platform). When your application needs to distribute data and/or data processing – the platform will shard the data evenly on several instances of the Processing Unit, so that each instance will handle a subset of the data independently of the other instances.

Redundant hot copies spread across zones

You can divide your deployment environment into virtual zones. These zones can represent different data centers, different cloud infrastructure vendors, or any physical or logical division you see fit. Then you can tell the platform (as part of the SLA) not to place both primary and its backup instances of the Processing Unit on the same zone – thus making sure the data stored within the Processing Unit is backed up on two different zones. This will provide your application resilience over two data centers, two cloud vendors, two regions, depending on your required resilience, all with uniform development API. You want higher level of resilience? Just define more zones and more backups for each PU.

Spread across several public cloud vendors and/or private cloud

GigaSpaces abstracts the details of the underlying infrastructure from your application. GigaSpaces’ Multi-Cloud Adaptor technology provides built-in integration with several major cloud providers, including the JClouds open source abstraction layer, thus supporting any cloud vendor that conforms to the JClouds standard. So all you need to do is plug in your desired cloud providers into the platform, and your application logic remains agnostic to the cloud infrastructure details. Plugging in two vendors to ensure resilience now becomes just a matter of configuration. The integration with JClouds is an open-source project under OpenSpaces.org, so feel free to review and even pitch in to extend and enhance integration with cloud vendors.

Automation and Monitoring

GigaSpaces offers a powerful set of tools that allow you to automate your system. First, it offers the Elastic Processing Unit, which can automatically monitor CPU and memory utilization and employ corrective actions based on your defined SLA. GigaSpaces also offers a rich Administration and Monitoring API that enables administration and monitoring of all the GigaSpaces services and components and layers running beneath the platform such as transport layer and, machine and operating system. GigaSpaces also offers a web-based dashboard and a management center client. Another powerful tool for monitoring and automation is the administrative alerts that can be configured and then viewed through GigaSpaces or external tools (e.g. via SNMP traps).

Avoiding ACID services and leveraging on NoSQL solutions

GigaSpaces does not rule out SQL for querying your data. We believe that true NoSQL stands for “Not Only SQL”, and that SQL as a language is good for certain uses, whereas other uses require other query semantics. GigaSpaces supports some of the SQL language through its SQLQuery API or through standard JDBC . However, GigaSpaces also provides a rich set of alternative standards and protocols for accessing your data, such as Map API for key/value access, Document API for dynamic schemas, Object-oriented (through proprietary Space API or standard JPA), and Memcached protocol.

Another challenge of the traditional relational databases is scaling data storage in read-write environment. The distributed relational databases were enough to deal with read-mostly environments. But Web2.0 brought social concepts into the web, with customers feeding data into the websites. Several NoSQL solutions try to address distributed data storage and querying. GigaSpaces provides this via its support for clustered topology of the in-memory data grid (the “space”) and for distributing queries and execution using patterns such as Map/Reduce and event-driven design.

Load Balancing

The elastic natureof the GigaSpaces platform allows it to automatically detect the CPU and memory capacity of the  deployment environment and optimize the load dynamically based on your defined SLA, instead of employing arbitrary division of the data into fixed zones. Such dynamic nature also allows your system to adjust in case of a failure of an entire zone (such as what happened with Amazon’s availability zones) so that your system doesn’t go down even in such extreme cases, and maintains optimal balance under the new conditions.

Furthermore, GigaSpaces platform supports content-based routing, which allow for smart load balancing based on your business model and logic. Content-based routing allows your application to route related data to the same host and then execute entire business flows within the same JVM, thus avoiding network hops and complex distributed transaction locking that hinder your application’s scalability.

Conclusion

Most significant advancements do not happen in slow gradual steps but rather in leaps. These leaps happen when the predominant conception crashes in face of the new reality, leaving behind chaos and uncertainty, and out of the chaos then emerges the next stage in the maturity of the system.

This is the case with the maturity process of the young cloud realm as well: the AWS outage was a major reality check that opened the eyes of the IT world to see that their systems crashed with AWS because they counted on their cloud infrastructure provider to handle your application’s high-availability and elasticity using its generic logic. This concept proved to be wrong. Now the IT world is in confusion, and many discussions are done on whether the faith in cloud was mistaken, with titles like “EC2 Failure Feeds Worries About Cloud Services”.

The next step in the cloud’s maturity was the realization that cloud infrastructure is just infrastructure, and that you need to implement your application correctly, using patterns and best practices such as the ones I raised in my previous post, to leverage on the cloud infrastructure to gain high-availability and elasticity.

The next step in the evolution is to start leveraging on designated application platforms that will handle these concerns for you and virtualize the cloud from your application, so that you can simply define the SLA for your application for high-availability and elasticity, and leave it up to the platform to manipulate the cloud infrastructure to enforce your SLA, while you concentrate on writing your application’s business logic. As Forrester said:

… A new generation of application platforms for elastic applications is arriving to help remove this barrier to realizing cloud’s benefits. Elastic application platforms will reduce the skill required to design, deliver, and manage elastic applications, making automatic scaling of cloud available to all shops …

 

1311765722_picons03
Follow Dotan on Twitter!

9 Comments

Filed under Cloud, PaaS