Category Archives: Uncategorized

The Mysterious Creator Of Bitcoin and Blockchain Comes Out Of The Shadows

The birth of the virtual currency Bitcoin was accompanied with great mystery, when its creator chose to remain in the shadows, known only by his (or her?) pseudonym Satoshi Nakamoto. Rumors over the years brought up various potential “suspects”, but nothing was proven and no one confessed. Until now.

Today Dr. Craig Wright, a 45-year-old computer scientist from Australia, announced that he is Satoshi Nakamoto. In his post Wright says

If I sign Craig Wright, it is not the same as if I sign Craig Wright, Satoshi.

Wright then goes on to thank those who supported the project. Something that has started from a monumental paper by the mysterious Satoshi Nakamoto, followed by a release of an even more monumental implementation of Blockchain, the revolutionary technology for open distributed ledger underlying Bitcoin. In fact, the impact of Blockchain currently seems to surmount that of bitcoin, with blockchain-based innovation boiling up in both startups and financial institutions. The interest is so great (and skill set is so rare) that IBM and Microsoft launched Blockchain-as-a-Service offerings on their clouds to help companies innovate with Blockchain.


Wright dedicated most of his post to convincing the community of the authenticity of his claim. He provides a signed evidence, supposedly signed with a private key associated with Satoshi Nakamoto (the key for block 9), and elaborates on the process of verifying cryptographic keys, basically implying on the verification of his own evidence.

Wright knew such announcement would not go about without a storm, so he summoned in advance three high-profile magazines – The Economist, The BBC and GQ Magazine – to exclusively present his claim and evidence so they can accompany his post with their own coverage. The magazines jumped on the scoop and dag deep into his claims. You can read their full review here:

The Economist: Craig Steven Wright claims to be Satoshi Nakamoto. Is he?

The BBC: Australian Craig Wright claims to be Bitcoin creator

GQ Magazine: Dr Craig Wright Outs Himself As Bitcoin Creator Satoshi Nakamoto

Is Dr. Wright the real Satoshi Nakamoto? The community will be debating that in the weeks to come, together with further evidence released by Wright. More importantly, Wright hints of new work he’s done in this field with “an exceptional group”. It may very well be that his current announcement just sets the stage for bigger announcements or releases yet to come.

1311765722_picons03 Follow Horovits on Twitter!

Leave a comment

Filed under Blockchain, Uncategorized

IBM, Microsoft Offer Blockchain In Their Cloud Services

Recently blockchain fans got major news, with two giants IBM and Microsoft announcing their support for Blockchain-as-a-Service (BaaS) in their cloud services. Are we going to see some cloud-based blockchain developments soon? sounds like it.

Blockchain emerged from Bitcoin cryptocurrency hype as the innovative distributed ledger technology behind Bitcoin. But while cryptocurrencies are well past Gartner’s peak of inflated expectations, blockchain is gaining growing interest from startups and enterprises alike. The interest in blockchain isn’t limited to just cryptocurrencies but also extends into other financial use cases, and even transcends FinTech realm into non-financial use cases such as electronic voting, smart contracts and ownership verification for art and diamonds.


The interest that blockchain drove the creation of different “flavors” of the distributed ledger notion, beyond the initial one used for Bitcoin. One interesting initiative recently launched is the hyperledger project, a community-backed open-source standard for distributed ledger. It was launched December 2015 under the Linux Foundation by big financial services names such as J.P. Morgan, Wells Fargo, London Stock Exchange Group and Deutsche Börse, as well as equally big IT players such as IBM, Intel, Cisco and VMware. As part of joining Hyperledger, IBM has open sourced a significant chunk of the blockchain code it has been working on.


IBM launched its blockchain-as-a-service in production February. In order to encourage adoption of its new cloud service, IBM also opens garages for blockchain app design and implementation in London, New York, Singapore and Tokyo.

Microsoft was first to move in on blockchain. Last November ETH-BaaSMicrosoft launched a Blockchain-as-a-Service on its Azure cloud based on Ethereum in partnership with ConsenSys. But while IBM bet on hyperledger project, Microsoft took a different approach and spread its bet across multiple projects and partnerships. During last month Microsoft added to its blockchain partnerships Augur, Lisk, BitShares, Syscoin and, and this month also added Storj.

I estimate IBM and Microsoft would not remain alone in this game. Other vendors will join in to offer platforms and cloud services to accelerate the development of blockchain-based applications. This will be a serious enabler for innovation around this fascinating technology, whether for young innovative startups bootstrapping on low budget, or for financial institutions (and other enterprises) lacking in-house skills in this cutting-edge technology.

1311765722_picons03 Follow Horovits on Twitter!


1 Comment

Filed under Blockchain, Cloud, Uncategorized

Live Video Streaming At Facebook Scale

Operating at Facebook scale is far from trivial. With 1.49 billion monthly active users (and growing 13 percent yearly), every 60 seconds on Facebook 510 comments are posted, 293,000 statuses are updated, and 136,000 photos are uploaded. And there lies the challenge of serving the masses efficiently and reliably without any outages.

For serving the offline content, whether text (updates, comments, etc.), photos or videos, Facebook developed a sophisticated architecture that includes state-of-the-art data center technology and search engine to traverse and fetch content quickly and efficiently.

But now comes a new type of challenge: A few months ago Facebook rolled out a new service for live streaming called Live for Facebook Mentions, which allows celebs to broadcast live video to their followers. This service is quite similar to Twitter’s Periscope (acquired by Twitter beginning of this year) and the popular Meerkat app, which offer their live video streaming services to all and not just celebs. In fact, Facebook announced this month it is piloting a new service which will offer live streaming to the wide public as well.


While offline photos and videos get uploaded fully and then distributed and made accessible to followers and friends, serving live video streams is much more challenging to implement at scale. And to make things even worse, the viral nature of social media (and of celeb content in particular) often creates spikes where thousands of followers demand the same popular content at the same time, a phenomenon the Facebook team calls the “thundering herd” problem.

An interesting post by Facebook engineering shares information on these challenges and the approaches they took: Facebook’s system uses Content Delivery Network (CDN) architecture with a two-layer caching of the content, with the edge cache closest to the users and serving 98 percent of the content. This design aims to reduce the load from the backend server processing the incoming live feed from the broadcaster. Another useful optimization for further reducing the load on the backend is request coalescing, whereby when many followers (in the case of celebs it could reach millions!) are asking for some content that’s missing in the cache (cache miss), only one instance request will proceed to the backend to fetch the content on behalf of all to avoid a flood.


It’s interesting to note that the celebs’ service and the newer public service show different considerations and trade-offs of throughput and latency which brought Facebook’s engineering team to make changes to adapt the architecture to the new service:

Where building Live for Facebook Mentions was an exercise in making sure the system didn’t get overloaded, building Live for people was an exercise in reducing latency.

The content itself is broken down into tiny segments of multiplexed audio and video for more efficient distribution and lower latency. The new Live service (for the wide public) even called for changing the underlying streaming protocol to enable an even better latency, reduce the lag between broadcaster and viewer by 5x.

This is a fascinating exercise in scalable architecture for live streaming, which is said to effectively scale to millions of broadcasters. Such open discussions can pave the way to smaller companies in the social media, internet of things (IoT) and the ever-more-connected world. You can read the full post here.

1311765722_picons03 Follow Horovits on Twitter!

Leave a comment

Filed under Solution Architecture, Uncategorized

How IBM is using big data to fix Beijing’s pollution crisis

A fascinating way to leverage big data to help the world


Of China’s major cities, Beijing’s pollution problem is probably the worst, causing thousands of premature deaths every year. Its residents are fed up. The growing outrage has forced leaders to declare a “war on pollution,” including the goal of slashing Beijing’s PM2.5— the concentration of the particles that pose the greatest risk to human health—by 25% by 2017. The Beijing municipal government will earmark nearly 1 trillion yuan ($160 billion) to meet that target.


Why, then, are the city’s own government officials skeptical about hitting that 2017 goal? Perhaps because Beijing’s pollution woes are unusually complicated. The city is flanked on three sides by smog-trapping mountain ranges. There are numerous sources of foul air, and a multitude of subtle ways the chemicals interact with each other, which make it hard to identify what problems need fixing.

IBM thinks it change that outlook. On Monday, the company will unveil a 10-year initiative launched in partnership with the Beijing Municipal Government…

View original post 612 more words

Leave a comment

Filed under Uncategorized

Facebook outage reported now worldwide

Facebook is down. trying to access the page shows a page with a laconic message that “something went wrong”.

facebook outage error message

According to the outage started 3:55 a.m. EDT.

facebook outage statistics

Reports are  flooding the net. The outage seems to be worldwide.

facebook outage twitter responses

So far no explanation from Facebook.

Stay tuned.



1 Comment

Filed under Uncategorized

AWS Outage: Moving from Multi-Availability-Zone to Multi-Cloud

A couple of days ago Amazon Web Services (AWS) suffered a significant outage in their US-EAST-1 region. This has been the 5th major outage in that region in the past 18 months. The outage affected leading services such as Reddit, Netflix, Foursquare and Heroku.

How should you architect your cloud-hosted system to sustain such outages? Much has been written on this question during this outage, as well as past outages. Many recommend basing your architecture on multiple AWS Availability Zones (AZ) to spread the risk. But during this outage we saw even multi-Availability Zone applications severely affected. Even Amazon published during the outage that

Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery.

The reason is that there is an underlying infrastructure that escalates the traffic from the affected AZ to other AZ in a way that overwhelms the system. In the case of this outage it was the AWS API Platform that was rendered unavailable, as nicely explained in this great post:

The waterfall effect seems to happen, where the AWS API stack gets overwhelmed to the point of being useless for any management task in the region.

But it doesn’t really matter for us as users which exact infrastructure it was that failed on this specific outage. 18 months ago, during the first major outage, the reason was another infastructure component, the Elastic Block Store (“EBS”) volumes, that cascaded the problem. Back then I wrote a post on how to architect your system to sustain such outages, and one of my recommendations was:

Spread across several public cloud vendors and/or private cloud

The rule of thumb in IT is that there will always be extreme and rare situations (and don’t forget, Amazon only commits to 99.995% SLA) causing such major outages. And there will always be some common infrastructure that under that extreme and rare situation will carry the ripple effect of the outage to other Availability Zones in the region.

Of course, you can mitigate risk by spreading your system across several AWS Regions (e.g. between US-EAST and US-WEST), as they have much looser coupling, but as I stated on my previous post, that loose coupling comes with a price: it is up to your application to replicate data, using a separate set of APIs for each region. As Amazon themselves state: “it requires effort on the part of application builders to take advantage of this isolation”.

The most resilient architecture would therefore be to mitigate risk by spreading your system across different cloud vendors, to provide the best isolation level. The advantages in terms resilience are clear. But how can that be implemented, given that the vendors are so different in their characteristics and APIs?

There are 2 approaches to deploying across multiple cloud vendors and keeping cloud-vendor-agnostic:

  1. Open Standards and APIs for cloud API that will be supported by multiple cloud vendors. That way you write your application using a common standard and have immediate support by all conforming cloud vendors. Examples for such emerging standards are OpenStack and JClouds. However, the Cloud is still a young domain with many competing standards and APIs and it is yet to be determined which one shall become the de-facto standard of the industry and where to “place our bet”.
  2. Open PaaS Platforms that abstract the underlying cloud infrastructure and provide transparent support for all major vendors. You build your application on top of the platform, and leave it up to the platform to communicate to the underlying cloud vendors (whether public or private clouds, or even a hybrid). Examples of such platforms, are CloudFoundry and Cloudify. I dedicated one of my posts for exploring how to build your application using such platforms.


System architects need to face the reality of the Service Level Agreement provided by Amazon and other cloud vendors and their limitations, and start designing for resilience by spreading across isolated environments, deploying DR sites, and by similar redundancy measures to keep their service up-and-running and their data safe. Only that way can we guarantee that we will not be the next one to fall off the 99.995% SLA.

This post was originally posted here.


Filed under cloud deployment, Disaster-Recovery, IaaS, PaaS, Solution Architecture, Uncategorized

AWS Outage – Thoughts on Disaster Recovery Policies

A couple of days ago it happened again. On June 14 around 9 pm PDT Amazon AWS hit a power outage in its Northern Virginia data center, affecting EC2, RDS, Elastic Beanstalk and other services in the US-EAST region. The AWS status page reported:

Some Cache Clusters in a single AZ in the US-EAST-1 region are currently unavailable. We are also experiencing increased error rates and latencies for the ElastiCache APIs in the US-EAST-1 Region. We are investigating the issue.

This outage affected major sites such as Quora, Foursquare, Pinterest, Heroku and Dropbox. I followed the outage reports, the tweets, the blog posts, and it all sounded all too familiar. A year ago AWS faced a mega-outage that lasted over 3 days, when another datacenter (in Virginia, no less!) went down, and took down with it major sites (Quora, Foursquare… ring a bell?).

Back during last year’s outage I analyzed the reports of the sites that managed to survive the outage, and compiled a list of field-proven guidelines and best practices to apply in your architecture to make it resilient when deployed on AWS and other IaaS providers. I find these guidelines and best practices highly useful in my architectures. I then followed up with another blog post suggesting using designated software platforms to apply some of the guidelines and best practices.

On this blog post I’d like to address one specific guideline in greater depth – architecting for Disaster Recovery.

Disaster Recovery – Characteristics and Challenges

PC Magazine defines Disaster Recovery (DR):

A plan for duplicating computer operations after a catastrophe occurs, such as a fire or earthquake. It includes routine off-site backup as well as a procedure for activating vital information systems in a new location.

DR Planning is a common practice since the days of the mainframes. An interesting question is why this practice is not as widespread in cloud-based architectures. In his recent post “Lessons from the Heroku/Amazon Outage” Nati Shalom, GigaSpaces CTO, analyzes this apparent behavior, and suggests two possible causes:

  • We give up responsability when we move to the cloud – When we move our operation to the cloud we often assume that were outsourcing our data center operation completly, that include our Disaster-Recovery procedures. The truth is that when we move to the cloud were only outsourcing the infrastructure not our operation and the responsability of using this infrastructure remain ours.
  • Complexity – The current DR processes and tools were designed for a pre-cloud world and doesn’t work well in a dynamic environment as the cloud. Many of the tools that are provided by the cloud vendor (Amazon in this sepcific case) are still fairly complex to use.

I addressed the first cause, the perception that cloud is a silver bullet that lets people give up responsibility on resilience aspects, in my previous post. The second cause, the lack of tools, is usually addressed by DevOps tools such as ChefPuppetCFEngine and Cloudify, which capture the setup and are able to bootstrap the application stack on different environments. In my example I used Cloudify to provide consistent installation between EC2 and RackSpace clouds.

Making sure your architecture incorporates a Disaster Recovery Plan is essential to ensure the business continuity, and avoid cases such as the ones seen over Amazon’s outages. Online services require the Hot Backup Site architecture, so the service can stay up even during the outage:

A hot site is a duplicate of the original site of the organization, with full computer systems as well as near-complete backups of user data. Real time synchronization between the two sites may be used to completely mirror the data environment of the original site using wide area network links and specialized software.

DR sites can be in Active/Standby architecture (as was in traditional DRPs), where the DR site starts serving only upon outage event, or they can be in Active/Active architecture (the more modern architectures). In his discussion on assuming responsibility, Nati states that DR architecture should assume responsibility for the following aspects:

  • Workload migration – specifically the ability to clone our application environment in a consistent way across sites in an on demand fashion.
  • Data Synchronization – The ability to maintain real time copy of the data between the two sites.
  • Network connectivity – The ability to enable flow of netwrok traffic between between two sites.

I’d like to experiment with an example DR architecture to address these aspects, as well as addressing Nati’s second challange – Complexity. In this part I will use an example of a simple web app and show how we can easily create two sites on-demand. I would even go as far as setting this environment on two seperate clouds to show how we can ensure even higher degree of redundancy by running our application across two different cloud providers.

A step-by step example: Disaster Recovery from AWS to RackSpace

Let’s put up our sleeves and start experimenting hands-on with DR architecture. As reference application let’s take Spring’s PetClinic Sample Application and run it on an Apache Tomcat web container. The application will persist its data locally to a MySQL relational database. On my experiment I used Amazon EC2 and RackSpace IaaS providers to simulate the two distinct environments of the primary and secondary sites, but any on-demand environments will do. We tried the same example with a combination of HP Cloud Services and a flavor of a Private cloud.

Data synchronization over WAN

How do we replicate data between the MySQL database instances over WAN? On this experiment we’ll use the following pattern:

  1. Monitor data mutating SQL statements on source site. Turn on the MySQL query log, and write a listener (“Feeder”) to intercept data mutating SQL statements, then write them to GigaSpaces In-Memory Data Grid.
  2. Replicate data mutating SQL statements over WAN. I used GigaSpaces WAN Replication to replicate the SQL statements  between the data grids of the primary and secondary sites in a real-time and transactional manner.
  3. Execute data mutating SQL statements on target site. Write a listener (“Processor”) to intercept incoming SQL statements on the data grid and execute them on the local MySQL DB.

To support bi-directional data replication we simply deploy both the Feeder and the Processor on each site.

Workload migration

I would like to address the complexity challenge and show how to automate setting up the site on demand. This is also useful for Active/Standby architectures, where the DR site is activated only upon outage.

In order to set up a site for service, we need to perform the following flow:

  1. spin up compute nodes (VMs)
  2. download and install Tomcat web server
  3. download and install the PetClinic application
  4. configure the load balancer with the new node
  5. when peak load is over – perform the reverse flow to tear down the secondary site

We would like to automate this bootstrap process to support on-demand capabilities in the cloud as we know from traditional DR solutions. I used GigaSpaces Cloudify open-source product as the automation tool for setting up and for taking down the secondary site, utilizing the out-of-the-box connectors for EC2 and RackSpace. Cloudify also provides self-healing  in case of VM or process failure, and can later help in scaling the application (in case of clustered applications).

Network Connectivity

The network connectivity between the primary and secondary sites can be addressed in several ways, ranging from load-balancing between the sites, through setting up VPN between the sites, and up to using designated products such as Cisco’s Connected Cloud Solution.

In this example I went for a simple LB solution using RackSpace’s Load Balancer Service to balance between the web instances, and automated the LB configuration using Cloudify to make the changes as seamless as possible.

Implementation Details

The application is actually a re-use of an  application I wrote recently to experiment with Cloud Bursting architectures, seeing that Cloud Bursting follows the same architecture guidelines as for DR (Active/Standby DR to be exact). The result of the experimentation is available on GitHub. It contains:

  • DB scripts for setting up the logging, schema and demo data for the PetClinic application
  • PetClinic application (.war) file
  • WAN replication gateway module
  • Cloudify recipe for automating the PetClinic deployment

See the documentation on GitHub for detailed instructions on how to configure the above with your specific deployment details.


Cloud-hosted applications should take care of non-functional requirements of the system, including resilience and scalability, just as on-premise applications. Systems that neglect to incorporate these considerations in their architecture, relying solely on the underlying cloud infrastructure, end up severely affected by cloud outage such as the one experienced a few days ago in AWS. On my previous post I listed some guidelines, an important of which is Disaster Recovery which I explored here and suggested possible architectural approaches and example implementation. I hope this discussion raises the awareness in the cloud community and helps maturing up cloud-based architectures, so that on the next outage we will not see as many systems go down.

Follow Dotan on Twitter!


Filed under Cloud, DevOps, Disaster-Recovery, IaaS, Solution Architecture, Uncategorized