Can a Configuration Update Take Down The Entire Microsoft Azure Cloud?

Yesterday at 0:51 AM (UTC) Azure, Microsoft’s public cloud service, suffered a global and massive outage for around 11 hours. The outage affected 12 out of Azure’s 17 regions, taking down the entire US, Europe and Asia, together with their customer’s applications and services, and causing a havoc among the users.

After a day of emergency fix and investigation, Microsoft published a formal initial report of the issue in its blog. The root cause is reported to be:

A bug in the Blob Front-Ends which was exposed by the configuration change made as a part of the performance improvement update, which resulted in the Blob Front-Ends to going into an infinite loop.

Though the issue was not yet fully investigated, the initial report indicates that the testing scheme Azure team employed (nicknamed “flighting”) failed to detect the bug, thus allowing the configuration change to be rolled out to production. In addition, the roll-out process itself was done concurrently to the regions instead of the common practice of staged roll-outs across regions:

update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches.

Microsoft is not the first cloud vendor to encounter major outages. Amazon’s long-standing AWS cloud has suffered at least one major outage a year (let’s see how this year end, so far looking good for them). In 2011 AWS suffered an outage which lasted 3 days in the US East Region. Interesting to note that outage was also triggered by a configuration change (in their case to upgrade the network capacity). Following that outage I provided recommendations and best practices for customers on how to keep their cloud-hosted systems resilient.

No cloud vendor is immuned to such outages. Even the standard built-in geo-redundancy mechanisms of the vendors such as multi-availability-zone and multi-region strategies cannot save the customers from such major outages, as we witnessed in these outages. We, as customers placing our mission-critical systems in the cloud, need to guarantee the resilience of our system regardless of the vulnerabilities of the underlying cloud provider. To achieve adequate level of resilience we need to employ multi-cloud strategy, deploying our application across several vendors to reduce the risk. I covered the multi-cloud strategy in greater detail in my blog in 2012 following yet another AWS outage.

There will always be bugs. The cloud vendors need to improve their processes and procedures to flush out such critical bugs early on in the testing phase and to avoid cascading problems between systems and geographies in production. And the customers need to remember that the cloud is not a silver bullet, prepare for the disaster, and design their system accordingly.

1 Comment

Filed under Cloud, IaaS

Amazon, Google Public Clouds Drive Networking to Next Gen

As more enterprises and telcos are moving their infrastructure to private cloud, the more they raise needs for advanced networking to meet their modern, dynamic and virtualized architectures. This trend is fueled by the recent flux of telcos now looking for a carrier-grade private cloud solution to virtualize their IT. These needs from the community took central place in the OpenStack Summit a couple of weeks ago.

But while the OpenStack community only now gets to address the next-gen networking needs for the private cloud, the major public cloud providers the likes of Amazon and Google have long been facing these challenges.

Amazon’s cloud networking strategy

Amazon, on last week’s AWS re:Invent annual event in Las Vegas, shared some of its networking strategy for managing its global IT deployment, with 11 regions and 28 AZ (Availability Zones) across 5 continents. You can read the full technical details in this great article, but the interesting point I find beyond the details is that Amazon realized that traditional networking backbone and paradigms cannot meet the challenges it’s facing, and therefore innovatively reached out to explore the next gen networking for its organization. One such example was cutting costs of high-end networking equipment. Instead:

it buys routing equipment from original design manufacturers… that it hooks up to a custom network-protocol software that’s supposedly more efficient than commodity gear

Another interesting example was achieving network virtualization by utilizing single-root I/O virtualization (SR-IOV) and supporting multiple virtual functions on same infrastructure while maintaining good network performance.

Amazon didn’t come out with its internal networking strategy for no reason. Amazon’s strategy has been to externalize its networking capabilities as cloud services for its end customers. 5 years ago they offered VPC (Virtual Private Cloud), logically isolated AWS clusters which can be connected to the customer’s data center using VPN. On last year’s AWS re:Invent Amazon announced its “Enhanced Networking” for AWS cloud, where it provided SR-IOV support on its new high-end instances. Then March this year they announced support for VPC peering within a region, to enable private connectivity between VPCs.

Google’s take on cloud networking

While the Stackers had their conference and announcements in Paris a couple of weeks ago, Google ran it’s own Cloud Platform Live event in San Francisco, where it announced its Google Cloud Interconnect. Google has been investing in its networking for over a decade, and is now starting to externalize some of it as network cloud services, much in response to Amazon’s aforementioned networking services.

Google’s first important announcement was made March at the Open Networking Summit with the launch of Andromeda - Google’s network virtualization stack, which now got a new release and increased performance. With its Cloud Interconnect Google also responded to Amazon with its own capabilities around VPN connectivity (to be GA in Q1 2015) and Direct Peering. It is interesting to note that Google specifically targets Telcos, namely access network operators and ISPs, offering to meet the demanding carrier-grade challenge of the Telecommunications industry with their global infrastructure and services.

Public clouds heading for network virtualization

Amazon and Google own massive infrastructure and cater for massive and diverse workloads. As such they face the networking challenges and limitations ahead of the market, and lead with innovation around next gen networking and virtualization. I expect we shall see more work around SDN and network virtualization to meet these challenges, with the private clouds following and perhaps also taking the lead with telco-driven carrier-grade requirements and workloads.

Leave a comment

Filed under Cloud, SDN

Architecting at Scale with Hybrid and Multi Cloud – Wix Case Study

Wix is a great example of a scalable modern hybrid-cloud and multi-cloud architecture for web based applications. Wix provides you an easy way to set up your own site without being a web developer expert. They serve 54+ million websites (and growing by 1 million each month), with 700M requests per day and some 800+TB data served.

On a recent session Wix shared some of their experience architecting their system. Their experience shows the process they went through from single-server monolithic application to a distributed cloud-based and service-oriented architecture. It is a great practical lesson. it contains important principles for architecting at massive scale such as:

  • No DB transactions. XA TX will kill your scalability.
  • Denormalized data model. make the data accessible locally to avoid remote calls or distributed transactions.
  • Immutable data. This pattern has become a common approach for scalable architecture, and is nowadays built into scalable programming languages such as Scala.
  • Hybrid-cloud and Multi-cloud infrastructure. Your data center can crash. Amazon cloud can also experience major outages. If you want to offer strict levels of resilience and assurance while keeping very strict SLA (under 100ms in Wix case), you need to use a hybrid model of data centers and multiple cloud vendors (AWS and GCE in Wix case).
  • No caching for long-tail services (yes, NO caching). instead, a multi-layered architecture with public service, archive service, CDN, static grid, etc. to cascade requests as needed.
  • Good balance between server-side logic and client-side rendering to maintain SLA on critical path.

It’s also fascinating to see the learning process Wix team underwent, and the pragmatic approach of analyzing the real bottlenecks and critical paths and focusing on optimizing them. I highly recommend reading what Wix shared with the community (and thanks Wix for sharing with us!)

http://highscalability.com/blog/2014/11/10/nifty-architecture-tricks-from-wix-building-a-publishing-pla.html

Leave a comment

Filed under Cloud, Solution Architecture

Virtual networking picks up at OpenStack

Virtual networking was a key theme at this week’s OpenStack Summit in Paris. We saw keynotes addressing it, panels with leading Telco experts on it, and dedicated sessions on emerging open standards such as OpenNFV.

Telco inherently posses more challenging environments and networking needs, with elaborate inter-connectivity and service chaining, which the Neutron project has not yet adequately addressed. We also see open standards emerging in the industry around SDN and NFV, most notably OpenDaylight, which OpenStack foundation still haven’t decided how to address in collaboration and complementary fashion. It would become even trickier in light of competing open standards such as the ON.Lab’s Open Network Operating System (ONOS) which was announced just this week.

This lack of standardization in SDN & NFV for OpenStack presents an opportunity for different vendors to offer an open source solution in attempt to take the lead in that area, similarly to the way Ceph took the lead and ultimately became the de-facto standard for OpenStack block storage. On this week’s summit we saw two announcements tackling this gap of SDN for OpenStack: both Akanda and Midokura announced their open source products in compatibility with OpenStack.

Midokura decided to open source it’s core asset MidoNet which provides Layer-2 overlay aiming to replace the default OVS plugin from OpenStack. Midokura is targeting OpenStack community, making its source code available as part of Ubuntu’s OpenStack Interoperability Lab (OIL). OpenStack is also clearly targeted in their announcement:

MidoNet is a highly distributed, de-centralized, multi-layer software-defined virtual network solution and the industry’s first truly open vendor-agnostic network virtualization solution available today for the OpenStack Community.

Akanda on the other hand was an open-source project from the beginning. Akanda focuses on Layer-3 virtual routing on top of VMware NSX’s Layer 2 overlay, with support for OpenDaylight and OpenStack. In fact Akanda is a sort of a spin-out of DreamHost, the company that spun-out Inktank and brought about Ceph (acquired by RedHat in April). Will they be able to achieve same success with Akanda in Networking as they did with Ceph in Storage?

Telco operators such as AT&T, Huawei and Vodafone are pushing the OpenStack Foundation and community to address the needs of the Telecommunications domain and industry. The OpenStack framework has reached enough maturity in its core projects and ecosystem to be able to address the more complex networking challenges and requirements. Backed by the network operators and network equipment providers (NEPs), and with the right collaboration with other open-source projects in the SDN and NFV domains, I expect it shall be on the right path to offer a leading virtualization platform for Telco and Enterprise alike.

3 Comments

Filed under NFV, OpenStack, SDN

OpenStack is getting a hug from VMware and Eucalyptus

What’s OpenStack position in the market?

OpenStack traditionally had competition from both well-established closed-source vendors and other open-source initiatives. Time has passed, and the cloud world has matured. So how is OpenStack doing now?

Want to know how a certain company is positioned in the market? Check what its competitors are saying (and doing) about it. Business analysis 101. So let’s examine a couple of competitors from the closed-source and the open-source fronts, including some very recent announcements.

The closed-source enterprise front: VMware

One of OpenStack’s fierce rivals on the enterprise virtualization domain is VMware. VMware has established its reputation in server virtualization and gained foothold in all major enterprises, which gave it clear leverage when offering private cloud for the data center.

Nonetheless, VMware could not afford to ignore the OpenStack wave and has been keeping presence in the foundation, including active code contributions to the main OpenStack projects and a community page around their integration.
Then they decided to hug OpenStack even closer.
A couple of days ago, during VMWorld 2014 conference, VMware announced its own OpenStack distribution, dubbed “VMware Integrated OpenStack”. VMware says it is

a solution that will enable IT organizations to quickly and cost-effectively provide developers with open, cloud-style APIs to access VMware infrastructure.

VMware even launched a new blog dedicated for OpenStack where VMware appeals for developers based on its reputation with developer tools and frameworks, as well as enterprise experience, promising to be the agile way to develop OpenStack in enterprise grade.

Team+OpenStack+@+VMware[1]

The open-source community front: Eucalyptus

VMware is not the only player to recognize OpenStack’s lead position. Eucalyptus is another open-source initiative that competed with OpenStack in its early days on the hearts of the OSS community. One of its strategic moves was to partner with Amazon to provide AWS-compatible API, to enable hybrid cloud deployments.

A couple of weeks ago Eucalyptus CTO Marten Mickos, the guy who compared OpenStack to the Soviet Union, surprised everyone by stating in his blog that he wants nothing short of to be an OpenStack contributor. Yes, you heard me right, Eucalyptus wants to help the enemy. In his post he explains the rationale:

I want OpenStack to succeed. When that happens, Eucalyptus can also succeed. OpenStack is (in my humble opinion) the name of a phenomenon of enormous proportions. Eucalyptus is the name of a tightly focused piece of software that serves a unique use case. I am intent on finding and pursuing a mutual benefit. 

Seems like Eucalyptus bets on the complexity of OpenStack and tries to position itself as a less broad but simpler solution. If you ever tried installing and configuring OpenStack on your environment you’d know that this approach can make a lot of sense. The system integrators sure monetize on that. It would be interesting to see the reactions to Marten’s message in the OpenStack Silicon Valley event next month.

If you can’t beat them, join them

Things look good for OpenStack. Prominent closed-source competitors as well as open-source competitors are coming to the realization that OpenStack is becoming the de-facto standard for private clouds, and are now embracing OpenStack and are trying to position themselves as complementary. The game is not yet over. There are still vendors, both closed-source enterprise shops such as Microsoft Azure, and open-source, primarily CloudStack (some would argue also OpenNebula), still giving a fight. Things also differ in different regions. But in my view the recent announcements of past weeks are a good evidence in favor of OpenStack.
Who’s next in line to hug OpenStack?

Leave a comment

Filed under Cloud, IaaS, OpenStack

Samsung shifts gear on IoT and the Smart Home and acquires SmartThings for $200M

Samsung has a clear interest in the Internet of Things and the Smart Home. With its vast range of consumer devices it is only natural for it to connect it all together and let you control it via your Galaxy smartphone, tablet or even smart watch (Gear).

samsung-connected-home-2014-04-02-01[1]

On CES2014 (Consumer Electronics Show) earlier this year Samsung shared its vision of One Service to Rule Them All, and introduced its new Smart Home service:

in a move that could change the home forever, Samsung announced a new Smart Home service that puts people in control of their devices and home appliances with one application that connects them all.

But how can you presume to control everything if you can’t talk the common language? This is where Samsung started exploring emerging standards. First Samsung teamed up with Intel, Dell and others to form the Open Interconnect Consortium (OIC), then When Google came with its own Thread Group initiative it jumped this wagon as well. As I stated on my last post, it is yet unclear how the different initiatives will relate to one another, so Samsung hedged its bets on the open standards front.

While standard bodies battle for domination, Samsung doesn’t wait and makes a parallel move on the platform front. In this $200M-worth move, Samsung announced acquiring SmartThings, a US-based start-up developing a smartphone app which enables users to monitor and control their domestic affairs even when they are out of their home. It is also an open platform, which encourages the developer community and device makes to create new applications and expand the range of uses and smart devices.

img-android-smartsetup[1]

Though the developer community is somewhat concerned by the impact of the acquisition on the platform’s openness, SmartThings founder and CEO Alex Hawkinson ensures in his blog that “SmartThings will remain SmartThings”. Judging from Samsung’s moves with open standards and open platforms (such as the data and sensor platforms for health monitoring), it seems like Samsung embraces openness as its main path for market penetration, which is a positive indication for the future of SmartThings.

Samsung seems to bet heavily on the Internet of Things and the Smart Home. David Eun, head of Samsung’s Open Innovation Center, said that “Connected devices have long been strategically important to Samsung” and that more investments, acquisitions and partnerships around the internet of things were planned. In this rate, we won’t have to wait long for their next move.

1 Comment

Filed under Internet of Things, IoT, Smart Home

Will the Internet of Things talk Googlish?

Things definitely change fast in the landscape of the Internet of Things. On my last blog post less than 2 weeks ago I discussed standardization efforts in IoT and covered the announcement of a new consortium called Open Interconnect Consortium (OIC), led by Samsung, Intel, Dell and others.

And just a week later we got the new heavy gun in the field: Google announced, through its recently acquired company Nest, a new industry group called Thread, together with Samsung, ARM Holdings and others, to define the communications standard for the smart home. The new standard is said to solve reliability, security, power and compatibility issues for connecting products around the home.

Thread-Group

This announcement joins Microsoft’s announcement from beginning of this month about joining AllSeen Alliance as the 51st member, which was followed by last week’s announcement of 7 other new members, making AllSeen Alliance 58 members strong to date (on my last blog post earlier this month they were only 51, just think about it…).

Google’s new consortium joins other industry consortia. How do these different initiatives relate to one another? This question becomes even more interesting when noting that Samsung is a member of both OIC and Thread Group (see footnote), and that Apple’s list of HomeKit partners includes Broadcom (another member of OIC) and Haier (member of AllSeen Alliance).

It may be that in these early stages organizations are reluctant to bet on a single horse and distribute the risk across different consortia. It may also be that some of these initiatives are not really competitive but rather complementary. Reading through the statement of the new Thread Group seems that they target a new networking protocol (to supersede WiFi, Bluetooth and the likes) for IoT to be more energy-efficient and scalable, which may be complementary to the mandate declared by OIC which seems to deal with higher layers. But as statements are very high level and tend to change, we will have to patiently wait and see how it plays out.

——————————————————

* Update: in a subsequent post I explored Samsung’s play in IoT in greater detail. read more here.

2 Comments

Filed under Internet of Things, IoT, Smart Home