Tag Archives: Observability

5 Key Observability Trends for 2022

The importance of observability has been well established, with organizations relying on metrics, logs, and traces to help detect, diagnose, and isolate problems in their environments. But, like most things in IT, observability is continuing to evolve rapidly — both in terms of how people define it and how they are working to improve observability in practice.

I’ve argued in the past that observability is, at its core, a data analytics problem. The formal definition of observability tends to center on the external outputs of IT systems. I use a slightly broader definition of observability: “The capability to allow a human to ask and answer questions about the system”. I like this definition because it suggests that observability should be incorporated as part of the system design (rather than being bolted on as an afterthought) and because it underscores the need for engineers and system administrators to bring an analytics mindset to the challenge of enabling observability.

In the coming year, look for organizations to deepen and diversify their telemetry data usage, while consolidating their tooling, in an attempt to level up their observability. Technologies such as eBPF and OpenTelemetry will lower the barrier to entry on instrumentation, and matured data analytics practices will enable IT and DevOps teams to identify and respond to issues more quickly and effectively.

Broader adoption of distributed tracing

Many IT and business leaders still don’t realize just how much potential distributed tracing holds, and this represents a huge missed opportunity in the quest to optimize observability. However, the next year is likely to see a significant uptick in adoption. As more organizations migrate their workloads to cloud-native and microservices architectures, distributed tracing will become more prevalent as a means to pinpoint where failures occur and what causes poor performance. Our recent DevOps Pulse survey shows a 38 percent year-over-year increase in organizations’ use of tracing, and 64 percent of respondents who are not yet using tracing said they planned to implement it within the next two years.

Distributed tracing can open up a whole new world of observability into numerous processes beyond IT monitoring, in areas as diverse as developer experience, business, and FinOps. Distributed tracing relies on instrumenting application with the mechanics of propagating context when executing requests. You can easily use the context propagation mechanism for many other processes, such as tracking resource attribution or capacity planning information per product line or per customer account.

Data privacy compliance is another extremely useful application of distributed tracing. In light of emerging compliance regulations such as GDPR and CCPA, data privacy is a huge priority, and this challenge is exacerbated by the fact that low-level storage is often unaware of user context. By propagating user IDs from downstream tiers to data storage tiers, distributed tracing can help organizations to better enforce their data privacy policies.

Movement beyond the ‘three pillars’ of observability

Discussions about observability often begin and end with what have come to be called the “three pillars of observability.” These are metrics, logs, and traces. Metrics help to detect problems and let DevOps or site reliability engineers understand what has happened. Logs, then, help to diagnose issues, providing the “why” behind the “what.” Finally, traces help engineers to pinpoint and isolate issues by indicating where they happened within distributed requests and elaborate microservice graphs.

These three pillars continue to be critically important. But it’s important not to be confined by the “three pillars” paradigm and to choose the right telemetry data for your needs. In the coming year, I expect we’ll be seeing more organizations embrace additional types of observability signals, including events and continuous profiling.

It is also important to remember that the “three pillars,” or any other telemetry signal for that matter, is just the raw data. As I wrote above, I firmly believe that observability is a data analytics problem, and as such, it is about proactively extracting insights out of that raw data, similar to BI analysts in a way. In December, I interviewed Frederic Branczyk, the founder of Polar Signals and a passionate advocate for observability. He shared the gap he sees in companies today:

We pretend in our observability bubble that everybody has well-instrumented applications with distributed tracing and structured logging. But the reality is, when I look at a typical startup, they may not even be monitoring at all. They’re waiting for their customers to tell them something is wrong before they start investigating.

More momentum behind eBPF

Extended Berkeley Packet Filter, or eBPF, is a technology that allows programs to run in the operating system’s kernel space without having to change the kernel source code or add additional modules. Currently, observability practice is largely based on manual instrumentation, requiring the addition of code at relevant points to generate telemetry data, which often presents a significant barrier, and can even prevent some organizations from implementing observability. While auto-instrumentation agents do exist, they tend to be tailored to specific programming languages and frameworks. However, eBPF allows organizations to embrace no-code instrumentation across their entire software stack, right from the OS kernel level, providing easier observability into their Kubernetes environments and offering benefits around networking and security.

Because eBPF works across different types of traffic, it helps organizations to meet their goal of unified observability. For instance, DevOps engineers might use eBPF to collect full body trace requests, database queries, HTTP requests, or gRPC streams. They can also use eBPF to collect resource utilization metrics, including CPU usage or bytes sent, allowing the organization to calculate relevant statistics and profile their data to understand the resource consumption of various functions. Additionally, eBPF can handle encrypted traffic.

Netflix recently published a blog about how the company is using eBPF to capture network insights. According to the company, the use of eBPF has been highly efficient, consuming less than one percent of CPU and memory in any instance.

Unification of siloed tools

As observability matures, organizations will increasingly look to holistic observability platforms, favoring these integrated solutions over the more siloed tools that they have used in the past. Compared to stand-alone observability tools, these more holistic platforms can better position developers, DevOps, and SREs to address querying, visualization, and correlation across all of their different telemetry signal types and sources.

We saw this unification trend in the past year, with major vendors such as Grafana Labs, Datadog, AppDynamics, and my company Logz.io coming out of their respective specialty domains in log analytics, infrastructure monitoring, APM, or others, and expanding into a more comprehensive observability offering. We’ll see this trend accelerating in 2022, adapting to the changing observability needs and changing the competitive landscape.

Continued adoption of open source tools and standards

The open source community created Kubernetes (and, essentially, the entire concept of “cloud native”). This same community is now delivering open source tools and standards to monitor these environments. New open standards like OpenMetrics and OpenTelemetry will mature, becoming de facto industry standards in the process. In fact, OpenMetrics may be adopted this coming year as a formal standard by IETF, the premier internet standards organization. The rise of open source tools not only provides companies with additional options for enabling observability, but also prevents the vendor lock-in that has historically plagued some corners of the IT industry.

At the moment, the open source landscape for observability is quite dynamic, with a number of important projects emerging in just the past couple of years. It can sometimes be difficult for DevOps and system administrators to keep these solutions straight (especially because many have adopted the naming convention of “OpenSomething”), but they are beginning to converge. Each day in 2022, we will move closer to something resembling open source standardization — and closer to the ideal of unified observability.


This article was originally published at InfoWorld.com, reprinted with permission from © IDG Communications, Inc.

Leave a comment

Filed under DevOps

Open Source for Better Observability

The monitoring challenge in cloud native systems

Monitoring cloud native systems is hard. You’ve got highly distributed apps spanning tens and hundreds of nodes, services, and instances. You’ve got additional layers and dimensions –– not just bare metal and OS, but now there are node, pod, namespace, deployment version, Kubernetes’ control plane and more. 

To make things more interesting, any typical system these days uses many third party frameworks, whether open source or cloud services. We didn’t write them, but need to monitor them nonetheless. 

The monitoring challenge comes up often in my discussions with users and customers, as well as in industry surveys. Nearly half of respondents to our 2020 DevOps Pulse survey (44 percent) say that “Monitoring/troubleshooting” is where they find the most difficulties when running Kubernetes in production. 

Observability as a data analytics problem

The way to address the monitoring challenge is with observability. But what is observability in IT systems anyway? Simply put (and formal definitions aside), observability is the capability to ask and answer questions based on telemetry data. The reason I like this definition is that it makes it clear that observability is essentially a data analytics problem. We bring together different telemetry signals of different types and different sources into one conceptual data lake, and then ask and answer questions to understand our system.  

Observability is typically built on three pillars –– Metrics, Logs, and Traces. Let’s see how they tell us the “what”, “why” and “where”, and enable us to answer questions about our system:

Simply put, Metrics help us detect the issues and tell us what happened: Is the service down? Was the endpoint slow to respond? Metrics are essentially numerical data, which is efficient to collect, process, store, aggregate, and manipulate. On the other hand, this numerical data doesn’t contain much context to accompany it. Once the system emits metrics, the backend collects them, aggregates them, stores them in a time series database, and exposes a designated query language for time series data.

Next, Logs help us diagnose the issues and tell us why they happened. Logs are perfect for that end, as the developer who writes the application code outputs all the relevant context for that code unto logs. Textual and verbose, however, logs take up more storage space, and require parsing and full text indexing to effectively search for ad-hoc queries by any field in the logs.

Finally, Traces help us isolate issues and tell us where they happened. As a request comes into the system, it flows through a chain of interacting microservices, which we can trace with Distributed Tracing. Each call in the chain creates and emits a span for that service and operation (think of it as a structured log), which includes context such as start time, duration, and parent span. This context is propagated through the call chain. A tracing backend then collects the emitted spans, and reconstructs the trace according to causality. It then visualizes, typically with the famous timeline view (gantt chart), for further trace analysis.

Role of Open Source: Success and Challenges

Now let’s move to the role of open source in observability, and which open source projects lead the domain.

Open source is the new norm 

Open source is the new norm, with 60 percent of organizations using open source monitoring tools, according to 451 Research. According to the Cloud Native Computing Foundation (CNCF), The most commonly adopted observability tools are open source, as shown in the End User Technology Radar. And, Gartner predicts that, by 2025, 70% of new cloud-native application monitoring will use open-source instrumentation, rather than vendor-specific agents for improved interoperability.

Tool sprawl is a serious challenge

But, the wealth of available observability tools creates a consolidation issue. Half of companies are using five or more tools, while a third of them are using ten or more, according to the CNCF. Tool sprawl is a challenge not just for operating and managing the tools, but also for observability itself: Observability in itself is a data analytics problem, and tools create additional data silos.  

Relicensing is changing OSS landscape

Another new challenge we’re seeing is OSS project relicensing. In the past year alone, we’ve witnessed several relicensing moves for leading OSS projects, moving to a more restrictive license, a copyleft license (such as GNU AGPL), or even to a non-open-source license (non OSI-compliant, such as SSPL). Typically this happens by a vendor that controls the project, not by a foundation. It could mean that source code is available, but you’re restricted in your usage or modification, or may even need to open-source your own code in some cases. 

This pushes some users to look for alternatives. Among these users you can find other OSS projects that can’t consume these licenses, or even commercial companies such as Google who ban use of AGPL and other licenses. Google Open Source says on AGPL that “the risks heavily outweigh the benefits”.

The leading open source tools for logs, metrics and traces

The open source landscape for observability is quite dynamic. Many of the OSS projects emerged as recently as the past couple of years alone. Funny enough, many are called OpenSomething which adds quite a bit of confusion to the mix. Let’s go through the open source projects according to the signal types:

Open Source Software For Metrics, 

  • Prometheus, a CNCF graduate project, is a monitoring system with a dimensional data model, flexible PromQL query language, efficient time series database, and modern alerting approach with AlertManager;
  • OpenMetrics, another CNCF project, offers a format for exposing metrics, which has become a de-facto standard across the industry; and 
  • Grafana, a project by Grafana Labs, offering a powerful analytics and visualization tool that’s exceptionally popular in combination with Prometheus.
    Relicensing update: On Apr 2021 Grafana project was relicensed from Apache2 to AGPLv3 by Grafana Labs.

Open Source Software For Logs, 

  • ELK Stack, led by Elastic B.V., has been the leading open source choice for a good few years. It is comprised of Elasticsearch text distributed data store, Logstash data collection and processing engine and Kibana visualization tool;
    Relicensing update: On Feb 2021 Elasticsearch and Kibana projects were relicensed from Apache2 to a non-OSS dual license (SSPL and Elastic License) by Elastic B.V.
  • OpenSearch, is a fork of ElasticSearch and Kibana OSS projects, aimed to keep these popular projects open source. The project is led by AWS, which also contributed OpenDistro for Elasticsearch – a set of open source plugins for Elasticsearch; and
  • Loki, led by Grafana Labs, is a log aggregation system specialized for interoperability with Prometheus. Loki doesn’t perform full-text indexing, but rather only indexes labels used in Prometheus.
    Relicensing update: On Apr 2021 Loki project was relicensed from Apache2 to AGPLv3 by Grafana Labs. 

Open Source Software For Traces,

  • Jaeger offers a distributed tracing system released as open source by Uber Technologies, which is now a CNCF graduated project;
  • Zipkin, a more veteran Java-based distributed tracing system to collect and look up data from distributed systems; and 
  • Skywalking, an open source APM system, including monitoring, tracing, diagnosing capabilities for distributed system in Cloud Native architecture.

Unified telemetry collection with OpenTelemetry

Having a variety of tools to choose from also brings up a challenge in telemetry data collection. Organizations find themselves multiple libraries for the logging, metrics, traces, with each vendor having its own APIs, SDKs, agents and collectors.

OpenTelemetry is a novel project under the CNCF that offers a unified set of vendor-agnostic APIs, SDKs and tools for generating and collecting telemetry data, and then exporting it to a variety of analysis tools. The beauty of OpenTelemetry is that it offers an observability framework that works across metrics, traces and logs. You get one API and SDK per programming language for extracting all of your application’s observability data, together with a standard collector, a transmission protocol (OTLP) and more.

OpenTelemetry (or OTel as it’s commonly nicknamed) was created under the CNCF out of the merge of OpenMetrics and OpenTracing projects, and was officially accepted to CNCF incubation in August 2021. More importantly, the project is widely adopted by all the major vendors, all the monitoring tools, the cloud providers and many others. As such, it’s well positioned to become the go-to platform for generating and collecting observability data.

Open source standards such as OpenTelemetry and OpenMetrics are converging the industry, preventing vendor lock-in and bringing us a step closer to unified observability. I expect we’ll be seeing these projects becoming de-facto standards, as well as additional such efforts for unified observability to address the data storage, querying, correlation and other aspects.

The future looks bright for open source based observability. Join us in the community effort and together we can make it happen.


This article originally ran on Container Journal on September 28, 2021

Leave a comment

Filed under DevOps, Open Source