“Watch, monitor, and observe!” on the Oracle Cloud
Monitoring, or as in today’s more Cloud terms “Observability” is important in complex cloud environments to discover and interpret information about how your systems and applications are doing.
Site Reliability and Observability Engineers are some of the new roles that are requested these days by companies, in order to manage Cloud environments and its applications in every possible way. Where traditionally, monitoring tools were usually a bunch of tools, with all of them their strong and weak points, still it wasn’t always an easy task to combine all this log, trace and monitoring data into something useful, across the variety of systems in a company’s IT landscape.
In fact, by having some of all parts of you IT infrastructure in the cloud, provides you a single pane of glass to manage all your resources, whether it’s pure infrastructure, platform or apps.
Observability vs Monitoring
Observability goes beyond monitoring (even of very complicated infrastructures) and is instead about building visibility across all kinds of different components in your infrastructure. It creates insights on several levels, business or technical. The collection of data about your systems, but also the interpretation, and what to do with it helps companies in their path to grow, and learn about possible failures and coming bottlenecks in their applications.
The following picture shows an analysis by Forrester about the rate of application errors discovered by customers, instead of IT departments
This is a situation which occurs frequently but in fact we would like to tell the customers that their applications still run very well, and that’s because we prevented them from getting errors by interfere in possible bottlenecks
© Oracle — Study by Forrester on behalf of Oracle — 2019
So what about observability? Here’s one statement:
“Observability is about getting answers to questions that we didn’t know we’d have to ask. When I think about observability, I’m thinking from top down and bottom up. It’s the actionable insights you collect from your entire system, not just one piece, that tells you the health of your environment”
— Brent Miller, Senior Director of Cloud Operations, Quantum Metri
With observability a lot more information comes to the surface which might have previously be unattended or unused. It’s not only that it becomes more visible, but also being interpreted and analyzed in order to go to action — by taking the proper measurements which applies to the situation. An example could be an autoscaling action of a system which is running out of resources.
In essence: Monitoring tells about the how a system is behaving, Observability tells about the what you do with the collected data in terms of analysis and actions.
Combining these two in every layer of you systems and way of working makes it a strong and powerful one. Yet though it’s also embracing a culture, a mindset to which one values the ability to understand the total behavior of a system. Observability goes beyond looking at the metrics and alerts for every single component of an application or system and looks at the totality of it.
Technology
There is a wide range of technologies in the ‘Monitoring & Observability” area, but I will try to focus on two areas: a little bit about Open Source and about what a company like Oracle has to offer as a solution for observability in the cloud.
Now you might think: Oracle? The $$$ company which primary focus is database and Enterprise? Where Cloud was in the beginning days of it a no-go?. Well that has changed dramatically since a few years. According to the ranking of top Cloud providers, Oracle is now at position 5 (https://accelerationeconomy.com/cloud-wars-top-10/)
Besides that, Oracle is also a contributor to the open source community with many projects, and a contributor to and platinum member of CNCF and the Linux Foundation, developing cloud native applications and solutions.
Different methods
In Cloud Native, but also DevOps, or traditional operations tracking the state of applications is critical, and observability, telemetry, and application performance management (APM) all make that possible. However, they each support IT teams with different focus.
Telemetry is the method to collect all the data necessary- logs, events, metrics, alerts; transactional across different systems, and in special in dynamic environments, such as cloud native environments. OpenTelemetry is an incubating open source project focusing on CloudNative telemetry with a set of tools and SDKs compatible with several frameworks and languages
Besides this there are also vendor specific tools(which can be open source to, or commercial):
APM Application Performance Monitoring is a method already being practiced in the traditional operations. Now, there are no strict boundaries between APM and observability, there is surely some overlap, however the difference lies in focus and depth of insight necessary for a particular team managing its applications.
Both use telemetry to collect data across systems, where APM lies the focus on a more high-level method of tracking system health and end-to-end monitoring of application’s transactions, observability goes more deep into the technical details for root cause analysis.
Performance monitoring of an application is extremely important. Observability drills down deeper into performance monitoring by providing the “why” behind a possible performance bottlenecks.
APM in the Oracle Cloud Infrastructure (OCI APM)
You can setup all by yourself using the tools you prefer as a team, such as the well-known ones like Prometheus, Grafana, OpenSearch and more, but in the Oracle Cloud there’s also an end-to-end service available where you can setup quick and rather easily the things you need in your existing and new CloudNative applications and systems.
It’s all about “instrumenting and injecting”
To setup OCI APM for your cloudnative application, it’s evident to know you have to tell your application or system where it’s collected data will be stored, analyzed and interpreted. Wether if it’s a microservice, a containerized application or platform, or Kubernetes, it doesn’t matter.
Below a typical setup of a tradional application serve, containerized running on managed Kubernetes service in OCI
Instrumenting here happens by implementing the specific Java agent into the WebLogic domain to collect all the monitoring data in the OCI APM domain. The agent binaries are on persistent storage so all pods can access it to be used at runtime, and can be easily upgraded with newer versions. The APM agent can be downloaded from OCI with the proper version.
Injecting happens to apply a property file for the agent containing the following:
- com.oracle.apm.agent.rum.enable.injection=true
- com.oracle.apm.agent.public.data.key= <Public data key of your APM domain>. Setting up these keys(pub and priv) need to be done during setup of your APM domain instance
- com.oracle.apm.agent.rum.web.application=name of the application
- com.oracle.apm.agent.rum.service.name=Just a name
This will finally setup your dashboard for your applications and collected data will come in after a while. From here you can continue interpret all this data for you teams purposes, setting alerts, triggers, rules on events and determine what to do with it in a certain circumnstance.
Microservices
Running microservices, observability is even more evident , amongst the multitude of services and versions of services which are running, to know what’s happening and when to act, or to set some automated actions after some event. This can be done with all the collected data around logs, traces & metrics. In some cases probes can be used too; you can see them as an endpoint; a Kubernetes cluster or a server can give an “Alive” event or check if a component is ready to receive.
For the OCI APM domain there is in fact no difference in what data comes in, it’s just the setup which will be slightly different, but the concepts regarding instrumenting and injection remains the same.
In here, we just inject our APM data upload point and data key in to the applications yaml file :
tracing:
name: "Microservice APP APM Tracer"
service: "mcs-http2"
data-upload-endpoint: <data upload endpoint of the OCI API domain>
private-data-key: <private data key of the OCI API domain>
collect-metrics: true
collect-resources: true
properties:
- key: com.oracle.apm.agent.log.level
value: INFO
You can set the necessary log parameters to your own requirements
Specific for my helidon microservice app I need to embed
import io.helidon.tracing.TracerBuilder;
and in a startServer() method, .tracer(TracerBuilder.create(config.get("tracing")).build())
needs to be added, before rebuilding the application. Now I won’t go to much into detail because it can be different for your situation. Important is that you instrument your application well, either for traces or spans.
Conclusions
There is a lot more to tell and to know about Monitoring and Observability; for years this wasn’t such a “hot” topic, which always came somwhere around at the end of a project, but since the rise of Cloud, and combining Dev an Ops together, the need of visibility about the health of an application became more and more important, in a world where billions of users and devices are connected, and outages should be avoided at any time.
Final words, I hope 2023 brings you a lot of health, happiness and loads of interesting tech stuff!