Monitoring is very important, especially in DevOps. We not only monitor the system with all his characteristics, we also monitor team behavior to support improvements.
Areas of monitoring are:
- Team telemetry
Understand the maturity of the team and how the team can improve. One could measure and report many different aspects (KPIs) per team.
- Live site telemetry
Measure how the system runs, and the platform behaves and log management. The different platforms offer capabilities to capture these metrics. Some platforms also offer the needed reporting capabilities and notification mechanisms. When using a multi or hybrid cloud implementation it is recommenced to assess the used tools if the data can be aggregated and reported as a single view.
- Cognitive monitoring
Automate and improve IT operations by applying machine learning to the log data. A next step is the optimization of operations of systems and the ability to scale. Adopting artificial intelligence for IT operations can cover two aspects. The first is analyzing the telemetry data to understand the default behavior of the system and to be notified of anomalies. The other aspect is the proactive interaction between the AI model and the system regarding behavior predictions.
- Security monitoring
Security monitoring involves collecting and analyzing information to detect suspicious behavior or unauthorized system changes on your network, defining which types of behavior should trigger alerts, and acting on alerts as needed. Often, commercial state-of-the-art tools to measure, report and notify are used in DevOps.
- User telemetry
User sentiment and behavior in DevOps are the most important informative aspects of success. Measuring user interaction is very important and often forgotten.
In modern IT systems, knowing what is going on in your system is crucial. Is the IT system still behaving according to its expectations? IT systems consist of multiple components that need to interface and cooperate to work correctly and deliver the pursued business value.
To create this insight, logging, tracing, and metrics should be in place throughout the IT system, including its infrastructure components.
The term observability is a container concept for various subjects around the state of an IT system. Logging, tracing, and metrics are the cornerstones of observability.
The information they provide is combined to create a coherent view of the state of the IT system. The data is used for reactive and pro-active maintenance, auditability, controllability, and debugging purposes.
In the section about logging & tracing we already refer to the importance of logging functional system indicators. To be able to use logging for observability purposes, one additional requirement needs to be met: The logging should contain contextual information.
Every log entry should refer to a request or an autonomous work package, e.g., customer-, contract-, or order-id. Log entries without this contextual information only clutter the log store and don’t have additional value. Although contextual data is critical to apply effective observability, remember that GDPR rules need to be adhered to, so not all desired references may be allowed.
The log entries should be collected and stored in chronological order and have the correct severity level.
Keep in mind that observability serves multiple purposes and thus needs different severities to serve those purposes. E.g., debug level log entries are helpful for anomaly analysis but are not required to get insight into the overall health of a service or component.
Metrics and telemetry
A metric is a specific log entry that gives information about a predefined activity or a technical process. There are metrics for functional and non-functional items, which serve different purposes and goals.
Common metrics which are collected by telemetry are following the RED method [Yocum 2021]:
- Request rate
- Error rate
- Duration of requests
Tracing is the practice of tracking (following) a request or autonomous work package throughout the IT system.
Following a request through multiple components of the IT system requires a trace ID to which all information can be bound.
Tracing is used for different goals:
- Debugging and problem tracking
- Collecting performance metrics of individual components
Warning: Generating high volumes of trace data (especially useless trace data) can have severe negative effects on the IT system, e.g.: performance issues.
Combining all the information from metrics, logs, and traces into a single dashboard delivers valuable information to the complete cross-functional DevOps team.
Different views can be configured to fit the needs of a team member.
Alerting systems notify appropriate team members if thresholds are exceeded or too many anomalies are detected. (See Reporting & Alerting).
Why should we care about observability? Well, first of all, we release to many environments, so this could mean better support in development, testing and production environments. But, of course, it also means empowering the team’s ability to understand the production situation as there are always interesting new behaviors uncovered by real users under real load, and we all should be listening for them.
[Yocum 2021] The RED method: A new strategy for monitoring microservices, Tim Yocum Euteneuer, 4 November 2021.