Anomaly management | DevOps

Anomaly management in DevOps

When there is a difference between expected behavior and actual behavior of a system, we call this an anomaly. An anomaly has to be investigated and when the cause of the anomaly is found; it can be fixed. If the anomaly was caused by a fault in the IT system, then a change to the IT system is needed to fix the problem (and maybe even data in the database needs to be corrected). If the anomaly was caused by an error during testing, then the test has to be changed.

After fixing an anomaly, the test involved is retested to confirm that the fix has indeed solved the anomaly. At the same time, a regression test is often done to confirm that the fix did not introduce any new problems.

anomaly
An anomaly is a difference between the expected behavior and the actual outcome of a test. This is registered so that the cause can be analyzed and resolved.

A problem is a cause, or potential cause, of one or more anomalies or incidents.

There are many words used as synonyms for anomaly (e.g. defect, incident, issue, bug, problem, finding), we however have chosen to use the IEEE1044 term anomaly that is most unprejudiced and unbiased. [IEEE1044]

Anomalies can be the result of dynamic testing but also the outcome of reviews and other static testing activities.

Anomaly handling in a light-weight process

Any anomalies discovered are communicated during the sprint and the measures to be taken are discussed and implemented. Anomalies that can be solved directly do not have to be registered. In such a case, the tester and the developer collaborate to solve the anomaly in a rapid way and test the solution immediately. This facilitates keeping a minimum of documentation and a focus on progress.

To safeguard against the administration of anomalies becoming a main activity, the following guidelines are provided. An anomaly is recorded in the anomaly administration when:

  • the anomaly cannot be solved directly
  • the team decides – in consultation with the product owner – to fix the anomaly in another sprint
  • an anomaly discovered during the sprint review cannot be rectified in the sprint review and can therefore not be solved before the sprint ends.

Of course, the team may deviate from such a protocol if required. If the team believes that more than the minimum ought to be registered, that is fine. This may be applicable in cases such as when metrics have to be built up. If the team wants to register less, please be aware that there are three reasons for registering. First, it is needed for investigation and fixing. Second, it is needed as a basis for retesting. And third, information from registered anomalies can (and should!) be used for process improvement in order to not just fix a problem but to also prevent a problem from returning in the future. This third use of anomalies can be applied during a retrospective meeting or a specific process improvement initiative.

In high-performance IT delivery, the team is responsible for the registration, monitoring, fixing and retesting of the anomalies. In the case an anomaly surpasses the scope of the team, this can be designated to the scrum master or product owner or be discussed in a scrum-of-scrums. Some organizations choose to organize dispute or arbitrage meetings to discuss anomalies that otherwise would cause a ping-pong between teams.

Please note that when during retesting the anomaly appears not to have been fixed, the people involved should start another cycle of fixing and retesting.

Also please note that anomalies may also result from implicit testing. That is when a team member is not performing a focused test for a specific aspect but still notices something different than expected. In such cases it should also be handled according to the anomaly management process.

Tools to support anomaly management

Typically, many different people, teams, departments and even organizations can be involved in the identification, analysis and follow-up of anomalies. Therefore, the anomaly management needs to be supported by automated tools. These tools support the administration of the anomalies, but also the communication and exchange of the relevant information. Please refer to "Tooling", for examples of these tools.

Terminology related to anomalies

When many anomalies are encountered, they need to be classified to support deciding on the order in which to investigate and fix the anomalies. Two terms are important: "severity" and "priority". Some people want to treat these as one, but they are different terms with different meanings and reasons. Severity indicates the impact of the anomaly on the business process. This can be decided in an objective manner. A severity is therefore factual information that does not change over time. Priority indicates the order in which anomalies should be treated and priorities may change. In practice, high-severity anomalies will obviously often get a high priority as well. But sometimes there is a high severity anomaly which won't be fixed during the sprint, because it is not blocking the deliverables of the current sprint or maybe there is an acceptable workaround in place. In this situation the priority is not high.

Terms that are often used in relation to anomalies are terms that indicate there is something wrong. In our glossary we have described error, fault and failure as the main terms. If an anomaly is investigated and the conclusion is that there is something wrong in the IT system, we use the terms incident and problem based on ITIL [ITIL 2019].

error fault failure

 

An error is a human mistake that may, but not necessarily needs to, lead to faults or failures.

A fault is the manifestation of an error residing in the code or a document or a system. This may cause a failure. A fault may be detected by static testing.
A failure is a deviation of the system from its expected delivery or service. The result or manifestation of one or more faults. A failure may be detected by dynamic testing.
An incident is an unplanned interruption to an IT service or reduction in the quality of an IT service or a failure of a configuration item.
A problem is a cause, or potential cause, of one or more incidents.

The term defect is often used. But with a lot of different meanings. Some people use defect as a synonym for failure, others for fault and quite a few people use it as a synonym for anomaly. Because of the use of all these different meanings, there is a high chance of confusion. Therefore, we advise not to use the words defect and defect management.

The term incident relates to unexpected behavior in the operational environment, anomalies relate to testing, therefore incident management is similar to but not the same as anomaly management.