Unreproducible Bugs - A Bane! - The Official 360logica Blog

In this blog, we shall share development/testing guidelines that can reduce the likelihood of these bugs from occurring at the first go. Avoid and test for race conditions, deadlocks, timing issues, memory corruption, uninitialized memory access, memory leaks, and resource issues.

This blog will club together many bug types, but they are all co-related in the sense by how we test for them and how disproportionately hard they are to reproduce and debug. The root cause and effect can be separated by milliseconds or hours and stack traces might be nonexistent or misleading. A system may fail in strange ways when exposed to unusual traffic spikes or insufficient resources. Race conditions and deadlocks may only be discovered during unique traffic patterns or resource configurations. Timing issues may only be noticed when many components are integrated and their performance parameters and failure/retry/timeout delays create a chaotic system. Memory corruption or uninitialized memory access may go unnoticed for a large percentage of calls but become fatal for rare states. Memory leaks may be negligible unless the system is exposed to load for an extended period of time.

Guidelines for testing:

To avoid getting a shock by the pop up of any unexpected failures while the system is under heavy traffic – Stress test your system on a day to day basis.
Create tests that mock/fake dependencies to test timeout code. If your timeout code does something bad, it may cause a bug that only occurs under certain system conditions – Test Timeouts.
It is possible that a well behaved debug build works fine, yet the system fails in an unknown pattern once optimized – Test with debug and optimized builds.
Try reducing the number of data centers, machines, processes, threads, available disk space or available memory. Also try simulating reduced network bandwidth – Test under constrained resources.
Some bugs require a long period of time to reveal themselves. For example, persistent data may become corrupt over time – Test for longevity.
Use dynamic analysis tools like memory debuggers, ASan, TSan and MSan regularly. They can help identify many categories of unreproducible memory/threading issues.

Guidelines for development:

Simplifying things always helps, make easy your synchronization logic. If it’s too hard to understand, it will be difficult to reproduce and debug complex concurrency problems.
A tried and tested guideline to avoid deadlocks is to obtain locks in the same order. Yet it is observed that codes break it periodically. It is advisable not to change an order that you have defined for obtaining multiple locks.
Extra locks increase concurrency complexity. Don’t optimize by creating many fine-grained locks, unless you have verified that they are required.
Unless you feel it’s essential, avoid shared memory. It is very obvious that shared memory access are prone to getting wrong, therefore reproducing the bugs may be a difficult task.

Unreproducible bugs can be termed as a bane. It is often brought to notice that when a tester finds a bug and reports it, he gets a feedback that it’s not a bug because it can’t be reproduced. But, of course the bug is right there, silently waiting to attack on an appropriate time. Such types of bugs tend to be very expensive in aspects like increased investigation time and overall lifetime. In order to save any damaging effect on the product, we should be take preventive measures.

Image Credit paintgurus.com