Developing Distributed Systems is difficult due to the following reasons:
- Concurrency bugs are difficult to reproduce
- Writing test cases for Integration Testing is difficult
When debugging distributed systems, we usually try to reproduce the bug by executing the system again and again while changing system parameters or inserting print statements. However, execution results of distributed systems may vary depending on various uncontrollable timing factors such as process (thread) scheduling, network speed, machine performance, etc.
In order to improve system stability, it is important to test the system under the various conditions. Unit testing has become popular in recent years, and various tools and frameworks such as xUnit are available. In addition to unit testing, integration (or system-level) testing is also important for distributed systems because the system consists of multiple unit of processing (i.e., servers, processes, and threads). However, integration testing of distributed systems is very difficult since the system behavior is non-deterministic due to the various uncontrollable timing factors. How do you define the expected results of system behavior for the test cases of integration testing?