16 February 2026

How to Handle 1700000000000000000000000000000000 Test Cases and Tests That Actually Matter

There are actual problems with how tests are used, so let’s look at the omitted aspects of testing. To get a sense at what I'm aiming at you can check how huge successful systems try to survive, see: Oracle Database 12.2 – 25 million lines of C code.
TDD
I don’t like TDD in any high-complexity project, because TDD hardens interfaces. In most TDD-driven projects it’s almost impossible to see good structural changes: by the time real problems are discovered, the interface, structure, and logic are already cemented by tests. As long as the interface doesn’t make the solution literally impossible, bad design tends to stay. TDD itself is not "wrong", but forcing TDD into places where it will never work is not a solution. In practice this happens in:
  • domains where requirements and models are still changing fast,
  • core logic with huge combinatorial spaces and many edge cases,
  • code that must be heavily refactored as we learn more,
  • by design they tend to skip any internal logic path.
In those areas, TDD tends to freeze early, incomplete designs behind a wall of brittle tests instead of helping us discover better designs.

Proper decomposition tends to invalidate a lot of early TDD tests. When you finally extract the right abstractions and boundaries, many method-level tests:
  • no longer match the real responsibilities
  • have to be rewritten or deleted
  • were only protecting a design we now know was wrong
It’s not unusual that a single good refactor can throw away 50% of earlier TDD tests – and this can repeat a few times as the system progress to better structure.
MOCKS
Externally, if you are not able to provide comprehensive tests and mocks for interfaces and services, you are far from quality.
Documentation is not proof that you did your part. Writing each service should be accompanied by a project that can be built into a full-fledged mock of the actual service you did or will build – and most likely a client mock as well. This is one of the few cases where mocks are genuinely appreciated and useful.
GENERATOR
Best element of easy to test workspace is making it partially stateless / sometimes domain is so big that making it stateless require rebuilding outside conditions - this is actual reason why most bugs require minimal test case. If we are able to make whole or part of the system stateless then we should be able to write code that can dump state info test. To do this we will need:
public interface ToSourceCodeInterface {

   public static class Static {

      public static String toCode(BigDecimal element){...} // for each base type

   }

   String toSourceCode();
}
with more or less work we can potentially dump any state or exception into working testcase in runtime that can be moved to test code indefinitely. This will harden stability over the time and greatly decrease amount of assumptions in tests.
"ABSOLUTE" TESTS
Input variability
To handle code quality you need not only a sufficient number of tests, but also sufficient variability of input data. For simplicity, let’s consider a small but realistic part of the problem and how to test it.
DB-like engine that we will use as the base to be tested:
  • we have up to 12k distinct codes
  • codes are in 600 groups
  • up to 10 codes can matter at most
  • we do not care about order
  • each dataset will be tested against 2k static rule sets where each set can have up to 10 rules (let’s take 5 as average for simplicity)
Have you ever considered how much it would take to write an ABSOLUTE test? Tests that cover all possible cases? We would need proper data variability, and that leads to something on the order of:
  • 12k10 / 10! ≈ 1.7×1034 test cases – and this is only one aspect of integration testing.
  • Assume we can handle 1 billion cases per second (which is mostly impossible for real integration tests).
  • The age of the universe is about 4.3×1017 seconds.
  • We would still need roughly 3.9×1016 times the age of the universe to execute all tests.
For this reason, "absolute" tests don’t exist for any non-trivial system.

For the same reason, most tests that should exist are totally skipped and replaced with:
  • unit tests together with mocks, hopes, and dreams
  • incremental integration testing that covers some logic paths
  • risk-based / domain-based test design:
    • most-used code groups
    • known tricky combinations
    • historically buggy areas
  • coverage and observability – which give minimal quality as long as we don’t mock everything important.
So let’s focus on what could be done:
  • We cut groups between each other until each sub-group has no intersection with others. Let’s say this gives us around 1k sub-groups (real data is almost never random). The possible critical paths then become roughly 1k10 / 10! ≈ 2.7×1024 (for now we ignore duplicates inside sub-groups).
    That’s still about 0.6 times the age of the universe – not ideal.
So we make another correction to the approach:
  • For each rule set we create all positive paths and one negative path for each rule.
  • To build positive paths, we check each rule against all 1k sub-groups. Let’s say this gives us on average 3 sub-groups per rule, and from that we derive ≈3×15 positive tests per rule set. Add ≈5 negative tests → about 50 tests per rule set, and then 50×2 if we want to introduce randomness inside each sub-group.
  • This way we end up with 2k × 100 = 2×105 test cases at most.
  • Definitely much shorter than the age of the universe.
To reintroduce variability where needed, we add cases for:
  • duplicates inside sub-groups
  • more mixing of negative and positive conditions
If this is done with moderation, it can at most double the number of tests.
Results
When saved together with expected results, these tests allow you to rewrite the engine or introduce changes with very high accuracy and quality.
  • Any optimization or structural change will generate an incompatibility report by itself.
  • With this "over the top" quality, we can select, for example, a random 10% of tests to keep the build process manageable (< 10 minutes), and run the full suite nightly.
So while truly absolute tests don’t exist, in some domains we can get close enough to have very strong guarantees.
/DECOMPOSITION/
It might seem unrelated – sadly this is one of the most important aspects that gets omitted while being directly corelated to quality and amount of tests.
There is a strong push for "decomposition" that is actually harmful and makes code quality drop considerably. It comes from confusing decomposition with mechanical code splitting, and from ignoring quality concerns.

Consider one method doing one complex operation. Let’s say creating it took 2 days and testing it took 1 day:
public class Report4Batman {
   ReportResult do() {
      // collect data
      // process
      // compose report
   }
}
The first instinct – and a bad one – is to split it into smaller methods:
public class Report4Batman {
   private /* fields... */;
   private void collectData() {}
   private void process() {}
   private ReportResult composeReport() { return null; }

   ReportResult do() {
      collectData();
      process();
      return composeReport();
   }
}
Static analysis tools will be happy, but code quality will be exactly the same or worse, and we will have more work. This leads to:
  • methods that look independent but are deeply entangled through shared state and hidden assumptions,
  • more chances to expose internal variables without enforcing valid state,
  • pressure to make these methods public because "they are nicely encapsulated now,"
  • method names growing into small monsters just to express actual usage, e.g. collectDataFromSecondSystemOfBruceFromEndOfMonth().
An alternative is to harden method interfaces properly:
  • Scope of each method must be reconsidered and checked against other parts of the code, often with different requirements.
  • We need to respecify requirements for each method – documentation size can grow multiple times.
  • Each future modification must be checked against each related contract.
  • Each method needs additional tests, often more complex than the single end-to-end test.
All this can multiply implementation time while giving ZERO additional business value in most cases.
Another alternative is a more structural split (names shortened for readability):
class Report {
   class ReadFile4Report {
      ReadFile4Report(Input input) { ... }
   }

   class StreamToXls4Report {
      StreamToXls4Report(ReadFile4Report src) { ... }
   }

   class NormalizeData4Report {
      NormalizeData4Report(StreamToXls4Report src) { ... }
   }

   ReportResult do() {
      return new NormalizeData4Report(
         new StreamToXls4Report(
           new ReadFile4Report(/* input */)))
      .toResult();
   }
}
  • We must properly cascade all necessary data – a lot of work.
  • There is pressure to extract these classes elsewhere, making analysis extremely hard.
  • This structure can reflect our logic well, but any future move toward generalization will be harder.
Proper decomposition – the "utils" step:
  • Every reasonable function that can be generalized must be extracted into a utility:
    • Each utility needs research of all possible scopes.
    • Each must be tested against all relevant scopes.
    • Each utility becomes a liability and can take more effort to design and test than the original project.
    • This is exactly why frameworks are made public – free external testing makes them economically viable.
    • For example, from one simple date format we could end up supporting multiple ambiguous, non-deterministic formats.
Proper decomposition – the "framework" step:
  • For someone inexperienced with building frameworks, it may be impossible.
  • It is considered an anti-pattern in many teams for good reasons:
    • It will either be very limited or very expensive.
    • Every later change requires coordination with all framework clients.
    • Functionality can be delayed by weeks or months.
All those more or less unnecessary steps lead to blind spots in tests.
  • Tests that are heavily dependent on context.
  • "Nothing-testing" – tests that exercise mocks more than real logic.
  • Coming from an optimistic testing path instead of a pessimistic one. Roughly:
    • Let m be the number of distinct steps (methods) that can be combined or ordered in different ways. In the worst case, testing all permutations would give up to m! combinations.
    • Let i be the number of independent boolean conditions (if statements). In the worst case they introduce up to 2^i combinations.
    • Let b be the average number of branching points introduced by tools/frameworks, and t the number of tools. This can add up to t^b more combinations.
    So a very rough estimate of the number of distinct execution paths is: m! × 2^i × t^b. This isn’t a precise formula, but it illustrates how quickly the test space explodes, even for a modest number of methods, conditions, and framework branches. This also means that adding "just one more method" can easily double the test space. If we aggressively compress test paths for this model (mainly in core logic), we can end up with something like 5–10% coding time and 90–95% test & quality assurance time – and in return get amazing stability. Sadly, instead we mostly see m + i + t + b shallow tests that behave like a plaster on a severed leg.
In short decomposition is overused as slogan for:
  • Splitting code often without proper structure
  • Introducing invalid interfaces(for example methods) into code
  • Introducing unnecessary complexity into code
  • Slopy work that lack proper quality assurance in projecting and tests

0 comments / komentarz(y):

Post a Comment