Unit tests, integration tests, black box tests, end to end tests, user tests, test driven development, demo days. There are lots of kinds of tests. And they all can provide value. But only if you run the right tests at the right time. And as with so many things, it comes back to context, scope, and scale.
You want to have enough inputs to test that it works, that the different combinations of flags/features/datasets all work together the way you expect. But not just that the correct cases are handled correctly. You need to test that you detect and provide useful error information if the inputs don't make sense, you can't handle them, or something goes wrong during processing. That's the context part.
For scope, you want to run just enough code to test the system under test. There's lots of range to scope. From an individual algorithm to a class/package/executable to a service/distributed service/ecosystem. And your tests and framework need to reflect that.
If you're testing an algorithm then write the algorithm and enough code around it to test that it works per the above. Mock out everything but the algorithm. Provide the data in the expected format. Know what the answer is supposed to be. Remember what you're testing (the algorithm). These kinds of tests are generally called unit tests.
Unit tests can also have a slightly bigger scope. If you're testing the external interface of a class/library/exe then you need to provide enough environment around it to run, but you need to control the environment. This isn't the time to run against the live dB in production. You don't want to upset the production system, and it's hard to make it respond consistently to a test. You want to provide enough constraints so that you're sure what you're testing and that when there's a failure you know where to look.
The next step in scope is the integration test. This is where you're making sure that two things that you know work "correctly" (however that's defined) by themselves work well together. In the Bing GlobalOrtho project we spent a lot of time using WGS84 coordinates. We threw around a lot of latitudes and longitudes. We did this in the image stitcher and the poisson color blender. And all of the unit tests worked. Perfect. Let's hook these things together. And it worked. Mostly But the further east/west we went the weirder it got. Then all of a sudden things started crashing. Turns out some things took in latitude, longitude, others took longitude, latitude. It was only during the integration that we found the problem. and of course, you need a more complex system to do integration testing, but it's still not the full thing.
Then there are end-2-end tests. *That's* where you run the whole thing, in something not entirely unlike the production environment. With known inputs. Expecting known outputs. Really good for making sure nothing has broken, but not good at all for telling you what went wrong. In Global Ortho when the color of the output images changed by more than a certain amount we first had to figure out why. And that usually took longer than the actual fix. But again, without that kind of testing we never would have known.
So what kind of testing is there after end-2-end? You've run out of scope, but now you get to scale. There are a few kinds of scale. Maybe your system can handle blending 50 images of roughly the same place, but what if you have 1000? Or 10,000? Or your system behaves correctly at 100 Queries/Sec (QPS), but sometimes you get 10,000 QPS or more? What happens when your dataset grows by 10x? 1000x? More? What about parallelism? Breaking things into 10 pieces might cut your almost 90%, but at 100 pieces it fails or takes longer.
Then there's the kind of scale that describes the test space. Your system does the right thing in a few cases, but there's a combinatorial explosion of possible cases and there are millions of tests to run. How do you scale to that?
Then there's black box testing. Go outside your system. Act like a user. Using an entirely different mechanism, test what you're doing with no knowledge of the system other than the external APIs. Even here there are two kinds of tests. Those that make sure things work right, and those that make sure things don't break. Because those are two very different things. And remember, as Bill Gates saw 20+ years ago, even with all the testing, sometimes things go worng