Remember when Slack went down a couple of months ago? That probably got your attention. And appropriately, it got the attention of everyone at Slack too. It turns out that the big outage was a delayed consequence of an earlier outage, and exacerbated by "what you know that just ain't so".
There's a bunch of good lessons in here, but two of the biggest are that monitoring is important, but you need to make sure that it really is monitoring what you think, and small bugs in one place can manifest themselves in completely different areas weeks/months later.