Posted May 14, 2011 13:01 UTC (Sat) by vachi (subscriber, #67512)
Parent article: Scale Fail (part 1)
Don't feel so safe just because your admins have monitoring process... It is the thing you did not monitor that will come to bite you.
(disclaimer: I'm app dev. I always have an axe to grind about admins :-)
A few months back, our app abruptly slowed down, and sometimes seemed to hang from user point of view. App team talked to admins whether any strange sign in app server or DB server box. Admins defiantly declared that the servers are all green. CPU went over threshold a few times, but that was normal on peak day. A lot of free memory. It must be lousy app that was the problem.
After a lot of frustration and investigation, app team found out that one of the resources inside DB was configured way too low, and on that fateful day (4 years since setup) the resource was used up and things blow apart. Admin said they did not even know how to monitor that resouce consumption...