First post in a while, so i’ll try and make it a decent one.
I’ve been the cause of a fair few issues over the last at work. They’ve been related to systems not being set up properly, unexpected bugs and misunderstandings/poor design of our network.
One that originated on Sunday morning after I migrated users between subscriber routers. I forgot to include the OSPF summary address, so in moving the users, I inadvertently populated an additional 3,000 routes into the OSPF process.
Due to our network design (or lack of with OSPF area manipulation – most things exist in area 0) the routes were populated to most routers in the network. Now, 3,000 routes isn’t a lot, however, to an IGP it can be; it’ll equate to about 5Mb of memory usage. More so however, to routers such as the NPE-200 which carry only a maximum of 128Mb, this could be a lot. Not to mention if that router currently only has 5Mb of memory availalble.
The issue wasn’t apparent immediately, otherwise I would have picked it up during my change. A few hours after the change, through normal process, the router was unable to process LDP tags due to a lack of memory. Immediately following this, it lost its BGP connectivity with our core, again, due to the insufficent memory.
The engineer to asses the issue saw the lack of memory, however, didn’t diagnose the cause, reloaded the router and assumed normal operation as it was stable. Later in the day, the router exceeded its memory allocation and again issues occured. I identified the issue as occurring globally across our network and proceeded to add the summary routes; an increase in available memory of the router impacted was noticed immediately.
Its really annoying that, even though I made the change resulting in the issue, that no other engineer in our team picked up the issue. Its even more annoying that during my resolution of the issue, I caused a loop for a small number of users meaning an increased call load to the call centre. Luckily it was after hours, and the impact wasn’t large.
I’m too tired to be thinking on my feet – engineers need proper sleeping patterns! Speaking of which, I won’t have any normality for the coming week – Tuesday through Thursday – 11pm migrations.