Routers in this day and age scale from small processor and single chip solutions, through matrix multi-chassis driven systems like the Juniper TX Matrix formed from their T4000 and T1600 devices.
Software tends to be broken, sometimes catastrophically, other times mildly, and the engineering fraternity in general accept it. Salesmen generally believe that Cisco devices don’t run software as you don’t need to update them. < Note on this…most companies do have a maintenance procedure, some however do not.
Along with upgrading devices as part of my job, every now and then I find the time to upgrade lab equipment to latest OS versions. It needs to be done so realistic outcomes can be obtained. Ok, some would argue there is no point in having a generic upgrade policy as the real world doesn’t work like that, but I tend to think there is something to be gained from newer code and later features in lab. Real world tends to be based on feature and related stability of code. Unless there is a desperate requirement for the latest and greatest feature, the latest bleeding code release is normally avoided. Some braver companies also readily test beta software with an attitude of “How will it be stable if no one tests it?”.
I have picked up many oddities over the years with various vendors, along with routers not forwarding packets after an upgrade, immediate crashes, core dumps and power issues. I happen to like Junipers way of dealing with upgrades. Junos has a validation feature which checks the configuration to ensure compatibility. Other vendors do not have this. Much pain caused. These oddities I believe can be gotten around by the following list:
1) Read guidance and release notes. This should throw initial alarm bells.
2) Lab test required features. This removes a multitude of un-anticipated situations and also provides in some cases insurance for step 1 above (or if you haven’t performed step 1).
3) Ensure plenty of disk space and RAM. Juniper J series are infamous (at least with me) for not forwarding packets after an upgrade pushes RAM usage up to the roof.
4) Make use of ‘out of band’ connectivity. Whether this is a serial cable from a server, ensure you have at least some means of jumping on to the console port.
5) Rely on redundancy. If your devices are clustered, you may get away with it, or you could run a first hop redundancy based system using VRRP/HSRP. Upgrade one device whilst failed over to the alternate.
6) Check MD5 hashes. Ever wondered why MD5 hashes are available at download time from vendor sites? They help prevent dodgy image downloads (unless the MD5 has also been changed on the site – deep subject) and they also help to stop engineers from uploading duff images. Check those MD5 hashes and avoid a host of issues.
7) Disk health. Some vendors do a disk check routine. Failure with compact flash & SSD devices can be high in warmer environments. Failure is common after air con failure for instance. Some companies prefer to erase, format, upload and validate images on flash based devices to ensure integrity.
8) Backup images. To play it super safe, Cisco, Juniper and other vendors allow multiple images to be stored on the same media. Under failure conditions of one image, the older image can be booted.
9) Pray. No explanation necessary. I advise kneeling on a jacket as datacentre floor cooling vents can come sharp on the knees.
10) If step 9 fails, be TACtful. Use your maintenance contract (providing you were able to get one). TAC departments are there to help. Make use of them!