I have talked to many people about the need for backups and a few years ago did a small presentation to some NZIPP’ers on the tools to help. You would expect that I used a few of these techniques myself and I do. Recently we returned from a lengthy holiday and turning on all the computers turned out to be a sad few minutes and the ruin of my first week home.
For a fast re-cap we all know that we have to backup. And the paranoid people know that you need to backup more than once. These rules were tested for me this last week and it is such a good (or bad) example I thought I should share it with you. When I powered up my array of servers, three of them had hardware faults that resulted in lost data.
Everybody remember RAID for disk? (Redundant Array of Inexpensive Disk). This is our first line of defence. The disks are grouped so that one (RAID5) or two (RAID6) drives just keep a checksum of the data and can be removed or fail without losing anything. If one does fail you can replace it and the system rebuilds the check sum and you are back in business.
So one of my backup systems (a NAS) uses RAID6 because I had so many drives failing in pairs I did not trust RAID5 anymore. But this week, THREE drives failed when I turned on this unit. So this became a dead box and all data on it was lost. I installed some new, larger, drives and recovered it from the second copy of this data I keep on another NAS.
A second backup unit had single drive failure so being RAID5 I just swapped it out and it rebuilt itself. Sadly some corruption meant that it was not possible to access the system using the standard shares. I tried to find a way to solve that problem and keep the data but failed. So even though I did not lose any data a corruption meant I still had to rebuild the machine and copy the data from the second and third tier backups.
This third machine was my main file storage, which is a Windows server that uses RAID5 and mirroring. This lovely machine has run non-stop for 6 years without a hitch. UNTIL last Saturday when I turned it back on. I always say computers fail when you turn them ON so best to leave them on to minimise the opportunity for failure. Luckily my healthy paranoia and the lack of space on this six year old machine (Phase One files are large!) meant that I had planned a replacement server and it was commissioned just a week before we went away. The only service that was not migrated was the domain controller function (controls logons). Of course I have two machines serving that role so everything still worked and I was able to build a new one quickly, ignoring the old server all together. I don’t know what is wrong with this machine but I was expecting it to fail sometime and will not bother to fix it.
I have 4 copies of pretty much everything here. The servers are backed to a NAS and that NAS is backed up to another. So when the NAS was rebuilt I just copied the data from the third level backup and we are back in action. As a last resort there is of course the off-site backup. I am a little lazy with this one and don’t make them as often as I should. But as some else commented recently if the studio burns down we won’t be so worried about getting everything back.
What is important from my experience, if you don’t want to lose data or think about it as your income stream, are three things.
1. Backup up everything, including the backups
2. Plan to replace equipment before it’s expected life term.
3. It’s not if you lose data, it’s about when and what you can do about it.