Broken SW-Raid / HDD and Rebuild fail

  • Hi there,

    I need your help with a customer server ... This is the first time I see such a problem and i don't know how to solve this ...

    It was a software RAID1 with two HDDs. One failed syncing and SMART told me ... its now better time to replace the faulty hdd ... I needed two days to get to the server and replaced
    the HDD but now the new HDD can't get synced because the first HDD has a lots of failed sectors ... I tried really much but its not possible to sync while he gets throw failed sectors.

    Now I have a clean HDD and a faulty HDD with much errors on a productive system. Whats the best way to get the files and system to the new HDD ?

    My Idea: Create new partitions on the clean HDD ... sdb1 /boot sdb2 / and sdb3 swap and then copy all files but how? with dd I got errors due to failed sectors and with cp I don't see anything how far it is...
    I think the best way would be with rsync to copy everything to the new hdd and after that stop every service, check with rsync for changed files, copy them, reboot to new hdd

    Is this a good way or any better ideas?

    This is how it looks actually

    Im totally confused what I have to do, to get the right partition to the new HDD? I need three partitions ... /boot, /, and linux-swap

    Problems are going to be ugly ... I can't access one reseller anymore:

  • How about a new install and then restore the backups from imscp?

  • This is a option im playing with but there are some problems ....
    I think its a too big part ... 127 Customers, 7 Resellers .... ... I hoped there is a simple way to create the needed partitions, copy all files, reboot and done ...

    Backup, Reinstall, Restore takes a lot of time ... so there are so much websites, downtime can only be at night for a few hours only ... Don't now whats the best way ...
    This needs a lot if time because the backups have to be created right before the reinstall... And whats with all the mail accounts? We backup /var/mail/* but after a reinstall how to restore them?

  • hmm, total reinstall of the server took me about 10 minutes, restoring backups somewhat longer. How big is the backup you need to restore? Do you have a spare server for initial testing?

  • The problem is, i can't restore backups, that i took a week ago ... I have to make backups, copy them to another server, server reinstall (yes 10min), restore all backups

    Create backups of 127 Accounts ... around 120GB this took a while ... plus copying them to another server ouch .... and how you backup mail of customer and restore them?
    In my mind a simple copy of /var/mail/* and a restore after accounts are recreated isn't possible or ?

  • on the old server: tar cvzpf mail.tgz --same-owner /var/mail/
    on the new server tar --xvzf mail.tgz -C /

    Test it first :-) (it's been a while since I did this, can't test it quick myself.

  • I will test it too. The last time I did a migration I used imap sync, but that was easy because I am the only customer, with 3 domains, and I know all the passwords. However, I want to setup a mirror server, just in case, so I can test these kinds of scenarios.

  • p.s. If your problem is bandwith, I can give you some temp storage on my server @ hetzner. It will speed things up dramaticly for you if you are bound to slow German DSL, like me :-)

  • To be honest: your partition-layout is a load of crap!

    1) 1GB /boot? Do you plan to hold 50 kernels available? And in general: what is the purpose of that extra partition?
    2) 2,7TB of / - sounds massive but is in fact problematic. Why? You're running raid1 and let's assume your first hdd breaks. Within your raid, every partition that belongs to hdd1, has to be marked as faulty. After then you have to shutdown, the hdd needs to be replaced and after then you have to re-sync the f*cking 2,7TB in-a-row in rescue mode. Takes a while and the services are down, right? Why not using 20GB? Takes about 3mins to resync and after the reboot the services are running, because the raid is still active. The resync of the other partitions can be done hot!
    3) Why no /var? Your logs can grow until the hdd is full!
    4) Why no /var/www/virtual? Seperating web-spaces from the rest is a good idea!
    5) Why no /tmp? Crap in /tmp can grow until the hdd is full!

    In my opinion this is a good point to rebuild your system from backup/scratch!

    Question: is this a Hetzner-server? Just asking because last year three crappy 3TB-Seagate-hdds broke in my webserver within 4 weeks.