障害のあるドライブをLVMボリュームグループから削除しています...そして不完全なLV（PVが欠落している）から部分的なデータを回復しています

Question

私はしばらくの間この問題と戦ってきました。

1.5TB、2TB、3TBの3つのディスクを備えた論理ボリュームがあります。 1.5TBドライブに障害が発生しています。多くのI/Oエラーと死んだ不良セクター。障害のあるドライブの既存のエクステントを3TBドライブに移動するためにpvmoveを開始しました（十分なスペースが残っています）。エクステントの99％を移動しましたが、最後のパーセントは読み取れないようです。読み取りが失敗し、pvmoveが終了します。

現在の状態は次のとおりです。

root@server:~# pvdisplay /dev/sdd: read failed after 0 of 4096 at 0: Input/output error /dev/sdd: read failed after 0 of 4096 at 1500301819904: Input/output error /dev/sdd: read failed after 0 of 4096 at 1500301901824: Input/output error /dev/sdd: read failed after 0 of 4096 at 4096: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300771328: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300853248: Input/output error /dev/sdd1: read failed after 0 of 4096 at 0: Input/output error /dev/sdd1: read failed after 0 of 4096 at 4096: Input/output error Couldn't find device with uuid hFhfbQ-4cuW-CSlE-qhfO-GNl8-Jvt7-4nZTWK. --- Physical volume --- PV Name /dev/sda # old, working drive VG Name lvm_group1 PV Size 1.82 TiB / not usable 1.09 MiB Allocatable yes (but full) PE Size 4.00 MiB Total PE 476932 Free PE 0 Allocated PE 476932 PV UUID FEoDYU-Lhjf-FdI1-Ei5p-koue-PIma-TGvs9A --- Physical volume --- PV Name /dev/sdd1 # old failing drive VG Name lvm_group1 PV Size 1.36 TiB / not usable 2.40 MiB Allocatable NO PE Size 4.00 MiB Total PE 357699 Free PE 357600 Allocated PE 99 PV UUID hFhfbQ-4cuW-CSlE-qhfO-GNl8-Jvt7-4nZTWK --- Physical volume --- PV Name /dev/sdf # new drive VG Name lvm_group1 PV Size 2.73 TiB / not usable 4.46 MiB Allocatable yes PE Size 4.00 MiB Total PE 715396 Free PE 357746 Allocated PE 357650 PV UUID qs4BVK-PAPv-I1DG-x5wJ-dRNq-vhBE-wQeJL6

Pvmoveの内容は次のとおりです。

root@server:~# pvmove /dev/sdd1:335950-336500 /dev/sdf --verbose Finding volume group "lvm_group1" Archiving volume group "lvm_group1" metadata (seqno 93). Creating logical volume pvmove0 Moving 50 extents of logical volume lvm_group1/cryptex Found volume group "lvm_group1" activation/volume_list configuration setting not defined: Checking only Host tags for lvm_group1/cryptex Updating volume group metadata Found volume group "lvm_group1" Found volume group "lvm_group1" Creating lvm_group1-pvmove0 Loading lvm_group1-pvmove0 table (253:2) Loading lvm_group1-cryptex table (253:0) Suspending lvm_group1-cryptex (253:0) with device flush Suspending lvm_group1-pvmove0 (253:2) with device flush Found volume group "lvm_group1" activation/volume_list configuration setting not defined: Checking only Host tags for lvm_group1/pvmove0 Resuming lvm_group1-pvmove0 (253:2) Found volume group "lvm_group1" Loading lvm_group1-pvmove0 table (253:2) Suppressed lvm_group1-pvmove0 identical table reload. Resuming lvm_group1-cryptex (253:0) Creating volume group backup "/etc/lvm/backup/lvm_group1" (seqno 94). Checking progress before waiting every 15 seconds /dev/sdd1: Moved: 4.0% /dev/sdd1: read failed after 0 of 4096 at 0: Input/output error No physical volume label read from /dev/sdd1 Physical volume /dev/sdd1 not found ABORTING: Can't reread PV /dev/sdd1 ABORTING: Can't reread VG for /dev/sdd1

障害が発生したドライブには、まだ99個のエクステントしか残っていません。このデータを失っても大丈夫です。他のドライブのデータを失うことなく、このドライブをプルして破棄したいだけです。

だから私はpvremoveを試しました：

root@server:~# pvremove /dev/sdd1 /dev/sdd1: read failed after 0 of 4096 at 1500300771328: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300853248: Input/output error /dev/sdd1: read failed after 0 of 4096 at 0: Input/output error /dev/sdd1: read failed after 0 of 4096 at 4096: Input/output error No physical volume label read from /dev/sdd1 Physical Volume /dev/sdd1 not found

そして、vgreduce：

root@server:~# vgreduce lvm_group1 --removemissing /dev/sdd: read failed after 0 of 4096 at 0: Input/output error /dev/sdd: read failed after 0 of 4096 at 1500301819904: Input/output error /dev/sdd: read failed after 0 of 4096 at 1500301901824: Input/output error /dev/sdd: read failed after 0 of 4096 at 4096: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300771328: Input/output error /dev/sdd1: read failed after 0 of 4096 at 1500300853248: Input/output error /dev/sdd1: read failed after 0 of 4096 at 0: Input/output error /dev/sdd1: read failed after 0 of 4096 at 4096: Input/output error Couldn't find device with uuid hFhfbQ-4cuW-CSlE-qhfO-GNl8-Jvt7-4nZTWK. WARNING: Partial LV cryptex needs to be repaired or removed. WARNING: Partial LV pvmove0 needs to be repaired or removed. There are still partial LVs in VG lvm_group1. To remove them unconditionally use: vgreduce --removemissing --force. Proceeding to remove empty missing PVs.

pvdisplayはまだ故障したドライブを表示しています...

何か案は？

Sniku · Accepted Answer

結局、私は/etc/lvm/backup/lvm_group1を手動で編集することでこの問題を解決しました。

他の誰かがこの問題にぶつかった場合の手順は次のとおりです。

サーバーからデッドドライブを物理的に取り外しました
vgreduce lvm_group1 --removemissing --forceを実行しました
デッドドライブを設定から削除しました
故障したドライブでは読み取れなかったエクステントの代わりに、「正常な」ドライブに別のストライプを追加しました。
vgcfgrestore -f edited_config_file.cfg lvm_group1を実行しました
リブート
出来上がり！ドライブが表示され、マウントできます。

これを解決するために、LVMの出入りを学ぶのに4日かかりました...

これまでのところ、それはよさそうだ。エラーはありません。ハッピーキャンプ。

Joachim Wagner · Answer

LVMを一時的に停止する（および使用する場合は基盤となるLUKSコンテナーを閉じる）ことができる場合は、別の解決策として、PV（または基盤となるLUKSコンテナー）を可能な限りGNU ddrescueそして、LVMを再起動する前に古いディスクを削除します。

私はSnikuのLVMソリューションが好きですが、ddrescueはpvmoveよりも多くのデータを回復できる可能性があります。

（LVMを停止する理由は、LVMがマルチパスをサポートしており、LVMがPVを検出するとすぐに、同一のUUIDを持つPVのペア間で書き込み操作のバランスを取るためです。さらに、LVMとLUKSを停止して、最近のすべてのデータを確認する必要があります。書き込まれたものは基盤となるデバイスに表示されます。システムを再起動し、LUKSパスワードを指定しないのが、それを確認する最も簡単な方法です。）