Nous avons dû effectuer un redimensionnement de volume en direct d'un datastore NFS que VMWare utilise sur notre Netapp. Toutes nos machines virtuelles Windows se sont bien comportées après le redimensionnement. Cependant, certaines de nos VM Linux ont eu des problèmes.
Certaines VM Linux ne répondent plus. Après avoir redémarré ces VM, je n'ai rien trouvé dans les journaux indiquant un problème.
J'ai cependant trouvé ce genre de messages de journal sur certaines des VM :
May 29 14:56:02 rhel6-server-1314 kernel: INFO: task jbd2/dm-0-8:382 blocked for more than 120 seconds.
May 29 14:56:02 rhel6-server-1314 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 14:56:02 rhel6-server-1314 kernel: jbd2/dm-0-8 D 0000000000000000 0 382 2 0x00000000
May 29 14:56:02 rhel6-server-1314 kernel: ffff880037ce9c20 0000000000000046 ffff880037ce9be0 ffffffffa00041fc
May 29 14:56:02 rhel6-server-1314 kernel: ffff880037ce9b90 ffffffff81012b59 ffff880037ce9bd0 ffffffff8109b809
May 29 14:56:02 rhel6-server-1314 kernel: ffff880037ce1af8 ffff880037ce9fd8 000000000000f4e8 ffff880037ce1af8
May 29 14:56:02 rhel6-server-1314 kernel: Call Trace:
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
... rhel6-server-1314
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
May 29 14:56:02 rhel6-server-1314 kernel: INFO: task master:1674 blocked for more than 120 seconds.
May 29 14:56:02 rhel6-server-1314 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 14:56:02 rhel6-server-1314 kernel: master D 0000000000000000 0 1674 1 0x00000080
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003d669958 0000000000000086 ffff88003d669918 ffffffffa00041fc
May 29 14:56:02 rhel6-server-1314 kernel: 0000000000000000 ffff880002216028 ffff880002215fc0 ffff88003fac2b78
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003fac30f8 ffff88003d669fd8 000000000000f4e8 ffff88003fac30f8
May 29 14:56:02 rhel6-server-1314 kernel: Call Trace:
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
... rhel6-server-1314
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
May 29 14:56:02 rhel6-server-1314 kernel: INFO: task pickup:6197 blocked for more than 120 seconds.
May 29 14:56:02 rhel6-server-1314 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 14:56:02 rhel6-server-1314 kernel: pickup D 0000000000000000 0 6197 1674 0x00000080
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003da95968 0000000000000086 ffff88003da95928 ffffffffa00041fc
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003da95938 ffff8800022128a0 ffff88003da95908 ffffffff81127ed0
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003d90da78 ffff88003da95fd8 000000000000f4e8 ffff88003d90da78
May 29 14:56:02 rhel6-server-1314 kernel: Call Trace:
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
... rhel6-server-1314
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
May 29 14:56:02 rhel6-server-1314 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880037bfd280)
May 29 14:56:02 rhel6-server-1314 kernel: sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 03 14 e8 d0 00 00 18 00
May 29 14:56:02 rhel6-server-1314 kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
May 29 14:56:02 rhel6-server-1314 kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880037bfd280)
May 29 14:56:02 rhel6-server-1314 kernel: scsi target2:0:0: Beginning Domain Validation
May 29 14:56:02 rhel6-server-1314 kernel: scsi target2:0:0: Domain Validation skipping write tests
May 29 14:56:02 rhel6-server-1314 kernel: scsi target2:0:0: Ending Domain Validation
May 29 14:56:02 rhel6-server-1314 kernel: scsi target2:0:0: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)
Mes questions :
- Quelqu'un sait-il quelle en est la cause ?
- Sinon, où devons-nous chercher des indices ?
- Enfin, quelqu'un sait-il comment atténuer ce problème la prochaine fois que nous devrons redimensionner un volume ?