• Need help troubleshooting system lock-up while running a KVM


    About 6 weeks ago I created a windows 10 KVM with iommu passthrough for gaming and it worked flawlessly until just a few days ago. I have changed nothing, either hardware or software, but now the host system locks up completely after just a few minutes of gaming in the VM. My first thought was that one of the relevant packages has started to cause the issue after an update, so I rolled my system back to a snapshot that was created (via Timeshift) immediately after I got my gaming VM up and running but the problem still persists. I have tried removing each piece of hardware from the VM one by one, including the graphics card, until there was no physical hardware attached to the VM save for the SATA controller which is required for it to boot (it uses a bare-metal install of Windows 10) and still the problem persists. I also made sure that each pci device attached to the VM was using message signaled interrupts, and this didn’t help either. I can verify the stability of the host using stress, stressapptest and Unigine Heaven, all passing with flying colors. I’m really running out of ideas as to what could be the cause of my troubles, any direction or insight would be greatly appreciated.

    OS: Antergos
    kernel: linux-vfio 4.16.13-1 with acs override patch
    mobo: ASRock X370 Taichi
    cpu: AMD Ryzen 7 1700x
    host gpu: Radeon RX 480
    guest gpu: GTX 1070
    guest OS: Windows 10

  • As any guest system depends on some other services or files, in rolling release distro environment will break often, check pamac (updater and software manager) log for what was updated before broke.
    Also journalctl -b0 say somenthing after vm crash?

  • @edwin-foss Thanks for the reply. There has been a lot of updates since the VM was created, but like I said, rolling back to a snapshot of a known good state using Timeshift made no difference. journalctl -b0 dosen’t help because the entire host locks up, so I’m looking at journal entries after a reboot. Here is the first couple lines from journalctl immediately after the last crash:

    Jul 01 13:15:35 Matt-Linux-Desktop cinnamon-session[860]: GLib-GIO-CRITICAL: t+1075.66432s: g_dbus_connection_call_sync_internal: assertion ‘G_IS_DBUS_CONNECTION (connection)’ failed
    Jul 01 13:15:35 Matt-Linux-Desktop cinnamon-session[860]: WARNING: t+1075.66437s: Requesting system restart…
    Jul 01 13:15:35 Matt-Linux-Desktop cinnamon-session[860]: WARNING: t+1075.66440s: Attempting to restart using systemd…
    Jul 01 13:15:35 Matt-Linux-Desktop systemd-logind[694]: System is rebooting.
    Jul 01 13:15:35 Matt-Linux-Desktop systemd[1]: Stopped target Bluetooth.
    Jul 01 13:15:35 Matt-Linux-Desktop systemd[1]: Starting Generate shutdown-ramfs…
    Jul 01 13:15:35 Matt-Linux-Desktop systemd[1]: Unmounting /mnt/Win10-KVM…
    Jul 01 13:15:35 Matt-Linux-Desktop systemd[1]: Stopping User Manager for UID 1000…
    Jul 01 13:15:35 Matt-Linux-Desktop systemd[1]: Removed slice system-getty.slice.
    Jul 01 13:15:35 Matt-Linux-Desktop systemd[1]: Stopping Network Manager Script Dispatcher Service…
    Jul 01 13:15:35 Matt-Linux-Desktop nm-dispatcher[27788]: Caught signal 15, shutting down…

    And then it goes on like that for another 230 lines over the next 3 seconds before I reboot.

  • @physics_gaming

    Hmmmm… After reading again your posts, did you tried to genarate an image from your Windows disk and tried to run from it? I know that is a shoot in the dark but after removing all passthrough devices the disk could be out of sync as the partition drivers fails to read and write hard locking the system.

    Also iommu trees must be entirely passed to guest machines, passing one device can cause instabilities, to get your iommu group information I found a script on Gentoo wiki (https://wiki.installgentoo.com/index.php/PCI_passthrough) that collects information needed:
    for iommu_group in $(find /sys/kernel/iommu_groups/ -maxdepth 1 -mindepth 1 -type d); do echo "IOMMU group $(basename "$iommu_group")"; for device in $(ls -1 "$iommu_group"/devices/); do echo -n $'\t'; lspci -nns "$device"; done; done

    And also I found some information about nvidia in guests that if the driver detects that is inside of a vm will disable itself here: https://davidyat.es/2016/09/08/gpu-passthrough/

  • Sorry for the delayed response, but I have tested a bunch of different possible solutions. I created a disk image like you suggested, and kept in on a HDD plugged into a separate sata controller and ran the VM from that and had the same issue. I also tried a separate install of Windows 10 on a qcow2 image on another SSD, and still it locks up after a few minutes of running a game.

    My iommu groups are valid and all passed through hardware are isolated. I have tried with and without the acs override patch and on different kernels ( 4.16, 4.16-vfio, 4.17, and 4.18rc) all resulting in the same issue. I always make sure to use the NVIDIA workarounds as described in the Arch Wiki: https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#.22Error_43:_Driver_failed_to_load.22_on_Nvidia_GPUs_passed_to_Windows_VMs (I can’t play games without them because the geforce drivers won’t even start :p )

    After all that, I decided to try on a couple of different hosts, first I tried using Ubuntu as a host and used the exact same disk image for the VM with all the edits to the xml file for the domain, and it runs flawlessly, played for over 10 hours on really demanding games without issue. I also tried my setup on Arch and it has the same exact issue as Antergos, so I believe it is something either Arch specific or just the version of qemu in the Arch repos. Next I will trying different versions of qemu available in the AUR to see what happens.

  • @physics_gaming Okay, after all that experimentation with different builds of qemu and even trying the same setup with different hosts such as Ubuntu, Arch, Fedora, I finally found a way to run the VM without it crashing: I changed the cpu model to AMD EPYC instead of host-passthrough. The performance is not as good this way, but it makes things way more stable, no crashes so far. Since I’m using looking-glass (https://looking-glass.hostfission.com/quickstart) I thought I had to use host-passthrough so that the guest would have access to the instruction set that allows looking-glass to work, but apparently EPYC is close enough to Ryzen in that regard.

windows 1011 qemu5 kvm4 iommu3 Posts 6Views 467
Bloom Email Optin Plugin

Looks like your connection to Antergos Community Forum was lost, please wait while we try to reconnect.