@physics_gaming Okay, after all that experimentation with different builds of qemu and even trying the same setup with different hosts such as Ubuntu, Arch, Fedora, I finally found a way to run the VM without it crashing: I changed the cpu model to AMD EPYC instead of host-passthrough. The performance is not as good this way, but it makes things way more stable, no crashes so far. Since I’m using looking-glass (https://looking-glass.hostfission.com/quickstart) I thought I had to use host-passthrough so that the guest would have access to the instruction set that allows looking-glass to work, but apparently EPYC is close enough to Ryzen in that regard.
Sorry for the delayed response, but I have tested a bunch of different possible solutions. I created a disk image like you suggested, and kept in on a HDD plugged into a separate sata controller and ran the VM from that and had the same issue. I also tried a separate install of Windows 10 on a qcow2 image on another SSD, and still it locks up after a few minutes of running a game.
My iommu groups are valid and all passed through hardware are isolated. I have tried with and without the acs override patch and on different kernels ( 4.16, 4.16-vfio, 4.17, and 4.18rc) all resulting in the same issue. I always make sure to use the NVIDIA workarounds as described in the Arch Wiki: https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#.22Error_43:_Driver_failed_to_load.22_on_Nvidia_GPUs_passed_to_Windows_VMs (I can’t play games without them because the geforce drivers won’t even start :p )
After all that, I decided to try on a couple of different hosts, first I tried using Ubuntu as a host and used the exact same disk image for the VM with all the edits to the xml file for the domain, and it runs flawlessly, played for over 10 hours on really demanding games without issue. I also tried my setup on Arch and it has the same exact issue as Antergos, so I believe it is something either Arch specific or just the version of qemu in the Arch repos. Next I will trying different versions of qemu available in the AUR to see what happens.
@edwin-foss Thanks for the reply. There has been a lot of updates since the VM was created, but like I said, rolling back to a snapshot of a known good state using Timeshift made no difference. journalctl -b0 dosen’t help because the entire host locks up, so I’m looking at journal entries after a reboot. Here is the first couple lines from journalctl immediately after the last crash:
Jul 01 13:15:35 Matt-Linux-Desktop cinnamon-session: GLib-GIO-CRITICAL: t+1075.66432s: g_dbus_connection_call_sync_internal: assertion ‘G_IS_DBUS_CONNECTION (connection)’ failed
Jul 01 13:15:35 Matt-Linux-Desktop cinnamon-session: WARNING: t+1075.66437s: Requesting system restart…
Jul 01 13:15:35 Matt-Linux-Desktop cinnamon-session: WARNING: t+1075.66440s: Attempting to restart using systemd…
Jul 01 13:15:35 Matt-Linux-Desktop systemd-logind: System is rebooting.
Jul 01 13:15:35 Matt-Linux-Desktop systemd: Stopped target Bluetooth.
Jul 01 13:15:35 Matt-Linux-Desktop systemd: Starting Generate shutdown-ramfs…
Jul 01 13:15:35 Matt-Linux-Desktop systemd: Unmounting /mnt/Win10-KVM…
Jul 01 13:15:35 Matt-Linux-Desktop systemd: Stopping User Manager for UID 1000…
Jul 01 13:15:35 Matt-Linux-Desktop systemd: Removed slice system-getty.slice.
Jul 01 13:15:35 Matt-Linux-Desktop systemd: Stopping Network Manager Script Dispatcher Service…
Jul 01 13:15:35 Matt-Linux-Desktop nm-dispatcher: Caught signal 15, shutting down…
And then it goes on like that for another 230 lines over the next 3 seconds before I reboot.
About 6 weeks ago I created a windows 10 KVM with iommu passthrough for gaming and it worked flawlessly until just a few days ago. I have changed nothing, either hardware or software, but now the host system locks up completely after just a few minutes of gaming in the VM. My first thought was that one of the relevant packages has started to cause the issue after an update, so I rolled my system back to a snapshot that was created (via Timeshift) immediately after I got my gaming VM up and running but the problem still persists. I have tried removing each piece of hardware from the VM one by one, including the graphics card, until there was no physical hardware attached to the VM save for the SATA controller which is required for it to boot (it uses a bare-metal install of Windows 10) and still the problem persists. I also made sure that each pci device attached to the VM was using message signaled interrupts, and this didn’t help either. I can verify the stability of the host using stress, stressapptest and Unigine Heaven, all passing with flying colors. I’m really running out of ideas as to what could be the cause of my troubles, any direction or insight would be greatly appreciated.
kernel: linux-vfio 4.16.13-1 with acs override patch
mobo: ASRock X370 Taichi
cpu: AMD Ryzen 7 1700x
host gpu: Radeon RX 480
guest gpu: GTX 1070
guest OS: Windows 10