• Hard freeze due to PCIe bus error


    Hello Antergos community

    I am having a problem which started occurring few days ago. journalctl reviles PCIe bus error for approx 4 hours and 20 minutes (there are few gaps in these error messages for maybe a minute during that time)

    júl 29 18:53:53 OVG-ACER kernel: pcieport 0000:00:01.7: AER: Corrected error received: id=0008
    júl 29 18:53:53 OVG-ACER kernel: pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
    júl 29 18:53:53 OVG-ACER kernel: pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
    júl 29 18:53:53 OVG-ACER kernel: pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
    júl 29 18:53:54 OVG-ACER kernel: pcieport 0000:00:01.7: AER: Corrected error received: id=0008
    júl 29 18:53:54 OVG-ACER kernel: pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
    júl 29 18:53:54 OVG-ACER kernel: pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
    júl 29 18:53:54 OVG-ACER kernel: pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
    júl 29 18:54:01 OVG-ACER kernel: pcieport 0000:00:01.7: AER: Corrected error received: id=0008
    júl 29 18:54:01 OVG-ACER kernel: pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
    júl 29 18:54:01 OVG-ACER kernel: pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
    júl 29 18:54:01 OVG-ACER kernel: pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
    júl 29 18:54:03 OVG-ACER kernel: pcieport 0000:00:01.7: AER: Corrected error received: id=0008
    júl 29 18:54:03 OVG-ACER kernel: pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
    júl 29 18:54:03 OVG-ACER kernel: pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
    júl 29 18:54:03 OVG-ACER kernel: pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
    júl 29 18:54:03 OVG-ACER kernel: pcieport 0000:00:01.7: AER: Corrected error received: id=0008
    júl 29 18:54:03 OVG-ACER kernel: pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
    júl 29 18:54:03 OVG-ACER kernel: pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
    júl 29 18:54:03 OVG-ACER kernel: pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
    -- Reboot --
    

    I have found few different “solution” regarding PCIe bus errors, but none with my exact numbers. These solutions recommend me to add to grub, usually the GRUB_CMDLINE_LINUX_DEFAULT line.
    However, a while ago I had to change the grub due to CPU#3 soft lockup. After I upgraded the grub after that change, the file does not look the same from before, and I can’t find GRUB_CMDLINE_LINUX_DEFAULT line.

    Any suggestions are very welcome. I want so much to carry on using Antergos.

    My system:

    [[email protected] ~]$ lscpu
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    CPU(s):              8
    On-line CPU(s) list: 0-7
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    NUMA node(s):        1
    Vendor ID:           AuthenticAMD
    CPU family:          23
    Model:               17
    Model name:          AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
    Stepping:            0
    CPU MHz:             1561.964
    CPU max MHz:         2000,0000
    CPU min MHz:         1600,0000
    BogoMIPS:            3994.83
    Virtualization:      AMD-V
    L1d cache:           32K
    L1i cache:           64K
    L2 cache:            512K
    L3 cache:            4096K
    NUMA node0 CPU(s):   0-7
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
    
  • Couldn’t edit my original post. The computer freezes randomly, and I am only able to use the power button to reset. Nothing else works.

  • @olividir

    You could try adding idle=nomwait to you kernel parameters.

    The mwait cpu instruction can hang a thread. This is documented in the AMD errata for certain Ryzen chips.

    If you want to try solution, reboot system, type e at grub menu, add command at the end of the command line
    ex: linux /boot/vmlinuz-linux root=UUID=0b49f454-f823-4323-884c-c3ea434259e5 rw quiet splash idle=nomwait

  • I did try idle=nomwait but the computer still froze. I might have put the command at wrong place though (I don’t know enough about programming), because I just made a new line at the bottom (pressed e at startup).

    Anyway, my boot is bricked because I decided to see if I could run any LTS OSes from live USB (which doesn’t make any sense), I never attempted install from those USB’s.

    One thing I have noticed, in BIOS is marked “Windows” boot which doesn’t go away when I do a clean install. So when I installed Antergos over the whole disk, the BIOS saw that as “boot 2” even if there was no Windows partition.

  • I just did a clean install, and like before, there are no PCIe bus errors after clean install. However, they usually start to occur after a while. I don’t remember the exact number in days.

  • @loup001 said in Hard freeze due to PCIe bus error:

    ex: linux /boot/vmlinuz-linux root=UUID=0b49f454-f823-4323-884c-c3ea434259e5 rw quiet splash idle=nomwait

    My command line for linux /boot/linuz-linux root=UUID doesn’t have splash in it (says “quiet resume=UUID=”) there are numbers and letters after the = sign.

    Should I just put splash idle=nomwait after quiet??? Or does the placement of the command not matter?

  • @olividir
    I think the placement doesn’t matter much, but I’d put the option in file /etc/default/grub, into value of variable GRUB_CMDLINE_LINUX_DEFAULT (inside the existing double quotes, after the word ‘quiet’).

    And then you need to run command:

    sudo grub-mkconfig -o /boot/grub/grub.cfg
    

    and reboot.

  • @olividir

    The idea was to test if solution is working for you.

    [quoted](If you want to try solution, reboot system, type e at grub menu, add command at the end of the command line)

    But since you reinstalled and claim no problem … if it aint broken then …

    Manuel solution is the proper way to do the modification for the long run.

    If the problem occur again, you can try with this command rcu_nocbs=0-15 ( seem to affect other model of ryzen but might still worth a try )

    I also suggest to check if a bios update is available

  • @manuel I have already put in a command there to stop “CPU#3 soft lockup” and upgraded grub. Now the grub file looks very different, but I will give it a try.

    @loup001 BIOS update is available, but only for Windows 10 machine (the update is done while running Windows, no boot into BIOS). I am already seeing PCIe errors in my log after the second boot so I might put rcu_nocbs=0-15 in grub.

  • @manuel said in Hard freeze due to PCIe bus error:

    sudo grub-mkconfig -o /boot/grub/grub.cfg

    I did put the command in the right place, but when I upgrade grub, the terminal replies with ismod: ERROR: could not load module part_gpt: No such file or directory

    Same applies if I try the location of /etc/default/grub even if I just used nano to get there (I used copy/paste for the grub upgrade).

  • @olividir
    Strange, your system has some severe errors. You could try reinstalling grub and linux (or linux-lts) and see if it helps anything:

    sudo pacman -S linux # or/and linux-lts if you use it
    sudo pacman -S grub
    
  • @olividir
    By the way, does your machine work properly with any other operating system?

  • @manuel said in Hard freeze due to PCIe bus error:

    By the way, does your machine work properly with any other operating system?

    It worked great on the first boot to Windows (which is an operating system I don’t want to use). Antergos has been the most stable of the Linux OSes I have tried.

    I don’t know why, but the computer boots just fine (still) even if it “can’t” find grub.

    Isn’t kernel 4.17 stable? My CPU won’t boot on anything older than 4.17 (except maybe Fedora 28).

    I did was trying to reinstall grub and linux, put back my splash idle=nomwait command and tried to upgrade grub… and got the same error message again.

    The computer is running (I am using it now), so maybe it is just a bug somewhere? It hasn’t frozen yet since reinstall of Antergos.

  • @olividir
    Kernel 4.17 is marked as stable by kernel.org, and 4.14 is the LTS kernel. They are the default kernels in Antergos. Usually I use the LTS kernel, but 4.17 works here too.

    Have you tried that LTS kernel? Install command is

    sudo pacman -S linux-lts
    

    So you did reinstall and it is working so far OK?

  • I am not sure if I can use kernel 4.14 LTS, does it support AMDGPU DC Raven Ridge? I have had (more) problems with OSes running kernel 4.16 and 4.15. Usually these kernels won’t even boot from live USB’s on this laptop.

    Computer has been running fine since last fresh install, which was last Monday, but checking journalctl -f I do see the same occasional PCIe errors.
    I might reinstall again next time I am at home and add splash idle=nomwait at the same time as I add processor.max_cstate=1. Second command is a fix for CPU soft lockup / PCIe bus error in Ryzen laptops.

    I was using processor.max_cstate=1 fix when I had these freezes (see more info here https://forum.level1techs.com/t/ryzen-vega-laptop-pcie-bus-error/124661/60).

  • Do you already check for dirt dust or bad connection inside system?

  • @olividir
    Found this long thread on Manjaro pages: https://forum.manjaro.org/t/any-support-for-the-ryzen-2200g-and-2400g-for-linux/38079
    Seems that Ryzen has had lots of compatibility issues with Linux so far. And 4.17 kernel seems currently the best for Ryzen.

    By the way, have you checked if there are any BIOS/firmware updates for your machine?

  • @joekamprad
    I have not opened up the machine yet, it is about a month old, and not in a very dusty environment. I can however give it a try during the weekend. Like I said, looking through journalctl -f it looks much nicer now, after reinstalling Antergos, than when I had those crashes.

    @manuel There is a BIOS update I know of, but in order to install that, I will need to install Windows on it (don’t know if doing so through VM will work, only have 6 GB of RAM). The update is on a .exe file format.

  • 0_1533247872271_Skjámynd frá 2018-08-02 22-10-48.png

    This is how the BIOS package looks like and what README.txt says. I am very afraid to try BIOS update at the moment because I really can’t afford another laptop for a while.

error307 cinnamon89 pcie5 bus4 Posts 24Views 1119
Bloom Email Optin Plugin

Looks like your connection to Antergos Community Forum was lost, please wait while we try to reconnect.