New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BSoD with bugcheck 0x133 DPC_WATCHDOG_VIOLATION #663
Comments
Just as an update, we're working on some re-engineering of the Npcap kernel driver to reduce the amount of work done while certain locks are held and reducing the number of functions that contend for those locks. We plan to include this in the next Npcap release and expect it to resolve the issue. I also wanted to mention this possibly-related report we received by email:
|
Another update: it is starting to look like this issue may be related to the Palo Alto Networks GlobalProtect product. One of our largest customers wrote that GlobalProtect "version 6.0.5 (and possibly 6.0.3) is the item pushing it over the edge for us." while another said they are seeing the problems with GlobalProtect 6.1.1. It's also worth noting that the GlobalProtect 6.0 Known Issue page says:
This means that when the Windows Driver Verifier is enabled and monitoring the PAN virtual adapter driver, it detects serious enough violations by the 6.0.5 driver that it crashes the system with this DETECTED_VIOLATION error. When run without the Driver Verifier, these violations would still be happening but not detected. And they could be causing problems for Npcap and other drivers on the system. I'm not yet sure whether Palo Alto Networks believes this particular problem is fixed in their 6.1.1 release, but we've been hearing about the Npcap crash from 6.1.1 users. For anyone still experiencing this DPC_WATCHDOG_VIOLATION crash, please let us know if you're running Palo Alto Networks GlobalProtect VPN and, if so, what version. Thanks! |
Just as a further update, Palo Alto Networks has apparently released the GlobalProtect 6.0.5-c35 hotfix (revision date April 12, 2023) which is supposed to resolve the GPC-17099, GPC-17081, and GPC-17239 flaws which might be triggering this Npcap DPC_WATCHDOG_VIOLATION issue. If anyone experiencing the Npcap issue applies this GlobalProtect hotfix, please let us know how it goes. In parallel, we are still working to re-engineer the Npcap kernel to reduce the amount of work done while certain locks are held and reducing the number of functions that contend for those locks. That may be able to resolve the issue even when it may be triggered by other drivers. |
To improve performance by reducing lock contention and hopefully address issue #663, this change avoids holding a filter module's OpenInstancesLock while processing packet data. Instead, a separate lock is used to manage the list of currently active (OpenRunning) instances, so that listing adapters or opening new instances does not use the same locks as processing packets. Another change in this commit is that each packet is matched against each BPF filter before any packet data is copied, so that the copy operation can be done once for the maximum required snaplen instead of in stages if later filters match larger portions of the packet than earlier ones. Furthermore, the BpfProgramsLock is released prior to dispatching the packet data to each instance's queue, reducing the amount of time holding the lock (at DISPATCH_LEVEL).
Greetings fyodor! I have experienced a similar issue in my environment while trying to run GlobalProtect (6.1.1) and a new Zscaler implementation side by side: (4.1.0.102). (Disabling zscaler prevents the issue - GP 6.1.0 and previous seem to work OK)
Are there any updates on this since your last update a few weeks ago? I am trying to determine a safe version to deploy before we are required to roll out Zscaler to 500+ users in short order (hopefully without causing mass chaos). Thanks! |
As additional context, I wanted to note this Reddit thread covering the issue. The author of that post says that upgrading to Palo Alto Networks Global Protect 6.1.1 resolved the issue for them, but I've also heard from people (including the comment before mine) that 6.1.1 was still problematic. Meanwhile, we've made some code improvements that we're hopeful will help. While the timing is not guaranteed, I hope we will be able to release Npcap 1.76 with these improvements next week (by June 30). |
Hi fyodor, We started seeing issues after upgrading GlobalProtect from 6.0.4 to 6.1.1. We skipped 6.0.5 due to GPC-17099 (it was not listed under 6.1 some weeks back). For testing i have updated NPCAP to 1.75 on some devices and at least for now the BSOD's have stopped |
As another data point, we just heard from another customer who had been experiencing the issue that "After upgrading PANGP to latest 6.1 and 6.2 the issues are gone." I also wanted to note that our Npcap Version 1.76 includes some code reorganization to reduce log contention and conslidate data copy operations. This improves performance and also reduces the chances for a crash even when "triggered" by problems in other drivers. |
I'm closing this since we don't have any more active cases. All of the customers we've heard from say that upgrading to PANGP 6.2 fixed the problem. |
Reported in Npcap 1.71. "The system cumulatively spent an extended period of time at IRQL DISPATCH_LEVEL or above."
Analyzing these dumps, it appears that there is significant resource contention within the Npcap driver when opening a new capture handle while another handle is processing packets. There are packets being sent and received on several different processors at once, each of which has acquired read access to a locked resource. Opening a new handle requires write access to the resource, which means that the thread waits at DPC level for all the read accesses to stop, while each new read access is queued for when the write access is finished. The write operation itself is very fast (2 pointer writes), but because it forces this wait, it appears that the system stays at DPC level for too long, causing the bugcheck 0x133 DPC_WATCHDOG_VIOLATION.
Related: #535
The text was updated successfully, but these errors were encountered: