Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BSoD with bugcheck 0x133 DPC_WATCHDOG_VIOLATION #663

Closed
dmiller-nmap opened this issue Mar 20, 2023 · 8 comments
Closed

BSoD with bugcheck 0x133 DPC_WATCHDOG_VIOLATION #663

dmiller-nmap opened this issue Mar 20, 2023 · 8 comments

Comments

@dmiller-nmap
Copy link
Contributor

Reported in Npcap 1.71. "The system cumulatively spent an extended period of time at IRQL DISPATCH_LEVEL or above."

Analyzing these dumps, it appears that there is significant resource contention within the Npcap driver when opening a new capture handle while another handle is processing packets. There are packets being sent and received on several different processors at once, each of which has acquired read access to a locked resource. Opening a new handle requires write access to the resource, which means that the thread waits at DPC level for all the read accesses to stop, while each new read access is queued for when the write access is finished. The write operation itself is very fast (2 pointer writes), but because it forces this wait, it appears that the system stays at DPC level for too long, causing the bugcheck 0x133 DPC_WATCHDOG_VIOLATION.

Related: #535

@fyodor
Copy link
Member

fyodor commented May 8, 2023

Just as an update, we're working on some re-engineering of the Npcap kernel driver to reduce the amount of work done while certain locks are held and reducing the number of functions that contend for those locks. We plan to include this in the next Npcap release and expect it to resolve the issue. I also wanted to mention this possibly-related report we received by email:

We are utilizing of NPCAP v1.72 on Windows Server 2019. We have a pretty sophisticated, highly optimized, timing-sensitive architecture that consists of dozens of Windows Services all running in Realtime priority, with several of these services receiving and sending packets over two separate network interfaces. To help us detect, diagnose, and correct performance-related issues, we’ve developed an internal profiling tool. Using the tool, we’ve observed that occasionally under very heavy traffic conditions NPCAP appears to lock out all our services for about 1.5ms, which is a fairly significant amount of time for our application platform.

@fyodor
Copy link
Member

fyodor commented May 12, 2023

Another update: it is starting to look like this issue may be related to the Palo Alto Networks GlobalProtect product. One of our largest customers wrote that GlobalProtect "version 6.0.5 (and possibly 6.0.3) is the item pushing it over the edge for us." while another said they are seeing the problems with GlobalProtect 6.1.1. It's also worth noting that the GlobalProtect 6.0 Known Issue page says:

GPC-17099 - When the GlobalProtect app for Windows is upgraded to GlobalProtect app version 6.0.5, devices with Driver Verifier enabled and configured to monitor the PAN virtual adapter driver (pangpd.sys) display the DRIVER_VERIFIER_DETECTED_VIOLATION Blue Screen error.

This means that when the Windows Driver Verifier is enabled and monitoring the PAN virtual adapter driver, it detects serious enough violations by the 6.0.5 driver that it crashes the system with this DETECTED_VIOLATION error. When run without the Driver Verifier, these violations would still be happening but not detected. And they could be causing problems for Npcap and other drivers on the system.

I'm not yet sure whether Palo Alto Networks believes this particular problem is fixed in their 6.1.1 release, but we've been hearing about the Npcap crash from 6.1.1 users.

For anyone still experiencing this DPC_WATCHDOG_VIOLATION crash, please let us know if you're running Palo Alto Networks GlobalProtect VPN and, if so, what version. Thanks!

@fyodor
Copy link
Member

fyodor commented May 18, 2023

Just as a further update, Palo Alto Networks has apparently released the GlobalProtect 6.0.5-c35 hotfix (revision date April 12, 2023) which is supposed to resolve the GPC-17099, GPC-17081, and GPC-17239 flaws which might be triggering this Npcap DPC_WATCHDOG_VIOLATION issue. If anyone experiencing the Npcap issue applies this GlobalProtect hotfix, please let us know how it goes.

In parallel, we are still working to re-engineer the Npcap kernel to reduce the amount of work done while certain locks are held and reducing the number of functions that contend for those locks. That may be able to resolve the issue even when it may be triggered by other drivers.

dmiller-nmap added a commit that referenced this issue Jun 11, 2023
To improve performance by reducing lock contention and hopefully address
issue #663, this change avoids holding a filter module's
OpenInstancesLock while processing packet data. Instead, a separate lock
is used to manage the list of currently active (OpenRunning) instances,
so that listing adapters or opening new instances does not use the same
locks as processing packets.

Another change in this commit is that each packet is matched against
each BPF filter before any packet data is copied, so that the copy
operation can be done once for the maximum required snaplen instead of
in stages if later filters match larger portions of the packet than
earlier ones. Furthermore, the BpfProgramsLock is released prior to
dispatching the packet data to each instance's queue, reducing the
amount of time holding the lock (at DISPATCH_LEVEL).
@BlameFirewall
Copy link

Greetings fyodor!

I have experienced a similar issue in my environment while trying to run GlobalProtect (6.1.1) and a new Zscaler implementation side by side: (4.1.0.102). (Disabling zscaler prevents the issue - GP 6.1.0 and previous seem to work OK)


*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000000, A single DPC or ISR exceeded its time allotment. The offending
	component can usually be identified with a stack trace.
Arg2: 0000000000000501, The DPC time count (in ticks).
Arg3: 0000000000000500, The DPC time allotment (in ticks).
Arg4: fffff8004c8fb320, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
	additional information regarding this single DPC timeout

Debugging Details:
------------------

*** WARNING: Unable to verify timestamp for npcap.sys
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Either you specified an unqualified symbol, or your debugger   ***
***    doesn't have full symbol information.  Unqualified symbol      ***
***    resolution is turned off by default. Please either specify a   ***
***    fully qualified symbol module!symbolname, or enable resolution ***
***    of unqualified symbols by typing ".symopt- 100". Note that     ***
***    enabling unqualified symbol resolution with network symbol     ***
***    server shares in the symbol path may cause the debugger to     ***
***    appear to hang for long periods of time when an incorrect      ***
***    symbol name is typed or the network symbol server is down.     ***
***                                                                   ***
***    For some commands to work properly, your symbol path           ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: TickPeriods                                   ***
***                                                                   ***
*************************************************************************

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 5608

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 35533

    Key  : Analysis.Init.CPU.mSec
    Value: 2515

    Key  : Analysis.Init.Elapsed.mSec
    Value: 117480

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 97

    Key  : WER.OS.Branch
    Value: vb_release

    Key  : WER.OS.Timestamp
    Value: 2019-12-06T14:06:00Z

    Key  : WER.OS.Version
    Value: 10.0.19041.1


FILE_IN_CAB:  061423-36828-01.dmp

BUGCHECK_CODE:  133

BUGCHECK_P1: 0

BUGCHECK_P2: 501

BUGCHECK_P3: 500

BUGCHECK_P4: fffff8004c8fb320

DPC_TIMEOUT_TYPE:  SINGLE_DPC_TIMEOUT_EXCEEDED

TRAP_FRAME:  fffff28b3d247650 -- (.trap 0xfffff28b3d247650)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000001 rbx=0000000000000000 rcx=ffff9b8c88657a60
rdx=fffff28b3d247938 rsi=0000000000000000 rdi=0000000000000000
rip=fffff8004bf405bc rsp=fffff28b3d2477e0 rbp=0000000000000000
 r8=0000000000000001  r9=0000000000000001 r10=fffff8004be1e680
r11=ffff9b8c87797da1 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na po nc
nt!KxWaitForSpinLockAndAcquire+0x2c:
fffff800`4bf405bc 488b07          mov     rax,qword ptr [rdi] ds:00000000`00000000=????????????????
Resetting default scope

BLACKBOXBSD: 1 (!blackboxbsd)


BLACKBOXNTFS: 1 (!blackboxntfs)


BLACKBOXPNP: 1 (!blackboxpnp)


BLACKBOXWINLOGON: 1

CUSTOMER_CRASH_COUNT:  1

PROCESS_NAME:  PanGPS.exe

STACK_TEXT:  
ffffb381`773d9e18 fffff800`4c043790     : 00000000`00000133 00000000`00000000 00000000`00000501 00000000`00000500 : nt!KeBugCheckEx
ffffb381`773d9e20 fffff800`4be813c3     : 000001eb`a1b3a67d ffffb381`773c0180 00000000`00000000 ffffb381`773c0180 : nt!KeAccumulateTicks+0x1bfb70
ffffb381`773d9e80 fffff800`4be80eaa     : ffff9b8c`40d240e0 fffff28b`3d2476d0 fffff800`51411500 00000004`00008101 : nt!KeClockInterruptNotify+0x453
ffffb381`773d9f30 fffff800`4bf3e965     : ffff9b8c`40d240e0 00000000`00000000 00000000`00000000 ffffa9db`0ee6d5f7 : nt!HalpTimerClockIpiRoutine+0x1a
ffffb381`773d9f60 fffff800`4bffdc3a     : fffff28b`3d2476d0 ffff9b8c`40d240e0 ffff9b8c`6a5d0a50 00000000`00000000 : nt!KiCallInterruptServiceRoutine+0xa5
ffffb381`773d9fb0 fffff800`4bffe407     : 00000000`00000000 00000000`00000000 00000000`00000004 00000000`00000000 : nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
fffff28b`3d247650 fffff800`4bf405bc     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiInterruptDispatchNoLockNoEtw+0x37
fffff28b`3d2477e0 fffff800`4c114dfc     : ffffb381`773c0180 ffff9b8c`88657a00 ffff9b8c`8752db30 fffff28b`3d247970 : nt!KxWaitForSpinLockAndAcquire+0x2c
fffff28b`3d247810 fffff800`4be1e6e3     : ffff9b8c`88657a60 ffff9b8c`8a0418c0 ffff9b8c`88657a50 ffff9b8c`8752ca20 : nt!KiAcquireSpinLockInstrumented+0xb0
fffff28b`3d247860 fffff800`522559ae     : 00000000`00000001 00000000`00000000 ffff9b8c`8752ca20 fffff800`5281131c : nt!KeAcquireSpinLockAtDpcLevel+0x63
fffff28b`3d247890 fffff800`675659b2     : ffff9b8c`6a5d0a50 00000000`00000000 ffff9b8c`6a5d0a50 00000000`00000000 : ndis!NdisAcquireRWLockRead+0x9e
fffff28b`3d2478c0 ffff9b8c`6a5d0a50     : 00000000`00000000 ffff9b8c`6a5d0a50 00000000`00000000 fffff28b`3d247918 : npcap+0x59b2
fffff28b`3d2478c8 00000000`00000000     : ffff9b8c`6a5d0a50 00000000`00000000 fffff28b`3d247918 ffff9b8c`8752fa20 : 0xffff9b8c`6a5d0a50


SYMBOL_NAME:  npcap+59b2

MODULE_NAME: npcap

IMAGE_NAME:  npcap.sys

STACK_COMMAND:  .cxr; .ecxr ; kb

BUCKET_ID_FUNC_OFFSET:  59b2

FAILURE_BUCKET_ID:  0x133_DPC_npcap!unknown_function

OS_VERSION:  10.0.19041.1

BUILDLAB_STR:  vb_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {d39a2ab6-3a99-ccd3-b018-2efbf6925d30}

Followup:     MachineOwner
---------

Are there any updates on this since your last update a few weeks ago? I am trying to determine a safe version to deploy before we are required to roll out Zscaler to 500+ users in short order (hopefully without causing mass chaos).

Thanks!

@fyodor
Copy link
Member

fyodor commented Jun 24, 2023

As additional context, I wanted to note this Reddit thread covering the issue. The author of that post says that upgrading to Palo Alto Networks Global Protect 6.1.1 resolved the issue for them, but I've also heard from people (including the comment before mine) that 6.1.1 was still problematic. Meanwhile, we've made some code improvements that we're hopeful will help. While the timing is not guaranteed, I hope we will be able to release Npcap 1.76 with these improvements next week (by June 30).

@ahtivi
Copy link

ahtivi commented Jul 6, 2023

Hi fyodor,

We started seeing issues after upgrading GlobalProtect from 6.0.4 to 6.1.1. We skipped 6.0.5 due to GPC-17099 (it was not listed under 6.1 some weeks back). For testing i have updated NPCAP to 1.75 on some devices and at least for now the BSOD's have stopped
EDIT: Some devices still got bsod with 1.75. The issue seems to be resolved with GlobalProtect 6.2

@fyodor
Copy link
Member

fyodor commented Aug 28, 2023

As another data point, we just heard from another customer who had been experiencing the issue that "After upgrading PANGP to latest 6.1 and 6.2 the issues are gone." I also wanted to note that our Npcap Version 1.76 includes some code reorganization to reduce log contention and conslidate data copy operations. This improves performance and also reduces the chances for a crash even when "triggered" by problems in other drivers.

@fyodor
Copy link
Member

fyodor commented Aug 28, 2023

I'm closing this since we don't have any more active cases. All of the customers we've heard from say that upgrading to PANGP 6.2 fixed the problem.

@fyodor fyodor closed this as completed Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants