sound tests: Test hot-unload #13

Open
christos wants to merge 1 commit from christos/hotunload into main AGit
Member

Sponsored by: The FreeBSD Foundation
MFC after: 1 week

Sponsored by: The FreeBSD Foundation MFC after: 1 week
sound tests: Test hot-unload
Some checks failed
Checklist / commit (pull_request_target) Has been cancelled
Cross-build Kernel / amd64 ubuntu-22.04 (clang-15) (pull_request) Has been cancelled
Cross-build Kernel / aarch64 ubuntu-22.04 (clang-15) (pull_request) Has been cancelled
Cross-build Kernel / amd64 ubuntu-24.04 (clang-18) (pull_request) Has been cancelled
Cross-build Kernel / aarch64 ubuntu-24.04 (clang-18) (pull_request) Has been cancelled
Cross-build Kernel / amd64 macos-latest (clang-18) (pull_request) Has been cancelled
Cross-build Kernel / aarch64 macos-latest (clang-18) (pull_request) Has been cancelled
Style Checker / Style Checker (pull_request) Has been cancelled
6e1ef7e9ac
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Author
Member

This is problematic when running with -v parallelism because it might unload snd_dummy when the other tests are using it. For instance:

# kyua -v parallelism=8 test
pcm_read_write:pcm_write  ->  passed  [0.017s]
pcm_read_write:pcm_read  ->  passed  [0.022s]
pcm_read_write:pcm_format_bits  ->  passed  [0.025s]
polling:poll_poll  ->  passed  [0.124s]
polling:poll_kqueue  ->  passed  [0.456s]
sndstat:sndstat_nv  ->  passed  [0.503s]
sndstat:sndstat_udev  ->  passed  [0.519s]


Fatal trap 12: page fault while in kernel mode
cpuid = 4; apic id = 04
fault virtual address	= 0x8
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf8d6c
stack pointer	        = 0x28:0xfffffe00c3da6930
frame pointer	        = 0x28:0xfffffe00c3da6ae0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 6246 (polling)
rdi: fffff80004509cc0 rsi: 0000000000000009 rdx: ffffffff83142323
rcx: fffff801cdfa4780  r8: 00000000000004d6  r9: deadc0dedeadc0de
rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe00c3da6ae0
r10: ffffffff83142323 r11: 0000000000010000 r12: 00000000000004d6
r13: 0000000000000009 r14: ffffffff83142323 r15: fffff80004509cc0
trap number		= 12
panic: page fault
cpuid = 4
time = 1776281449
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c3da6660
vpanic() at vpanic+0x13f/frame 0xfffffe00c3da6790
panic() at panic+0x43/frame 0xfffffe00c3da67f0
trap_pfault() at trap_pfault+0x422/frame 0xfffffe00c3da6860
calltrap() at calltrap+0x8/frame 0xfffffe00c3da6860
--- trap 0xc, rip = 0xffffffff80bf8d6c, rsp = 0xfffffe00c3da6930, rbp = 0xfffffe00c3da6ae0 ---
witness_checkorder() at witness_checkorder+0x7c/frame 0xfffffe00c3da6ae0
__mtx_lock_flags() at __mtx_lock_flags+0x91/frame 0xfffffe00c3da6b30
dsp_ioctl() at dsp_ioctl+0x1845/frame 0xfffffe00c3da6bb0
devfs_ioctl() at devfs_ioctl+0xd2/frame 0xfffffe00c3da6c00
VOP_IOCTL_APV() at VOP_IOCTL_APV+0x54/frame 0xfffffe00c3da6c30
vn_ioctl() at vn_ioctl+0xc2/frame 0xfffffe00c3da6ca0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe00c3da6cc0
kern_ioctl() at kern_ioctl+0x267/frame 0xfffffe00c3da6d30
sys_ioctl() at sys_ioctl+0x143/frame 0xfffffe00c3da6e00
amd64_syscall() at amd64_syscall+0x16c/frame 0xfffffe00c3da6f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c3da6f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x214937d7d0ea, rsp = 0x214935dffa68, rbp = 0x214935dffaa0 ---
KDB: enter: panic
[ thread pid 6246 tid 100858 ]
Stopped at      kdb_enter+0x33: movq    $0,0x161c472(%rip)
db> 

That being said, I'm wondering what the solution is. We already have checks inside sound(4)'s devfs methods to see if the device has been detached (see DSP_REGISTERED()), but because these happen at the beginning of the function, it is not guaranteed that the device won't be detached at some point when a devfs method releases the lock for a while. I'm trying to figure out an elegant solution for this, because adding a DSP_REGISTERED() check after every single lock release seems way too tedious. Any thoughts?

This is problematic when running with `-v parallelism` because it might unload `snd_dummy` when the other tests are using it. For instance: ``` # kyua -v parallelism=8 test pcm_read_write:pcm_write -> passed [0.017s] pcm_read_write:pcm_read -> passed [0.022s] pcm_read_write:pcm_format_bits -> passed [0.025s] polling:poll_poll -> passed [0.124s] polling:poll_kqueue -> passed [0.456s] sndstat:sndstat_nv -> passed [0.503s] sndstat:sndstat_udev -> passed [0.519s] Fatal trap 12: page fault while in kernel mode cpuid = 4; apic id = 04 fault virtual address = 0x8 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf8d6c stack pointer = 0x28:0xfffffe00c3da6930 frame pointer = 0x28:0xfffffe00c3da6ae0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 6246 (polling) rdi: fffff80004509cc0 rsi: 0000000000000009 rdx: ffffffff83142323 rcx: fffff801cdfa4780 r8: 00000000000004d6 r9: deadc0dedeadc0de rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe00c3da6ae0 r10: ffffffff83142323 r11: 0000000000010000 r12: 00000000000004d6 r13: 0000000000000009 r14: ffffffff83142323 r15: fffff80004509cc0 trap number = 12 panic: page fault cpuid = 4 time = 1776281449 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c3da6660 vpanic() at vpanic+0x13f/frame 0xfffffe00c3da6790 panic() at panic+0x43/frame 0xfffffe00c3da67f0 trap_pfault() at trap_pfault+0x422/frame 0xfffffe00c3da6860 calltrap() at calltrap+0x8/frame 0xfffffe00c3da6860 --- trap 0xc, rip = 0xffffffff80bf8d6c, rsp = 0xfffffe00c3da6930, rbp = 0xfffffe00c3da6ae0 --- witness_checkorder() at witness_checkorder+0x7c/frame 0xfffffe00c3da6ae0 __mtx_lock_flags() at __mtx_lock_flags+0x91/frame 0xfffffe00c3da6b30 dsp_ioctl() at dsp_ioctl+0x1845/frame 0xfffffe00c3da6bb0 devfs_ioctl() at devfs_ioctl+0xd2/frame 0xfffffe00c3da6c00 VOP_IOCTL_APV() at VOP_IOCTL_APV+0x54/frame 0xfffffe00c3da6c30 vn_ioctl() at vn_ioctl+0xc2/frame 0xfffffe00c3da6ca0 devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe00c3da6cc0 kern_ioctl() at kern_ioctl+0x267/frame 0xfffffe00c3da6d30 sys_ioctl() at sys_ioctl+0x143/frame 0xfffffe00c3da6e00 amd64_syscall() at amd64_syscall+0x16c/frame 0xfffffe00c3da6f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c3da6f30 --- syscall (54, FreeBSD ELF64, ioctl), rip = 0x214937d7d0ea, rsp = 0x214935dffa68, rbp = 0x214935dffaa0 --- KDB: enter: panic [ thread pid 6246 tid 100858 ] Stopped at kdb_enter+0x33: movq $0,0x161c472(%rip) db> ``` That being said, I'm wondering what the solution is. We already have checks inside `sound(4)`'s devfs methods to see if the device has been detached (see `DSP_REGISTERED()`), but because these happen at the beginning of the function, it is not guaranteed that the device won't be detached at some point when a devfs method releases the lock for a while. I'm trying to figure out an elegant solution for this, because adding a `DSP_REGISTERED()` check after every single lock release seems way too tedious. Any thoughts?
Member

If you properly use destroy_dev() and devfs cdevpriv, the priv data should be intact until all threads leave the cdevsw methods.

If you properly use destroy_dev() and devfs cdevpriv, the priv data should be intact until all threads leave the cdevsw methods.
Author
Member

@kib wrote in #13 (comment):

If you properly use destroy_dev() and devfs cdevpriv, the priv data should be intact until all threads leave the cdevsw methods.

Can you elaborate on what properly means?

@kib wrote in https://ron-dev.freebsd.org/FreeBSD/src/pulls/13#issuecomment-90: > If you properly use destroy_dev() and devfs cdevpriv, the priv data should be intact until all threads leave the cdevsw methods. Can you elaborate on what properly means?
Member

At very least, if outside of the cdevsw methods, where it is handled automatically, the code must not touch device data unless dev_refthread() or like returned success.

In any case, the case in the comment should be debugged, it might be that this is just a race in driver code and not the KPI mis-use.

At very least, if outside of the cdevsw methods, where it is handled automatically, the code must not touch device data unless dev_refthread() or like returned success. In any case, the case in the comment should be debugged, it might be that this is just a race in driver code and not the KPI mis-use.
Some checks failed
Checklist / commit (pull_request_target) Has been cancelled
Cross-build Kernel / amd64 ubuntu-22.04 (clang-15) (pull_request) Has been cancelled
Cross-build Kernel / aarch64 ubuntu-22.04 (clang-15) (pull_request) Has been cancelled
Cross-build Kernel / amd64 ubuntu-24.04 (clang-18) (pull_request) Has been cancelled
Cross-build Kernel / aarch64 ubuntu-24.04 (clang-18) (pull_request) Has been cancelled
Cross-build Kernel / amd64 macos-latest (clang-18) (pull_request) Has been cancelled
Cross-build Kernel / aarch64 macos-latest (clang-18) (pull_request) Has been cancelled
Style Checker / Style Checker (pull_request) Has been cancelled
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin +refs/pull/13/head:christos/hotunload
git switch christos/hotunload
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
FreeBSD/src!13
No description provided.