This is a deep-ish dive into the riscv privileged specifications and Linux kernel syscall implementation.
Privileged specification tour Link to heading
To keep it short, There are 3 privilege levels
- U (user) : 0
- S (supervisor) : 1
- Reserved
- M (Machine) : 3
And specs describes them as
At any time, a RISC-V hardware thread (hart) is running at some privilege level encoded as a mode in one or more CSRs (control and status registers).
And
All hardware implementations must provide M-mode, as this is the only mode that has unfettered access to the whole machine. The simplest RISC-V implementations may provide only M-mode, though this will provide no protection against incorrect or malicious application code
CSRs Link to heading
Chapter 2 describes two classes of instructions
The SYSTEM major opcode is used to encode all privileged instructions in the RISC-V ISA. These can be divided into two main classes
those that atomically read-modify-write control and status registers (CSRs), which are defined in the Zicsr extension and other instructions
Specs defines 12 bits for CSR encoding csr[11:0]
and the last 4 bits encode permission and privilege.
- [11:10] write/read or read only
- [9:8] privilege level that can access that register
Table 2.1 describes CSR address ranges for example
11 11 0XXX 0xF00-0xF7F Standard read-only
Table 2.5 describes each of the these CSRs
0xF11 MRO mvendorid Vendor ID
Note: CSR instructions are defined in unprivileged specs chapter 9 “Zicsr”, Control and Status Register (CSR) Instructions
RISC-V defines a separate address space of 4096 Control and Status registers associated with each hart. This chapter defines the full set of CSR instructions that operate on these CSRs.
Machine-level ISA Link to heading
Section 3.1 defines the CSR (and Fields required) for M-level
The mvendorid CSR is a 32-bit read-only register providing the JEDEC manufacturer ID of the provider of the core. This register must be readable in any implementation, but a value of 0 can be returned to indicate the field is not implemented or that this is a non-commercial implementation.
Then in section 3.3, Machine-level instructions are:
- ECALL
- MRET
- WFI
I will just put ECALL
description for reference
The ECALL instruction is used to make a request to the supporting execution environment. When executed in U-mode, S-mode, or M-mode, it generates an environment-call-from-U-mode exception, environment-call-from-S-mode exception, or environment-call-from-M-mode exception, respectively, and performs no other operation.
Deep Dive into Linux kernel Link to heading
Now that we had a quick tour through the specs Let’s see it in action!
Kernel CSR definitions Link to heading
include/asm/csr.h
defines the CSRs. I didn’t find mvendorid
but i thought mstatus
is important enough to highlight.
#define CSR_MSTATUS 0x300
For example, arch/riscv/kernel/entry.S
uses as follows
csrw CSR_SCRATCH, x0
Exception handler and mret Link to heading
I will start with handle_exception
which calls mret
and work my way backward
arch/riscv/kernel/entry.S
defines handle_exception
which handle syscalls (among other exceptions).
ENTRY(handle_exception)
Naturally, It has to use mret
(or sret
).
#ifdef CONFIG_RISCV_M_MODE
mret
#else
sret
#endif
And handle_exception
is installed in _start
defined in arch/riscv/kernel/head.S
. As far as i remember, _start
is called from boot code(revisit later).
call setup_trap_vector
tail smp_callin
#endif /* CONFIG_SMP */
.align 2
setup_trap_vector:
/* Set trap vector to exception handler */
la a0, handle_exception
csrw CSR_TVEC, a0
/*
* Set sup0 scratch register to 0, indicating to exception vector that
* we are presently executing in kernel.
*/
csrw CSR_SCRATCH, zero
ret
And from the spec:
The mtvec register is an MXLEN-bit WARL read/write register that holds trap vector configuration, consisting of a vector base address (BASE) and a vector mode (MODE)
Calling ecall Link to heading
At this point, we can see the kernel side of the exception. Now, We need to see the user land calling ecall
.
Initially i tried to find a way to write C code that generates syscall ecall
but it didn’t work. I think glibc does the syscall
(will have cicle back later).
I grep’ed through the kernel and found syscall
after setting syscall
number in a7
.
ENTRY(__vdso_getcpu)
.cfi_startproc
/* For now, just do the syscall. */
li a7, __NR_getcpu
ecall
ret
.cfi_endproc
ENDPROC(__vdso_getcpu)