This is a deep-ish dive into the riscv privileged specifications and Linux kernel syscall implementation.

Privileged specification tour Link to heading

To keep it short, There are 3 privilege levels

  • U (user) : 0
  • S (supervisor) : 1
  • Reserved
  • M (Machine) : 3

And specs describes them as

At any time, a RISC-V hardware thread (hart) is running at some privilege level encoded as a mode in one or more CSRs (control and status registers).

And

All hardware implementations must provide M-mode, as this is the only mode that has unfettered access to the whole machine. The simplest RISC-V implementations may provide only M-mode, though this will provide no protection against incorrect or malicious application code

CSRs Link to heading

Chapter 2 describes two classes of instructions

The SYSTEM major opcode is used to encode all privileged instructions in the RISC-V ISA. These can be divided into two main classes

those that atomically read-modify-write control and status registers (CSRs), which are defined in the Zicsr extension and other instructions

Specs defines 12 bits for CSR encoding csr[11:0] and the last 4 bits encode permission and privilege.

  • [11:10] write/read or read only
  • [9:8] privilege level that can access that register

Table 2.1 describes CSR address ranges for example

11 11 0XXX 0xF00-0xF7F Standard read-only

Table 2.5 describes each of the these CSRs

0xF11 MRO mvendorid Vendor ID

Note: CSR instructions are defined in unprivileged specs chapter 9 “Zicsr”, Control and Status Register (CSR) Instructions

RISC-V defines a separate address space of 4096 Control and Status registers associated with each hart. This chapter defines the full set of CSR instructions that operate on these CSRs.

Example image

Machine-level ISA Link to heading

Section 3.1 defines the CSR (and Fields required) for M-level

The mvendorid CSR is a 32-bit read-only register providing the JEDEC manufacturer ID of the provider of the core. This register must be readable in any implementation, but a value of 0 can be returned to indicate the field is not implemented or that this is a non-commercial implementation.

Then in section 3.3, Machine-level instructions are:

  • ECALL
  • MRET
  • WFI

I will just put ECALL description for reference

The ECALL instruction is used to make a request to the supporting execution environment. When executed in U-mode, S-mode, or M-mode, it generates an environment-call-from-U-mode exception, environment-call-from-S-mode exception, or environment-call-from-M-mode exception, respectively, and performs no other operation.

Deep Dive into Linux kernel Link to heading

Now that we had a quick tour through the specs Let’s see it in action!

Kernel CSR definitions Link to heading

include/asm/csr.h defines the CSRs. I didn’t find mvendorid but i thought mstatus is important enough to highlight.

#define CSR_MSTATUS             0x300

Example image

For example, arch/riscv/kernel/entry.S uses as follows

csrw CSR_SCRATCH, x0

Exception handler and mret Link to heading

I will start with handle_exception which calls mret and work my way backward

arch/riscv/kernel/entry.S defines handle_exception which handle syscalls (among other exceptions).

ENTRY(handle_exception)

Naturally, It has to use mret (or sret).

#ifdef CONFIG_RISCV_M_MODE
         mret
#else
         sret
#endif

And handle_exception is installed in _start defined in arch/riscv/kernel/head.S. As far as i remember, _start is called from boot code(revisit later).

        call setup_trap_vector
        tail smp_callin
#endif /* CONFIG_SMP */

.align 2
setup_trap_vector:
        /* Set trap vector to exception handler */
        la a0, handle_exception
        csrw CSR_TVEC, a0

        /*
         * Set sup0 scratch register to 0, indicating to exception vector that
         * we are presently executing in kernel.
         */
        csrw CSR_SCRATCH, zero
        ret

And from the spec:

The mtvec register is an MXLEN-bit WARL read/write register that holds trap vector configuration, consisting of a vector base address (BASE) and a vector mode (MODE)

Calling ecall Link to heading

At this point, we can see the kernel side of the exception. Now, We need to see the user land calling ecall. Initially i tried to find a way to write C code that generates syscall ecall but it didn’t work. I think glibc does the syscall(will have cicle back later).

I grep’ed through the kernel and found syscall after setting syscall number in a7.

ENTRY(__vdso_getcpu)
        .cfi_startproc
        /* For now, just do the syscall. */
        li a7, __NR_getcpu
        ecall
        ret
        .cfi_endproc
ENDPROC(__vdso_getcpu)