The Kernel Interface
A system call (syscall) is the programmatic interface through which a user-space process requests a service from the operating system kernel. Since user-mode code (ring 3) cannot directly access hardware or kernel data structures, system calls provide controlled, well-defined entry points into ring 0.
How a System Call Works
On x86-64 Linux, the sequence is:
- The user program places the syscall number in the
raxregister and arguments inrdi,rsi,rdx,r10,r8,r9(up to 6 arguments). - The program executes the
syscallinstruction. This is a trap -- the CPU switches from ring 3 to ring 0, saves the user-mode instruction pointer and flags, and jumps to the kernel's syscall entry point (configured via theLSTARMSR). - The kernel reads
raxto determine the syscall number and looks up the handler in the syscall table (an array of function pointers indexed by syscall number). - The handler executes the requested operation (e.g., reading a file, allocating memory).
- The return value is placed in
rax, and the kernel executessysretto return to user mode.
Trap Instructions by Architecture
| Architecture | Instruction | Notes |
|---|---|---|
| x86 (32-bit) | int 0x80 | Software interrupt, older and slower |
| x86-64 | syscall | Dedicated fast-path instruction |
| ARM (AArch64) | svc #0 | Supervisor Call |
| RISC-V | ecall | Environment Call |
The syscall instruction on x86-64 is faster than int 0x80 because it avoids the overhead of the full interrupt descriptor table lookup. Linux also provides a vDSO (virtual dynamic shared object) that maps certain read-only kernel data into user space, allowing some "syscalls" like gettimeofday() to execute entirely in user mode without any mode switch.
System Call Overhead
Each system call involves:
- Mode switch: saving user-mode registers, switching to kernel stack, changing CPL from 3 to 0.
- Register save/restore: the kernel must preserve caller-saved registers.
- TLB considerations: on systems with Kernel Page Table Isolation (KPTI, the Meltdown mitigation), switching page tables flushes TLB entries, adding significant overhead.
- Return path: restoring registers and switching back to ring 3.
A single syscall typically costs 100-1000 nanoseconds depending on the operation and whether KPTI is enabled. This is why high-performance applications batch operations (e.g., readv/writev for scatter-gather I/O) or use io_uring to submit multiple I/O operations with a single syscall.
Common System Calls
| Syscall | Number (x86-64) | Purpose |
|---|---|---|
read | 0 | Read bytes from a file descriptor |
write | 1 | Write bytes to a file descriptor |
open | 2 | Open a file, return a file descriptor |
close | 3 | Close a file descriptor |
mmap | 9 | Map files or anonymous memory into address space |
fork | 57 | Create a child process |
execve | 59 | Replace process image with a new program |
exit | 60 | Terminate the calling process |
The C library (glibc) wraps these raw syscalls in higher-level functions like printf() (which buffers and calls write), malloc() (which calls mmap or brk), and fopen() (which calls open).
Tracing System Calls with strace
On Linux, strace intercepts and logs every system call a program makes. Running strace cat /etc/hostname reveals:
openat(AT_FDCWD, "/etc/hostname", O_RDONLY) = 3
read(3, "myserver\n", 131072) = 9
write(1, "myserver\n", 9) = 9
close(3) = 0
exit_group(0) = ?
This shows that even the simple cat command makes multiple syscalls: openat to open the file (returns fd 3), read to read its contents, write to output to stdout (fd 1), and close to release the file descriptor.
Performance insight: a web server handling 100,000 requests per second might make 300,000+ syscalls per second (read, write, epoll_wait for each request). Reducing syscall count via buffering, io_uring, or sendfile directly impacts throughput.
Security insight: seccomp (Secure Computing) is a Linux mechanism that restricts which syscalls a process can make. Docker containers and Chrome's sandbox use seccomp to limit attack surface -- even if an attacker achieves code execution, they cannot call execve or mount.