Back to DAG

System Calls

os

The Kernel Interface

A system call (syscall) is the programmatic interface through which a user-space process requests a service from the operating system kernel. Since user-mode code (ring 3) cannot directly access hardware or kernel data structures, system calls provide controlled, well-defined entry points into ring 0.

How a System Call Works

On x86-64 Linux, the sequence is:

  1. The user program places the syscall number in the rax register and arguments in rdi, rsi, rdx, r10, r8, r9 (up to 6 arguments).
  2. The program executes the syscall instruction. This is a trap -- the CPU switches from ring 3 to ring 0, saves the user-mode instruction pointer and flags, and jumps to the kernel's syscall entry point (configured via the LSTAR MSR).
  3. The kernel reads rax to determine the syscall number and looks up the handler in the syscall table (an array of function pointers indexed by syscall number).
  4. The handler executes the requested operation (e.g., reading a file, allocating memory).
  5. The return value is placed in rax, and the kernel executes sysret to return to user mode.

Trap Instructions by Architecture

ArchitectureInstructionNotes
x86 (32-bit)int 0x80Software interrupt, older and slower
x86-64syscallDedicated fast-path instruction
ARM (AArch64)svc #0Supervisor Call
RISC-VecallEnvironment Call

The syscall instruction on x86-64 is faster than int 0x80 because it avoids the overhead of the full interrupt descriptor table lookup. Linux also provides a vDSO (virtual dynamic shared object) that maps certain read-only kernel data into user space, allowing some "syscalls" like gettimeofday() to execute entirely in user mode without any mode switch.

System Call Overhead

Each system call involves:

  • Mode switch: saving user-mode registers, switching to kernel stack, changing CPL from 3 to 0.
  • Register save/restore: the kernel must preserve caller-saved registers.
  • TLB considerations: on systems with Kernel Page Table Isolation (KPTI, the Meltdown mitigation), switching page tables flushes TLB entries, adding significant overhead.
  • Return path: restoring registers and switching back to ring 3.

A single syscall typically costs 100-1000 nanoseconds depending on the operation and whether KPTI is enabled. This is why high-performance applications batch operations (e.g., readv/writev for scatter-gather I/O) or use io_uring to submit multiple I/O operations with a single syscall.

Common System Calls

SyscallNumber (x86-64)Purpose
read0Read bytes from a file descriptor
write1Write bytes to a file descriptor
open2Open a file, return a file descriptor
close3Close a file descriptor
mmap9Map files or anonymous memory into address space
fork57Create a child process
execve59Replace process image with a new program
exit60Terminate the calling process

The C library (glibc) wraps these raw syscalls in higher-level functions like printf() (which buffers and calls write), malloc() (which calls mmap or brk), and fopen() (which calls open).

Tracing System Calls with strace

Real-World Example

On Linux, strace intercepts and logs every system call a program makes. Running strace cat /etc/hostname reveals:

openat(AT_FDCWD, "/etc/hostname", O_RDONLY) = 3
read(3, "myserver\n", 131072) = 9
write(1, "myserver\n", 9) = 9
close(3) = 0
exit_group(0) = ?

This shows that even the simple cat command makes multiple syscalls: openat to open the file (returns fd 3), read to read its contents, write to output to stdout (fd 1), and close to release the file descriptor.

Performance insight: a web server handling 100,000 requests per second might make 300,000+ syscalls per second (read, write, epoll_wait for each request). Reducing syscall count via buffering, io_uring, or sendfile directly impacts throughput.

Security insight: seccomp (Secure Computing) is a Linux mechanism that restricts which syscalls a process can make. Docker containers and Chrome's sandbox use seccomp to limit attack surface -- even if an attacker achieves code execution, they cannot call execve or mount.

System Call Flow (x86-64)

User Space (Ring 3) 1. Set registers rax = 1 (write) rdi = 1 (stdout) rsi = buf, rdx = len 2. Execute syscall Trap to ring 0 Mode switch (ring 3 -> ring 0) Kernel Space (Ring 0) 3. Save state RIP, RFLAGS, RSP to kernel stack 4. Syscall table sys_call_table[rax] -> sys_write() 5. Run handler Write bytes to file descriptor 6. Return rax = result sysret 7. Resume rax = bytes written Continue execution in user mode Syscall Table (partial) [0] sys_read [1] sys_write [2] sys_open [3] sys_close Overhead per syscall ~100-1000ns: register save/restore, stack switch, TLB flush (KPTI), privilege level change
Step 1 of 3