What is SQL?

Prerequisites(1)

Builds on to(6)

Join Algorithms (NLJ, Hash, Merge)Aggregation & GROUP BY Subquery & CTE Column Statistics & Histograms Materialized View ACID Properties

SQL (Structured Query Language) is the standard declarative language for relational databases. You specify what data you want, not how to retrieve it — the query optimizer figures out the execution plan.

SQL sub-languages

DDL (Data Definition Language) — defines schema:

CREATE TABLE users (id INT PRIMARY KEY, name VARCHAR(100), email VARCHAR(255) UNIQUE)
ALTER TABLE users ADD COLUMN age INT DEFAULT 0
DROP TABLE users
CREATE INDEX idx_email ON users (email)

DML (Data Manipulation Language) — reads and modifies data:

SELECT — query data
INSERT INTO users (name, email) VALUES ('Alice', 'a@b.com')
UPDATE users SET email = 'new@b.com' WHERE id = 1
DELETE FROM users WHERE id = 1
MERGE / INSERT ... ON CONFLICT (upsert) — insert or update depending on existence

DCL (Data Control Language) — manages permissions:

GRANT SELECT, INSERT ON users TO analyst_role
REVOKE DELETE ON users FROM intern_role

TCL (Transaction Control Language) — manages transactions:

BEGIN — start a transaction
COMMIT — make changes permanent
ROLLBACK — undo all changes since BEGIN
SAVEPOINT sp1 / ROLLBACK TO sp1 — partial rollback

SELECT anatomy

The logical order of a SELECT statement differs from the written order:

Written order:

SELECT department, AVG(salary) AS avg_sal
FROM employees
WHERE hire_date > '2020-01-01'
GROUP BY department
HAVING AVG(salary) > 80000
ORDER BY avg_sal DESC
LIMIT 10;

Logical execution order:

FROM employees — identify the source table
WHERE hire_date > '2020-01-01' — filter rows
GROUP BY department — form groups
HAVING AVG(salary) > 80000 — filter groups
SELECT department, AVG(salary) — compute output columns
ORDER BY avg_sal DESC — sort results
LIMIT 10 — truncate output

Understanding this order explains why you cannot use a column alias defined in SELECT within the WHERE clause — WHERE executes before SELECT.

NULL semantics (three-valued logic)

SQL uses three-valued logic: TRUE, FALSE, and UNKNOWN. Any comparison involving NULL yields UNKNOWN:

NULL = NULL → UNKNOWN (not TRUE!)
NULL > 5 → UNKNOWN
NULL AND TRUE → UNKNOWN
NULL OR TRUE → TRUE (short-circuit)
WHERE x = NULL matches nothing — use WHERE x IS NULL

COALESCE(a, b, c) returns the first non-NULL argument. It is the standard way to provide default values: COALESCE(nickname, first_name, 'Anonymous').

Declarative nature

SQL is declarative: you describe the result set, not the algorithm. The same query might be executed as:

A sequential scan with a filter
An index lookup followed by a heap fetch
A merge join of two sorted inputs
A hash join with parallel workers

The query optimizer chooses the best plan based on table statistics, index availability, and cost estimates. This is why the same SQL query can perform very differently depending on the database engine, indexes, and data distribution.

**SQL (Structured Query Language)** is the standard declarative language for relational databases. You specify **what** data you want, not **how** to retrieve it — the query optimizer figures out the execution plan.

### SQL sub-languages

**DDL (Data Definition Language)** — defines schema:
- `CREATE TABLE users (id INT PRIMARY KEY, name VARCHAR(100), email VARCHAR(255) UNIQUE)`
- `ALTER TABLE users ADD COLUMN age INT DEFAULT 0`
- `DROP TABLE users`
- `CREATE INDEX idx_email ON users (email)`

**DML (Data Manipulation Language)** — reads and modifies data:
- `SELECT` — query data
- `INSERT INTO users (name, email) VALUES ('Alice', 'a@b.com')`
- `UPDATE users SET email = 'new@b.com' WHERE id = 1`
- `DELETE FROM users WHERE id = 1`
- `MERGE` / `INSERT ... ON CONFLICT` (upsert) — insert or update depending on existence

**DCL (Data Control Language)** — manages permissions:
- `GRANT SELECT, INSERT ON users TO analyst_role`
- `REVOKE DELETE ON users FROM intern_role`

**TCL (Transaction Control Language)** — manages transactions:
- `BEGIN` — start a transaction
- `COMMIT` — make changes permanent
- `ROLLBACK` — undo all changes since BEGIN
- `SAVEPOINT sp1` / `ROLLBACK TO sp1` — partial rollback

### SELECT anatomy

The logical order of a SELECT statement differs from the written order:

**Written order:**
```sql
SELECT department, AVG(salary) AS avg_sal
FROM employees
WHERE hire_date   '2020-01-01'
GROUP BY department
HAVING AVG(salary)   80000
ORDER BY avg_sal DESC
LIMIT 10;
```

**Logical execution order:**
1. `FROM employees` — identify the source table
2. `WHERE hire_date   '2020-01-01'` — filter rows
3. `GROUP BY department` — form groups
4. `HAVING AVG(salary)   80000` — filter groups
5. `SELECT department, AVG(salary)` — compute output columns
6. `ORDER BY avg_sal DESC` — sort results
7. `LIMIT 10` — truncate output

Understanding this order explains why you **cannot** use a column alias defined in SELECT within the WHERE clause — WHERE executes before SELECT.

### NULL semantics (three-valued logic)

SQL uses **three-valued logic**: TRUE, FALSE, and **UNKNOWN**. Any comparison involving NULL yields UNKNOWN:

- `NULL = NULL` → UNKNOWN (not TRUE!)
- `NULL   5` → UNKNOWN
- `NULL AND TRUE` → UNKNOWN
- `NULL OR TRUE` → TRUE (short-circuit)
- `WHERE x = NULL` matches **nothing** — use `WHERE x IS NULL`

**COALESCE(a, b, c)** returns the first non-NULL argument. It is the standard way to provide default values: `COALESCE(nickname, first_name, 'Anonymous')`.

### Declarative nature

SQL is **declarative**: you describe the result set, not the algorithm. The same query might be executed as:
- A sequential scan with a filter
- An index lookup followed by a heap fetch
- A merge join of two sorted inputs
- A hash join with parallel workers

The **query optimizer** chooses the best plan based on table statistics, index availability, and cost estimates. This is why the same SQL query can perform very differently depending on the database engine, indexes, and data distribution.

Real-Life: The Power of Declarative Queries

Real-World Example

Consider a query to find the top 5 customers by total spending in the last 30 days:

SELECT c.name, SUM(o.amount) AS total
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.created_at > CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.id, c.name
ORDER BY total DESC
LIMIT 5;

What the optimizer might decide:

Use a range index scan on orders.created_at to find recent orders (instead of scanning all orders).
For each matching order, use a nested loop join with an index lookup on customers.id.
Or, if most orders are recent, do a hash join: build a hash table on customers, scan orders sequentially.

The developer wrote 7 lines of SQL. The optimizer generated a plan that would take 200+ lines of imperative code. And it automatically adapts: if you add an index on created_at, the plan changes without modifying the query.

NULL gotcha in practice:

-- This finds nothing even if some users have no email!
SELECT * FROM users WHERE email != 'test@example.com';
-- NULL != 'test@example.com' is UNKNOWN, filtered out
-- Fix:
SELECT * FROM users WHERE email != 'test@example.com' OR email IS NULL;
-- Or: WHERE COALESCE(email, '') != 'test@example.com'

Savepoints in practice:

BEGIN;
INSERT INTO orders (...) VALUES (...);
SAVEPOINT before_payment;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Oops, insufficient funds:
ROLLBACK TO before_payment;
-- Order insert is still active, try a different payment method
COMMIT;

SELECT Logical Execution Order

Step 1 of 3