SQL
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
SQL (Structured Query Language) is the standard language for interacting with relational databases. It is used to store, retrieve, update, and delete data � and underlies nearly every application that persists structured information, from mobile apps to enterprise systems.
Remembering[edit]
Key terms:
- Database � an organized collection of structured data stored electronically.
- Relational database � a database that organizes data into tables with defined relationships between them.
- Table � a grid of rows and columns, similar to a spreadsheet, representing one type of entity (e.g., users, orders).
- Row (record) � a single entry in a table.
- Column (field) � a named attribute shared by all rows in a table (e.g., "email", "created_at").
- Primary key � a column (or combination) that uniquely identifies each row.
- Foreign key � a column that references the primary key of another table, linking them.
- Query � a request for data, written in SQL.
- Schema � the blueprint describing a database's tables, columns, and relationships.
- Index � a data structure that speeds up lookups on a column at the cost of extra storage.
- NULL � the absence of a value; distinct from zero or empty string.
Core SQL statement categories:
- SELECT � retrieve data
- INSERT � add new rows
- UPDATE � modify existing rows
- DELETE � remove rows
- CREATE / DROP / ALTER � define or change the schema (DDL � Data Definition Language)
Understanding[edit]
A relational database stores data in normalized form to reduce duplication. Instead of repeating a customer's address on every order, the address lives once in a Customers table and orders reference it via a foreign key. This separation is the core idea behind relational design.
SQL queries are declarative: you describe what data you want, not how to retrieve it. The database engine (PostgreSQL, MySQL, SQLite, SQL Server, etc.) figures out the execution plan. This matters because the same logical query can be executed many different ways � and the engine picks the one it estimates to be cheapest.
How a SELECT query executes (logical order, not syntactic order):
- FROM / JOIN � identify and combine the source tables
- WHERE � filter rows before grouping
- GROUP BY � collapse rows into groups
- HAVING � filter groups
- SELECT � compute the output columns
- ORDER BY � sort the result
- LIMIT / OFFSET � slice the result set
Understanding this order explains common confusions: you cannot reference a SELECT alias in a WHERE clause (WHERE runs before SELECT), and aggregate functions like COUNT() only become valid after GROUP BY.
Applying[edit]
Common patterns:
- Basic retrieval
SELECT name, email FROM users WHERE active = 1 ORDER BY created_at DESC LIMIT 10;
- Joining tables
SELECT orders.id, users.name, orders.total
FROM orders
JOIN users ON orders.user_id = users.id
WHERE orders.status = 'shipped';
- Aggregation
SELECT product_id, COUNT(*) AS num_orders, SUM(total) AS revenue
FROM orders
GROUP BY product_id
HAVING SUM(total) > 1000
ORDER BY revenue DESC;
- Subquery
SELECT name FROM users
WHERE id IN (SELECT user_id FROM orders WHERE total > 500);
- Upsert (insert or update)
INSERT INTO settings (user_id, key, value)
VALUES (42, 'theme', 'dark')
ON CONFLICT (user_id, key) DO UPDATE SET value = EXCLUDED.value;
- Common Table Expression (CTE)
WITH recent_orders AS (
SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '30 days'
)
SELECT user_id, COUNT(*) FROM recent_orders GROUP BY user_id;
Practical habits: always filter early (move conditions as close to the source as possible), use EXPLAIN to inspect the query plan before deploying, and never run UPDATE or DELETE without a WHERE clause in production.
Analyzing[edit]
Joins are the most misunderstood feature:
- INNER JOIN � returns only rows with a match in both tables. Rows with no match are silently dropped.
- LEFT JOIN � returns all rows from the left table; unmatched right-side columns are NULL.
- RIGHT JOIN � the mirror of LEFT JOIN (rarely used; usually rewritten as a LEFT JOIN).
- FULL OUTER JOIN � returns all rows from both tables, NULLs where there is no match.
- CROSS JOIN � every row from the left combined with every row from the right (Cartesian product); rarely intentional.
NULL behavior is a frequent source of bugs. NULL is not equal to anything, including itself � NULL = NULL evaluates to NULL (unknown), not TRUE. Filtering with WHERE col != 'x' silently excludes NULL rows. Use IS NULL / IS NOT NULL explicitly.
Index trade-offs:
- An index on a column makes reads faster (O(log n) vs O(n) scan) but slows writes because the index must be updated on every INSERT/UPDATE/DELETE.
- Indexes are most effective on high-cardinality columns (many distinct values) used frequently in WHERE or JOIN conditions.
- A composite index on (a, b) helps queries filtering on a alone or on a and b together, but not on b alone.
N+1 query problem: fetching a list of N records and then issuing one query per record to get related data results in N+1 round trips. Fix it with a JOIN or a single IN (...) query instead.
Evaluating[edit]
Signs of a well-designed SQL schema and query set:
- Tables are normalized to at least 3NF � no repeating groups, no partial dependencies, no transitive dependencies � unless denormalization is a deliberate performance trade-off with documented justification.
- Queries use indexes effectively; EXPLAIN / EXPLAIN ANALYZE shows index scans rather than sequential scans on large tables.
- Transactions wrap related mutations so partial failures leave the database in a consistent state (ACID guarantees).
- Migrations are versioned, reviewed, and applied non-destructively (add columns before removing old ones; never rename a column in one atomic step in a live system).
- Sensitive data (passwords, PII) is never stored in plaintext.
Expert-level judgment calls:
- When to denormalize: OLAP (analytics) workloads often benefit from wide, flat tables (star/snowflake schema) to avoid expensive runtime joins over billions of rows.
- When to use a window function over GROUP BY: window functions (RANK(), LAG(), running SUM OVER PARTITION BY) let you compute aggregates without collapsing rows � critical for ranking, time-series analysis, and cohort comparisons.
- When NOT to use a relational database: high write throughput with flexible schema (use a document store), graph traversal queries (use a graph database), time-series data at scale (use a purpose-built TSDB).
Creating[edit]
Designing a database for a new system involves several architectural decisions:
- Schema design
- Start from the domain entities and their relationships. Identify cardinality (one-to-one, one-to-many, many-to-many). Many-to-many relationships require a junction table. Decide which attributes belong to which entity and resist the temptation to put everything in one table.
- Choosing data types carefully
- Use the narrowest type that fits the data � INT not BIGINT if IDs will stay under 2 billion, TIMESTAMP WITH TIME ZONE not VARCHAR for dates. Incorrect types cause silent truncation, incorrect sort orders, and missed index opportunities.
- Migration strategy
- In production systems, schema changes must be backwards-compatible during the deployment window. Techniques: expand-contract (add the new column, backfill, migrate reads, then drop the old column across separate deployments), online index builds (CREATE INDEX CONCURRENTLY in PostgreSQL), and feature flags to gate new columns until safe.
- Query layer design
- Decide whether application code uses raw SQL, a query builder, or an ORM. Raw SQL gives maximum control and transparency; ORMs reduce boilerplate but can generate inefficient queries invisibly. Audit ORM-generated SQL in production before assuming correctness.
- Observability
- Log slow queries (pg_stat_statements in PostgreSQL, slow query log in MySQL). Set query timeouts to prevent runaway queries from degrading the whole system. Track table bloat and vacuum health in write-heavy databases.