A SELECT statement retrieves zero or more rows from one or more database tables or database views. In most applications,
SELECT is the most commonly used data query language (DQL) command. As SQL is a declarative programming language,
SELECT queries specify a result set, but do not specify how to calculate it. The database translates the query into a "query plan" which may vary between executions, database versions and database software. This functionality is called the "query optimizer" as it is responsible for finding the best possible execution plan for the query, within applicable constraints.
The SELECT statement has many optional clauses:
WHEREspecifies which rows to retrieve.
GROUP BYgroups rows sharing a property so that an aggregate function can be applied to each group.
HAVINGselects among the groups defined by the GROUP BY clause.
ORDER BYspecifies an order in which to return the rows.
ASprovides an alias which can be used to temporarily rename tables or columns.
Given a table T, the query
SELECT * FROM T will result in all the elements of all the rows of the table being shown.
With the same table, the query
SELECT C1 FROM T will result in the elements from the column C1 of all the rows of the table being shown. This is similar to a projection in Relational algebra, except that in the general case, the result may contain duplicate rows. This is also known as a Vertical Partition in some database terms, restricting query output to view only specified fields or columns.
With the same table, the query
SELECT * FROM T WHERE C1 = 1 will result in all the elements of all the rows where the value of column C1 is '1' being shown — in Relational algebra terms, a selection will be performed, because of the WHERE clause. This is also known as a Horizontal Partition, restricting rows output by a query according to specified conditions.
With more than one table, the result set will be every combination of rows. So if two tables are T1 and T2,
SELECT * FROM T1, T2 will result in every combination of T1 rows with every T2 rows. E.g., if T1 has 3 rows and T2 has 5 rows, then 15 rows will result.
The SELECT clause specifies a list of properties (columns) by name, or the wildcard character (“*”) to mean “all properties”.
Limiting result rows
Often it is convenient to indicate a maximum number of rows that are returned. This can be used for testing or to prevent consuming excessive resources if the query returns more information than expected. The approach to do this often varies per vendor.
- cursors, or
- By introducing SQL window function to the SELECT-statement
ISO SQL:2008 introduced the
FETCH FIRST clause.
According to PostgreSQL v.9 documentation, an SQL Window function performs a calculation across a set of table rows that are somehow related to the current row, in a way similar to aggregate functions.  The name recalls signal processing window functions. A window function call always contains an OVER clause.
ROW_NUMBER() window function
ROW_NUMBER() OVER may be used for a simple table on the returned rows, e.g. to return no more than ten rows:
SELECT * FROM ( SELECT ROW_NUMBER() OVER (ORDER BY sort_key ASC) AS row_number, columns FROM tablename ) AS foo WHERE row_number <= 11
ROW_NUMBER can be non-deterministic: if sort_key is not unique, each time you run the query it is possible to get different row numbers assigned to any rows where sort_key is the same. When sort_key is unique, each row will always get a unique row number.
RANK() window function
RANK() OVER window function acts like ROW_NUMBER, but may return more or less than n rows in case of tie conditions, e.g. to return the top-10 youngest persons:
SELECT * FROM ( SELECT RANK() OVER (ORDER BY age ASC) AS ranking, person_id, person_name, age FROM person )AS foo WHERE ranking <= 10
The above code could return more than ten rows, e.g. if there are two people of the same age, it could return eleven rows.
FETCH FIRST clause
Since ISO SQL:2008 results limits can be specified as in the following example using the
FETCH FIRST clause.
SELECT * FROM T FETCH FIRST 10 ROWS ONLY
This clause currently is supported by CA DATACOM/DB 11, IBM DB2, SAP SQL Anywhere, PostgreSQL, EffiProz, H2, HSQLDB version 2.0, Oracle 12c and Mimer SQL.
Microsoft SQL Server 2008 and higher supports
FETCH FIRST, but it is considered part of the
ORDER BY clause. The
FETCH FIRST clauses are all required for this usage.
SELECT * FROM T ORDER BY acolumn DESC OFFSET 0 ROWS FETCH FIRST 10 ROWS ONLY
Some DBMSs offer non-standard syntax either instead of or in addition to SQL standard syntax. Below, variants of the simple limit query for different DBMSes are listed:
SET ROWCOUNT 10 SELECT * FROM T
|MS SQL Server (This also works on Microsoft SQL Server 6.5 while the Select top 10 * from T does not)|
||Netezza, MySQL, SAP SQL Anywhere, PostgreSQL (also supports the standard, since version 8.4), SQLite, HSQLDB, H2, Vertica, Polyhedra, Couchbase Server|
||Informix (row numbers are filtered after order by is evaluated. SKIP clause was introduced in a v10.00.xC4 fixpack)|
||MS SQL Server, SAP ASE, MS Access, SAP IQ, Teradata|
||SAP SQL Anywhere (also supports the standard, since version 9.0.1)|
||Firebird (since version 2.1)|
SELECT * FROM T WHERE ID_T > 10 FETCH FIRST 10 ROWS ONLY
SELECT * FROM T WHERE ID_T > 20 FETCH FIRST 10 ROWS ONLY
|DB2 (new rows are filtered after comparing with key column of table T)|
sum(population) OVER( PARTITION BY city )
calculates the sum of the populations of all rows having the same city value as the current row.
Partitions are specified using the OVER clause which modifies the aggregate. Syntax:
<OVER_CLAUSE> :: = OVER ( [ PARTITION BY <expr>, ... ] [ ORDER BY <expression> ] )
The OVER clause can partition and order the result set. Ordering is used for order-relative functions such as row_number.
Query evaluation ANSI
The processing of a SELECT statement according to ANSI SQL would be the following:
select g.* from users u inner join groups g on g.Userid = u.Userid where u.LastName = 'Smith' and u.FirstName = 'John'
- the FROM clause is evaluated, a cross join or Cartesian product is produced for the first two tables in the FROM clause resulting in a virtual table as Vtable1
- the ON clause is evaluated for vtable1; only records which meet the join condition g.Userid = u.Userid are inserted into Vtable2
- If an outer join is specified, records which were dropped from vTable2 are added into VTable 3, for instance if the above query were:
all users who did not belong to any groups would be added back into Vtable3
select u.* from users u left join groups g on g.Userid = u.Userid where u.LastName = 'Smith' and u.FirstName = 'John'
- the WHERE clause is evaluated, in this case only group information for user John Smith would be added to vTable4
- the GROUP BY is evaluated; if the above query were:
vTable5 would consist of members returned from vTable4 arranged by the grouping, in this case the GroupName
select g.GroupName, count(g.*) as NumberOfMembers from users u inner join groups g on g.Userid = u.Userid group by GroupName
- the HAVING clause is evaluated for groups for which the HAVING clause is true and inserted into vTable6. For example:
select g.GroupName, count(g.*) as NumberOfMembers from users u inner join groups g on g.Userid = u.Userid group by GroupName having count(g.*) > 5
- the SELECT list is evaluated and returned as Vtable 7
- the DISTINCT clause is evaluated; duplicate rows are removed and returned as Vtable 8
- the ORDER BY clause is evaluated, ordering the rows and returning VCursor9. This is a cursor and not a table because ANSI defines a cursor as an ordered set of rows (not relational).
Window function support by RDBMS vendors
The implementation of window function features by vendors of relational databases and SQL engines differs wildly. Apart from MySQL, most databases support at least some flavour of window functions. However, when we take a closer look it becomes clear that most vendors only implement a subset of the standard. Let's take the powerful RANGE clause as an example. Only Oracle, DB2, Spark/Hive, and Google Big Query fully implement this feature. More recently, vendors have added new extensions to the standard, e.g. array aggregation functions. These are particularly useful in the context of running SQL against a distributed file system (Hadoop, Spark, Google BigQuery) where we have weaker data co-locality guarantees than on a distributed relational database (MPP). Rather than evenly distributing the data across all nodes, SQL engines running queries against a distributed filesystem can achieve data co-locality guarantees by nesting data and thus avoiding potentially expensive joins involving heavy shuffling across the network. User-defined aggregate functions that can be used in window functions are another extremely powerful feature.
Generating data in T-SQL
Method to generate data based on the union all
select 1 a, 1 b union all select 1, 2 union all select 1, 3 union all select 2, 1 union all select 5, 1
SQL Server 2008 supports the "row constructor" specified in the SQL3 ("SQL:1999") standard
select * from (values (1, 1), (1, 2), (1, 3), (2, 1), (5, 1)) as x(a, b)
- Microsoft. "Transact-SQL Syntax Conventions".
- MySQL. "SQL SELECT Syntax".
- PostgreSQL 9.1.24 Documentation - Chapter 3. Advanced Features
- Inside Microsoft SQL Server 2005: T-SQL Querying by Itzik Ben-Gan, Lubor Kollar, and Dejan Sarka
- Horizontal & Vertical Partitioning, Microsoft SQL Server 2000 Books Online.