I once found all admin pages of a web app were taking 100ms longer than they should due to a count query on a huge table. I couldn’t remove the feature but as admins only needed a rough idea of how many rows there were I could substitute an estimate 🔮
In most relational databases count(*)
is surprisingly slow due to MVCC consistency. PostgreSQL allows concurrent transactions (which could each be modifying rows) so there’s no canonical state of the data and each transaction sees its own version. Count must scan the whole table (or suitable index); literally counting all rows visible to the current transaction. Counting a subset (with WHERE
) performs better because Postgres can scan fewer rows, but performance still hurts for big tables.
Side-note about
count(1)
vscount(*)
. You might guesscount(1)
would be fastest becausecount(*)
appears to be inefficiently saying “fetch every column of every row and count the result”. But the opposite happens, as per the SQL standard “If COUNT(*) is specified, then the result is the cardinality of T” so no columns need to be loaded as the*
here means cardinality not “all columns”. Some databases optimisecount(1)
in the same way but PostgreSQL does not; it must check every row to confirm that the argument 1 is not NULL. The query plans look identical but the extra NULL check is slower, and benchmarks prove this. Follow the spec and usecount(*)
!
COUNT(*)
is slow because it’s “correct”, but if an estimate is good enough (exploratory queries, vanity metrics, etc) we can use PostgreSQL’s planner stats for a near instant answer:
You can query PosgreSQL’s catalogue table pg_class
for stats about a whole table. A use case would be admin panels / database clients which show whole table row counts.
SELECT reltuples::bigint AS count_estimate
FROM pg_class
WHERE oid = to_regclass('public.large_table');
Or for more accuracy you can mimic what PostgreSQL’s query planner does:
SELECT
(reltuples / relpages) * (
pg_relation_size('items') /
(current_setting('block_size')::integer)
)
FROM pg_class
WHERE oid = to_regclass('public.large_table');
But at this point you might as well use the planner’s estimate directly…
This function works for estimating the count of any query (including WHERE clauses) by parsing EXPLAIN output. It's still just an estimate, but it's fast and good enough for big approximate counts.
I based it off a well circulated function with these improvements:
bigint
instead of integer
to prevent an exception in the case a table really is huge or the stats come back way too high (it happens).CREATE FUNCTION count_estimate(query text) RETURNS bigint AS $$
DECLARE
rec record;
BEGIN
EXECUTE 'EXPLAIN (FORMAT json) ' || query INTO rec;
RETURN rec."QUERY PLAN"->0->'Plan'->'Plan Rows';
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;
Use it without a count in the query as the function fetches the count for you (otherwise you’ll get the count of a count which is just 1 aggregate row).
SELECT count_estimate('SELECT * FROM socks WHERE plain = false');
The result can be exactly correct! But I’ve often enough seen it be up to 10x off. When estimating a very complex (hundreds of lines of SQL) view count it was 10,000x off! So be aware that the more complex a query the more wrong the estimate is likely to be.
That said there are a couple things you can do to improve accuracy:
The PostgreSQL planner stats are updated by the stats collector and the autovacuum daemon. If vacuum has just run and data hasn’t changed since, then count_estimate
should be spot on. Thus you can increase the frequency vacuum runs (but be careful you don’t overwhelm the system).
Since v10 you can give additional statistics to PostgreSQL to improve query plan estimates; generally where one column has some relationship with another. It will require some research to get right and still won’t give perfect results, but worth a look to get the most out of estimates.