DNS Analytics
from Zero to 2M
Complete step-by-step guide to a DNS query analytics server on Ubuntu 24.04. BIND9 captures every LAN query. Python streams them into PostgreSQL. Daily, weekly, and monthly rollups give you the traffic intelligence to design caching and peering for an ISP at scale.
How Everything Connects
Teaching Lens
Think in layers, not commands. This system is a pipeline where each layer has one job: resolve DNS, capture events, store durable records, then aggregate for decisions.
- Resolver layer answers DNS and emits raw events.
- Ingest layer transforms text logs into structured rows.
- Analytics layer converts high-volume raw data into low-cost summaries.
- If output looks wrong, debug from left to right through the pipeline.
Installing & Configuring BIND9
BIND9 acts as a recursive caching resolver for your entire LAN. Every query is logged to the file that feeds our analytics pipeline.
Teaching Lens
The goal is not just to "install DNS". You are building a trustworthy data source. If resolver behavior is unstable, every downstream metric is noisy.
- ACL controls who can recurse through your resolver.
- Forwarders control upstream behavior and latency.
- Logging format and timestamp quality decide parser reliability.
- Validation order: syntax, service state, query test, log line test.
1.1 Install BIND9
Update package index
sudo apt-get updateWhy this command Refreshes apt metadata so you install the newest package index, not stale repository references.
What to verify No 404 repository errors and exit code 0 before proceeding to installation.
Install BIND9 and tools
sudo apt-get install -y bind9 bind9utils bind9-doc dnsutilsWhy this command Installs resolver daemon plus diagnostic tools so you can both serve DNS and test DNS from the same host.
What to verify `named`, `dig`, and `named-checkconf` are available on PATH after install.
Verify and enable on boot
sudo systemctl status bind9 && sudo systemctl enable bind91.2 Configure named.conf.options
sudo cp /etc/bind/named.conf.options /etc/bind/named.conf.options.bak
// DNS Analytics Server -- 192.168.234.128 acl "trusted" { 127.0.0.1; 192.168.0.0/16; 10.0.0.0/8; 172.16.0.0/12; }; options { directory "/var/cache/bind"; listen-on { any; }; listen-on-v6 { any; }; allow-query { trusted; }; allow-recursion { trusted; }; forwarders { 8.8.8.8; 8.8.4.4; 1.1.1.1; 9.9.9.9; }; forward only; dnssec-validation auto; allow-transfer { none; }; notify no; recursion yes; max-cache-size 256m; min-cache-ttl 60; max-cache-ttl 86400; rate-limit { responses-per-second 50; window 5; }; pid-file "/run/named/named.pid"; };
Why this config Defines trust boundaries (ACL), recursion policy, upstream forwarding, and cache behavior that shape both security and performance.
What to verify Only trusted networks can recurse and external hosts receive refusal instead of open resolver behavior.
1.3 Configure Query Logging
sudo mkdir -p /var/log/named sudo chown bind:bind /var/log/named && sudo chmod 755 /var/log/named
BIND9 runs as the bind user. Root-owned log dir = silent failure with zero error message.
logging { channel queries_log { file "/var/log/named/queries.log" versions 7 size 500m; severity dynamic; print-time yes; // CRITICAL: timestamp on every line print-severity no; print-category no; }; channel named_log { file "/var/log/named/named.log" versions 4 size 50m; severity info; print-time yes; print-severity yes; print-category yes; }; category queries { queries_log; }; category query-errors { queries_log; }; category default { named_log; }; category general { named_log; }; category config { named_log; }; category network { named_log; }; category security { named_log; }; category lame-servers { null; }; category dnssec { null; }; category resolver { null; }; category cname { null; }; category xfer-in { null; }; category xfer-out { null; }; category notify { null; }; category client { null; }; category unmatched { null; }; category dispatch { null; }; category edns-disabled { null; }; category rpz { null; }; category rate-limit { null; }; };
1.4 Validate, Restart, Test
sudo named-checkconf # silence = success sudo systemctl restart bind9 dig @192.168.234.128 google.com A dig @192.168.234.128 youtube.com AAAA sudo tail -20 /var/log/named/queries.log
Why this sequence It validates syntax, applies config, verifies DNS function, then verifies telemetry emission in one deterministic flow.
What to verify `dig` answers with status NOERROR and each query appears in `queries.log` with timestamp.
BIND9 installed and logging every query to /var/log/named/queries.log.
PostgreSQL — Installation & Schema
Teaching Lens
You are modeling events over time. Keep raw data append-heavy, and keep reporting data pre-aggregated so analytics queries stay fast under growth.
- Raw table keeps full fidelity for reprocessing.
- Daily, weekly, monthly tables reduce compute cost.
- Indexes are aligned with your real query patterns.
- Categories table turns technical domains into business language.
2.1 Install
sudo apt-get install -y postgresql postgresql-contrib sudo systemctl enable postgresql sudo -u postgres psql -c "SELECT version();"
Why this command Installs core PostgreSQL plus common extensions package, then confirms the database engine is healthy before schema work.
What to verify Version query returns one row and service is enabled for reboot persistence.
2.2 Create Database and User
CREATE USER dns_user WITH PASSWORD 'ChangeThisPassword!' NOSUPERUSER NOCREATEDB NOCREATEROLE LOGIN; CREATE DATABASE dns_analytics OWNER dns_user ENCODING 'UTF8' LC_COLLATE 'en_US.UTF-8' LC_CTYPE 'en_US.UTF-8' TEMPLATE template0; GRANT ALL PRIVILEGES ON DATABASE dns_analytics TO dns_user; \q
2.3 Schema
psql -U dns_user -d dns_analytics -h localhost-- Raw query log, partitioned by day CREATE TABLE dns_queries ( id BIGSERIAL, queried_at TIMESTAMPTZ NOT NULL, client_ip INET NOT NULL, client_port INTEGER, domain TEXT NOT NULL, apex_domain TEXT NOT NULL, qtype VARCHAR(10) NOT NULL, flags TEXT, response_rcode TEXT, cached BOOLEAN DEFAULT FALSE, inserted_at TIMESTAMPTZ DEFAULT now() ) PARTITION BY RANGE (queried_at); CREATE TABLE dns_queries_today PARTITION OF dns_queries FOR VALUES FROM (CURRENT_DATE) TO (CURRENT_DATE + INTERVAL '1 day'); -- Per-domain count per day CREATE TABLE dns_daily_stats ( stat_date DATE NOT NULL, domain TEXT NOT NULL, apex_domain TEXT NOT NULL, qtype VARCHAR(10) NOT NULL, hit_count BIGINT NOT NULL DEFAULT 0, unique_clients INTEGER NOT NULL DEFAULT 0, nxdomain_count INTEGER NOT NULL DEFAULT 0, servfail_count INTEGER NOT NULL DEFAULT 0, first_seen TIMESTAMPTZ, last_seen TIMESTAMPTZ, PRIMARY KEY (stat_date, domain, qtype) ); CREATE TABLE dns_weekly_stats ( week_start DATE NOT NULL, week_end DATE NOT NULL, apex_domain TEXT NOT NULL, total_queries BIGINT NOT NULL DEFAULT 0, unique_clients INTEGER NOT NULL DEFAULT 0, peak_day DATE, peak_day_count BIGINT, category TEXT, PRIMARY KEY (week_start, apex_domain) ); CREATE TABLE dns_monthly_stats ( year_month CHAR(7) NOT NULL, apex_domain TEXT NOT NULL, total_queries BIGINT NOT NULL DEFAULT 0, unique_clients INTEGER NOT NULL DEFAULT 0, avg_daily NUMERIC(12,2), rank_in_month INTEGER, category TEXT, PRIMARY KEY (year_month, apex_domain) ); CREATE TABLE dns_hourly_heatmap ( stat_date DATE NOT NULL, hour_of_day SMALLINT NOT NULL CHECK (hour_of_day BETWEEN 0 AND 23), apex_domain TEXT NOT NULL, query_count BIGINT NOT NULL DEFAULT 0, PRIMARY KEY (stat_date, hour_of_day, apex_domain) ); CREATE TABLE domain_categories ( apex_domain TEXT PRIMARY KEY, category TEXT NOT NULL, provider TEXT, cache_priority SMALLINT DEFAULT 5, tagged_at TIMESTAMPTZ DEFAULT now() ); INSERT INTO domain_categories VALUES ('google.com','search','Google',3,now()),('googleapis.com','cdn','Google',2,now()), ('youtube.com','streaming','Google',1,now()),('googlevideo.com','streaming','Google',1,now()), ('netflix.com','streaming','Netflix',1,now()),('nflxvideo.net','streaming','Netflix',1,now()), ('nflximg.net','cdn','Netflix',2,now()),('akamai.net','cdn','Akamai',1,now()), ('akamaiedge.net','cdn','Akamai',1,now()),('cloudflare.com','cdn','Cloudflare',2,now()), ('facebook.com','social','Meta',4,now()),('instagram.com','social','Meta',4,now()), ('whatsapp.net','messaging','Meta',3,now()),('whatsapp.com','messaging','Meta',3,now()), ('twitter.com','social','X',4,now()),('twimg.com','cdn','X',2,now()), ('tiktok.com','social','TikTok',4,now()),('microsoft.com','work','Microsoft',5,now()), ('office.com','work','Microsoft',5,now()),('windowsupdate.com','updates','Microsoft',8,now()), ('ubuntu.com','updates','Canonical',8,now()),('amazonaws.com','cloud','AWS',5,now()), ('fastly.net','cdn','Fastly',2,now()),('cloudfront.net','cdn','AWS',2,now()), ('doubleclick.net','ads','Google',9,now()),('googlesyndication.com','ads','Google',9,now()); COMMIT;
Why this schema Separates raw immutable events from summary tables so ingestion remains simple while analytics stay fast.
What to verify Tables exist, primary keys are present, and seed categories insert without constraint errors.
2.4 Indexes
CREATE INDEX idx_q_time ON dns_queries (queried_at DESC); CREATE INDEX idx_q_apex ON dns_queries (apex_domain, queried_at DESC); CREATE INDEX idx_q_client ON dns_queries (client_ip, queried_at DESC); CREATE INDEX idx_d_apex ON dns_daily_stats (apex_domain, stat_date DESC); CREATE INDEX idx_d_hits ON dns_daily_stats (stat_date, hit_count DESC); CREATE INDEX idx_w_domain ON dns_weekly_stats (apex_domain, week_start DESC); CREATE INDEX idx_m_domain ON dns_monthly_stats (apex_domain, year_month DESC); CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX idx_d_trgm ON dns_daily_stats USING gin (apex_domain gin_trgm_ops); COMMIT;
Five-table schema built, indexed, 26 domain categories seeded.
The Log Ingestor — Python Daemon
Teaching Lens
The ingestor is an ETL loop: read, parse, normalize, write. Batch writes improve throughput and reduce database lock pressure.
- Regex parser must tolerate noisy log variants.
- Batch size and flush interval balance latency vs throughput.
- Systemd gives restart guarantees after crashes or reboots.
- If inserts stop, check file permissions, regex match rate, then DB auth.
3.1 Python Setup
sudo apt-get install -y python3 python3-pip python3-venv sudo mkdir -p /opt/dns-ingestor && sudo chown $USER:$USER /opt/dns-ingestor cd /opt/dns-ingestor && python3 -m venv venv source venv/bin/activate && pip install psycopg2-binary
Why this command Creates an isolated runtime so ingestor dependencies are deterministic and decoupled from system Python.
What to verify `venv/bin/python` exists and `pip show psycopg2-binary` resolves inside the virtual environment.
3.2 The Script
nano /opt/dns-ingestor/ingestor.py#!/usr/bin/env python3
import re, time, logging, os, signal, sys
from datetime import datetime
import psycopg2
from psycopg2.extras import execute_values
DB = {'dbname':'dns_analytics','user':'dns_user',
'password':'ChangeThisPassword!','host':'127.0.0.1','port':5432}
LOG_FILE = '/var/log/named/queries.log'
BATCH_SIZE = 100
FLUSH_INTERVAL = 5
QUERY_RE = re.compile(
r'^(?P<ts>\d{2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2})\.\d+'
r'\s+client\s+(?:@\S+\s+)?(?P<ip>[\d\.a-fA-F:]+)#(?P<port>\d+)'
r'\s+\([^)]+\):\s+query:\s+(?P<domain>[\w.\-]+)\s+IN\s+(?P<qtype>\w+)'
r'\s+(?P<flags>[+\-\w\s]*?)(?:\s+\([\d.]+\))?$'
)
def apex(d):
p = d.rstrip('.').split('.')
return '.'.join(p[-2:]) if len(p)>=2 else d
def parse_ts(s):
try: return datetime.strptime(s, '%d-%b-%Y %H:%M:%S')
except: return datetime.utcnow()
def parse(line):
line = line.strip()
if not line: return None
m = QUERY_RE.match(line)
if not m: return None
d = m.groupdict()
dom = d.get('domain','').lower().rstrip('.')
if not dom: return None
flags = d.get('flags','').strip()
return {'queried_at':parse_ts(d['ts']),'client_ip':d.get('ip','0.0.0.0'),
'client_port':int(d.get('port',0)),'domain':dom,'apex_domain':apex(dom),
'qtype':d.get('qtype','A').upper(),'flags':flags,
'response_rcode':'NOERROR','cached':'CL' in flags}
def connect():
while True:
try:
c=psycopg2.connect(**DB); c.autocommit=False
logging.info('DB connected'); return c
except psycopg2.OperationalError as e:
logging.error(f'DB: {e}'); time.sleep(10)
def flush(conn, batch):
if not batch: return 0
sql = ('INSERT INTO dns_queries '
'(queried_at,client_ip,client_port,domain,apex_domain,'
'qtype,flags,response_rcode,cached) VALUES %s ON CONFLICT DO NOTHING')
rows = [(r['queried_at'],r['client_ip'],r['client_port'],r['domain'],
r['apex_domain'],r['qtype'],r['flags'],r['response_rcode'],r['cached'])
for r in batch]
try:
with conn.cursor() as cur: execute_values(cur,sql,rows,page_size=500)
conn.commit(); return len(batch)
except psycopg2.Error as e:
logging.error(f'Insert: {e}'); conn.rollback(); return 0
def run(conn):
batch,last_flush,total,inode,f = [],time.time(),0,0,None
logging.info(f'Watching {LOG_FILE}')
while True:
try: cur_inode=os.stat(LOG_FILE).st_ino
except FileNotFoundError: time.sleep(5); continue
if cur_inode != inode:
try: f=open(LOG_FILE,'r'); f.seek(0,2); inode=cur_inode
except IOError: time.sleep(5); continue
while True:
line = f.readline()
if not line: break
rec = parse(line)
if rec: batch.append(rec)
now = time.time()
if len(batch)>=BATCH_SIZE or (batch and now-last_flush>=FLUSH_INTERVAL):
n=flush(conn,batch); total+=n
logging.info(f'Flushed {n} | total {total:,}')
batch,last_flush=[],now
time.sleep(0.5)
if __name__=='__main__':
os.makedirs('/var/log/dns-ingestor',exist_ok=True)
logging.basicConfig(level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(message)s',
handlers=[logging.StreamHandler(sys.stdout),
logging.FileHandler('/var/log/dns-ingestor/ingestor.log')])
def bye(s,f): sys.exit(0)
signal.signal(signal.SIGTERM,bye); signal.signal(signal.SIGINT,bye)
run(connect())
Why this code Implements resilient tail-and-batch ingestion with reconnect logic, minimizing data loss during transient DB failures.
What to verify Logs show repeated "Flushed N" messages and database row counts increase during DNS traffic.
3.3 Systemd Service
[Unit] Description=DNS Query Log Ingestor -- code::core After=network.target postgresql.service bind9.service Requires=postgresql.service [Service] Type=simple User=root WorkingDirectory=/opt/dns-ingestor ExecStart=/opt/dns-ingestor/venv/bin/python3 /opt/dns-ingestor/ingestor.py Restart=always RestartSec=10s LimitNOFILE=65536 [Install] WantedBy=multi-user.target
sudo systemctl daemon-reload sudo systemctl enable --now dns-ingestor sudo journalctl -u dns-ingestor -f
Why this command Registers your unit file, starts it immediately, and tails logs so first-run failures are visible in real time.
What to verify Service state is `active (running)` and no crash-loop pattern appears in journal output.
3.4 End-to-End Test
for d in google.com youtube.com netflix.com instagram.com microsoft.com; do
dig @192.168.234.128 $d A +short
done
sleep 6
psql -U dns_user -d dns_analytics -h localhost \
-c "SELECT domain,qtype,queried_at FROM dns_queries ORDER BY queried_at DESC LIMIT 10;"BIND9 → Python → PostgreSQL. Every LAN DNS query is now captured and stored.
Rollup Jobs — Daily, Weekly & Monthly
All scripts use ON CONFLICT DO UPDATE — idempotent, safe to rerun at any time.
Teaching Lens
Rollups are your performance strategy. Instead of scanning billions of raw rows for every report, you pre-compute stable summaries on a schedule.
- Daily table is operational and near-term.
- Weekly table is planning and trend-oriented.
- Monthly table is executive reporting and capacity forecasting.
- Idempotent SQL means recovery is rerun, not manual repair.
sudo mkdir -p /opt/dns-ingestor/sql4.1 Daily Rollup
-- Run at 01:00 every morning
INSERT INTO dns_daily_stats
(stat_date,domain,apex_domain,qtype,hit_count,unique_clients,
nxdomain_count,servfail_count,first_seen,last_seen)
SELECT
DATE(queried_at AT TIME ZONE 'Africa/Kampala') AS stat_date,
domain, apex_domain, qtype,
COUNT(*), COUNT(DISTINCT client_ip),
COUNT(*) FILTER (WHERE response_rcode='NXDOMAIN'),
COUNT(*) FILTER (WHERE response_rcode='SERVFAIL'),
MIN(queried_at), MAX(queried_at)
FROM dns_queries
WHERE queried_at >= (CURRENT_DATE-INTERVAL '1 day') AT TIME ZONE 'Africa/Kampala'
AND queried_at < CURRENT_DATE AT TIME ZONE 'Africa/Kampala'
GROUP BY 1,domain,apex_domain,qtype
ON CONFLICT (stat_date,domain,qtype) DO UPDATE SET
hit_count=EXCLUDED.hit_count, unique_clients=EXCLUDED.unique_clients,
nxdomain_count=EXCLUDED.nxdomain_count, servfail_count=EXCLUDED.servfail_count,
first_seen=EXCLUDED.first_seen, last_seen=EXCLUDED.last_seen;
INSERT INTO dns_hourly_heatmap (stat_date,hour_of_day,apex_domain,query_count)
SELECT DATE(queried_at AT TIME ZONE 'Africa/Kampala'),
EXTRACT(HOUR FROM queried_at AT TIME ZONE 'Africa/Kampala')::SMALLINT,
apex_domain, COUNT(*)
FROM dns_queries
WHERE queried_at >= (CURRENT_DATE-INTERVAL '1 day') AT TIME ZONE 'Africa/Kampala'
AND queried_at < CURRENT_DATE AT TIME ZONE 'Africa/Kampala'
GROUP BY 1,2,3
ON CONFLICT (stat_date,hour_of_day,apex_domain) DO UPDATE SET query_count=EXCLUDED.query_count;
COMMIT;Why this SQL Compresses raw query volume into daily business-ready metrics and heatmap dimensions with rerun-safe conflict handling.
What to verify Running twice should not duplicate counts; rows are updated deterministically for the same date window.
4.2 Weekly Rollup
-- Run Monday 02:00
INSERT INTO dns_weekly_stats
(week_start,week_end,apex_domain,total_queries,unique_clients,peak_day,peak_day_count,category)
WITH totals AS (
SELECT DATE_TRUNC('week',stat_date)::DATE AS week_start,
(DATE_TRUNC('week',stat_date)+INTERVAL '6 days')::DATE AS week_end,
apex_domain, SUM(hit_count) AS total_queries, MAX(unique_clients) AS unique_clients
FROM dns_daily_stats
WHERE stat_date >= DATE_TRUNC('week',CURRENT_DATE-INTERVAL '7 days')::DATE
AND stat_date < DATE_TRUNC('week',CURRENT_DATE)::DATE
GROUP BY 1,2,3
), peaks AS (
SELECT DATE_TRUNC('week',stat_date)::DATE AS week_start, apex_domain,
stat_date AS peak_day, SUM(hit_count) AS day_total,
RANK() OVER (PARTITION BY DATE_TRUNC('week',stat_date),apex_domain
ORDER BY SUM(hit_count) DESC) AS rnk
FROM dns_daily_stats
WHERE stat_date >= DATE_TRUNC('week',CURRENT_DATE-INTERVAL '7 days')::DATE
AND stat_date < DATE_TRUNC('week',CURRENT_DATE)::DATE
GROUP BY 1,2,3
)
SELECT t.week_start,t.week_end,t.apex_domain,t.total_queries,t.unique_clients,
p.peak_day,p.day_total,dc.category
FROM totals t
LEFT JOIN peaks p ON p.week_start=t.week_start AND p.apex_domain=t.apex_domain AND p.rnk=1
LEFT JOIN domain_categories dc ON dc.apex_domain=t.apex_domain
ON CONFLICT (week_start,apex_domain) DO UPDATE SET
total_queries=EXCLUDED.total_queries,unique_clients=EXCLUDED.unique_clients,
peak_day=EXCLUDED.peak_day,peak_day_count=EXCLUDED.peak_day_count;
COMMIT;4.3 Monthly Rollup
-- Run 1st of month 03:00
INSERT INTO dns_monthly_stats (year_month,apex_domain,total_queries,unique_clients,avg_daily,rank_in_month,category)
WITH m AS (
SELECT TO_CHAR(stat_date,'YYYY-MM') AS year_month, apex_domain,
SUM(hit_count) AS total_queries, MAX(unique_clients) AS unique_clients,
ROUND(AVG(hit_count),2) AS avg_daily
FROM dns_daily_stats
WHERE TO_CHAR(stat_date,'YYYY-MM')=TO_CHAR(CURRENT_DATE-INTERVAL '1 month','YYYY-MM')
GROUP BY 1,2
)
SELECT m.year_month,m.apex_domain,m.total_queries,m.unique_clients,m.avg_daily,
RANK() OVER (PARTITION BY m.year_month ORDER BY m.total_queries DESC), dc.category
FROM m LEFT JOIN domain_categories dc ON dc.apex_domain=m.apex_domain
ON CONFLICT (year_month,apex_domain) DO UPDATE SET
total_queries=EXCLUDED.total_queries,avg_daily=EXCLUDED.avg_daily,rank_in_month=EXCLUDED.rank_in_month;
COMMIT;4.4 Cron Schedule
# DNS Analytics Rollups 0 1 * * * psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_daily.sql >> /var/log/dns-ingestor/daily.log 2>&1 0 2 * * 1 psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_weekly.sql >> /var/log/dns-ingestor/weekly.log 2>&1 0 3 1 * * psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_monthly.sql >> /var/log/dns-ingestor/monthly.log 2>&1 # Create tomorrow's partition at 23:30 30 23 * * * psql -U dns_user -d dns_analytics -h localhost -c "CREATE TABLE IF NOT EXISTS dns_queries_$(date -d tomorrow +\%Y\%m\%d) PARTITION OF dns_queries FOR VALUES FROM (CURRENT_DATE+INTERVAL '1 day') TO (CURRENT_DATE+INTERVAL '2 days');" >> /var/log/dns-ingestor/partition.log 2>&1
SQL Analytics — Reading Your Traffic
Teaching Lens
Every query answers a business question. Treat SQL here as decision logic for peering, caching, abuse detection, and customer experience.
- Top domains identify concentration risk.
- Category mix shows where bandwidth is actually going.
- Hourly heatmaps guide cache warm-up windows.
- NXDOMAIN rates surface misconfigurations or bot noise.
5.1 Top 50 Domains (7-Day)
SELECT ds.apex_domain, dc.category, dc.provider, dc.cache_priority, SUM(ds.hit_count) AS total_queries, ROUND(SUM(ds.hit_count)*100.0/SUM(SUM(ds.hit_count)) OVER(),2) AS pct, ROUND(AVG(ds.hit_count),0) AS avg_per_day FROM dns_daily_stats ds LEFT JOIN domain_categories dc ON dc.apex_domain=ds.apex_domain WHERE ds.stat_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY ds.apex_domain,dc.category,dc.provider,dc.cache_priority ORDER BY total_queries DESC LIMIT 50;
Why this query Produces concentration analysis: which domains dominate resolver workload and therefore drive caching and peering priorities.
What to verify Percentages roughly sum to 100 and top domains align with observed subscriber behavior.
5.2 By Category
SELECT COALESCE(dc.category,'uncategorised') AS category, COUNT(DISTINCT ds.apex_domain) AS domains, SUM(ds.hit_count) AS total_queries, ROUND(SUM(ds.hit_count)*100.0/SUM(SUM(ds.hit_count)) OVER(),2) AS pct FROM dns_daily_stats ds LEFT JOIN domain_categories dc ON dc.apex_domain=ds.apex_domain WHERE ds.stat_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1 ORDER BY total_queries DESC;
5.3 24-Hour Heatmap
SELECT LPAD(h.hour_of_day::TEXT,2,'0')||':00' AS hour, SUM(h.query_count) AS queries, REPEAT('█',(SUM(h.query_count)/NULLIF(MAX(SUM(h.query_count)) OVER()/40,0))::INT) AS bar FROM dns_hourly_heatmap h WHERE h.stat_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY h.hour_of_day ORDER BY h.hour_of_day;
5.4 NXDOMAIN Analysis
SELECT apex_domain, SUM(hit_count) AS total, SUM(nxdomain_count) AS failures, ROUND(SUM(nxdomain_count)*100.0/NULLIF(SUM(hit_count),0),1) AS fail_pct FROM dns_daily_stats WHERE stat_date >= CURRENT_DATE - INTERVAL '7 days' AND nxdomain_count > 0 GROUP BY apex_domain HAVING SUM(nxdomain_count)>10 ORDER BY fail_pct DESC LIMIT 30;
Scaling to 2 Million Subscribers
2M subs = 600M queries/day = 7,000 qps sustained. Text log files break at this scale. The data model stays identical; the transport layer changes.
Teaching Lens
At scale, architecture changes where data moves, not what data means. Keep schema semantics stable so reports remain comparable from lab to production.
- Switch text logs to binary event streams (dnstap).
- Switch single-process ingest to distributed stream workers.
- Switch plain PostgreSQL patterns to time-series optimizations.
- Keep privacy controls explicit before volume grows.
6.1 Production Architecture
| Layer | VM (this guide) | Production (2M subs) |
|---|---|---|
| DNS daemon | BIND9 text log | Unbound x8 + dnstap |
| Transport | File on disk | Kafka 3-broker cluster |
| Ingestor | 1 Python process | Faust workers x8 |
| Database | PostgreSQL 16 | TimescaleDB on NVMe RAID |
| Volume | ~50K/day | 600M/day (7K qps) |
6.2 TimescaleDB
sudo apt-get install -y timescaledb-2-postgresql-16 sudo timescaledb-tune --quiet --yes && sudo systemctl restart postgresql
CREATE EXTENSION IF NOT EXISTS timescaledb; SELECT create_hypertable('dns_queries','queried_at', chunk_time_interval=>INTERVAL '1 day',if_not_exists=>TRUE); ALTER TABLE dns_queries SET (timescaledb.compress, timescaledb.compress_orderby='queried_at DESC', timescaledb.compress_segmentby='apex_domain'); SELECT add_compression_policy('dns_queries',INTERVAL '7 days'); CREATE MATERIALIZED VIEW dns_daily_live WITH (timescaledb.continuous) AS SELECT time_bucket('1 day',queried_at) AS bucket, apex_domain, qtype, COUNT(*) AS hit_count, COUNT(DISTINCT client_ip) AS unique_clients FROM dns_queries GROUP BY bucket,apex_domain,qtype; SELECT add_continuous_aggregate_policy('dns_daily_live', start_offset=>INTERVAL '3 days',end_offset=>INTERVAL '1 hour', schedule_interval=>INTERVAL '1 hour');
6.3 dnstap
dnstap:
dnstap-enable: yes
dnstap-socket-path: "/var/run/unbound/dnstap.sock"
dnstap-log-client-query-messages: yes
dnstap-log-client-response-messages: yesStore only the /24 subnet, not the full /32 client IP. Aggregated domain counts contain no PII.
Firewall (UFW)
Teaching Lens
Security here is segmentation, not complexity. DNS stays reachable only from trusted private ranges; database stays local unless you explicitly tunnel it.
sudo ufw enable && sudo ufw allow ssh sudo ufw allow from 192.168.0.0/16 to any port 53 proto udp sudo ufw allow from 192.168.0.0/16 to any port 53 proto tcp sudo ufw allow from 10.0.0.0/8 to any port 53 proto udp sudo ufw allow from 10.0.0.0/8 to any port 53 proto tcp sudo ufw allow from 172.16.0.0/12 to any port 53 proto udp sudo ufw allow from 172.16.0.0/12 to any port 53 proto tcp sudo ufw allow from 127.0.0.1 to any port 5432 proto tcp sudo ufw default deny incoming && sudo ufw default allow outgoing
Troubleshooting
Teaching Lens
Troubleshooting should follow the data path. Start at resolver output, then parser health, then database writes, then rollups, then reports.
| Symptom | Check | Command |
|---|---|---|
| BIND9 won't start | Config syntax | sudo named-checkconf |
| No queries.log | Log dir ownership | ls -la /var/log/named/ |
| dig returns REFUSED | Client not in ACL | sudo journalctl -u bind9 -n 20 |
| Ingestor not inserting | Wrong DB password | sudo journalctl -u dns-ingestor -n 40 |
| Partition error | Partition absent | Create partition manually |
| psql auth fail | pg_hba.conf peer | Add md5 line below |
| Empty daily_stats | Rollup not run yet | Run rollup_daily.sql manually |
sudo nano /etc/postgresql/16/main/pg_hba.conf # Add ABOVE existing local lines: # host dns_analytics dns_user 127.0.0.1/32 md5 sudo systemctl reload postgresql
BIND9 capturing. Python ingesting. PostgreSQL storing. Cron aggregating. Run the Chapter 5 queries after 7 days and you have everything to design caching, CDN peering, and BGP strategy.