DNS Analytics
from Zero to 2M
Complete step-by-step guide to a DNS query analytics server on Ubuntu 24.04. BIND9 captures every LAN query. Python streams them into PostgreSQL. Daily, weekly, and monthly rollups give you the traffic intelligence to design caching and peering for an ISP at scale.
How Everything Connects
Teaching Lens
This lesson teaches pipeline thinking instead of command memorization: resolve DNS, capture events, store durable records, then aggregate for decisions.
- You verify resolver behavior before trusting downstream analytics.
- You verify ingest transforms raw logs into structured rows.
- You verify rollups convert raw volume into low-cost summaries.
- You verify faults by debugging left to right across the pipeline.
Installing & Configuring BIND9
BIND9 acts as a recursive caching resolver for your entire LAN. Every query is logged to the file that feeds our analytics pipeline.
Teaching Lens
This lesson teaches that resolver stability is a data-quality requirement, not just a DNS setup task.
- You verify ACL policy to prevent open-recursion abuse.
- You verify forwarder behavior and upstream response reliability.
- You verify log timestamp quality for parser correctness.
- You verify in order: syntax, service state, query path, and log emission.
1.1 Install BIND9
Update package index
sudo apt-get updateThis lesson teaches package metadata refresh as a prerequisite for reliable installation results.
You verify there are no repository errors and the command exits successfully before continuing.
Install BIND9 and tools
sudo apt-get install -y bind9 bind9utils bind9-doc dnsutilsThis lesson teaches installing both the resolver service and its diagnostic toolchain in the same step.
You verify named, dig, and named-checkconf are available after installation.
Verify and enable on boot
sudo systemctl status bind9 && sudo systemctl enable bind91.2 Configure named.conf.options
sudo cp /etc/bind/named.conf.options /etc/bind/named.conf.options.bak
Before you paste this block, understand its job: this is the resolver policy engine for your network. The acl "trusted" section defines who is allowed to use your resolver recursively, which prevents open-resolver abuse. The forwarders list defines where unresolved queries are sent upstream. Cache and rate-limit settings protect performance during spikes. Read this as "who may ask, where answers come from, and how safely the resolver behaves under load."
// DNS Analytics Server -- 192.168.234.128 acl "trusted" { 127.0.0.1; 192.168.0.0/16; 10.0.0.0/8; 172.16.0.0/12; }; options { directory "/var/cache/bind"; listen-on { any; }; listen-on-v6 { any; }; allow-query { trusted; }; allow-recursion { trusted; }; forwarders { 8.8.8.8; 8.8.4.4; 1.1.1.1; 9.9.9.9; }; forward only; dnssec-validation auto; allow-transfer { none; }; notify no; recursion yes; max-cache-size 256m; min-cache-ttl 60; max-cache-ttl 86400; rate-limit { responses-per-second 50; window 5; }; pid-file "/run/named/named.pid"; };
This lesson teaches resolver policy design using ACL, recursion limits, forwarding strategy, and cache controls.
You verify only trusted networks can recurse and external clients are refused.
1.3 Configure Query Logging
This command block prepares the filesystem for reliable telemetry capture. BIND writes logs as the bind service account, so directory ownership is not optional. If ownership is wrong, DNS might still resolve while analytics silently fail because no query lines are written. Treat this as data-pipeline readiness, not just Linux housekeeping.
sudo mkdir -p /var/log/named sudo chown bind:bind /var/log/named && sudo chmod 755 /var/log/named
BIND9 runs as the bind user. Root-owned log dir = silent failure with zero error message.
This logging block inside /etc/bind/named.conf.local decides which DNS events become analyzable records. The key line is print-time yes, because your parser and time-based rollups depend on accurate timestamps. The category mapping intentionally sends query events to queries.log and suppresses noisy internals with null sinks. You are defining signal versus noise for every lesson that follows.
logging { channel queries_log { file "/var/log/named/queries.log" versions 7 size 500m; severity dynamic; print-time yes; // CRITICAL: timestamp on every line print-severity no; print-category no; }; channel named_log { file "/var/log/named/named.log" versions 4 size 50m; severity info; print-time yes; print-severity yes; print-category yes; }; category queries { queries_log; }; category query-errors { queries_log; }; category default { named_log; }; category general { named_log; }; category config { named_log; }; category network { named_log; }; category security { named_log; }; category lame-servers { null; }; category dnssec { null; }; category resolver { null; }; category cname { null; }; category xfer-in { null; }; category xfer-out { null; }; category notify { null; }; category client { null; }; category unmatched { null; }; category dispatch { null; }; category edns-disabled { null; }; category rpz { null; }; category rate-limit { null; }; };
1.4 Validate, Restart, Test
This sequence is ordered as a troubleshooting decision tree. First, named-checkconf validates syntax before risking a restart. Second, restarting applies your new policy and logging rules. Third, dig tests real query flow. Last, tail verifies telemetry emission. If any step fails, you know exactly where the break occurred: config parsing, service runtime, DNS path, or logging path.
sudo named-checkconf # silence = success sudo systemctl restart bind9 dig @192.168.234.128 google.com A dig @192.168.234.128 youtube.com AAAA sudo tail -20 /var/log/named/queries.log
This lesson teaches deterministic validation flow from syntax check to live telemetry confirmation.
You verify dig returns NOERROR and query lines appear in queries.log with timestamps.
BIND9 installed and logging every query to /var/log/named/queries.log.
PostgreSQL — Installation & Schema
Teaching Lens
This lesson teaches event modeling for scale: append-heavy raw storage plus pre-aggregated reporting tables.
- You verify raw tables preserve full fidelity for reprocessing.
- You verify daily, weekly, and monthly rollups reduce query cost.
- You verify indexes match real analytics access patterns.
- You verify category mapping translates domains into business meaning.
2.1 Install
This installation block creates the database runtime used by every later lesson. The package command installs the server and common extensions bundle, enabling PostgreSQL to start at boot ensures persistence after reboots, and the version query proves the server is reachable before any schema work begins. Treat this as a platform readiness gate, not just package setup.
sudo apt-get install -y postgresql postgresql-contrib sudo systemctl enable postgresql sudo -u postgres psql -c "SELECT version();"
This lesson teaches database runtime setup before schema creation and ingestion onboarding.
You verify version output is returned and PostgreSQL is enabled at boot.
2.2 Create Database and User
This SQL block establishes least-privilege access. You create a dedicated login role for the ingestor, then bind the analytics database ownership to that role so application writes stay scoped to the project. Avoid using the postgres superuser for daily ingestion because privilege boundaries are part of operational safety and incident containment.
CREATE USER dns_user WITH PASSWORD 'ChangeThisPassword!' NOSUPERUSER NOCREATEDB NOCREATEROLE LOGIN; CREATE DATABASE dns_analytics OWNER dns_user ENCODING 'UTF8' LC_COLLATE 'en_US.UTF-8' LC_CTYPE 'en_US.UTF-8' TEMPLATE template0; GRANT ALL PRIVILEGES ON DATABASE dns_analytics TO dns_user; \q
2.3 Schema
This lesson teaches connection and ownership validation before schema creation. You verify that the application role can access the target database directly before any table definitions are applied.
psql -U dns_user -d dns_analytics -h localhostThis lesson teaches the full analytics data model from raw events to reporting summaries. You verify that partitioned ingestion, rollup tables, and category metadata work together as one coherent decision-support schema.
-- Raw query log, partitioned by day CREATE TABLE dns_queries ( id BIGSERIAL, queried_at TIMESTAMPTZ NOT NULL, client_ip INET NOT NULL, client_port INTEGER, domain TEXT NOT NULL, apex_domain TEXT NOT NULL, qtype VARCHAR(10) NOT NULL, flags TEXT, response_rcode TEXT, cached BOOLEAN DEFAULT FALSE, inserted_at TIMESTAMPTZ DEFAULT now() ) PARTITION BY RANGE (queried_at); CREATE TABLE dns_queries_today PARTITION OF dns_queries FOR VALUES FROM (CURRENT_DATE) TO (CURRENT_DATE + INTERVAL '1 day'); -- Per-domain count per day CREATE TABLE dns_daily_stats ( stat_date DATE NOT NULL, domain TEXT NOT NULL, apex_domain TEXT NOT NULL, qtype VARCHAR(10) NOT NULL, hit_count BIGINT NOT NULL DEFAULT 0, unique_clients INTEGER NOT NULL DEFAULT 0, nxdomain_count INTEGER NOT NULL DEFAULT 0, servfail_count INTEGER NOT NULL DEFAULT 0, first_seen TIMESTAMPTZ, last_seen TIMESTAMPTZ, PRIMARY KEY (stat_date, domain, qtype) ); CREATE TABLE dns_weekly_stats ( week_start DATE NOT NULL, week_end DATE NOT NULL, apex_domain TEXT NOT NULL, total_queries BIGINT NOT NULL DEFAULT 0, unique_clients INTEGER NOT NULL DEFAULT 0, peak_day DATE, peak_day_count BIGINT, category TEXT, PRIMARY KEY (week_start, apex_domain) ); CREATE TABLE dns_monthly_stats ( year_month CHAR(7) NOT NULL, apex_domain TEXT NOT NULL, total_queries BIGINT NOT NULL DEFAULT 0, unique_clients INTEGER NOT NULL DEFAULT 0, avg_daily NUMERIC(12,2), rank_in_month INTEGER, category TEXT, PRIMARY KEY (year_month, apex_domain) ); CREATE TABLE dns_hourly_heatmap ( stat_date DATE NOT NULL, hour_of_day SMALLINT NOT NULL CHECK (hour_of_day BETWEEN 0 AND 23), apex_domain TEXT NOT NULL, query_count BIGINT NOT NULL DEFAULT 0, PRIMARY KEY (stat_date, hour_of_day, apex_domain) ); CREATE TABLE domain_categories ( apex_domain TEXT PRIMARY KEY, category TEXT NOT NULL, provider TEXT, cache_priority SMALLINT DEFAULT 5, tagged_at TIMESTAMPTZ DEFAULT now() ); INSERT INTO domain_categories VALUES ('google.com','search','Google',3,now()),('googleapis.com','cdn','Google',2,now()), ('youtube.com','streaming','Google',1,now()),('googlevideo.com','streaming','Google',1,now()), ('netflix.com','streaming','Netflix',1,now()),('nflxvideo.net','streaming','Netflix',1,now()), ('nflximg.net','cdn','Netflix',2,now()),('akamai.net','cdn','Akamai',1,now()), ('akamaiedge.net','cdn','Akamai',1,now()),('cloudflare.com','cdn','Cloudflare',2,now()), ('facebook.com','social','Meta',4,now()),('instagram.com','social','Meta',4,now()), ('whatsapp.net','messaging','Meta',3,now()),('whatsapp.com','messaging','Meta',3,now()), ('twitter.com','social','X',4,now()),('twimg.com','cdn','X',2,now()), ('tiktok.com','social','TikTok',4,now()),('microsoft.com','work','Microsoft',5,now()), ('office.com','work','Microsoft',5,now()),('windowsupdate.com','updates','Microsoft',8,now()), ('ubuntu.com','updates','Canonical',8,now()),('amazonaws.com','cloud','AWS',5,now()), ('fastly.net','cdn','Fastly',2,now()),('cloudfront.net','cdn','AWS',2,now()), ('doubleclick.net','ads','Google',9,now()),('googlesyndication.com','ads','Google',9,now()); COMMIT;
This lesson teaches separation of raw event capture and summary analytics for scale and clarity.
You verify tables, constraints, and seed category inserts all succeed.
2.4 Indexes
This indexing block encodes your expected access paths. Time-descending indexes support recent-activity views, apex-domain indexes support ranking and filtering, and trigram search enables flexible domain text matching at scale. Indexes are not optional decoration; they are what keeps reporting latency predictable as data volume grows.
CREATE INDEX idx_q_time ON dns_queries (queried_at DESC); CREATE INDEX idx_q_apex ON dns_queries (apex_domain, queried_at DESC); CREATE INDEX idx_q_client ON dns_queries (client_ip, queried_at DESC); CREATE INDEX idx_d_apex ON dns_daily_stats (apex_domain, stat_date DESC); CREATE INDEX idx_d_hits ON dns_daily_stats (stat_date, hit_count DESC); CREATE INDEX idx_w_domain ON dns_weekly_stats (apex_domain, week_start DESC); CREATE INDEX idx_m_domain ON dns_monthly_stats (apex_domain, year_month DESC); CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX idx_d_trgm ON dns_daily_stats USING gin (apex_domain gin_trgm_ops); COMMIT;
Five-table schema built, indexed, 26 domain categories seeded.
The Log Ingestor — Python Daemon
Teaching Lens
This lesson teaches ingestion as a resilient ETL loop: read, parse, normalize, and write.
- You verify parser tolerance against noisy log variants.
- You verify batch settings balance latency and throughput.
- You verify systemd restart behavior after failures and reboots.
- You verify failures in order: file access, parse rate, then database auth.
3.1 Python Setup
This setup block creates a reproducible runtime for ingestion code. System packages install Python tooling, the dedicated directory isolates operational files, the virtual environment freezes dependency scope, and psycopg2-binary provides PostgreSQL connectivity. Reproducible environments reduce drift and make debugging repeatable across hosts.
sudo apt-get install -y python3 python3-pip python3-venv sudo mkdir -p /opt/dns-ingestor && sudo chown $USER:$USER /opt/dns-ingestor cd /opt/dns-ingestor && python3 -m venv venv source venv/bin/activate && pip install psycopg2-binary
This lesson teaches dependency isolation through a dedicated Python virtual environment.
You verify the virtual environment works and psycopg2-binary is installed inside it.
3.2 The Script
This edit command opens the ingestor source file location used by systemd later. Keeping the script in a stable operational path means your service unit, logs, and maintenance procedures all point to one canonical implementation instead of ad-hoc copies.
nano /opt/dns-ingestor/ingestor.pyThis Python block implements a resilient ETL loop. It parses resolver logs into structured fields, reconnects on transient database failures, batches inserts for throughput efficiency, and continuously tails rotating log files. Read it as a production loop: input normalization, controlled writes, observability, and graceful restart behavior.
#!/usr/bin/env python3
import re, time, logging, os, signal, sys
from datetime import datetime
import psycopg2
from psycopg2.extras import execute_values
DB = {'dbname':'dns_analytics','user':'dns_user',
'password':'ChangeThisPassword!','host':'127.0.0.1','port':5432}
LOG_FILE = '/var/log/named/queries.log'
BATCH_SIZE = 100
FLUSH_INTERVAL = 5
QUERY_RE = re.compile(
r'^(?P<ts>\d{2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2})\.\d+'
r'\s+client\s+(?:@\S+\s+)?(?P<ip>[\d\.a-fA-F:]+)#(?P<port>\d+)'
r'\s+\([^)]+\):\s+query:\s+(?P<domain>[\w.\-]+)\s+IN\s+(?P<qtype>\w+)'
r'\s+(?P<flags>[+\-\w\s]*?)(?:\s+\([\d.]+\))?$'
)
def apex(d):
p = d.rstrip('.').split('.')
return '.'.join(p[-2:]) if len(p)>=2 else d
def parse_ts(s):
try: return datetime.strptime(s, '%d-%b-%Y %H:%M:%S')
except: return datetime.utcnow()
def parse(line):
line = line.strip()
if not line: return None
m = QUERY_RE.match(line)
if not m: return None
d = m.groupdict()
dom = d.get('domain','').lower().rstrip('.')
if not dom: return None
flags = d.get('flags','').strip()
return {'queried_at':parse_ts(d['ts']),'client_ip':d.get('ip','0.0.0.0'),
'client_port':int(d.get('port',0)),'domain':dom,'apex_domain':apex(dom),
'qtype':d.get('qtype','A').upper(),'flags':flags,
'response_rcode':'NOERROR','cached':'CL' in flags}
def connect():
while True:
try:
c=psycopg2.connect(**DB); c.autocommit=False
logging.info('DB connected'); return c
except psycopg2.OperationalError as e:
logging.error(f'DB: {e}'); time.sleep(10)
def flush(conn, batch):
if not batch: return 0
sql = ('INSERT INTO dns_queries '
'(queried_at,client_ip,client_port,domain,apex_domain,'
'qtype,flags,response_rcode,cached) VALUES %s ON CONFLICT DO NOTHING')
rows = [(r['queried_at'],r['client_ip'],r['client_port'],r['domain'],
r['apex_domain'],r['qtype'],r['flags'],r['response_rcode'],r['cached'])
for r in batch]
try:
with conn.cursor() as cur: execute_values(cur,sql,rows,page_size=500)
conn.commit(); return len(batch)
except psycopg2.Error as e:
logging.error(f'Insert: {e}'); conn.rollback(); return 0
def run(conn):
batch,last_flush,total,inode,f = [],time.time(),0,0,None
logging.info(f'Watching {LOG_FILE}')
while True:
try: cur_inode=os.stat(LOG_FILE).st_ino
except FileNotFoundError: time.sleep(5); continue
if cur_inode != inode:
try: f=open(LOG_FILE,'r'); f.seek(0,2); inode=cur_inode
except IOError: time.sleep(5); continue
while True:
line = f.readline()
if not line: break
rec = parse(line)
if rec: batch.append(rec)
now = time.time()
if len(batch)>=BATCH_SIZE or (batch and now-last_flush>=FLUSH_INTERVAL):
n=flush(conn,batch); total+=n
logging.info(f'Flushed {n} | total {total:,}')
batch,last_flush=[],now
time.sleep(0.5)
if __name__=='__main__':
os.makedirs('/var/log/dns-ingestor',exist_ok=True)
logging.basicConfig(level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(message)s',
handlers=[logging.StreamHandler(sys.stdout),
logging.FileHandler('/var/log/dns-ingestor/ingestor.log')])
def bye(s,f): sys.exit(0)
signal.signal(signal.SIGTERM,bye); signal.signal(signal.SIGINT,bye)
run(connect())
This lesson teaches resilient tail-and-batch ingestion with reconnect behavior for transient failures.
You verify repeated flush logs and increasing database row counts during query activity.
3.3 Systemd Service
This unit file turns the script into a managed service. Dependency ordering waits for network and database availability, restart policy recovers from failures, and fixed working/executable paths make startup deterministic. Systemd is the reliability contract that keeps ingestion alive without manual supervision.
[Unit] Description=DNS Query Log Ingestor -- code::core After=network.target postgresql.service bind9.service Requires=postgresql.service [Service] Type=simple User=root WorkingDirectory=/opt/dns-ingestor ExecStart=/opt/dns-ingestor/venv/bin/python3 /opt/dns-ingestor/ingestor.py Restart=always RestartSec=10s LimitNOFILE=65536 [Install] WantedBy=multi-user.target
These control commands register the new unit, start it immediately, and open a live log stream for first-run validation. Running them in order ensures the service definition is reloaded before activation, which prevents stale unit metadata from hiding configuration mistakes.
sudo systemctl daemon-reload sudo systemctl enable --now dns-ingestor sudo journalctl -u dns-ingestor -f
This lesson teaches service activation workflow from unit reload to live log observation.
You verify dns-ingestor is active and not entering a crash loop.
3.4 End-to-End Test
This test block deliberately generates DNS traffic, waits for the batch flush window, then inspects the newest stored rows. It validates that resolver output, parser logic, and database writes are functioning as one pipeline. When this check passes, your telemetry path is proven, not assumed.
for d in google.com youtube.com netflix.com instagram.com microsoft.com; do
dig @192.168.234.128 $d A +short
done
sleep 6
psql -U dns_user -d dns_analytics -h localhost \
-c "SELECT domain,qtype,queried_at FROM dns_queries ORDER BY queried_at DESC LIMIT 10;"BIND9 → Python → PostgreSQL. Every LAN DNS query is now captured and stored.
Rollup Jobs — Daily, Weekly & Monthly
All scripts use ON CONFLICT DO UPDATE — idempotent, safe to rerun at any time.
Teaching Lens
This lesson teaches rollups as a performance strategy: pre-compute stable summaries instead of repeatedly scanning raw volume.
- You verify daily rollups support operational reporting.
- You verify weekly rollups support trend and planning analysis.
- You verify monthly rollups support forecasting and executive views.
- You verify idempotent SQL enables safe rerun-based recovery.
This command creates a dedicated directory for rollup SQL artifacts. Keeping scheduled SQL in a single controlled location improves auditability and makes cron references explicit and maintainable.
sudo mkdir -p /opt/dns-ingestor/sql4.1 Daily Rollup
This daily rollup computes yesterday's operational summary and hourly distribution from raw events. The time-zone conversion anchors reporting to local business time, and conflict updates make reruns safe by replacing prior values for the same key window. This is your primary daily telemetry compression stage.
-- Run at 01:00 every morning
INSERT INTO dns_daily_stats
(stat_date,domain,apex_domain,qtype,hit_count,unique_clients,
nxdomain_count,servfail_count,first_seen,last_seen)
SELECT
DATE(queried_at AT TIME ZONE 'Africa/Kampala') AS stat_date,
domain, apex_domain, qtype,
COUNT(*), COUNT(DISTINCT client_ip),
COUNT(*) FILTER (WHERE response_rcode='NXDOMAIN'),
COUNT(*) FILTER (WHERE response_rcode='SERVFAIL'),
MIN(queried_at), MAX(queried_at)
FROM dns_queries
WHERE queried_at >= (CURRENT_DATE-INTERVAL '1 day') AT TIME ZONE 'Africa/Kampala'
AND queried_at < CURRENT_DATE AT TIME ZONE 'Africa/Kampala'
GROUP BY 1,domain,apex_domain,qtype
ON CONFLICT (stat_date,domain,qtype) DO UPDATE SET
hit_count=EXCLUDED.hit_count, unique_clients=EXCLUDED.unique_clients,
nxdomain_count=EXCLUDED.nxdomain_count, servfail_count=EXCLUDED.servfail_count,
first_seen=EXCLUDED.first_seen, last_seen=EXCLUDED.last_seen;
INSERT INTO dns_hourly_heatmap (stat_date,hour_of_day,apex_domain,query_count)
SELECT DATE(queried_at AT TIME ZONE 'Africa/Kampala'),
EXTRACT(HOUR FROM queried_at AT TIME ZONE 'Africa/Kampala')::SMALLINT,
apex_domain, COUNT(*)
FROM dns_queries
WHERE queried_at >= (CURRENT_DATE-INTERVAL '1 day') AT TIME ZONE 'Africa/Kampala'
AND queried_at < CURRENT_DATE AT TIME ZONE 'Africa/Kampala'
GROUP BY 1,2,3
ON CONFLICT (stat_date,hour_of_day,apex_domain) DO UPDATE SET query_count=EXCLUDED.query_count;
COMMIT;This lesson teaches idempotent daily rollups for stable metrics and hourly demand summaries.
You verify rerunning the job updates the same window without duplicate counts.
4.2 Weekly Rollup
This weekly rollup aggregates daily records into planning-level metrics. It computes total demand, peak day, and optional category context per apex domain, which is useful for capacity and peering decisions. The CTE structure separates total calculations from peak detection so each step remains understandable and testable.
-- Run Monday 02:00
INSERT INTO dns_weekly_stats
(week_start,week_end,apex_domain,total_queries,unique_clients,peak_day,peak_day_count,category)
WITH totals AS (
SELECT DATE_TRUNC('week',stat_date)::DATE AS week_start,
(DATE_TRUNC('week',stat_date)+INTERVAL '6 days')::DATE AS week_end,
apex_domain, SUM(hit_count) AS total_queries, MAX(unique_clients) AS unique_clients
FROM dns_daily_stats
WHERE stat_date >= DATE_TRUNC('week',CURRENT_DATE-INTERVAL '7 days')::DATE
AND stat_date < DATE_TRUNC('week',CURRENT_DATE)::DATE
GROUP BY 1,2,3
), peaks AS (
SELECT DATE_TRUNC('week',stat_date)::DATE AS week_start, apex_domain,
stat_date AS peak_day, SUM(hit_count) AS day_total,
RANK() OVER (PARTITION BY DATE_TRUNC('week',stat_date),apex_domain
ORDER BY SUM(hit_count) DESC) AS rnk
FROM dns_daily_stats
WHERE stat_date >= DATE_TRUNC('week',CURRENT_DATE-INTERVAL '7 days')::DATE
AND stat_date < DATE_TRUNC('week',CURRENT_DATE)::DATE
GROUP BY 1,2,3
)
SELECT t.week_start,t.week_end,t.apex_domain,t.total_queries,t.unique_clients,
p.peak_day,p.day_total,dc.category
FROM totals t
LEFT JOIN peaks p ON p.week_start=t.week_start AND p.apex_domain=t.apex_domain AND p.rnk=1
LEFT JOIN domain_categories dc ON dc.apex_domain=t.apex_domain
ON CONFLICT (week_start,apex_domain) DO UPDATE SET
total_queries=EXCLUDED.total_queries,unique_clients=EXCLUDED.unique_clients,
peak_day=EXCLUDED.peak_day,peak_day_count=EXCLUDED.peak_day_count;
COMMIT;4.3 Monthly Rollup
This monthly rollup produces executive-scale trends with totals, average daily volume, and rank ordering. Because it reads from daily summaries instead of raw logs, it stays efficient while preserving consistent semantics. Ranking within month helps highlight dominant domains and shifting traffic priorities over time.
-- Run 1st of month 03:00
INSERT INTO dns_monthly_stats (year_month,apex_domain,total_queries,unique_clients,avg_daily,rank_in_month,category)
WITH m AS (
SELECT TO_CHAR(stat_date,'YYYY-MM') AS year_month, apex_domain,
SUM(hit_count) AS total_queries, MAX(unique_clients) AS unique_clients,
ROUND(AVG(hit_count),2) AS avg_daily
FROM dns_daily_stats
WHERE TO_CHAR(stat_date,'YYYY-MM')=TO_CHAR(CURRENT_DATE-INTERVAL '1 month','YYYY-MM')
GROUP BY 1,2
)
SELECT m.year_month,m.apex_domain,m.total_queries,m.unique_clients,m.avg_daily,
RANK() OVER (PARTITION BY m.year_month ORDER BY m.total_queries DESC), dc.category
FROM m LEFT JOIN domain_categories dc ON dc.apex_domain=m.apex_domain
ON CONFLICT (year_month,apex_domain) DO UPDATE SET
total_queries=EXCLUDED.total_queries,avg_daily=EXCLUDED.avg_daily,rank_in_month=EXCLUDED.rank_in_month;
COMMIT;4.4 Cron Schedule
This cron block operationalizes the rollup lifecycle. It schedules daily, weekly, and monthly jobs with log capture for post-run diagnostics, and pre-creates tomorrow's partition to avoid insert failures at day boundaries. Scheduling converts analytics from manual scripts into dependable operations.
# DNS Analytics Rollups 0 1 * * * psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_daily.sql >> /var/log/dns-ingestor/daily.log 2>&1 0 2 * * 1 psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_weekly.sql >> /var/log/dns-ingestor/weekly.log 2>&1 0 3 1 * * psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_monthly.sql >> /var/log/dns-ingestor/monthly.log 2>&1 # Create tomorrow's partition at 23:30 30 23 * * * psql -U dns_user -d dns_analytics -h localhost -c "CREATE TABLE IF NOT EXISTS dns_queries_$(date -d tomorrow +\%Y\%m\%d) PARTITION OF dns_queries FOR VALUES FROM (CURRENT_DATE+INTERVAL '1 day') TO (CURRENT_DATE+INTERVAL '2 days');" >> /var/log/dns-ingestor/partition.log 2>&1
SQL Analytics — Reading Your Traffic
Teaching Lens
This lesson teaches SQL as decision logic for peering, caching, abuse detection, and customer experience.
- You verify top domains to identify concentration risk.
- You verify category mix to understand bandwidth destinations.
- You verify hourly heatmaps to plan cache warm-up windows.
- You verify NXDOMAIN rates to detect misconfigurations or bot noise.
5.1 Top 50 Domains (7-Day)
This query measures concentration: which apex domains consume most resolver capacity over the last week. Percentage and average-per-day columns turn raw counts into comparative signals you can use for cache policy, upstream negotiation, and anomaly triage.
SELECT ds.apex_domain, dc.category, dc.provider, dc.cache_priority, SUM(ds.hit_count) AS total_queries, ROUND(SUM(ds.hit_count)*100.0/SUM(SUM(ds.hit_count)) OVER(),2) AS pct, ROUND(AVG(ds.hit_count),0) AS avg_per_day FROM dns_daily_stats ds LEFT JOIN domain_categories dc ON dc.apex_domain=ds.apex_domain WHERE ds.stat_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY ds.apex_domain,dc.category,dc.provider,dc.cache_priority ORDER BY total_queries DESC LIMIT 50;
This lesson teaches concentration analysis to identify domains that dominate resolver workload.
You verify percentages are coherent and leaders match expected user behavior.
5.2 By Category
This category query collapses domain-level noise into service classes such as streaming, social, and work. It helps learners see how traffic mix reflects subscriber behavior and where optimization effort should be focused at a service-class level rather than individual hostnames.
SELECT COALESCE(dc.category,'uncategorised') AS category, COUNT(DISTINCT ds.apex_domain) AS domains, SUM(ds.hit_count) AS total_queries, ROUND(SUM(ds.hit_count)*100.0/SUM(SUM(ds.hit_count)) OVER(),2) AS pct FROM dns_daily_stats ds LEFT JOIN domain_categories dc ON dc.apex_domain=ds.apex_domain WHERE ds.stat_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1 ORDER BY total_queries DESC;
5.3 24-Hour Heatmap
This heatmap query reveals diurnal demand patterns by hour. The generated bar column is a quick visual proxy for load intensity, useful for identifying cache warm-up windows, maintenance windows, and periods where resolver stress is highest.
SELECT LPAD(h.hour_of_day::TEXT,2,'0')||':00' AS hour, SUM(h.query_count) AS queries, REPEAT('█',(SUM(h.query_count)/NULLIF(MAX(SUM(h.query_count)) OVER()/40,0))::INT) AS bar FROM dns_hourly_heatmap h WHERE h.stat_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY h.hour_of_day ORDER BY h.hour_of_day;
5.4 NXDOMAIN Analysis
This failure-focused query identifies domains with significant NXDOMAIN rates, a common signal for typos, stale configs, blocked telemetry, or bot behavior. Ranking by failure percentage helps prioritize remediation where user impact or wasted resolver effort is highest.
SELECT apex_domain, SUM(hit_count) AS total, SUM(nxdomain_count) AS failures, ROUND(SUM(nxdomain_count)*100.0/NULLIF(SUM(hit_count),0),1) AS fail_pct FROM dns_daily_stats WHERE stat_date >= CURRENT_DATE - INTERVAL '7 days' AND nxdomain_count > 0 GROUP BY apex_domain HAVING SUM(nxdomain_count)>10 ORDER BY fail_pct DESC LIMIT 30;
Scaling to 2 Million Subscribers
2M subs = 600M queries/day = 7,000 qps sustained. Text log files break at this scale. The data model stays identical; the transport layer changes.
Teaching Lens
This lesson teaches scale migration without changing metric semantics, so reports stay comparable from lab to production.
- You verify transport migration from text logs to binary event streams.
- You verify ingest migration from single process to distributed workers.
- You verify database migration to time-series optimization patterns.
- You verify privacy controls before volume growth increases risk.
6.1 Production Architecture
| Layer | VM (this guide) | Production (2M subs) |
|---|---|---|
| DNS daemon | BIND9 text log | Unbound x8 + dnstap |
| Transport | File on disk | Kafka 3-broker cluster |
| Ingestor | 1 Python process | Faust workers x8 |
| Database | PostgreSQL 16 | TimescaleDB on NVMe RAID |
| Volume | ~50K/day | 600M/day (7K qps) |
6.2 TimescaleDB
This command block installs time-series database capabilities for high-volume retention and query efficiency. Tuning adjusts PostgreSQL settings for Timescale workloads, and restarting applies those changes. This is the foundation for scaling from lab-sized daily volumes to sustained production ingestion.
sudo apt-get install -y timescaledb-2-postgresql-16 sudo timescaledb-tune --quiet --yes && sudo systemctl restart postgresql
This SQL block enables hypertable behavior, compression, and continuous aggregates while keeping your logical schema intact. It changes storage and refresh mechanics, not metric meaning, so dashboards and lessons remain comparable before and after scale transition.
CREATE EXTENSION IF NOT EXISTS timescaledb; SELECT create_hypertable('dns_queries','queried_at', chunk_time_interval=>INTERVAL '1 day',if_not_exists=>TRUE); ALTER TABLE dns_queries SET (timescaledb.compress, timescaledb.compress_orderby='queried_at DESC', timescaledb.compress_segmentby='apex_domain'); SELECT add_compression_policy('dns_queries',INTERVAL '7 days'); CREATE MATERIALIZED VIEW dns_daily_live WITH (timescaledb.continuous) AS SELECT time_bucket('1 day',queried_at) AS bucket, apex_domain, qtype, COUNT(*) AS hit_count, COUNT(DISTINCT client_ip) AS unique_clients FROM dns_queries GROUP BY bucket,apex_domain,qtype; SELECT add_continuous_aggregate_policy('dns_daily_live', start_offset=>INTERVAL '3 days',end_offset=>INTERVAL '1 hour', schedule_interval=>INTERVAL '1 hour');
6.3 dnstap
This configuration switches resolver event export from text logs to structured binary messages. At large scale, dnstap reduces parsing overhead and improves event fidelity for downstream stream processors. Enabling both query and response messages preserves context needed for richer analytics and troubleshooting.
dnstap:
dnstap-enable: yes
dnstap-socket-path: "/var/run/unbound/dnstap.sock"
dnstap-log-client-query-messages: yes
dnstap-log-client-response-messages: yesStore only the /24 subnet, not the full /32 client IP. Aggregated domain counts contain no PII.
MikroTik Home Network Lesson (10.10.10.0/24)
This lesson adapts the platform to a MikroTik home lab where the router IP is 10.10.10.10 and local DNS namespace is codeandcore.home.
Teaching Lens
This lesson teaches hybrid DNS operation: authoritative local naming plus recursive analytics in one pipeline.
- You verify MikroTik distributes DNS and search domain via DHCP.
- You verify BIND serves
codeandcore.homeauthoritatively. - You verify zone records are mirrored into PostgreSQL for governance.
- You verify local queries are analyzed alongside internet traffic.
7.1 Configure MikroTik DHCP + DNS Domain
This RouterOS block is the client-distribution layer of your naming system. DHCP options push DNS server and search domain to every device, so clients learn to ask 10.10.10.10 and append codeandcore.home automatically. Without this, your local zone can exist correctly on BIND but appear "broken" to users because clients never query it correctly.
/ip dhcp-server network set [find address="10.10.10.0/24"] \
gateway=10.10.10.10 dns-server=10.10.10.10 domain=codeandcore.home
/ip dns set allow-remote-requests=yes servers=10.10.10.10
/ip dns static add name=router.codeandcore.home address=10.10.10.10 ttl=1dThis lesson teaches client-side DNS distribution through DHCP domain and resolver settings.
You verify renewed clients receive DNS 10.10.10.10 and suffix codeandcore.home.
7.2 Create Authoritative Zone: codeandcore.home
This block in /etc/bind/named.conf.local is the zone declaration, which tells BIND "I am authoritative for this namespace." type master means this server is the source of truth, and the file path points to the zone database containing records. allow-update { none; } disables dynamic updates so records only change through deliberate admin edits.
zone "codeandcore.home" {
type master;
file "/etc/bind/db.codeandcore.home";
allow-update { none; };
};This zone file block is the authoritative DNS database itself. The SOA record sets lifecycle rules for secondaries and cache behavior, while NS and A records map stable names to device IPs in your 10.10.10.0/24 lab. Think of this as your local naming contract: every hostname your users depend on should be explicitly represented here.
$TTL 86400
@ IN SOA ns1.codeandcore.home. admin.codeandcore.home. (
2026051201 ; serial
3600 ; refresh
1800 ; retry
1209600 ; expire
86400 ) ; minimum
@ IN NS ns1.codeandcore.home.
ns1 IN A 10.10.10.10
router IN A 10.10.10.10
dns IN A 10.10.10.10
nas IN A 10.10.10.20
cam1 IN A 10.10.10.30
printer IN A 10.10.10.40This lesson teaches authoritative local namespace design with deterministic host mappings.
You verify local records resolve with NOERROR responses from 10.10.10.10.
7.3 Store Zone Records in PostgreSQL
This SQL block mirrors DNS inventory into the analytics database so operations can audit and report on intended state versus observed query behavior. The unique constraint prevents duplicate logical records, and indexes support fast lookups by zone and record type. This turns static DNS files into queryable governance data for lessons and reviews.
CREATE TABLE IF NOT EXISTS dns_zone_records (
id BIGSERIAL PRIMARY KEY,
zone_name TEXT NOT NULL,
fqdn TEXT NOT NULL,
record_type VARCHAR(10) NOT NULL,
record_value TEXT NOT NULL,
ttl INTEGER NOT NULL DEFAULT 86400,
source TEXT NOT NULL DEFAULT 'bind',
active BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (zone_name, fqdn, record_type, record_value)
);
CREATE INDEX IF NOT EXISTS idx_zone_fqdn ON dns_zone_records (zone_name, fqdn);
CREATE INDEX IF NOT EXISTS idx_zone_type ON dns_zone_records (zone_name, record_type);
INSERT INTO dns_zone_records (zone_name, fqdn, record_type, record_value, ttl) VALUES
('codeandcore.home','ns1.codeandcore.home','A','10.10.10.10',86400),
('codeandcore.home','router.codeandcore.home','A','10.10.10.10',86400),
('codeandcore.home','dns.codeandcore.home','A','10.10.10.10',86400),
('codeandcore.home','nas.codeandcore.home','A','10.10.10.20',86400),
('codeandcore.home','cam1.codeandcore.home','A','10.10.10.30',86400),
('codeandcore.home','printer.codeandcore.home','A','10.10.10.40',86400)
ON CONFLICT DO NOTHING;
COMMIT;This lesson teaches treating DNS inventory as queryable governance data.
You verify seeded zone records are returned from dns_zone_records for codeandcore.home.
7.4 Analytics for Local Zone + Full Traffic
These queries teach two different analytical questions. The first measures demand for local names inside codeandcore.home, which helps validate adoption of your local namespace. The second compares local traffic against internet-bound lookups, giving you an immediate ratio of internal service usage versus external dependency.
-- Local zone traffic (last 7 days) SELECT domain, qtype, COUNT(*) AS hits, COUNT(DISTINCT client_ip) AS unique_clients FROM dns_queries WHERE domain LIKE '%.codeandcore.home' AND queried_at >= NOW() - INTERVAL '7 days' GROUP BY domain, qtype ORDER BY hits DESC; -- Compare local-zone vs internet queries SELECT CASE WHEN domain LIKE '%.codeandcore.home' THEN 'local_zone' ELSE 'internet' END AS traffic_class, COUNT(*) AS total_queries, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS pct FROM dns_queries WHERE queried_at >= NOW() - INTERVAL '7 days' GROUP BY 1 ORDER BY total_queries DESC;
This lesson teaches local-versus-internet query mix analysis for the home lab zone.
You verify local hosts such as nas.codeandcore.home appear in recent query outputs.
MikroTik clients use 10.10.10.10, local zone codeandcore.home resolves authoritatively, and zone records plus analytics are stored in PostgreSQL.
Firewall (UFW)
Teaching Lens
This lesson teaches security-by-segmentation: expose only what each service requires.
- You verify DNS is reachable only from trusted private ranges.
- You verify PostgreSQL remains local unless intentionally tunneled.
- You verify default-deny inbound posture is preserved.
- You verify firewall policy matches resolver and analytics design.
This firewall rule set enforces minimum-exposure networking. It allows DNS only from private address spaces used by your clients, keeps PostgreSQL loopback-only, and denies unsolicited inbound traffic by default. The goal is a resolver that serves your network without becoming an internet-facing attack surface.
sudo ufw enable && sudo ufw allow ssh sudo ufw allow from 192.168.0.0/16 to any port 53 proto udp sudo ufw allow from 192.168.0.0/16 to any port 53 proto tcp sudo ufw allow from 10.0.0.0/8 to any port 53 proto udp sudo ufw allow from 10.0.0.0/8 to any port 53 proto tcp sudo ufw allow from 172.16.0.0/12 to any port 53 proto udp sudo ufw allow from 172.16.0.0/12 to any port 53 proto tcp sudo ufw allow from 127.0.0.1 to any port 5432 proto tcp sudo ufw default deny incoming && sudo ufw default allow outgoing
Troubleshooting
Teaching Lens
This lesson teaches path-based troubleshooting from resolver output to final reports.
- You verify resolver output before checking ingestion.
- You verify parser and service health before database assumptions.
- You verify database writes before rollup and report checks.
- You verify each layer in order to reduce diagnosis time.
| Symptom | Check | Command |
|---|---|---|
| BIND9 won't start | Config syntax | sudo named-checkconf |
| No queries.log | Log dir ownership | ls -la /var/log/named/ |
| dig returns REFUSED | Client not in ACL | sudo journalctl -u bind9 -n 20 |
| Ingestor not inserting | Wrong DB password | sudo journalctl -u dns-ingestor -n 40 |
| Partition error | Partition absent | Create partition manually |
| psql auth fail | pg_hba.conf peer | Add md5 line below |
| Empty daily_stats | Rollup not run yet | Run rollup_daily.sql manually |
This repair snippet addresses a common local-auth mismatch where PostgreSQL expects peer authentication instead of password-based login for your application role. Adding the loopback md5 rule and reloading PostgreSQL aligns server auth with your ingestor connection method without a full database restart.
sudo nano /etc/postgresql/16/main/pg_hba.conf # Add ABOVE existing local lines: # host dns_analytics dns_user 127.0.0.1/32 md5 sudo systemctl reload postgresql
BIND9 capturing. Python ingesting. PostgreSQL storing. Cron aggregating. Run the Chapter 5 queries after 7 days and you have everything to design caching, CDN peering, and BGP strategy.