code::core · Technical Book

DNS Analytics
from Zero to 2M

Complete step-by-step guide to a DNS query analytics server on Ubuntu 24.04. BIND9 captures every LAN query. Python streams them into PostgreSQL. Daily, weekly, and monthly rollups give you the traffic intelligence to design caching and peering for an ISP at scale.

Server IP192.168.234.128
OSUbuntu 24.04 LTS
DNSBIND9
DatabasePostgreSQL 16
NetworksAll RFC1918
Overview

How Everything Connects

Teaching Lens

Think in layers, not commands. This system is a pipeline where each layer has one job: resolve DNS, capture events, store durable records, then aggregate for decisions.

  • Resolver layer answers DNS and emits raw events.
  • Ingest layer transforms text logs into structured rows.
  • Analytics layer converts high-volume raw data into low-cost summaries.
  • If output looks wrong, debug from left to right through the pipeline.
LAN Client
any machine
BIND9
192.168.234.128:53
Upstream
8.8.8.8 / 1.1.1.1
Query Resolution
queries.log
/var/log/named/
Python Daemon
dns-ingestor.service
PostgreSQL
dns_analytics DB
Logging & Ingest
dns_queries
raw table
Cron Rollups
daily/weekly/monthly
Stats Tables
analytics ready
Analytics Path
BIND9
Resolve & Log
Recursive resolver. Every query written to a structured log file.
Python
Parse & Insert
Tails the log in real-time. Bulk-inserts into PostgreSQL.
PostgreSQL
Store & Aggregate
Raw queries partitioned by day. Nightly cron builds summaries.
Outcome
Traffic Intelligence
Top domains, CDN ratios, peak hours for network design.
Chapter 01

Installing & Configuring BIND9

BIND9 acts as a recursive caching resolver for your entire LAN. Every query is logged to the file that feeds our analytics pipeline.

Teaching Lens

The goal is not just to "install DNS". You are building a trustworthy data source. If resolver behavior is unstable, every downstream metric is noisy.

  • ACL controls who can recurse through your resolver.
  • Forwarders control upstream behavior and latency.
  • Logging format and timestamp quality decide parser reliability.
  • Validation order: syntax, service state, query test, log line test.

1.1 Install BIND9

1

Update package index

bash
sudo apt-get update

Why this command Refreshes apt metadata so you install the newest package index, not stale repository references.

What to verify No 404 repository errors and exit code 0 before proceeding to installation.

2

Install BIND9 and tools

bash
sudo apt-get install -y bind9 bind9utils bind9-doc dnsutils

Why this command Installs resolver daemon plus diagnostic tools so you can both serve DNS and test DNS from the same host.

What to verify `named`, `dig`, and `named-checkconf` are available on PATH after install.

3

Verify and enable on boot

bash
sudo systemctl status bind9 && sudo systemctl enable bind9
ExpectedActive: active (running)

1.2 Configure named.conf.options

Back up first

sudo cp /etc/bind/named.conf.options /etc/bind/named.conf.options.bak

named.conf
/etc/bind/named.conf.options — delete all, paste this
// DNS Analytics Server -- 192.168.234.128
acl "trusted" {
    127.0.0.1;
    192.168.0.0/16;
    10.0.0.0/8;
    172.16.0.0/12;
};
options {
    directory "/var/cache/bind";
    listen-on { any; }; listen-on-v6 { any; };
    allow-query { trusted; }; allow-recursion { trusted; };
    forwarders { 8.8.8.8; 8.8.4.4; 1.1.1.1; 9.9.9.9; };
    forward only;
    dnssec-validation auto;
    allow-transfer { none; }; notify no; recursion yes;
    max-cache-size 256m; min-cache-ttl 60; max-cache-ttl 86400;
    rate-limit { responses-per-second 50; window 5; };
    pid-file "/run/named/named.pid";
};

Why this config Defines trust boundaries (ACL), recursion policy, upstream forwarding, and cache behavior that shape both security and performance.

What to verify Only trusted networks can recurse and external hosts receive refusal instead of open resolver behavior.

1.3 Configure Query Logging

bash
sudo mkdir -p /var/log/named
sudo chown bind:bind /var/log/named && sudo chmod 755 /var/log/named
Why chown bind:bind

BIND9 runs as the bind user. Root-owned log dir = silent failure with zero error message.

named.conf
/etc/bind/named.conf.local — append this block
logging {
    channel queries_log {
        file "/var/log/named/queries.log"
            versions 7 size 500m;
        severity dynamic;
        print-time yes; // CRITICAL: timestamp on every line
        print-severity no; print-category no;
    };
    channel named_log {
        file "/var/log/named/named.log"
            versions 4 size 50m;
        severity info; print-time yes; print-severity yes; print-category yes;
    };
    category queries       { queries_log; };
    category query-errors  { queries_log; };
    category default       { named_log; };
    category general       { named_log; };
    category config        { named_log; };
    category network       { named_log; };
    category security      { named_log; };
    category lame-servers  { null; };
    category dnssec        { null; };
    category resolver      { null; };
    category cname         { null; };
    category xfer-in       { null; };
    category xfer-out      { null; };
    category notify        { null; };
    category client        { null; };
    category unmatched     { null; };
    category dispatch      { null; };
    category edns-disabled { null; };
    category rpz           { null; };
    category rate-limit    { null; };
};

1.4 Validate, Restart, Test

bash
sudo named-checkconf  # silence = success
sudo systemctl restart bind9
dig @192.168.234.128 google.com A
dig @192.168.234.128 youtube.com AAAA
sudo tail -20 /var/log/named/queries.log

Why this sequence It validates syntax, applies config, verifies DNS function, then verifies telemetry emission in one deterministic flow.

What to verify `dig` answers with status NOERROR and each query appears in `queries.log` with timestamp.

Expected log format 11-Jan-2026 08:15:33.421 client @0xabcd 192.168.234.1#45231 (google.com): query: google.com IN A + (192.168.234.128)
Chapter 1 Complete

BIND9 installed and logging every query to /var/log/named/queries.log.

Chapter 02

PostgreSQL — Installation & Schema

Teaching Lens

You are modeling events over time. Keep raw data append-heavy, and keep reporting data pre-aggregated so analytics queries stay fast under growth.

  • Raw table keeps full fidelity for reprocessing.
  • Daily, weekly, monthly tables reduce compute cost.
  • Indexes are aligned with your real query patterns.
  • Categories table turns technical domains into business language.

2.1 Install

bash
sudo apt-get install -y postgresql postgresql-contrib
sudo systemctl enable postgresql
sudo -u postgres psql -c "SELECT version();"

Why this command Installs core PostgreSQL plus common extensions package, then confirms the database engine is healthy before schema work.

What to verify Version query returns one row and service is enabled for reboot persistence.

2.2 Create Database and User

sql
sudo -u postgres psql
CREATE USER dns_user WITH PASSWORD 'ChangeThisPassword!'
    NOSUPERUSER NOCREATEDB NOCREATEROLE LOGIN;
CREATE DATABASE dns_analytics OWNER dns_user
    ENCODING 'UTF8' LC_COLLATE 'en_US.UTF-8'
    LC_CTYPE 'en_US.UTF-8' TEMPLATE template0;
GRANT ALL PRIVILEGES ON DATABASE dns_analytics TO dns_user;
\q

2.3 Schema

bash
psql -U dns_user -d dns_analytics -h localhost
sql
full schema
-- Raw query log, partitioned by day
CREATE TABLE dns_queries (
    id BIGSERIAL, queried_at TIMESTAMPTZ NOT NULL,
    client_ip INET NOT NULL, client_port INTEGER,
    domain TEXT NOT NULL, apex_domain TEXT NOT NULL,
    qtype VARCHAR(10) NOT NULL, flags TEXT,
    response_rcode TEXT, cached BOOLEAN DEFAULT FALSE,
    inserted_at TIMESTAMPTZ DEFAULT now()
) PARTITION BY RANGE (queried_at);

CREATE TABLE dns_queries_today PARTITION OF dns_queries
    FOR VALUES FROM (CURRENT_DATE) TO (CURRENT_DATE + INTERVAL '1 day');

-- Per-domain count per day
CREATE TABLE dns_daily_stats (
    stat_date DATE NOT NULL, domain TEXT NOT NULL,
    apex_domain TEXT NOT NULL, qtype VARCHAR(10) NOT NULL,
    hit_count BIGINT NOT NULL DEFAULT 0,
    unique_clients INTEGER NOT NULL DEFAULT 0,
    nxdomain_count INTEGER NOT NULL DEFAULT 0,
    servfail_count INTEGER NOT NULL DEFAULT 0,
    first_seen TIMESTAMPTZ, last_seen TIMESTAMPTZ,
    PRIMARY KEY (stat_date, domain, qtype)
);

CREATE TABLE dns_weekly_stats (
    week_start DATE NOT NULL, week_end DATE NOT NULL,
    apex_domain TEXT NOT NULL, total_queries BIGINT NOT NULL DEFAULT 0,
    unique_clients INTEGER NOT NULL DEFAULT 0,
    peak_day DATE, peak_day_count BIGINT, category TEXT,
    PRIMARY KEY (week_start, apex_domain)
);

CREATE TABLE dns_monthly_stats (
    year_month CHAR(7) NOT NULL, apex_domain TEXT NOT NULL,
    total_queries BIGINT NOT NULL DEFAULT 0,
    unique_clients INTEGER NOT NULL DEFAULT 0,
    avg_daily NUMERIC(12,2), rank_in_month INTEGER, category TEXT,
    PRIMARY KEY (year_month, apex_domain)
);

CREATE TABLE dns_hourly_heatmap (
    stat_date DATE NOT NULL,
    hour_of_day SMALLINT NOT NULL CHECK (hour_of_day BETWEEN 0 AND 23),
    apex_domain TEXT NOT NULL, query_count BIGINT NOT NULL DEFAULT 0,
    PRIMARY KEY (stat_date, hour_of_day, apex_domain)
);

CREATE TABLE domain_categories (
    apex_domain TEXT PRIMARY KEY, category TEXT NOT NULL,
    provider TEXT, cache_priority SMALLINT DEFAULT 5,
    tagged_at TIMESTAMPTZ DEFAULT now()
);

INSERT INTO domain_categories VALUES
('google.com','search','Google',3,now()),('googleapis.com','cdn','Google',2,now()),
('youtube.com','streaming','Google',1,now()),('googlevideo.com','streaming','Google',1,now()),
('netflix.com','streaming','Netflix',1,now()),('nflxvideo.net','streaming','Netflix',1,now()),
('nflximg.net','cdn','Netflix',2,now()),('akamai.net','cdn','Akamai',1,now()),
('akamaiedge.net','cdn','Akamai',1,now()),('cloudflare.com','cdn','Cloudflare',2,now()),
('facebook.com','social','Meta',4,now()),('instagram.com','social','Meta',4,now()),
('whatsapp.net','messaging','Meta',3,now()),('whatsapp.com','messaging','Meta',3,now()),
('twitter.com','social','X',4,now()),('twimg.com','cdn','X',2,now()),
('tiktok.com','social','TikTok',4,now()),('microsoft.com','work','Microsoft',5,now()),
('office.com','work','Microsoft',5,now()),('windowsupdate.com','updates','Microsoft',8,now()),
('ubuntu.com','updates','Canonical',8,now()),('amazonaws.com','cloud','AWS',5,now()),
('fastly.net','cdn','Fastly',2,now()),('cloudfront.net','cdn','AWS',2,now()),
('doubleclick.net','ads','Google',9,now()),('googlesyndication.com','ads','Google',9,now());
COMMIT;

Why this schema Separates raw immutable events from summary tables so ingestion remains simple while analytics stay fast.

What to verify Tables exist, primary keys are present, and seed categories insert without constraint errors.

2.4 Indexes

sql
CREATE INDEX idx_q_time   ON dns_queries (queried_at DESC);
CREATE INDEX idx_q_apex   ON dns_queries (apex_domain, queried_at DESC);
CREATE INDEX idx_q_client ON dns_queries (client_ip, queried_at DESC);
CREATE INDEX idx_d_apex   ON dns_daily_stats (apex_domain, stat_date DESC);
CREATE INDEX idx_d_hits   ON dns_daily_stats (stat_date, hit_count DESC);
CREATE INDEX idx_w_domain ON dns_weekly_stats (apex_domain, week_start DESC);
CREATE INDEX idx_m_domain ON dns_monthly_stats (apex_domain, year_month DESC);
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_d_trgm ON dns_daily_stats USING gin (apex_domain gin_trgm_ops);
COMMIT;
Chapter 2 Complete

Five-table schema built, indexed, 26 domain categories seeded.

Chapter 03

The Log Ingestor — Python Daemon

Teaching Lens

The ingestor is an ETL loop: read, parse, normalize, write. Batch writes improve throughput and reduce database lock pressure.

  • Regex parser must tolerate noisy log variants.
  • Batch size and flush interval balance latency vs throughput.
  • Systemd gives restart guarantees after crashes or reboots.
  • If inserts stop, check file permissions, regex match rate, then DB auth.

3.1 Python Setup

bash
sudo apt-get install -y python3 python3-pip python3-venv
sudo mkdir -p /opt/dns-ingestor && sudo chown $USER:$USER /opt/dns-ingestor
cd /opt/dns-ingestor && python3 -m venv venv
source venv/bin/activate && pip install psycopg2-binary

Why this command Creates an isolated runtime so ingestor dependencies are deterministic and decoupled from system Python.

What to verify `venv/bin/python` exists and `pip show psycopg2-binary` resolves inside the virtual environment.

3.2 The Script

bash
nano /opt/dns-ingestor/ingestor.py
python
/opt/dns-ingestor/ingestor.py
#!/usr/bin/env python3
import re, time, logging, os, signal, sys
from datetime import datetime
import psycopg2
from psycopg2.extras import execute_values

DB = {'dbname':'dns_analytics','user':'dns_user',
      'password':'ChangeThisPassword!','host':'127.0.0.1','port':5432}
LOG_FILE = '/var/log/named/queries.log'
BATCH_SIZE = 100
FLUSH_INTERVAL = 5

QUERY_RE = re.compile(
    r'^(?P<ts>\d{2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2})\.\d+'
    r'\s+client\s+(?:@\S+\s+)?(?P<ip>[\d\.a-fA-F:]+)#(?P<port>\d+)'
    r'\s+\([^)]+\):\s+query:\s+(?P<domain>[\w.\-]+)\s+IN\s+(?P<qtype>\w+)'
    r'\s+(?P<flags>[+\-\w\s]*?)(?:\s+\([\d.]+\))?$'
)

def apex(d):
    p = d.rstrip('.').split('.')
    return '.'.join(p[-2:]) if len(p)>=2 else d

def parse_ts(s):
    try: return datetime.strptime(s, '%d-%b-%Y %H:%M:%S')
    except: return datetime.utcnow()

def parse(line):
    line = line.strip()
    if not line: return None
    m = QUERY_RE.match(line)
    if not m: return None
    d = m.groupdict()
    dom = d.get('domain','').lower().rstrip('.')
    if not dom: return None
    flags = d.get('flags','').strip()
    return {'queried_at':parse_ts(d['ts']),'client_ip':d.get('ip','0.0.0.0'),
            'client_port':int(d.get('port',0)),'domain':dom,'apex_domain':apex(dom),
            'qtype':d.get('qtype','A').upper(),'flags':flags,
            'response_rcode':'NOERROR','cached':'CL' in flags}

def connect():
    while True:
        try:
            c=psycopg2.connect(**DB); c.autocommit=False
            logging.info('DB connected'); return c
        except psycopg2.OperationalError as e:
            logging.error(f'DB: {e}'); time.sleep(10)

def flush(conn, batch):
    if not batch: return 0
    sql = ('INSERT INTO dns_queries '
           '(queried_at,client_ip,client_port,domain,apex_domain,'
           'qtype,flags,response_rcode,cached) VALUES %s ON CONFLICT DO NOTHING')
    rows = [(r['queried_at'],r['client_ip'],r['client_port'],r['domain'],
             r['apex_domain'],r['qtype'],r['flags'],r['response_rcode'],r['cached'])
            for r in batch]
    try:
        with conn.cursor() as cur: execute_values(cur,sql,rows,page_size=500)
        conn.commit(); return len(batch)
    except psycopg2.Error as e:
        logging.error(f'Insert: {e}'); conn.rollback(); return 0

def run(conn):
    batch,last_flush,total,inode,f = [],time.time(),0,0,None
    logging.info(f'Watching {LOG_FILE}')
    while True:
        try: cur_inode=os.stat(LOG_FILE).st_ino
        except FileNotFoundError: time.sleep(5); continue
        if cur_inode != inode:
            try: f=open(LOG_FILE,'r'); f.seek(0,2); inode=cur_inode
            except IOError: time.sleep(5); continue
        while True:
            line = f.readline()
            if not line: break
            rec = parse(line)
            if rec: batch.append(rec)
        now = time.time()
        if len(batch)>=BATCH_SIZE or (batch and now-last_flush>=FLUSH_INTERVAL):
            n=flush(conn,batch); total+=n
            logging.info(f'Flushed {n} | total {total:,}')
            batch,last_flush=[],now
        time.sleep(0.5)

if __name__=='__main__':
    os.makedirs('/var/log/dns-ingestor',exist_ok=True)
    logging.basicConfig(level=logging.INFO,
        format='%(asctime)s [%(levelname)s] %(message)s',
        handlers=[logging.StreamHandler(sys.stdout),
                  logging.FileHandler('/var/log/dns-ingestor/ingestor.log')])
    def bye(s,f): sys.exit(0)
    signal.signal(signal.SIGTERM,bye); signal.signal(signal.SIGINT,bye)
    run(connect())

Why this code Implements resilient tail-and-batch ingestion with reconnect logic, minimizing data loss during transient DB failures.

What to verify Logs show repeated "Flushed N" messages and database row counts increase during DNS traffic.

3.3 Systemd Service

ini
/etc/systemd/system/dns-ingestor.service
[Unit]
Description=DNS Query Log Ingestor -- code::core
After=network.target postgresql.service bind9.service
Requires=postgresql.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/dns-ingestor
ExecStart=/opt/dns-ingestor/venv/bin/python3 /opt/dns-ingestor/ingestor.py
Restart=always
RestartSec=10s
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
bash
sudo systemctl daemon-reload
sudo systemctl enable --now dns-ingestor
sudo journalctl -u dns-ingestor -f

Why this command Registers your unit file, starts it immediately, and tails logs so first-run failures are visible in real time.

What to verify Service state is `active (running)` and no crash-loop pattern appears in journal output.

3.4 End-to-End Test

bash
for d in google.com youtube.com netflix.com instagram.com microsoft.com; do
    dig @192.168.234.128 $d A +short
done
sleep 6
psql -U dns_user -d dns_analytics -h localhost \
  -c "SELECT domain,qtype,queried_at FROM dns_queries ORDER BY queried_at DESC LIMIT 10;"
Expected google.com | A | 2026-01-11 08:20:01+00 (10 rows)
Pipeline is LIVE

BIND9 → Python → PostgreSQL. Every LAN DNS query is now captured and stored.

Chapter 04

Rollup Jobs — Daily, Weekly & Monthly

All scripts use ON CONFLICT DO UPDATE — idempotent, safe to rerun at any time.

Teaching Lens

Rollups are your performance strategy. Instead of scanning billions of raw rows for every report, you pre-compute stable summaries on a schedule.

  • Daily table is operational and near-term.
  • Weekly table is planning and trend-oriented.
  • Monthly table is executive reporting and capacity forecasting.
  • Idempotent SQL means recovery is rerun, not manual repair.
bash
sudo mkdir -p /opt/dns-ingestor/sql

4.1 Daily Rollup

sql
/opt/dns-ingestor/sql/rollup_daily.sql
-- Run at 01:00 every morning
INSERT INTO dns_daily_stats
    (stat_date,domain,apex_domain,qtype,hit_count,unique_clients,
     nxdomain_count,servfail_count,first_seen,last_seen)
SELECT
    DATE(queried_at AT TIME ZONE 'Africa/Kampala') AS stat_date,
    domain, apex_domain, qtype,
    COUNT(*), COUNT(DISTINCT client_ip),
    COUNT(*) FILTER (WHERE response_rcode='NXDOMAIN'),
    COUNT(*) FILTER (WHERE response_rcode='SERVFAIL'),
    MIN(queried_at), MAX(queried_at)
FROM dns_queries
WHERE queried_at >= (CURRENT_DATE-INTERVAL '1 day') AT TIME ZONE 'Africa/Kampala'
  AND queried_at  <  CURRENT_DATE AT TIME ZONE 'Africa/Kampala'
GROUP BY 1,domain,apex_domain,qtype
ON CONFLICT (stat_date,domain,qtype) DO UPDATE SET
    hit_count=EXCLUDED.hit_count, unique_clients=EXCLUDED.unique_clients,
    nxdomain_count=EXCLUDED.nxdomain_count, servfail_count=EXCLUDED.servfail_count,
    first_seen=EXCLUDED.first_seen, last_seen=EXCLUDED.last_seen;

INSERT INTO dns_hourly_heatmap (stat_date,hour_of_day,apex_domain,query_count)
SELECT DATE(queried_at AT TIME ZONE 'Africa/Kampala'),
    EXTRACT(HOUR FROM queried_at AT TIME ZONE 'Africa/Kampala')::SMALLINT,
    apex_domain, COUNT(*)
FROM dns_queries
WHERE queried_at >= (CURRENT_DATE-INTERVAL '1 day') AT TIME ZONE 'Africa/Kampala'
  AND queried_at  <  CURRENT_DATE AT TIME ZONE 'Africa/Kampala'
GROUP BY 1,2,3
ON CONFLICT (stat_date,hour_of_day,apex_domain) DO UPDATE SET query_count=EXCLUDED.query_count;
COMMIT;

Why this SQL Compresses raw query volume into daily business-ready metrics and heatmap dimensions with rerun-safe conflict handling.

What to verify Running twice should not duplicate counts; rows are updated deterministically for the same date window.

4.2 Weekly Rollup

sql
/opt/dns-ingestor/sql/rollup_weekly.sql
-- Run Monday 02:00
INSERT INTO dns_weekly_stats
    (week_start,week_end,apex_domain,total_queries,unique_clients,peak_day,peak_day_count,category)
WITH totals AS (
    SELECT DATE_TRUNC('week',stat_date)::DATE AS week_start,
           (DATE_TRUNC('week',stat_date)+INTERVAL '6 days')::DATE AS week_end,
           apex_domain, SUM(hit_count) AS total_queries, MAX(unique_clients) AS unique_clients
    FROM dns_daily_stats
    WHERE stat_date >= DATE_TRUNC('week',CURRENT_DATE-INTERVAL '7 days')::DATE
      AND stat_date  <  DATE_TRUNC('week',CURRENT_DATE)::DATE
    GROUP BY 1,2,3
), peaks AS (
    SELECT DATE_TRUNC('week',stat_date)::DATE AS week_start, apex_domain,
           stat_date AS peak_day, SUM(hit_count) AS day_total,
           RANK() OVER (PARTITION BY DATE_TRUNC('week',stat_date),apex_domain
                        ORDER BY SUM(hit_count) DESC) AS rnk
    FROM dns_daily_stats
    WHERE stat_date >= DATE_TRUNC('week',CURRENT_DATE-INTERVAL '7 days')::DATE
      AND stat_date  <  DATE_TRUNC('week',CURRENT_DATE)::DATE
    GROUP BY 1,2,3
)
SELECT t.week_start,t.week_end,t.apex_domain,t.total_queries,t.unique_clients,
       p.peak_day,p.day_total,dc.category
FROM totals t
LEFT JOIN peaks p ON p.week_start=t.week_start AND p.apex_domain=t.apex_domain AND p.rnk=1
LEFT JOIN domain_categories dc ON dc.apex_domain=t.apex_domain
ON CONFLICT (week_start,apex_domain) DO UPDATE SET
    total_queries=EXCLUDED.total_queries,unique_clients=EXCLUDED.unique_clients,
    peak_day=EXCLUDED.peak_day,peak_day_count=EXCLUDED.peak_day_count;
COMMIT;

4.3 Monthly Rollup

sql
/opt/dns-ingestor/sql/rollup_monthly.sql
-- Run 1st of month 03:00
INSERT INTO dns_monthly_stats (year_month,apex_domain,total_queries,unique_clients,avg_daily,rank_in_month,category)
WITH m AS (
    SELECT TO_CHAR(stat_date,'YYYY-MM') AS year_month, apex_domain,
           SUM(hit_count) AS total_queries, MAX(unique_clients) AS unique_clients,
           ROUND(AVG(hit_count),2) AS avg_daily
    FROM dns_daily_stats
    WHERE TO_CHAR(stat_date,'YYYY-MM')=TO_CHAR(CURRENT_DATE-INTERVAL '1 month','YYYY-MM')
    GROUP BY 1,2
)
SELECT m.year_month,m.apex_domain,m.total_queries,m.unique_clients,m.avg_daily,
       RANK() OVER (PARTITION BY m.year_month ORDER BY m.total_queries DESC), dc.category
FROM m LEFT JOIN domain_categories dc ON dc.apex_domain=m.apex_domain
ON CONFLICT (year_month,apex_domain) DO UPDATE SET
    total_queries=EXCLUDED.total_queries,avg_daily=EXCLUDED.avg_daily,rank_in_month=EXCLUDED.rank_in_month;
COMMIT;

4.4 Cron Schedule

cron
sudo crontab -e
# DNS Analytics Rollups
0 1 * * *  psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_daily.sql >> /var/log/dns-ingestor/daily.log 2>&1
0 2 * * 1  psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_weekly.sql >> /var/log/dns-ingestor/weekly.log 2>&1
0 3 1 * *  psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_monthly.sql >> /var/log/dns-ingestor/monthly.log 2>&1
# Create tomorrow's partition at 23:30
30 23 * * * psql -U dns_user -d dns_analytics -h localhost -c "CREATE TABLE IF NOT EXISTS dns_queries_$(date -d tomorrow +\%Y\%m\%d) PARTITION OF dns_queries FOR VALUES FROM (CURRENT_DATE+INTERVAL '1 day') TO (CURRENT_DATE+INTERVAL '2 days');" >> /var/log/dns-ingestor/partition.log 2>&1
Chapter 05

SQL Analytics — Reading Your Traffic

Teaching Lens

Every query answers a business question. Treat SQL here as decision logic for peering, caching, abuse detection, and customer experience.

  • Top domains identify concentration risk.
  • Category mix shows where bandwidth is actually going.
  • Hourly heatmaps guide cache warm-up windows.
  • NXDOMAIN rates surface misconfigurations or bot noise.

5.1 Top 50 Domains (7-Day)

sql
SELECT ds.apex_domain, dc.category, dc.provider, dc.cache_priority,
    SUM(ds.hit_count) AS total_queries,
    ROUND(SUM(ds.hit_count)*100.0/SUM(SUM(ds.hit_count)) OVER(),2) AS pct,
    ROUND(AVG(ds.hit_count),0) AS avg_per_day
FROM dns_daily_stats ds
LEFT JOIN domain_categories dc ON dc.apex_domain=ds.apex_domain
WHERE ds.stat_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY ds.apex_domain,dc.category,dc.provider,dc.cache_priority
ORDER BY total_queries DESC LIMIT 50;

Why this query Produces concentration analysis: which domains dominate resolver workload and therefore drive caching and peering priorities.

What to verify Percentages roughly sum to 100 and top domains align with observed subscriber behavior.

5.2 By Category

sql
SELECT COALESCE(dc.category,'uncategorised') AS category,
    COUNT(DISTINCT ds.apex_domain) AS domains, SUM(ds.hit_count) AS total_queries,
    ROUND(SUM(ds.hit_count)*100.0/SUM(SUM(ds.hit_count)) OVER(),2) AS pct
FROM dns_daily_stats ds
LEFT JOIN domain_categories dc ON dc.apex_domain=ds.apex_domain
WHERE ds.stat_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY 1 ORDER BY total_queries DESC;

5.3 24-Hour Heatmap

sql
SELECT LPAD(h.hour_of_day::TEXT,2,'0')||':00' AS hour,
    SUM(h.query_count) AS queries,
    REPEAT('█',(SUM(h.query_count)/NULLIF(MAX(SUM(h.query_count)) OVER()/40,0))::INT) AS bar
FROM dns_hourly_heatmap h
WHERE h.stat_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY h.hour_of_day ORDER BY h.hour_of_day;

5.4 NXDOMAIN Analysis

sql
SELECT apex_domain, SUM(hit_count) AS total, SUM(nxdomain_count) AS failures,
    ROUND(SUM(nxdomain_count)*100.0/NULLIF(SUM(hit_count),0),1) AS fail_pct
FROM dns_daily_stats
WHERE stat_date >= CURRENT_DATE - INTERVAL '7 days' AND nxdomain_count > 0
GROUP BY apex_domain HAVING SUM(nxdomain_count)>10
ORDER BY fail_pct DESC LIMIT 30;
Chapter 06

Scaling to 2 Million Subscribers

2M subs = 600M queries/day = 7,000 qps sustained. Text log files break at this scale. The data model stays identical; the transport layer changes.

Teaching Lens

At scale, architecture changes where data moves, not what data means. Keep schema semantics stable so reports remain comparable from lab to production.

  • Switch text logs to binary event streams (dnstap).
  • Switch single-process ingest to distributed stream workers.
  • Switch plain PostgreSQL patterns to time-series optimizations.
  • Keep privacy controls explicit before volume grows.

6.1 Production Architecture

Anycast VIP
resolver.isp.net
8x Unbound
DNS cluster
dnstap
binary protobuf
Resolver Layer
Kafka
dns-queries topic
Faust Workers
x8 processors
TimescaleDB
hypertables
Ingest Layer
LayerVM (this guide)Production (2M subs)
DNS daemonBIND9 text logUnbound x8 + dnstap
TransportFile on diskKafka 3-broker cluster
Ingestor1 Python processFaust workers x8
DatabasePostgreSQL 16TimescaleDB on NVMe RAID
Volume~50K/day600M/day (7K qps)

6.2 TimescaleDB

bash
sudo apt-get install -y timescaledb-2-postgresql-16
sudo timescaledb-tune --quiet --yes && sudo systemctl restart postgresql
sql
CREATE EXTENSION IF NOT EXISTS timescaledb;
SELECT create_hypertable('dns_queries','queried_at',
    chunk_time_interval=>INTERVAL '1 day',if_not_exists=>TRUE);
ALTER TABLE dns_queries SET (timescaledb.compress,
    timescaledb.compress_orderby='queried_at DESC',
    timescaledb.compress_segmentby='apex_domain');
SELECT add_compression_policy('dns_queries',INTERVAL '7 days');
CREATE MATERIALIZED VIEW dns_daily_live WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day',queried_at) AS bucket, apex_domain, qtype,
       COUNT(*) AS hit_count, COUNT(DISTINCT client_ip) AS unique_clients
FROM dns_queries GROUP BY bucket,apex_domain,qtype;
SELECT add_continuous_aggregate_policy('dns_daily_live',
    start_offset=>INTERVAL '3 days',end_offset=>INTERVAL '1 hour',
    schedule_interval=>INTERVAL '1 hour');

6.3 dnstap

ini
unbound.conf additions
dnstap:
    dnstap-enable: yes
    dnstap-socket-path: "/var/run/unbound/dnstap.sock"
    dnstap-log-client-query-messages: yes
    dnstap-log-client-response-messages: yes
Privacy at ISP scale

Store only the /24 subnet, not the full /32 client IP. Aggregated domain counts contain no PII.

Appendix A

Firewall (UFW)

Teaching Lens

Security here is segmentation, not complexity. DNS stays reachable only from trusted private ranges; database stays local unless you explicitly tunnel it.

bash
sudo ufw enable && sudo ufw allow ssh
sudo ufw allow from 192.168.0.0/16 to any port 53 proto udp
sudo ufw allow from 192.168.0.0/16 to any port 53 proto tcp
sudo ufw allow from 10.0.0.0/8 to any port 53 proto udp
sudo ufw allow from 10.0.0.0/8 to any port 53 proto tcp
sudo ufw allow from 172.16.0.0/12 to any port 53 proto udp
sudo ufw allow from 172.16.0.0/12 to any port 53 proto tcp
sudo ufw allow from 127.0.0.1 to any port 5432 proto tcp
sudo ufw default deny incoming && sudo ufw default allow outgoing
Appendix B

Troubleshooting

Teaching Lens

Troubleshooting should follow the data path. Start at resolver output, then parser health, then database writes, then rollups, then reports.

SymptomCheckCommand
BIND9 won't startConfig syntaxsudo named-checkconf
No queries.logLog dir ownershipls -la /var/log/named/
dig returns REFUSEDClient not in ACLsudo journalctl -u bind9 -n 20
Ingestor not insertingWrong DB passwordsudo journalctl -u dns-ingestor -n 40
Partition errorPartition absentCreate partition manually
psql auth failpg_hba.conf peerAdd md5 line below
Empty daily_statsRollup not run yetRun rollup_daily.sql manually
bash
fix pg_hba.conf if psql auth fails
sudo nano /etc/postgresql/16/main/pg_hba.conf
# Add ABOVE existing local lines:
# host  dns_analytics  dns_user  127.0.0.1/32  md5
sudo systemctl reload postgresql
Done

BIND9 capturing. Python ingesting. PostgreSQL storing. Cron aggregating. Run the Chapter 5 queries after 7 days and you have everything to design caching, CDN peering, and BGP strategy.