code::core · Technical Book

DNS Analytics
from Zero to 2M

Complete step-by-step guide to a DNS query analytics server on Ubuntu 24.04. BIND9 captures every LAN query. Python streams them into PostgreSQL. Daily, weekly, and monthly rollups give you the traffic intelligence to design caching and peering for an ISP at scale.

Server IP192.168.234.128
OSUbuntu 24.04 LTS
DNSBIND9
DatabasePostgreSQL 16
NetworksAll RFC1918
Overview

How Everything Connects

Teaching Lens

This lesson teaches pipeline thinking instead of command memorization: resolve DNS, capture events, store durable records, then aggregate for decisions.

  • You verify resolver behavior before trusting downstream analytics.
  • You verify ingest transforms raw logs into structured rows.
  • You verify rollups convert raw volume into low-cost summaries.
  • You verify faults by debugging left to right across the pipeline.
LAN Client
any machine
BIND9
192.168.234.128:53
Upstream
8.8.8.8 / 1.1.1.1
Query Resolution
queries.log
/var/log/named/
Python Daemon
dns-ingestor.service
PostgreSQL
dns_analytics DB
Logging & Ingest
dns_queries
raw table
Cron Rollups
daily/weekly/monthly
Stats Tables
analytics ready
Analytics Path
BIND9
Resolve & Log
Recursive resolver. Every query written to a structured log file.
Python
Parse & Insert
Tails the log in real-time. Bulk-inserts into PostgreSQL.
PostgreSQL
Store & Aggregate
Raw queries partitioned by day. Nightly cron builds summaries.
Outcome
Traffic Intelligence
Top domains, CDN ratios, peak hours for network design.
Chapter 01

Installing & Configuring BIND9

BIND9 acts as a recursive caching resolver for your entire LAN. Every query is logged to the file that feeds our analytics pipeline.

Teaching Lens

This lesson teaches that resolver stability is a data-quality requirement, not just a DNS setup task.

  • You verify ACL policy to prevent open-recursion abuse.
  • You verify forwarder behavior and upstream response reliability.
  • You verify log timestamp quality for parser correctness.
  • You verify in order: syntax, service state, query path, and log emission.

1.1 Install BIND9

1

Update package index

bash
sudo apt-get update

This lesson teaches package metadata refresh as a prerequisite for reliable installation results.

You verify there are no repository errors and the command exits successfully before continuing.

2

Install BIND9 and tools

bash
sudo apt-get install -y bind9 bind9utils bind9-doc dnsutils

This lesson teaches installing both the resolver service and its diagnostic toolchain in the same step.

You verify named, dig, and named-checkconf are available after installation.

3

Verify and enable on boot

bash
sudo systemctl status bind9 && sudo systemctl enable bind9
ExpectedActive: active (running)

1.2 Configure named.conf.options

Back up first

sudo cp /etc/bind/named.conf.options /etc/bind/named.conf.options.bak

Before you paste this block, understand its job: this is the resolver policy engine for your network. The acl "trusted" section defines who is allowed to use your resolver recursively, which prevents open-resolver abuse. The forwarders list defines where unresolved queries are sent upstream. Cache and rate-limit settings protect performance during spikes. Read this as "who may ask, where answers come from, and how safely the resolver behaves under load."

named.conf
/etc/bind/named.conf.options — delete all, paste this
// DNS Analytics Server -- 192.168.234.128
acl "trusted" {
    127.0.0.1;
    192.168.0.0/16;
    10.0.0.0/8;
    172.16.0.0/12;
};
options {
    directory "/var/cache/bind";
    listen-on { any; }; listen-on-v6 { any; };
    allow-query { trusted; }; allow-recursion { trusted; };
    forwarders { 8.8.8.8; 8.8.4.4; 1.1.1.1; 9.9.9.9; };
    forward only;
    dnssec-validation auto;
    allow-transfer { none; }; notify no; recursion yes;
    max-cache-size 256m; min-cache-ttl 60; max-cache-ttl 86400;
    rate-limit { responses-per-second 50; window 5; };
    pid-file "/run/named/named.pid";
};

This lesson teaches resolver policy design using ACL, recursion limits, forwarding strategy, and cache controls.

You verify only trusted networks can recurse and external clients are refused.

1.3 Configure Query Logging

This command block prepares the filesystem for reliable telemetry capture. BIND writes logs as the bind service account, so directory ownership is not optional. If ownership is wrong, DNS might still resolve while analytics silently fail because no query lines are written. Treat this as data-pipeline readiness, not just Linux housekeeping.

bash
sudo mkdir -p /var/log/named
sudo chown bind:bind /var/log/named && sudo chmod 755 /var/log/named
Why chown bind:bind

BIND9 runs as the bind user. Root-owned log dir = silent failure with zero error message.

This logging block inside /etc/bind/named.conf.local decides which DNS events become analyzable records. The key line is print-time yes, because your parser and time-based rollups depend on accurate timestamps. The category mapping intentionally sends query events to queries.log and suppresses noisy internals with null sinks. You are defining signal versus noise for every lesson that follows.

named.conf
/etc/bind/named.conf.local — append this block
logging {
    channel queries_log {
        file "/var/log/named/queries.log"
            versions 7 size 500m;
        severity dynamic;
        print-time yes; // CRITICAL: timestamp on every line
        print-severity no; print-category no;
    };
    channel named_log {
        file "/var/log/named/named.log"
            versions 4 size 50m;
        severity info; print-time yes; print-severity yes; print-category yes;
    };
    category queries       { queries_log; };
    category query-errors  { queries_log; };
    category default       { named_log; };
    category general       { named_log; };
    category config        { named_log; };
    category network       { named_log; };
    category security      { named_log; };
    category lame-servers  { null; };
    category dnssec        { null; };
    category resolver      { null; };
    category cname         { null; };
    category xfer-in       { null; };
    category xfer-out      { null; };
    category notify        { null; };
    category client        { null; };
    category unmatched     { null; };
    category dispatch      { null; };
    category edns-disabled { null; };
    category rpz           { null; };
    category rate-limit    { null; };
};

1.4 Validate, Restart, Test

This sequence is ordered as a troubleshooting decision tree. First, named-checkconf validates syntax before risking a restart. Second, restarting applies your new policy and logging rules. Third, dig tests real query flow. Last, tail verifies telemetry emission. If any step fails, you know exactly where the break occurred: config parsing, service runtime, DNS path, or logging path.

bash
sudo named-checkconf  # silence = success
sudo systemctl restart bind9
dig @192.168.234.128 google.com A
dig @192.168.234.128 youtube.com AAAA
sudo tail -20 /var/log/named/queries.log

This lesson teaches deterministic validation flow from syntax check to live telemetry confirmation.

You verify dig returns NOERROR and query lines appear in queries.log with timestamps.

Expected log format 11-Jan-2026 08:15:33.421 client @0xabcd 192.168.234.1#45231 (google.com): query: google.com IN A + (192.168.234.128)
Chapter 1 Complete

BIND9 installed and logging every query to /var/log/named/queries.log.

Chapter 02

PostgreSQL — Installation & Schema

Teaching Lens

This lesson teaches event modeling for scale: append-heavy raw storage plus pre-aggregated reporting tables.

  • You verify raw tables preserve full fidelity for reprocessing.
  • You verify daily, weekly, and monthly rollups reduce query cost.
  • You verify indexes match real analytics access patterns.
  • You verify category mapping translates domains into business meaning.

2.1 Install

This installation block creates the database runtime used by every later lesson. The package command installs the server and common extensions bundle, enabling PostgreSQL to start at boot ensures persistence after reboots, and the version query proves the server is reachable before any schema work begins. Treat this as a platform readiness gate, not just package setup.

bash
sudo apt-get install -y postgresql postgresql-contrib
sudo systemctl enable postgresql
sudo -u postgres psql -c "SELECT version();"

This lesson teaches database runtime setup before schema creation and ingestion onboarding.

You verify version output is returned and PostgreSQL is enabled at boot.

2.2 Create Database and User

This SQL block establishes least-privilege access. You create a dedicated login role for the ingestor, then bind the analytics database ownership to that role so application writes stay scoped to the project. Avoid using the postgres superuser for daily ingestion because privilege boundaries are part of operational safety and incident containment.

sql
sudo -u postgres psql
CREATE USER dns_user WITH PASSWORD 'ChangeThisPassword!'
    NOSUPERUSER NOCREATEDB NOCREATEROLE LOGIN;
CREATE DATABASE dns_analytics OWNER dns_user
    ENCODING 'UTF8' LC_COLLATE 'en_US.UTF-8'
    LC_CTYPE 'en_US.UTF-8' TEMPLATE template0;
GRANT ALL PRIVILEGES ON DATABASE dns_analytics TO dns_user;
\q

2.3 Schema

This lesson teaches connection and ownership validation before schema creation. You verify that the application role can access the target database directly before any table definitions are applied.

bash
psql -U dns_user -d dns_analytics -h localhost

This lesson teaches the full analytics data model from raw events to reporting summaries. You verify that partitioned ingestion, rollup tables, and category metadata work together as one coherent decision-support schema.

sql
full schema
-- Raw query log, partitioned by day
CREATE TABLE dns_queries (
    id BIGSERIAL, queried_at TIMESTAMPTZ NOT NULL,
    client_ip INET NOT NULL, client_port INTEGER,
    domain TEXT NOT NULL, apex_domain TEXT NOT NULL,
    qtype VARCHAR(10) NOT NULL, flags TEXT,
    response_rcode TEXT, cached BOOLEAN DEFAULT FALSE,
    inserted_at TIMESTAMPTZ DEFAULT now()
) PARTITION BY RANGE (queried_at);

CREATE TABLE dns_queries_today PARTITION OF dns_queries
    FOR VALUES FROM (CURRENT_DATE) TO (CURRENT_DATE + INTERVAL '1 day');

-- Per-domain count per day
CREATE TABLE dns_daily_stats (
    stat_date DATE NOT NULL, domain TEXT NOT NULL,
    apex_domain TEXT NOT NULL, qtype VARCHAR(10) NOT NULL,
    hit_count BIGINT NOT NULL DEFAULT 0,
    unique_clients INTEGER NOT NULL DEFAULT 0,
    nxdomain_count INTEGER NOT NULL DEFAULT 0,
    servfail_count INTEGER NOT NULL DEFAULT 0,
    first_seen TIMESTAMPTZ, last_seen TIMESTAMPTZ,
    PRIMARY KEY (stat_date, domain, qtype)
);

CREATE TABLE dns_weekly_stats (
    week_start DATE NOT NULL, week_end DATE NOT NULL,
    apex_domain TEXT NOT NULL, total_queries BIGINT NOT NULL DEFAULT 0,
    unique_clients INTEGER NOT NULL DEFAULT 0,
    peak_day DATE, peak_day_count BIGINT, category TEXT,
    PRIMARY KEY (week_start, apex_domain)
);

CREATE TABLE dns_monthly_stats (
    year_month CHAR(7) NOT NULL, apex_domain TEXT NOT NULL,
    total_queries BIGINT NOT NULL DEFAULT 0,
    unique_clients INTEGER NOT NULL DEFAULT 0,
    avg_daily NUMERIC(12,2), rank_in_month INTEGER, category TEXT,
    PRIMARY KEY (year_month, apex_domain)
);

CREATE TABLE dns_hourly_heatmap (
    stat_date DATE NOT NULL,
    hour_of_day SMALLINT NOT NULL CHECK (hour_of_day BETWEEN 0 AND 23),
    apex_domain TEXT NOT NULL, query_count BIGINT NOT NULL DEFAULT 0,
    PRIMARY KEY (stat_date, hour_of_day, apex_domain)
);

CREATE TABLE domain_categories (
    apex_domain TEXT PRIMARY KEY, category TEXT NOT NULL,
    provider TEXT, cache_priority SMALLINT DEFAULT 5,
    tagged_at TIMESTAMPTZ DEFAULT now()
);

INSERT INTO domain_categories VALUES
('google.com','search','Google',3,now()),('googleapis.com','cdn','Google',2,now()),
('youtube.com','streaming','Google',1,now()),('googlevideo.com','streaming','Google',1,now()),
('netflix.com','streaming','Netflix',1,now()),('nflxvideo.net','streaming','Netflix',1,now()),
('nflximg.net','cdn','Netflix',2,now()),('akamai.net','cdn','Akamai',1,now()),
('akamaiedge.net','cdn','Akamai',1,now()),('cloudflare.com','cdn','Cloudflare',2,now()),
('facebook.com','social','Meta',4,now()),('instagram.com','social','Meta',4,now()),
('whatsapp.net','messaging','Meta',3,now()),('whatsapp.com','messaging','Meta',3,now()),
('twitter.com','social','X',4,now()),('twimg.com','cdn','X',2,now()),
('tiktok.com','social','TikTok',4,now()),('microsoft.com','work','Microsoft',5,now()),
('office.com','work','Microsoft',5,now()),('windowsupdate.com','updates','Microsoft',8,now()),
('ubuntu.com','updates','Canonical',8,now()),('amazonaws.com','cloud','AWS',5,now()),
('fastly.net','cdn','Fastly',2,now()),('cloudfront.net','cdn','AWS',2,now()),
('doubleclick.net','ads','Google',9,now()),('googlesyndication.com','ads','Google',9,now());
COMMIT;

This lesson teaches separation of raw event capture and summary analytics for scale and clarity.

You verify tables, constraints, and seed category inserts all succeed.

2.4 Indexes

This indexing block encodes your expected access paths. Time-descending indexes support recent-activity views, apex-domain indexes support ranking and filtering, and trigram search enables flexible domain text matching at scale. Indexes are not optional decoration; they are what keeps reporting latency predictable as data volume grows.

sql
CREATE INDEX idx_q_time   ON dns_queries (queried_at DESC);
CREATE INDEX idx_q_apex   ON dns_queries (apex_domain, queried_at DESC);
CREATE INDEX idx_q_client ON dns_queries (client_ip, queried_at DESC);
CREATE INDEX idx_d_apex   ON dns_daily_stats (apex_domain, stat_date DESC);
CREATE INDEX idx_d_hits   ON dns_daily_stats (stat_date, hit_count DESC);
CREATE INDEX idx_w_domain ON dns_weekly_stats (apex_domain, week_start DESC);
CREATE INDEX idx_m_domain ON dns_monthly_stats (apex_domain, year_month DESC);
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_d_trgm ON dns_daily_stats USING gin (apex_domain gin_trgm_ops);
COMMIT;
Chapter 2 Complete

Five-table schema built, indexed, 26 domain categories seeded.

Chapter 03

The Log Ingestor — Python Daemon

Teaching Lens

This lesson teaches ingestion as a resilient ETL loop: read, parse, normalize, and write.

  • You verify parser tolerance against noisy log variants.
  • You verify batch settings balance latency and throughput.
  • You verify systemd restart behavior after failures and reboots.
  • You verify failures in order: file access, parse rate, then database auth.

3.1 Python Setup

This setup block creates a reproducible runtime for ingestion code. System packages install Python tooling, the dedicated directory isolates operational files, the virtual environment freezes dependency scope, and psycopg2-binary provides PostgreSQL connectivity. Reproducible environments reduce drift and make debugging repeatable across hosts.

bash
sudo apt-get install -y python3 python3-pip python3-venv
sudo mkdir -p /opt/dns-ingestor && sudo chown $USER:$USER /opt/dns-ingestor
cd /opt/dns-ingestor && python3 -m venv venv
source venv/bin/activate && pip install psycopg2-binary

This lesson teaches dependency isolation through a dedicated Python virtual environment.

You verify the virtual environment works and psycopg2-binary is installed inside it.

3.2 The Script

This edit command opens the ingestor source file location used by systemd later. Keeping the script in a stable operational path means your service unit, logs, and maintenance procedures all point to one canonical implementation instead of ad-hoc copies.

bash
nano /opt/dns-ingestor/ingestor.py

This Python block implements a resilient ETL loop. It parses resolver logs into structured fields, reconnects on transient database failures, batches inserts for throughput efficiency, and continuously tails rotating log files. Read it as a production loop: input normalization, controlled writes, observability, and graceful restart behavior.

python
/opt/dns-ingestor/ingestor.py
#!/usr/bin/env python3
import re, time, logging, os, signal, sys
from datetime import datetime
import psycopg2
from psycopg2.extras import execute_values

DB = {'dbname':'dns_analytics','user':'dns_user',
      'password':'ChangeThisPassword!','host':'127.0.0.1','port':5432}
LOG_FILE = '/var/log/named/queries.log'
BATCH_SIZE = 100
FLUSH_INTERVAL = 5

QUERY_RE = re.compile(
    r'^(?P<ts>\d{2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2})\.\d+'
    r'\s+client\s+(?:@\S+\s+)?(?P<ip>[\d\.a-fA-F:]+)#(?P<port>\d+)'
    r'\s+\([^)]+\):\s+query:\s+(?P<domain>[\w.\-]+)\s+IN\s+(?P<qtype>\w+)'
    r'\s+(?P<flags>[+\-\w\s]*?)(?:\s+\([\d.]+\))?$'
)

def apex(d):
    p = d.rstrip('.').split('.')
    return '.'.join(p[-2:]) if len(p)>=2 else d

def parse_ts(s):
    try: return datetime.strptime(s, '%d-%b-%Y %H:%M:%S')
    except: return datetime.utcnow()

def parse(line):
    line = line.strip()
    if not line: return None
    m = QUERY_RE.match(line)
    if not m: return None
    d = m.groupdict()
    dom = d.get('domain','').lower().rstrip('.')
    if not dom: return None
    flags = d.get('flags','').strip()
    return {'queried_at':parse_ts(d['ts']),'client_ip':d.get('ip','0.0.0.0'),
            'client_port':int(d.get('port',0)),'domain':dom,'apex_domain':apex(dom),
            'qtype':d.get('qtype','A').upper(),'flags':flags,
            'response_rcode':'NOERROR','cached':'CL' in flags}

def connect():
    while True:
        try:
            c=psycopg2.connect(**DB); c.autocommit=False
            logging.info('DB connected'); return c
        except psycopg2.OperationalError as e:
            logging.error(f'DB: {e}'); time.sleep(10)

def flush(conn, batch):
    if not batch: return 0
    sql = ('INSERT INTO dns_queries '
           '(queried_at,client_ip,client_port,domain,apex_domain,'
           'qtype,flags,response_rcode,cached) VALUES %s ON CONFLICT DO NOTHING')
    rows = [(r['queried_at'],r['client_ip'],r['client_port'],r['domain'],
             r['apex_domain'],r['qtype'],r['flags'],r['response_rcode'],r['cached'])
            for r in batch]
    try:
        with conn.cursor() as cur: execute_values(cur,sql,rows,page_size=500)
        conn.commit(); return len(batch)
    except psycopg2.Error as e:
        logging.error(f'Insert: {e}'); conn.rollback(); return 0

def run(conn):
    batch,last_flush,total,inode,f = [],time.time(),0,0,None
    logging.info(f'Watching {LOG_FILE}')
    while True:
        try: cur_inode=os.stat(LOG_FILE).st_ino
        except FileNotFoundError: time.sleep(5); continue
        if cur_inode != inode:
            try: f=open(LOG_FILE,'r'); f.seek(0,2); inode=cur_inode
            except IOError: time.sleep(5); continue
        while True:
            line = f.readline()
            if not line: break
            rec = parse(line)
            if rec: batch.append(rec)
        now = time.time()
        if len(batch)>=BATCH_SIZE or (batch and now-last_flush>=FLUSH_INTERVAL):
            n=flush(conn,batch); total+=n
            logging.info(f'Flushed {n} | total {total:,}')
            batch,last_flush=[],now
        time.sleep(0.5)

if __name__=='__main__':
    os.makedirs('/var/log/dns-ingestor',exist_ok=True)
    logging.basicConfig(level=logging.INFO,
        format='%(asctime)s [%(levelname)s] %(message)s',
        handlers=[logging.StreamHandler(sys.stdout),
                  logging.FileHandler('/var/log/dns-ingestor/ingestor.log')])
    def bye(s,f): sys.exit(0)
    signal.signal(signal.SIGTERM,bye); signal.signal(signal.SIGINT,bye)
    run(connect())

This lesson teaches resilient tail-and-batch ingestion with reconnect behavior for transient failures.

You verify repeated flush logs and increasing database row counts during query activity.

3.3 Systemd Service

This unit file turns the script into a managed service. Dependency ordering waits for network and database availability, restart policy recovers from failures, and fixed working/executable paths make startup deterministic. Systemd is the reliability contract that keeps ingestion alive without manual supervision.

ini
/etc/systemd/system/dns-ingestor.service
[Unit]
Description=DNS Query Log Ingestor -- code::core
After=network.target postgresql.service bind9.service
Requires=postgresql.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/dns-ingestor
ExecStart=/opt/dns-ingestor/venv/bin/python3 /opt/dns-ingestor/ingestor.py
Restart=always
RestartSec=10s
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

These control commands register the new unit, start it immediately, and open a live log stream for first-run validation. Running them in order ensures the service definition is reloaded before activation, which prevents stale unit metadata from hiding configuration mistakes.

bash
sudo systemctl daemon-reload
sudo systemctl enable --now dns-ingestor
sudo journalctl -u dns-ingestor -f

This lesson teaches service activation workflow from unit reload to live log observation.

You verify dns-ingestor is active and not entering a crash loop.

3.4 End-to-End Test

This test block deliberately generates DNS traffic, waits for the batch flush window, then inspects the newest stored rows. It validates that resolver output, parser logic, and database writes are functioning as one pipeline. When this check passes, your telemetry path is proven, not assumed.

bash
for d in google.com youtube.com netflix.com instagram.com microsoft.com; do
    dig @192.168.234.128 $d A +short
done
sleep 6
psql -U dns_user -d dns_analytics -h localhost \
  -c "SELECT domain,qtype,queried_at FROM dns_queries ORDER BY queried_at DESC LIMIT 10;"
Expected google.com | A | 2026-01-11 08:20:01+00 (10 rows)
Pipeline is LIVE

BIND9 → Python → PostgreSQL. Every LAN DNS query is now captured and stored.

Chapter 04

Rollup Jobs — Daily, Weekly & Monthly

All scripts use ON CONFLICT DO UPDATE — idempotent, safe to rerun at any time.

Teaching Lens

This lesson teaches rollups as a performance strategy: pre-compute stable summaries instead of repeatedly scanning raw volume.

  • You verify daily rollups support operational reporting.
  • You verify weekly rollups support trend and planning analysis.
  • You verify monthly rollups support forecasting and executive views.
  • You verify idempotent SQL enables safe rerun-based recovery.

This command creates a dedicated directory for rollup SQL artifacts. Keeping scheduled SQL in a single controlled location improves auditability and makes cron references explicit and maintainable.

bash
sudo mkdir -p /opt/dns-ingestor/sql

4.1 Daily Rollup

This daily rollup computes yesterday's operational summary and hourly distribution from raw events. The time-zone conversion anchors reporting to local business time, and conflict updates make reruns safe by replacing prior values for the same key window. This is your primary daily telemetry compression stage.

sql
/opt/dns-ingestor/sql/rollup_daily.sql
-- Run at 01:00 every morning
INSERT INTO dns_daily_stats
    (stat_date,domain,apex_domain,qtype,hit_count,unique_clients,
     nxdomain_count,servfail_count,first_seen,last_seen)
SELECT
    DATE(queried_at AT TIME ZONE 'Africa/Kampala') AS stat_date,
    domain, apex_domain, qtype,
    COUNT(*), COUNT(DISTINCT client_ip),
    COUNT(*) FILTER (WHERE response_rcode='NXDOMAIN'),
    COUNT(*) FILTER (WHERE response_rcode='SERVFAIL'),
    MIN(queried_at), MAX(queried_at)
FROM dns_queries
WHERE queried_at >= (CURRENT_DATE-INTERVAL '1 day') AT TIME ZONE 'Africa/Kampala'
  AND queried_at  <  CURRENT_DATE AT TIME ZONE 'Africa/Kampala'
GROUP BY 1,domain,apex_domain,qtype
ON CONFLICT (stat_date,domain,qtype) DO UPDATE SET
    hit_count=EXCLUDED.hit_count, unique_clients=EXCLUDED.unique_clients,
    nxdomain_count=EXCLUDED.nxdomain_count, servfail_count=EXCLUDED.servfail_count,
    first_seen=EXCLUDED.first_seen, last_seen=EXCLUDED.last_seen;

INSERT INTO dns_hourly_heatmap (stat_date,hour_of_day,apex_domain,query_count)
SELECT DATE(queried_at AT TIME ZONE 'Africa/Kampala'),
    EXTRACT(HOUR FROM queried_at AT TIME ZONE 'Africa/Kampala')::SMALLINT,
    apex_domain, COUNT(*)
FROM dns_queries
WHERE queried_at >= (CURRENT_DATE-INTERVAL '1 day') AT TIME ZONE 'Africa/Kampala'
  AND queried_at  <  CURRENT_DATE AT TIME ZONE 'Africa/Kampala'
GROUP BY 1,2,3
ON CONFLICT (stat_date,hour_of_day,apex_domain) DO UPDATE SET query_count=EXCLUDED.query_count;
COMMIT;

This lesson teaches idempotent daily rollups for stable metrics and hourly demand summaries.

You verify rerunning the job updates the same window without duplicate counts.

4.2 Weekly Rollup

This weekly rollup aggregates daily records into planning-level metrics. It computes total demand, peak day, and optional category context per apex domain, which is useful for capacity and peering decisions. The CTE structure separates total calculations from peak detection so each step remains understandable and testable.

sql
/opt/dns-ingestor/sql/rollup_weekly.sql
-- Run Monday 02:00
INSERT INTO dns_weekly_stats
    (week_start,week_end,apex_domain,total_queries,unique_clients,peak_day,peak_day_count,category)
WITH totals AS (
    SELECT DATE_TRUNC('week',stat_date)::DATE AS week_start,
           (DATE_TRUNC('week',stat_date)+INTERVAL '6 days')::DATE AS week_end,
           apex_domain, SUM(hit_count) AS total_queries, MAX(unique_clients) AS unique_clients
    FROM dns_daily_stats
    WHERE stat_date >= DATE_TRUNC('week',CURRENT_DATE-INTERVAL '7 days')::DATE
      AND stat_date  <  DATE_TRUNC('week',CURRENT_DATE)::DATE
    GROUP BY 1,2,3
), peaks AS (
    SELECT DATE_TRUNC('week',stat_date)::DATE AS week_start, apex_domain,
           stat_date AS peak_day, SUM(hit_count) AS day_total,
           RANK() OVER (PARTITION BY DATE_TRUNC('week',stat_date),apex_domain
                        ORDER BY SUM(hit_count) DESC) AS rnk
    FROM dns_daily_stats
    WHERE stat_date >= DATE_TRUNC('week',CURRENT_DATE-INTERVAL '7 days')::DATE
      AND stat_date  <  DATE_TRUNC('week',CURRENT_DATE)::DATE
    GROUP BY 1,2,3
)
SELECT t.week_start,t.week_end,t.apex_domain,t.total_queries,t.unique_clients,
       p.peak_day,p.day_total,dc.category
FROM totals t
LEFT JOIN peaks p ON p.week_start=t.week_start AND p.apex_domain=t.apex_domain AND p.rnk=1
LEFT JOIN domain_categories dc ON dc.apex_domain=t.apex_domain
ON CONFLICT (week_start,apex_domain) DO UPDATE SET
    total_queries=EXCLUDED.total_queries,unique_clients=EXCLUDED.unique_clients,
    peak_day=EXCLUDED.peak_day,peak_day_count=EXCLUDED.peak_day_count;
COMMIT;

4.3 Monthly Rollup

This monthly rollup produces executive-scale trends with totals, average daily volume, and rank ordering. Because it reads from daily summaries instead of raw logs, it stays efficient while preserving consistent semantics. Ranking within month helps highlight dominant domains and shifting traffic priorities over time.

sql
/opt/dns-ingestor/sql/rollup_monthly.sql
-- Run 1st of month 03:00
INSERT INTO dns_monthly_stats (year_month,apex_domain,total_queries,unique_clients,avg_daily,rank_in_month,category)
WITH m AS (
    SELECT TO_CHAR(stat_date,'YYYY-MM') AS year_month, apex_domain,
           SUM(hit_count) AS total_queries, MAX(unique_clients) AS unique_clients,
           ROUND(AVG(hit_count),2) AS avg_daily
    FROM dns_daily_stats
    WHERE TO_CHAR(stat_date,'YYYY-MM')=TO_CHAR(CURRENT_DATE-INTERVAL '1 month','YYYY-MM')
    GROUP BY 1,2
)
SELECT m.year_month,m.apex_domain,m.total_queries,m.unique_clients,m.avg_daily,
       RANK() OVER (PARTITION BY m.year_month ORDER BY m.total_queries DESC), dc.category
FROM m LEFT JOIN domain_categories dc ON dc.apex_domain=m.apex_domain
ON CONFLICT (year_month,apex_domain) DO UPDATE SET
    total_queries=EXCLUDED.total_queries,avg_daily=EXCLUDED.avg_daily,rank_in_month=EXCLUDED.rank_in_month;
COMMIT;

4.4 Cron Schedule

This cron block operationalizes the rollup lifecycle. It schedules daily, weekly, and monthly jobs with log capture for post-run diagnostics, and pre-creates tomorrow's partition to avoid insert failures at day boundaries. Scheduling converts analytics from manual scripts into dependable operations.

cron
sudo crontab -e
# DNS Analytics Rollups
0 1 * * *  psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_daily.sql >> /var/log/dns-ingestor/daily.log 2>&1
0 2 * * 1  psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_weekly.sql >> /var/log/dns-ingestor/weekly.log 2>&1
0 3 1 * *  psql -U dns_user -d dns_analytics -h localhost -f /opt/dns-ingestor/sql/rollup_monthly.sql >> /var/log/dns-ingestor/monthly.log 2>&1
# Create tomorrow's partition at 23:30
30 23 * * * psql -U dns_user -d dns_analytics -h localhost -c "CREATE TABLE IF NOT EXISTS dns_queries_$(date -d tomorrow +\%Y\%m\%d) PARTITION OF dns_queries FOR VALUES FROM (CURRENT_DATE+INTERVAL '1 day') TO (CURRENT_DATE+INTERVAL '2 days');" >> /var/log/dns-ingestor/partition.log 2>&1
Chapter 05

SQL Analytics — Reading Your Traffic

Teaching Lens

This lesson teaches SQL as decision logic for peering, caching, abuse detection, and customer experience.

  • You verify top domains to identify concentration risk.
  • You verify category mix to understand bandwidth destinations.
  • You verify hourly heatmaps to plan cache warm-up windows.
  • You verify NXDOMAIN rates to detect misconfigurations or bot noise.

5.1 Top 50 Domains (7-Day)

This query measures concentration: which apex domains consume most resolver capacity over the last week. Percentage and average-per-day columns turn raw counts into comparative signals you can use for cache policy, upstream negotiation, and anomaly triage.

sql
SELECT ds.apex_domain, dc.category, dc.provider, dc.cache_priority,
    SUM(ds.hit_count) AS total_queries,
    ROUND(SUM(ds.hit_count)*100.0/SUM(SUM(ds.hit_count)) OVER(),2) AS pct,
    ROUND(AVG(ds.hit_count),0) AS avg_per_day
FROM dns_daily_stats ds
LEFT JOIN domain_categories dc ON dc.apex_domain=ds.apex_domain
WHERE ds.stat_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY ds.apex_domain,dc.category,dc.provider,dc.cache_priority
ORDER BY total_queries DESC LIMIT 50;

This lesson teaches concentration analysis to identify domains that dominate resolver workload.

You verify percentages are coherent and leaders match expected user behavior.

5.2 By Category

This category query collapses domain-level noise into service classes such as streaming, social, and work. It helps learners see how traffic mix reflects subscriber behavior and where optimization effort should be focused at a service-class level rather than individual hostnames.

sql
SELECT COALESCE(dc.category,'uncategorised') AS category,
    COUNT(DISTINCT ds.apex_domain) AS domains, SUM(ds.hit_count) AS total_queries,
    ROUND(SUM(ds.hit_count)*100.0/SUM(SUM(ds.hit_count)) OVER(),2) AS pct
FROM dns_daily_stats ds
LEFT JOIN domain_categories dc ON dc.apex_domain=ds.apex_domain
WHERE ds.stat_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY 1 ORDER BY total_queries DESC;

5.3 24-Hour Heatmap

This heatmap query reveals diurnal demand patterns by hour. The generated bar column is a quick visual proxy for load intensity, useful for identifying cache warm-up windows, maintenance windows, and periods where resolver stress is highest.

sql
SELECT LPAD(h.hour_of_day::TEXT,2,'0')||':00' AS hour,
    SUM(h.query_count) AS queries,
    REPEAT('█',(SUM(h.query_count)/NULLIF(MAX(SUM(h.query_count)) OVER()/40,0))::INT) AS bar
FROM dns_hourly_heatmap h
WHERE h.stat_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY h.hour_of_day ORDER BY h.hour_of_day;

5.4 NXDOMAIN Analysis

This failure-focused query identifies domains with significant NXDOMAIN rates, a common signal for typos, stale configs, blocked telemetry, or bot behavior. Ranking by failure percentage helps prioritize remediation where user impact or wasted resolver effort is highest.

sql
SELECT apex_domain, SUM(hit_count) AS total, SUM(nxdomain_count) AS failures,
    ROUND(SUM(nxdomain_count)*100.0/NULLIF(SUM(hit_count),0),1) AS fail_pct
FROM dns_daily_stats
WHERE stat_date >= CURRENT_DATE - INTERVAL '7 days' AND nxdomain_count > 0
GROUP BY apex_domain HAVING SUM(nxdomain_count)>10
ORDER BY fail_pct DESC LIMIT 30;
Chapter 06

Scaling to 2 Million Subscribers

2M subs = 600M queries/day = 7,000 qps sustained. Text log files break at this scale. The data model stays identical; the transport layer changes.

Teaching Lens

This lesson teaches scale migration without changing metric semantics, so reports stay comparable from lab to production.

  • You verify transport migration from text logs to binary event streams.
  • You verify ingest migration from single process to distributed workers.
  • You verify database migration to time-series optimization patterns.
  • You verify privacy controls before volume growth increases risk.

6.1 Production Architecture

Anycast VIP
resolver.isp.net
8x Unbound
DNS cluster
dnstap
binary protobuf
Resolver Layer
Kafka
dns-queries topic
Faust Workers
x8 processors
TimescaleDB
hypertables
Ingest Layer
LayerVM (this guide)Production (2M subs)
DNS daemonBIND9 text logUnbound x8 + dnstap
TransportFile on diskKafka 3-broker cluster
Ingestor1 Python processFaust workers x8
DatabasePostgreSQL 16TimescaleDB on NVMe RAID
Volume~50K/day600M/day (7K qps)

6.2 TimescaleDB

This command block installs time-series database capabilities for high-volume retention and query efficiency. Tuning adjusts PostgreSQL settings for Timescale workloads, and restarting applies those changes. This is the foundation for scaling from lab-sized daily volumes to sustained production ingestion.

bash
sudo apt-get install -y timescaledb-2-postgresql-16
sudo timescaledb-tune --quiet --yes && sudo systemctl restart postgresql

This SQL block enables hypertable behavior, compression, and continuous aggregates while keeping your logical schema intact. It changes storage and refresh mechanics, not metric meaning, so dashboards and lessons remain comparable before and after scale transition.

sql
CREATE EXTENSION IF NOT EXISTS timescaledb;
SELECT create_hypertable('dns_queries','queried_at',
    chunk_time_interval=>INTERVAL '1 day',if_not_exists=>TRUE);
ALTER TABLE dns_queries SET (timescaledb.compress,
    timescaledb.compress_orderby='queried_at DESC',
    timescaledb.compress_segmentby='apex_domain');
SELECT add_compression_policy('dns_queries',INTERVAL '7 days');
CREATE MATERIALIZED VIEW dns_daily_live WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day',queried_at) AS bucket, apex_domain, qtype,
       COUNT(*) AS hit_count, COUNT(DISTINCT client_ip) AS unique_clients
FROM dns_queries GROUP BY bucket,apex_domain,qtype;
SELECT add_continuous_aggregate_policy('dns_daily_live',
    start_offset=>INTERVAL '3 days',end_offset=>INTERVAL '1 hour',
    schedule_interval=>INTERVAL '1 hour');

6.3 dnstap

This configuration switches resolver event export from text logs to structured binary messages. At large scale, dnstap reduces parsing overhead and improves event fidelity for downstream stream processors. Enabling both query and response messages preserves context needed for richer analytics and troubleshooting.

ini
unbound.conf additions
dnstap:
    dnstap-enable: yes
    dnstap-socket-path: "/var/run/unbound/dnstap.sock"
    dnstap-log-client-query-messages: yes
    dnstap-log-client-response-messages: yes
Privacy at ISP scale

Store only the /24 subnet, not the full /32 client IP. Aggregated domain counts contain no PII.

Chapter 07

MikroTik Home Network Lesson (10.10.10.0/24)

This lesson adapts the platform to a MikroTik home lab where the router IP is 10.10.10.10 and local DNS namespace is codeandcore.home.

Teaching Lens

This lesson teaches hybrid DNS operation: authoritative local naming plus recursive analytics in one pipeline.

  • You verify MikroTik distributes DNS and search domain via DHCP.
  • You verify BIND serves codeandcore.home authoritatively.
  • You verify zone records are mirrored into PostgreSQL for governance.
  • You verify local queries are analyzed alongside internet traffic.

7.1 Configure MikroTik DHCP + DNS Domain

This RouterOS block is the client-distribution layer of your naming system. DHCP options push DNS server and search domain to every device, so clients learn to ask 10.10.10.10 and append codeandcore.home automatically. Without this, your local zone can exist correctly on BIND but appear "broken" to users because clients never query it correctly.

routeros
MikroTik terminal
/ip dhcp-server network set [find address="10.10.10.0/24"] \
    gateway=10.10.10.10 dns-server=10.10.10.10 domain=codeandcore.home

/ip dns set allow-remote-requests=yes servers=10.10.10.10

/ip dns static add name=router.codeandcore.home address=10.10.10.10 ttl=1d

This lesson teaches client-side DNS distribution through DHCP domain and resolver settings.

You verify renewed clients receive DNS 10.10.10.10 and suffix codeandcore.home.

7.2 Create Authoritative Zone: codeandcore.home

This block in /etc/bind/named.conf.local is the zone declaration, which tells BIND "I am authoritative for this namespace." type master means this server is the source of truth, and the file path points to the zone database containing records. allow-update { none; } disables dynamic updates so records only change through deliberate admin edits.

named.conf
/etc/bind/named.conf.local
zone "codeandcore.home" {
    type master;
    file "/etc/bind/db.codeandcore.home";
    allow-update { none; };
};

This zone file block is the authoritative DNS database itself. The SOA record sets lifecycle rules for secondaries and cache behavior, while NS and A records map stable names to device IPs in your 10.10.10.0/24 lab. Think of this as your local naming contract: every hostname your users depend on should be explicitly represented here.

dns-zone
/etc/bind/db.codeandcore.home
$TTL 86400
@   IN  SOA ns1.codeandcore.home. admin.codeandcore.home. (
        2026051201 ; serial
        3600       ; refresh
        1800       ; retry
        1209600    ; expire
        86400 )    ; minimum

@       IN  NS  ns1.codeandcore.home.
ns1     IN  A   10.10.10.10
router  IN  A   10.10.10.10
dns     IN  A   10.10.10.10
nas     IN  A   10.10.10.20
cam1    IN  A   10.10.10.30
printer IN  A   10.10.10.40

This lesson teaches authoritative local namespace design with deterministic host mappings.

You verify local records resolve with NOERROR responses from 10.10.10.10.

7.3 Store Zone Records in PostgreSQL

This SQL block mirrors DNS inventory into the analytics database so operations can audit and report on intended state versus observed query behavior. The unique constraint prevents duplicate logical records, and indexes support fast lookups by zone and record type. This turns static DNS files into queryable governance data for lessons and reviews.

sql
dns_analytics schema additions
CREATE TABLE IF NOT EXISTS dns_zone_records (
    id BIGSERIAL PRIMARY KEY,
    zone_name TEXT NOT NULL,
    fqdn TEXT NOT NULL,
    record_type VARCHAR(10) NOT NULL,
    record_value TEXT NOT NULL,
    ttl INTEGER NOT NULL DEFAULT 86400,
    source TEXT NOT NULL DEFAULT 'bind',
    active BOOLEAN NOT NULL DEFAULT TRUE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (zone_name, fqdn, record_type, record_value)
);

CREATE INDEX IF NOT EXISTS idx_zone_fqdn ON dns_zone_records (zone_name, fqdn);
CREATE INDEX IF NOT EXISTS idx_zone_type ON dns_zone_records (zone_name, record_type);

INSERT INTO dns_zone_records (zone_name, fqdn, record_type, record_value, ttl) VALUES
('codeandcore.home','ns1.codeandcore.home','A','10.10.10.10',86400),
('codeandcore.home','router.codeandcore.home','A','10.10.10.10',86400),
('codeandcore.home','dns.codeandcore.home','A','10.10.10.10',86400),
('codeandcore.home','nas.codeandcore.home','A','10.10.10.20',86400),
('codeandcore.home','cam1.codeandcore.home','A','10.10.10.30',86400),
('codeandcore.home','printer.codeandcore.home','A','10.10.10.40',86400)
ON CONFLICT DO NOTHING;
COMMIT;

This lesson teaches treating DNS inventory as queryable governance data.

You verify seeded zone records are returned from dns_zone_records for codeandcore.home.

7.4 Analytics for Local Zone + Full Traffic

These queries teach two different analytical questions. The first measures demand for local names inside codeandcore.home, which helps validate adoption of your local namespace. The second compares local traffic against internet-bound lookups, giving you an immediate ratio of internal service usage versus external dependency.

sql
local zone insights
-- Local zone traffic (last 7 days)
SELECT domain, qtype, COUNT(*) AS hits, COUNT(DISTINCT client_ip) AS unique_clients
FROM dns_queries
WHERE domain LIKE '%.codeandcore.home'
  AND queried_at >= NOW() - INTERVAL '7 days'
GROUP BY domain, qtype
ORDER BY hits DESC;

-- Compare local-zone vs internet queries
SELECT
  CASE WHEN domain LIKE '%.codeandcore.home' THEN 'local_zone' ELSE 'internet' END AS traffic_class,
  COUNT(*) AS total_queries,
  ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS pct
FROM dns_queries
WHERE queried_at >= NOW() - INTERVAL '7 days'
GROUP BY 1
ORDER BY total_queries DESC;

This lesson teaches local-versus-internet query mix analysis for the home lab zone.

You verify local hosts such as nas.codeandcore.home appear in recent query outputs.

Chapter 7 Complete

MikroTik clients use 10.10.10.10, local zone codeandcore.home resolves authoritatively, and zone records plus analytics are stored in PostgreSQL.

Appendix A

Firewall (UFW)

Teaching Lens

This lesson teaches security-by-segmentation: expose only what each service requires.

  • You verify DNS is reachable only from trusted private ranges.
  • You verify PostgreSQL remains local unless intentionally tunneled.
  • You verify default-deny inbound posture is preserved.
  • You verify firewall policy matches resolver and analytics design.

This firewall rule set enforces minimum-exposure networking. It allows DNS only from private address spaces used by your clients, keeps PostgreSQL loopback-only, and denies unsolicited inbound traffic by default. The goal is a resolver that serves your network without becoming an internet-facing attack surface.

bash
sudo ufw enable && sudo ufw allow ssh
sudo ufw allow from 192.168.0.0/16 to any port 53 proto udp
sudo ufw allow from 192.168.0.0/16 to any port 53 proto tcp
sudo ufw allow from 10.0.0.0/8 to any port 53 proto udp
sudo ufw allow from 10.0.0.0/8 to any port 53 proto tcp
sudo ufw allow from 172.16.0.0/12 to any port 53 proto udp
sudo ufw allow from 172.16.0.0/12 to any port 53 proto tcp
sudo ufw allow from 127.0.0.1 to any port 5432 proto tcp
sudo ufw default deny incoming && sudo ufw default allow outgoing
Appendix B

Troubleshooting

Teaching Lens

This lesson teaches path-based troubleshooting from resolver output to final reports.

  • You verify resolver output before checking ingestion.
  • You verify parser and service health before database assumptions.
  • You verify database writes before rollup and report checks.
  • You verify each layer in order to reduce diagnosis time.
SymptomCheckCommand
BIND9 won't startConfig syntaxsudo named-checkconf
No queries.logLog dir ownershipls -la /var/log/named/
dig returns REFUSEDClient not in ACLsudo journalctl -u bind9 -n 20
Ingestor not insertingWrong DB passwordsudo journalctl -u dns-ingestor -n 40
Partition errorPartition absentCreate partition manually
psql auth failpg_hba.conf peerAdd md5 line below
Empty daily_statsRollup not run yetRun rollup_daily.sql manually

This repair snippet addresses a common local-auth mismatch where PostgreSQL expects peer authentication instead of password-based login for your application role. Adding the loopback md5 rule and reloading PostgreSQL aligns server auth with your ingestor connection method without a full database restart.

bash
fix pg_hba.conf if psql auth fails
sudo nano /etc/postgresql/16/main/pg_hba.conf
# Add ABOVE existing local lines:
# host  dns_analytics  dns_user  127.0.0.1/32  md5
sudo systemctl reload postgresql
Done

BIND9 capturing. Python ingesting. PostgreSQL storing. Cron aggregating. Run the Chapter 5 queries after 7 days and you have everything to design caching, CDN peering, and BGP strategy.