smaug/PLAN.md
Claude 7e9221a914 feat: add PLAN.md and insider copytrade POC implementation
- PLAN.md: full implementation plan from issue
- config.py: configurable thresholds, API keys via .env
- ingestion/: EDGAR RSS poller + Form 4 XML parser
- db/: SQLite schema + interface (WAL mode)
- signals/: filter engine (buy/10b5-1/value/role) + cluster detector
- alerts/: Slack webhook alert with score gating
- broker/: Alpaca paper/live trade execution
- backtest/: historical signal backtesting with yfinance
- main.py: CLI entrypoint (run | fetch-once | backtest)
2026-05-04 16:15:22 +00:00

9.1 KiB

Insider Copytrade System -- Implementation Plan

Description

A personal system that monitors SEC EDGAR Form 4 filings in real-time, filters for high-quality insider buying signals, alerts via Slack, and optionally executes trades automatically through Alpaca's paper or live trading API.

The system is fully self-hosted, uses only free/public data sources, and requires no third-party data subscriptions.


Background

Company insiders (executives, directors, >10% shareholders) must file SEC Form 4 within 2 business days of any trade. This is public data via SEC EDGAR. The signal value of insider buying is academically documented -- executives buying their own stock with personal capital is a meaningful vote of confidence, particularly when:

  • Multiple insiders buy simultaneously (cluster signal)
  • The trade is unplanned (not a 10b5-1 scheduled plan)
  • The company is small/mid-cap (less institutional arbitrage)

The edge vs. political trade copying: 2-day disclosure lag vs. 45 days, and the signal is company-specific rather than sector-level.

Key risk: This signal is publicly known and tracked. The edge is in filtering quality and execution speed, not data exclusivity. Large-cap Form 4 signals are arbitraged quickly. Focus on small/mid-cap, clustered, unplanned buys.


System Outline

SEC EDGAR RSS Feed (poll every 10 min)
        |
   [Ingestion Layer]
        |
   Parse Form 4 XML
        |
   [Filter Engine]
    - Buy only (flag = A)
    - Exclude 10b5-1 plans
    - Min transaction size
    - Role weighting
    - Cluster detection
        |
   SQLite Database
        |
   ┌────────────┬──────────────┐
   |            |              |
[Backtester] [Slack Alert]  [Alpaca API]
             (manual)      (paper/live)

Actionables

Phase 1 -- Data Ingestion

Goal: Reliably pull and parse Form 4 filings as they appear.

Tasks:

  1. Set up project structure
insider-copytrade/
  ingestion/
    edgar_poller.py      # polls EDGAR RSS
    form4_parser.py      # parses XML -> structured dict
  db/
    schema.sql
    db.py                # SQLite interface
  signals/
    filter_engine.py     # applies signal filters
    cluster_detector.py
  alerts/
    slack_alert.py
  broker/
    alpaca_client.py
  backtest/
    backtest.py
  config.py
  main.py
  1. Poll EDGAR RSS for Form 4 filings every 10 minutes:
https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&type=4&dateb=&owner=include&count=40&search_text=&action=getcurrent

SEC also provides a structured latest filings feed:

https://efts.sec.gov/LATEST/search-index?q=&forms=4
  1. For each new filing, fetch and parse the XML document. Key fields to extract:

    • issuerTradingSymbol (ticker)
    • rptOwnerName, officerTitle (insider name + role)
    • transactionDate
    • transactionAcquiredDisposedCode (A = buy, D = sell)
    • transactionShares, transactionPricePerShare
    • transactionTotalValue (compute if not present)
    • footnotes (check for "10b5-1" mention)
    • sharesOwnedFollowingTransaction
  2. Store raw filing XML + parsed fields. Track accessionNumber as dedup key.

SQLite schema:

CREATE TABLE filings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    accession_number TEXT UNIQUE,
    ticker TEXT,
    cik TEXT,
    insider_name TEXT,
    role TEXT,
    transaction_date TEXT,
    filed_date TEXT,
    shares REAL,
    price REAL,
    total_value REAL,
    flag TEXT,           -- A or D
    is_10b51 INTEGER,    -- 0 or 1
    post_tx_shares REAL,
    created_at TEXT
);

CREATE TABLE signals (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT,
    trigger_date TEXT,
    cluster_size INTEGER,
    total_cluster_value REAL,
    score REAL,
    alerted INTEGER DEFAULT 0,
    executed INTEGER DEFAULT 0,
    created_at TEXT
);

Phase 2 -- Filter Engine

Goal: Reduce noise to actionable signals only.

Filters to apply (in order):

Filter Logic
Buy only flag == 'A'
Exclude 10b5-1 Scan footnotes for "10b5-1", "Rule 10b5", "adopted a plan"
Min transaction value total_value >= 50000 (configurable)
Exclude derivative transactions Options exercises are weaker signal than open market purchases
Role weighting CEO/CFO/President = high; Director = medium; 10% owner = context-dependent
Cluster detection 2+ insiders buying same ticker within 30 days = elevated signal

Scoring formula (simple v1):

score = base_role_weight * log(total_value) * cluster_multiplier
# cluster_multiplier = 1.0 + (0.5 * (cluster_size - 1))

Expose all thresholds in config.py for easy tuning during backtesting.


Phase 3 -- SQLite Storage

SQLite is sufficient for this workload (low write volume, single process). Use WAL mode for concurrent reads during backtesting:

conn = sqlite3.connect('insider.db')
conn.execute('PRAGMA journal_mode=WAL')

Keep raw filing XML in a /data/filings/ directory keyed by accession number. Parse on ingest, re-parse never needed.


Phase 4 -- Slack Alerts

Goal: Get notified immediately when a signal fires, with enough context to decide manually.

  1. Create a Slack app, get a webhook URL (takes 5 minutes)
  2. Alert format:
INSIDER BUY SIGNAL
Ticker:   $ACME
Insider:  John Smith (CEO)
Date:     2025-05-01
Shares:   10,000 @ $14.50 = $145,000
Cluster:  3 insiders in last 14 days
Score:    8.4
10b5-1:   No
EDGAR:    https://www.sec.gov/cgi-bin/browse-edgar?...
  1. Alert only on signals above configurable score threshold
  2. Mark alerted = 1 in DB after sending to avoid duplicates on re-poll
import requests

def send_slack_alert(webhook_url, signal):
    requests.post(webhook_url, json={"text": format_signal(signal)})

Phase 5 -- Backtesting

Goal: Validate filter parameters on historical data before going live.

Data:

  • Historical Form 4 filings: download bulk XML from https://www.sec.gov/dera/data/form-4-data
  • Price data: yfinance (free, sufficient for backtesting)

Backtest logic:

# For each signal in historical data:
# - Entry: next market open after filed_date
# - Exit: N days later (configurable: 30/60/90/180)
# - Calculate return vs SPY over same period
# - Aggregate by role, cluster_size, market_cap bucket

Use vectorbt for performance:

import vectorbt as vbt
# Build entry/exit signal matrices aligned to price data
# Run portfolio simulation with configurable position sizing

Output metrics:

  • Annualized return vs SPY benchmark
  • Win rate
  • Avg return by holding period
  • Avg return by role / cluster size
  • Max drawdown
  • Sharpe ratio

Critical: Test on post-2022 data specifically. Pre-2022 results are likely inflated -- the signal became widely tracked after Autopilot/media coverage.

Parameter grid to test:

MIN_VALUE = [25_000, 50_000, 100_000]
HOLDING_DAYS = [30, 60, 90, 180]
CLUSTER_WINDOW = [14, 30]
MIN_CLUSTER_SIZE = [1, 2, 3]
ROLES = ['all', 'c-suite-only']

Phase 6 -- Alpaca Integration

Goal: Optionally auto-execute signals. Start with paper trading.

Paper trading base URL: https://paper-api.alpaca.markets Live trading base URL: https://api.alpaca.markets

Swap via config flag -- never hardcode.

from alpaca_trade_api import REST

api = REST(
    key_id=config.ALPACA_KEY,
    secret_key=config.ALPACA_SECRET,
    base_url=config.ALPACA_BASE_URL  # paper or live
)

def execute_signal(ticker, portfolio_value, signal_score):
    # Fixed fractional sizing: 2% of portfolio per signal
    price = api.get_latest_trade(ticker).price
    allocation = portfolio_value * 0.02
    qty = int(allocation / price)
    if qty < 1:
        return
    api.submit_order(
        symbol=ticker,
        qty=qty,
        side='buy',
        type='market',
        time_in_force='day'
    )

Position sizing: start at 2% per signal, max 10% in any single ticker. Add a max open positions limit (e.g. 20) to cap exposure.

Exit logic (v1): time-based only (close after N days). Add trailing stop later.


Build Order

Step Deliverable Est. Time
1 EDGAR poller + Form 4 XML parser + SQLite storage 1 day
2 Filter engine + cluster detector 0.5 day
3 Slack alert 1 hour
4 Historical data download + backtest 1-2 days
5 Alpaca paper trading integration 0.5 day
6 Run paper trading 4-8 weeks, monitor --
7 Switch to live with small capital --

Do not proceed to Step 7 without meaningful paper trading history.


Dependencies

requests
lxml
sqlite3 (stdlib)
yfinance
vectorbt
alpaca-trade-api
python-dotenv

All free. No paid APIs required.


Config Template

# config.py
EDGAR_POLL_INTERVAL = 600        # seconds
MIN_TRANSACTION_VALUE = 50_000
MIN_CLUSTER_SIZE = 1             # raise to 2 for higher quality
CLUSTER_WINDOW_DAYS = 30
HOLDING_PERIOD_DAYS = 90
POSITION_SIZE_PCT = 0.02         # 2% per signal
MAX_POSITIONS = 20
SCORE_ALERT_THRESHOLD = 5.0

SLACK_WEBHOOK_URL = ""
ALPACA_KEY = ""
ALPACA_SECRET = ""
ALPACA_BASE_URL = "https://paper-api.alpaca.markets"  # switch for live