smaug/PLAN.md
Claude 7e9221a914 feat: add PLAN.md and insider copytrade POC implementation
- PLAN.md: full implementation plan from issue
- config.py: configurable thresholds, API keys via .env
- ingestion/: EDGAR RSS poller + Form 4 XML parser
- db/: SQLite schema + interface (WAL mode)
- signals/: filter engine (buy/10b5-1/value/role) + cluster detector
- alerts/: Slack webhook alert with score gating
- broker/: Alpaca paper/live trade execution
- backtest/: historical signal backtesting with yfinance
- main.py: CLI entrypoint (run | fetch-once | backtest)
2026-05-04 16:15:22 +00:00

341 lines
9.1 KiB
Markdown

# Insider Copytrade System -- Implementation Plan
## Description
A personal system that monitors SEC EDGAR Form 4 filings in real-time, filters for high-quality insider buying signals, alerts via Slack, and optionally executes trades automatically through Alpaca's paper or live trading API.
The system is fully self-hosted, uses only free/public data sources, and requires no third-party data subscriptions.
---
## Background
Company insiders (executives, directors, >10% shareholders) must file SEC Form 4 within 2 business days of any trade. This is public data via SEC EDGAR. The signal value of insider *buying* is academically documented -- executives buying their own stock with personal capital is a meaningful vote of confidence, particularly when:
- Multiple insiders buy simultaneously (cluster signal)
- The trade is unplanned (not a 10b5-1 scheduled plan)
- The company is small/mid-cap (less institutional arbitrage)
The edge vs. political trade copying: 2-day disclosure lag vs. 45 days, and the signal is company-specific rather than sector-level.
**Key risk:** This signal is publicly known and tracked. The edge is in filtering quality and execution speed, not data exclusivity. Large-cap Form 4 signals are arbitraged quickly. Focus on small/mid-cap, clustered, unplanned buys.
---
## System Outline
```
SEC EDGAR RSS Feed (poll every 10 min)
|
[Ingestion Layer]
|
Parse Form 4 XML
|
[Filter Engine]
- Buy only (flag = A)
- Exclude 10b5-1 plans
- Min transaction size
- Role weighting
- Cluster detection
|
SQLite Database
|
┌────────────┬──────────────┐
| | |
[Backtester] [Slack Alert] [Alpaca API]
(manual) (paper/live)
```
---
## Actionables
### Phase 1 -- Data Ingestion
**Goal:** Reliably pull and parse Form 4 filings as they appear.
**Tasks:**
1. Set up project structure
```
insider-copytrade/
ingestion/
edgar_poller.py # polls EDGAR RSS
form4_parser.py # parses XML -> structured dict
db/
schema.sql
db.py # SQLite interface
signals/
filter_engine.py # applies signal filters
cluster_detector.py
alerts/
slack_alert.py
broker/
alpaca_client.py
backtest/
backtest.py
config.py
main.py
```
2. Poll EDGAR RSS for Form 4 filings every 10 minutes:
```
https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&type=4&dateb=&owner=include&count=40&search_text=&action=getcurrent
```
SEC also provides a structured latest filings feed:
```
https://efts.sec.gov/LATEST/search-index?q=&forms=4
```
3. For each new filing, fetch and parse the XML document. Key fields to extract:
- `issuerTradingSymbol` (ticker)
- `rptOwnerName`, `officerTitle` (insider name + role)
- `transactionDate`
- `transactionAcquiredDisposedCode` (A = buy, D = sell)
- `transactionShares`, `transactionPricePerShare`
- `transactionTotalValue` (compute if not present)
- `footnotes` (check for "10b5-1" mention)
- `sharesOwnedFollowingTransaction`
4. Store raw filing XML + parsed fields. Track `accessionNumber` as dedup key.
**SQLite schema:**
```sql
CREATE TABLE filings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
accession_number TEXT UNIQUE,
ticker TEXT,
cik TEXT,
insider_name TEXT,
role TEXT,
transaction_date TEXT,
filed_date TEXT,
shares REAL,
price REAL,
total_value REAL,
flag TEXT, -- A or D
is_10b51 INTEGER, -- 0 or 1
post_tx_shares REAL,
created_at TEXT
);
CREATE TABLE signals (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ticker TEXT,
trigger_date TEXT,
cluster_size INTEGER,
total_cluster_value REAL,
score REAL,
alerted INTEGER DEFAULT 0,
executed INTEGER DEFAULT 0,
created_at TEXT
);
```
---
### Phase 2 -- Filter Engine
**Goal:** Reduce noise to actionable signals only.
**Filters to apply (in order):**
| Filter | Logic |
|---|---|
| Buy only | `flag == 'A'` |
| Exclude 10b5-1 | Scan footnotes for "10b5-1", "Rule 10b5", "adopted a plan" |
| Min transaction value | `total_value >= 50000` (configurable) |
| Exclude derivative transactions | Options exercises are weaker signal than open market purchases |
| Role weighting | CEO/CFO/President = high; Director = medium; 10% owner = context-dependent |
| Cluster detection | 2+ insiders buying same ticker within 30 days = elevated signal |
**Scoring formula (simple v1):**
```python
score = base_role_weight * log(total_value) * cluster_multiplier
# cluster_multiplier = 1.0 + (0.5 * (cluster_size - 1))
```
Expose all thresholds in `config.py` for easy tuning during backtesting.
---
### Phase 3 -- SQLite Storage
SQLite is sufficient for this workload (low write volume, single process). Use WAL mode for concurrent reads during backtesting:
```python
conn = sqlite3.connect('insider.db')
conn.execute('PRAGMA journal_mode=WAL')
```
Keep raw filing XML in a `/data/filings/` directory keyed by accession number. Parse on ingest, re-parse never needed.
---
### Phase 4 -- Slack Alerts
**Goal:** Get notified immediately when a signal fires, with enough context to decide manually.
1. Create a Slack app, get a webhook URL (takes 5 minutes)
2. Alert format:
```
INSIDER BUY SIGNAL
Ticker: $ACME
Insider: John Smith (CEO)
Date: 2025-05-01
Shares: 10,000 @ $14.50 = $145,000
Cluster: 3 insiders in last 14 days
Score: 8.4
10b5-1: No
EDGAR: https://www.sec.gov/cgi-bin/browse-edgar?...
```
3. Alert only on signals above configurable score threshold
4. Mark `alerted = 1` in DB after sending to avoid duplicates on re-poll
```python
import requests
def send_slack_alert(webhook_url, signal):
requests.post(webhook_url, json={"text": format_signal(signal)})
```
---
### Phase 5 -- Backtesting
**Goal:** Validate filter parameters on historical data before going live.
**Data:**
- Historical Form 4 filings: download bulk XML from `https://www.sec.gov/dera/data/form-4-data`
- Price data: `yfinance` (free, sufficient for backtesting)
**Backtest logic:**
```python
# For each signal in historical data:
# - Entry: next market open after filed_date
# - Exit: N days later (configurable: 30/60/90/180)
# - Calculate return vs SPY over same period
# - Aggregate by role, cluster_size, market_cap bucket
```
**Use `vectorbt` for performance:**
```python
import vectorbt as vbt
# Build entry/exit signal matrices aligned to price data
# Run portfolio simulation with configurable position sizing
```
**Output metrics:**
- Annualized return vs SPY benchmark
- Win rate
- Avg return by holding period
- Avg return by role / cluster size
- Max drawdown
- Sharpe ratio
**Critical:** Test on post-2022 data specifically. Pre-2022 results are likely inflated -- the signal became widely tracked after Autopilot/media coverage.
**Parameter grid to test:**
```python
MIN_VALUE = [25_000, 50_000, 100_000]
HOLDING_DAYS = [30, 60, 90, 180]
CLUSTER_WINDOW = [14, 30]
MIN_CLUSTER_SIZE = [1, 2, 3]
ROLES = ['all', 'c-suite-only']
```
---
### Phase 6 -- Alpaca Integration
**Goal:** Optionally auto-execute signals. Start with paper trading.
**Paper trading base URL:** `https://paper-api.alpaca.markets`
**Live trading base URL:** `https://api.alpaca.markets`
Swap via config flag -- never hardcode.
```python
from alpaca_trade_api import REST
api = REST(
key_id=config.ALPACA_KEY,
secret_key=config.ALPACA_SECRET,
base_url=config.ALPACA_BASE_URL # paper or live
)
def execute_signal(ticker, portfolio_value, signal_score):
# Fixed fractional sizing: 2% of portfolio per signal
price = api.get_latest_trade(ticker).price
allocation = portfolio_value * 0.02
qty = int(allocation / price)
if qty < 1:
return
api.submit_order(
symbol=ticker,
qty=qty,
side='buy',
type='market',
time_in_force='day'
)
```
Position sizing: start at 2% per signal, max 10% in any single ticker. Add a max open positions limit (e.g. 20) to cap exposure.
Exit logic (v1): time-based only (close after N days). Add trailing stop later.
---
## Build Order
| Step | Deliverable | Est. Time |
|---|---|---|
| 1 | EDGAR poller + Form 4 XML parser + SQLite storage | 1 day |
| 2 | Filter engine + cluster detector | 0.5 day |
| 3 | Slack alert | 1 hour |
| 4 | Historical data download + backtest | 1-2 days |
| 5 | Alpaca paper trading integration | 0.5 day |
| 6 | Run paper trading 4-8 weeks, monitor | -- |
| 7 | Switch to live with small capital | -- |
Do not proceed to Step 7 without meaningful paper trading history.
---
## Dependencies
```
requests
lxml
sqlite3 (stdlib)
yfinance
vectorbt
alpaca-trade-api
python-dotenv
```
All free. No paid APIs required.
---
## Config Template
```python
# config.py
EDGAR_POLL_INTERVAL = 600 # seconds
MIN_TRANSACTION_VALUE = 50_000
MIN_CLUSTER_SIZE = 1 # raise to 2 for higher quality
CLUSTER_WINDOW_DAYS = 30
HOLDING_PERIOD_DAYS = 90
POSITION_SIZE_PCT = 0.02 # 2% per signal
MAX_POSITIONS = 20
SCORE_ALERT_THRESHOLD = 5.0
SLACK_WEBHOOK_URL = ""
ALPACA_KEY = ""
ALPACA_SECRET = ""
ALPACA_BASE_URL = "https://paper-api.alpaca.markets" # switch for live
```