Smaug

<h1 align="center">
  <img src='./icon.png' width="250px"">
  <br>
  <b>Smaug</b>
</h1>


Monitors SEC EDGAR Form 4 filings in near real-time, detects insider buy clusters, sends Slack alerts, and optionally executes trades via Alpaca.  
Copying the idea from [insidercopytrading.com](https://insidercopytrading.com/). Available at [insidercopytradingcopy.com](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

## Architecture

```
EDGAR (Form 4 feed)
      │
      ▼
ingestion/edgar_poller.py    ← polls every 10 min, dedupes by accession
ingestion/sec_bulk_ingest.py ← bulk historical ingest via quarterly form.idx archives
      │
      ▼
ingestion/form4_parser.py    ← parses XML, detects 10b5-1 plans, extracts tx_code
      │
      ▼
db/models.py + db/db.py      ← SQLAlchemy ORM: filings, signals, price_cache tables
      │
      ▼
signals/filter_engine.py     ← buy-only, open-market (P) only, exclude 10b5-1,
signals/cluster_detector.py    min $50k, role-weighted scoring, as-of-date aware
      │
      ├──► alerts/slack_alert.py   ← POST to Slack webhook when score ≥ threshold
      └──► broker/alpaca_client.py ← paper/live order: 2% position size, 10% per-ticker cap
                                        positions auto-closed after holding period expires

backtest/backtest.py         ← per-signal return / alpha vs SPY analysis
backtest/simulate.py         ← realistic portfolio simulation with transaction costs
```

## Setup

```bash
cp .env.example .env
# edit .env with your credentials
pip install -r requirements.txt
```

### Environment variables (`.env`)

| Variable | Required | Default | Description |
|---|---|---|---|
| `SLACK_WEBHOOK_URL` | optional | — | Incoming webhook URL for alerts |
| `ALPACA_KEY` | optional | — | Alpaca API key |
| `ALPACA_SECRET` | optional | — | Alpaca API secret |
| `ALPACA_BASE_URL` | optional | `https://paper-api.alpaca.markets` | Use paper or live endpoint |
| `DB_PATH` | optional | `insider.db` | SQLite database file path |
| `DATA_DIR` | optional | `data/filings` | Directory for cached raw XML filings |

## Usage

```bash
# Initialize DB and start continuous polling (every 10 minutes)
python main.py run

# Bulk-ingest historical Form 4 filings from SEC EDGAR quarterly archives
python main.py backfill --years 2023 2024        # full year range
python main.py backfill --year 2024 --quarter 1  # single quarter

# Per-signal backtest: win rate, alpha vs SPY
python main.py backtest

# Portfolio simulation with configurable strategy and cost params
python main.py simulate [options]
```

### Simulate options

```
Strategy:
  --holding-days N      Calendar days to hold each position (default: 7)
  --buy-delay N         Days after signal trigger to enter (default: 1)
  --position-size F     Fraction of available cash per trade (default: 0.10)
  --min-score F         Minimum signal score filter (default: 0.0)
  --min-cluster N       Minimum cluster size filter (default: 1)
  --capital F           Initial capital in USD (default: 100000)

Transaction costs:
  --spread F            One-way bid-ask half-spread paid at entry and exit (default: 0.003)
  --slippage F          Entry slippage / market impact (default: 0.002)
  --commission F        Per-trade commission as fraction of notional (default: 0.001)

Round-trip cost = spread×2 + slippage + commission×2
```

## Key configuration (`config.py`)

| Parameter | Default | Description |
|---|---|---|
| `EDGAR_POLL_INTERVAL` | 600 s | Polling cadence |
| `MIN_TRANSACTION_VALUE` | $50,000 | Ignore buys below this |
| `MIN_CLUSTER_SIZE` | 1 | Minimum unique insiders before a signal fires |
| `CLUSTER_WINDOW_DAYS` | 30 | Rolling window for cluster counting |
| `HOLDING_PERIOD_DAYS` | 90 | Days held per position (backtest + auto-close trigger) |
| `POSITION_SIZE_PCT` | 2% | Fraction of portfolio per trade |
| `MAX_POSITIONS` | 20 | Hard position limit |
| `SCORE_ALERT_THRESHOLD` | 5.0 | Minimum score to trigger Slack alert |

## Scoring

```
score = role_weight × log(total_value) × (1 + 0.5 × (cluster_size − 1))
```

Role weights: CEO 3.0 · CFO/President 2.5 · COO 2.0 · Director 1.5 · VP 1.2 · 10% owner 1.0

## Backtesting

The backtest loads signals from the DB and fetches OHLC data via `yfinance`. Prices are cached in the `price_cache` table — completed date ranges are served entirely from the DB on repeat runs. Entry price is the closing price on the first trading day on or after the signal date; exit price is the closing price on the last trading day before or on the exit date.

## Results (2023–2024 backtest, 302k filings ingested)

> **⚠ Read the caveats below before drawing conclusions.**

### Per-signal statistics (pre-cost)

Across 16,279 signals generated from 302k Form 4 filings (2023–2024):

| Hold | Avg return | Avg alpha vs SPY | Sharpe | Win rate |
|------|-----------|-----------------|--------|----------|
| 3 d  | +0.61%    | +0.52%          | ~0.80  | ~53%     |
| 7 d  | +1.19%    | +0.68%          | ~1.05  | ~54%     |
| 14 d | +1.41%    | +0.55%          | ~0.90  | ~54%     |
| 30 d | +1.89%    | +0.41%          | ~0.70  | ~54%     |
| 90 d | +5.8%     | +1.0%           | ~0.55  | ~57%     |

Alpha is strongest and most consistent at 3–14 day holds. Beyond 30 days, market beta dominates. Signal quality is broadly robust across `min_score` and `min_cluster` filter values.

### Portfolio simulation (1-day lag, 7-day hold, 10% of cash per signal)

Pre-cost simulation on the same period:

| Metric | Value |
|--------|-------|
| Initial capital | $100,000 |
| Final value | $782,097 |
| Total return | +682% |
| Annualized return | +177% |
| SPY annualized | +25.9% |
| Max drawdown | 12.8% |
| Sharpe | 4.67 |
| Trades executed | 13,766 |

After realistic transaction costs (~1% round-trip), expected annualized return drops to roughly **20–60%** depending on assumed spread and slippage. Run the simulator to check your specific assumptions:

```bash
# Conservative (liquid mid-caps, ~1% round-trip)
python main.py simulate --spread 0.003 --slippage 0.002 --commission 0.001

# Realistic small-cap (~1.5% round-trip)
python main.py simulate --spread 0.007 --slippage 0.005 --commission 0.001
```

### Reality check: with costs this strategy underperforms SPY

Actual simulation results on the full dataset (2020–2025, 16,556 signals) with a realistic 1.5% round-trip cost:

| Config | Ann. return | SPY | Excess | Sharpe |
|--------|-------------|-----|--------|--------|
| 7d hold, 0d delay, 1.5% cost | +5.8% | +16.1% | -10.2% | 0.45 |
| 7d hold, 1d delay, 1.5% cost | -2.5% | +16.2% | -18.7% | -1.55 |
| 3d hold, 1d delay, 1.5% cost | -21.1% | +16.2% | -37.3% | -6.45 |
| 3d hold, 1d delay, 0.67% cost | +8.9% | +16.2% | -7.3% | 0.17 |

**The strategy underperforms SPY under any realistic execution assumption.** Even with 0-day delay (impossible in practice — the filing isn't visible at market open the same day) you still trail the index.

The signal exists — insiders outperform at ~0.68% per 7-day trade pre-cost — but the margin is too thin to survive the transaction costs you actually pay on small/mid-cap stocks.

### Why sites like insidercopytrading.com show outperformance

Services that claim strong returns from following insider filings typically:
- Use close-on-filing-date entry (impossible: filings arrive after hours or mid-day, you execute next open at best)
- Omit bid-ask spread and slippage from their simulations
- Cherry-pick a bull market period or high-score signal subset
- Show gross returns without benchmarking against SPY

None of that is necessarily fraudulent — it's just not what you'd actually earn. Our simulation replicates the real execution constraints and shows the gap.

### Caveats

1. **Transaction costs are everything.** Average alpha per 7-day trade is ~0.68%. A round-trip on small/mid caps costs 0.6–1.5% (spread + slippage + commission). At the high end this strategy is negative after costs. The 177% pre-cost figure is not achievable in practice.

2. **2023–2024 was an exceptional bull market.** SPY returned +25.9% annualized. The long-only bias in insider buys captured broad market momentum. Expected performance in flat or down markets is lower and untested.

3. **Survivorship bias.** Tickers that were delisted, halted, or acquired may be underrepresented in the price cache. This slightly flatters results by dropping the worst outcomes.

4. **No slippage on popular signals.** When multiple insiders at the same company buy on the same day, the stock may have already moved before you execute. The 1-day delay helps but doesn't fully resolve this.

5. **Concentrated portfolio.** At 10% of cash per signal with 7-day holds, you run ~7–10 simultaneous positions on average. Individual position variance is high.

6. **Long-only.** Excess return over SPY is not directly capturable without shorting SPY, which has its own carry cost.

## Position lifecycle

Positions are tracked in the `signals` table. When a trade is executed, `executed_at` is recorded. On each poll cycle the poller checks for positions where `executed_at` is older than `HOLDING_PERIOD_DAYS` and calls Alpaca to close them, marking `closed=1` in the DB.

## Modules

| Path | Purpose |
|---|---|
| `config.py` | All thresholds and env-var loading |
| `ingestion/edgar_poller.py` | EDGAR Atom feed polling and deduplication |
| `ingestion/sec_bulk_ingest.py` | Bulk historical ingest via quarterly form.idx archives |
| `ingestion/form4_parser.py` | Form 4 XML → structured dict; 10b5-1 detection; tx_code extraction |
| `db/models.py` | SQLAlchemy ORM models (`Filing`, `Signal`, `PriceCache`) |
| `db/db.py` | DB access layer — dedup-safe inserts, chunked IN queries, price cache |
| `signals/filter_engine.py` | Filing → signal pipeline (open-market-only, as-of-date aware) |
| `signals/cluster_detector.py` | Cluster detection from DB (as-of-date aware) |
| `alerts/slack_alert.py` | Slack webhook alert |
| `broker/alpaca_client.py` | Alpaca order execution + position exit |
| `backtest/backtest.py` | Per-signal historical backtest runner |
| `backtest/simulate.py` | Portfolio simulator with configurable costs |
| `main.py` | CLI entry point (`run` / `backfill` / `backtest` / `simulate`) |

## Requirements

- Python 3.11+
- See `requirements.txt`: `requests`, `lxml`, `cssselect`, `yfinance`, `python-dotenv`, `alpaca-trade-api`, `sqlalchemy`