feat: cap-tier filtering, Alpaca cost model, README cleanup

- simulate.py: --cap-tier large|mid|small|micro; yfinance market cap fetch with DB cache (ticker_meta table); argv fix for main.py dispatch - plot.py: equity curves now show cap tiers with Alpaca costs (zero commission); HP sweep uses Alpaca cost decomposition; SPY line clamped to last strategy date - db/models.py: TickerMeta table - db/db.py: get_cached_market_caps, upsert_market_caps - README: add --cap-tier to simulate docs; backfill note (~3 days for 2 years at SEC 10 req/s limit); remove duplicate setup block; remove em-dashes in prose; results table tilde estimates to be updated once cap-tier sims complete Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 18:10:09 +02:00 · 2026-05-26 18:10:09 +02:00 · d0e98b9cb7
commit d0e98b9cb7
parent 56ec0b4a81
6 changed files with 127 additions and 381 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -1,340 +0,0 @@
-# Insider Copytrade System -- Implementation Plan
-
-## Description
-
-A personal system that monitors SEC EDGAR Form 4 filings in real-time, filters for high-quality insider buying signals, alerts via Slack, and optionally executes trades automatically through Alpaca's paper or live trading API.
-
-The system is fully self-hosted, uses only free/public data sources, and requires no third-party data subscriptions.
-
---
-
-## Background
-
-Company insiders (executives, directors, >10% shareholders) must file SEC Form 4 within 2 business days of any trade. This is public data via SEC EDGAR. The signal value of insider *buying* is academically documented -- executives buying their own stock with personal capital is a meaningful vote of confidence, particularly when:
-
- Multiple insiders buy simultaneously (cluster signal)
- The trade is unplanned (not a 10b5-1 scheduled plan)
- The company is small/mid-cap (less institutional arbitrage)
-
-The edge vs. political trade copying: 2-day disclosure lag vs. 45 days, and the signal is company-specific rather than sector-level.
-
-**Key risk:** This signal is publicly known and tracked. The edge is in filtering quality and execution speed, not data exclusivity. Large-cap Form 4 signals are arbitraged quickly. Focus on small/mid-cap, clustered, unplanned buys.
-
---
-
-## System Outline
-
-```
-SEC EDGAR RSS Feed (poll every 10 min)
-        |
-   [Ingestion Layer]
-        |
-   Parse Form 4 XML
-        |
-   [Filter Engine]
-    - Buy only (flag = A)
-    - Exclude 10b5-1 plans
-    - Min transaction size
-    - Role weighting
-    - Cluster detection
-        |
-   SQLite Database
-        |
-   ┌────────────┬──────────────┐
-   |            |              |
-[Backtester] [Slack Alert]  [Alpaca API]
-             (manual)      (paper/live)
-```
-
---
-
-## Actionables
-
-### Phase 1 -- Data Ingestion
-
-**Goal:** Reliably pull and parse Form 4 filings as they appear.
-
-**Tasks:**
-
-1. Set up project structure
-```
-insider-copytrade/
-  ingestion/
-    edgar_poller.py      # polls EDGAR RSS
-    form4_parser.py      # parses XML -> structured dict
-  db/
-    schema.sql
-    db.py                # SQLite interface
-  signals/
-    filter_engine.py     # applies signal filters
-    cluster_detector.py
-  alerts/
-    slack_alert.py
-  broker/
-    alpaca_client.py
-  backtest/
-    backtest.py
-  config.py
-  main.py
-```
-
-2. Poll EDGAR RSS for Form 4 filings every 10 minutes:
-```
-https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&type=4&dateb=&owner=include&count=40&search_text=&action=getcurrent
-```
-SEC also provides a structured latest filings feed:
-```
-https://efts.sec.gov/LATEST/search-index?q=&forms=4
-```
-
-3. For each new filing, fetch and parse the XML document. Key fields to extract:
-   - `issuerTradingSymbol` (ticker)
-   - `rptOwnerName`, `officerTitle` (insider name + role)
-   - `transactionDate`
-   - `transactionAcquiredDisposedCode` (A = buy, D = sell)
-   - `transactionShares`, `transactionPricePerShare`
-   - `transactionTotalValue` (compute if not present)
-   - `footnotes` (check for "10b5-1" mention)
-   - `sharesOwnedFollowingTransaction`
-
-4. Store raw filing XML + parsed fields. Track `accessionNumber` as dedup key.
-
-**SQLite schema:**
-```sql
-CREATE TABLE filings (
-    id INTEGER PRIMARY KEY AUTOINCREMENT,
-    accession_number TEXT UNIQUE,
-    ticker TEXT,
-    cik TEXT,
-    insider_name TEXT,
-    role TEXT,
-    transaction_date TEXT,
-    filed_date TEXT,
-    shares REAL,
-    price REAL,
-    total_value REAL,
-    flag TEXT,           -- A or D
-    is_10b51 INTEGER,    -- 0 or 1
-    post_tx_shares REAL,
-    created_at TEXT
-);
-
-CREATE TABLE signals (
-    id INTEGER PRIMARY KEY AUTOINCREMENT,
-    ticker TEXT,
-    trigger_date TEXT,
-    cluster_size INTEGER,
-    total_cluster_value REAL,
-    score REAL,
-    alerted INTEGER DEFAULT 0,
-    executed INTEGER DEFAULT 0,
-    created_at TEXT
-);
-```
-
---
-
-### Phase 2 -- Filter Engine
-
-**Goal:** Reduce noise to actionable signals only.
-
-**Filters to apply (in order):**
-
-| Filter | Logic |
-|---|---|
-| Buy only | `flag == 'A'` |
-| Exclude 10b5-1 | Scan footnotes for "10b5-1", "Rule 10b5", "adopted a plan" |
-| Min transaction value | `total_value >= 50000` (configurable) |
-| Exclude derivative transactions | Options exercises are weaker signal than open market purchases |
-| Role weighting | CEO/CFO/President = high; Director = medium; 10% owner = context-dependent |
-| Cluster detection | 2+ insiders buying same ticker within 30 days = elevated signal |
-
-**Scoring formula (simple v1):**
-```python
-score = base_role_weight * log(total_value) * cluster_multiplier
-# cluster_multiplier = 1.0 + (0.5 * (cluster_size - 1))
-```
-
-Expose all thresholds in `config.py` for easy tuning during backtesting.
-
---
-
-### Phase 3 -- SQLite Storage
-
-SQLite is sufficient for this workload (low write volume, single process). Use WAL mode for concurrent reads during backtesting:
-
-```python
-conn = sqlite3.connect('insider.db')
-conn.execute('PRAGMA journal_mode=WAL')
-```
-
-Keep raw filing XML in a `/data/filings/` directory keyed by accession number. Parse on ingest, re-parse never needed.
-
---
-
-### Phase 4 -- Slack Alerts
-
-**Goal:** Get notified immediately when a signal fires, with enough context to decide manually.
-
-1. Create a Slack app, get a webhook URL (takes 5 minutes)
-2. Alert format:
-
-```
-INSIDER BUY SIGNAL
-Ticker:   $ACME
-Insider:  John Smith (CEO)
-Date:     2025-05-01
-Shares:   10,000 @ $14.50 = $145,000
-Cluster:  3 insiders in last 14 days
-Score:    8.4
-10b5-1:   No
-EDGAR:    https://www.sec.gov/cgi-bin/browse-edgar?...
-```
-
-3. Alert only on signals above configurable score threshold
-4. Mark `alerted = 1` in DB after sending to avoid duplicates on re-poll
-
-```python
-import requests
-
-def send_slack_alert(webhook_url, signal):
-    requests.post(webhook_url, json={"text": format_signal(signal)})
-```
-
---
-
-### Phase 5 -- Backtesting
-
-**Goal:** Validate filter parameters on historical data before going live.
-
-**Data:**
- Historical Form 4 filings: download bulk XML from `https://www.sec.gov/dera/data/form-4-data`
- Price data: `yfinance` (free, sufficient for backtesting)
-
-**Backtest logic:**
-```python
-# For each signal in historical data:
-# - Entry: next market open after filed_date
-# - Exit: N days later (configurable: 30/60/90/180)
-# - Calculate return vs SPY over same period
-# - Aggregate by role, cluster_size, market_cap bucket
-```
-
-**Use `vectorbt` for performance:**
-```python
-import vectorbt as vbt
-# Build entry/exit signal matrices aligned to price data
-# Run portfolio simulation with configurable position sizing
-```
-
-**Output metrics:**
- Annualized return vs SPY benchmark
- Win rate
- Avg return by holding period
- Avg return by role / cluster size
- Max drawdown
- Sharpe ratio
-
-**Critical:** Test on post-2022 data specifically. Pre-2022 results are likely inflated -- the signal became widely tracked after Autopilot/media coverage.
-
-**Parameter grid to test:**
-```python
-MIN_VALUE = [25_000, 50_000, 100_000]
-HOLDING_DAYS = [30, 60, 90, 180]
-CLUSTER_WINDOW = [14, 30]
-MIN_CLUSTER_SIZE = [1, 2, 3]
-ROLES = ['all', 'c-suite-only']
-```
-
---
-
-### Phase 6 -- Alpaca Integration
-
-**Goal:** Optionally auto-execute signals. Start with paper trading.
-
-**Paper trading base URL:** `https://paper-api.alpaca.markets`
-**Live trading base URL:** `https://api.alpaca.markets`
-
-Swap via config flag -- never hardcode.
-
-```python
-from alpaca_trade_api import REST
-
-api = REST(
-    key_id=config.ALPACA_KEY,
-    secret_key=config.ALPACA_SECRET,
-    base_url=config.ALPACA_BASE_URL  # paper or live
-)
-
-def execute_signal(ticker, portfolio_value, signal_score):
-    # Fixed fractional sizing: 2% of portfolio per signal
-    price = api.get_latest_trade(ticker).price
-    allocation = portfolio_value * 0.02
-    qty = int(allocation / price)
-    if qty < 1:
-        return
-    api.submit_order(
-        symbol=ticker,
-        qty=qty,
-        side='buy',
-        type='market',
-        time_in_force='day'
-    )
-```
-
-Position sizing: start at 2% per signal, max 10% in any single ticker. Add a max open positions limit (e.g. 20) to cap exposure.
-
-Exit logic (v1): time-based only (close after N days). Add trailing stop later.
-
---
-
-## Build Order
-
-| Step | Deliverable | Est. Time |
-|---|---|---|
-| 1 | EDGAR poller + Form 4 XML parser + SQLite storage | 1 day |
-| 2 | Filter engine + cluster detector | 0.5 day |
-| 3 | Slack alert | 1 hour |
-| 4 | Historical data download + backtest | 1-2 days |
-| 5 | Alpaca paper trading integration | 0.5 day |
-| 6 | Run paper trading 4-8 weeks, monitor | -- |
-| 7 | Switch to live with small capital | -- |
-
-Do not proceed to Step 7 without meaningful paper trading history.
-
---
-
-## Dependencies
-
-```
-requests
-lxml
-sqlite3 (stdlib)
-yfinance
-vectorbt
-alpaca-trade-api
-python-dotenv
-```
-
-All free. No paid APIs required.
-
---
-
-## Config Template
-
-```python
-# config.py
-EDGAR_POLL_INTERVAL = 600        # seconds
-MIN_TRANSACTION_VALUE = 50_000
-MIN_CLUSTER_SIZE = 1             # raise to 2 for higher quality
-CLUSTER_WINDOW_DAYS = 30
-HOLDING_PERIOD_DAYS = 90
-POSITION_SIZE_PCT = 0.02         # 2% per signal
-MAX_POSITIONS = 20
-SCORE_ALERT_THRESHOLD = 5.0
-
-SLACK_WEBHOOK_URL = ""
-ALPACA_KEY = ""
-ALPACA_SECRET = ""
-ALPACA_BASE_URL = "https://paper-api.alpaca.markets"  # switch for live
-```
--- a/README.md
+++ b/README.md
@ -44,7 +44,7 @@ cp .env.example .env  # fill in credentials
 # Live polling (every 10 min)
 python main.py run

-# Bulk-ingest historical filings
+# Bulk-ingest historical filings (2 years took ~3 days at SEC's 10 req/s rate limit)
 python main.py backfill --years 2023 2024
 python main.py backfill --year 2024 --quarter 1

@ -62,14 +62,15 @@ python main.py plot

 ```
 Strategy:
-  --holding-days N      Days to hold each position (default: 7)
-  --buy-delay N         Days after signal to enter (default: 1)
-  --position-size F     Fraction of available cash per trade (default: 0.10)
-  --min-score F         Minimum signal score (default: 0.0)
-  --min-cluster N       Minimum cluster size (default: 1)
-  --capital F           Initial capital (default: 100000)
+  --holding-days N                  Days to hold each position (default: 7)
+  --buy-delay N                     Days after signal to enter (default: 1)
+  --position-size F                 Fraction of available cash per trade (default: 0.10)
+  --min-score F                     Minimum signal score (default: 0.0)
+  --min-cluster N                   Minimum cluster size (default: 1)
+  --cap-tier large|mid|small|micro  Filter by market cap tier (default: all)
+  --capital F                       Initial capital (default: 100000)

-Transaction costs:
+Transaction costs (Alpaca has zero commission, set --commission 0):
  --spread F            One-way bid-ask half-spread at entry and exit (default: 0.003)
  --slippage F          Entry slippage / market impact (default: 0.002)
  --commission F        Per-trade commission as fraction of notional (default: 0.001)
@ -77,12 +78,10 @@ Transaction costs:

 Round-trip = spread x 2 + slippage + commission x 2.

-## Setup
+Cap tiers: large >$10B, mid $2-10B, small $300M-2B, micro <$300M.
+Market caps are fetched from yfinance on first use and cached in the DB.

-```bash
-cp .env.example .env
-pip install -r requirements.txt
-```
+## Setup

 | Variable | Default | Description |
 |---|---|---|
@ -147,7 +146,7 @@ Alpaca charges $0 commission on US equities. Real costs are spread + slippage on

 SPY annualised over the same period: ~+16%.

-Break-even is roughly 0.3-0.5% round-trip. On Alpaca that means large-cap stocks only -- but most insider buying happens in small and mid-cap names, so filtering aggressively kills signal count.
+Break-even is roughly 0.3-0.5% round-trip. On Alpaca that means large-cap stocks only. Most insider buying happens in small and mid-cap names, so filtering aggressively kills signal count.

 ### Is insidercopytrading.com a scam?

@ -174,14 +173,14 @@ Alpaca integration exists in the codebase (`broker/alpaca_client.py`) but is not
 | `ingestion/edgar_poller.py` | EDGAR Atom feed polling |
 | `ingestion/sec_bulk_ingest.py` | Bulk historical ingest via form.idx |
 | `ingestion/form4_parser.py` | Form 4 XML parser; 10b5-1 detection |
-| `db/models.py` | SQLAlchemy ORM models |
+| `db/models.py` | SQLAlchemy ORM models (Filing, Signal, PriceCache, TickerMeta) |
 | `db/db.py` | DB access layer |
 | `signals/filter_engine.py` | Filing to signal pipeline |
 | `signals/cluster_detector.py` | Cluster detection |
 | `alerts/slack_alert.py` | Slack webhook |
 | `broker/alpaca_client.py` | Alpaca order execution |
 | `backtest/backtest.py` | Per-signal backtest |
-| `backtest/simulate.py` | Portfolio simulator |
+| `backtest/simulate.py` | Portfolio simulator with cap-tier filtering |
 | `backtest/plot.py` | Plot generator |
 | `main.py` | CLI: `run / backfill / backtest / simulate / plot` |

--- a/backtest/plot.py
+++ b/backtest/plot.py
@ -41,15 +41,11 @@ def plot_hp_heatmap(prices: dict, out_dir: str = PLOTS_DIR) -> str:
    hold_days   = [3, 5, 7, 10, 14, 21, 30]
    rt_pcts     = [0.3, 0.5, 0.7, 1.0, 1.2, 1.5, 2.0]

-    # decompose round-trip into (spread, slippage, commission) that sum correctly:
-    # roundtrip = 2*spread + slippage + 2*commission
-    # allocate  40% spread, 40% slippage, 20% commission  (all relative to RT)
-    # => spread = RT*0.4/2 = RT*0.2  (one-way)
-    # => slippage = RT*0.4
-    # => commission = RT*0.2/2 = RT*0.1  (one-way)
-    # verify: 2*0.2 + 0.4 + 2*0.1 = 0.4+0.4+0.2 = 1.0 * RT  ✓
+    # Alpaca: zero commission. Decompose RT into spread + slippage only (50/50).
+    # roundtrip = 2*spread + slippage  =>  spread = RT*0.25, slippage = RT*0.5
+    # verify: 2*0.25 + 0.5 = 1.0 * RT ✓
    def _costs(rt):
-        return dict(spread=rt * 0.2, slippage=rt * 0.4, commission=rt * 0.1)
+        return dict(spread=rt * 0.25, slippage=rt * 0.5, commission=0)

    rows_excess = []
    rows_ann    = []
@ -116,7 +112,7 @@ def plot_hp_heatmap(prices: dict, out_dir: str = PLOTS_DIR) -> str:
                ax.text(j, i, txt, ha="center", va="center", fontsize=7.5, color=color)

    fig.suptitle(
-        "HP sweep: 1-day entry delay, 10% position size, buy filter only",
+        "HP sweep: Alpaca (zero commission), 1-day entry delay, 10% position size, all cap tiers",
        fontsize=12,
    )
    plt.tight_layout()
@ -135,22 +131,25 @@ def plot_equity_curves(prices: dict, out_dir: str = PLOTS_DIR) -> str:
    """
    matplotlib, plt, mdates, np = _get_matplotlib()

+    # Alpaca zero-commission costs by cap tier (spread + slippage only)
    scenarios = [
-        {"label": "0% RT cost (theoretical)",    "spread": 0,      "slippage": 0,      "commission": 0},
-        {"label": "0.67% RT (best case)",         "spread": 0.0014, "slippage": 0.0027, "commission": 0.0007},
-        {"label": "1.0% RT (mid)",                "spread": 0.002,  "slippage": 0.004,  "commission": 0.001},
-        {"label": "1.5% RT (realistic small-cap)","spread": 0.003,  "slippage": 0.006,  "commission": 0.0015},
+        {"label": "Large cap  (~0.2% RT)", "cap_tier": "large", "spread": 0.001,  "slippage": 0.001},
+        {"label": "Mid cap    (~0.5% RT)", "cap_tier": "mid",   "spread": 0.0025, "slippage": 0.0025},
+        {"label": "Small cap  (~0.8% RT)", "cap_tier": "small", "spread": 0.004,  "slippage": 0.004},
+        {"label": "All tickers (0% RT)",   "cap_tier": None,    "spread": 0,      "slippage": 0},
    ]

    fig, ax = plt.subplots(figsize=(13, 7))

-    colors  = ["#2ecc71", "#3498db", "#e67e22", "#e74c3c"]
-    sim_start = sim_end = None
+    colors  = ["#2ecc71", "#3498db", "#e67e22", "#aaaaaa"]
+    sim_start = None
+    last_curve_date = None

    for sc, color in zip(scenarios, colors):
        s = Strategy(
            holding_days=7, buy_delay=1,
-            spread=sc["spread"], slippage=sc["slippage"], commission=sc["commission"],
+            spread=sc["spread"], slippage=sc["slippage"], commission=0,
+            cap_tier=sc["cap_tier"],
        )
        r = simulate(s, prices=prices)
        curve = r.get("equity_curve", [])
@ -158,7 +157,7 @@ def plot_equity_curves(prices: dict, out_dir: str = PLOTS_DIR) -> str:
            continue

        sim_start = sim_start or r["period"]["start"]
-        sim_end   = r["period"]["end"]
+        last_curve_date = curve[-1][0]  # actual last signal date in this curve

        dates  = [datetime.strptime(d, "%Y-%m-%d") for d, _ in curve]
        values = [v for _, v in curve]
@ -166,10 +165,10 @@ def plot_equity_curves(prices: dict, out_dir: str = PLOTS_DIR) -> str:
        ax.plot(dates, [v / base * 100 for v in values],
                label=sc["label"], color=color, linewidth=1.8)

-    # SPY buy-and-hold overlay
+    # SPY buy-and-hold overlay — clamp to last data point of strategy curves
    spy_px = prices.get("SPY", {})
-    if spy_px and sim_start and sim_end:
-        spy_dates = sorted(d for d in spy_px if sim_start <= d <= sim_end)
+    if spy_px and sim_start and last_curve_date:
+        spy_dates = sorted(d for d in spy_px if sim_start <= d <= last_curve_date)
        if spy_dates:
            base = spy_px[spy_dates[0]]
            ax.plot(
@ -182,7 +181,7 @@ def plot_equity_curves(prices: dict, out_dir: str = PLOTS_DIR) -> str:
    ax.set_xlabel("Date", fontsize=11)
    ax.set_ylabel("Portfolio value (indexed to 100)", fontsize=11)
    ax.set_title(
-        "Insider Copytrade: equity curves vs SPY  (7d hold, 1d delay, 10% position size)",
+        "Insider Copytrade: equity curves by cap tier, Alpaca costs  (7d hold, 1d delay, 10% position size)",
        fontsize=12,
    )
    ax.legend(fontsize=10)
--- a/backtest/simulate.py
+++ b/backtest/simulate.py
@ -32,7 +32,39 @@ from datetime import datetime, timedelta
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))

 import config
-from db.db import get_signals_for_backtest
+from db.db import get_signals_for_backtest, get_cached_market_caps, upsert_market_caps
+
+CAP_TIERS = {
+    "large": (10_000_000_000, None),
+    "mid":   (2_000_000_000, 10_000_000_000),
+    "small": (300_000_000,   2_000_000_000),
+    "micro": (0,             300_000_000),
+}
+
+
+def _fetch_market_caps(tickers: list[str]) -> dict[str, float]:
+    """Return market cap for each ticker, using DB cache then yfinance for misses."""
+    import yfinance as yf
+
+    cached = get_cached_market_caps(tickers)
+    missing = [t for t in tickers if t not in cached]
+
+    if missing:
+        logger.info(f"Fetching market caps for {len(missing)} tickers via yfinance...")
+        fetched = {}
+        for ticker in missing:
+            try:
+                info = yf.Ticker(ticker).fast_info
+                cap = getattr(info, "market_cap", None)
+                if cap:
+                    fetched[ticker] = float(cap)
+            except Exception:
+                pass
+        if fetched:
+            upsert_market_caps(fetched)
+        cached.update(fetched)
+
+    return cached

 logger = logging.getLogger(__name__)

@ -92,6 +124,7 @@ class Strategy:
        spread: float = 0.003,
        slippage: float = 0.002,
        commission: float = 0.001,
+        cap_tier: str = None,
    ):
        self.holding_days = holding_days
        self.buy_delay = buy_delay
@ -102,6 +135,7 @@ class Strategy:
        self.spread = spread
        self.slippage = slippage
        self.commission = commission
+        self.cap_tier = cap_tier  # "large" | "mid" | "small" | "micro" | None

    # cost applied at entry: half-spread + slippage + commission
    @property
@ -137,6 +171,22 @@ def simulate(strategy: Strategy, prices: dict = None) -> dict:
    if not signals:
        return {"error": "No signals after filtering"}

+    if strategy.cap_tier:
+        tier = CAP_TIERS.get(strategy.cap_tier)
+        if tier is None:
+            raise ValueError(f"Unknown cap_tier {strategy.cap_tier!r}. Use: {list(CAP_TIERS)}")
+        cap_min, cap_max = tier
+        tickers = list({s["ticker"] for s in signals})
+        market_caps = _fetch_market_caps(tickers)
+        signals = [
+            s for s in signals
+            if market_caps.get(s["ticker"], 0) >= cap_min
+            and (cap_max is None or market_caps.get(s["ticker"], 0) < cap_max)
+        ]
+        logger.info(f"Cap tier '{strategy.cap_tier}': {len(signals)} signals after filtering")
+        if not signals:
+            return {"error": f"No signals after cap_tier={strategy.cap_tier} filter"}
+
    if prices is None:
        prices = _load_all_prices()

@ -291,6 +341,7 @@ def simulate(strategy: Strategy, prices: dict = None) -> dict:
            "min_score": strategy.min_score,
            "min_cluster": strategy.min_cluster,
            "roundtrip_cost_pct": round(strategy.roundtrip_cost * 100, 3),
+            "cap_tier": strategy.cap_tier or "all",
        },
        "period": {
            "start": equity_curve[0][0] if equity_curve else "n/a",
@ -338,7 +389,7 @@ def _print_results(r: dict):
    print(f"{'=' * w}")
    print(f"  Strategy")
    print(f"    Hold: {s['holding_days']}d  |  Delay: {s['buy_delay']}d  |  Size: {s['position_size']*100:.0f}% of cash")
-    print(f"    Score ≥ {s['min_score']}  |  Cluster ≥ {s['min_cluster']}")
+    print(f"    Score ≥ {s['min_score']}  |  Cluster ≥ {s['min_cluster']}  |  Cap: {s['cap_tier']}")
    print(f"    Round-trip cost: {s['roundtrip_cost_pct']:.2f}%")
    print(f"  Period: {period['start']} → {period['end']}  ({period['years']}y)")
    print(f"{'─' * w}")
@ -373,6 +424,8 @@ def main():
                        help="Fraction of available cash per trade (0.10 = 10%%)")
    parser.add_argument("--min-score", type=float, default=0.0)
    parser.add_argument("--min-cluster", type=int, default=1)
+    parser.add_argument("--cap-tier", choices=["large", "mid", "small", "micro"],
+                        default=None, help="Filter by market cap tier")
    parser.add_argument("--capital", type=float, default=100_000.0)
    # Costs
    parser.add_argument("--spread", type=float, default=0.003,
@ -382,7 +435,11 @@ def main():
    parser.add_argument("--commission", type=float, default=0.001,
                        help="Per-trade commission as fraction of notional")

-    args = parser.parse_args()
+    # When invoked via `python main.py simulate ...`, argv[1] is 'simulate' -- skip it
+    raw = sys.argv[1:]
+    if raw and raw[0] == "simulate":
+        raw = raw[1:]
+    args = parser.parse_args(raw)

    from db.db import init_db
    init_db()
@ -397,6 +454,7 @@ def main():
        spread=args.spread,
        slippage=args.slippage,
        commission=args.commission,
+        cap_tier=args.cap_tier,
    )

    result = simulate(strategy)
--- a/db/db.py
+++ b/db/db.py
@ -6,7 +6,7 @@ from sqlalchemy.exc import IntegrityError
 from sqlalchemy.orm import Session

 import config
-from db.models import Base, Filing, PriceCache, Signal
+from db.models import Base, Filing, PriceCache, Signal, TickerMeta


 def _engine():
@ -219,6 +219,28 @@ def get_signals_for_backtest(min_score: float, min_cluster_size: int) -> list[di
        return [_signal_to_dict(r) for r in rows]


+def get_cached_market_caps(tickers: list[str]) -> dict[str, float]:
+    if not tickers:
+        return {}
+    with _session() as session:
+        rows = session.scalars(
+            select(TickerMeta).where(TickerMeta.ticker.in_(tickers))
+        ).all()
+    return {r.ticker: r.market_cap for r in rows if r.market_cap is not None}
+
+
+def upsert_market_caps(caps: dict[str, float]) -> None:
+    with _session() as session:
+        for ticker, cap in caps.items():
+            existing = session.get(TickerMeta, ticker)
+            if existing:
+                existing.market_cap = cap
+                existing.fetched_at = datetime.utcnow()
+            else:
+                session.add(TickerMeta(ticker=ticker, market_cap=cap))
+        session.commit()
+
+
 def get_cached_prices(ticker: str, start_date: str, end_date: str) -> dict[str, float]:
    with _session() as session:
        rows = session.scalars(
--- a/db/models.py
+++ b/db/models.py
@ -66,6 +66,14 @@ class Signal(Base):
    )


+class TickerMeta(Base):
+    __tablename__ = "ticker_meta"
+
+    ticker = Column(String, primary_key=True)
+    market_cap = Column(Float, nullable=True)
+    fetched_at = Column(DateTime, default=datetime.utcnow)
+
+
 class PriceCache(Base):
    __tablename__ = "price_cache"