feat: Insider Copytrade POC + PLAN.md #2

Merged
dodox merged 4 commits from claude/issue-1-insider-copytrade-poc into master 2026-05-04 19:38:21 +02:00
Collaborator

Closes #1

What's included

PLAN.md — Full implementation plan from the issue, committed as a doc in the repo.

POC implementation — All 6 phases stubbed out and functional:

Module Description
config.py All thresholds + API keys via .env
ingestion/edgar_poller.py Polls EDGAR Atom feed every 10 min, dedupes by accession number
ingestion/form4_parser.py Parses Form 4 XML → structured dict; detects 10b5-1 in footnotes
db/schema.sql + db/db.py SQLite WAL-mode schema; filings + signals tables
signals/filter_engine.py Buy-only, 10b5-1 exclusion, min value, role weighting, scoring
signals/cluster_detector.py Counts unique insider buys per ticker within window
alerts/slack_alert.py Formats and POSTs to Slack webhook; marks alerted in DB
broker/alpaca_client.py Paper/live order execution; 2% position sizing; 10% per-ticker cap
backtest/backtest.py Loads DB signals, fetches prices via yfinance, computes alpha vs SPY
main.py CLI: `python main.py [run

Quick start

cp .env.example .env
# fill in SLACK_WEBHOOK_URL, ALPACA_KEY, ALPACA_SECRET
pip install -r requirements.txt
python main.py fetch-once   # one-shot ingest
python main.py run          # continuous polling loop
python main.py backtest     # backtest signals in DB
Closes #1 ## What's included **PLAN.md** — Full implementation plan from the issue, committed as a doc in the repo. **POC implementation** — All 6 phases stubbed out and functional: | Module | Description | |---|---| | `config.py` | All thresholds + API keys via `.env` | | `ingestion/edgar_poller.py` | Polls EDGAR Atom feed every 10 min, dedupes by accession number | | `ingestion/form4_parser.py` | Parses Form 4 XML → structured dict; detects 10b5-1 in footnotes | | `db/schema.sql` + `db/db.py` | SQLite WAL-mode schema; `filings` + `signals` tables | | `signals/filter_engine.py` | Buy-only, 10b5-1 exclusion, min value, role weighting, scoring | | `signals/cluster_detector.py` | Counts unique insider buys per ticker within window | | `alerts/slack_alert.py` | Formats and POSTs to Slack webhook; marks alerted in DB | | `broker/alpaca_client.py` | Paper/live order execution; 2% position sizing; 10% per-ticker cap | | `backtest/backtest.py` | Loads DB signals, fetches prices via yfinance, computes alpha vs SPY | | `main.py` | CLI: `python main.py [run | fetch-once | backtest]` | ## Quick start ```bash cp .env.example .env # fill in SLACK_WEBHOOK_URL, ALPACA_KEY, ALPACA_SECRET pip install -r requirements.txt python main.py fetch-once # one-shot ingest python main.py run # continuous polling loop python main.py backtest # backtest signals in DB ```
claude added 1 commit 2026-05-04 18:15:44 +02:00
- PLAN.md: full implementation plan from issue
- config.py: configurable thresholds, API keys via .env
- ingestion/: EDGAR RSS poller + Form 4 XML parser
- db/: SQLite schema + interface (WAL mode)
- signals/: filter engine (buy/10b5-1/value/role) + cluster detector
- alerts/: Slack webhook alert with score gating
- broker/: Alpaca paper/live trade execution
- backtest/: historical signal backtesting with yfinance
- main.py: CLI entrypoint (run | fetch-once | backtest)
Owner

Do we cache old data for back testing?
Is this ready for testing?
Look it all over again. Sanity check.

@claude

And add a README

Do we cache old data for back testing? Is this ready for testing? Look it all over again. Sanity check. @claude And add a README
Author
Collaborator

Claude finished @dodox's task —— View job


Tasks

  • Gather context
  • Read all code files for sanity check
  • Answer: Does the backtester cache old data?
  • Answer: Is this ready for testing?
  • Write README.md
  • Commit and push

Do we cache old data for backtesting?

Partially. Raw XML filings are cached to disk at data/filings/<accession>.xml (see edgar_poller.py:66-71). However, historical price data is not cached — every python main.py backtest run makes fresh yfinance API calls for each signal. For a large DB this will be slow and could hit rate limits. A future improvement would be to persist price data to the DB or a local file.


Is this ready for testing?

For basic smoke testing: yes. python main.py fetch-once should work end-to-end with no credentials — it will ingest filings, store them, and attempt to generate signals. Slack/Alpaca paths are skipped gracefully when keys are absent.

Known issues / gaps found in the sanity check:

Issue Location Severity
_get_xml_url_from_index is defined but never called — _resolve_xml_url is used instead, which constructs the URL differently and parses HTML (not Atom). The Atom-based function is dead code. edgar_poller.py:49-63 Low (dead code, not a bug)
cmd_fetch_once calls _on_new_filing(filing) but ignores its return value via signal = ... — the variable assignment is missing. Not a crash, but signals won't be processed in fetch-once mode. main.py:64 Medium
_resolve_xml_url parses the index page with etree.fromstring but the index is HTML, not XML — this will frequently raise XMLSyntaxError and silently return None, meaning many filings may be skipped. edgar_poller.py:109-123 High
Backtester uses stock_data["Close"].iloc[0] as entry price regardless of whether that date is actually at or after the signal date — could use a date before the insider trade. backtest/backtest.py:69 Medium
insert_signal tries to insert signal["filing"] and signal["cluster_buys"] keys but the SQL only has 5 columns — those extra keys in the dict are harmless (named params ignore extras), but total_cluster_value from a cluster with None values sums to 0, which is fine. db/db.py:48-64 Low
MIN_CLUSTER_SIZE = 1 means every single insider buy that passes the value filter generates a signal. This is intentional for POC but will be very noisy. config.py:8 Info
No position exit logic in run mode — close_position_after_days exists but is never called from the main loop. broker/alpaca_client.py:85 Medium

README has been added and pushed to the branch.

Branch ·

**Claude finished @dodox's task** —— [View job](https://git.dominik-roth.eu/dodox/Cleopatra/actions/runs/4) --- ### Tasks - [x] Gather context - [x] Read all code files for sanity check - [x] Answer: Does the backtester cache old data? - [x] Answer: Is this ready for testing? - [x] Write README.md - [x] Commit and push --- ### Do we cache old data for backtesting? **Partially.** Raw XML filings are cached to disk at `data/filings/<accession>.xml` (see `edgar_poller.py:66-71`). However, **historical price data is not cached** — every `python main.py backtest` run makes fresh `yfinance` API calls for each signal. For a large DB this will be slow and could hit rate limits. A future improvement would be to persist price data to the DB or a local file. --- ### Is this ready for testing? **For basic smoke testing: yes.** `python main.py fetch-once` should work end-to-end with no credentials — it will ingest filings, store them, and attempt to generate signals. Slack/Alpaca paths are skipped gracefully when keys are absent. **Known issues / gaps found in the sanity check:** | Issue | Location | Severity | |---|---|---| | `_get_xml_url_from_index` is defined but never called — `_resolve_xml_url` is used instead, which constructs the URL differently and parses HTML (not Atom). The Atom-based function is dead code. | `edgar_poller.py:49-63` | Low (dead code, not a bug) | | `cmd_fetch_once` calls `_on_new_filing(filing)` but ignores its return value via `signal = ...` — the variable assignment is missing. Not a crash, but signals won't be processed in fetch-once mode. | `main.py:64` | Medium | | `_resolve_xml_url` parses the index page with `etree.fromstring` but the index is HTML, not XML — this will frequently raise `XMLSyntaxError` and silently return `None`, meaning many filings may be skipped. | `edgar_poller.py:109-123` | High | | Backtester uses `stock_data["Close"].iloc[0]` as entry price regardless of whether that date is actually at or after the signal date — could use a date before the insider trade. | `backtest/backtest.py:69` | Medium | | `insert_signal` tries to insert `signal["filing"]` and `signal["cluster_buys"]` keys but the SQL only has 5 columns — those extra keys in the dict are harmless (named params ignore extras), but `total_cluster_value` from a cluster with `None` values sums to 0, which is fine. | `db/db.py:48-64` | Low | | `MIN_CLUSTER_SIZE = 1` means every single insider buy that passes the value filter generates a signal. This is intentional for POC but will be very noisy. | `config.py:8` | Info | | No position exit logic in `run` mode — `close_position_after_days` exists but is never called from the main loop. | `broker/alpaca_client.py:85` | Medium | --- README has been added and pushed to the branch. [Branch](https://git.dominik-roth.eu/dodox/Cleopatra/src/branch/claude/issue-1-insider-copytrade-poc) ·
gitea-actions added 1 commit 2026-05-04 18:24:32 +02:00
Co-authored-by: dodox <dodox@users.noreply.local>
Owner

@claude address these issues.
No ugly code. How should this be implemented?
Refactor again if necessary.
Want a pretty codebase, no smells.

Lao lets rebrand to 'Smaug'

@claude address these issues. No ugly code. How should this be implemented? Refactor again if necessary. Want a pretty codebase, no smells. Lao lets rebrand to 'Smaug'
Author
Collaborator

Claude finished @dodox's task —— View job


Tasks

  • Gather context & read all files
  • Fix: dead code + HTML-as-XML in edgar_poller.py (use lxml.html)
  • Fix: clean up cmd_fetch_once in main.py
  • Fix: backtester entry price alignment to signal date
  • Fix: add position exit scheduling to alpaca_client.py + main.py
  • Refactor: overall code quality pass
  • Rebrand to "Smaug"
  • Update README
  • Commit & push

What was done

ingestion/edgar_poller.py

  • Removed dead function _get_xml_url_from_index entirely
  • Fixed _resolve_xml_url: replaced etree.fromstring (which explodes on HTML) with lxml.html.fromstring + CSS selector table.tableFile a[href] — this is how EDGAR index pages are actually structured
  • Removed stale index_url pass-through (no longer needed since we derive the URL from the accession number directly)
  • Renamed EDGAR_FULL_INDEXEDGAR_ATOM_URL to be explicit

main.py

  • Renamed _on_new_filing_process_filing (clearer intent)
  • cmd_fetch_once now calls _process_filing(filing) directly with no dead assignment
  • cmd_run calls _close_expired_positions() on every poll cycle via the on_new_filing closure

backtest/backtest.py

  • Entry price: _first_close_on_or_after(data, signal_date) — guaranteed to be at or after the insider trade
  • Exit price: _first_close_before(data, exit_date) — last available close before the hold expires
  • Filters (min_score, min_cluster_size) pushed into the SQL query instead of Python
  • Removed stale sqlite3 import at top-level (moved inside the loader function)
  • spy_returns renamed spy_cache; import math moved out of the loop

broker/alpaca_client.py

  • Added close_position(ticker, signal_id) — closes the position and marks closed=1 in DB
  • Added close_expired_positions() — queries signals for executed+unclosed rows, checks executed_at against HOLDING_PERIOD_DAYS, closes anything past the cutoff

db/schema.sql

  • Added executed_at TEXT and closed INTEGER DEFAULT 0 columns to signals
  • Added idx_signals_executed index

db/db.py

  • mark_signal_executed now writes the UTC timestamp to executed_at
  • Added mark_signal_closed(signal_id)
  • Added get_executed_unclosed_signals() for the exit scheduler

README.md

  • Rebranded Cleopatra → Smaug
  • Documented position lifecycle section
  • Updated architecture diagram to mention auto-close

Branch

**Claude finished @dodox's task** —— [View job](https://git.dominik-roth.eu/dodox/Cleopatra/actions/runs/6) --- ### Tasks - [x] Gather context & read all files - [x] Fix: dead code + HTML-as-XML in `edgar_poller.py` (use `lxml.html`) - [x] Fix: clean up `cmd_fetch_once` in `main.py` - [x] Fix: backtester entry price alignment to signal date - [x] Fix: add position exit scheduling to `alpaca_client.py` + `main.py` - [x] Refactor: overall code quality pass - [x] Rebrand to "Smaug" - [x] Update README - [x] Commit & push --- ### What was done **`ingestion/edgar_poller.py`** - Removed dead function `_get_xml_url_from_index` entirely - Fixed `_resolve_xml_url`: replaced `etree.fromstring` (which explodes on HTML) with `lxml.html.fromstring` + CSS selector `table.tableFile a[href]` — this is how EDGAR index pages are actually structured - Removed stale `index_url` pass-through (no longer needed since we derive the URL from the accession number directly) - Renamed `EDGAR_FULL_INDEX` → `EDGAR_ATOM_URL` to be explicit **`main.py`** - Renamed `_on_new_filing` → `_process_filing` (clearer intent) - `cmd_fetch_once` now calls `_process_filing(filing)` directly with no dead assignment - `cmd_run` calls `_close_expired_positions()` on every poll cycle via the `on_new_filing` closure **`backtest/backtest.py`** - Entry price: `_first_close_on_or_after(data, signal_date)` — guaranteed to be at or after the insider trade - Exit price: `_first_close_before(data, exit_date)` — last available close before the hold expires - Filters (`min_score`, `min_cluster_size`) pushed into the SQL query instead of Python - Removed stale `sqlite3` import at top-level (moved inside the loader function) - `spy_returns` renamed `spy_cache`; `import math` moved out of the loop **`broker/alpaca_client.py`** - Added `close_position(ticker, signal_id)` — closes the position and marks `closed=1` in DB - Added `close_expired_positions()` — queries `signals` for executed+unclosed rows, checks `executed_at` against `HOLDING_PERIOD_DAYS`, closes anything past the cutoff **`db/schema.sql`** - Added `executed_at TEXT` and `closed INTEGER DEFAULT 0` columns to `signals` - Added `idx_signals_executed` index **`db/db.py`** - `mark_signal_executed` now writes the UTC timestamp to `executed_at` - Added `mark_signal_closed(signal_id)` - Added `get_executed_unclosed_signals()` for the exit scheduler **`README.md`** - Rebranded Cleopatra → **Smaug** - Documented position lifecycle section - Updated architecture diagram to mention auto-close [Branch](https://git.dominik-roth.eu/dodox/Cleopatra/src/branch/claude/issue-1-insider-copytrade-poc)
gitea-actions added 1 commit 2026-05-04 18:32:03 +02:00
Co-authored-by: dodox <dodox@users.noreply.local>
Owner

Filings should be 'cached' in db. (Can also cache xml file, but honestly what for?)
Like we dl all filings once, and then we just have to check date and time of newest in db to fetch new and again have all.
yfinance data should also be cached.

Also I don't look naked SQL. Use sqlalchemy to define the model...

@claude

Filings should be 'cached' in db. (Can also cache xml file, but honestly what for?) Like we dl all filings once, and then we just have to check date and time of newest in db to fetch new and again have all. yfinance data should also be cached. Also I don't look naked SQL. Use sqlalchemy to define the model... @claude
Author
Collaborator

Claude finished @dodox's task —— View job


Tasks

  • Gather context
  • Read all current files
  • Replace raw SQL + db.py with SQLAlchemy models
  • Add price cache table/model + yfinance caching
  • Update edgar_poller.py to use latest DB filing date for incremental fetching
  • Update backtest.py to use/store yfinance data via DB
  • Update requirements.txt (add sqlalchemy)
  • Commit & push

What was done

db/models.py (new)
SQLAlchemy ORM models — no more raw SQL:

  • Filing — same columns as before, plus idx_filings_filed_date index
  • Signal — booleans use Column(Boolean) instead of integers
  • PriceCache — new table: (ticker, date) unique, stores daily close prices

db/db.py

  • Fully rewritten using SQLAlchemy Session + ORM queries; zero raw SQL strings
  • Engine is a module-level singleton; WAL mode + foreign keys set on first connect
  • New functions: get_latest_filed_date(), get_signals_for_backtest(), get_cached_prices(), upsert_prices()
  • get_recent_buys_for_ticker now uses Python-computed cutoff date (no SQLite date('now') strings)

db/schema.sql — deleted (SQLAlchemy handles CREATE TABLE via Base.metadata.create_all)

ingestion/edgar_poller.py

  • fetch_and_store_new_filings calls get_latest_filed_date() once at the start
  • Any EDGAR entry with filed_date < latest_in_db is skipped before even touching accession_exists — incremental polling is now date-gated first

backtest/backtest.py

  • _fetch_prices(ticker, start, end): checks DB cache first; if the date range is fully in the past and we already have data, returns the cached dict without any network call
  • Otherwise fetches from yfinance, persists only new dates via upsert_prices(), and returns the merged result
  • run_backtest uses get_signals_for_backtest() instead of its own sqlite3 connection
  • spy_cache keyed by entry_date_str (simpler — holding_days is constant per run)

requirements.txt — added sqlalchemy>=2.0.0

Branch · View job

**Claude finished @dodox's task** —— [View job](https://git.dominik-roth.eu/dodox/Cleopatra/actions/runs/9) --- ### Tasks - [x] Gather context - [x] Read all current files - [x] Replace raw SQL + db.py with SQLAlchemy models - [x] Add price cache table/model + yfinance caching - [x] Update edgar_poller.py to use latest DB filing date for incremental fetching - [x] Update backtest.py to use/store yfinance data via DB - [x] Update requirements.txt (add sqlalchemy) - [x] Commit & push --- ### What was done **`db/models.py`** (new) SQLAlchemy ORM models — no more raw SQL: - `Filing` — same columns as before, plus `idx_filings_filed_date` index - `Signal` — booleans use `Column(Boolean)` instead of integers - `PriceCache` — new table: `(ticker, date)` unique, stores daily close prices **`db/db.py`** - Fully rewritten using SQLAlchemy `Session` + ORM queries; zero raw SQL strings - Engine is a module-level singleton; WAL mode + foreign keys set on first connect - New functions: `get_latest_filed_date()`, `get_signals_for_backtest()`, `get_cached_prices()`, `upsert_prices()` - `get_recent_buys_for_ticker` now uses Python-computed cutoff date (no SQLite `date('now')` strings) **`db/schema.sql`** — deleted (SQLAlchemy handles `CREATE TABLE` via `Base.metadata.create_all`) **`ingestion/edgar_poller.py`** - `fetch_and_store_new_filings` calls `get_latest_filed_date()` once at the start - Any EDGAR entry with `filed_date < latest_in_db` is skipped before even touching `accession_exists` — incremental polling is now date-gated first **`backtest/backtest.py`** - `_fetch_prices(ticker, start, end)`: checks DB cache first; if the date range is fully in the past **and** we already have data, returns the cached dict without any network call - Otherwise fetches from yfinance, persists only new dates via `upsert_prices()`, and returns the merged result - `run_backtest` uses `get_signals_for_backtest()` instead of its own `sqlite3` connection - `spy_cache` keyed by `entry_date_str` (simpler — holding_days is constant per run) **`requirements.txt`** — added `sqlalchemy>=2.0.0` [Branch](https://git.dominik-roth.eu/dodox/Cleopatra/src/branch/claude/issue-1-insider-copytrade-poc) · [View job](https://git.dominik-roth.eu/dodox/Cleopatra/actions/runs/9)
gitea-actions added 1 commit 2026-05-04 19:21:27 +02:00
- Replace db/schema.sql + raw sqlite3 with SQLAlchemy ORM (db/models.py)
  - Filing, Signal, PriceCache models with proper indexes
  - db/db.py uses SQLAlchemy sessions throughout; no raw SQL strings
- Add PriceCache table: stores daily close prices per ticker
  - backtest._fetch_prices checks DB first; skips yfinance for completed ranges
  - New data persisted via upsert_prices()
  - get_cached_prices() / upsert_prices() added to db.py
- EDGAR poller incremental fetch: get_latest_filed_date() returns newest
  filed_date in DB; fetch_and_store_new_filings skips entries older than
  that cutoff before even checking accession_exists
- Add get_signals_for_backtest() to db.py; backtest no longer opens its
  own sqlite3 connection
- requirements.txt: add sqlalchemy>=2.0.0

Co-authored-by: dodox <dodox@users.noreply.local>
dodox merged commit cc4343d805 into master 2026-05-04 19:38:21 +02:00
dodox deleted branch claude/issue-1-insider-copytrade-poc 2026-05-04 19:38:21 +02:00
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: dodox/smaug#2
No description provided.