# Analysis Request: Data needed for system optimization **Status:** Completed **Date:** 2025-12-20 **Priority:** Medium --- ## 1. User Description & Query **Goal:** Check and analyze what kind of historical data is needed for further optimization of the system. **Context:** Learning from the past how the system worked (good and bad) will help in the future. Assume that any kind of data publicly available can be used. **Desired Behavior:** List of data which will help to improve now and in the future. ### Specific Questions 1. Price data, what kind of? 2. CSV, DB? 3. Source of data (Hyperliquid, Uniswap)? 4. Other sources of data? Please propose. --- ## 2. Agent Summary * **Objective:** Define a comprehensive data strategy to support backtesting, parameter optimization, and performance analysis for the Uniswap CLP + Hyperliquid Hedger system. * **Key Constraints:** * **High Frequency:** Hedging logic runs on ~1s ticks. 1-minute candles are insufficient for simulating slippage and "whipsaw" events. * **Dual Venue:** Must correlate Uniswap V3 (Spot/Liquidity) events with Hyperliquid (Perp/Hedge) actions. * **Storage:** High-frequency data grows rapidly; format matters. ## 3. Main Analysis ### 3.1 Data Types Required To fully reconstruct and optimize the strategy, you need three distinct layers of data: #### A. Market Data (The "Environment") 1. **Tick-Level Trades (Hyperliquid):** * *Why:* To simulate realistic slippage, fill probability, and exact trigger timing for the hedger. * *Fields:* `timestamp_ms`, `price`, `size`, `side`, `liquidation (bool)`. 2. **Order Book Snapshots (Hyperliquid):** * *Why:* To calculate "effective impact price" for large hedges. The mid-price might be $3000, but selling $50k might execute at $2998. * *Frequency:* Every 1-5 seconds. 3. **Uniswap V3 Pool Events (Arbitrum):** * *Why:* To track the exact "Health" of the CLP. Knowing when the price crosses a tick boundary is critical for "In Range" status. * *Events:* `Swap` (Price changes), `Mint`, `Burn`. #### B. System State Data (The "Bot's Brain") * *Why:* To understand *why* the bot made a decision. A trade might look bad in hindsight, but was correct given the data available at that millisecond. * *Fields:* `timestamp`, `current_hedge_delta`, `target_hedge_delta`, `rebalance_threshold_used`, `volatility_metric`, `pnl_unrealized`, `pnl_realized`. #### C. External "Alpha" Data (Optimization Signals) * **Funding Rates (Historical):** To optimize long/short bias. * **Gas Prices (Arbitrum):** To optimize mint/burn timing (don't rebalance CLP if gas > expected fees). * **Implied Volatility (Deribit/Derebit Options):** Compare realized vol vs. implied vol to adjust `DYNAMIC_THRESHOLD_MULTIPLIER`. ### 3.2 Technical Options / Trade-offs | Option | Pros | Cons | Complexity | | :--- | :--- | :--- | :--- | | **A. CSV Files (Flat)** | Simple, human-readable, portable. Good for daily logs. | Slow to query large datasets. Hard to merge multiple streams (e.g., matching Uniswap swap to HL trade). | Low | | **B. SQLite (Local DB)** | Single file, supports SQL queries, better performance than CSV. | Concurrency limits (one writer). Not great for massive tick data (TB scale). | Low-Medium | | **C. Time-Series DB (InfluxDB / QuestDB)** | Optimized for high-frequency timestamps. Native downsampling. | Requires running a server/container. Overkill for simple analysis? | High | | **D. Parquet / HDF5** | Extremely fast read/write for Python (Pandas). High compression. | Not human-readable. Best for "Cold" storage (backtesting). | Medium | ### 3.3 Proposed Solution Design #### Architecture: "Hot" Logging + "Cold" Archival 1. **Live Logging (Hot):** Continue using `JSON` status files and `Log` files for immediate state. 2. **Data Collector Script:** A separate process (or async thread) that dumps high-frequency data into **daily CSVs** or **Parquet** files. 3. **Backtest Engine:** A Python script that loads these Parquet files to simulate "What if threshold was 0.08 instead of 0.05?". #### Data Sources * **Hyperliquid:** Public API (Info) provides L2 snapshots and recent trade history. * **Uniswap:** The Graph (Subgraphs) or RPC `eth_getLogs`. * **Dune Analytics:** Great for exporting historical Uniswap V3 data (fees, volumes) to CSV for free/cheap. ### 3.4 KPI & Performance Metrics To truly evaluate "Success," we need more than just PnL. We need to compare against benchmarks. 1. **NAV vs. Benchmark (HODL):** * *Metric:* `(Current Wallet Value + Position Value) - (Net Inflows)` vs. `(Initial ETH * Current Price)`. * *Goal:* Did we beat simply holding ETH? * *Frequency:* Hourly. 2. **Hedging Efficiency (Delta Neutrality):** * *Metric:* `Net Delta Exposure = (Uniswap Delta + Hyperliquid Delta)`. * *Goal:* Should be close to 0. A high standard deviation here means the bot is "loose" or slow. * *Frequency:* Per-Tick (or aggregated per minute). 3. **Cost of Hedge (The "Insurance Premium"):** * *Metric:* `(Hedge Fees Paid + Funding Paid + Hedge Slippage) / Total Portfolio Value`. * *Goal:* Keep this below the APR earned from Uniswap fees. * *Frequency:* Daily. 4. **Fee Coverage Ratio:** * *Metric:* `Uniswap Fees Earned / Cost of Hedge`. * *Goal:* Must be > 1.0. If < 1.0, the strategy is burning money to stay neutral. * *Frequency:* Daily. 5. **Impermanent Loss (IL) Realized:** * *Metric:* Value lost due to selling ETH low/buying high during CLP rebalances vs. Fees Earned. * *Frequency:* Per-Rebalance. ## 4. Risk Assessment * **Risk:** **Data Gaps.** If the bot goes offline, you miss market data. * *Mitigation:* Use public historical APIs (like Hyperliquid's archive or Dune) to fill gaps, rather than relying solely on local recording. * **Risk:** **Storage Bloat.** Storing every millisecond tick can fill a hard drive in weeks. * *Mitigation:* Aggregate. Store "1-second OHLC" + "Tick Volume" instead of every raw trade, unless debugging specific slippage events. ## 5. Conclusion **Recommendation:** 1. **Immediate:** Start logging **Internal System State** (Thresholds, Volatility metrics) to a structured CSV (`hedge_metrics.csv`). You can't get this from public APIs later. 2. **External Data:** Don't build a complex scraper yet. Rely on downloading public data (Dune/Hyperliquid) when you are ready to backtest. 3. **Format:** Use **Parquet** (via Pandas) for storing price data. It's 10x faster and smaller than CSV. ## 6. Implementation Plan - [ ] **Step 1:** Create `tools/data_collector.py` to fetch and save public trade history (HL) daily. - [ ] **Step 2:** Modify `clp_hedger.py` to append "Decision Metrics" (Vol, Threshold, Delta) to a `metrics.csv` every loop. - [ ] **Step 3:** Use a notebook (Colab/Jupyter) to load `metrics.csv` and visualize "Threshold vs. Price Deviation".