# Analysis Request: Data needed for system optimization

**Status:** Completed
**Date:** 2025-12-20
**Priority:** Medium

---

## 1. User Description & Query
**Goal:** Check and analyze what kind of historical data is needed for further optimization of the system.
**Context:** Learning from the past how the system worked (good and bad) will help in the future. Assume that any kind of data publicly available can be used.
**Desired Behavior:** List of data which will help to improve now and in the future.

### Specific Questions
1.  Price data, what kind of?
2.  CSV, DB?
3.  Source of data (Hyperliquid, Uniswap)?
4.  Other sources of data? Please propose.

---

## 2. Agent Summary
*   **Objective:** Define a comprehensive data strategy to support backtesting, parameter optimization, and performance analysis for the Uniswap CLP + Hyperliquid Hedger system.
*   **Key Constraints:**
    *   **High Frequency:** Hedging logic runs on ~1s ticks. 1-minute candles are insufficient for simulating slippage and "whipsaw" events.
    *   **Dual Venue:** Must correlate Uniswap V3 (Spot/Liquidity) events with Hyperliquid (Perp/Hedge) actions.
    *   **Storage:** High-frequency data grows rapidly; format matters.

## 3. Main Analysis

### 3.1 Data Types Required

To fully reconstruct and optimize the strategy, you need three distinct layers of data:

#### A. Market Data (The "Environment")
1.  **Tick-Level Trades (Hyperliquid):**
    *   *Why:* To simulate realistic slippage, fill probability, and exact trigger timing for the hedger.
    *   *Fields:* `timestamp_ms`, `price`, `size`, `side`, `liquidation (bool)`.
2.  **Order Book Snapshots (Hyperliquid):**
    *   *Why:* To calculate "effective impact price" for large hedges. The mid-price might be $3000, but selling $50k might execute at $2998.
    *   *Frequency:* Every 1-5 seconds.
3.  **Uniswap V3 Pool Events (Arbitrum):**
    *   *Why:* To track the exact "Health" of the CLP. Knowing when the price crosses a tick boundary is critical for "In Range" status.
    *   *Events:* `Swap` (Price changes), `Mint`, `Burn`.

#### B. System State Data (The "Bot's Brain")
*   *Why:* To understand *why* the bot made a decision. A trade might look bad in hindsight, but was correct given the data available at that millisecond.
*   *Fields:* `timestamp`, `current_hedge_delta`, `target_hedge_delta`, `rebalance_threshold_used`, `volatility_metric`, `pnl_unrealized`, `pnl_realized`.

#### C. External "Alpha" Data (Optimization Signals)
*   **Funding Rates (Historical):** To optimize long/short bias.
*   **Gas Prices (Arbitrum):** To optimize mint/burn timing (don't rebalance CLP if gas > expected fees).
*   **Implied Volatility (Deribit/Derebit Options):** Compare realized vol vs. implied vol to adjust `DYNAMIC_THRESHOLD_MULTIPLIER`.

### 3.2 Technical Options / Trade-offs

| Option | Pros | Cons | Complexity |
| :--- | :--- | :--- | :--- |
| **A. CSV Files (Flat)** | Simple, human-readable, portable. Good for daily logs. | Slow to query large datasets. Hard to merge multiple streams (e.g., matching Uniswap swap to HL trade). | Low |
| **B. SQLite (Local DB)** | Single file, supports SQL queries, better performance than CSV. | Concurrency limits (one writer). Not great for massive tick data (TB scale). | Low-Medium |
| **C. Time-Series DB (InfluxDB / QuestDB)** | Optimized for high-frequency timestamps. Native downsampling. | Requires running a server/container. Overkill for simple analysis? | High |
| **D. Parquet / HDF5** | Extremely fast read/write for Python (Pandas). High compression. | Not human-readable. Best for "Cold" storage (backtesting). | Medium |

### 3.3 Proposed Solution Design

#### Architecture: "Hot" Logging + "Cold" Archival
1.  **Live Logging (Hot):** Continue using `JSON` status files and `Log` files for immediate state.
2.  **Data Collector Script:** A separate process (or async thread) that dumps high-frequency data into **daily CSVs** or **Parquet** files.
3.  **Backtest Engine:** A Python script that loads these Parquet files to simulate "What if threshold was 0.08 instead of 0.05?".

#### Data Sources
*   **Hyperliquid:** Public API (Info) provides L2 snapshots and recent trade history.
*   **Uniswap:** The Graph (Subgraphs) or RPC `eth_getLogs`.
*   **Dune Analytics:** Great for exporting historical Uniswap V3 data (fees, volumes) to CSV for free/cheap.

### 3.4 KPI & Performance Metrics
To truly evaluate "Success," we need more than just PnL. We need to compare against benchmarks.

1.  **NAV vs. Benchmark (HODL):**
    *   *Metric:* `(Current Wallet Value + Position Value) - (Net Inflows)` vs. `(Initial ETH * Current Price)`.
    *   *Goal:* Did we beat simply holding ETH?
    *   *Frequency:* Hourly.

2.  **Hedging Efficiency (Delta Neutrality):**
    *   *Metric:* `Net Delta Exposure = (Uniswap Delta + Hyperliquid Delta)`.
    *   *Goal:* Should be close to 0. A high standard deviation here means the bot is "loose" or slow.
    *   *Frequency:* Per-Tick (or aggregated per minute).

3.  **Cost of Hedge (The "Insurance Premium"):**
    *   *Metric:* `(Hedge Fees Paid + Funding Paid + Hedge Slippage) / Total Portfolio Value`.
    *   *Goal:* Keep this below the APR earned from Uniswap fees.
    *   *Frequency:* Daily.

4.  **Fee Coverage Ratio:**
    *   *Metric:* `Uniswap Fees Earned / Cost of Hedge`.
    *   *Goal:* Must be > 1.0. If < 1.0, the strategy is burning money to stay neutral.
    *   *Frequency:* Daily.

5.  **Impermanent Loss (IL) Realized:**
    *   *Metric:* Value lost due to selling ETH low/buying high during CLP rebalances vs. Fees Earned.
    *   *Frequency:* Per-Rebalance.

## 4. Risk Assessment
*   **Risk:** **Data Gaps.** If the bot goes offline, you miss market data.
    *   *Mitigation:* Use public historical APIs (like Hyperliquid's archive or Dune) to fill gaps, rather than relying solely on local recording.
*   **Risk:** **Storage Bloat.** Storing every millisecond tick can fill a hard drive in weeks.
    *   *Mitigation:* Aggregate. Store "1-second OHLC" + "Tick Volume" instead of every raw trade, unless debugging specific slippage events.

## 5. Conclusion
**Recommendation:**
1.  **Immediate:** Start logging **Internal System State** (Thresholds, Volatility metrics) to a structured CSV (`hedge_metrics.csv`). You can't get this from public APIs later.
2.  **External Data:** Don't build a complex scraper yet. Rely on downloading public data (Dune/Hyperliquid) when you are ready to backtest.
3.  **Format:** Use **Parquet** (via Pandas) for storing price data. It's 10x faster and smaller than CSV.

## 6. Implementation Plan
- [ ] **Step 1:** Create `tools/data_collector.py` to fetch and save public trade history (HL) daily.
- [ ] **Step 2:** Modify `clp_hedger.py` to append "Decision Metrics" (Vol, Threshold, Delta) to a `metrics.csv` every loop.
- [ ] **Step 3:** Use a notebook (Colab/Jupyter) to load `metrics.csv` and visualize "Threshold vs. Price Deviation".