Operations

Browse docs

Operations

Tap to expand

Contribute

OperationsUpdated 2026-03-18

Latency Accounting

Learn how to measure RetainDB latency separately from LLM latency for accurate performance monitoring.

Applies to: API v1, SDK v3+

Accurate latency measurement is critical for monitoring and optimization. This guide explains how to separate RetainDB latency from your LLM latency.


Why Separate Latencies?

Combining RetainDB and LLM latency hides the true performance of each component:

ComponentLatencyNotes
RetainDB Search50-250msCan be optimized
LLM Generation500-5000msDepends on model & tokens

By separating them, you can:

  • Identify bottlenecks accurately
  • Set appropriate SLAs
  • Optimize each component independently

Latency Breakdown

Every RetainDB response includes detailed latency metrics:

json
{
  "latency_breakdown": {
    "cache_ms": 4,
    "embed_ms": 45,
    "vector_ms": 27,
    "lexical_ms": 5,
    "merge_ms": 3,
    "total_ms": 84
  }
}

Breakdown Fields

FieldDescriptionTypical Range
cache_msCache lookup time0-10ms
embed_msEmbedding generation20-100ms
vector_msVector search10-50ms
lexical_msKeyword search2-20ms
merge_msResult merging1-10ms
total_msTotal RetainDB time50-250ms

Measuring in Your Application

Basic Separation

typescript
// Start timer
const RetainDBStart = Date.now();

// RetainDB search
const context = await client.memory.search({
  user_id: userId,
  query: message,
  top_k: 5,
});

const RetainDBLatency = Date.now() - RetainDBStart;

// LLM generation
const llmStart = Date.now();

const response = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    { role: "system", content: `Context: ${context.results.map(r => r.memory.content).join("\n")}` },
    { role: "user", content: message },
  ],
});

const llmLatency = Date.now() - llmStart;

console.log(`RetainDB: ${RetainDBLatency}ms, LLM: ${llmLatency}ms`);

Full Instrumentation

typescript
class LatencyTracker {
  private RetainDBLatencies: number[] = [];
  private llmLatencies: number[] = [];

  async searchWithTracking(userId: string, query: string) {
    const start = Date.now();
    const result = await this.client.memory.search({
      user_id: userId,
      query,
    });
    this.RetainDBLatencies.push(Date.now() - start);
    return result;
  }

  async generateWithTracking(messages: any[]) {
    const start = Date.now();
    const result = await this.openai.chat.completions.create({
      model: "gpt-4",
      messages,
    });
    this.llmLatencies.push(Date.now() - start);
    return result;
  }

  getMetrics() {
    const p50 = (arr: number[]) => arr.sort((a, b) => a - b)[Math.floor(arr.length / 2)];
    const p95 = (arr: number[]) => arr.sort((a, b) => a - b)[Math.floor(arr.length * 0.95)];
    const p99 = (arr: number[]) => arr.sort((a, b) => a - b)[Math.floor(arr.length * 0.99)];

    return {
      RetainDB: {
        p50: p50(this.RetainDBLatencies),
        p95: p95(this.RetainDBLatencies),
        p99: p99(this.RetainDBLatencies),
      },
      llm: {
        p50: p50(this.llmLatencies),
        p95: p95(this.llmLatencies),
        p99: p99(this.llmLatencies),
      },
    };
  }
}

Dashboard Best Practices

Correct Reporting

code
Good:
- RetainDB Search: p95 = 85ms
- LLM Generation: p95 = 1200ms  
- Total E2E: p95 = 1350ms

Wrong Reporting

code
Bad:
- Total: p95 = 1350ms (hides RetainDB contribution)
- "Memory adds 100ms overhead" (inaccurate)

Latency by Profile

Fast Profile

Optimized for speed:

json
{
  "profile": "fast",
  "latency_breakdown": {
    "cache_ms": 4,
    "embed_ms": 35,
    "vector_ms": 18,
    "total_ms": 65
  }
}

Target: < 100ms p95

Balanced Profile

Speed/quality tradeoff:

json
{
  "profile": "balanced",
  "latency_breakdown": {
    "cache_ms": 4,
    "embed_ms": 80,
    "vector_ms": 45,
    "total_ms": 140
  }
}

Target: < 250ms p95

Quality Profile

Maximum accuracy:

json
{
  "profile": "quality",
  "latency_breakdown": {
    "cache_ms": 4,
    "embed_ms": 150,
    "vector_ms": 120,
    "total_ms": 290
  }
}

Target: < 500ms p95


Optimization Tips

1. Use Fast Profile

typescript
// For real-time applications
await client.memory.search({
  user_id: userId,
  query,
  profile: "fast", // < 100ms target
});

2. Enable Caching

typescript
// First request - cache miss
const results = await client.memory.search({ query });

// Subsequent - cache hit
const cached = await client.memory.search({ query });
// cache_ms: 2, total_ms: 10

3. Batch When Possible

typescript
// Instead of individual adds
for (const item of items) {
  await client.memory.add({ ... }); // Slow
}

// Use bulk add
await client.memory.addBulk({
  user_id: userId,
  memories: items,
});

SLA Recommendations

MetricTargetAlert Threshold
RetainDB p95< 250ms> 300ms
RetainDB p99< 500ms> 750ms
Cache hit rate> 70%< 50%
Error rate< 0.1%> 1%

Next step

Was this page helpful?

Your feedback helps us prioritize docs improvements weekly.