Latency Accounting
Learn how to measure RetainDB latency separately from LLM latency for accurate performance monitoring.
Applies to: API v1, SDK v3+
Accurate latency measurement is critical for monitoring and optimization. This guide explains how to separate RetainDB latency from your LLM latency.
Why Separate Latencies?
Combining RetainDB and LLM latency hides the true performance of each component:
| Component | Latency | Notes |
|---|---|---|
| RetainDB Search | 50-250ms | Can be optimized |
| LLM Generation | 500-5000ms | Depends on model & tokens |
By separating them, you can:
- Identify bottlenecks accurately
- Set appropriate SLAs
- Optimize each component independently
Latency Breakdown
Every RetainDB response includes detailed latency metrics:
{
"latency_breakdown": {
"cache_ms": 4,
"embed_ms": 45,
"vector_ms": 27,
"lexical_ms": 5,
"merge_ms": 3,
"total_ms": 84
}
}Breakdown Fields
| Field | Description | Typical Range |
|---|---|---|
cache_ms | Cache lookup time | 0-10ms |
embed_ms | Embedding generation | 20-100ms |
vector_ms | Vector search | 10-50ms |
lexical_ms | Keyword search | 2-20ms |
merge_ms | Result merging | 1-10ms |
total_ms | Total RetainDB time | 50-250ms |
Measuring in Your Application
Basic Separation
// Start timer
const RetainDBStart = Date.now();
// RetainDB search
const context = await client.memory.search({
user_id: userId,
query: message,
top_k: 5,
});
const RetainDBLatency = Date.now() - RetainDBStart;
// LLM generation
const llmStart = Date.now();
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "system", content: `Context: ${context.results.map(r => r.memory.content).join("\n")}` },
{ role: "user", content: message },
],
});
const llmLatency = Date.now() - llmStart;
console.log(`RetainDB: ${RetainDBLatency}ms, LLM: ${llmLatency}ms`);Full Instrumentation
class LatencyTracker {
private RetainDBLatencies: number[] = [];
private llmLatencies: number[] = [];
async searchWithTracking(userId: string, query: string) {
const start = Date.now();
const result = await this.client.memory.search({
user_id: userId,
query,
});
this.RetainDBLatencies.push(Date.now() - start);
return result;
}
async generateWithTracking(messages: any[]) {
const start = Date.now();
const result = await this.openai.chat.completions.create({
model: "gpt-4",
messages,
});
this.llmLatencies.push(Date.now() - start);
return result;
}
getMetrics() {
const p50 = (arr: number[]) => arr.sort((a, b) => a - b)[Math.floor(arr.length / 2)];
const p95 = (arr: number[]) => arr.sort((a, b) => a - b)[Math.floor(arr.length * 0.95)];
const p99 = (arr: number[]) => arr.sort((a, b) => a - b)[Math.floor(arr.length * 0.99)];
return {
RetainDB: {
p50: p50(this.RetainDBLatencies),
p95: p95(this.RetainDBLatencies),
p99: p99(this.RetainDBLatencies),
},
llm: {
p50: p50(this.llmLatencies),
p95: p95(this.llmLatencies),
p99: p99(this.llmLatencies),
},
};
}
}Dashboard Best Practices
Correct Reporting
Good:
- RetainDB Search: p95 = 85ms
- LLM Generation: p95 = 1200ms
- Total E2E: p95 = 1350msWrong Reporting
Bad:
- Total: p95 = 1350ms (hides RetainDB contribution)
- "Memory adds 100ms overhead" (inaccurate)Latency by Profile
Fast Profile
Optimized for speed:
{
"profile": "fast",
"latency_breakdown": {
"cache_ms": 4,
"embed_ms": 35,
"vector_ms": 18,
"total_ms": 65
}
}Target: < 100ms p95
Balanced Profile
Speed/quality tradeoff:
{
"profile": "balanced",
"latency_breakdown": {
"cache_ms": 4,
"embed_ms": 80,
"vector_ms": 45,
"total_ms": 140
}
}Target: < 250ms p95
Quality Profile
Maximum accuracy:
{
"profile": "quality",
"latency_breakdown": {
"cache_ms": 4,
"embed_ms": 150,
"vector_ms": 120,
"total_ms": 290
}
}Target: < 500ms p95
Optimization Tips
1. Use Fast Profile
// For real-time applications
await client.memory.search({
user_id: userId,
query,
profile: "fast", // < 100ms target
});2. Enable Caching
// First request - cache miss
const results = await client.memory.search({ query });
// Subsequent - cache hit
const cached = await client.memory.search({ query });
// cache_ms: 2, total_ms: 103. Batch When Possible
// Instead of individual adds
for (const item of items) {
await client.memory.add({ ... }); // Slow
}
// Use bulk add
await client.memory.addBulk({
user_id: userId,
memories: items,
});SLA Recommendations
| Metric | Target | Alert Threshold |
|---|---|---|
| RetainDB p95 | < 250ms | > 300ms |
| RetainDB p99 | < 500ms | > 750ms |
| Cache hit rate | > 70% | < 50% |
| Error rate | < 0.1% | > 1% |
Next step
- Memory Search API — Search details
- SDK Quickstart — SDK usage
- Error Troubleshooting — Error handling
Was this page helpful?
Your feedback helps us prioritize docs improvements weekly.