11 KiB
Race Condition Debugging Session
This example demonstrates debugging an intermittent issue using the four-stage framework.
The Bug
User reports: "Cache occasionally returns stale data. Works most of the time, but sometimes shows old values after updates."
Stage 1: Collect Evidence
Task state: "Collecting evidence" (in_progress)
Read Error Description
No error messages — behavior is wrong but silent:
- Expected: Updated value from cache
- Actual: Old value returned (intermittently)
- Frequency: ~5% of requests after update
Reproduce Consistently
Initial attempts fail — bug is intermittent:
- Sometimes works correctly
- Sometimes returns stale data
- No obvious pattern
Run test 100 times to find pattern:
for i in {1..100}; do
npm test -- --grep "cache update" > /dev/null || echo "Failed: $i"
done
# Results:
# Failed: 3
# Failed: 17
# Failed: 28
# Failed: 41
# Failed: 59
# Failed: 72
# Failed: 88
# Failed: 94
Failure rate: ~8% (8 failures out of 100 runs)
Check Recent Changes
git log --since="1 week ago" --oneline src/cache/
# f8d3c21 Optimize cache reads with async/await
# 9a2b741 Add cache prewarming on startup
# c4e5d67 Implement cache TTL refresh
Recent changes to cache implementation — possible cause.
Gather Evidence with Timestamps
Add detailed timing logs:
export async function updateCache(key: string, value: any): Promise<void> {
console.log(`[${Date.now()}] updateCache START: ${key}`);
await cache.set(key, value);
console.log(`[${Date.now()}] updateCache cache.set COMPLETE: ${key}`);
await invalidateRelated(key);
console.log(`[${Date.now()}] updateCache invalidateRelated COMPLETE: ${key}`);
}
export async function getCache(key: string): Promise<any> {
console.log(`[${Date.now()}] getCache START: ${key}`);
const value = await cache.get(key);
console.log(`[${Date.now()}] getCache COMPLETE: ${key}`, value);
return value;
}
Timeline Analysis
Captured logs from a failure:
[1702500123450] updateCache START: user:123
[1702500123455] updateCache cache.set COMPLETE: user:123
[1702500123456] getCache START: user:123 <-- Read started
[1702500123458] getCache COMPLETE: user:123 [OLD] <-- Returned old value
[1702500123460] updateCache invalidateRelated COMPLETE: user:123
Key finding: getCache started (1456) AFTER cache.set completed (1455) but BEFORE invalidateRelated completed (1460). Returned stale data.
Transition: Evidence gathered showing timing issue. Mark "Collect Evidence" complete, add "Isolate Variables" as in_progress.
Stage 2: Isolate Variables
Task state: "Isolating variables" (in_progress)
Find Working Examples
Check how other cache operations handle this:
// Working example - authentication cache
export async function updateAuthCache(userId: string, token: string): Promise<void> {
const key = `auth:${userId}`;
// Atomic operation - no race window
await cache.set(key, token, { ttl: 3600 });
}
Difference: No separate invalidation step, single atomic operation.
Read Recent Optimization Commit
The optimization commit (f8d3c21):
Before (synchronous, blocking):
export function updateCache(key: string, value: any): void {
cache.set(key, value); // Synchronous
invalidateRelated(key); // Synchronous
// Both complete before function returns
}
After (async, non-blocking):
export async function updateCache(key: string, value: any): Promise<void> {
await cache.set(key, value); // Async - completes
await invalidateRelated(key); // Async - still pending
// Function returns here, but invalidation still running
}
Identify Differences
Working code:
- Single atomic operation
- No race window
- Consistent state
Broken code:
- Two-step process
- Race window between set and invalidate
- Inconsistent state possible
Understand the Flow
updateCache('user:123', newData)startscache.setcompletes — new data in cache- Race window starts
getCache('user:123')called from different request- Reads from cache — gets new data
- BUT cache has stale related entries
- Related entries override with old data
- Race window ends
invalidateRelatedcompletes
Root cause hypothesis forming: The async optimization introduced a race window between setting the value and invalidating related entries.
Transition: Pattern identified. Mark "Isolate Variables" complete, add "Formulate Hypotheses" as in_progress.
Stage 3: Formulate Hypotheses & Test
Task state: "Formulating hypotheses" (in_progress)
Form Hypothesis
Hypothesis: "Cache returns stale data because the async optimization (f8d3c21) introduced a race window. When updateCache sets a value but hasn't yet invalidated related entries, concurrent getCache calls can read the new value while related entries are still stale, causing those stale entries to be returned instead."
Evidence:
- Timeline shows
getCachecalled during race window - Worked before async optimization
- Fails ~8% of time (when timing hits race window)
- Working code uses atomic operations
Transition: Hypothesis formed. Mark "Formulate Hypotheses" complete, add "Test Hypothesis" as in_progress.
Task state: "Testing hypothesis" (in_progress)
Design Minimal Test
Add artificial delay to widen race window and make bug consistent:
export async function updateCache(key: string, value: any): Promise<void> {
await cache.set(key, value);
// TESTING: Widen race window
await new Promise(resolve => setTimeout(resolve, 100));
await invalidateRelated(key);
}
Execute Test
Run test 100 times with widened race window:
for i in {1..100}; do
npm test -- --grep "cache update" > /dev/null || echo "Failed: $i"
done
# Results: 67 failures (67%)
Failure rate increased dramatically with wider race window. Confirms timing-based hypothesis.
Test Solution
Make operations atomic by ensuring no reads during update:
export async function updateCache(key: string, value: any): Promise<void> {
// Acquire lock to prevent concurrent reads
const lock = await cache.lock(key);
try {
await cache.set(key, value);
await invalidateRelated(key);
} finally {
await lock.release();
}
}
Run 100 times:
# Results: 0 failures (0%)
Result: Hypothesis confirmed. Lock prevents race condition.
Transition: Solution verified. Mark "Test Hypothesis" complete, add "Verify Fix" as in_progress.
Stage 4: Verify Fix
Task state: "Verifying fix" (in_progress)
Create Failing Test
describe('updateCache race condition', () => {
it('prevents stale data during concurrent update and read', async () => {
// Setup initial data
await cache.set('user:123', 'old-value');
await cache.set('related:123', 'old-related');
// Simulate race: update and read concurrently
const [updateResult, readResult] = await Promise.all([
updateCache('user:123', 'new-value'),
getCache('user:123'),
]);
// Read should either see old (before update) or new (after update)
// but never a mix of new + stale related
const finalValue = await getCache('user:123');
expect(finalValue).not.toBe('old-related'); // Should not see stale
expect(['old-value', 'new-value']).toContain(finalValue);
});
});
Test fails without lock, passes with lock.
Implement Fix
Lock-based solution has performance cost. Consider alternative:
export async function updateCache(key: string, value: any): Promise<void> {
// Generate update ID to track consistency
const updateId = crypto.randomUUID();
// Set all values atomically with transaction
await cache.multi()
.set(key, value)
.set(`${key}:updateId`, updateId)
.invalidate(getRelatedKeys(key))
.exec();
}
This uses cache transaction (multi/exec) for atomicity without explicit locking.
Verify Fix Works
Run comprehensive tests:
# Run 1000 times to ensure no failures
for i in {1..1000}; do
npm test -- --grep "cache update" > /dev/null || echo "Failed: $i"
done
# Results: 0 failures
All tests pass:
- Race condition test passes
- Existing tests pass
- Load test shows no performance degradation
- No deadlocks or timeouts
Document Fix
/**
* Updates cache value atomically with related entries.
*
* Bug fix (2024-12-12): Prevented race condition during updates.
* Root cause: Async optimization (f8d3c21) introduced race window
* between setting value and invalidating related entries. Concurrent
* reads during this window could return inconsistent data.
*
* Prevention: Use cache transactions (multi/exec) to ensure atomicity.
* All cache updates and invalidations complete together or not at all.
*/
export async function updateCache(key: string, value: any): Promise<void> {
const updateId = crypto.randomUUID();
await cache.multi()
.set(key, value)
.set(`${key}:updateId`, updateId)
.invalidate(getRelatedKeys(key))
.exec();
}
Additional Defense
Add monitoring to detect inconsistency:
export async function getCache(key: string): Promise<any> {
const [value, updateId] = await Promise.all([
cache.get(key),
cache.get(`${key}:updateId`),
]);
// Verify related entries have matching updateId
const relatedKeys = getRelatedKeys(key);
for (const relatedKey of relatedKeys) {
const relatedUpdateId = await cache.get(`${relatedKey}:updateId`);
if (relatedUpdateId !== updateId) {
// Log inconsistency for monitoring
console.warn(`Cache inconsistency detected: ${key}`);
// Trigger cache refresh
await refreshCache(key);
}
}
return value;
}
Transition: Fix verified and monitoring added. Mark "Verify Fix" complete.
Task state: All tasks completed.
Summary
What broke: Cache occasionally returned stale data after updates
Root cause: Async optimization introduced race window between setting value and invalidating related entries. Concurrent reads during this window could read new value but get stale related data.
The fix:
- Use cache transactions (multi/exec) for atomic updates
- All updates and invalidations complete atomically
- Added monitoring to detect inconsistencies
Prevention:
- Use atomic operations for multi-step cache updates
- Test concurrent operations explicitly
- Add timing logs to expose race conditions
- Monitor for cache inconsistency in production
Lessons
- Intermittent bugs require many test runs to find pattern
- Timing logs revealed race window
- Widening race window made bug consistent for testing
- Async optimization can introduce race conditions
- Atomic operations eliminate race windows
- Transaction support in cache library enables atomicity without locks