Backend Weekly

How would you design Memory Systems for production-ready AI Agent.

Solomon Eseme — Sat, 30 May 2026 09:37:53 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

Today's issue is brought to you by Masteringbackend → An all-in-one platform that helps backend engineers become highly-paid backend and AI engineers by leveraging a practical-based learning approach.

Here's another issue of Backend Weekly — your favorite newsletter on mastering AI backend engineering through real-world systems and interview design questions.

Before we dive in:

If you’ve asked your AI agents the same question twice, in two different sessions, and it doesn’t remember the first answer, or doesn’t remember that it already solved this exact problem for you yesterday.

That’s because LLMs are stateless. Every request starts from zero. It lacks Memory. The kind that persists across sessions, accumulates knowledge over time, and makes an agent feel like it knows you.

Memory is not built in. It’s built by backend engineers like yourself.

If you want to be the engineer who builds these systems, not just the one who uses them, join us on Monday's AMA session to learn more about the “Build 10 AI Products in 30 Days” Bootcamp.

This is the AI Backend Interview Series on Backend Weekly, which airs every Saturday now.

In this series, I will guide you through answering common AI backend engineering interview questions, covering topics such as AI backend system design, vector databases, memory systems, microservices, API design, and databases.

Let's get started with episode 4 (Episode 3 Here):

The Interview Scenario

You’re in an AI backend interview.

They ask:

“How would you design a memory system that allows AI agents to remember context across sessions — including user preferences, past interactions, and learned behaviors — at production scale?”

Here’s how to approach it:

When your AI Agent makes an LLM call, it can feel like a blank slate.

The model receives a prompt, generates a response, and immediately forgets everything. Sometimes, even in the next call, it still won’t have context or knowledge of the user or previous conversations.

This is fine for a playground demo. It is not fine for production AI agents, I mean real AI Agents that assist customers, manage workflows, or collaborate with engineering teams across days, weeks, and months.

Memory is what turns a stateless text generator into an agent that actually knows things.

And here’s the part that most people miss:

Memory is not an AI problem. It is a backend engineering problem.

To explain further. It’s storage. It’s retrieval. It’s lifecycle management. It’s scoping, access control, and consistency guarantees. That’s exactly what backend engineers have been doing.

Let’s start from the first principle.

Why LLMs Don’t Have Memory

Start here with your interviewer. LLMs process a context window, which is a fixed-size token buffer that contains everything the model can “see” during a single request. As of 2026, context windows range from 128K to 1M+ tokens.

A larger context window does not solve the memory problem. It just delays it. You know, like when you compare RAM and SSD (disk)

Why?

Let’s explore the reasons:

Context windows are ephemeral. The content disappears after the response is generated. The next API call starts empty with nothing persisted.
Token cost scales linearly. Stuffing an entire conversation history into every request is expensive. A full-context approach on a 200K-token window can cost 10–50× more than a memory-augmented approach that injects only the relevant facts.
Recall degrades with length. Research consistently shows that LLMs perform worse at retrieving specific facts from extremely long prompts. More context doesn’t mean better understanding. It often means more noise.

The solution is not a bigger context window. The solution is a system that selectively stores, consolidates, and retrieves the right information at the right time, and injects only what’s relevant into a manageable prompt.

That system is called memory.

Types of Memory

By the end of 2026, the AI engineering community will have converged on three memory types.

These mirror human cognitive architecture, and that’s not a coincidence. If you think about it for a second, they solve the same fundamental problem:

What to remember, at what granularity, and for how long.

Discuss each with your interviewer.

Most production systems implement semantic and episodic memory as a minimum.

In procedural memory, agents rewrite their own instructions based on experience. It is more advanced and is primarily seen in frameworks like Letta (formerly MemGPT) and LangMem.

Tell your interviewer this:

The three memory types are not an either/or choice. A production memory system uses all three, scored and blended at retrieval time.

The Memory Architecture

Every memory system, regardless of framework, should follow the same four-stage pipeline.

This is the architecture you’ll design as a backend engineer.

Let me break each stage down for you.

[Stage 1 — Extract]: Turning Conversations into Facts

Raw conversation messages are not in memory. They’re noise. The extraction stage uses an LLM call to distill a conversation into atomic, structured facts.

Here’s what extraction looks like in code:

import { Memory } from "mem0ai/oss";

const client = new Memory({ apiKey: process.env.MEM0_API_KEY });

// After a conversation, add the messages to memory
await client.add(
  [
    { role: "user",      content: "I work at Acme Corp in the billing team. We use Stripe." },
    { role: "assistant", content: "Got it! I'll keep that in mind for future billing discussions." },
    { role: "user",      content: "Can you always use TypeScript for code examples?" },
    { role: "assistant", content: "Absolutely — TypeScript it is from now on." },
  ],
  {
    user_id: "user_abc123",   // scoped to this user
    agent_id: "coding-agent", // scoped to this agent
  }
);

Behind the scenes, the extraction pipeline converts those four messages into discrete facts:

// Extracted memories (stored as separate records):
// 1. "User works at Acme Corp in the billing team"  - semantic
// 2. "User's company uses Stripe for payments"       - semantic
// 3. "User prefers TypeScript for code examples"     - procedural

Each fact is embedded as a vector, tagged with metadata (user ID, session ID, timestamp, memory type), and stored. The raw conversation is not stored, only the distilled facts.

[Stage 2 — Consolidate]: Deduplication and Conflict Resolution

This is where most memory systems fail. Without consolidation, memory accumulates contradictions and duplicates over time.

The user says “I work at Acme Corp” in session 1, then “I just joined Stripe” in session 30. Both facts exist in the vector store.

Which one is true?

Think about it.

The consolidation stage compares each extracted fact against existing memories and applies one of four operations:

ADD: If it’s a new fact, no similar existing memory. Insert it.
UPDATE: If similar memory exists, but the details have changed, then replace it. For example, “Works at Acme Corp” becomes “Works at Stripe.”
DELETE: New fact explicitly contradicts an old one. Remove the old one.
NOOP: If fact already exists in memory. Skip it.

This is implemented as a tool-calling pattern that the LLM examines the candidate’s fact alongside similar existing memories and decides which operation to apply.

[Stage 3 — Store]: The Dual-Store Architecture

Production memory systems use a dual-store architecture.

A vector store for semantic search and an entity index for structured relationships.

Let me explain it with an illustration:

// Vector store: fast semantic retrieval
// "Find memories similar to 'billing API architecture'"
// → Returns: "User works on billing team", "Uses Stripe", "Prefers microservices"

// Entity index: structured relationship traversal
// "What do I know about entity 'Acme Corp'?"
// → Returns: "User works there", "Uses Stripe", "Billing team", "500 employees"

Why both?

This is important. Explain it in detail to your interviewer.

Vectors provide semantic flexibility because you can find related memories even when the wording is completely different. Additionally, Entity indexing provides relational integrity, meaning you can traverse relationships between entities without semantic drift.

For the vector store, you can discuss any technology you’re already comfortable with that works.

pgvector
Qdrant
Chroma
Pinecone
Redis

However, for the entity index, Mem0 now uses a built-in entity collection rather than requiring an external graph database. So that during extraction time, entities are identified and stored in a parallel collection, then matched at retrieval time.

[Stage 4 — Retrieve]: Multi-Signal Scoring

Retrieval is where memory becomes useful. When a new user message arrives, the system must decide:

Which memories are relevant to this specific request?

Modern retrieval runs three scoring passes in parallel:

async function retrieveMemories(query: string, userId: string) {
  // 1. Semantic similarity — embed query, find nearest vectors
  const semanticResults = await vectorStore.search({
    embedding: await embed(query),
    filter: { user_id: userId },
    topK: 20,
  });

  // 2. Keyword matching — exact term overlap for precision
  const keywordResults = await keywordIndex.search({
    query,
    filter: { user_id: userId },
    topK: 20,
  });

  // 3. Entity matching — extract entities from query, match against entity index
  const entities = extractEntities(query); // ["Stripe", "billing API"]
  const entityResults = await entityIndex.search({
    entities,
    filter: { user_id: userId },
  });

  // 4. Fuse scores: relevance × recency × type_weight
  const fused = fuseScores(semanticResults, keywordResults, entityResults, {
    semanticWeight: 0.6,
    keywordWeight: 0.25,
    entityWeight: 0.15,
    recencyDecay: 0.95, // older memories score slightly lower
  });

  // 5. Return top-5 under 200 token budget
  return selectWithinTokenBudget(fused, { maxTokens: 200 });
}

The fused results are injected into the LLM prompt as a system-level context block. The model sees them as pre-existing knowledge, not as search results. This is what makes the agent feel like it "remembers."

Next, let’s talk about “Who Remembers What”, so that you can discuss it with your interviewer.

Memory Scoping: Who Remembers What

In production, you don’t have one global memory. You have scoped memories that determine who can see what.

Discuss this with your interviewer because it’s the access control layer of your memory system.

User-scoped: Memories specific to one user. For example, “This user prefers dark mode.” Only retrieved when that user is active. This is the most common scope.
Session-scoped: Memories that expire with the session. Short-term working memory. For example, “In this session, we’re refactoring the auth module.” Cleared on session end.
Agent-scoped: Knowledge the agent has learned across all users. “This codebase uses Prisma ORM.” Retrieved regardless of which user is active.
Organization-scoped: Shared memories across a team. “Acme Corp’s coding standards require 80% test coverage.” Retrieved for any user in that org.

Each memory record is tagged with its scope IDs at write time, and filtered by those IDs at read time.

This is not conceptually different from row-level security in PostgreSQL or tenant isolation in a multi-tenant API. It’s the same access control pattern for a new data type.

Let’s build a simple flow:

A Memory-Augmented Agent

Here’s the complete flow starting from user message to memory-augmented response:

import { MemoryClient } from "mem0ai";
import { OpenAI } from "openai";

const memory = new MemoryClient({ apiKey: process.env.MEM0_API_KEY });
const openai = new OpenAI();

async function chat(userId: string, userMessage: string) {
  // 1. Retrieve relevant memories BEFORE calling the LLM
  const memories = await memory.search(userMessage, { user_id: userId });
  const memoryContext = memories
    .map((m: { memory: string }) => `- ${m.memory}`)
    .join("\n");

  // 2. Inject memories into the system prompt
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `You are a helpful assistant. Here's what you remember about this user:\n${memoryContext || "No memories yet."}`,
      },
      { role: "user", content: userMessage },
    ],
  });

  const assistantMessage = response.choices[0].message.content;

  // 3. Store new memories from this interaction (async — don't block response)
  memory.add(
    [
      { role: "user", content: userMessage },
      { role: "assistant", content: assistantMessage! },
    ],
    { user_id: userId }
  ).catch((err: Error) => console.error("Memory write failed:", err));

  return assistantMessage;
}

Three things to notice.

Memory retrieval happens before the LLM call, which means retrieved facts are injected into the system prompt, so the model already "knows" them when it generates a response.
Next, Memory storage happens after the response, asynchronously. That means you don't block the user's response to write memories.
Third, the .add() call handles extraction, consolidation, and storage internally. You send only raw messages; the framework handles the rest.

Scaling and Production Concerns

Here’s where you impress your interviewer.

The happy path of memory is easy. Production-grade memory has edge cases that most engineers don’t think about until they hit them.

Token Budget Management

The biggest production concern is how much memory to inject.

If it’s too little, the agent will forget the critical context. And if it’s too much, your backend system will consume tokens that should be used for the user’s actual request.

The standard budget is 200 tokens for injected memories, which is enough for 5–8 atomic facts. Your retrieval scoring must rank ruthlessly within that budget.

Memory Staleness and Decay

In a production system, coming from a business perspective:

Things change. Facts change. Users switch jobs, change preferences, and abandon projects, customers churn and move to other platforms.

Your memory systems need a recency decay strategy.

A strategy where newer memories score higher than older ones, or deleting memories of 30 days of user inactivity, etc.

Some systems implement explicit expiration: session-scoped memories expire automatically; semantic memories older than N days are deprioritized; episodic memories are pruned to the most recent K interactions.

Privacy and Data Governance

Memory creates a persistent record of user interactions. In regulated industries like healthcare, finance, and legal, this triggers compliance requirements. Your memory system needs:

Delete APIs: Users must be able to delete their memories (GDPR right to erasure).
Scope isolation: One user’s memories must never leak into another user’s retrieval results.
Audit trails: Every memory write, update, and delete must be logged.
Encryption at rest: Memory records contain user data and should be encrypted.

Observability

You can’t manage what you can’t see.

These are some metrics that are worth tracking. You can discuss with the interviewer about their specific use cases:

Memory retrieval latency (p50, p95): Should be under 100ms. If it’s higher, your vector index needs tuning, or your token budget filtering is too complex.
Memory hit rate: What percentage of queries return at least one relevant memory? A consistently low hit rate means your extraction pipeline isn’t capturing useful facts.
Token efficiency: How many tokens do you inject vs how many are used for the user’s actual request? Track the ratio. Full-context approaches consume 26,000+ tokens per conversation. Memory-augmented approaches average ~7,000 tokens per retrieval, which is a 73% reduction.
Consolidation conflict rate: How often does a new fact UPDATE or DELETE an existing memory? A high conflict rate may indicate noisy extraction or a domain where facts change rapidly.
Stale memory retrievals: How often do retrieved memories turn out to be outdated? Track user corrections (“Actually, I no longer work at Acme”) as a signal.

The Framework Landscape in 2026

Let me add this: Discuss these with your interviewer to show awareness of the production landscape

You don’t need to build memory from scratch. The ecosystem has matured.

Mem0: The most widely adopted standalone memory layer. 57K+ GitHub stars. Dual-store architecture (vector + entity index). Supports 20+ vector backends. Works with any LLM stack via REST API. YC-backed, $24M Series A. This is the default recommendation for most teams.
Zep: Production-grade memory with hybrid vector + graph storage. Strong session management. Best for long-running agent sessions where temporal ordering matters deeply.
LangMem (LangChain): Built into the LangChain/LangGraph ecosystem. Supports all three memory types, including procedural. Best when you’re already committed to LangGraph for agent orchestration.
Letta (MemGPT): Tiered memory with self-editing capabilities. The agent can explicitly choose to write, update, or delete its own memories. Most advanced for autonomous agents with long horizons.

Final Answer

“I’d design the memory system as a four-stage pipeline, store in a dual-store architecture combining vector search for semantic retrieval with an entity index for relational traversal, and retrieve using multi-signal scoring that fuses semantic similarity, keyword matching, and entity matching. Memory is scoped by user, session, agent, and organization to enforce isolation. For production hardening, I’d add recency decay scoring, GDPR-compliant delete APIs, scope-level isolation, and observability on retrieval latency, hit rate, and token efficiency.”

Designing a memory system for AI agents sounds like a problem for ML engineers and AI researchers. But as you dig in, you realize it’s built on the same foundations you’ve been working with for years:

Write-ahead pipelines: Extract, consolidate, store. The same pattern as event sourcing
Dual-store retrieval: Vector + structured index. The same architecture as search systems that combine full-text with filters
Scope-based access control: User, session, org. The same isolation model as multi-tenant APIs
Token budget management: Deciding what to include and what to leave out. The same constraint as API response pagination
Lifecycle management: Expiration, decay, pruning. The same TTL logic as cache eviction

Memory for AI agents is not exotic new technology.

It is backend infrastructure for storage, retrieval, access control, and lifecycle, just applied to a new data type.

The things an agent has learned.

So the next time an interviewer asks, “How would you give an AI agent memory?” don’t just say “I’d use a vector database.”

Walk them through the extraction pipeline. Explain how consolidation resolves conflicting facts. Show them the multi-signal scoring function. Talk about what happens when a user exercises their right to be forgotten, and every memory scoped to their ID must be deleted within 30 days.

That’s the answer that shows you’ve built real systems, and not just plugged in a library.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this series and if you want me to continue breaking down interview questions like this.

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

1. The MB Platform: Join 4000+ backend engineers learning backend engineering on the MB platform. Build real-world backend projects, track your learnings and set schedules, learn from expert-vetted courses and roadmaps, and solve backend engineering tasks, exercises, and challenges.

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

3. GetBackendJobs: Access 1000+ tailored backend engineering jobs, manage and track all your job applications, create a job streak, and never miss applying. Lastly, you can hire backend engineers anywhere in the world.

LAST WORD 👋

How am I doing?

I love hearing from readers, and I’m always looking for feedback. How am I doing with The Backend Weekly? Is there anything you’d like to see more or less of? Which aspects of the newsletter do you enjoy the most?

Hit reply and say hello — I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

Understanding Vector Databases for Backend Engineers

Solomon Eseme — Thu, 21 May 2026 09:30:40 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

Here's another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

Every AI-powered feature you’ve seen in the last two years — semantic search, chatbots that answer from your docs, recommendation engines that actually understand context — has the same thing under the hood: vector embeddings stored in a database designed to search them.

If you’re a backend engineer building anything that touches AI, understanding vector databases is no longer optional. It’s the data layer you’ll be asked about in your next interview, and the one your team will ask you to architect next quarter.

That’s why we built the “Golang30 AI Bootcamp” to help you learn how to build 10 production-ready AI projects in Golang. 30 days. Real systems. Real production patterns.

Join the AI Engineering Bootcamp →

This is the AI Backend Engineer Series on Backend Weekly.

In this series, I will guide you through understanding and learning AI backend engineering. Let’s get started with episode 3 of this series. Episode 2 here

You’re in an AI Engineer interview.

They ask:

“What is a vector database, how does it work under the hood, and when would you use one instead of a traditional database in a backend AI system?”

Here’s how to approach it:

Every database you’ve ever used answers the same kind of question: “Give me the row where id = 42.” Or “Give me all rows where status = 'active' and created_at > '2025-01-01'.”

These are exact-match queries. The database compares values. Either a row matches, or it doesn’t. This works brilliantly for structured data, and it’s the foundation of every production system you’ve built.

But what if the question isn’t exact?

“Find me documents that are about contract disputes, even if the word ‘contract’ never appears.” Or “Find me products similar to this one.” Or “Find the 10 most relevant knowledge base articles for this customer question.”

Traditional databases can’t answer these questions. They don’t understand the meaning because they compare bytes.

Vector databases can.

Therefore, as a backend engineer building AI-powered features in 2026, you need to understand how.

What Is a Vector Database?

Start here with your interviewer:

A vector database is a database system purposely built for storing, indexing, and searching high-dimensional vectors, also called embeddings.

An embedding is a numerical representation of a piece of data, a sentence, an image, a product, or a user profile, produced by a machine learning model. The model converts the meaning of the data into a list of numbers, typically 768 to 1536 dimensions.

The key property is that things that are semantically similar end up close together in vector space.

The sentences “My order hasn’t arrived” and “Where is my package?” produce embeddings that are close to each other, even though they share zero words.

A vector database does one thing exceptionally well. It is designed to efficiently search through embeddings.

Given a query vector, it finds the K most similar vectors in the database. This process is called similarity search, and it powers semantic search, RAG pipelines, recommendation systems, anomaly detection, and multi-modal search across text, images, and audio.

Discuss this with your interviewer:

The core difference between a traditional database and a vector database is the type of question it answers. A traditional database answers "find the exact match." A vector database answers "find the closest meaning."

How Vector Search Works Under the Hood

When a query arrives at a vector database, this is what happens:

The critical step is the index lookup.

Without an index, finding the most similar vectors requires comparing the query against every single vector in the database. That’s a brute-force scan [O(n)] and it’s prohibitively slow once you have more than a few thousand vectors.

This is where Approximate Nearest Neighbor (ANN) algorithms come in. They trade a tiny amount of accuracy for massive speed improvements, finding results that are “close enough” to the true nearest neighbors in a fraction of the time.

The Indexing Algorithms You Need to Know

Three indexing algorithms come up in interviews. You don’t need to implement them from scratch, but you need to understand what they do, when to use them, and their trade-offs.

HNSW (Hierarchical Navigable Small World)

The most widely used ANN index in production today. HNSW builds a multi-layered graph where each node is a vector and edges connect similar vectors. The top layers have long-range connections (for fast global navigation), and the bottom layers have short-range connections (for precise local search).

Think of it like a skip list, but in high-dimensional space. You start at the top layer, make big jumps to get close to the target region, then descend layer by layer to find the exact nearest neighbors.

Speed: Excellent query performance. Sub-millisecond at millions of vectors.
Recall: Very high, typically 95–99%+ with proper tuning.
Trade-off: High memory usage. The entire graph lives in RAM. Slower build times than IVF.
Best for: Most production workloads under 50M vectors where query speed matters most.

IVFFlat (Inverted File Index)

IVF works by clustering your vectors into groups using k-means, then only searching the clusters closest to the query vector. Instead of scanning every vector, it scans only the relevant clusters, dramatically reducing the search space.

Speed: Good, but slower than HNSW for most workloads.
Recall: Depends heavily on the number of clusters scanned. More clusters = higher recall = slower search.
Trade-off: Requires a training step — you need existing data before building the index. Not great for tables that start empty.
It is best for: Large datasets where memory is constrained. Often combined with Product Quantization (PQ) for compression.

Flat (Brute Force)

This algorithm needs no index at all. It compares the query against every single vector with 100% recall, but O(n) scan time.

The flat (Brute force) strategy is best for small datasets (under 10K vectors), benchmarking recall of other indexes, or when perfect accuracy is required.

Tell your interviewer this:

HNSW is the default choice for most production systems. Use IVF when you have billions of vectors, and memory is the constraint. Use flat only when the dataset is small enough that brute-force is fast.

Distance Metrics: How “Similar” Is Defined

The database needs a function to measure how close two vectors are. Three distance metrics dominate:

Cosine Similarity: Measures the angle between two vectors. Ignores magnitude, focuses on direction. This is the default for most text embedding models like OpenAI, Cohere, and Sentence Transformers, all of which normalize their outputs. Use this unless you have a specific reason not to.
L2 (Euclidean) Distance: Measures the straight-line distance between two points. Considers both direction and magnitude. Better when magnitude carries meaning, like user activity intensity.
Dot Product (Inner Product): A fast alternative to cosine when vectors are already normalized. Often used by recommendation systems.

When building the index, you must specify which distance metric to use, and your queries must use the same one. Mixing them is a silent correctness bug that’s hard to catch.

Vector Search with pgvector

Here’s where it gets practical.

For most backend teams, the right starting point is not a dedicated vector database, but it’s pgvector, the PostgreSQL extension that adds vector columns, distance operators, and ANN indexing to the database you’re already running.

Here’s why?

Your documents and embeddings live in the same table, in the same transaction, with no sync pipeline or extra credentials. Additionally, your team has no new service to monitor. For workloads under 5 million vectors, pgvector’s performance is more than adequate.

Setting Up pgvector

-- Enable the extension
CREATE EXTENSION vector;

-- Create a table with a vector column
CREATE TABLE documents (
  id       SERIAL PRIMARY KEY,
  title    TEXT NOT NULL,
  content  TEXT NOT NULL,
  metadata JSONB DEFAULT '{}',
  embedding VECTOR(1536)  -- OpenAI ada-002 outputs 1536 dims
);

-- Create an HNSW index for cosine similarity
CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- m = max connections per node (higher = more accurate, more memory)
-- ef_construction = build-time search depth (higher = better index, slower build)

Inserting Embeddings

import { OpenAI } from "openai";
import { Pool } from "pg";

const openai = new OpenAI();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function insertDocument(title: string, content: string) {
  // 1. Generate the embedding from the content
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: content,
  });

  const embedding = response.data[0].embedding; // float[] of 1536 dims

  // 2. Store both the content AND the embedding in the same row
  await pool.query(
    `INSERT INTO documents (title, content, embedding)
     VALUES ($1, $2, $3)`,
    [title, content, JSON.stringify(embedding)]
  );
}

Querying: Semantic Search

async function semanticSearch(query: string, limit: number = 5) {
  // 1. Embed the query with the SAME model used for documents
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });

  const queryEmbedding = response.data[0].embedding;

  // 2. Find the closest vectors using cosine distance (<=>) 
  const result = await pool.query(
    `SELECT id, title, content,
            1 - (embedding <=> $1::vector) AS similarity
     FROM documents
     ORDER BY embedding <=> $1::vector
     LIMIT $2`,
    [JSON.stringify(queryEmbedding), limit]
  );

  return result.rows;
  // Returns: [{ id, title, content, similarity: 0.92 }, ...]
}

Notice the <=> operator.

That’s pgvector’s cosine distance operator. The HNSW index kicks in automatically. PostgreSQL’s query planner knows to use it. You get ANN search through standard SQL, in the same transaction as your regular queries.

Hybrid Search: Combining Vector and Metadata Filters

In production, pure vector search is rarely enough. You almost always need to combine semantic similarity with traditional filters, by user, by date, by category, by tenant. Discuss this with your interviewer.

// Find support articles similar to a question, but only for a specific product
const result = await pool.query(
  `SELECT id, title, content,
          1 - (embedding <=> $1::vector) AS similarity
   FROM documents
   WHERE metadata->>'product' = $2
     AND metadata->>'status' = 'published'
   ORDER BY embedding <=> $1::vector
   LIMIT 10`,
  [JSON.stringify(queryEmbedding), "billing-api"]
);

This is one of pgvector's biggest advantages: the WHERE clause and the vector search happen in the same query, in the same transaction, against the same table. No sync pipeline between a metadata store and a vector store.

pgvector vs Dedicated Vector Databases

This is where many candidates stumble. They either always say “use Pinecone” or always say “use Postgres.” The right answer depends on the workload. Walk your interviewer through the decision:

Rule of thumb:

Start with pgvector. It handles more than most teams realize. Move to a dedicated vector database when you hit a scaling ceiling that Postgres can't solve, hundreds of millions of vectors, auto-scaling requirements, or multi-tenant isolation at the vector level.

Real-World Use Cases for Backend Engineers

Don’t just list use cases, explain them in terms your interviewer will recognize as production systems:

RAG (Retrieval-Augmented Generation): The most common use case. You embed your knowledge base, store the vectors, and when a user asks a question, you embed the question, find the K most relevant documents via vector search, and feed those documents to the LLM as context. The LLM answers from your data, not from its training set. This is how every “chat with your docs” feature works.
Semantic search: Traditional search requires keyword matching. With vector search, a query for “how to cancel my subscription” will surface articles about “membership termination” and “account deactivation”, because the embeddings are close together in meaning.
Recommendation engines: Embed your products, your users, and their interactions. Find products whose embeddings are closest to what a user has engaged with. Shopify uses a hybrid vector + keyword search for product discovery at scale.
Anomaly detection: Embed your normal system behavior. When a new data point is far from all known embeddings, flag it as anomalous. Used in fraud detection and security monitoring.
Deduplication: Finding near-duplicate content, support tickets, product listings, and user-generated posts that wouldn’t be caught by exact-match comparison.

Observability and Production Concerns

You can’t manage what you can’t see. Discuss these with your interviewer:

Metrics to Track

Query latency (p50, p95, p99): Vector search should complete in 10–50ms. If it’s above 200ms, your index parameters need tuning, or your dataset has outgrown a single instance.
Recall accuracy: Periodically run ground-truth comparisons, flat (brute-force) search vs your ANN index, to verify recall hasn’t degraded. Target 95%+ for most applications.
Embedding generation latency: The API call to your embedding model (OpenAI, Cohere) is often the bottleneck, not the vector search itself. Track this separately.
Index build time: HNSW indexes can take minutes to hours on large datasets. Track rebuild durations and plan around them.
Memory usage: HNSW indexes live in RAM. Monitor memory consumption as your vector count grows. A 1536-dimension vector at 32-bit precision is ~6KB — 1 million vectors = ~6GB of index memory before overhead.

Common Production Pitfalls

Embedding model mismatch: If you embed your documents, text-embedding-3-small but query with text-embedding-ada-002. The results will be garbage. Always use the same model for both insertion and query. Store the model name as metadata.
Stale embeddings: When your source content changes, the embedding doesn’t automatically update. Build a re-embedding pipeline triggered by content updates.
Missing distance metric alignment: If you build your HNSW index with vector_cosine_ops , but query using vector_l2_ops. The index won’t be used. PostgreSQL will fall back to a sequential scan. Always match the index and query operators.

Final Answer

“A vector database is purpose-built for storing and searching high-dimensional embeddings by semantic similarity, rather than exact match. Under the hood, it uses ANN algorithms, primarily HNSW, to find the K nearest neighbors in sub-linear time. For most backend teams, I’d start with pgvector: it adds vector columns, cosine distance operators, and HNSW indexing directly to PostgreSQL, so embeddings and metadata live in the same table, same transaction, same query. This avoids the sync pipeline and operational overhead of a separate vector store. I’d move to a dedicated solution like Pinecone or Weaviate only when the workload exceeds what a single Postgres instance can handle, hundreds of millions of vectors, or high concurrent search throughput. The key production concerns are embedding model consistency, index memory budgeting, and recall monitoring.”

Understanding vector databases sounds like an AI-specific topic. Something for ML engineers, not backend engineers. But as you dig in, you realize the core challenges are deeply familiar:

Choosing the right index type for a workload is the same trade-off you’ve made with B-trees vs hash indexes vs GIN
Memory budgeting is the same capacity planning you do for any in-memory data structure
Consistency guarantees between data and its derived representations are the same challenge as materialized views or search indexes
Hybrid querying is combining new query patterns with existing relational data in the same transaction
Observability, latency, recall, and throughput are the same metrics discipline you apply to any production system

Vector databases are not some exotic new technology that requires you to unlearn everything you know. They are an extension of the same storage engineering principles you’ve been applying for years, applied to a new data type and a new class of query.

So the next time an interviewer asks, “What is a vector database and when would you use one?” don’t just say “it’s a database for AI embeddings.”

Walk them through how HNSW builds its multi-layered graph. Explain why cosine similarity is the default for text embeddings. Show them the pgvector query that combines a WHERE clause with a vector search in a single SQL statement. Tell them exactly when you’d outgrow Postgres and why.

That’s the answer that shows you understand the engineering underneath and not just the buzzword on top.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello — I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

Understanding MCP for Backend Engineers

Solomon Eseme — Sat, 25 Apr 2026 15:49:43 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

Today's issue is brought to you by Masteringbackend → An all-in-one platform that helps backend engineers become highly paid backend and AI engineers by leveraging a practical learning approach.

Here's another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

Give Your AI Agent Eyes on the Web

Here’s something nobody talks about when they recommend MCP servers:

3 MCP servers consumed 143,000 of 200,000 tokens before an agent read its first message. That’s 72% of your context window gone — on tool schemas the agent never even touched.

There’s a simpler architecture.

Bright Data CLI gives coding agents like Claude Code, Cursor, and Copilot direct access to real-time web data — straight from the terminal. No server setup. No schema bloat. No OAuth flow. Just one command:

brightdata scrape https://any-website.com → structured JSON

Scrape any URL with automatic CAPTCHA bypass. Search Google, Bing, and Yandex. Extract structured data from 40+ platforms — Amazon, LinkedIn, Instagram, TikTok, YouTube, Reddit, and more.

One install. Works with 46+ AI agents. 10–32x cheaper than MCP for the same tasks.

It’s open source. Go check it out — and drop a star while you’re there. ⭐

→ Star on GitHub

This is the AI Backend Engineer Series on Backend Weekly. In this series, I will guide you through understanding and learning AI backend engineering.

Let's get started with episode 2 of this series. Episode 1 here

What is MCP?

For most of software engineering history, APIs had one type of consumer: code written by humans, for humans to trigger. A mobile app, a browser, a cron job — all driven by human intent at some level.

That assumption is broken.

AI agents are now consumers of your backend. They discover capabilities, make decisions, call your services, and act on the results — autonomously, at scale, and without a human writing the integration code on each side.

The Model Context Protocol — MCP — is the open standard that makes this work reliably. Anthropic released it in November 2024. By March 2025, OpenAI had adopted it. Google DeepMind followed. In December 2025, the protocol moved to the Linux Foundation, co-owned by Anthropic, OpenAI, and Block. When three competing AI labs agree on a standard, that standard tends to stick.

Every backend engineer building AI-integrated systems needs to understand MCP.

Let’s break it down.

Understand the Problem MCP Solves

Before MCP, connecting an AI model to an external system meant custom integration code for every single pair of (model, tool).

If you want Claude to query your PostgreSQL database. You must write an integration.

If you want ChatGPT to do the same. You will write a different integration. Add three more AI models, add three more integrations per tool.

This is what Anthropic called the N×M integration problem.

With N AI models and M external tools, you end up writing N×M custom connectors — each with its own authentication handling, schema definitions, error formats, and maintenance burden.

MCP solves it by collapsing N×M to N+M. You write one MCP server per tool, and any MCP-compatible AI host can use it without modification.

The analogy that sticks: MCP is to AI agents what the USB standard is to hardware peripherals. Before USB, every device needed a proprietary port. After USB, any device works with any host. MCP does the same for AI models and backend services.

It's also worth understanding what MCP is not.

MCP is not an agent framework. It does not decide when to call a tool or why. It does not replace orchestration layers like LangChain or LangGraph. MCP is a standardized integration layer — the protocol that sits between an AI model and the external world, defining how capabilities are discovered and invoked.

The Architecture: Hosts, Clients, and Servers

MCP defines three roles. Understanding them clearly is the key to the entire protocol. Discuss each one with your interviewer.

The key insight is the 1:1 relationship between client and server. A single host can contain many clients, each connected to a different MCP server. Your agent can simultaneously be connected to a GitHub server, a PostgreSQL server, your internal API server, and a Slack server — and the host coordinates all of them.

The Three Primitives: What Your Server Can Expose

An MCP server exposes its capabilities through exactly three primitives. Everything in the protocol is built around these three concepts. Get them right, and the rest falls into place.

Most backend engineers building MCP servers will spend 90% of their time on Tools. That's where the action is. Resources are useful for giving the model a structured context. Prompts are useful when you want to guide how the model interacts with your domain.

Building Your First MCP Server

Let’s build a real one. We’ll build an MCP server that exposes a user service — something every backend engineer has shipped before. The server will expose the ability to look up a user and create a new one.

First, install the official TypeScript SDK and Zod for schema validation:

npm install @modelcontextprotocol/sdk zod
npm install -D typescript @types/node tsx

Setting Up the Server

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";

// McpServer is the high-level API — handles capability negotiation,
// request routing, and input validation automatically.
const server = new McpServer({
  name: "user-service",
  version: "1.0.0",
});

Registering a Tool

Here’s the most important primitive. A tool has a name, a description the LLM reads to decide when to use it, a typed input schema using Zod, and a handler that does the actual work:

server.registerTool(
  "get_user",
  {
    description: "Look up a user by their ID. Returns name, email, and account status.",
    inputSchema: {
      userId: z.string().describe("The unique identifier of the user"),
    },
  },
  async ({ userId }) => {
    try {
      // Your real database call goes here
      const user = await db.users.findUnique({ where: { id: userId } });

      if (!user) {
        return {
          content: [{ type: "text", text: `No user found with ID: ${userId}` }],
          isError: true,
        };
      }

      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            id: user.id,
            name: user.name,
            email: user.email,
            status: user.status,
          }, null, 2),
        }],
      };
    } catch (err) {
      return {
        content: [{ type: "text", text: "Failed to retrieve user. Please try again." }],
        isError: true,
      };
    }
  }
);

Notice two things. First, the description matters enormously — the LLM reads it to decide whether to call this tool at all. Write it like you're documenting an API for a very smart but very literal colleague. Second, always handle errors explicitly and return isError: true rather than throwing. MCP tools should degrade gracefully.

Registering a Tool with Annotations

For tools that have side effects, use annotations to signal behavior to the host. This is how you tell the host — and the LLM — what kind of action this tool performs:

server.registerTool(
  "create_user",
  {
    description: "Create a new user account. Sends a verification email on success.",
    inputSchema: {
      name:  z.string().describe("Full name of the user"),
      email: z.string().email().describe("Email address — must be unique"),
      role:  z.enum(["admin", "member", "viewer"]).default("member"),
    },
    annotations: {
      readOnlyHint:  false,  // this tool DOES modify state
      destructiveHint: false, // it creates, not deletes
      idempotentHint:  false, // calling twice creates two users
      openWorldHint:   true,  // reaches outside — sends an email
    },
  },
  async ({ name, email, role }) => {
    const existing = await db.users.findUnique({ where: { email } });
    if (existing) {
      return {
        content: [{ type: "text", text: `User with email ${email} already exists.` }],
        isError: true,
      };
    }

    const user = await db.users.create({ data: { name, email, role } });
    await emailQueue.push({ type: "VERIFICATION", userId: user.id });

    return {
      content: [{ type: "text", text: `User created: ${user.id}` }],
    };
  }
);

Registering a Resource

Resources expose structured data that the model can read as context. They use URI patterns and are identified like file system paths:

import { ResourceTemplate } from "@modelcontextprotocol/sdk/server/mcp.js";

server.registerResource(
  "user-profile",
  new ResourceTemplate("user://{userId}/profile", {
    list: async () => ({
      resources: (await db.users.findMany({ take: 50 })).map(u => ({
        uri:  `user://${u.id}/profile`,
        name: u.name,
      })),
    }),
  }),
  {
    title:       "User Profile",
    description: "Full profile data for a user, including preferences and permissions.",
    mimeType:    "application/json",
  },
  async (uri, { userId }) => ({
    contents: [{
      uri:  uri.href,
      text: JSON.stringify(await db.users.findUnique({ where: { id: userId } })),
    }],
  })
);

Connecting the Transport and Starting the Server

const transport = new StdioServerTransport();
await server.connect(transport);

console.error("User service MCP server running on stdio");
// Note: use console.error for logs — stdout is reserved for MCP protocol messages

How an MCP Request Actually Flows

Understanding the request lifecycle is what separates engineers who can use MCP from engineers who can debug it in production. Walk your interviewer through this:

All of this happens over JSON-RPC 2.0 — a simple, well-understood remote procedure call standard using JSON. If you’ve worked with any RPC system before, the wire format will feel familiar.

Transport Mechanisms: stdio vs Streamable HTTP

MCP supports two transport mechanisms. Which one to use depends entirely on where your server will run. Discuss this with your interviewer.

stdio (Standard Input/Output)

The host spawns your MCP server as a subprocess. All communication happens over stdin and stdout. Microsecond latency. No network overhead. No authentication complexity. But it only works locally — the host must be on the same machine as the server.

This is what Claude Desktop uses by default. It’s perfect for developer tools, local agents, and anything that runs on a single machine.

Streamable HTTP

The production-ready transport for remote servers. A single HTTP endpoint handles both directions — POST for client-to-server, with optional SSE streaming for server-to-client pushes. Works with any HTTP infrastructure, supports horizontal scaling, and deploys to Kubernetes or serverless without modification.

import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";
import express from "express";

const app = express();
app.use(express.json());

app.all("/mcp", async (req, res) => {
  const transport = new StreamableHTTPServerTransport({
    sessionIdGenerator: () => crypto.randomUUID(),
  });

  // Fresh server instance per session — stateless, horizontally scalable
  const sessionServer = new McpServer({ name: "user-service", version: "1.0.0" });
  registerAllTools(sessionServer); // your tool/resource registrations

  await sessionServer.connect(transport);
  await transport.handleRequest(req, res);
});

app.listen(3000, () => console.log("MCP server listening on :3000/mcp"));

Rule of thumb: Use stdio for local developer tools and CLI agents. Use Streamable HTTP for anything that needs to run remotely, scale horizontally, or be consumed by multiple hosts simultaneously.

Security Considerations

MCP is powerful precisely because it gives AI models the ability to execute real actions against real systems. That power demands real security thinking. Share this with your interviewer — it’s what distinguishes candidates who’ve thought deeply about this from candidates who haven’t.

Authentication for remote servers: The MCP spec classifies remote servers as OAuth Resource Servers. Use OAuth 2.0 with proper scopes. Never expose an unauthenticated remote MCP server to the internet — security scans of public MCP servers in 2025 found thousands with no authentication at all.
Principle of least privilege: Your MCP server’s database credentials should only have access to what that server’s tools actually need. A read-only tool should connect with a read-only credential. Don’t give your entire database access to an MCP server just because it’s convenient.
Input validation: Always use Zod schemas on every input. MCP tool inputs come from an LLM, and LLMs can produce unexpected values. Treat every tool call like an untrusted API request.
Prompt injection awareness: A compromised MCP server could attempt to manipulate the LLM’s behavior through its tool output. Sanitize and structure your tool responses. Never echo user input directly back in tool results.
Audit logging: Every tool call — who called it, with what arguments, what was returned — should be logged. MCP servers are a privileged execution path. You need to be able to reconstruct exactly what an agent did when something goes wrong.

When to Use MCP vs Other Approaches

Not every backend service needs an MCP server. Be honest with your interviewer about when MCP is the right choice and when it isn’t.

Use MCP when:

You want your backend service to be discoverable by multiple AI hosts without writing custom integrations for each
You are building an AI agent that needs to interact with internal tools — databases, APIs, file systems — in a structured, auditable way
You want to expose capabilities to AI models that were originally designed for human-driven API clients
You are building developer tooling that AI-powered IDEs like Cursor, Windsurf, or VS Code should be able to use

Don’t use MCP when:

Your only consumer is a human-driven frontend — REST or GraphQL is simpler and more appropriate
You need real-time bidirectional streaming — MCP is request-response, not WebSocket
You’re doing a quick one-off AI integration, and the overhead of building a full MCP server isn’t justified — direct function calling against your AI SDK may be faster

Final Answer

“MCP is an open standard that solves the N×M AI integration problem by defining a single protocol any AI host can use to discover and invoke capabilities on any MCP server. The architecture has three roles: the host, which contains the LLM and manages the session; the client, which handles the stateful protocol connection; and the server, which exposes capabilities through three primitives — tools for actions, resources for read-only data, and prompts for reusable templates. To expose a backend service, I’d build an MCP server using the official TypeScript SDK, register tools with Zod-validated schemas and clear descriptions, and deploy over Streamable HTTP for remote access. I’d secure it with OAuth, apply least-privilege database credentials, validate all inputs, and audit-log every tool call. The result is a backend service that any MCP-compatible AI agent can discover and use without custom integration code.”

Understanding MCP sounds like keeping up with the latest AI hype. But as you dig in, you realize it’s really a systems engineering problem that every backend engineer should recognize:

Capability discovery — how does a consumer learn what a service can do?
Schema-first contracts — typed inputs and outputs, validated at the boundary
Stateful session management over a transport layer
Least-privilege security in a programmatic execution context
Audit logging for an automated, non-human caller

These are not new concepts. They are the same engineering principles you apply to every production API you’ve ever built. MCP just applies them to a new class of consumer: the AI agent.

The engineers who will be most valuable in the next few years aren’t the ones who know the most AI frameworks. They are the ones who know how to build reliable, secure, observable backend services — and understand how to expose those services to AI systems correctly.

So the next time someone asks you, “What is MCP?” don’t just say, “It’s how AI models talk to tools.”

Explain the N×M problem it solves. Walk them through the three primitives. Explain why the description of a tool is as important as the implementation. Talk about what happens when an agent calls your server with an unexpected input value.

That’s the answer that shows you understand what’s actually being built — not just the name of the protocol.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello — I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

AI Workshop: How to Land AI Engineering Jobs

Solomon Eseme — Sat, 21 Mar 2026 13:00:21 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

Today’s issue is brought to you by Masteringbackend → An all-in-one platform that helps backend engineers become highly-paid backend and AI engineers by leveraging a practical-based learning approach.

Thanks for reading Backend Weekly! Subscribe for free to receive new posts and support my work.

Here’s another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

If you want your LLM or AI Agent to seamlessly search, navigate, and extract real-time data from any website without any blockers and CAPTCHAs.
Try BrightData MCP.

I’m hosting a free workshop TODAY.

“How to Land AI Engineering Jobs”

A free workshop for backend engineers who want to break into AI roles.

Guest: Victor Eduoh, Founder of Laand

📅 Date: Today — Saturday, March 22nd
⏰ Time: 4:00 PM UTC
⏱️ Duration: 60-90 minutes
💰 Price: Free

JOIN THE WORKSHOP →

Why this workshop?

The AI job market is confusing right now.

Everyone says “learn AI”.

But which skills actually matter? How do you position yourself when everyone claims AI experience? What do hiring managers actually look for?

I don’t have all the answers. So I brought in someone who does.

Victor runs Laand.me. A platform that helps people land jobs. He sees hundreds of applications. He knows what works and what doesn’t from the hiring side.

What we’re covering:

Which AI skills are actually in demand (not what’s hyped)
How to position your backend experience for AI roles
What makes applications stand out vs. get rejected
Common mistakes that kill your chances
Live Q&A

Who should attend:

Backend engineers looking at AI roles
Anyone confused about the AI job market
Engineers updating their resumes for 2026
People who want to know what hiring managers see

Can’t make it live?

JOIN THE WORKSHOP →

See you at 4 PM UTC.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

1. The MB Platform: Join 4000+ backend engineers learning backend engineering on the MB platform. Build real-world backend projects, track your learnings and set schedules, learn from expert-vetted courses and roadmaps, and solve backend engineering tasks, exercises, and challenges.

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

4. GetBackendJobs: Access 1000+ tailored backend engineering jobs, manage and track all your job applications, create a job streak, and never miss applying. Lastly, you can hire backend engineers anywhere in the world.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

How would you design a globally distributed configuration propagation service?

Solomon Eseme — Sat, 21 Mar 2026 10:06:11 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

Today's issue is brought to you by Masteringbackend → An all-in-one platform that helps backend engineers become highly-paid backend and AI engineers by leveraging a practical-based learning approach.

Here's another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

If you want your LLM or AI Agent to seamlessly search, navigate, and extract real-time data from any website without any blockers and CAPTCHAs.
Try BrightData MCP.

This is the MB Interview Series on Backend Weekly, which airs every Saturday.

In this series, I will guide you through answering common backend engineering interview questions, covering topics such as system design, microservices, API design, and databases.

Let's get started with episode 6 (Episode 5 Here):

The Interview Scenario

You’re in a backend interview.

They ask:

“Design a globally distributed configuration propagation service that pushes config updates to tens of thousands of servers within seconds, with versioning, rollback, and strong delivery guarantees.”

Here’s how to approach it:

We are building the next Interview Prep Playground targeting backend engineers.
Join our MB Interview waitlist: https://tally.so/r/w46glb

Every production system has configuration: feature flags, database connection strings, rate limit thresholds, encryption keys, and service discovery endpoints. Managing these at scale is one of the most underestimated problems in backend engineering.

Pushing a config update to ten servers is trivial. Pushing it to ten thousand servers that are spread across multiple regions, within two seconds, while guaranteeing every server gets the exact right version, with the ability to rollback in an instant?

That’s a whole different class of problem.

Strong backend engineers do not jump into the design without first understanding the exact requirements.

You should clarify the requirements with your interviewer:

Clarify the Requirements

Before drawing a single diagram, ask your interviewer what “correct” looks like for this system. For a config propagation service, the core requirements are:

Sub-second to low-second propagation: Config changes must reach all connected agents worldwide within seconds and not minutes.
Immutable versioning: Every config change produces a new, immutable version. Old versions are never mutated.
Atomic regional rollout: Config versions can be scoped to a region and deployed atomically; either all agents in a region get version N, or none do.
Instantaneous rollback: Rolling back must be as fast as rolling forward. It’s not a revert. It’s a pointer swap.
Integrity guarantees: Agents must verify the checksum and cryptographic signature of every config they receive. Tampered configs are rejected.
Durability and auditability: Every version, every rollout, every agent acknowledgement must be logged and immutable.
Conflict-free: Two operators cannot create conflicting versions simultaneously. The system must enforce strict ordering.

Discuss each of these with your interviewer. They may tell you that eventually-consistent propagation is acceptable, or that rollback can be asynchronous. These trade-offs will shape the entire architecture.

Next, let’s explore the architecture after clarifying with your Interviewer.

High-Level Architecture

A production-grade config propagation service is built from four major layers. Each layer has a clear boundary and a single responsibility, and mixing them is how systems fail.

Let’s break each layer down:

Core Components

Control Plane API

The control plane is the gatekeeper of the entire system. It is intentionally thin, stateless, and strict. An admin submits a config change here, and the control plane validates it, assigns an immutable version number, signs it, and writes it to the version store.

Nothing else. It does not notify agents directly.

Here’s a simplified version of the creation endpoint:

import { Request, Response } from "express";
import { createHash, sign } from "crypto";
import { versionStore } from "./store";

export async function publishConfig(req: Request, res: Response) {
  const { configPayload, region, submittedBy } = req.body;

  // 1. Validate the payload schema
  if (!configPayload || !region) {
    return res.status(400).json({ error: "Invalid config request" });
  }

  // 2. Compute a deterministic checksum
  const checksum = createHash("sha256")
    .update(JSON.stringify(configPayload))
    .digest("hex");

  // 3. Sign with the control plane private key
  const signature = sign("sha256", Buffer.from(checksum), PRIVATE_KEY);

  // 4. Write an immutable version record
  const version = await versionStore.create({
    versionId: generateVersionId(),
    configPayload,
    checksum,
    signature: signature.toString("base64"),
    region,
    status: "PENDING",
    createdAt: Date.now(),
    createdBy: submittedBy,
  });

  return res.status(201).json({ versionId: version.versionId });
}

Version Store

The version store is the single source of truth for the entire system. Every config version ever published lives here, forever, immutably. You never update a version record, but only append new ones or change the activeVersionPointer for a region.

Use a strongly consistent store here. The right options to discuss with your interviewer:

etcd: Excellent for low-latency, strongly consistent KV with watch semantics. Native to Kubernetes environments.
Google Spanner: Best for globally distributed, externally consistent SQL. Ideal when you need a cross-region quorum without managing Raft yourself.
ZooKeeper: Battle-tested in large-scale systems (Kafka, HBase). Strong consistency with leader election built in.

The config blobs themselves and the actual payloads are stored in immutable object storage (S3, GCS). The version store holds only metadata and a pointer to the blob. This keeps the strongly-consistent store small and fast.

Regional Coordinators

Regional coordinators are the intelligence layer of the propagation path. Each coordinator owns a geographic region and is responsible for:

Subscribing to version store change events (etcd watch, Spanner change streams)
Tracking which version each agent in its region has acknowledged
Publishing rollout metadata to the push clusters
Quarantining unhealthy agents that repeatedly fail to ack
Reporting rollout health back to the control plane

async function onNewVersion(version: VersionRecord) {
  const agents = await agentRegistry.getActiveAgents(version.region);

  for (const agent of agents) {
    // Write rollout task for each agent
    await rolloutStore.upsert({
      agentId: agent.id,
      targetVersionId: version.versionId,
      status: "PENDING",
      attempts: 0,
    });
  }

  // Signal push clusters to notify agents in this region
  await pushBus.publish(version.region, {
    type: "VERSION_AVAILABLE",
    versionId: version.versionId,
  });
}

Push Clusters (Fan-out Layer)

Push clusters are stateless frontend nodes that maintain persistent WebSocket (or long-poll) connections with agents. They do not store any state — they are pure notification relays.

When a coordinator publishes a VERSION_AVAILABLE event, every push node subscribed to that region’s channel broadcasts it to all connected agents. The push node’s job is only to deliver the signal. The actual config fetch happens agent-side.

This separation (signal vs. payload) is what makes the fan-out scalable. You don’t push megabytes of config data through ten thousand WebSocket connections. You push a tiny notification that says, “version X is ready, go fetch it.”

Edge Agents

Agents run on every server in your fleet. They are the final consumer of the config system, and they enforce the strongest guarantees. Each agent:

Maintains a persistent WebSocket connection to the nearest push cluster in its region
Receives the VERSION_AVAILABLE notification
Fetches the config blob from object storage
Verifies the checksum AND the cryptographic signature against the control plane’s public key
Applies the config atomically (swap in-memory reference, persist to local disk)
Sends an acknowledgement back to the coordinator
Retries the entire fetch-verify-apply cycle on any failure

async function onVersionAvailable(versionId: string) {
  const alreadyApplied = await localStore.has(versionId);
  if (alreadyApplied) return; // idempotency guard

  let attempts = 0;

  while (attempts < MAX_RETRIES) {
    try {
      // 1. Fetch the config blob
      const blob = await objectStorage.get(versionId);

      // 2. Verify checksum
      const computedHash = sha256(blob.payload);
      if (computedHash !== blob.checksum) throw new Error("Checksum mismatch");

      // 3. Verify signature
      const valid = verify(blob.checksum, blob.signature, CONTROL_PLANE_PUBLIC_KEY);
      if (!valid) throw new Error("Signature verification failed");

      // 4. Atomic apply + persist for restart resilience
      currentConfig = blob.payload;
      await localStore.save(versionId, blob);

      // 5. Acknowledge to coordinator
      await coordinator.ack({ agentId: AGENT_ID, versionId, status: "APPLIED" });
      return;

    } catch (err) {
      attempts++;
      const delay = 2 ** attempts * 500; // exponential backoff
      logger.warn(`Retry ${attempts} for version ${versionId} in ${delay}ms`);
      await sleep(delay);
    }
  }

  // Report persistent failure to coordinator
  await coordinator.ack({ agentId: AGENT_ID, versionId, status: "FAILED" });
}

The Primary Config Propagation Flow

Every config change passes through this exact lifecycle. Walk your interviewer through each step. This is the story of how a config update travels from an operator’s keyboard to ten thousand servers:

Notice something critical: the config payload never flows through the push cluster. Steps 4 and 5 are deliberately separated.

The push cluster sends only a tiny signal. The agent fetches the actual blob independently from object storage. This is what keeps the fan-out layer fast and unbounded in its connection count.

Versioning and Rollback

Versioning is the backbone of the entire system.

Discuss this carefully with your interviewer. Discuss how your model versions determine how rollback, audits, and conflict resolution all work.

The core principle: a version is never mutated.

If a config change is wrong, you do not patch the version. You publish a new version that supersedes it, or you update the activeVersionPointer for a region to point back to a prior version.

That last part is rollback. It is not an undo operation. It is a pointer swap:

async function rollback(region: string, targetVersionId: string) {
  // Validate that target version exists and was previously COMMITTED
  const target = await versionStore.getVersion(targetVersionId);
  if (!target || target.region !== region) {
    throw new Error("Invalid rollback target");
  }

  // The rollback IS a new publish event — same propagation path
  await versionStore.setActivePointer(region, targetVersionId);

  // This triggers coordinators and push clusters automatically
  // — the rollback is as fast as a forward deployment
  logger.info(`Rollback initiated: region=${region} target=${targetVersionId}`);
}

Because rollback reuses the same propagation path, it has the same latency guarantees as a forward deployment.

There is no special rollback codepath. This is intentional because special codepaths under incident pressure are how rollbacks fail.

Reliability and Delivery Guarantees

This is where you impress your interviewer. Most candidates design the happy path well. What distinguishes senior engineers is how they design for failure.

Discuss these guarantees explicitly:

At-Least-Once Notification, Exactly-Once Application

The push cluster may deliver VERSION_AVAILABLE more than once — that’s acceptable. The agent protects against duplicate applications with an idempotency check at the start of onVersionAvailable(). If the agent already has version N applied locally, it discards the notification.

Commit = Checksum + Signature Verification

A version is not “committed” until the agent reports a successful signature verification. The coordinator tracks this per agent. An agent that reports FAILED after MAX_RETRIES is quarantined — flagged for human review — and the version is not considered committed for that agent’s server.

Agents Offline for Extended Periods

An agent that reconnects after a long outage will not receive the WebSocket notification it missed. This is fine. On reconnect, the agent compares its locally persisted version ID against the coordinator’s current active version for its region. If they differ, it fetches and applies the delta. This is the pull-on-reconnect pattern:

async function onAgentReconnect(agentId: string) {
  const agentVersion = await localStore.getCurrentVersionId();
  const activeVersion = await coordinator.getActiveVersion(REGION);

  if (agentVersion !== activeVersion) {
    logger.info(`Version drift detected. Reconciling: ${agentVersion} → ${activeVersion}`);
    await onVersionAvailable(activeVersion);
  }
}

Tell your interviewer this: The system must never require a running push connection to converge. Push-on-change is an optimization for speed. Pull-on-reconnect is the correctness guarantee. Always design both paths.

Scaling Strategy

Scaling a config propagation service is not about scaling one big thing — it’s about independently scaling the right layers. Share this with your interviewer:

Control Plane API: Stateless. Scale horizontally with standard load balancing. Write throughput is low (config changes are rare compared to reads), so this is rarely the bottleneck.
Version Store: Globally replicated via multi-region quorum. Use etcd clusters per region with cross-region replication, or Google Spanner for a managed globally-consistent store.
Regional Coordinators: Sharded by region. Each coordinator owns its region exclusively. No cross-coordinator coordination is needed. Add coordinators as regions expand.
Push Clusters: Scaled by connection fan-out. Each push node handles tens of thousands of WebSocket connections. Add push nodes to increase connection capacity. They are purely stateless — any push node can serve any agent in the region.
Agents: Connect to the nearest push cluster using geo-DNS or latency-based routing. Each agent stores its applied version locally, so restarts do not require a full re-fetch.

Backpressure via Staged Rollouts

At tens of thousands of agents, a simultaneous rollout causes a thundering herd against object storage. Use staged rollouts:

const ROLLOUT_STAGES = [
  { percentage: 1,   waitMs: 30_000 },  // 1% canary — watch for 30s
  { percentage: 10,  waitMs: 60_000 },  // 10% early — watch for 60s
  { percentage: 50,  waitMs: 120_000 }, // 50% broad — watch for 2m
  { percentage: 100, waitMs: 0 },       // full fleet
];

async function stagedRollout(versionId: string, region: string) {
  const agents = await agentRegistry.getActiveAgents(region);

  for (const stage of ROLLOUT_STAGES) {
    const cohort = sample(agents, stage.percentage / 100);
    await pushBus.publishToCohort(cohort, { type: "VERSION_AVAILABLE", versionId });
    await waitForAcks(cohort, versionId); // block until stage is healthy
    if (stage.waitMs > 0) await sleep(stage.waitMs);
  }
}

Staged rollouts give you progressive blast radius control. If 1% of the canary cohort starts failing signature verification, you halt the rollout before it reaches the full fleet.

Observability and Monitoring

You can’t manage what you can’t see.

Discuss with your interviewer what to measure and what to alert on. Here’s what matters for this system specifically:

Metrics to Track

Propagation latency (p50, p95, p99): From version creation to last agent ack per region. This is your SLO metric. It should be under 5 seconds at p99.
Agent ack rate: What percentage of the fleet has acknowledged the current active version? Below 99% is a yellow alert. Below 95% is red.
Ack skew: How spread out are acknowledgement timestamps? High skew indicates a push cluster bottleneck or a connectivity issue in part of the fleet.
Signature verification failures: Any non-zero count here is an immediate incident. Could indicate a key rotation issue or a compromised config pipeline.
Rollout stall rate: How often do staged rollouts stall at a given percentage? A pattern here indicates fragile canary cohort selection.

Tools

Prometheus + Grafana: Scrape coordinator and push cluster metrics. Dashboard per region showing propagation latency and ack curves.
OpenTelemetry: Distributed traces from control plane → coordinator → push cluster → agent. This is the only way to pinpoint where latency is introduced in the propagation path.
Structured logging: Every version creation, rollout event, agent ack, and signature result must produce a structured log entry with versionId, agentId, region, and timestamp. These logs are your audit trail.

Alerts

Region ack rate below 95% for more than 60 seconds after a publish
Any signature verification failure
The coordinator's heartbeat is missing for more than 30 seconds
Version drift between regions exceeding N minutes

Edge Cases and Trade-offs

Great candidates discuss edge cases unprompted. Bring these up with your interviewer:

Coordinator overloaded during mass reconnect: If a regional outage causes thousands of agents to reconnect simultaneously, the coordinator sees a thundering herd of ack-check requests. Fix: stagger reconnection with randomized jitter, and rate-limit ack-check calls at the coordinator.
Split-brain version pointers: Two coordinators disagree on the active version for a region after a network partition. Fix: The version store uses a strong quorum for all pointer writes. No coordinator can update the active pointer without a majority consensus.
Cost trade-off — persistent connections vs periodic pull: Ten thousand persistent WebSocket connections cost real money in infrastructure. If ultra-low latency is not required, periodic pull (agents polling every 5 seconds) dramatically simplifies the system. Discuss this explicitly — it’s a valid design choice for some use cases.
Blast radius during rollout: A bad config that reaches 100% of the fleet before failure is detected can take down your entire service. Progressive staged rollouts — with automatic halt on ack failure — are non-negotiable for production systems.

Final Answer

“I’d design this system using a global control plane with immutable versioning, regional coordinators for scoped rollout, and fan-out push clusters for low-latency propagation. Config payloads are stored as immutable blobs in object storage and signed by the control plane.”

Designing a globally distributed config propagation service sounds like a glorified key-value store with a webhook. But as you dig in, you realize it’s a deep exercise in:

Immutable data modeling and version pointer semantics
Fan-out at scale without thundering herd
Cryptographic integrity in a distributed trust model
Exactly-once application with at-least-once delivery
Progressive deployment as a first-class system primitive
Reconciliation for agents that were offline during a rollout

These are the kinds of systems that separate engineers who can build features from engineers who can build infrastructure.

So the next time an interviewer asks, “How would you design a globally distributed configuration propagation service?” don’t just say, “I’ll use etcd and WebSockets.”

Walk them through the versioning model. Walk them through the signal vs. payload separation. Walk them through what rollback actually means at this scale. Walk them through what happens to an agent who was offline for six hours.

That’s the answer that shows you’ve built real systems — not just read about them.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

How to Debug AI Backend Systems

Solomon Eseme — Sat, 14 Mar 2026 10:36:29 GMT

Hello 👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

Today’s issue is brought to you by Masteringbackend → An all-in-one platform that helps backend engineers become highly-paid backend and AI engineers by leveraging a practical-based learning approach.

Here’s another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

Your AI Agents will be stupid with hallucinated results.

Except you give it the right data.

Nimble makes AI Agents smarter by giving them access to retrieve real-time web data in a structured, tabular format.

With Nimble, you can:

- Turn the web into data tables, not just markdown with long text.
- Live Web Access, Not Stale Indexes
- Access Any Website (Even JS Heavy Ones)

Imagine you’re searching for something like:

“Which stores have a PS5 in stock within 25 miles right now—include price, pickup time, and store address from each retailer’s site?”

This is a time-oriented question, and your AI Agents won’t give the right result unless you use Nimble:

Watch this video:

Here’s the documentation to start with: https://docs.nimbleway.com

I want to tell you about the worst three days of my engineering career.

It started with a Slack message from our support team:

“Users are complaining that the AI is giving wrong answers.”

That was it.

There were no error codes, stack traces, or even reproducible steps. Just... the AI was wrong.

I opened our logs and checked: the request was received at 10:47 AM, the response was sent at 10:49 AM, and the status was 200.

Everything looked perfect. The system had done exactly what it was supposed to do: very simple.

Accept a question, process it, and return an answer.

At the time, the answer was completely wrong, and I had absolutely no idea why.

I spent the next three days in debugging hell. I tried reproducing the issue locally, but of course, the AI gave different answers each time. I added more logging, but I was logging the wrong things. I stared at the database queries that returned the right data. I reviewed prompts that looked fine. I checked the API responses that seemed correct. Every individual piece worked, but the whole system didn’t.

On day three, I finally found it. The problem was in our chunking strategy.

We were splitting documents in the middle of sentences, so when users asked certain questions, the retrieved context was grammatically incomplete and semantically meaningless. The AI was doing its best with garbage input, producing garbage output.

That was how I spent three days on a chunking bug. Not because the bug was hard to fix, but because I couldn’t see what was happening inside the system.

I was debugging blind.

That experience fundamentally changed how I think about AI observability. Traditional debugging doesn’t work for AI systems. The tools and practices we’ve developed over decades of software engineering, you know, the stack traces, error logs, breakpoints, etc., they’re not enough. AI systems fail differently, and they need to be observed differently.

In this article, we will delve into Observability in AI systems. We will explore the fundamental problem, why your current logging system will fail you, how to build a proper observation system for AI systems, and finally, the debugging workflow that actually works.

The Fundamental Problem

Here’s what makes AI debugging so uniquely frustrating:

Traditional applications have the decency to crash when something goes wrong. They throw exceptions. They return error codes. They give you a stack trace pointing to the exact line where things went sideways.

You might not immediately know how to fix the problem, but at least you know where it is.

However, an AI system is different; it doesn’t do this. They keep running. They keep returning responses. They keep returning status 200. But the responses are incorrect, and nothing in your standard monitoring indicates this is happening.

Think about what happens in a RAG pipeline when something goes wrong. Maybe your embedding model is poorly suited for your domain, so queries about technical concepts get matched with vaguely related but ultimately unhelpful documents.

The system will never throw an error, and your vector database will keep returning the wrong results each time.

The LLM receives this irrelevant context and, as LLMs do, generates a plausible-sounding response based on what it was given. The response is confident. The response is articulate. The response is wrong.

From the outside, everything looks fine. Your request latency is normal. Your error rate is zero. Your uptime is 100%. Meanwhile, users are getting incorrect information and losing trust in your product, and you have no idea it’s happening until someone complains.

This is the core challenge of AI observability: you’re not looking for crashes, you’re looking for degradation. You’re not hunting for errors, you’re hunting for quality problems. And quality problems are sneaky. They don’t announce themselves. They hide in the space between “working” and “working well.”

Why Your Current Logging Strategy Is Failing You

When I first started building AI systems, I logged what I always logged: inputs and outputs. Request came in with this question, response went out with this answer. Basic request/response logging, same as any API.

This is almost useless for AI debugging.

The problem is that AI systems aren’t simple request/response pipelines. They’re multi-step workflows where each step transforms data in ways that affect downstream steps.

A RAG query might involve embedding the question, searching a vector database, reranking the results, constructing a prompt, calling the LLM, and post-processing the response.

That’s six distinct operations, each with its own potential failure modes, and if you’re only logging the first input and final output, you’re blind to everything in between.

When something goes wrong, you need to know:

Was the embedding generated correctly?
What chunks did the vector search return, and what were their similarity scores?
Did reranking change the order?
What context ended up in the prompt?
How many tokens were used?
What did the LLM actually see when it generated its response?

Without this information, debugging becomes guesswork. You’re trying to figure out which of six potential failure points caused the problem, but you can only see the beginning and end. It’s like trying to debug a function when you can only see the input parameters and return value, not any of the intermediate computations.

I learned this lesson the hard way during those three days of debugging. My logs showed the question and the answer. They didn’t show that the retrieved chunks had unusually low similarity scores, which would have immediately pointed me toward a retrieval problem.

They didn’t show that the context contained incomplete sentences, which would have pointed me toward chunking.

I had to add this logging manually, re-deploy, wait for similar queries, and then analyze the results—three days of detective work that would have taken three minutes with proper observability.

Building an Observability Stack for AI Systems

After that experience, I completely redesigned how I approach AI observability. I now think of it in four layers, each building on the one below:

Structured logging

The foundation is structured logging with AI-specific context: This isn’t just “log more stuff.” It’s logging the right stuff, in the right format, at the right points in your pipeline. Every operation that touches AI, such as embedding, retrieval, reranking, prompt construction, and generation, needs its own log entry with all the relevant context.

For an embedding operation, you need to capture which model you used, how many tokens were in the input, how long it took, and, ideally, an identifier that lets you correlate it with other operations in the same request.

For retrieval, you need the query, the number of results, the similarity scores for the retrieved results, and the time taken. For a generation, you need the model, input tokens, output tokens, cost, and latency.

The key insight is that each log entry should be self-contained enough that, if something goes wrong, you can look at that entry and understand what happened at that step. You shouldn’t need to cross-reference five different log lines to piece together the story.

Here’s what a properly structured log entry looks like for a retrieval operation:

{
  "trace_id": "abc-123-def-456",
  "timestamp": "2026-02-24T14:30:22.847Z",
  "operation": "vector_retrieval",
  "service": "rag-service",
  "query_text": "What is the refund policy for digital products?",
  "embedding_model": "text-embedding-3-small",
  "vector_db": "pinecone",
  "index": "knowledge-base-prod",
  "top_k_requested": 10,
  "results_returned": 10,
  "similarity_scores": [0.91, 0.87, 0.85, 0.82, 0.79, 0.76, 0.71, 0.68, 0.65, 0.61],
  "top_result_preview": "Digital products are eligible for refund within 14 days of purchase...",
  "latency_ms": 147,
  "metadata": {
    "user_id": "user-789",
    "session_id": "session-012"
  }
}

With this level of detail, when a user reports a wrong answer, I can query my logs for that trace ID and immediately see what happened. If the similarity scores are all below 0.7, I know retrieval failed to find relevant content.

If the top result preview doesn’t match the query topic, I know there’s a mismatch somewhere in my embeddings or indexing.

This time, I’m not guessing anymore, but I’m diagnosing.

Distributed Tracing

The second layer is distributed tracing, which connects all these individual log entries into a coherent timeline. A trace shows you the entire journey of a request through your system:

“It started here, went through these operations in this order, took this long at each step, and ended here.”

Traces are invaluable because they show you not just what happened, but the sequence and timing.

I use OpenTelemetry for this, though there are other options. The key is that every operation creates a “span” with a start time, end time, and relevant attributes, and all spans in the same request share a trace ID.

When I’m debugging, I can pull up a trace and see a visual timeline of exactly how the request was processed.

Metrics:

The third layer is metrics, an aggregate of measurements over time that indicate system health. While logs and traces help you debug individual requests, metrics help you understand patterns.

It answers questions such as:

What’s my average retrieval similarity score this hour?
How has latency changed over the past week?
What percentage of queries are hitting my cache?

Metrics turn individual observations into trends.

For AI systems, you need metrics that traditional APM tools don’t provide out of the box. I track things like:

Average retrieval quality scores
Confidence score distributions
Token usage by model
Cost per request
Cache hit rates
Hallucination detection rates.

These metrics tell me when something is degrading before users start complaining.

Alerting

The fourth layer is alerting, and it sends automated notifications when metrics cross certain thresholds.

If my average retrieval similarity score drops below 0.7 for ten minutes, I want to know immediately. If my hourly AI spend exceeds my budget, I want an alert. If my error rate spikes, I want to be paged.

The goal of this four-layer stack is simple:

I never want to be surprised by an AI problem again. I want to catch degradation before users notice, and when users do report issues, I want to diagnose the root cause in minutes, not days.

The Debugging Workflow That Actually Works

Let me walk you through how I debug AI issues now, using the observability stack I just described.

A support ticket comes in:

“User asked about shipping times and got information about return policies instead.”

This is a classic symptom. The AI answered confidently, but answered the wrong question.

Step 1:

First, I find the trace ID. Every response from my AI system includes a trace ID in the response metadata so that support can include it in tickets. If they didn’t include it, I can search my logs by user ID and timestamp to find the relevant trace.

Step 2:

Once I have the trace ID, I pull up the full trace. I’m looking at a timeline that shows me:

Query received
Embedding generated
Vector search executed
Reranking applied
Prompt constructed
LLM called
Response returned.

Each step shows its duration and key attributes.

Step 3:

I start from the beginning. The query was “How long does shipping take?”

Therefore, that’s correct, it matches what the user asked. The embedding was generated in 45ms using text-embedding-3-small.

Then I look at vector retrieval:

10 results returned
Top score 0.73
Top result preview “Our return policy allows...”

The top score is 0.73, which is on the lower end. And the top result preview is about returns, not shipping.

That’s the problem right there.

The retrieval step failed to find relevant shipping content and instead returned the most similar content it could find, which happened to be about returns.

Step 4:

Now I know where to look. Why didn’t retrieval find shipping content?

I can easily think through a few possibilities:

Maybe there’s no content about shipping in the knowledge base.
Maybe the content exists, but wasn’t embedded correctly
Maybe there’s a mismatch between how the content was chunked and how the query was embedded.

I search my knowledge base index for “shipping” to make sure the content exists.

I look at the embedding for that content compared to the query embedding, and they should be similar. If they’re not, I might have an embedding model mismatch or a chunking problem.

In this case, I discovered that the shipping information was embedded in a larger document on order processing, and the chunk that contains the shipping details also includes content on order tracking, payment processing, and other topics.

The embedding for that chunk reflects all those topics, diluting the “shipping” signal.

With only 4 steps, I have easily identified the root cause: a poor chunking strategy that embeds relevant content with too much surrounding context.

I can quickly introduce a fix by re-chunking the order processing document into more focused sections, one specifically about shipping times.

With this approach, the total debugging time is about 15 minutes. Now compare that to three days of blind guessing.

Practical Implementation

I want to share some specific implementation patterns that have worked well for me, because the concepts are only useful if you can actually build them.

For logging, I created a simple class that wraps all my AI operations and automatically logs with the right structure.

Every time I make an embedding, retrieval, or LLM call, it goes through this wrapper, which captures timing, token counts, scores, and other relevant metadata. The wrapper also propagates trace IDs, so all operations in a request are connected.

class AILogger {
  constructor(serviceName) {
    this.serviceName = serviceName;
  }
  createTrace(requestId) {
    return new AITrace(requestId, this.serviceName);
  }
}
class AITrace {
  constructor(requestId, serviceName) {
    this.traceId = requestId || crypto.randomUUID();
    this.serviceName = serviceName;
    this.spans = [];
    this.startTime = Date.now();
  }
  span(operation, data = {}) {
    const span = {
      traceId: this.traceId,
      operation,
      timestamp: new Date().toISOString(),
      elapsed_ms: Date.now() - this.startTime,
      service: this.serviceName,
      ...data
    };
  
    this.spans.push(span);
    
    // Send to your logging backend
    console.log(JSON.stringify(span));
    
    return span;
  }
  // Specific logging methods for common AI operations
  logEmbedding(model, inputTokens, latencyMs) {
    return this.span('embedding', {
      step: 'embedding',
      model,
      input_tokens: inputTokens,
      latency_ms: latencyMs
    });
  }

  logRetrieval(query, results, latencyMs) {
    return this.span('retrieval', {
      step: 'retrieval',
      query_length: query.length,
      chunks_retrieved: results.length,
      top_score: results[0]?.score,
      bottom_score: results[results.length - 1]?.score,
      scores: results.map(r => r.score),
      latency_ms: latencyMs
    });
  }

  logReranking(beforeCount, afterCount, latencyMs) {
    return this.span('reranking', {
      step: 'reranking',
      chunks_before: beforeCount,
      chunks_after: afterCount,
      latency_ms: latencyMs
    });
  }

  logGeneration(model, inputTokens, outputTokens, latencyMs) {
    return this.span('generation', {
      step: 'generation',
      model,
      input_tokens: inputTokens,
      output_tokens: outputTokens,
      latency_ms: latencyMs,
      cost_usd: this.estimateCost(model, inputTokens, outputTokens)
    });
  }

  logPrompt(systemPrompt, userPrompt, context) {
    return this.span('prompt_construction', {
      step: 'prompt',
      system_prompt_tokens: this.countTokens(systemPrompt),
      user_prompt_tokens: this.countTokens(userPrompt),
      context_tokens: this.countTokens(context),
      // Store first 500 chars for debugging (not full prompt for privacy)
      context_preview: context.substring(0, 500)
    });
  }

  logResponse(response, confidence) {
    return this.span('response', {
      step: 'response',
      response_tokens: this.countTokens(response),
      confidence_score: confidence,
      response_preview: response.substring(0, 200)
    });
  }

  estimateCost(model, inputTokens, outputTokens) {
    const pricing = {
      'gpt-4': { input: 0.03, output: 0.06 },
      'gpt-4-turbo': { input: 0.01, output: 0.03 },
      'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
      'claude-3-sonnet': { input: 0.003, output: 0.015 }
    };
    const p = pricing[model] || pricing['gpt-3.5-turbo'];
    return ((inputTokens / 1000) * p.input) + ((outputTokens / 1000) * p.output);
  }

  countTokens(text) {
    // Rough estimation: ~4 chars per token
    return Math.ceil((text?.length || 0) / 4);
  }

  complete() {
    return this.span('trace_complete', {
      total_duration_ms: Date.now() - this.startTime,
      span_count: this.spans.length
    });
  }
}

The critical insight is to log at decision points, not just at entry and exit.

When my system decides to use a particular embedding model, I log that decision and why.
When reranking changes the order of results, I log the before and after rankings.
When I decide to fall back to a different model because the primary one is slow,

I log that fallback decision. These decision points are often where bugs hide, and having visibility into the decisions makes debugging much faster.

For metrics, I use Prometheus with Grafana, but the specific tools matter less than the metrics you choose to track.

My dashboard shows me six key AI metrics at a glance:

Average retrieval similarity score (quality)
P95 generation latency (performance)
Cache hit rate (efficiency)
Hourly token usage by model (cost)
Error rate by operation (reliability)
Requests per minute (volume).

These six numbers give me a quick health check of my AI system. If any of them look unusual, I dig deeper.

For alerting, I’ve learned to alert on trends, not just thresholds. It’s not very useful to alert when the retrieval score drops below 0.7, because scores fluctuate query-by-query, and I’d get a lot of noise.

Instead, I alert when the average retrieval score over a 15-minute window falls below 0.75, indicating genuine degradation rather than a few unlucky queries.

Similarly, I am alerted on the rate of change. For example, if my costs are increasing 50% faster than my request volume, something is probably wrong even if the absolute numbers are still within budget.

The Tools

People often ask me which observability tools to use. Honestly, the specific tools matter less than the practice of properly instrumenting your code. That said, here are the tools I’ve had good experiences with:

OpenTelemetry with Jaeger or Datadog works well for general-purpose tracing and metrics. These are battle-tested tools with good ecosystem support.
Langfuse is excellent if you want an open-source option for LLM-specific observability, and LangSmith works well if you’re using LangChain.
Helicone and Portkey are good choices if you want a proxy-based approach that adds observability without changing your code.

The most important thing is to start with something. Imperfect observability is infinitely better than no observability. You can always improve your tooling later, but if you’re flying blind, you’re accumulating debugging debt with every deployment.

I want to be honest:

Building proper AI observability takes time. Instrumenting your code, setting up dashboards, and configuring alerts.

It’s not a trivial amount of work. You might be tempted to skip it, especially when you’re trying to ship features quickly.

Don’t skip it.

The time you invest in observability pays off exponentially when something goes wrong. And in AI systems, something will go wrong.

Models behave unexpectedly. Embeddings drift over time. Retrieval quality degrades as your knowledge base grows. These aren’t edge cases; they’re inevitable parts of operating AI in production.

When these problems happen, you have a choice: spend days debugging blind, or spend minutes with good observability.

The three days I spent debugging that chunking issue? That was time I could have spent building features. That was the time my team spent frustrated and unproductive. That was the time users spent getting wrong answers and losing trust.

Good observability isn’t just a technical practice. It’s a business investment. It’s the difference between AI systems you can confidently operate and AI systems that feel like ticking time bombs.

I hope you learned something today. Spread the love. Share this newsletter with at least two of your friends today.

If you have questions about the bootcamp, reply to this email. I read everything.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

3. GetBackendJobs: Access 1000+ tailored backend engineering jobs, manage and track all your job applications, create a job streak, and never miss applying.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

Announcement: AI Backend Engineer Bootcamp

Solomon Eseme — Tue, 10 Mar 2026 09:34:34 GMT

Hello 👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

Today’s issue is brought to you by Masteringbackend → An all-in-one platform that helps backend engineers become highly paid backend and AI engineers through a practical, hands-on learning approach.

I’ve been working on something for months. Now it’s ready.

The AI Backend Engineer Bootcamp starts April 1st.

Here’s what I realized:

Backend engineers are struggling with AI, not because they’re not smart enough, but because the resources are broken.

Most AI courses teach you to call an API and print “Hello World.” That’s not engineering.

This bootcamp is different.

You’ll build a complete AI-powered backend system from scratch:

Weeks 1-3: Backend foundations (auth, databases, APIs, caching, security)
Weeks 4-5: AI infrastructure (vector databases, RAG, agents, cost controls)
Week 6: Live defense — present your system and defend your architecture

By the end, you’ll have production-ready code, not tutorial projects.

Details:

Starts April 1st, 2026
6 weeks, 10-15 hours/week
30 spots only

Enroll Now for 38% discount→

Enroll Now for 38% discount→

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

3. GetBackendJobs: Access 1000+ tailored backend engineering jobs, manage and track all your job applications, create a job streak, and never miss applying.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

FREE WORKSHOP: AI Beyond the Chatbots: Building Reliable Workflows

Solomon Eseme — Thu, 05 Mar 2026 09:09:22 GMT

Hello 👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

I’m hosting another free workshop — and this time I’m bringing reinforcement.

“AI Beyond the Chatbots: Building Reliable Workflows”

📅 Thursday, March 5th

⏰ 4:00 PM UTC

💰 Free

JOIN: https://luma.com/wfe8hgoh

Here’s the problem this workshop solves:

Most AI tutorials teach you to build chatbots. Call an API, get a response, done.

But production AI isn’t chatbots. Its workflows:

Multi-step processes
Data flowing between systems
Decisions that need to be reliable
Failures that need to be debugged

And workflows break in ways chatbots don’t.

Jide will show you the reliability patterns that make AI workflows actually work — not just in demos, but in production.

What you’ll learn:

Why the industry is moving from chatbots to workflows
The reliability system every AI workflow needs
Live demo of patterns in action

Last workshop had 200+ attendees. This one is capped at 500.

https://luma.com/wfe8hgoh

How would you design a real-time collaborative document editing backend?

Solomon Eseme — Sat, 21 Feb 2026 08:43:42 GMT

Hello 👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

Welcome to another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world system design and interview questions.

Before we dive in:

Check the new GitHub for non-developers:

It's called Remix. It allows users to chat with your app, propose changes in plain English, and you decide what to merge.

How it works:

Install SDK → one command setup, works with existing React Native apps

Try it here:

If you have an iPhone, try out Remix on TestFlight

This is the MB Interview Series on Backend Weekly, which airs every Saturday.

In this series, I will guide you through answering common backend engineering interview questions, covering topics such as system design, microservices, API design, and databases.

Let’s get started with episode 7 (Episode 6 Here):

The Interview Scenario

You’re in a backend interview.

They ask:

“How would you design a real-time collaborative document editing backend?”

Here’s how to approach it:

To practice this question in real-time, we are building the next Interview Prep Playground.

Check it out here: https://masteringbackend.com/interviews

Now, let’s start by clarifying what real-time collaboration actually means.

Understand the problem

Solving any system design question becomes simple when you clearly understand the real problem.

Real-time collaborative editing is not about text.

It is about keeping one shared document consistent while many users are editing it at the same time.

Think about tools like Google Docs or Notion. When two people type at the same time:

Nobody’s text disappears.
The document stays consistent.
Updates appear almost instantly.
Offline users can reconnect safely.

That is exactly the goal you need to lead with when faced with this question in your next interview.

When designing this type of system, whether it’s for a job or discussing it in an interview.

Your objectives should be to:

Keep the document consistent across all users.
Handle concurrent edits safely.
Deliver updates in near real-time (under 200ms).
Prevent data loss during crashes.
Scale to thousands of active documents.

By the way, now that we understand the problem and the objectives that we should focus on.

The real question becomes:

How do you synchronize distributed edits safely and efficiently?

The High-Level Architecture

Let’s look at the high-level architecture for this. To design a distributed collaborative backend system like this requires a lot of components and engineering prowess.

However, we are going to stick with just the high-level architecture to help us with a clear view.

This is the core of our architecture:

From the diagram above, you can easily spot the components and how they interact together for a smooth collaborative document editing.

Below is the interaction that happens:

Client: Sends edit operations over a persistent WebSocket connection.
Collaboration Server: Receives operations and routes them to the correct document instance.
Document State Manager: Holds in-memory state of active documents and version numbers.
Conflict Resolution Engine: Resolves concurrent edits using OT or CRDT.
Database: Stores document snapshots and an append-only operation log.

Core Concepts

In a collaborative editor, clients do not send the full document every time someone types; rather, they send operations.

This is an important point to share with your interviewer.

Here’s a code example in TypeScript. When a user types “H” at position 10, we send:

interface EditOperation {
  type: "insert" | "delete";
  position: number;
  value?: string;
  length?: number;
  baseVersion: number;
  clientId: string;
}

const edit = {
  type: "insert",
  position: 10,
  value: "H",
  baseVersion: 42,
  clientId: "user-123"
}

If you look closely, you will notice the baseVersion property. This tells the server the document version the user was editing. This is critical for conflict resolution.

You should also discuss the importance of implementing baseVersion for conflict resolution with your interviewer.

Let’s talk about that further:

Conflict Resolution Strategy

The hardest part of building a collaborative editing backend is “Concurrent Edit”. I have firsthand experience with this while building our MB Project playground, where users can build a real backend project in real time using our code editor.

Your concern at this stage is:

If two users edit at the same time, we must merge their changes safely.

I’m going to share two common strategies you can implement to solve this problem, and you can also discuss this with your interviewer to pick the best strategy based on your use case.

Operational Transformation (OT): This strategy is used by Google Docs. The server transforms incoming operations against already applied operations to maintain consistency.
CRDT (Conflict-Free Replicated Data Types): This strategy is used by systems like Figma. The data structure itself guarantees eventual consistency without central transformation.

For interviews, OT is simpler to explain in centralized systems.

Operational Transformation (OT)

Operational Transformation (OT) was one of the earliest serious attempts to solve this problem.

The idea is deceptively elegant because instead of rejecting concurrent operations, you transform them against each other so that they can be applied in any order and still converge to the same final state.

If two inserts target the same position, you define a deterministic tie-breaking rule using client ID and shift one operation’s index accordingly. If an insert and delete overlap, you adjust ranges so both operations preserve their original user intent.

Here is a simplified transformation example:

function transformInsertInsert(a, b) {
  if (a.position < b.position) return b;

  if (a.position > b.position) {
    return { ...b, position: b.position + a.value.length };
  }

  // deterministic ordering by clientId
  if (a.clientId < b.clientId) {
    return { ...b, position: b.position + a.value.length };
  }

  return b;
}

The code snippet is very simple. I am simply adjusting the position each time an edit occurs so that both users’ intent is preserved.

Note that Operational Transformation (OT) may look simple on paper, but the implementation in a real production-ready environment is not simple.

CRDT (Conflict-Free Replicated Data Types)

Conflict-Free Replicated Data Types (CRDTs) approach the problem from the opposite direction.

Instead of rewriting operations dynamically, CRDTs design the data structure itself so that concurrent modifications can be merged deterministically without transformation.

Instead of saying “insert at position 5,” a CRDT says “insert after element X.”

A simplified CRDT insertion might look like this:

interface CharNode {
  id: string;        // unique logical timestamp
  value: string;
  next?: CharNode;
}

function insertAfter(targetId, newNode) {
  const target = findNode(targetId);
  newNode.next = target.next;
  target.next = newNode;
}

When two clients insert after the same node concurrently, their unique IDs determine deterministic ordering. Therefore, eliminating the need for transformations like in the case of OT.

The Backend Architecture That Actually Works

Regardless of OT or CRDT, the surrounding backend architecture tends to converge to a similar shape.

Clients establish a persistent connection, usually over WebSockets. The server maintains an in-memory representation of active documents, because disk-backed synchronization is far too slow for real-time interaction.

Every operation is:

Validated.
Applied (with transformation or merge).
Appended to a durable log.
Broadcast to connected collaborators.

The durable log matters more than most teams initially realize. Without it, you cannot recover the document state after a crash, nor can you allow late joiners to reconstruct the document.

A common approach is event sourcing.

Below is a simple server implementation:

interface EditOperation {
  id: string;
  documentId: string;
  revision: number;
  payload: any;
  timestamp: number;
}

wss.on("connection", (socket) => {
  socket.on("edit", async (operation: EditOperation) => {
    try {
      const doc = await documentManager.get(operation.documentId);

      const transformed = conflictEngine.transform(doc, operation);

      doc.apply(transformed); // update in-memory state
      await operationLog.save(transformed); // persist

      broadcastToRoom(operation.documentId, transformed);
    } catch (error) {
      console.error("Edit failed", error);
    }
  });
});

This simple implementation demonstrates:

In-memory document state
Conflict resolution
Durable storage
Real-time broadcast

With the core implementation out of the way, let’s look at some additions that are more challenging but make our collaborative backend system efficient.

Offline Editing and Reconciliation

Imagine if you’re working on a document and your internet was disconnected 5 minutes ago.

What happens to the content you typed within that timeframe?

That’s where offline editing and reconciliation come in handy.

The frontend of your application should store operations locally, and when they reconnect:

They send unsynced operations.
Server transforms them against the latest document state using version numbers as key
Server applies and rebroadcasts.

This implementation will prevent data loss. More importantly, without version tracking, reconciliation becomes impossible.

Scaling the System

If you’re able to build a scalable collaborative editing system. Then scaling it will not be trivial.

Here’s how to approach it:

Shard documents by documentId using consistent hashing.
Route all edits for a document to the same server node.
Keep collaboration servers stateless except for active document memory.
Use Redis to share ephemeral state if needed.
Snapshot documents every 100 operations to reduce replay time.

When a node crashes:

Reload snapshot.
Replay the operation log.
Resume service.

This is similar to event sourcing.

Most importantly, make sure to discuss your fault tolerance, reliability, monitoring, and observability strategies with your interviewer.

Final Answer

I would design the system using WebSockets for real-time communication, a document state manager that keeps active documents in memory, and an Operational Transformation engine to handle concurrent edits. I would store all operations in an append-only log with periodic snapshots for durability. To scale, I would shard documents using consistent hashing and route all edits for a document to the same node. I would add monitoring around latency, transformation time, and connection stability to ensure reliability. This ensures consistency, scalability, and fault tolerance in a real-time collaborative editing system.

Final Thoughts

Designing a collaborative editor might sound simple, but it is a deep dive into concurrency control, distributed systems, ordering guarantees, and durability.

Every keystroke becomes a distributed event. Every version number becomes a consistency boundary.

So next time an interviewer asks, “How would you design a real-time collaborative document editing backend?”

Don’t just say “I’ll use WebSockets.”

Walk them through how you’d handle concurrency, scaling, durability, and failure.

That’s how you show senior-level thinking.

Join our AI Engineering Bootcamp

To answer the question “Will AI Kill Backend Engineers?” boils down to the fact that it’s someone who embraces AI that will replace backend engineers who do not embrace AI.

If you’re ready to LEARN AI, EMBRACE AI. We are launching a 6-week AI Bootcamp on “Become a Production-Ready AI Backend Engineer.”

We are backend engineers, and building production-ready systems has been our core skill. Learn exactly how to do the same as an AI Backend Engineer in the AI-first world.

Join here: Become a Production-Ready AI Backend Engineer

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this new series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

I'm teaching my AI backend framework live (free)

Solomon Eseme — Wed, 11 Feb 2026 09:31:29 GMT

Hello 👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

On Friday, I’m hosting a free live workshop:

“The 6 Layers Every AI Backend Needs”

It’s the framework I developed after multiple production failures — runaway agents, hallucinations, cost explosions.

I’m teaching it for free.

Why this matters for backend engineers:

Most AI education teaches you to call APIs.

That’s maybe 10% of what you need to build AI systems that actually work.

The other 90%? It’s backend engineering:

Error handling when the API fails
Cost management when usage spikes
Observability when outputs are wrong
Guardrails when agents misbehave

That’s what this workshop covers.

What you’ll learn:

→ Why AI tutorials fail in production (specific patterns)
→ The 6-layer architecture for AI backends
→ Live demo: Building a RAG endpoint with proper infrastructure
→ Q&A: Your questions answered

The details:

Date: Friday, 13th February, 2026
Time: 4:00 PM UTC
Duration: 90 minutes
Cost: Free

300+ engineers already registered. If you’re building with AI (or planning to), this is worth 90 minutes.

I’m also launching an AI Backend Engineer Bootcamp in March. The workshop teaches the framework; the bootcamp makes you build it.

Details at masteringai.dev if you’re curious. But the workshop is valuable on its own, no bootcamp required.

I hope you learned something today. Spread the love. Share this newsletter with at least two of your friends today.

If you have questions about the bootcamp, reply to this email. I read everything.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

3. GetBackendJobs: Access 1000+ tailored backend engineering jobs, manage and track all your job applications, create a job streak, and never miss applying.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

Here's exactly what you'll build in 6 weeks

Solomon Eseme — Sat, 07 Feb 2026 08:00:21 GMT

Hello 👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

Here’s another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

Your AI Agents will be stupid with hallucinated results.

Except you give it the right data.

Nimble makes AI Agents smarter by giving them access to retrieve real-time web data in a structured, tabular format.

With Nimble (@nimble_data), you can:

- Turn the web into data tables, not just markdown with long text.
- Live Web Access, Not Stale Indexes
- Access Any Website (Even JS Heavy Ones)

Imagine you're searching for something like:

"Which stores have a PS5 in stock within 25 miles right now—include price, pickup time, and store address from each retailer's site?"

This is a time-oriented question, and your AI Agents won't give the right result unless you use Nimble:

Watch this video:

Here's the documentation to start with: https://docs.nimbleway.com

Why I Built This Bootcamp

Let me be direct.

Most backend engineers are learning AI incorrectly.

They’re taking prompt engineering courses. Building ChatGPT wrapper side projects. Adding “AI/ML” to their LinkedIn headlines. Watching tutorials on LangChain and vector databases.

And they still can’t build AI systems in production.

I know because I made every one of these mistakes.

18 months ago, I shipped my first production AI feature. Within two weeks, everything broke.

A runaway agent racked up $400 in API costs overnight
A hallucination gave a user incorrect information
Memory leaks crashed our service
Vector search returned garbage the moment we scaled

Every tutorial I watched was useless. They showed toy demos that fell apart the moment real users touched them.

So I threw out everything I thought I knew.

I stopped thinking like an “AI developer” and started thinking like a backend engineer who happens to work with AI.

That shift changed everything.

The Core Insight

Here’s what most AI education gets wrong:

They treat AI as a separate discipline.

It’s not.

AI is backend infrastructure.

RAG pipelines are just retrieval systems. Agents are just workflow orchestrators with LLM calls. Embeddings are mathematical representations that you query like any other data layer. Human-in-the-loop is just approval workflows with AI triggers.

Once you see AI through this lens, everything clicks.

The skills that make you a good backend engineer, such as system design, data modeling, error handling, observability, and cost management, are the same skills that make you a good AI backend engineer.

You just need to know how to apply them.

That’s what this bootcamp teaches.

The 6-Week Curriculum

Here’s exactly what you’ll build each week:

Week 1 — Activation System

You ship your first working backend service.

What you’ll build:

Authentication system (JWT and session handling)
Database schema with proper migrations
Input validation layer
Environment configuration (dev/staging/prod)
Basic API structure with error handling

What you’ll prove:

You can ship a deployable backend service
You understand production-grade foundations
You can explain your database design decisions

Week 2 — Business Logic Layer

You add real business functionality to your system.

What you’ll build:

Role-based access control (RBAC)
Background job processing
Third-party API integrations
Payment or workflow logic
Multi-entity business rules

What you’ll prove:

You can build systems that companies actually use
You understand authorization vs. authentication
You can model real business domains

This is where most “AI developers” fall apart. They can copy tutorials, but they can’t build business systems.

Week 3 — Production Hardening

You make your system production-ready.

What you’ll build:

Caching strategies (Redis, query caching)
Security baselines (input sanitization, rate limiting)
Centralized logging
Monitoring and alerting setup
Performance profiling

What you’ll prove:

You can ship systems that survive real traffic
You understand security at an infrastructure level
You can debug production issues

After Week 3, you’ll have a backend system better than 90% of what junior-to-mid engineers ship. And we haven’t even touched AI yet.

Week 4 — AI Infrastructure Layer

Now we add AI. The right way.

What you’ll build:

Vector database integration (Pinecone/Weaviate/Qdrant)
Embedding generation and storage
RAG pipeline architecture
AI agents with proper guardrails
Cost tracking and budget controls

What you’ll prove:

You understand embeddings as math and not magic
You can build RAG systems that actually work at scale
You can control agent behavior and costs

Here’s a simple example of what cost control looks like:

async function executeAgent(task: AgentTask) {
  const budget = await redis.get(`budget:${task.userId}`);
  
  if (budget && parseInt(budget) <= 0) {
    throw new Error("Budget exceeded for this billing period");
  }
  
  const result = await agent.run(task);
  
  // Track token usage
  await redis.decrby(`budget:${task.userId}`, result.tokensUsed);
  await logUsage(task.userId, result.tokensUsed, result.cost);
  
  return result;
}

This is the infrastructure the tutorials don’t show you.

Week 5 — AI Systems Integration

You build the systems that make AI safe and maintainable.

What you’ll build:

Human-in-the-loop workflows
Conversation memory and context management
AI-specific observability
Hallucination detection and handling
Feedback loops for improvement

What you’ll prove:

You can build AI systems that humans can trust
You understand when AI should NOT make decisions
You can debug AI failures systematically

Most AI systems fail because engineers treat AI as autonomous. It’s not.

Week 6 — Defense

This is where you prove everything.

What you’ll do:

Present your complete system to your cohort
Walk through your architecture decisions
Explain your trade-offs
Debug a scenario I throw at you
Answer questions about scaling, security, and failure modes

What you’ll prove:

You understand what you built — not just that you built it
You can explain technical decisions to senior engineers
You can think on your feet when things break

The defense is 60 minutes. I’ll ask you questions like:

“Why did you choose this vector database?”
“What happens if your RAG pipeline returns irrelevant results?”
“How do you handle it when your agent exceeds budget?”
“Walk me through how you’d debug a hallucination in production.”

You either know your system, or you don’t. That defense is your proof. Not a PDF certificate.

Why the Defense Matters

Most AI education optimizes for engagement. This bootcamp optimizes for transformation.

You ship code every week. You get feedback on real systems. You defend your architecture to prove you understand it.

You can’t fake your way through a defense.

You either understand your system, or you don’t. There’s no middle ground. That’s why the certificate from this bootcamp actually means something.

Because it’s backed by proof.

What Happens Next

Join the waitlist at masteringai.dev
This Friday: Free live workshop where I teach the 6-layer framework for AI backend systems, the same framework the bootcamp is built on
After the workshop: Early bird enrollment opens (waitlist members get first access and priority spots)

50 spots. That’s it.

I’m keeping the first cohort small because everyone gets code reviews, everyone gets feedback, and everyone defends their system.

That doesn’t scale to hundreds of students. The waitlist is the only way in. masteringai.dev

Final Thoughts

We are in the AI-first world already.

Every single platform and system you maintain will either be AI-enhanced or completely revamped based on AI solutions.

As backend engineers, our role isn’t just to manage the persistence layer. But, most importantly, to architect the intelligence layer itself.

Many of us are still focused on optimizing the foundation while the entire skyscraper (the application) is being redesigned around us.

The engineers who can build AI infrastructure and not just use AI tools will be the ones who command premium salaries and interesting projects.

This bootcamp doesn’t promise you a job. It promises you capability.

What you do with it is up to you.

Upskill now.

→ masteringai.dev

I hope you learned something today. Spread the love. Share this newsletter with at least two of your friends today.

If you have questions about the bootcamp, reply to this email. I read everything.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

3. GetBackendJobs: Access 1000+ tailored backend engineering jobs, manage and track all your job applications, create a job streak, and never miss applying.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

The Hidden Gap: Backend Engineers vs AI Engineers

Solomon Eseme — Wed, 04 Feb 2026 14:11:25 GMT

Hello 👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

Here’s another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

I need to tell you about something I’ve been working on quietly for the past few months. I wasn’t sure when I’d share it, but this feels like the right time because many of you have been asking me questions that led me to build this in the first place.

A Conversation That Kept Repeating

A few months ago, I had a call with a backend engineer who had been building APIs and services for about three years. He’s a Solid engineer who knows his way around databases, caching, authentication, and all the core backend stuff.

He asked me a simple question:

“Solomon, how do I actually add AI to my backend systems?”

He asked me a simple question:

“Solomon, how do I actually add AI to my backend systems?”

I started explaining some approaches, and then he stopped me and said something that stuck with me:

“I’ve watched tutorials. I can call the OpenAI API. I can get responses. But when I try to put this into a real system with users, with errors, costs, and things that break at 2 am. I have no idea what I’m doing.”

That conversation didn’t happen once. It happened again the following week with a different engineer. And then again. And again.

I started paying closer attention to the questions coming into my inbox, the DMs on Twitter, and the conversations in our community. And I realized there was a pattern that I had somehow missed.

Backend engineers are not struggling to understand what AI is. They’re struggling to build production systems that use AI reliably.

The Gap I Kept Seeing

Let me explain what I mean by that, because it took me a while to fully understand it myself.

When most backend engineers approach AI, they go through the same journey. They sign up for OpenAI, get an API key, write a few prompts, and see the magic happen. They sent in text-based prompts, and intelligent text responses came out. It feels like the future.

Then they try to put it into a real application. And suddenly, questions start piling up that no tutorial seems to answer.

What happens when the API is slow or down entirely, and your users are waiting? How do you know how much each user is costing you in API calls? What do you do when the model returns something completely wrong, and that wrong answer is about to go to a customer?

How do you change your prompts without breaking things that were working yesterday? How do you even know if something is broken when the outputs are non-deterministic?

These are not AI questions. These are backend engineering questions. But nobody seems to be teaching them together.

I went looking for resources that approached AI from a backend engineer’s perspective and not from a data scientist’s or ML researcher’s perspective.

I couldn’t find what I was looking for. Most courses assume you want to fine-tune models, understand transformer architecture, or build ML pipelines. That’s valuable knowledge, but it’s not what backend engineers need to ship AI features in production.

What we need is much more practical.

We need to know how to build the infrastructure around AI that makes it reliable, observable, cost-controlled, and maintainable. We need to treat AI like we treat any other external dependency in our systems with proper error handling, fallbacks, monitoring, and controls.

What I Started Building

So I started building something to fill this gap.

I didn’t announce it because I wanted to make sure it was actually useful before I talked about it. I’ve been refining it, testing ideas, talking to engineers, and putting together what I believe is the right approach for backend engineers who want to add AI capabilities to their skill set.

It’s not a course where you watch videos and collect a certificate at the end. I’ve seen too many engineers go through programs like that and come out unable to build anything real.

They understand concepts but can’t ship systems.

Instead, what I’ve built is a structured program where you actually build things. Every week, you ship code. Real code that does real things.

By the end, you have a complete AI-powered backend system that you built yourself, and you have to present it and explain your decisions, the same way you would in a senior engineering interview or a design review at work.

The focus is on what I call AI infrastructure.

Things like building RAG pipelines that actually work at scale, with proper chunking strategies, caching, and citation tracking. Building AI agents that have guardrails so they don’t run forever and cost you thousands of dollars.

Implementing human-in-the-loop systems where uncertain outputs get routed to a person for review instead of going straight to your users.

Setting up a proper Model Control Plane where you can version your prompts, set up fallbacks when one model fails, and track costs across your entire system.

These are the patterns that separate a demo from a production system. These are the things I wish someone had taught me when I first started integrating AI into backend services.

Why I’m Telling You Now

I’m sharing this now because enrollment is opening soon, and I wanted you to hear about it first.

You’ve been reading Backend Weekly, you’ve been part of this community, and many of you have sent me the exact questions that led me to build this. It felt wrong to announce it publicly without telling you first.

I also want to be honest about who this is for, because it’s not for everyone.

This is for backend engineers who already know how to build things. You should be comfortable with APIs, databases, authentication, and the fundamentals. You don’t need to be a senior engineer, but you should have shipped real backend code before.

If you’re still learning the basics of backend development, this isn’t the right time. Focus on the fundamentals first, and come back to AI later.

But if you’ve been building backends and you’ve been wondering how to add AI to your skill set in a way that actually makes you more valuable and not just someone who can write prompts, but someone who can architect AI systems, then this might be exactly what you’ve been looking for.

What Happens Next

I’ll be sharing full details later this week, including the curriculum, the schedule, and how to join.

If you want to make sure you don’t miss it, just reply to this email with “INTERESTED” and I’ll add you to the early notification list.

I’ll also answer any questions you have directly. Just hit reply and ask.

I’ve been building this quietly for months, and I’m genuinely excited to finally share it with you.

More soon.

I hope you found this useful. If you did, share this newsletter with a friend who's been asking how to get into AI as a backend engineer. There's a good chance they're facing the same gap.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome, Solomon (solomoneseme.com)

How would you design a globally consistent payment ledger?

Solomon Eseme — Sat, 03 Jan 2026 15:26:05 GMT

Hello 👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

Welcome to another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world system design and interview questions.

Before we dive in:

The enrollment for our upcoming AI Bootcamp has started, and the sit are filling up very fast by those who are ready to embrace the AI-First world.

We are in the AI-first world already, where every single platform and system you maintain will either be AI-enhanced or completely revamped based on AI solutions.

As backend engineers, our role isn’t just to manage the persistence layer. But, most importantly, to architect the intelligence layer itself.

Many of us are still focused on optimizing the foundation while the entire skyscraper (the application) is being redesigned around us.

Upskill now.

Join the “Becoming a Production-Ready AI Engineer.”

If you have any questions about the bootcamp. Please send it to hi [at] masteringbackend.com

This is the MB Interview Series on Backend Weekly, which airs every Saturday.

In this series, I will guide you through answering common backend engineering interview questions, covering topics such as system design, microservices, API design, and databases.

Let’s get started with episode 5 (Episode 4 Here):

The Interview Scenario

You’re in a backend interview.

They ask:

“How would you design a globally consistent payment ledger?”

Here’s how to approach it:

Lastly, we are building the next Interview Prep Playground targeting backend engineers.

Join our MB Interview waitlist: https://tally.so/r/w46glb

Building a FinTech is difficult. In fact, it takes the utmost attention to detail to get it right.

Even when you do:

It will still undergo thorough rounds of testing and auditing to ensure that money doesn’t slip away.

Why?

Building a FinTech is not about the API, CRUD, databases, and storing balances. The true foundation of a successful FinTech lies between Trust and Ease of payment.

It’s about correctness under pressure.

Payment Ledger is at the heart of every Fintech product because it is the only source of truth for every transaction that happens within the system.

So how do you approach building a globally consistent payment ledger that’s distributed, concurrent, serves millions, and still retains its attribute as the only source of truth?

This is a real challenge. Let’s dive in:

As you already know, a strong backend engineer does not jump into coding without clarifying the requirements. Therefore, go through a few rounds of requirement clarification with your interviewer.

Clarify the Requirements

Before we jump into the architecture diagrams and start implementing. We must clarify what must never break.

Discuss this with your interviewer.

For a payment ledger, the core requirements are:

Strong correctness for balances (no money created or lost)
Exactly-once payment application
Global availability across regions
Auditable, append-only history
High write throughput under load

This clarification will give you a clear direction on exactly what to do and what to avoid.

What Is a Payment Ledger?

A payment ledger is not a balance table. It is an immutable, append-only record of financial events.

Balances are derived data. The ledger is the source of truth.

Once you internalize this, the rest of the design becomes much clearer.

A payment ledger stores the record of all the transactions that have ever occurred on your system.

It doesn’t matter if it’s a failed, successful, pending, or reversed transaction; everything is stored exactly once in the ledger.

Now that the requirements are out of the way, let’s define the architecture that will solve this problem for us.

High-Level Architecture

A production-grade payment ledger is typically composed of these core components:

Payment API Layer
Ledger Write Service
Immutable Transaction Log
Balance Service

Each component has a very specific responsibility, and mixing them is how systems fail.

This is a good time to discuss with your interviewer about the responsibilities of each component and how they will interact with each other.

Let’s break them down:

Payment API Layer

The gatekeeper of correctness in a payment ledger system. The Payment API Layer is the front door to your payment system.

At a high level, the Payment API Layer does five critical things:

Accepts payment intent from clients
Validates requests (auth, schema, limits, currency rules)
Assigns an idempotency key
Forwards the request to the Ledger Write Service
Returns a commit result (or a safe retry response)

It is intentionally thin, stateless, and strict.

Here’s a simple implementation of the payment API layer:

import { Request, Response } from "express";
import { v4 as uuid } from "uuid";
import { redis } from "./redis";
import { publishToLedger } from "./ledger";

export async function createPayment(req: Request, res: Response) {
  const { fromAccount, toAccount, amount, currency } = req.body;
  const idempotencyKey =
    req.headers["idempotency-key"] || uuid();

  // 1. Validate input
  if (!fromAccount || !toAccount || amount <= 0) {
    return res.status(400).json({ error: "Invalid payment request" });
  }

  // 2. Check idempotency
  const existing = await redis.get(`idem:${idempotencyKey}`);
  if (existing) {
    // Return the previous result safely
    return res.status(200).json(JSON.parse(existing));
  }

  // 3. Build payment intent
  const paymentIntent = {
    intentId: uuid(),
    fromAccount,
    toAccount,
    amount,
    currency,
    idempotencyKey,
    createdAt: Date.now(),
  };

  // 4. Send to ledger writer
  const result = await publishToLedger(paymentIntent);

  // 5. Store result for idempotency replay
  await redis.set(
    `idem:${idempotencyKey}`,
    JSON.stringify(result),
    { EX: 60 * 60 } // 1 hour
  );

  return res.status(201).json(result);
}

Ledger Write Service

The single authority that decides what becomes financial truth. If the Payment API Layer captures intent, the Ledger Write Service decides truth.

This service is the only component allowed to write to the ledger.

Below are the core responsibilities of the ledger write service:

Enforce Exactly-Once Writes: It must guarantee that the same transaction is never applied twice, even under retries. This is enforced using Idempotency keys, Transaction uniqueness constraints, and Atomic writes.
Append to an Immutable Ledger: The ledger is Append-only, ordered, never mutated, and never deleted. If something is wrong, you append a compensating entry; you don’t fix history.
Guarantee Ordering (Per Account): Payments touching the same account must be processed in order.
Emit Deterministic Events: Every ledger entry must be fully deterministic, Replayable, and Self-contained. This enables recovery, audits, and balance rebuilding.

Below is a simple structure of a Ledger:

{
  "ledgerEntryId": "ledg_01HXYZ",
  "transactionId": "txn_123",
  "idempotencyKey": "idem_abc",
  "accountId": "acct_789",
  "amount": -5000,
  "currency": "USD",
  "type": "DEBIT",
  "createdAt": 1736359200
}

Here’s a sample code snippet implementing a ledger system:

import { db } from "./db";

export async function appendLedgerEntry(entry: LedgerEntry) {
  return db.transaction(async (tx) => {
    // 1. Enforce idempotency
    const existing = await tx.ledger.findUnique({
      where: { idempotencyKey: entry.idempotencyKey }
    });

    if (existing) {
      return existing; // exactly-once guarantee
    }

    // 2. Append-only write
    const written = await tx.ledger.create({
      data: {
        transactionId: entry.transactionId,
        idempotencyKey: entry.idempotencyKey,
        accountId: entry.accountId,
        amount: entry.amount,
        currency: entry.currency,
        type: entry.type,
      }
    });

    return written;
  });
}

Immutable Transaction Log

The permanent record of financial truth. The Immutable Transaction Log is the heart of a payment ledger system.

It is the place where every financial event is written once and never changed.

If balances are wrong, services crash, or caches are wiped, the immutable log is your only source for all ledger transactions.

Why an Immutable Log Is Required?

Money systems have non-negotiable constraints:

You must explain every cent
You must prove how a balance was formed
You must recover from bugs and outages
You must support audits years later

Balance Service

If you want to show the user transaction history, querying balances directly from the ledger every time would mean:

Scanning thousands (or millions) of entries
High latency
Expensive reads

So we derived balances or transactions:

Precompute balances per account
Store them in optimized tables or caches
Rebuild them anytime from the ledger if needed

This gives you speed without sacrificing the correctness of the ledger service.

Here’s a sample code snippet:

type LedgerEntry = {
  accountId: string;
  amount: number; // positive or negative
  offset: number;
};

async function applyLedgerEntry(entry: LedgerEntry) {
  await db.transaction(async (tx) => {
    const balance = await tx.balance.findUnique({
      where: { accountId: entry.accountId },
    });

    if (balance && balance.lastOffset >= entry.offset) {
      // Already applied → idempotent
      return;
    }

    const newBalance = (balance?.balance ?? 0) + entry.amount;

    await tx.balance.upsert({
      where: { accountId: entry.accountId },
      update: {
        balance: newBalance,
        lastOffset: entry.offset,
        updatedAt: new Date(),
      },
      create: {
        accountId: entry.accountId,
        balance: newBalance,
        lastOffset: entry.offset,
        updatedAt: new Date(),
      },
    });
  });
}

Now that the architecture is out of the way. Let’s explore the primary flow of a global, consistent ledger system:

Primary Payment Flow

Here’s the primary flow you should describe in an interview:

Client submits a payment intent
API validates the request and assigns an idempotency key
Ledger service appends the transaction to the immutable log
Balance service updates derived account state
Client receives a commit result

Notice something important:

The balance update is not the primary operation. The ledger append is.

Everything we have discussed above is only architecture and your engineering prowess. Now is the time to impress your interviewer by walking them through how you intend to scale and offer reliability throughout your Fintech system

Reliability & Correctness Guarantees

To impress your interviewer, here’s what to tell them when you want to discuss reliability and correctness.

Exactly-once semantics: Achieved via idempotency keys. Duplicate requests result in the same committed transaction.
Atomic ledger append: A transaction is either fully recorded or not recorded at all. No partial writes.
Deterministic replay: If balances are corrupted, you can replay the ledger from genesis. This is critical for recovery and audits.
Read-after-write consistency (per account): Once a transaction commits, reads for that account must reflect it.

To successfully build a reliable ledger system, these points are non-negotiable in a Fintech product.

Scaling the Ledger

Scaling payments is not about adding more databases. It’s about isolating contention. Discuss this with your interviewer and share your opinion on scaling the system.

Here are some scaling strategies you can adopt:

Horizontal scaling at the API and ledger writers.
Account-based sharding of the ledger.
Stateless API tier.
Stateful storage isolated behind shard boundaries.

Each account (or group of accounts) maps to a shard, ensuring:

No cross-account locking.
Predictable write performance.
Clear ownership of the state.

Observability and Monitoring

A payment system that can’t be observed is a liability.

You must monitor:

Ledger append latency
Ledger write error rates
Balance divergence checks
Shard lag and replay backlog
Structured logs with transaction IDs

Alerts should fire before balances drift, not after customers complain.

This is also a good time to discuss with your interviewer what metrics are important for them, while you also share your insights on what to monitor.

Final Answer

“I’d design the system around an append-only ledger with account-level sharding. All payment writes go through an immutable transaction log, while balances are derived separately. Exactly-once semantics are enforced via idempotency keys, and recovery is handled through deterministic replay. The system scales horizontally through shard isolation and remains observable through ledger-level metrics and consistency checks. This allows it to support global payment volume without sacrificing correctness or auditability.”

Final Thoughts

Designing a globally consistent payment ledger isn’t about frameworks.

It’s about discipline.

Discipline in separating the source of truth from derived data
Discipline in enforcing invariants
Discipline in trading convenience for correctness

If you understand this system, you understand real backend engineering.

And the next time an interviewer asks this question, you won’t hesitate — you’ll walk them through it calmly and confidently.

Join our AI Engineering Bootcamp

To answer the question “Will AI Kill Backend Engineers?” boils down to the fact that it’s someone who embraces AI that will replace backend engineers who do not embrace AI.

If you’re ready to LEARN AI, EMBRACE AI. We are launching a 6-week AI Bootcamp on “Become a Production-Ready AI Engineer.”

We are backend engineers, and building production-ready systems has been our core skill. Learn exactly how to do the same as an AI Engineer in the AI-first world.

Join here: Become a Production-Ready AI Engineer

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this new series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

Will AI kill Backend Engineers?

Solomon Eseme — Sat, 20 Dec 2025 17:24:16 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

If this newsletter was shared with you, consider subscribing here:

Subscribe now

Here’s another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

The enrollment for our upcoming AI Bootcamp has started, and the sits a fill up every fast by those who are ready to embrace the AI-First world.

We are heading toward an AI-first world, where every single platform and system you maintain will either be AI-enhanced or completely revamped based on AI solutions.

As backend engineers, our role isn’t just to manage the persistence layer.

But, most importantly, to architect the intelligence layer itself.

Many of us are still focused on optimizing the foundation while the entire skyscraper (the application) is being redesigned around us.

Upskill now.

Click here to join “Becoming a Production-Ready AI Engineer.”

If you have any questions about the bootcamp. Please send it to hi [at] masteringbackend.com

This is the AI for Developers Series on Backend Weekly every Monday.

In this series, I will guide you through building with AI and becoming a production-ready backend engineer.

Let’s get started with episode 1

My question is very simple and yet very difficult to answer.

“Will AI kill Backend Engineers?”

The shortest answer to this question is: YES

Why?

When you think about it from every point of view, AI is here to stay, and the only thing that’s changing right now is continuous improvement.

The world will never at any point go back to when AI wasn’t here. It is continuous improvement that will continue to happen.

So with these new improvements come new methods and patterns for backend engineers.

It will get to a stage where AI will be capable of doing a lot more than it’s currently capable of.

Therefore, a backend engineer who ignores the following points that I will share in this episode will be replaced.

Embrace Fundamentals
Have a plan
Learn AI
Embrace AI

Embrace Fundaments

AI has disrupted many industries. Infact every industry.

However, in the software industry, we have seen two ways in which AI has affected. Let me break it down.

Building with AI
Building for AI

Building with AI

This first approach is for engineers and businesses who are using AI builder tools to plan, build, test, and scale their software applications.

For example, using Cursor and all these coding assistants to build and develop applications.

These categories also include those non-technical developers who just relied on prompts, and the AI completely built out their tools.

Initially, none of these were in existence. However, this new way is here to stay, and it will continue to evolve.

It may get to a point where AI can completely execute the complete SDLC without any additional input from humans, except to pass in more prompts.

We all can agree that it’s a possible future.

Now, let’s talk about the next category.

Building for AI

This category opens up a new pattern, a new methodology, and ways engineers can utilise their existing skills and push beyond boundaries to stay relevant.

In this category, you build tools that allow AI to function properly based on business logic.

For example, if company A is into e-commerce, it can decide to integrate AI into its processes. So they will call an AI Engineer to overhaul the engineering part of the integrations.

That’s where you come in:

You are called to use the fundamentals that you have as a backend software engineer to build for AI. Build tools that AI will use to understand their business, perform their business better, connect their business to other AI businesses, etc.

Now above all these:

For any aspect that you belong to, you need the fundamentals of backend engineering at least now that AI still relies heavily on human input.

So that’s why I will focus on:

These are some of the questions I will ask myself.

Which technology is the best for this?
How do I break down this task so a junior dev can understand
What is the best way to build this
Where do I need AI help
What database is the best for this kind of product
What is this principle of software engineering, and why does it matter?
etc

Questions like this help you make informed decisions when you’re able to articulate and document your processes to AI. You will see the AI doing wonders that you can’t imagine.

But first, it has to come from you.

Have a plan

AI is coming for everybody. Including YOU.

Don’t say:

“AI can’t replace or kill backend engineers, because ……. (add your reason).”

Things change. That’s the only constant thing.

Instead, have a plan.

To create a better plan for yourself, ask yourself these questions and be honest with your answers to yourself.

What do you do now that AI is here?
How do I remain relevant?
What are others in my industry doing now?
What is the new way forward?
What do I need to learn to level up?
etc

How many of us have seen firsthand how an industry is completely wiped out due to time and change?

Let me take you back to memory lane.

Remember COBOL?
Remember QBasic?
Remember Visual Basic (or VB.Net)?

Even in backend engineering, you can easily see lots of changes that have happened, from how we used to build PHP applications back in the day to the Laravel days.

In frontend engineering, too, building with `index.html`, `style.css`, and `script.js` for Angular, React, Vue, and many others.

This new AI-first world is the same change that has been happening, so you need to level up and move along side it’s direction so you don’t become obsolete.

Imagine if there’s someone who still uses procedural PHP and still builds backend systems that way.

It will still work, but come on, man.

So now that you have a visual plan and see the relevance in joining the AI race. You need to LEARN AI.

Learn AI

There’s no sugar-coating anything. You need to invest money, time, energy, and all you’ve got to level up.

Let me tell you why it’s worth it.

The money, time, energy, and all you invested to learn backend engineering up to this point.

Be honest with yourself. Has it paid off or not?

The majority know that it has paid off, even if your career in backend engineering has not started yet. Deep down, you know you’re in a career of great importance, and all you just need is one opportunity.

You need to do the same in this new AI-first world. You need to invest all you’ve got to climb the ladder again, as others are doing it.

So put the plan you have into action, start a new course next year, join a bootcamp, invest your time into this new paradigm shift, and wait for the opportunity to present itself.

If you don’t know where to start. Here’s a good roadmap you can follow along:

Click here to access the roadmap.

Also, Mastering Backend is hosting an upcoming webinar next week to answer all your questions and honestly point you to the right direction to become a production-ready AI Engineer.

Let me add this:

The problem with most backend engineers is not learning because, as backend engineers, we learn for a living.

The main problem is embracing it.

Embrace AI

For some engineers, embracing AI is like allowing someone else to win the “React is better than Vue” war.

It’s that hard:

However, to stand out, you need to let in the new normal. If it takes you 2 hours to finish a particular task, with a well-trained colleague, it may take you half of the time.

AI is your new colleague now; this time, you have power over it. You can describe exactly what you want this new colleague to do, and they will do it exactly.

You can instruct it on things not do and how to do it, and that’s exactly what you will have.

Recently, I was building a fintech product, and I used AI to completely understand financial terms and how to relate or calculate them in backend engineering.

I’m a software engineer and not a banker, so how did I build a complete banking application with all the calculations?

My AI colleague not only has experience in building software, but also has experience in finance. We combined our superpowers and delivered.

You also experienced things like this at some point.

Therefore, you need to let go of the old way and embrace this new way of building systems.

Summary

All these boil down to one thing:

To answer the question “Will AI Kill Backend Engineers?” boils down to the fact that it’s someone who embraces AI that will replace backend engineers who do not embrace AI.

If you’re ready to LEARN AI, EMBRACE AI. We are launching a 6-week AI Bootcamp on “Become a Production-Ready AI Engineer.”

We are backend engineers, and building production-ready systems has been our core skill. Learn exactly how to do the same as an AI Engineer in the AI-first world.

Join here: Become a Production-Ready AI Engineer

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this new series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

How would you design a distributed job queue system

Solomon Eseme — Sat, 06 Dec 2025 15:15:17 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

If this newsletter was shared with you, consider subscribing here:

Subscribe now

Here’s another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

You’re in a backend interview.

They ask:

“How would you design a modern backend system in this AI era?”

Here’s how to approach it:

Shortest answer:

I will use Blackbox.ai

It generates production-ready code fast.
It runs in parallel, where different models work on my tasks, and the best one is picked.
It reads my entire codebase and understands it, makes informed decisions, and gets the work done.
It implements tasks on a production level with very detailed planning, high precision execution, and a thorough testing phase.

Check it out here: Blackbox AI.

This is the MB Interview Series on Backend Weekly every Saturday.

In this series, I will guide you through answering common backend engineering interview questions, covering topics such as system design, microservices, API design, and databases.

Let’s get started with episode 4 (Episode 3 Here):

The Interview Scenario

You’re in a backend interview.

They ask:

“How would you design a distributed job queue system that supports retries, prioritization, and horizontal scaling?”

Here’s how to approach it:

Before we dive in, we are building the next Interview Prep Playground targeting backend engineers.

Join our MB Interview waitlist: https://tally.so/r/w46glb

What does it really mean to build a distributed job queueing system? Almost every backend engineer reading this has built a queueing system at some point.

But building a distributed queueing system that works reliably across all your distributed servers will definitely be worth the challenge.

The first thing to do when you hit this interview question is to clarify the requirements with your interviewer.

Let’s do just that:

Clarify Requirements

First, let’s start with what a queueing system is:

A queueing system is an integral part of a distributed system that helps in managing the flow of tasks between different parts of the application.

Here’s a quick illustration:

User A triggers an action (e.g., uploads a file).
Service B receives the request and creates a background job.
The job is pushed into the Queue.
A Worker pulls the next job from the queue.
The Worker processes the job and writes the result to the Database.

Simple right?

However, for our interview question, below are some of the requirements. Your distributed queueing system must:

Process background jobs reliably
Retry failed jobs with exponential backoff
Support job prioritization (high/medium/low)
Scale horizontally across workers
Be idempotent
Offer strong observability (retries, failures, latency)

Now that the requirements are out of the way, let’s define the architecture that will solve this problem for us.

Core Components of the System

Ideally, a production-ready distributed queue is built from these different parts that come together.

Producers
Queue Service
Workers/Consumers
Storage

Producers

Producers are APIs or services that submit jobs to the system. These Jobs include metadata such as priority, delay, retries, payload, etc.

import { createClient } from “redis”; // could be Kafka or RabbitMQ

const redis = createClient();

export async function enqueueJob(job: any, priority: “high” | “medium” | “low”) {
  const jobData = {
    id: crypto.randomUUID(),
    payload: job,
    retries: 0,
    createdAt: Date.now()
  };

  await redis.lPush(`queue:${priority}`, JSON.stringify(jobData));

  console.log(”Job added:”, jobData.id);
}

// Usage

await enqueueJob({ email: “user@example.com” }, “high”);

Queue Service

A queue service can be handled by:

Redis (Lists, Sorted Sets, Streams)
RabbitMQ (priority queues, routing keys)
Kafka (partitioned, scalable pipelines)

It stores enqueued jobs and ensures ordering and durability.

Workers/Consumers

Workers perform the following functionalities: pull jobs, execute business logic, ACK success or NACK failure, trigger retries, metrics, and logs. Workers must be stateless for horizontal scaling.

Below is a pseudocode of how most queue services are implemented. Just an infinite loop waiting for jobs to process.

You should replace this code with RabbitMQ or Kafka implementations if you’re using any of them.

import { createClient } from “redis”;

const redis = createClient();

async function processQueue(priority: string) {
  while (true) {
    const job = await redis.rPop(`queue:${priority}`);
    if (!job) {
      await new Promise(r => setTimeout(r, 500));
      continue;
    }

    const data = JSON.parse(job);

    try {
      console.log(”Processing:”, data.id);

      // Your business logic here
      await sendEmail(data.payload);

      console.log(”SUCCESS:”, data.id);
    } catch (err) {
      data.retries += 1;

      if (data.retries > 3) {
        await redis.lPush(”queue:deadletter”, JSON.stringify(data));
        console.log(”Moved to DLQ:”, data.id);
      } else {
        const delay = 2 ** data.retries * 1000;
        console.log(`Retrying ${data.id} in ${delay}ms`);

        setTimeout(() => {
          redis.lPush(`queue:${priority}`, JSON.stringify(data));
        }, delay);
      }
    }
  }
}

processQueue(”high”);

Storage

You can spin up a Postgres, Redis, or Elasticsearch instance to store the results from each job that is performed. You should store the following information for metric and debugging purposes:

Job history
Retry attempts
Failures
Dead-letter queue entries

That’s all you need to build a distributed queueing system. If you get this right, then your queueing is already 90% done.

Let me walk you through a simple job lifecycle, which is critical to how you design your queueing system, and answer the interview question properly:

The Job Lifecycle

Every Job that your queue system processes will pass through this lifecycle, so it’s important to understand everything that goes into it and where to plug in other services.

A producer submits the job
Queue enqueues the job with metadata
A worker/consumer pulls the job
A worker processes the job
Worker ACKs (Acknowledge) if successful
Worker requeues with backoff if it fails
Permanently failed jobs go to a Dead Letter Queue (DLQ)

This simple lifecycle is the backbone of every large-scale queueing system

Retries and Backoff in Queues

Retries are the backbone of every queueing system. It’s an important feature to discuss with your interviewer, and you should take great care in implementing a good retry and backoff strategy so that it doesn’t hammer the system.

Always use an exponential backoff:

2s → 4s → 8s → 16s → …
Store retry metadata: count, next-run timestamp
Log every retry for observability
After N failures → send job to DLQ

DLQs are essential for debugging failures without blocking the queue.

From the code snippet above:

  const delay = 2 ** data.retries * 1000;
  console.log(`Retrying ${data.id} in ${delay}ms`);

  setTimeout(() => {
    redis.lPush(`queue:${priority}`, JSON.stringify(data));
   }, delay);

Then, after N failures:

  await redis.lPush(”queue:deadletter”, JSON.stringify(data));
  console.log(”Moved to DLQ:”, data.id);

Reliability & Idempotency

Lastly, it’s important to note that idempotency is important in retries and backoff implementation, which will help with reliability:

Always store processed job IDs in Redis:

await redis.set(`processed:${job.id}`, “1”, { EX: 3600 });

Check before processing:

const exists = await redis.exists(`processed:${job.id}`);
if (exists) return; // skip duplicate

Job Prioritization

Job prioritization is another important feature of a queueing system that you should discuss with your interviewer. It determines the next job to be processed.

If you have 3 jobs to be executed in a queue, you can use the default “First-in, First-Out” principle of a queue.

You can also decide to use Priority Queues by assigning an order of priority to each job, and each job will be processed based on its priority in the queue.

Here’s an example:

queue:high
queue:medium
queue:low

In JavaScript (create a simple array)

const PRIORITY_ORDER = [”high”, “medium”, “low”];

Add it to each job when sending it into a queue:

processQueue(PRIORITY_ORDER[0]);

Also note that RabbitMQ supports priority queues natively, while Redis uses sorted sets like priority + timestamp.

Delayed & Scheduled Jobs

Sometimes, your queue needs to delay some jobs intentionally for some tasks, such as:

Sending emails after X minutes
Scheduling tasks
Backoff logic

In cases like this, use:

Redis Sorted Sets (score = future timestamp)
A scheduler worker pulls when score <= now

Horizontal Scalability

When implementing your workers, always make sure that they are:

stateless
All state lives in Redis
Any worker can pick any job
You can auto-scale using CPU/memory/queue depth

This helps you to scale workers by:

Adding more worker nodes
Using Redis Streams or Kafka for partitioning
Using consumer groups for shared consumption
Using distributed locks to prevent double-processing
Workloads scale linearly as you add workers

This is how companies like Netflix, Shopify, and Uber scale job systems.

Observability & Monitoring

You can’t manage what you can’t see.

So discuss this with your interviewer to understand the metric that’s important to measure or track:

Below are some metrics to track and the tools to use:

Track:

Queue sizes
Processing latency
Retry counts
Worker failures
DLQ counts
Overall throughput

Use tools such as:

Prometheus
OpenTelemetry
Grafana dashboards

Logs must include the job ID, worker ID, payload, duration, and retry count.

Final Answer

“I’d build a distributed job queue using Redis Streams or Kafka. Producers submit jobs with metadata like priority, delay, and retries. Workers consume jobs using consumer groups, remain stateless, and retry via exponential backoff. Failed jobs go to a DLQ. The system scales horizontally by adding workers, supports prioritization via multiple queues or weighted consumption, and exposes observability metrics for retries, queue size, and latency.”

Designing a distributed job queue sounds simple — but at scale, it becomes a deep dive into:

concurrency
ordering
durability guarantees
backoff scheduling
partitioning
idempotency
observability
fault tolerance

These are the kinds of systems that separate junior from senior backend engineers.

Master this, and you’ll stand out in every interview.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this new series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 3 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

How would you design an authentication system for a large-scale web application?

Solomon Eseme — Sun, 23 Nov 2025 12:56:29 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

If this newsletter was shared with you, consider subscribing here:

Subscribe now

Here’s another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

If you are still stuck in tutorial hell

We’ve all been there—jumping from one Python video to the next, but never building anything real. No portfolio. No confidence. No interviews.

That’s why I created this.

The “Land Your Dream Python Job” Challenge
A 90-day, 3-phase roadmap that helps you:

✅ Build 30 real-world backend projects in 30 days
✅ Master DSA for technical interviews
✅ Get job-ready with resumes, mock interviews & daily job alerts
✅ And finally... land that backend job

This is NOT another course. It’s a challenge. And it works. It’s beginner-friendly.

Over 2,000 Python developers have taken this path—many are now working at top companies.

Only 120 slots left at $54 (then goes up to $100)

Join the challenge & change your future
👉 python30.masteringbackend.com

Let’s get you unstuck.

This is the MB Interview Series on Backend Weekly every Saturday.

In this series, I will guide you through answering common backend engineering interview questions, covering topics such as system design, microservices, API design, and databases.

Let’s get started with episode 4 (Episode 3 Here):

The Interview Scenario

You’re in a backend interview.

They ask:

“How would you design an authentication system for a large-scale web application?”

Here’s how you should approach it:

Before we dive in, we are building the next Interview Prep Playground targeting backend engineers.
Join our MB Interview waitlist: https://tally.so/r/w46glb

The shortest answer to this question will be:

Don’t do it.

Use an existing solution.

However, for learning purposes, we will break down the problem of designing a robust authentication system to help us understand how it works.

Understanding the problem

Building an authentication flow is simple.

However, designing a secure, scalable, and resilient authentication system for large-scale web applications that millions of users rely on every day.

That’s a whole new level, and with this comes different challenges.

To understand this properly, let’s detail some of the questions you should ask to give you clarity on how to design the system.

How do users sign up?
How are credentials stored?
How do you manage sessions at scale?
How do mobile apps authenticate differently from browsers?
How do you prevent abuse, attacks, replay, token theft, and session hijacking?
How do you scale auth to millions of requests per day?

To answer these questions, you must break down the system into these 4 pillars of authentication.

Identity (who the user is)
Authentication (verify who they say they are)
Session management (keep them logged in)
Security (protect credentials, tokens, systems)

Once you have identified “who the user is”, you can decide the logic to verify them and keep them logged in while protecting their credentials, or tokens.

This is the basic idea of authentication:

Now, how do you manage this on a large scale?

Let’s look at this architecture for a simple, large-scale authentication system.

The High-Level Architecture

A real-world large-scale authentication system is made of:

Authentication Service
Identity Store
Token System
Session Store (optional)
API Gateway
MFA + Device Management

Let’s explore each of these layers to give you a clear picture.

Authentication Service

The authentication service is where your logic happens. It is a backend service written in any server-side language. It handles the business logic of authenticating a user. This service must be stateless, horizontally scalable, and secured behind a gateway.

It handles the following, depending on the use case of the business:

Login
Signup
Token issuance
MFA
Refresh tokens
Password resets
Device tracking

Here’s a simple code snippet for logging in a user:

export const handler: Handlers[”Login”] = async (req, { emit, logger }) => {
  try {
    const validatedData = loginSchema.parse(req.body);

    // Find user
    const user = await prisma.user.findFirst({
      where: { email: { equals: validatedData.email } },
    });
    if (!user) throw new Error(”authentication failed”);

    const passwordMatch = await comparePassword({
      password: validatedData.password,
      hashed: user.password,
    });
    if (!passwordMatch) throw new Error(”authentication failed”);

    const token = createToken({ userId: user.id });

    if (logger) {
      logger.info(”User logged in”, {
        petId: user.id,
        name: user.name,
        createdAt: user.createdAt,
      });
    }

    if (emit) {
      await (emit as any)({
        topic: “user.loggedin”,
        data: user,
      });
    }

    return {
      status: 201,
      body: {
        message: “User logged in successfully”,
        success: true,
        data: {
          user,
          token,
        },
      },
    };
  } catch (error) {
    if (error instanceof z.ZodError) {
      // Return validation errors with proper HTTP status
      return {
        status: 400,
        body: {
          message: “Validation error”,
          errors: error.errors,
        },
      };
    }

    // Handle unexpected errors
    if (logger) {
      logger.error(”User creation failed”, {
        error: error instanceof Error ? error.message : “Unknown error”,
      });
    }

    return {
      status: 500,
      body: {
        message:
          error instanceof Error ? error.message : “Internal server error”,
      },
    };
  }
};

Identity Store

An identity store is very important as it represents your database or any storage system you use to persist user data or logged-in information.

This is where your user data lives:

Email
Hashed passwords: Must be hashed using any of these: Argon2, bcrypt, or scrypt
MFA secrets
Device fingerprints
OAuth identities

From our code snippet below, you can see that we connect to a database using Prisma ORM to retrieve and compare user data when trying to log in.

// PRISMA ORM
    
const user = await prisma.user.findFirst({
      where: { email: { equals: validatedData.email } }, 
});

You can use any database or data store available to you or agreed upon by the interviewer.

However, for recommendation, here are some data stores you can discuss with your interviewer to choose which one is better:

PostgreSQL (strong consistency)
DynamoDB (regional scaling)
LDAP (enterprise)
Firebase Auth (managed)

Token System

Your token system can be a simple function that generates a login token for the user on the go, or a different system that creates, manages user tokens, such as refresh tokens, expired, etc.

There are common services or libraries for generating and managing user tokens:

JWT (stateless, used by most modern APIs)
Opaque tokens + DB/session store
OAuth2 access + refresh tokens

The idea of a token system is to have a seamless functionality that manages everything related to tokens for users.

Here’s a simple snippet that uses JWT to generate user tokens:

export function setJWTOAuth(user: any): any {
  const token = jwt.sign({ userId: user?.id }, environment.auth.tokenSecret, {
    expiresIn: “30 days”,
  });
  const thirtyDays = 30 * 24 * 60 * 60 * 1000;

  const isProd = ![”LOCAL”, “DEVELOP”].includes(environment.context);

  const options: Option = {
    httpOnly: true,
    expires: new Date(Date.now() + thirtyDays),
    sameSite: isProd ? “none” : false,
    secure: isProd,
  };

  if (isProd) options.domain = “.example.com”;

  return {
    options,
    token,
  };
}

Here, we generated a token using JWT, stored it in the user’s cookie, and also returned it to the user as response data to the user.

Session Store

The session store is optional since your authentication flow should be stateless, meaning your backend should hold no state of a logged-in user; that’s the essence of using a token-based system, especially JWT.

However, in some specific cases, your backend might actually need to save state, or if you agree with your interviewer to design your authentication system with session state in mind.

Here are some use cases where implementing a session store might be needed. If your app needs:

Server invalidation (Logging out a user from the server, etc)
Instantly log out everywhere
Revocation lists

API Gateway

For a large-scale and scalable authentication system, an API gateway is important, and it stands as a mediator between your services while performing many important tasks, such as:

Token validation
Rate limiting
IP allow/deny
Bot filtering
Throttling

This is where authentication and authorization happen before sending the user to the right service.

MFA + Device Management

In some applications, Multi-Factor Authentication is optional, so you need to discuss with your interviewer if this feature is needed.

However, as a general recommendation, when building a large-scale enterprise application, MFA is not optional and needs to be implemented.

Device management on the other end is about storing the device information of your user for different purposes.

A very good use case could be whether you want to allow multiple devices to be connected to your application or not.

Next, let’s look at this visual representation of an authentication flow:

How Authentication Flows

The flow is simple:

Signup Flow
Login Flow
Refresh Token Flow
Logout Flow

Signup Flow

When a user sends a request to the backend with the valid information, your backend service follows these steps:

Steps:

Validate inputs
Hash password with Argon2
Create user record
Send verification email (Queue)
Return success

Login Flow

When a user sends a login request to the backend with the valid information, your backend service follows these steps:

Steps:

Fetch user
Compare hash
Generate access token (short-lived)
Generate refresh token (long-lived)
Store refresh token in Redis/DB (optional)
Return tokens and user metadata

Refresh Token Flow

When a user's access token expires, a refresh token request is sent to the backend with valid information. Your backend service follows these steps:

Authentication service checks:

Is this refresh token valid?
Is it revoked?
Is the device recognized?

If valid:

Issue a new access token
Issue a new refresh token (rotation)
Revoke the old one

Logout Flow

When a user sends a logout request to the backend with the valid information, your backend service follows these steps:

If using server-side sessions:

Delete token from Redis

If using JWT:

Invalidate the token immediately
Add token to a revocation list (short TTL)
Or rely on short-lived tokens

Next, let’s explore the different authentication methods and understand which one to use for different cases.

What Authentication Method Should You Use?

There are so many authentication methods and strategies to choose from. However, here are the common strategies:

JWT (JSON Web Tokens):

The JWT strategy is the most common due to its stateless nature. You can verify a user’s token without querying the database, making it great for distributed, multi-region services.

Here’s a simple code snippet to generate a JWT token:

  const token = jwt.sign({ userId: user?.id }, TOKENSECRET, {
    expiresIn: “1 day”,
  });

Opaque Tokens and DB Store:

This is a simple, secure, and server-controlled authentication strategy.

Opaque tokens are random, meaningless strings issued by the server to represent a user’s session. Unlike JWTs, they contain zero user data.

Their only purpose is to act as a key to look up session information stored on the backend.

Here’s a simple code snippet to generate a strong opaque token:

export const createToken = (data: Partial): string => {
  const header = { alg: “HS256”, typ: “OPAQUE” };

  const payload = {
    iat: Math.floor(Date.now() / 1000),
    exp: Math.floor(Date.now() / 1000) + 3600,
  };

  const encodedHeader = Buffer.from(JSON.stringify(header)).toString(
    “base64url”
  );
  const encodedPayload = Buffer.from(JSON.stringify(payload)).toString(
    “base64url”
  );
  const signature = Buffer.from(process.env.SIGNATURE!).toString(”base64url”);

  return `${encodedHeader}.${encodedPayload}.${signature}`;
};

Session Cookies (Web):

The session cookie is a battle-tested and classic authentication model for web applications. If you’re building a web-based application. I will always recommend session cookies.

Session cookies rely on a simple idea:

The server creates a session and stores the data. The browser holds only a tiny “session ID” cookie to reference that session.

This is the oldest and still the most secure method for traditional web authentication.

One good thing about this method is that you can generate the token using any of the above methods, you can decide to store it in Redis, your database, or make it stateless.

Here’s a simple code snippet combining JWT tokens with session cookies:

export function setJWTOAuth(user: any): any {
  const token = jwt.sign({ userId: user?.id }, TOKENSECRET, {
    expiresIn: “30 days”,
  });

  const thirtyDays = 30 * 24 * 60 * 60 * 1000;

  const isProd = ![”LOCAL”, “DEVELOP”].includes(environment.context);

  const options: Option = {
    httpOnly: true,
    expires: new Date(Date.now() + thirtyDays),
    sameSite: isProd ? “none” : false,
    secure: isProd,
  };

  if (isProd) options.domain = “.example.com”;

  return {
    options,
    token,
  };
}

OAuth2 / OpenID Connect

OAuths are mostly implemented to allow integrations with 3rd party systems. For example, allowing users to log in/register using Google, GitHub, Twitter (X), etc. This can be done properly by following the documentation of specific integration and learning about OAuth to create one.

Next, let’s examine how to scale an authentication service:

Scaling Authentication

To scale your system to millions of users, you must think beyond tokens. You must think of the following points and implement them properly:

Distributed Token Validation: Use a distributed-enabled token strategy like JWT, which allows you to validate the token locally within a gateway without hitting the database.
Horizontal Scaling: Design a stateless authentication service to scale using Kubernetes or containers.
Rate Limiting & Abuse Protection: Add some layers of protection, such as rate-limiting, to prevent attacks such as credential stuffing, brute force, token replay, and bot attacks.
Monitoring & Metrics: Add proper monitoring and track metrics such as Login success rates, Login failures, Suspicious IPs, Token refresh volume, Top failing passwords, MFA usage, and Bot patterns, and set alerts on the following: Sudden login failure spike, Refresh token abuse, key-signing failures, and JWT verification failures

Final Answer

Here’s exactly how to put your answer forward to the interviewer:

“I’d design auth with a secure user store, token-based sessions (JWT or opaque tokens), refresh token flow, MFA, and centralized logging. At scale, I’d add Redis for session lookups and rate limiting for login endpoints.”

Designing a scalable authentication system shows your engineering prowess. It’s the beginning of understanding how complex and scalable backend systems work.

Every choice you make affects the user right where they want to access your main product.

Therefore, this is where real engineering starts.

So the next time an interviewer asks, “How would you design an authentication system for a large-scale web application?” don’t just say, “I’ll use an existing solution.”

Walk them through your thinking.

Show them how you’d keep things fair, fast, and scalable even when millions of requests hit your system. That’s the part that shows you truly understand how real-world backend systems behave.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this new series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 4 ways I can help you become a great backend engineer:

1. The MB Platform: Join 1000+ backend engineers learning backend engineering on the MB platform. Build real-world backend projects, track your learnings and set schedules, learn from expert-vetted courses and roadmaps, and solve backend engineering tasks, exercises, and challenges.

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

How would you design a Rate-limiting system for APIs at scale?

Solomon Eseme — Sat, 15 Nov 2025 17:04:35 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

If this newsletter was shared with you, consider subscribing here:

Subscribe now

Here’s another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

If you are still stuck in tutorial hell

We’ve all been there—jumping from one Python video to the next, but never building anything real. No portfolio. No confidence. No interviews.

That’s why I created this.

The “Land Your Dream Python Job” Challenge
A 90-day, 3-phase roadmap that helps you:

✅ Build 30 real-world backend projects in 30 days
✅ Master DSA for technical interviews
✅ Get job-ready with resumes, mock interviews & daily job alerts
✅ And finally... land that backend job

This is NOT another course. It’s a challenge. And it works. It’s beginner-friendly.

Over 2,000 Python developers have taken this path—many are now working at top companies.

Only 120 slots left at $54 (then goes up to $100)

Join the challenge & change your future
👉 python30.masteringbackend.com

Let’s get you unstuck.

This is the MB Interview Series on Backend Weekly every Saturday.

In this series, I will walk you through how to answer common backend engineering interview questions, covering topics such as system design, microservices, API design, and databases.

Let’s get started with episode 3 (Episode 2 Here):

The Interview Scenario

You’re in a backend interview.

They ask:

“How would you design a rate-limiting system for APIs at scale?”

Here’s how you should approach it:

Before we dive in, we are building the next Interview Prep Playground targeting backend engineers.

Join our MB Interview waitlist: https://tally.so/r/w46glb

Make sure you have a solid understanding of the problem first before attempting any solution.

Understand the problem

Solving any problem is very simple, depending on your understanding of the problem.

Rate limiting is simple:

If you want to protect your backend systems from overload, ensure fair use across tenants, and provide predictable SLA behavior, then you need Rate-Limiting.

Rate-limiting is the process of limiting the number of requests a client (user, API key, IP, or tenant) can make to your API within a time window.

Let’s say within 1 minute or 2 seconds.

It prevents abuse, protects downstream services (DBs, 3rd-party APIs), and lets you enforce product tiers.

Below are typical goals for a production-ready rate-limiter:

Protect system capacity and reduce overload.
Provide predictable latency and QoS (Quality of Service).
Enforce per-user, per-API-key, per-IP, per-route, and per-plan limits.
Support bursts while enforcing long-term fairness.
Work at millions of requests/sec across regions.

The question in your mind probably is:

Then, how do you design a rate-limiting system for APIs at scale when many services are running in a microservice architecture?

Let’s look at this architecture for a simple distributed rate-limiting system.

The High-Level Architecture

At a high level, the core pieces are:

Generate App Server: Your backend services (Express, Go, Java, etc.). Requests land here after passing through the edge/gateway. Services might be regional.
API Gateway / Edge: Primary place to enforce low-latency rate checks (Envoy, Kong, AWS API Gateway, Fastly). This is the first defensive layer.
Distributed Store: Fast store for counters/tokens (Redis cluster, in-memory local caches, or an external quota service).
Central Quota Service (optional): For billing/long-term quotas and complex policy enforcement.
Monitoring/Policy Store: Where limits & plans live (DB/Config service, fetched by gateway).

Edge consults the distributed store (or local cache + sync) to decide allow/deny.

Rate-Limiting Algorithms

Below are the core strategies used across the industry. You can pick one and discuss it extensively with your interviewer.

Fixed Window

In a fixed-window algorithm, the system resets the request counter at fixed time intervals (such as every minute), making it simple to implement but occasionally allowing request bursts to slip through right as a window resets.

if requests[current_window] < limit: allow()

Sliding Window Log

With sliding-window log rate limiting, the system stores a timestamp for every request and calculates allowed requests by examining only the ones within the current time window, giving it excellent accuracy but making it expensive and memory-heavy at large scale.

log = log.filter(last_60s); if log.size < limit: log.add(now); allow()

Sliding Window Counter

The sliding-window counter combines the simplicity of the fixed window with some of the accuracy of the log approach by blending counts from the current window with a proportion of the previous one, making it smoother during bursts and efficient enough for large-scale systems.

count = curr_count + (prev_count * overlap_ratio); if count < limit: allow()

Token Bucket
In a token-bucket algorithm, each request consumes a token from the bucket while tokens continuously refill at a fixed rate, allowing the system to enforce steady traffic limits while still permitting short, controlled bursts.

tokens = min(max_tokens, tokens + refill_rate * dt); if tokens > 0: tokens-- && allow()

Leaky Bucket
The leaky-bucket algorithm forces requests to leave the system at a constant, fixed rate, regardless of how fast they arrive, making it useful for smoothing or shaping uneven traffic into a predictable, stable flow.

if queue.size < bucket_size: queue.push(req); process_at_fixed_rate()

I recommend the Sliding Window Counter algorithm for simplicity and fairness, or the token bucket as a practical choice for gateways that need burst tolerance and predictable RPS.

Designing a Distributed Rate-Limiting System

When designing a distributed rate-limiting system. You need to know where to enforce limits, what limits to enforce, and where not to enforce limits.

Edge/API Gateway: This is the primary point to place rate limiting. It is closer to the client and adds the lowest latency.
Sidecar / Service mesh (intra-cluster): For fine-grained control per service.
Application layer (fallback): Add extra business-rule limits or extra checks.
Central quota/checker: Synchronous for billing limits OR asynchronous reconciliation for monthly quotas.
Client SDK (soft): Client-side backoff to reduce noise.

Note: Enforce simple decisions at the gateway (fast) with a Redis (or local) token bucket for atomic checks. Use a central quota service only where strong consistency/billing is required.

Distributed Design Considerations

Now, let’s look at some key considerations to consider when building a distributed rate-limiting architecture.

Single Redis vs Clustered Redis: At scale, use a Redis cluster or sharded approach with keys sharded to distribute load.
Consistent hashing: When sharding gateways or Redis nodes, consistent hashing reduces reshuffling when nodes change.
Local token buckets: Gateways can use a local, in-process bucket and periodically sync to Redis (optimistic allows lower latency).
Multi-region: Prefer to region-local enforcement (low latency), accept eventual consistency for cross-region quotas. For strict global quotas, route checks through a global coordinator (higher latency).
Atomicity: Use Redis Lua scripts (atomic) to avoid race conditions under high concurrency.
Hot keys: Detect and apply special handling (rate limit more aggressively, throttle, route to dedicated nodes).

Handling System Failures

How should you deal with failures in your system?

You also need to share this with your interviewer and choose a strategy depending on the business.

Redis/Store unavailable: If Redis or your store is not available. What happens? Below are some of the strategies you can discuss with your interviewer.
- Fail-closed (deny all): This strategy is safe but causes an outage.
- Fail-open (allow all): This strategy keeps availability but risks abuse.
- Pragmatic option (fail-graceful): This strategy uses a local permissive leaky bucket with reduced accuracy and produces alerts.
Clock drift & time accuracy: Try to avoid clock drifts and be accurate with time across the server. Use server timestamps and avoid client clocks.
Network partitions: Enforce locally and reconcile later for non-billing use cases.
Cache stampede: In case many clients miss and hit the DB/central store simultaneously, use request coalescing, locks, or “allow one repopulator”.
Thundering herd on config changes/TTL expiry: Add jitter/randomized TTLs and stagger refresh.

API & UX for clients

Always return clear headers and status codes:

Successful responses:
- X-RateLimit-Limit:
- X-RateLimit-Remaining:
- X-RateLimit-Reset:
When throttled:
- HTTP 429 Too Many Requests
- Retry-After:
- Response body with a friendly message and suggested backoff

These help clients implement exponential backoff and graceful retry.

Metrics & monitoring

Track these metrics & set alerts early. You can also discuss this with your interviewer to see what works for the system you’re designing.

Requests allowed / requests denied (429)
Rate check latency (p95, p99)
Redis errors/timeouts/connection pool usage
Token refill rate/capacity usage
Top offenders (per API key / IP)
Success/failure ratios of local vs remote checks

Use Prometheus and Grafana, Datadog, or similar. Make sure to alert on sudden spikes in 429s, Redis errors, and increased check latency.

Dynamic config & policy management

Store throttling policies (per plan, per API-key) in a config store (DynamoDB, PostgreSQL, Consul).

Push policies to gateways via:

Hot reload API
Polling + ETag
Push via SSE / WebSocket for instant changes

// A simple policy

{
  “id”: “plan-pro”,
  “type”: “token-bucket”,
  “rate”: 100,           // tokens per second
  “burst”: 200,          // max burst tokens
  “scope”: [”global”,”per-api-key”]
}

Implementation Example

Here’s a simple rate-limiting implementation using the Token Bucket algorithm with Redis + Lua (TypeScript).

Let’s define our Lua script for atomic token bucket implementation:

// Lua script: atomic token bucket

const lua = `
local key = KEYS[1]
local now = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])   -- tokens per ms
local capacity = tonumber(ARGV[3])
local tokens_needed = tonumber(ARGV[4])

local data = redis.call(”HMGET”, key, “tokens”, “last”)
local tokens = tonumber(data[1]) or capacity
local last = tonumber(data[2]) or now

-- refill
local elapsed = math.max(0, now - last)
local add = elapsed * refill_rate
tokens = math.min(capacity, tokens + add)
last = now

local allowed = 0
if tokens >= tokens_needed then
  tokens = tokens - tokens_needed
  allowed = 1
end

redis.call(”HMSET”, key, “tokens”, tokens, “last”, last)
redis.call(”PEXPIRE”, key, 3600000) -- 1 hour TTL to reduce key churn
return { allowed, tokens }
`;

TypeScript + ioredis example (token bucket via Redis Lua):

import Redis from ‘ioredis’;
const redis = new Redis({ host: ‘redis-host’, port: 6379 });

const sha = await redis.script(’LOAD’, lua);

async function tryConsume(key: string, tokens = 1, ratePerSec = 100, capacity = 200) {
  const now = Date.now()
  const refillRatePerMs = ratePerSec / 1000.0
  const res = await redis.evalsha(sha, 1, key, now, refillRatePerMs, capacity, tokens)
  return { allowed: res[0] == 1, tokensLeft: tonumber(res[2]) }
}

Notes: ratePerSec and capacity map to plan tiers. Use namespaced keys: rl:api-key::. Keep the Lua script loaded (sha) for performance.

Fault Tolerance and reliability

Always have DB/Service fallback and circuit breakers around rate store calls

try {
  const allowed = await checkRate(...)
  if (!allowed) return 429
} catch (err) {
  logger.warn(’Rate store fail, using permissive local bucket’)
  // apply local soft quota
}

Use replicas for Redis reads; perform writes/updates on master/shards. Implement health checks and automatic failover for the store. Use exponential backoff and retry for store ops, but avoid blocking the request path too long.

Final Answer

Here’s exactly how to put your answer forward to the interviewer:

“I’d design the rate-limiter using a token-bucket algorithm enforced at the API gateway for low latency, backed by a clustered Redis store with Lua scripts for atomicity. Use per-API-key and per-route policies stored in a config service with hot reload. Add randomized TTLs and request coalescing to avoid stampedes, local token buckets for latency optimization, and monitoring for hit/deny ratios. For multi-region, deploy regional clusters with eventual consistency for non-billing quotas and a central coordinator for strict global quotas. This balances performance, burst handling, and operational reliability.”

Designing a rate-limiting system might sound like a small piece of the puzzle. However, as you dig in, you realize it’s really about understanding real-world traffic, distributed systems, and how to keep your APIs healthy under pressure.

Every little choice you make, the algorithm, the data store, how you handle bursts, how you sync counters, all of it affects how stable and reliable your system will be when things get busy.

So the next time an interviewer asks, “How would you design a rate-limiter at scale?” don’t just say, “I’ll use Redis.”

Walk them through your thinking.

Show them how you’d keep things fair, fast, and scalable even when millions of requests hit your system. That’s the part that shows you truly understand how real-world backend systems behave.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this new series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 4 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

How would you design a Distributed Cache for a High-Traffic System?

Solomon Eseme — Sat, 01 Nov 2025 15:30:42 GMT

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

If this newsletter was shared with you, consider subscribing here:

Subscribe now

Here’s another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

If you are still stuck in tutorial hell

We’ve all been there—jumping from one Python video to the next, but never building anything real. No portfolio. No confidence. No interviews.

That’s why I created this.

The “Land Your Dream Python Job” Challenge
A 90-day, 3-phase roadmap that helps you:

✅ Build 30 real-world backend projects in 30 days
✅ Master DSA for technical interviews
✅ Get job-ready with resumes, mock interviews & daily job alerts
✅ And finally... land that backend job

This is NOT another course. It’s a challenge. And it works. It’s beginner-friendly.

Over 2,000 Python developers have taken this path—many are now working at top companies.

Only 120 slots left at $54 (then goes up to $100)

Join the challenge & change your future
👉 python30.masteringbackend.com

Let’s get you unstuck.

This is the MB Interview Series on Backend Weekly every Saturday.

In this series, I will walk you through how to answer common backend engineering interview questions, covering topics such as system design, microservices, API design, and databases.

Let’s get started with episode 2 (Episode 1 Here):

The Interview Scenario

You’re in a backend interview.

They ask:

“How would you design a distributed cache for a high-traffic system?”

Here’s how to approach it:

Before we dive in, we are building the next Interview Prep Playground targeting backend engineers.

Join our MB Interview waitlist: https://tally.so/r/w46glb

Now, let’s start by clarifying why caching exists in the first place.

Understand the problem

Solving any problem is very simple, depending on your understanding of the problem.

If you want to boost the performance of your backend systems. Then you need Caching.

Caching is the process of storing frequently accessed data temporarily in high-speed storage. This storage is called a cache.

Caching helps speed up the retrieval of frequent data since the data is not accessed from the database directly.

While the goal for this is to:

Reduce DB load.
Improve latency for frequently accessed data.
Handle millions of requests with low overhead.

How do you design a distributed cache for a high-traffic system where we have multiple services running currently in a microservice architecture?

The High-Level Architecture

Let’s look at the high-level architecture for this:

This is the core of our architecture:

Generate App Server: This is backend systems like Express server or Go server. All requests are handled here. These servers can be distributed into different regions and can be built with any server-side language.
Cache Layer: The cache layer is any of the variants we discussed below. It can either be a Local or a Centralized Cache.
Redis Cluster: This assumes we pick Redis as our cache service; we can spin up as many caching clusters to address our high-speed and performance needs.
Database: The centralized database for our application.

Here are the 3 variants you can choose to build with:

Local Cache (In-memory): Fastest but inconsistent across nodes. Example: using node-cache or lru-cache.
Centralized Cache (e.g., Redis, Memcached): Shared across instances, consistent view. Needs replication and scaling.
Hybrid Cache: Local cache for ultra-fast reads. Centralized cache for synchronization.

Caching Strategies

Next, let’s explore some of the caching strategies. Here are three classic strategies:

Cache-Aside (Lazy Loading): The strategy allows the app to check the cache first, the if it misses, it fetches data from the database and writes it to cache for subsequent reads. It's a simpler method that avoids stale writes with a lower cold start latency.
Write-Through: This strategy writes to the cache and the DB together. In this strategy, data is always fresh, and the write latency increases.
Write-Behind: This strategy writes to the cache first and then writes to the DB asynchronously. Some of the features are Faster writes and a risk of data loss if the cache fails.

You can choose the Cache-Aside strategy because it’s simple, scalable, and mostly common in production-ready backend systems.

Eviction and Consistency Policy

While the speed and performance benefits are endless, at some point, you must evict some data when there are changes. That’s where deciding the best eviction strategy comes in:

Here are some of the common eviction policies:

LRU (Least Recently Used) — evict the least-used data.
LFU (Least Frequently Used) — evict infrequently used keys.
TTL (Time-to-Live) — automatically removes expired data.

For consistency models, you need to think about when your database changes. For instance, if your database data changes, your cache might temporarily serve old data.

You can use these strategies to solve that problem:

Use cache invalidation on writes.
Add short TTLs.
Use event-driven updates (e.g., publish DB changes to cache via Kafka).

While building a distributed cache, you might run into some challenges, and it’s good to share them with your interviewer.

Some Distributed Challenges:

Let’s explore some of the challenges and how to solve them:

1. Cache Stampede

This challenge happens when many requests hit a missing key simultaneously and all the request goes to the database at once.

Fixes:

Use request coalescing. Design your system in a way that only one request repopulates the cache.
Use the “lock & populate” mechanism with Redis locks.

2. Thundering Herd Problem

Next, let’s look at the thundering herd problems. Here, many keys expire at once, which leads to a sudden DB load spike.

To fix this, add randomized TTLs or soft TTLs to prevent all keys from expiring simultaneously.

await redis.set(cacheKey, data, { EX: 3600 + Math.random() * 300 });

Next, let’s elucidate how to scale a distributed cache.

Scaling the Distributed Cache

Scaling a distributed cache is hard:

Let’s make it simple to help you in your interviews.

When your cache grows large, single-node Redis won’t cut it.

Here’s how to scale it properly:

Here are some of the techniques used to scale a distributed cache:

Sharding: Split keys across multiple servers.
Consistent Hashing: Avoid massive reshuffles when nodes are added/removed.
Replication: Use replicas for fault tolerance.
Multi-Region Caches: Deploy regional cache clusters close to users.

While scaling your distributed cache systems, here are some additions that will come in handy.

Fault Tolerance and Reliability

We need to consider what happens if our cache server fails. As with other production-ready systems, it should fail gracefully without interrupting the flow of our backend system.

Here are some best practices to build a fault-tolerant and reliable system:

Always have a DB fallback.
Set timeouts and circuit breakers around cache calls.
Monitor cache hit ratios and latency.

try {
  const result = await redis.get(key);
  if (result) return result;
} catch (error) {
  logger.warn(”Cache unavailable, falling back to DB”);
  return await db.query(key);
}

Here’s a simple example that queries your database if something goes wrong with our cache server.

Additionally, we need to observe and monitor our cache servers like our main servers.

Observability and Metrics

Building a distributed system is not production-ready without observability. Here are some metrics you can track as you build your distributed cache system.

Cache hit ratio
Eviction count
Latency (p95, p99)
Connection pool usage
Replication lag

You can use tools such as Prometheus + Grafana or Datadog. This gives visibility into when and how your cache starts misbehaving.

Here are a few ideas to show advanced thinking:

Hot Key Detection: Track frequently requested keys and prefetch them.
Near Cache Pattern: Keep a local in-memory cache synced with the distributed cache.
Compression: Compress large cached values to save memory.
Lazy Expiration: Extend TTL if the key is actively accessed.

Final Answer

Here’s exactly how to put your answer forward to the interviewer:

“I’d design a distributed caching system using Redis with cache-aside pattern, LRU eviction, sharding via consistent hashing, and replication for fault tolerance. I’d add randomized TTLs and request coalescing to avoid cache stampedes, and expose metrics for observability. This design balances speed, consistency, and scalability for high-traffic systems.”

Designing a distributed caching system might sound simple — but it’s a deep dive into scalability, data consistency, distributed systems, and real-world resilience.

Every decision — from TTLs to hashing — impacts performance and reliability.

So next time an interviewer asks, “How would you design a distributed cache?”
Don’t just say “I’ll use Redis.”

Walk them through how you’d make it reliable, scalable, and production-grade.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this new series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 4 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

3. MB Video-Based Courses: Join 1000+ backend engineers who learn from our meticulously crafted courses designed to empower you with the knowledge and skills you need to excel in backend development.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon (solomoneseme.com)

How would you design a distributed job scheduling system like Cron at scale

Solomon Eseme — Sat, 25 Oct 2025 15:00:59 GMT

Hello “👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

Welcome to another issue of Backend Weekly — your favorite newsletter on mastering backend engineering through real-world systems and interview design questions.

Before we dive in:

We recently moved Backend Weekly from Beehiiv to Substack!
You don’t have to do anything — same great weekly content, just a new home.

Welcome to the interview series on Backend Weekly.

In this series, I will walk you through how to answer common backend engineering interview questions ranging from different backend engineering concepts like system design, microservices, database, etc.

Let’s get started with episode 1:

The Interview Scenario

You’re in a backend interview.

They ask:

“How would you design a distributed job scheduling system like Cron but at scale?”

Here’s how to approach it:

Understand the Problem

The first step towards solving any problem is understanding the problem inside and out.

Let’s break down the problem:

A cron service works great when you’re on a single machine. However, what if you’re running millions of jobs, recurring or one-time, across hundreds of servers, regions, and tenants.

Below are some of the problems you will have to worry about:

Jobs running exactly once at the correct time.
No duplication or missed executions
Scalability and handling spikes in scheduled jobs.
Fault tolerance: What happens when a node dies mid execution.

When you take these problems (and more) into consideration, you’re no longer designing a single simple Cron service.

The High-Level Architecture

Now that we understand the challenges of building a distributed Cron system, let’s visualize the architecture.

Here’s how each part fits and what it does:

Client/API Gateway: This is where the jobs are created, updated, and deleted via API or UI.
Job Service: The job service stores the job definitions and metadata such as interval, next run time, owner, payload, etc.
Scheduler Service: The scheduler service continuously checks which jobs are due and dispatches them for execution.
Distributed Queue (Kafka, RabbitMQ, SQS): This buffers and distributes jobs to available workers.
Worker Nodes: The worker node executes the actual jobs either by running a script, making HTTP calls, or performing database operations and report results.
Database: The database is used to persist job definitions, execution history, and audit logs.

Scheduling Logic

Now let’s talk about how jobs are actually triggered.

All jobs are stored in the database with metadata like:

job_id, interval, next_run_at, owner, status

Storing the metadata for each job is the function of the Job Service we mentioned before.

The scheduler periodically polls the database for jobs due to run and does the following:

It pushed the task to the distributed queue.
Workers consume from the queue and execute.
Results and timestamps are written back to the database.

Let’s visualize this process:

The scheduling logic looks simple right? But things can get tricky when it becomes distributed.

Now that the scheduling logic is out. Let’s look at some challenges we might face when dealing with millions of users.

The Challenges

Let’s explore some of the challenges and how to solve them:

1. Avoiding Duplicate Runs

Let’s consider what happens when you have multiple scheduler nodes running.

Each node might poll the same database table and try to execute the same job simultaneously. That will cause duplicated jobs to execute which is a big NO for a scheduling service.

Here’s a quick solution to solve duplicates:

Distributed Locks (Redis lock): Before executing, each scheduler node tries to acquire a Redis lock for the job, and only the node that acquired the lock proceeds.
Database Row Locks: Use transactional row-level locking when updating `next_run_at` or `status`.
Consistent Hashing: Assign each job to a specific scheduler node based on a hash of its ID to avoid overlap entirely.
Timestamps: Track `last_executed_at` to prevent re-runs.

Avoiding duplicates in a scheduling service is one of the core challenges to tackle. Therefore, now that we have solved that. Let’s see how we can scale the system to handle millions.

2. Scaling the System

When you’re handling millions of jobs, even a single scheduler won’t cut it.

Therefore, you can explore the following scalability options:

Partitioning jobs: You can partition jobs by user, tenant, or time window.
Sharding queues: Where each scheduler writes to its own queue partition.
Leader election: You can use Zookeeper or etcd to coordinate active schedulers.
Auto-scaling workers: You can scale each worker independently.

Now your system can elastically scale based on load. However, the system can still break, and that’s inevitable. So, how do we tackle this so the scheduling service does not crash?

3. Fault Tolerance

It is general knowledge that sometimes, distributed system fails, and that’s inevitable.

Therefore, the interviewer will expect that your system is designed for fault tolerance and to gracefully recover from failures.

Let’s look at how to achieve fault tolerance in our system:

Worker Node Crashes: When designing your worker node, make sure it uses ack/nack so that if a worker dies mid-task, the job is re-queued for another worker.
Scheduler Fails or Restarts: Design your schedulers to be stateless, so you can restart each scheduler freely.
Persistence: Every job’s lifecycle (created, scheduled, executed, failed, retried) is stored in the DB for audit and replay.

Designing your system to gracefully recover from failures is the best thing to do when building for millions. However, the next best thing is to observe when something is wrong and what caused it.

4. Observability & Management

While designing this system and answering your interviewer’s question, you need to understand that your scheduler service is not only about execution but also visibility.

You will need to have monitoring and observability in place to constantly learning about the service and where to improve or fix bugs.

You will need to set the following up:

APIs and dashboards to check job status, next runs, and logs.
Metrics:
- Job latency
- Success/failure rates
- Queue depth
- Worker throughput
Retries with exponential backoff for transient failures.

This helps SREs and developers trust the system. Solving all these challenges will put your scheduling service in a position to handle millions of users. However, we can still optimize further.

Optimization Tricks

Now that the core of the system is out, you can start making it smarter.

Let’s explore some optimization tricks that you can add or discuss with your interviewer to give you an upper hand in the interview.

Priority Queues: Urgent jobs get scheduled first.
Redis Cache: Cache active or hot jobs to reduce DB polling.
Batching: Combine small periodic jobs into one batch job to save resources.
Adaptive Polling: Dynamically adjust the scheduler’s polling interval based on system load.

Final Answer

Here’s exactly how to put your answer forward to the interviewer:

“I will design a distributed, fault-tolerant job scheduling system with a central job database, horizontally scaled schedulers using Redis locks for deduplication, a distributed queue for task dispatching, and stateless workers for execution — observable, elastic, and cloud-ready.”

Closing Thoughts

Designing something as simple as “cron at scale” teaches you nearly every backend concept that matters:

Distributed locking
Leader election
Message queues
Fault tolerance
Observability
Scalability

It’s a masterclass in system design thinking.

So next time an interviewer asks you this question in an interview — don’t just say “I’ll use a queue.”

Walk them through why and how you’d make it reliable, scalable, and production-grade.

I hope you learned something today: Spread the love. Share this newsletter with at least two of your friends today.

Also, let me know if you enjoy this new series and if you want me to continue breaking down interview questions like this.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 4 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon

API and API Design: API Performance

Solomon Eseme — Thu, 16 Oct 2025 06:19:42 GMT

Hello “👋

Welcome to another week, another opportunity to become a Great Backend Engineer.

Public Announcement:

Welcome back, Great Backend Engineers. I recently moved my newsletter from Beehiiv to Substack here.

Therefore, this is a public notice that you’re now receiving this email from Substack. Everything else remains the same. Same great weekly content on backend engineering is now coming to the Substack platform.

Previously, I completed the top 10 security risks from this OWASP top 10 list, and in this issue, I will explore the concept of API Performance in APIs and API Design.

API Performance is crucial when talking about APIs and API design. It refers to the efficiency and speed at which a developed API can execute tasks and return responses.

What is API Performance?

API Performance refers to the speed at which an API responds to a request and also the efficiency of the responses that are returned.

API performance directly impacts the responsiveness of an application, determining how quickly data can be exchanged, processed, and presented to the end-user.

When your API is slow, it directly affects the experience of the end user.

Therefore, improving the performance of your API resolves the problems related to the user experience and enhances the overall performance of the application that the API is integrated with.

Next, let’s explore 7 strategies to increase the performance of your backend APIs:

Top 7 Strategies to 10x Your API Performance

Below are the 7 efficient strategies to boost your API performance and increase the efficiency of your API.

Caching
Pagination
Avoid N+1 Queries
Compression
Connection Pooling
Serialization
Use Asynchronous Logging

1. Use Caching

Caching is one of the most powerful techniques to improve API performance. By storing frequently accessed data in memory, you reduce the need to repeatedly fetch data from databases or external services.

A diagram showing caching strategies

Implementing caching strategies like Redis or Memcached allows your API to serve responses instantly without performing expensive database queries.

Consider implementing multiple caching layers:

Client-side caching for static assets
Server-side caching for database queries
HTTP caching headers to leverage browser and CDN caches.

Cache invalidation is equally important—ensure you have a strategy to refresh cached data when it becomes stale.

Most modern applications use a combination of time-based expiration (TTL) and event-based invalidation to maintain data consistency while maximizing performance gains.

2. Pagination

Large datasets can significantly slow down your API responses.

Pagination breaks down large result sets into smaller, manageable chunks that are easier to transmit and process.

By limiting the number of records returned per request, you reduce bandwidth consumption, decrease memory usage on both client and server, and dramatically improve response times.

Implement cursor-based pagination for better performance with large datasets, as it handles additions and deletions more gracefully than offset-based pagination.

Always set sensible default page sizes and allow clients to configure limits within acceptable boundaries. This approach also improves user experience by delivering data progressively rather than making users wait for massive responses.

3. Avoid N+1 Queries

The N+1 query problem occurs when your application executes one query to fetch parent records, then executes N additional queries to fetch related data for each parent record.

This can turn a simple operation into dozens or hundreds of unnecessary database roundtrips, severely degrading performance.

Use eager loading techniques like JOIN operations, database-level population, or GraphQL batching to fetch all required data in fewer queries.

ORMs like Sequelize, Hibernate, or TypeORM provide built-in mechanisms to prevent N+1 queries.

Regularly audit your API logs and database query patterns to identify and eliminate these bottlenecks before they impact production performance.

4. Compression

Compressing API responses using gzip or Brotli can reduce payload sizes by up to 70%, dramatically decreasing bandwidth consumption and network transmission times.

Modern HTTP servers and client libraries support compression transparently, making it one of the easiest wins for performance optimization.

Enable compression at the server level for all text-based responses including JSON, XML, and HTML. Set appropriate compression levels based on your infrastructure—higher compression ratios save bandwidth but consume more CPU resources.

Monitor the trade-off between compression overhead and transmission savings, as the benefits are most pronounced for larger payloads while small responses might not justify the computational cost.

5. Connection Pooling

Every database connection involves overhead—establishing connections, authentication, and resource allocation. Creating new connections for each request is wasteful and creates unnecessary latency.

Connection pooling maintains a pool of pre-established database connections that are reused across requests, significantly reducing connection overhead.

Configure appropriate pool sizes based on your application’s concurrency requirements and database capacity. Too small a pool creates bottlenecks, while too large a pool wastes resources.

Most modern backend frameworks include connection pooling out of the box, but fine-tuning parameters like minimum pool size, maximum pool size, and connection timeout ensures optimal performance for your specific workload.

6. Serialization

The process of converting application objects into transmissible formats (JSON, XML, Protocol Buffers) consumes CPU resources. Optimizing serialization reduces this overhead and speeds up response times.

Choose efficient serialization formats and consider using lightweight alternatives like Protocol Buffers or MessagePack for performance-critical APIs.

Implement selective field serialization to exclude unnecessary data from responses, reducing payload sizes and serialization time.

Use streaming serialization for large responses to avoid holding entire datasets in memory. Benchmark different serialization strategies in your environment to identify which approach balances performance, bandwidth, and compatibility for your use case.

7. Use Asynchronous Logging

Synchronous logging writes logs directly to disk before the application continues, creating blocking I/O operations that slow down request processing.

Asynchronous logging buffers log entries in memory and writes them to disk separately, allowing your application to continue handling requests without waiting for disk I/O.

Implement async logging using message queues or dedicated logging libraries designed for high-throughput scenarios.

This approach not only improves API response times but also prevents a single slow I/O operation from impacting your entire application. Ensure you have appropriate buffer sizes and overflow handling to prevent data loss during high-load periods.

Did you learn any new things from this newsletter this week? Please reply to this email and let me know. Feedback like this encourages me to keep going.

Remember to start learning backend engineering from our courses:

Get a 50% discount on any of these courses. Reach out to me (Reply to this mail)

Backend Engineering Resources

Whenever you’re ready

There are 4 ways I can help you become a great backend engineer:

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

LAST WORD 👋

How am I doing?

Hit reply and say hello - I’d love to hear from you!

Stay awesome,
Solomon