TL;DR
Building production AI web applications requires a three-layer architecture that separates intelligence (LLM orchestration and prompt management), resilience (rate limiting, cost tracking, error handling), and experience (streaming UI and user feedback). The most common failures in AI products are operational: uncontrolled API costs, no rate limiting, unstructured LLM output that breaks the UI, and no fallback when the AI service goes down. Key patterns that prevent these failures include model routing (using cheaper models like Claude Haiku for simple tasks and reserving expensive models for complex reasoning), structured output via function calling or Zod schemas, semantic caching to reduce redundant API calls by 30-50%, and streaming responses using the Vercel AI SDK for perceived performance. AI features cost $0.01-$20+ per user per month depending on usage intensity, making cost tracking per user essential for sustainable unit economics. The difference between an AI demo and a production AI product is entirely in these operational layers.
Every founder I talk to in 2026 wants AI in their product. And fair enough - the technology is genuinely transformative. But there's a massive gap between a ChatGPT wrapper demo and a production AI feature that handles thousands of users without bleeding money or breaking down.
I've built AI-powered features across multiple production applications - content generation tools, document analysis systems, intelligent search interfaces, and AI-assisted workflows. The technical challenges that surface at scale are completely different from what you encounter in a tutorial.
This is what I've learned about building AI features that actually survive contact with real users.

The Three Layers of AI Integration
Every AI feature I build follows a three-layer architecture. This isn't theoretical - it's the pattern that has survived production workloads without requiring a rewrite.
Layer 1: The Intelligence Layer (LLM Orchestration)
This is where you interact with AI models - OpenAI, Anthropic's Claude, or open-source models. The critical decisions here aren't about which model to use (they all work). They're about:
Prompt management. Your prompts are code. They should live in version control, be testable, and be parameterized. I keep them in a dedicated /prompts directory with TypeScript types for their inputs and outputs.
// lib/prompts/analyze-document.ts
export interface DocumentAnalysisInput {
content: string;
documentType: "contract" | "report" | "proposal";
extractionGoals: string[];
}
export interface DocumentAnalysisOutput {
summary: string;
keyFindings: { finding: string; confidence: number }[];
risks: { description: string; severity: "low" | "medium" | "high" }[];
}
export function buildAnalysisPrompt(input: DocumentAnalysisInput): string {
return `You are a document analysis expert. Analyze the following ${input.documentType}.
Extract the following information:
${input.extractionGoals.map((g, i) => `${i + 1}. ${g}`).join("\n")}
Document content:
---
${input.content}
---
Respond in JSON format matching this structure:
{
"summary": "...",
"keyFindings": [{ "finding": "...", "confidence": 0.0-1.0 }],
"risks": [{ "description": "...", "severity": "low|medium|high" }]
}`;
}This pattern gives you versioning, testing, and reusability. When the model changes its behavior after an update (and it will), you know exactly which prompt to adjust.
Model routing. Not every AI task needs GPT-4o or Claude Opus. I route based on complexity:
- Simple classification/extraction → Smaller, faster models (Claude Haiku, GPT-4o-mini)
- Complex reasoning/generation → Larger models (Claude Sonnet, GPT-4o)
- High-stakes analysis → Best available model with multiple passes
This single decision can cut your AI costs by 60-70% without degrading the user experience.

Layer 2: The Resilience Layer (Error Handling & Cost Control)
This is where most AI products fail. The demo works beautifully. Then a real user submits a 50-page PDF, or sends 200 requests in an hour, or inputs text in Bengali mixed with English, and everything falls apart.
Rate limiting and queuing. Every AI endpoint needs rate limiting. Not just for cost control, but because LLM APIs have their own rate limits, and you need to queue gracefully instead of throwing errors.
// lib/ai/rate-limiter.ts
import { Ratelimit } from "@upstash/ratelimit";
import { Redis } from "@upstash/redis";
const ratelimit = new Ratelimit({
redis: Redis.fromEnv(),
limiter: Ratelimit.slidingWindow(20, "1 m"), // 20 requests per minute
analytics: true,
});
export async function checkAIRateLimit(userId: string) {
const { success, remaining, reset } = await ratelimit.limit(userId);
if (!success) {
return {
allowed: false,
retryAfter: Math.ceil((reset - Date.now()) / 1000),
};
}
return { allowed: true, remaining };
}Cost tracking per user. You need to know how much each user costs you in AI spend. I log every API call with token counts and calculate costs in real time.
// lib/ai/cost-tracker.ts
const MODEL_COSTS = {
"claude-sonnet-4-6": { input: 3.0 / 1_000_000, output: 15.0 / 1_000_000 },
"claude-haiku-4-5": { input: 0.8 / 1_000_000, output: 4.0 / 1_000_000 },
"gpt-4o-mini": { input: 0.15 / 1_000_000, output: 0.6 / 1_000_000 },
} as const;
export function calculateCost(
model: keyof typeof MODEL_COSTS,
inputTokens: number,
outputTokens: number,
): number {
const costs = MODEL_COSTS[model];
return inputTokens * costs.input + outputTokens * costs.output;
}When a user on a $29/month plan has consumed $15 in AI costs, you need to know that. This isn't premature optimization - it's the difference between a sustainable SaaS business and one that loses money on every active user.
Graceful degradation. LLM APIs go down. They return errors. They time out. Your product cannot break when this happens. I implement fallback chains: primary model → fallback model → cached response → graceful error message. The user should never see a raw API error.
Layer 3: The Experience Layer (Streaming UI)
This is what the user actually sees, and it's where the perception of quality lives.
Streaming responses are non-negotiable for any AI feature that generates text. Waiting 8 seconds for a response with a spinner feels broken. Watching text appear word-by-word feels fast and engaging, even if the total time is the same.
// app/api/generate/route.ts
import { streamText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
export async function POST(req: Request) {
const { prompt, context } = await req.json();
const result = streamText({
model: anthropic("claude-sonnet-4-6"),
system: "You are a helpful writing assistant...",
prompt,
});
return result.toDataStreamResponse();
}// components/ai-writer.tsx
'use client'
import { useChat } from 'ai/react'
export function AIWriter() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat({ api: '/api/generate' })
return (
<div className="space-y-4">
{messages.map((m) => (
<div key={m.id} className={m.role === 'user' ? 'text-right' : ''}>
{m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input
value={input}
onChange={handleInputChange}
placeholder="What would you like to write?"
disabled={isLoading}
/>
</form>
</div>
)
}The Vercel AI SDK handles streaming, parsing, and state management. The entire streaming UI is about 20 lines of code. But the perceived quality difference is enormous.
Patterns I've Learned the Hard Way
Structured Output Is Your Best Friend
The biggest source of bugs in AI applications is unstructured LLM output. You ask for JSON, you get JSON with markdown backticks. You ask for a list, you get a paragraph. You ask for a number, you get "approximately 42."
Use structured output modes (function calling, tool use, JSON mode) whenever possible. When you need free-form text, validate the output before passing it to your UI.
import { generateObject } from "ai";
import { z } from "zod";
const ProductDescription = z.object({
headline: z.string().max(60),
description: z.string().max(300),
bulletPoints: z.array(z.string()).length(4),
tone: z.enum(["professional", "casual", "playful"]),
});
const result = await generateObject({
model: anthropic("claude-sonnet-4-6"),
schema: ProductDescription,
prompt: `Write a product description for: ${productName}`,
});
// result.object is fully typed and validated
No parsing. No regex. No "hope the model follows instructions." The schema enforces the structure.
Context Windows Are Not Infinite
Founders frequently ask me to build features like "analyze this entire codebase" or "summarize all our customer feedback." The assumption is that you can dump everything into the AI and get a perfect answer.
Reality: context windows have limits, and even within those limits, model performance degrades with very long inputs. My approach:
- Chunk intelligently. Don't split at arbitrary character counts. Split at semantic boundaries - paragraphs, sections, logical units.
- Use embeddings for retrieval. Store your data as vector embeddings. When a user asks a question, retrieve only the relevant chunks and send those to the model. This is RAG (Retrieval-Augmented Generation), and it's essential for any AI product working with large datasets.
- Summarize progressively. For very large documents, summarize sections first, then summarize the summaries. This map-reduce pattern handles documents of any size.
Cache Aggressively
If 50 users ask the same question about your product, you don't need 50 API calls. I implement semantic caching - hash the input, check the cache, return the cached response if it exists.
For dynamic content, I use a TTL (time-to-live) based on how frequently the underlying data changes. Product descriptions might cache for 24 hours. Real-time analytics summaries might cache for 5 minutes.
This alone can reduce your AI costs by 30-50% for products with overlapping user queries.
The Cost Reality Check
Let me be direct about something founders need to hear: AI features are expensive to operate. Not expensive like server hosting. Expensive like a utility bill that scales directly with usage.
A typical SaaS product with AI features might cost:
- Light usage (classification, short generation): $0.01 - $0.05 per user per month
- Medium usage (document analysis, chat): $0.50 - $2.00 per user per month
- Heavy usage (continuous generation, large context): $5.00 - $20.00+ per user per month

If you're charging $29/month and your AI costs are $15/user, your unit economics don't work. This is why model routing, caching, and usage limits aren't nice-to-haves - they're survival requirements.
Building AI Features That Ship
The most important thing I can tell a founder about AI features: start with the user problem, not the technology. The question isn't "how do we add AI?" It's "what manual, tedious, or impossible task can we automate for our users?"
Once you have that answer, the architecture follows naturally:
- What data does the AI need? → Determines your retrieval strategy
- What format should the output be? → Determines structured vs. free-form
- How fast does it need to be? → Determines streaming, caching, model choice
- How much can you spend per request? → Determines model routing
- What happens when the AI is wrong? → Determines your fallback and review strategy
Every AI feature I build starts with these five questions. The code comes after.
Building an AI-powered product and need architecture guidance? I help startup founders design and build AI features that scale. Let's talk.

WRITTEN BY
Suhag Al Amin
Senior full-stack engineer specializing in SaaS MVPs and AI-powered web apps. 6+ years shipping production products for startup founders.
Common questions.
- How much does it cost to add AI features to a SaaS product?
- AI API costs vary significantly by usage pattern. Light usage like classification and short text generation costs $0.01-0.05 per user per month. Medium usage like document analysis and conversational AI costs $0.50-2.00 per user per month. Heavy usage like continuous generation and large context windows can cost $5-20+ per user per month. These costs can be managed through model routing, caching, and usage limits.
- Which AI model should I use for my product — OpenAI or Claude?
- Both work well for most applications. The more important decision is model routing — using the right size model for each task. Simple extraction and classification tasks should use smaller, cheaper models (GPT-4o-mini, Claude Haiku) while complex reasoning uses larger models (GPT-4o, Claude Sonnet). This approach can reduce AI costs by 60-70% without degrading user experience.
- How do I handle AI hallucinations in production?
- Use structured output (function calling, JSON mode, Zod schema validation) to constrain the model's response format. Implement confidence scoring for critical outputs. Add human-in-the-loop review for high-stakes decisions. Use RAG (Retrieval-Augmented Generation) to ground responses in your actual data rather than the model's training data. Never display raw AI output without validation.
- Can I build AI features with Next.js?
- Yes, Next.js is one of the best frameworks for AI web applications. The Vercel AI SDK provides streaming support, the App Router's server-side capabilities handle API orchestration securely, and Server Actions simplify the client-server boundary for AI interactions. The combination of Next.js + Vercel AI SDK + a model provider (OpenAI, Anthropic) is the fastest path to shipping AI features.
- How do I make AI responses feel fast to users?
- Streaming is the primary solution. Instead of waiting for the complete AI response and displaying it all at once, stream tokens to the UI as they are generated. The Vercel AI SDK handles this with approximately 20 lines of code. Users perceive streaming responses as significantly faster than buffered responses, even when the total generation time is identical.
STAY IN THE LOOP
Get new essays before they're posted.
One email when something new goes up. No cadence, no filler.


