Getting Started with LLMOps: Turning LLMs into Real Productivity
Two years ago, my company decided to integrate AI capabilities into our product line. A few backend engineers spent two months getting the GPT-4 API working and getting conversations running. And then? It crashed on day one — not because the model was bad, but because nobody knew how to modify prompts, control outputs, read logs, or switch models.
This is probably the typical situation most teams encounter when they first try to "put AI into production": the model is powerful, but the engineering capability to support it is lacking. This is exactly what LLMOps solves.
1. First, Understand What an LLM Is
Before talking about LLMOps, you need to understand what we're actually managing. What exactly is a large language model, and how does it work? Understanding these basics will help you avoid many detours in real applications.
1.1 Token: LLM's "Word Count"
When dealing with LLMs, the first thing to understand is token. You can think of a token as the "word count" in the LLM world — the model doesn't process text character by character, it processes by token.
Chinese example:
"你好世界" → about 4-6 tokens (depending on implementation)
"hello world" → about 2 tokens
English splits by word, Chinese splits by character/phrase
Tokens matter because billing is based on token usage. Inputting "Who are you?" costs money, and outputting "I am an AI assistant" also costs money. Drop in a few hundred-page PDF, and the token costs can be much higher than you'd expect.
Take OpenAI's GPT-4 as an example:
Input: ~$0.01 - $0.03 per 1K tokens
Output: ~$0.03 - $0.06 per 1K tokens
1K tokens ≈ 750 English words, or 400-500 Chinese characters
A medium-length Chinese paragraph (500 characters) consumes about 300-500 tokens
So in real projects, controlling token consumption is key to cost optimization. Common approaches include:
- Summarizing input documents, keeping only key information
- Setting max_tokens to limit output length
- Using caching to avoid recomputing identical content
1.2 From Text to Numbers: Embedding
What an LLM actually does is predict the next token. It takes a bunch of text, converts it into numbers (vectors), then calculates a probability distribution based on those numbers and selects the most likely next token.
This conversion process is called Embedding — embedding text into a high-dimensional vector space. Semantically similar words are also close in distance in vector space.
"Cat" and "Dog" → close (both pets, both have four legs)
"Cat" and "Car" → far apart (completely different semantics)
But sometimes there are "surprises":
"Cat" (internet slang for programmer) and "996" → might be close
Embedding is the foundation of RAG (Retrieval-Augmented Generation). Retrieval is essentially finding the "closest" content in vector space.
Vector databases are tools specifically designed to store these embeddings. Common vector databases include:
- Pinecone: Cloud service, hassle-free but paid
- Milvus: Open source, can be deployed locally
- Chroma: Lightweight, great for personal projects
- Weaviate: Feature-rich, supports hybrid retrieval
1.3 The Model's "Thinking": Reasoning and Agents
The capability of a basic LLM is continuing text — you give it a passage, it continues it. But when you ask it "Help me book a flight to Shanghai tomorrow," it can only respond "Okay, let me help you book that," but cannot actually book the ticket for you.
This is where Agent comes in. An Agent adds to the LLM:
- Tool calling capability: Can execute code, query databases, call APIs
- Task planning capability: Can break complex tasks into subtasks
- Memory capability: Short-term (context window), long-term (vector database)
How an Agent works roughly:
Traditional LLM:
User → Model → Text Response
Agent:
User → Model (Thinking: What does the user want? What should I do?)
↓
Planning (Break task into steps)
↓
Tool Calling (Search, query database, execute code...)
↓
Observation (What did the tool return?)
↓
Re-planning (Adjust next step based on results)
↓
... Loop until complete
↓
Final Response
Agents matter because they solve the LLM's "ability to take action" problem. LLMs know a lot, but they can't directly manipulate the external world. Agents let LLMs truly "get moving" through tool calling.
2. Why Software Development Needs LLMOps
2.1 What Happens Without LLMOps
Without LLMOps, LLM application development probably looks like this:
1. Backend engineer studies API documentation
2. Writes a bunch of prompts embedded in code
3. Calls GPT-4 / Claude for functionality
4. Goes live and discovers unstable outputs
5. Manually modifies prompts, goes live again
6. Logs? None. Monitoring? None.
7. Want to switch models? Refactor everything
8. Users grow, don't know how to scale
9. Model provider raises prices, helpless
This isn't building a product — it's gambling — gambling that model outputs stay stable, that prompts don't need changes, that users won't ask tricky questions.
I've seen too many teams where engineers are on edge every day after the first AI feature goes live:
- Don't know what questions users asked
- Don't know what the model responded with
- Don't know why it suddenly went off-topic
- Can only guess when accidents happen
2.2 What LLMOps Solves
LLMOps (Large Language Model Operations) is a lifecycle management platform for LLM-based applications, covering development, deployment, monitoring, maintenance, and more.
Its core value is letting the platform handle the complex stuff, so users get the simple stuff:
| Step | Without LLMOps | With LLMOps | Time Saved |
|---|---|---|---|
| Frontend App Development | Integrate and wrap LLM capabilities, significant dev time | Use LLMOps backend directly, API/WebApp-based dev | -80% |
| Prompt Engineering | Debug via API or Playground only | Visual prompt editor, WYSIWYG | -25% |
| Data Prep & Embedding | Write code for long text processing, Embedding | Upload text/files on platform | -80% |
| App Logs & Analytics | Write code to view logs, access database | Platform provides real-time logs & analytics | -70% |
| AI Plugin Development | Write code to create/integrate AI plugins | Platform provides visual tools for quick integration | -50% |
| AI Workflow Development | Write code for each workflow step | Visual workflow orchestration | -80% |
| Model Switching | Change code, refactor API calls | One-click model switch on platform | -90% |
| Performance Monitoring | Roll your own monitoring, not professional | Platform自带各项指标监控 | -60% |
In other words, LLMOps encapsulates all the engineering grunt work so developers can focus on business logic.
2.3 Core Features of an LLMOps Platform
A complete LLMOps platform typically includes these core features:
Application Management
- Create, edit, delete AI applications
- App configuration (model selection, parameter tuning)
- Multi-app management and comparison
Prompt Engineering
- Visual prompt editor
- Prompt version control
- Prompt template marketplace
- A/B testing capability
Knowledge Base Management
- Document upload and parsing
- Multiple chunking strategies
- Embedding configuration
- Knowledge base versioning
Data & Analytics
- Conversation log recording
- Token consumption statistics
- Answer quality evaluation
- User feedback collection
Model Integration
- Multi-model support (OpenAI, Claude, local models...)
- Model performance comparison
- Cost analysis
3. Core Concepts Overview
3.1 Prompt Engineering
A Prompt is how we communicate with LLMs, but it's not that simple. Good prompts and bad prompts can produce vastly different output quality.
Bad Prompt:
"Translate"
Good Prompt:
"You are a professional Chinese-English translator. Translate the following Chinese paragraph into English, maintaining accuracy of professional terminology, with a formal but not stiff tone. For proper nouns, add the original text in parentheses after the translation."
Input: [User Text]
Good prompts need:
- Role definition: What role should the model play ("You are a senior architect")
- Task description: What to do ("Review this code for me")
- Output format: What format of result ("Output in a table with issue location, severity, fix suggestion")
- Constraints: Any limitations ("Do not exceed 500 words")
Several practical Prompt Engineering tips:
1. Few-shot prompting: Show the model examples
Don't just say "Classify this text"
Instead:
"Text: This phone's battery life is terrible, needs charging three times a day. Classification: Negative
Text: The screen display is amazing, great for watching movies. Classification: Positive
Text: The packaging box is quite elegant. Classification:"
After seeing examples, the model better understands what you want
2. Chain of Thought: Let the model think step by step
Don't ask: "What's wrong with this code?"
Instead ask:
"Please analyze this code following these steps:
1. First understand the code's functionality
2. Identify potential security risks
3. Evaluate performance impact
4. Propose improvements
Please answer step by step"
3. Structured Output: Make output more controllable
Don't ask: "Summarize this article"
Instead ask:
"Please summarize the article in JSON format, including:
{
"title": "Article Title",
"summary": "Summary no more than 100 characters",
"keywords": ["Keyword 1", "Keyword 2", "Keyword 3"],
"sentiment": "Positive/Negative/Neutral"
}"
3.2 RAG: Let the Model "Understand" Your Documents
RAG (Retrieval-Augmented Generation) is currently the most popular LLM application architecture.
Why do we need RAG? Because an LLM's knowledge is static — it doesn't know about your company's internal affairs or the latest product documentation content. You can't make the model "remember" all the knowledge in the world, but you can let it "look up references" when answering questions.
The RAG approach is simple:
1. Prepare your documents (PDF, web pages, database...)
2. Split documents into chunks
3. Embed each chunk, store in vector database
4. When user asks a question, embed the question too
5. Find the "most relevant" content in the vector database
6. Send relevant content + user question to LLM together
7. LLM generates answer based on this context
A classic RAG flow:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Document │───▶│ Document │───▶│ Embedding │
│ Store │ │ Splitting │ └──────┬──────┘
└─────────────┘ └─────────────┘ │
▼
┌──────────────────────────┐
│ Vector Database │
│ (Stores document vectors) │
└──────────────────────────┘
▲
┌─────────────┐ ┌─────────────┐ │
│ User Query │───▶│ Query Embed │───────────┘
└─────────────┘ └─────────────┘ │
▼
┌──────────────────────────┐
│ Retrieve Most Relevant │
└──────────────────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Final Answer│◀───│ LLM Generate│◀───│ Context │
└─────────────┘ └─────────────┘ └─────────────┘
Common RAG application scenarios:
- Enterprise knowledge base Q&A: Employees ask about company policies, processes, bot finds answers from internal documents
- Customer service assistance: Users ask about products, find relevant info from product manuals
- Technical documentation Q&A: Developers ask about API usage, retrieve from technical docs
3.3 Fine-tuning and LoRA
Prompt Engineering and RAG are both about using models, but sometimes you need to train the model — this is Fine-tuning.
Ways to use models:
- Prompt Engineering: Adjust input, don't touch the model
- RAG: Give the model "cheat sheets", don't touch the model
Ways to train models:
- Fine-tuning: Train the model with specific data, let it "learn" new knowledge
- LoRA: Low-Rank Adaptation, more efficient fine-tuning method
When to use fine-tuning instead of RAG?
| Scenario | Recommended Approach |
|---|---|
| Need model to learn a speaking style | Fine-tuning |
| Need model to "remember" large amounts of knowledge | RAG |
| Need model to perform specific task formats | Fine-tuning |
| Knowledge updates frequently | RAG |
| Training data is easy to obtain | Fine-tuning |
| Need real-time knowledge | RAG |
The core idea of LoRA (Low-Rank Adaptation) is: don't modify the model's main backbone parameters, but additionally train a small set of "adapter" parameters. This greatly reduces the cost and time of fine-tuning.
To use an analogy: the model is a building's framework, LoRA is hanging some hooks on the framework — hang clothes on the hooks and you can change outfits. No need to rebuild the entire building, just change the hooks.
3.4 AI Agent Interaction Patterns
Traditional software interaction pattern: Human → Software → Data. Each software has different interfaces and operation methods, users need to learn.
AI Agent era: Human → Agent → Data. The interaction interface is unified, users just need to talk to the Agent.
Traditional mode:
User → Excel (handle spreadsheets)
User → Photoshop (handle images)
User → Email client (handle emails)
User → Data analysis tool (make charts)
User → PPT tool (make presentations)
... Need to learn each software
Agent mode:
User → Agent → "Analyze this month's sales data, make charts, and generate a PPT"
User → Agent → "Remove the background from this photo, then adjust the color"
User → Agent → "Send an email to client A explaining the order delay, tone should be sincere"
The Agent receives natural language, breaks down tasks, calls appropriate tools, and returns results. The underlying complexity is hidden; what users perceive is just an intelligent assistant that understands human language and helps them get things done.
4. Build Your First AI Application with Dify
After all that theory, let's get practical. Dify is one of the most popular open-source LLMOps platforms today, with the philosophy of "making AI application development as simple as building with blocks."
Dify's characteristics:
- Open source, can be deployed locally
- Supports multiple models (OpenAI, Claude, local open-source models...)
- Visual workflow orchestration
- Complete logging and analytics
4.1 Dify's Core Modules
Dify breaks AI applications into several core modules:
Application Module:
├── Prompt
│ └── Supports variables, templates, conditionals
├── Memory
│ ├── Short-term (conversation context window)
│ └── Long-term (vector database)
├── Tools
│ ├── Built-in tools (Google search, Wikipedia, calculator...)
│ └── Custom tools (call your own API)
├── Knowledge Base
│ └── Supports PDF, text, web pages, Notion...
├── Opening / Suggested Questions
│ └── Improves user interaction experience
└── Content Moderation
└── Built-in sensitive word filtering
Workflow Module:
├── Start Node
├── End Node
├── LLM Node (call model)
├── Knowledge Retrieval Node
├── Code Executor Node (online Python/JS execution)
├── Conditional Branch Node
├── HTTP Request Node
├── Template Transformation Node
├── Variable Aggregation Node
└── ...
4.2 A Simple Chatbot
Steps to make a chatbot with Dify:
Step 1: Create Application
1. Select "Chat Assistant" type
2. Set app name: "Tech Support Assistant"
3. Set opening: "Hello, I'm tech support. What technical questions can I help you with?"
4. Set suggested questions: "How to reset password?", "What to do if API call errors out?"...
Step 2: Configure Prompt
You are a professional technical support engineer familiar with all aspects of our company's products.
When users ask questions:
1. First understand the user's question; if unclear, ask for more details
2. If the question involves product features, prioritize retrieving relevant info from the knowledge base
3. If you can't find the answer, honestly tell the user and suggest contacting human support
4. Keep answers concise, professional, and easy to understand; avoid excessive technical jargon
5. If the question involves code, provide complete runnable code examples
Remember: You represent the company's image; every answer should demonstrate professionalism.
Step 3: Connect Knowledge Base
1. Create knowledge base: "Tech Support Knowledge Base"
2. Upload documents:
- Product user manual (PDF)
- FAQ (Markdown)
- API documentation (web scrape)
3. Configure chunking strategy:
- Recommend 300-500 characters per chunk
- Preserve paragraph context
4. Click "Embed," system processes automatically
Step 4: Publish
1. Click "Publish"
2. Choose publishing method:
- WebApp: Generate a webpage, can share directly with users
- API: Generate API address and key, let your system call it
3. Set access permissions (public/require login/whitelist)
That's it — just these steps, and you have a customer service bot backed by a product knowledge base. No code writing needed, no need to understand AI internals.
4.3 A More Advanced Workflow
Suppose you want to build an "Article Analysis Assistant": user drops in an article URL, the assistant automatically fetches content, summarizes key points, extracts keywords, and generates image suggestions.
Workflow Design:
Start
│
▼
HTTP Request (fetch URL content)
│
▼
LLM (extract article body, remove ads and irrelevant content)
│
▼
┌──┬──┬──┐
▼ ▼ ▼ ▼
│ │ │ └─▶ LLM (generate image suggestions)
│ │ │ Output: suggested image style and description
│ │ │
│ │ └────▶ LLM (keyword extraction)
│ │ Output: 3-5 keywords
│ │
│ └────────▶ LLM (generate summary)
│ Output: summary within 100 characters
│
└────────────▶ Variable Aggregation (assemble final result)
│
▼
┌─────────────────────────────────────┐
│ Final Output: │
│ - Summary │
│ - Keywords │
│ - Image suggestions │
└─────────────────────────────────────┘
│
▼
End
Drag and drop to implement in Dify — no code required. Complex workflows are broken into simple nodes, each doing one thing, then combined into complete functionality.
4.4 Real Case: Customer Service Bot Results
I previously built an internal customer service bot for my team using Dify with pretty good results:
Scenario: Answering employee questions about company IT systems
Knowledge base content:
- IT help center documents (50+ articles)
- Conference room booking guide
- Printer usage guide
- VPN connection guide
- ...
Pre-launch concerns:
❌ Problem: Employee questions might not be in the knowledge base
✅ Solution: When "answer not found," guide to human support and log the question
❌ Problem: Answers might be inaccurate, misleading employees
✅ Solution: Enable "content moderation," sensitive answers need human confirmation
❌ Problem: Employees ask about real-time info (like today's server status)
✅ Solution: Connect to company Status Page API, Agent can query in real-time
Post-launch results:
- 70% of questions resolved self-service
- Average response time reduced from 30 minutes (human) to 1 minute
- Employee satisfaction improved
- IT colleagues have more time for complex issues
5. Common AI Application Architectures
Depending on the scenario, LLM application architectures vary. There's no best architecture, only the most suitable one.
5.1 Simple Conversation Type
Best for: Customer service chat, Q&A bots, casual conversation
User Input → [Prompt + Conversation History] → LLM → Response
Characteristics:
- Simplest architecture, good for beginners
- No external knowledge base involved
- Depends on model's own capabilities
- Fast response
Limitations:
- Model doesn't know latest information
- Answers may be inaccurate (hallucination problem)
- Can't access user's private data
5.2 RAG Type
Best for: Document Q&A, knowledge base queries, enterprise intranet assistants
User Question
│
▼
Embed User Question
│
▼
Vector Retrieval ←── Document Split + Embedding
│ (preprocessed vector database)
▼
Get Top-K Relevant Documents
│
▼
Concatenate: User Question + Relevant Docs → LLM → Response
Characteristics:
- Can leverage external knowledge
- Answers traceable to sources
- Supports real-time knowledge base updates
- Relatively simple to deploy
Limitations:
- Depends on retrieval quality
- Context window length constraints
- Document chunking strategy affects results
5.3 Agent Type
Best for: Complex tasks, automated workflows, multi-step operations
User Request
│
▼
Agent (LLM + Tools + Planning)
│
├──▶ Plan next step
├──▶ Select and call tools
│ - Web search
│ - Database query
│ - API calls
│ - Code execution
├──▶ Observe tool return results
│
└──▶ Decide whether to continue or end
│
▼
Loop until complete
│
▼
Final Response
Characteristics:
- Can execute multi-step tasks
- Can access external systems
- Supports complex reasoning
- Can handle open-ended tasks
Limitations:
- Higher cost (multiple calls)
- Uncertain execution time
- Need to design fallback strategies
5.4 Multi-Agent Collaboration Type
Best for: Complex systems, cascading tasks, requiring multi-domain expert collaboration
User Request
│
▼
┌─────────────────┐
│ Orchestrator │
│ Agent (Understand│
│ task, delegate) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Data │ │Analysis │ │ Report │
│ Agent │ │Agent │ │Agent │
│(get data)│ │(analyze) │ │(generate)│
└──────────┘ └──────────┘ └──────────┘
│ │ │
└──────────────┴──────────────┘
│
▼
┌─────────────────┐
│ Final Output │
└─────────────────┘
Characteristics:
- Clear division of labor, each Agent specializes
- Can execute independent tasks in parallel
- Easy to scale and maintain
- Suitable for complex business processes
6. Practical Pitfall Guide
6.1 Prompt-Related Pitfalls
Pitfall 1: Prompt written but doesn't work
Sometimes prompts work great in testing but are unstable in production because LLM outputs have randomness. Solutions:
① Add output format constraints
Not: "Summarize this passage"
But: "Summarize in JSON format, including summary and keywords fields"
② Specify a few output examples (Few-shot)
Show model 2-3 examples first, then have it process your question
③ Generate multiple results and pick the best
Set temperature lower, or generate multiple and vote
④ Explicitly require model to "think out loud"
"Before answering, list your reasoning steps"
Pitfall 2: Context too long
Too many embedded documents, context fills up, model "can't read it all." Solutions:
① Document chunking should be reasonable
- Not越小越好; preserve semantic integrity
- Recommend 300-500 characters per chunk
- Overlap between chunks to avoid context breaks
② Use Re-ranking to filter
First retrieve 20 candidates via vector search
Then use a more precise model to filter to most relevant 3-5
③ Limit context length
Set max_tokens to prevent overly long output
Summarize input to control length
6.2 Knowledge Base Pitfalls
Pitfall 3: Can't retrieve relevant content
Embedding model and query semantics may not match. Solutions:
① Try different embedding models
Chinese scenarios推荐:
- text-embedding-3-small (OpenAI)
- BGE (BAAI open source, good Chinese performance)
- M3E (Moonshot open source)
② Hybrid retrieval
Combine vector retrieval + keyword retrieval (BM25)
Take union or intersection of both results
③ Optimize document structure
- Make each chunk semantically self-contained
- Avoid one chunk covering multiple topics
- Add clear headings and summaries
Pitfall 4: Knowledge base updates are troublesome
After documents update, need to re-embed; costs add up. Solutions:
① Incremental updates
Only reprocess changed documents
Use version control to track which documents updated
② Separate dynamic and static
Static knowledge (product manuals) → Embed
Dynamic info (inventory counts) → Real-time database queries
③ Design knowledge base with update frequency in mind
High-frequency content managed separately
Don't mix with low-frequency content
6.3 Cost Pitfalls
Pitfall 5: Token costs skyrocket
No rate limiting, no caching, no context length control. Solutions:
① Load test before launch
Estimate average token consumption per request
Project total consumption for daily active users
② Implement request caching
Same question (hash match) → return cached result
Cache hit rate can reach 30-50%
③ Set token limits
Per-request max_tokens limit
Per-user daily token limit
Auto-degrade or prompt when exceeded
④ Monitoring and alerts
Set cost threshold alerts
Auto circuit-break when over budget
6.4 Reliability Pitfalls
Pitfall 6: Model provider goes down
Relying on a single model is too risky. Solutions:
① Design fallback strategies
Primary: GPT-4
Backup: Claude 3 / Gemini Pro
Fallback trigger: primary model timeout / error / circuit break
② Use platform to manage multiple models
Dify supports configuring multiple models
One-click switch, no code changes
③ Monitor model health status
Follow API status pages
Set up automatic alerts
7. Next Stop: Making AI a Real Colleague
LLMOps made me realize something: AI isn't here to take jobs, but to take over tasks that are highly repetitive and low in creativity.
Try using AI now:
- Help me search for problematic code in the codebase (used to grep for半天)
- Auto-generate unit tests (those boring edge cases)
- Help me think through architecture for new features (chatting leads to inspiration)
- Help me write code review comments (things I tend to overlook)
Current AI applications are still pretty primitive — most products are just "calling APIs." But as Agent capabilities grow and LLMOps platforms mature, the real changes are still ahead.
If you haven't started experimenting with AI applications yet, my advice is:
- Start playing: Find a platform (Dify, Coze, LangFlow) and build a few demos to feel what AI can really do
- Find your scenario: Think of something you do daily that's repetitive and see if AI can help
- Go deep on one thing: Whether prompt engineering, RAG, or Agent development — pick one and go deep
- Focus on engineering: How to deploy, monitor, iterate — this is what separates "toy" from "product"
Appendix: Quick Reference Glossary
| Term | Full Name | One-Line Explanation |
|---|---|---|
| LLM | Large Language Model | Can understand and generate natural language |
| Token | Token | Basic unit of text processed by models; English≈word, Chinese≈character |
| Embedding | Embedding | Converting text to vectors; semantically similar words have similar vectors |
| Prompt | Prompt | Input instructions to the model; core human-LLM interaction method |
| RAG | Retrieval-Augmented Generation | Combines knowledge base with model for domain-specific understanding |
| Agent | Agent | Intelligent system that autonomously plans, calls tools, executes complex tasks |
| Fine-tuning | Fine-tuning | Training model with specific data to adapt to new tasks |
| LoRA | Low-Rank Adaptation | Low-rank adaptation; efficient model fine-tuning method |
| AIaaS | AI as a Service | AI capabilities as public services |
| LLMOps | LLMOps | Large language model operations; AI application engineering platforms and practices |
| Vector DB | Vector Database | Database storing embeddings for retrieval |
| Re-ranking | Re-ranking | Using more precise methods to filter retrieval results |
| Temperature | Temperature | Parameter controlling model output randomness; lower = more deterministic |
| Few-shot | Few-shot | Show examples to model to teach style/format |
| CoT | Chain of Thought | Having model think step by step before answering |
| Hallucination | Hallucination | Model confidently stating false information |
| Context Window | Context Window | Maximum tokens a model can process in one go |
Comments
No comments yet. Be the first to share your thoughts!