LLM API Integration¶
Summary¶
This chapter covers the technical aspects of integrating large language models into applications through APIs. Students will learn API fundamentals, authentication methods, and how to configure parameters like temperature and max tokens. The chapter also addresses practical concerns like rate limiting, cost optimization, and token counting for production deployments.
Concepts Covered¶
This chapter covers the following 17 concepts from the learning graph:
- API Fundamentals
- REST API
- SDK
- OpenAI API
- Anthropic API
- API Endpoints
- API Authentication
- API Keys
- Temperature Parameter
- Top-P Parameter
- Max Tokens Parameter
- Stop Sequences
- Streaming Responses
- Rate Limiting
- Cost Optimization
- API Pricing
- Token Counting
Prerequisites¶
This chapter builds on concepts from:
Learning Objectives¶
After completing this chapter, students will be able to:
- Implement OpenAI and Anthropic APIs for text generation
- Apply API parameters appropriately to control output characteristics
- Manage rate limiting and optimize costs for API usage
- Count tokens and estimate costs for AI applications
- Design API integration architectures for enterprise applications
Introduction¶
While consumer interfaces like ChatGPT demonstrate generative AI capabilities, building production applications requires direct API integration. Application Programming Interfaces (APIs) provide programmatic access to LLM capabilities, enabling developers to embed AI into custom applications, automate workflows, and create novel user experiences.
This chapter provides the technical foundation for working with LLM APIs. We explore the mechanics of API communication, authentication practices, parameter configuration, and operational concerns including rate limiting and cost management. Whether building a customer service chatbot or a document analysis pipeline, mastering these concepts is essential for successful AI application development.
API Fundamentals¶
What Is an API?¶
An Application Programming Interface (API) is a contract between software systems that defines how they communicate. APIs specify request formats, response structures, and the operations available. For LLMs, APIs allow applications to send prompts and receive generated text programmatically.
Key API concepts:
| Concept | Description |
|---|---|
| Endpoint | A specific URL where API requests are sent |
| Request | Data sent to the API (method, headers, body) |
| Response | Data returned from the API (status, headers, body) |
| Authentication | Verification of caller identity and permissions |
| Rate limiting | Constraints on request frequency |
REST APIs¶
REST (Representational State Transfer) is the dominant architectural style for web APIs. LLM providers use REST APIs with HTTP methods to expose their models.
Common HTTP methods:
| Method | Purpose | LLM API Usage |
|---|---|---|
| POST | Create/submit data | Submit prompts for completion |
| GET | Retrieve data | List models, check status |
| DELETE | Remove data | Delete fine-tuned models |
A typical REST API request:
POST /v1/chat/completions HTTP/1.1
Host: api.openai.com
Authorization: Bearer sk-your-api-key
Content-Type: application/json
{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "Explain REST APIs briefly."}
]
}
Software Development Kits (SDKs)¶
SDKs are client libraries that simplify API interaction. Rather than manually constructing HTTP requests, developers use language-specific objects and methods.
SDK benefits:
- Abstraction: Hide HTTP complexity behind clean interfaces
- Type safety: Catch errors at compile time (in typed languages)
- Convenience: Built-in serialization, error handling, retries
- Maintenance: SDK updates as API evolves
SDK example (Python with OpenAI):
from openai import OpenAI
client = OpenAI() # Uses OPENAI_API_KEY environment variable
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": "Explain REST APIs briefly."}
]
)
print(response.choices[0].message.content)
Major LLM APIs¶
OpenAI API¶
The OpenAI API provides access to GPT models, DALL-E image generation, Whisper transcription, and embedding models.
Key endpoints:
| Endpoint | Purpose |
|---|---|
/v1/chat/completions |
Conversational text generation |
/v1/completions |
Legacy text completion (deprecated for most models) |
/v1/embeddings |
Generate vector embeddings |
/v1/images/generations |
Create images with DALL-E |
/v1/audio/transcriptions |
Transcribe audio with Whisper |
/v1/fine-tuning/jobs |
Manage fine-tuning |
Chat completions request structure:
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"temperature": 0.7,
"max_tokens": 500
}
Anthropic API¶
The Anthropic API provides access to Claude models with a focus on safety and extended context.
Key endpoint: /v1/messages
Anthropic request structure:
{
"model": "claude-3-sonnet-20240229",
"max_tokens": 1024,
"system": "You are a helpful assistant.",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}
Key differences from OpenAI:
| Aspect | OpenAI | Anthropic |
|---|---|---|
| System prompt | In messages array | Separate system field |
| Model naming | gpt-4, gpt-4-turbo |
claude-3-sonnet-20240229 |
| Default context | Varies by model | 200K standard |
| Header auth | Authorization: Bearer |
x-api-key |
Authentication and Security¶
API Keys¶
API keys are secret tokens that authenticate API requests. They identify the calling application and associate usage with a billing account.
API key best practices:
- Never expose in client-side code: Keys in JavaScript, mobile apps, or repositories can be stolen
- Use environment variables: Store keys outside code; reference via
process.envor similar - Rotate periodically: Generate new keys and deprecate old ones
- Restrict permissions: Use project-specific keys with minimal permissions
- Monitor usage: Set up alerts for unexpected consumption patterns
Environment variable usage:
import os
from openai import OpenAI
# Key loaded from OPENAI_API_KEY environment variable
client = OpenAI()
# Or explicitly:
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
Security Warning
Exposed API keys can result in significant financial liability and data exposure. Immediately rotate any key that may have been compromised. Use secret scanning tools to prevent accidental commits.
Authentication Headers¶
LLM APIs use HTTP headers for authentication:
OpenAI:
Anthropic:
Generation Parameters¶
Temperature¶
The temperature parameter controls the randomness of model outputs. It scales the probability distribution over tokens before sampling.
| Temperature | Effect | Use Cases |
|---|---|---|
| 0.0 | Deterministic; highest probability token always selected | Factual Q&A, code generation, consistency-critical tasks |
| 0.3-0.5 | Low variation; mostly predictable with occasional diversity | Professional writing, summarization |
| 0.7-0.9 | Moderate creativity; balanced exploration | Creative writing, brainstorming |
| 1.0-1.5 | High creativity; unexpected, diverse outputs | Poetry, idea generation, experimental content |
The mathematical effect: temperature divides the logits (pre-softmax scores) before computing probabilities. Lower temperature sharpens the distribution (concentrating probability on top tokens); higher temperature flattens it (more uniform sampling).
Top-P (Nucleus Sampling)¶
Top-P (nucleus sampling) offers an alternative to temperature for controlling diversity. Instead of scaling probabilities, top-p dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold.
| Top-P | Effect |
|---|---|
| 0.1 | Very focused; only top ~10% probability mass considered |
| 0.5 | Moderate; top 50% probability mass |
| 0.9 | Broad; most tokens considered except extreme tail |
| 1.0 | All tokens considered (equivalent to no nucleus sampling) |
Temperature vs. Top-P
OpenAI recommends adjusting one or the other, not both simultaneously. Temperature is generally more intuitive; top-p provides finer control for specific applications.
Max Tokens¶
The max tokens parameter limits the length of generated output. It specifies the maximum number of tokens the model will generate before stopping.
Considerations:
- Output may be shorter if the model generates a stop token naturally
- Setting too low truncates responses mid-thought
- Setting too high increases cost and latency unnecessarily
- Context window limits apply to input + output combined
Estimation guidelines:
| Content Type | Approximate Tokens |
|---|---|
| Short answer | 50-100 |
| Paragraph | 100-250 |
| 150-400 | |
| Page of text | 500-700 |
| Long document | 1000+ |
Stop Sequences¶
Stop sequences are strings that, when generated, cause the model to stop producing output. They enable structured generation and prevent runaway responses.
Example use cases:
# Stop at end of first sentence
stop=["."]
# Stop at markdown headers or code blocks
stop=["##", "```"]
# Stop at JSON object close
stop=["}"]
Stop sequences are useful for:
- Extracting single items from potential lists
- Preventing model from adding unwanted commentary
- Enforcing output structure
Streaming Responses¶
Why Stream?¶
Streaming responses return tokens as they're generated rather than waiting for complete output. This dramatically improves perceived latency for users.
Without streaming:
With streaming:
For a 500-token response, streaming delivers first content in ~100ms versus ~3000ms for non-streaming.
Implementing Streaming¶
OpenAI streaming example:
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Write a haiku about APIs."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
Server-Sent Events (SSE) format:
data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":"APIs"}}]}
data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":" connect"}}]}
data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":" us"}}]}
data: [DONE]
Rate Limiting¶
Understanding Rate Limits¶
Rate limiting restricts how frequently and how much you can use an API within time windows. LLM APIs enforce multiple limits:
| Limit Type | Description |
|---|---|
| Requests per minute (RPM) | Maximum API calls per minute |
| Tokens per minute (TPM) | Maximum tokens processed per minute |
| Tokens per day (TPD) | Maximum tokens processed per day |
| Concurrent requests | Maximum simultaneous requests |
Rate limits vary by:
- Subscription tier (free, pay-as-you-go, enterprise)
- Account history and usage patterns
- Specific model (GPT-4 often has lower limits than GPT-3.5)
Handling Rate Limits¶
When limits are exceeded, APIs return HTTP 429 (Too Many Requests) errors.
Mitigation strategies:
Exponential backoff with jitter:
import time
import random
def call_with_retry(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except RateLimitError:
wait = (2 ** attempt) + random.random()
time.sleep(wait)
raise Exception("Max retries exceeded")
Request batching: Combine multiple small requests into fewer larger ones
Request queuing: Buffer requests and process at sustainable rate
Load distribution: Spread requests across multiple API keys or accounts (where permitted)
Token Counting and Cost Optimization¶
Counting Tokens¶
Understanding token counts is essential for cost estimation and context management.
OpenAI's tiktoken library:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
text = "How many tokens is this sentence?"
tokens = encoding.encode(text)
print(f"Token count: {len(tokens)}") # Output: 7
Token counting considerations:
- Different models use different tokenizers
- Special tokens (system instructions, formatting) add overhead
- Non-English text often requires more tokens per word
- Code typically requires more tokens than prose
API Pricing¶
LLM APIs charge per token, typically quoted per million tokens:
| Model | Input Price | Output Price |
|---|---|---|
| GPT-4 | $30/1M tokens | $60/1M tokens |
| GPT-4 Turbo | $10/1M tokens | $30/1M tokens |
| GPT-4o | $5/1M tokens | $15/1M tokens |
| GPT-3.5 Turbo | $0.50/1M tokens | $1.50/1M tokens |
| Claude 3 Opus | $15/1M tokens | $75/1M tokens |
| Claude 3 Sonnet | $3/1M tokens | $15/1M tokens |
| Claude 3 Haiku | $0.25/1M tokens | $1.25/1M tokens |
Prices as of early 2024; check current pricing
Cost calculation:
Cost Optimization Strategies¶
| Strategy | Implementation | Potential Savings |
|---|---|---|
| Model selection | Use smaller models for simple tasks | 50-90% |
| Prompt optimization | Shorter prompts, fewer examples | 10-30% |
| Caching | Cache responses for repeated queries | 30-80% |
| Batching | Process multiple items per API call | 10-20% |
| Output limits | Set appropriate max_tokens | 10-40% |
| Context management | Summarize rather than include full history | 20-50% |
Diagram: Cost Optimization Decision Tree¶
The following decision tree guides LLM API cost optimization decisions based on task complexity and usage volume.
flowchart TD
START["🎯 Start: What is the<br/>task complexity?"]
START -->|Simple| SIMPLE["📉 Simple Task Path"]
START -->|Complex| COMPLEX["📈 Complex Task Path"]
subgraph SimpleOpt["Simple Task Optimizations"]
S1["Use smallest capable model<br/>GPT-3.5, Haiku, Llama 8B"]
S2["Keep prompts minimal<br/>Remove unnecessary context"]
S3["Set low max_tokens<br/>Match expected output length"]
S1 --> S2 --> S3
end
subgraph ComplexOpt["Complex Task Optimizations"]
C1["Tiered approach:<br/>Small model first, escalate if needed"]
C2["Cache complex analysis<br/>Reuse for similar inputs"]
C3["Batch related requests<br/>Reduce per-request overhead"]
C1 --> C2 --> C3
end
SIMPLE --> SimpleOpt
COMPLEX --> ComplexOpt
VOL{"High Volume?<br/>>10K requests/day"}
SimpleOpt --> VOL
ComplexOpt --> VOL
subgraph HighVol["High Volume Optimizations"]
H1["Response caching<br/>Semantic deduplication"]
H2["Fine-tuning<br/>Reduce prompt tokens"]
H3["Self-hosted open-source<br/>Eliminate per-token costs"]
end
VOL -->|Yes| HighVol
VOL -->|No| DONE["✅ Apply selected<br/>optimizations"]
HighVol --> DONE
style START fill:#E3F2FD,stroke:#1565C0,stroke-width:2px
style SimpleOpt fill:#E8F5E9,stroke:#388E3C
style ComplexOpt fill:#FFF3E0,stroke:#F57C00
style HighVol fill:#FCE4EC,stroke:#C2185B
Cost Optimization Quick Reference:
| Optimization | Savings Potential | Implementation Effort | Best For |
|---|---|---|---|
| Smaller model | 50-90% | Low | Simple tasks currently using large models |
| Prompt reduction | 20-40% | Low | Verbose system prompts |
| Output limits | 10-40% | Low | Tasks generating more tokens than needed |
| Response caching | 30-70% | Medium | Repeated similar queries |
| Tiered models | 40-60% | Medium | Mix of simple and complex tasks |
| Fine-tuning | 50-80% | High | High-volume, specialized tasks |
| Self-hosted | 70-95% | High | Very high volume, privacy requirements |
Monthly Cost Estimation Formula:
Monthly Cost = (Requests/day × 30) × (Avg Input Tokens × Input Price + Avg Output Tokens × Output Price)
Example Calculation
- 10,000 requests/day × 30 = 300,000 requests/month
- 500 input tokens @ $0.003/1K = $0.0015/request
- 200 output tokens @ $0.006/1K = $0.0012/request
- Monthly cost: 300,000 × ($0.0015 + \(0.0012) = **\)810/month**
Switching to a 10× cheaper model for 80% of requests: $243/month (70% savings)
Production Architecture¶
Integration Patterns¶
Synchronous request-response: - Client waits for API response - Simplest pattern - Suitable for interactive applications with short responses
Asynchronous processing: - Submit request, poll for result - Suitable for long-running tasks - Enables better resource utilization
Queue-based architecture: - Requests queued; workers process at controlled rate - Smooths traffic spikes - Enables priority management
Error Handling¶
LLM APIs can fail for various reasons:
| Error Type | HTTP Status | Handling Strategy |
|---|---|---|
| Rate limit | 429 | Exponential backoff, queue requests |
| Server error | 500, 503 | Retry with backoff |
| Invalid request | 400 | Log, fix prompt/parameters |
| Authentication | 401, 403 | Check key validity, permissions |
| Context exceeded | 400 | Truncate input, use larger context model |
| Content filter | 400 | Review content, adjust approach |
Robust error handling:
from openai import OpenAI, RateLimitError, APIError
client = OpenAI()
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
except RateLimitError:
# Handle rate limiting with backoff
pass
except APIError as e:
# Handle other API errors
pass
Key Takeaways¶
- REST APIs provide the interface for programmatic LLM access; SDKs simplify integration with language-specific libraries
- API keys authenticate requests; protect them carefully and never expose in client-side code
- Temperature controls output randomness (0 = deterministic, 1+ = creative); top-p offers alternative diversity control
- Max tokens limits output length and affects cost; set appropriately for each use case
- Streaming delivers tokens progressively, dramatically improving perceived latency for interactive applications
- Rate limits constrain usage by requests, tokens, and time; implement exponential backoff and queuing
- Token counting is essential for cost estimation and context management; use provider tokenization libraries
- Cost optimization strategies include model selection, prompt optimization, caching, and batching
Review Questions¶
Why should API keys never be included in client-side code or version control?
Client-side code (JavaScript in browsers, mobile apps) is accessible to end users who can extract embedded keys. Version control systems retain history, so even deleted keys remain accessible in repository history. Exposed keys enable unauthorized usage billed to your account, potential data access if keys have broad permissions, and no way to trace who made specific requests. Best practices: use environment variables, backend proxies, and rotate keys periodically.
How do temperature and top-p parameters affect model output differently?
Temperature scales the probability distribution by dividing logits before softmax. Low temperature (0-0.3) makes the distribution sharper, concentrating probability on top tokens; high temperature (>1.0) flattens it, making unlikely tokens more probable. Top-p (nucleus sampling) dynamically selects the smallest token set exceeding the probability threshold, then samples uniformly within that set. Temperature affects how probabilities are distributed; top-p affects which tokens are even considered. For most applications, adjust one or the other, not both.
What strategies would you recommend to reduce LLM API costs by 50% without significantly impacting quality?
A 50% cost reduction strategy: (1) Model tiering—use smaller models (GPT-3.5, Haiku) for simple tasks, reserving larger models for complex queries, (2) Response caching—cache identical or similar queries (can reduce costs 30-80% depending on query repetition), (3) Prompt optimization—remove redundant instructions, use concise examples (10-30% savings), (4) Output limits—set appropriate max_tokens rather than defaults (prevents overly long responses), (5) Batch processing—combine related requests where possible. Combination of these approaches can achieve 50%+ reduction while maintaining quality for priority use cases.