LLM API Integration¶

Summary¶

This chapter covers the technical aspects of integrating large language models into applications through APIs. Students will learn API fundamentals, authentication methods, and how to configure parameters like temperature and max tokens. The chapter also addresses practical concerns like rate limiting, cost optimization, and token counting for production deployments.

Concepts Covered¶

This chapter covers the following 17 concepts from the learning graph:

API Fundamentals
REST API
SDK
OpenAI API
Anthropic API
API Endpoints
API Authentication
API Keys
Temperature Parameter
Top-P Parameter
Max Tokens Parameter
Stop Sequences
Streaming Responses
Rate Limiting
Cost Optimization
API Pricing
Token Counting

Prerequisites¶

This chapter builds on concepts from:

Learning Objectives¶

After completing this chapter, students will be able to:

Implement OpenAI and Anthropic APIs for text generation
Apply API parameters appropriately to control output characteristics
Manage rate limiting and optimize costs for API usage
Count tokens and estimate costs for AI applications
Design API integration architectures for enterprise applications

Introduction¶

While consumer interfaces like ChatGPT demonstrate generative AI capabilities, building production applications requires direct API integration. Application Programming Interfaces (APIs) provide programmatic access to LLM capabilities, enabling developers to embed AI into custom applications, automate workflows, and create novel user experiences.

This chapter provides the technical foundation for working with LLM APIs. We explore the mechanics of API communication, authentication practices, parameter configuration, and operational concerns including rate limiting and cost management. Whether building a customer service chatbot or a document analysis pipeline, mastering these concepts is essential for successful AI application development.

API Fundamentals¶

What Is an API?¶

An Application Programming Interface (API) is a contract between software systems that defines how they communicate. APIs specify request formats, response structures, and the operations available. For LLMs, APIs allow applications to send prompts and receive generated text programmatically.

Key API concepts:

Concept	Description
Endpoint	A specific URL where API requests are sent
Request	Data sent to the API (method, headers, body)
Response	Data returned from the API (status, headers, body)
Authentication	Verification of caller identity and permissions
Rate limiting	Constraints on request frequency

REST APIs¶

REST (Representational State Transfer) is the dominant architectural style for web APIs. LLM providers use REST APIs with HTTP methods to expose their models.

Common HTTP methods:

Method	Purpose	LLM API Usage
POST	Create/submit data	Submit prompts for completion
GET	Retrieve data	List models, check status
DELETE	Remove data	Delete fine-tuned models

A typical REST API request:

POST /v1/chat/completions HTTP/1.1
Host: api.openai.com
Authorization: Bearer sk-your-api-key
Content-Type: application/json

{
  "model": "gpt-4",
  "messages": [
    {"role": "user", "content": "Explain REST APIs briefly."}
  ]
}

Software Development Kits (SDKs)¶

SDKs are client libraries that simplify API interaction. Rather than manually constructing HTTP requests, developers use language-specific objects and methods.

SDK benefits:

Abstraction: Hide HTTP complexity behind clean interfaces
Type safety: Catch errors at compile time (in typed languages)
Convenience: Built-in serialization, error handling, retries
Maintenance: SDK updates as API evolves

SDK example (Python with OpenAI):

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY environment variable

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "user", "content": "Explain REST APIs briefly."}
    ]
)

print(response.choices[0].message.content)

Major LLM APIs¶

OpenAI API¶

The OpenAI API provides access to GPT models, DALL-E image generation, Whisper transcription, and embedding models.

Key endpoints:

Endpoint	Purpose
`/v1/chat/completions`	Conversational text generation
`/v1/completions`	Legacy text completion (deprecated for most models)
`/v1/embeddings`	Generate vector embeddings
`/v1/images/generations`	Create images with DALL-E
`/v1/audio/transcriptions`	Transcribe audio with Whisper
`/v1/fine-tuning/jobs`	Manage fine-tuning

Chat completions request structure:

{
  "model": "gpt-4",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
  ],
  "temperature": 0.7,
  "max_tokens": 500
}

Anthropic API¶

The Anthropic API provides access to Claude models with a focus on safety and extended context.

Key endpoint: /v1/messages

Anthropic request structure:

{
  "model": "claude-3-sonnet-20240229",
  "max_tokens": 1024,
  "system": "You are a helpful assistant.",
  "messages": [
    {"role": "user", "content": "What is machine learning?"}
  ]
}

Key differences from OpenAI:

Aspect	OpenAI	Anthropic
System prompt	In messages array	Separate `system` field
Model naming	`gpt-4`, `gpt-4-turbo`	`claude-3-sonnet-20240229`
Default context	Varies by model	200K standard
Header auth	`Authorization: Bearer`	`x-api-key`

Authentication and Security¶

API Keys¶

API keys are secret tokens that authenticate API requests. They identify the calling application and associate usage with a billing account.

API key best practices:

Never expose in client-side code: Keys in JavaScript, mobile apps, or repositories can be stolen
Use environment variables: Store keys outside code; reference via process.env or similar
Rotate periodically: Generate new keys and deprecate old ones
Restrict permissions: Use project-specific keys with minimal permissions
Monitor usage: Set up alerts for unexpected consumption patterns

Environment variable usage:

import os
from openai import OpenAI

# Key loaded from OPENAI_API_KEY environment variable
client = OpenAI()

# Or explicitly:
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

Security Warning

Exposed API keys can result in significant financial liability and data exposure. Immediately rotate any key that may have been compromised. Use secret scanning tools to prevent accidental commits.

Authentication Headers¶

LLM APIs use HTTP headers for authentication:

OpenAI:

Authorization: Bearer sk-your-api-key

Anthropic:

x-api-key: sk-ant-your-api-key
anthropic-version: 2023-06-01

Generation Parameters¶

Temperature¶

The temperature parameter controls the randomness of model outputs. It scales the probability distribution over tokens before sampling.

Temperature	Effect	Use Cases
0.0	Deterministic; highest probability token always selected	Factual Q&A, code generation, consistency-critical tasks
0.3-0.5	Low variation; mostly predictable with occasional diversity	Professional writing, summarization
0.7-0.9	Moderate creativity; balanced exploration	Creative writing, brainstorming
1.0-1.5	High creativity; unexpected, diverse outputs	Poetry, idea generation, experimental content

The mathematical effect: temperature divides the logits (pre-softmax scores) before computing probabilities. Lower temperature sharpens the distribution (concentrating probability on top tokens); higher temperature flattens it (more uniform sampling).

Top-P (Nucleus Sampling)¶

Top-P (nucleus sampling) offers an alternative to temperature for controlling diversity. Instead of scaling probabilities, top-p dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold.

Top-P	Effect
0.1	Very focused; only top ~10% probability mass considered
0.5	Moderate; top 50% probability mass
0.9	Broad; most tokens considered except extreme tail
1.0	All tokens considered (equivalent to no nucleus sampling)

Temperature vs. Top-P

OpenAI recommends adjusting one or the other, not both simultaneously. Temperature is generally more intuitive; top-p provides finer control for specific applications.

Max Tokens¶

The max tokens parameter limits the length of generated output. It specifies the maximum number of tokens the model will generate before stopping.

Considerations:

Output may be shorter if the model generates a stop token naturally
Setting too low truncates responses mid-thought
Setting too high increases cost and latency unnecessarily
Context window limits apply to input + output combined

Estimation guidelines:

Content Type	Approximate Tokens
Short answer	50-100
Paragraph	100-250
Email	150-400
Page of text	500-700
Long document	1000+

Stop Sequences¶

Stop sequences are strings that, when generated, cause the model to stop producing output. They enable structured generation and prevent runaway responses.

Example use cases:

# Stop at end of first sentence
stop=["."]

# Stop at markdown headers or code blocks
stop=["##", "```"]

# Stop at JSON object close
stop=["}"]

Stop sequences are useful for:

Extracting single items from potential lists
Preventing model from adding unwanted commentary
Enforcing output structure

Streaming Responses¶

Why Stream?¶

Streaming responses return tokens as they're generated rather than waiting for complete output. This dramatically improves perceived latency for users.

Without streaming:

User sends prompt → [Wait 3 seconds] → Entire response appears

With streaming:

User sends prompt → [100ms] → First token → [50ms] → Next token → ...

For a 500-token response, streaming delivers first content in ~100ms versus ~3000ms for non-streaming.

Implementing Streaming¶

OpenAI streaming example:

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a haiku about APIs."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Server-Sent Events (SSE) format:

data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":"APIs"}}]}

data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":" connect"}}]}

data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":" us"}}]}

data: [DONE]

Rate Limiting¶

Understanding Rate Limits¶

Rate limiting restricts how frequently and how much you can use an API within time windows. LLM APIs enforce multiple limits:

Limit Type	Description
Requests per minute (RPM)	Maximum API calls per minute
Tokens per minute (TPM)	Maximum tokens processed per minute
Tokens per day (TPD)	Maximum tokens processed per day
Concurrent requests	Maximum simultaneous requests

Rate limits vary by:

Subscription tier (free, pay-as-you-go, enterprise)
Account history and usage patterns
Specific model (GPT-4 often has lower limits than GPT-3.5)

Handling Rate Limits¶

When limits are exceeded, APIs return HTTP 429 (Too Many Requests) errors.

Mitigation strategies:

Exponential backoff with jitter:

import time
import random

def call_with_retry(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            wait = (2 ** attempt) + random.random()
            time.sleep(wait)
    raise Exception("Max retries exceeded")

Request batching: Combine multiple small requests into fewer larger ones

Request queuing: Buffer requests and process at sustainable rate

Load distribution: Spread requests across multiple API keys or accounts (where permitted)

Token Counting and Cost Optimization¶

Counting Tokens¶

Understanding token counts is essential for cost estimation and context management.

OpenAI's tiktoken library:

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
text = "How many tokens is this sentence?"
tokens = encoding.encode(text)
print(f"Token count: {len(tokens)}")  # Output: 7

Token counting considerations:

Different models use different tokenizers
Special tokens (system instructions, formatting) add overhead
Non-English text often requires more tokens per word
Code typically requires more tokens than prose

API Pricing¶

LLM APIs charge per token, typically quoted per million tokens:

Model	Input Price	Output Price
GPT-4	$30/1M tokens	$60/1M tokens
GPT-4 Turbo	$10/1M tokens	$30/1M tokens
GPT-4o	$5/1M tokens	$15/1M tokens
GPT-3.5 Turbo	$0.50/1M tokens	$1.50/1M tokens
Claude 3 Opus	$15/1M tokens	$75/1M tokens
Claude 3 Sonnet	$3/1M tokens	$15/1M tokens
Claude 3 Haiku	$0.25/1M tokens	$1.25/1M tokens

Prices as of early 2024; check current pricing

Cost calculation:

\[\text{Cost} = \frac{\text{Input Tokens} \times \text{Input Price}}{1,000,000} + \frac{\text{Output Tokens} \times \text{Output Price}}{1,000,000}\]

Cost Optimization Strategies¶

Strategy	Implementation	Potential Savings
Model selection	Use smaller models for simple tasks	50-90%
Prompt optimization	Shorter prompts, fewer examples	10-30%
Caching	Cache responses for repeated queries	30-80%
Batching	Process multiple items per API call	10-20%
Output limits	Set appropriate max_tokens	10-40%
Context management	Summarize rather than include full history	20-50%

Diagram: Cost Optimization Decision Tree¶

The following decision tree guides LLM API cost optimization decisions based on task complexity and usage volume.

flowchart TD
    START["🎯 Start: What is the<br/>task complexity?"]

    START -->|Simple| SIMPLE["📉 Simple Task Path"]
    START -->|Complex| COMPLEX["📈 Complex Task Path"]

    subgraph SimpleOpt["Simple Task Optimizations"]
        S1["Use smallest capable model<br/>GPT-3.5, Haiku, Llama 8B"]
        S2["Keep prompts minimal<br/>Remove unnecessary context"]
        S3["Set low max_tokens<br/>Match expected output length"]
        S1 --> S2 --> S3
    end

    subgraph ComplexOpt["Complex Task Optimizations"]
        C1["Tiered approach:<br/>Small model first, escalate if needed"]
        C2["Cache complex analysis<br/>Reuse for similar inputs"]
        C3["Batch related requests<br/>Reduce per-request overhead"]
        C1 --> C2 --> C3
    end

    SIMPLE --> SimpleOpt
    COMPLEX --> ComplexOpt

    VOL{"High Volume?<br/>>10K requests/day"}

    SimpleOpt --> VOL
    ComplexOpt --> VOL

    subgraph HighVol["High Volume Optimizations"]
        H1["Response caching<br/>Semantic deduplication"]
        H2["Fine-tuning<br/>Reduce prompt tokens"]
        H3["Self-hosted open-source<br/>Eliminate per-token costs"]
    end

    VOL -->|Yes| HighVol
    VOL -->|No| DONE["✅ Apply selected<br/>optimizations"]
    HighVol --> DONE

    style START fill:#E3F2FD,stroke:#1565C0,stroke-width:2px
    style SimpleOpt fill:#E8F5E9,stroke:#388E3C
    style ComplexOpt fill:#FFF3E0,stroke:#F57C00
    style HighVol fill:#FCE4EC,stroke:#C2185B

Cost Optimization Quick Reference:

Optimization	Savings Potential	Implementation Effort	Best For
Smaller model	50-90%	Low	Simple tasks currently using large models
Prompt reduction	20-40%	Low	Verbose system prompts
Output limits	10-40%	Low	Tasks generating more tokens than needed
Response caching	30-70%	Medium	Repeated similar queries
Tiered models	40-60%	Medium	Mix of simple and complex tasks
Fine-tuning	50-80%	High	High-volume, specialized tasks
Self-hosted	70-95%	High	Very high volume, privacy requirements

Monthly Cost Estimation Formula:

Monthly Cost = (Requests/day × 30) × (Avg Input Tokens × Input Price + Avg Output Tokens × Output Price)

Example Calculation

10,000 requests/day × 30 = 300,000 requests/month
500 input tokens @ $0.003/1K = $0.0015/request
200 output tokens @ $0.006/1K = $0.0012/request
Monthly cost: 300,000 × ($0.0015 + $0.0012) = **$810/month**

Switching to a 10× cheaper model for 80% of requests: $243/month (70% savings)

Production Architecture¶

Integration Patterns¶

Synchronous request-response: - Client waits for API response - Simplest pattern - Suitable for interactive applications with short responses

Asynchronous processing: - Submit request, poll for result - Suitable for long-running tasks - Enables better resource utilization

Queue-based architecture: - Requests queued; workers process at controlled rate - Smooths traffic spikes - Enables priority management

Error Handling¶

LLM APIs can fail for various reasons:

Error Type	HTTP Status	Handling Strategy
Rate limit	429	Exponential backoff, queue requests
Server error	500, 503	Retry with backoff
Invalid request	400	Log, fix prompt/parameters
Authentication	401, 403	Check key validity, permissions
Context exceeded	400	Truncate input, use larger context model
Content filter	400	Review content, adjust approach

Robust error handling:

from openai import OpenAI, RateLimitError, APIError

client = OpenAI()

try:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
except RateLimitError:
    # Handle rate limiting with backoff
    pass
except APIError as e:
    # Handle other API errors
    pass

Key Takeaways¶

REST APIs provide the interface for programmatic LLM access; SDKs simplify integration with language-specific libraries
API keys authenticate requests; protect them carefully and never expose in client-side code
Temperature controls output randomness (0 = deterministic, 1+ = creative); top-p offers alternative diversity control
Max tokens limits output length and affects cost; set appropriately for each use case
Streaming delivers tokens progressively, dramatically improving perceived latency for interactive applications
Rate limits constrain usage by requests, tokens, and time; implement exponential backoff and queuing
Token counting is essential for cost estimation and context management; use provider tokenization libraries
Cost optimization strategies include model selection, prompt optimization, caching, and batching

Review Questions¶

Why should API keys never be included in client-side code or version control?

Client-side code (JavaScript in browsers, mobile apps) is accessible to end users who can extract embedded keys. Version control systems retain history, so even deleted keys remain accessible in repository history. Exposed keys enable unauthorized usage billed to your account, potential data access if keys have broad permissions, and no way to trace who made specific requests. Best practices: use environment variables, backend proxies, and rotate keys periodically.

How do temperature and top-p parameters affect model output differently?

Temperature scales the probability distribution by dividing logits before softmax. Low temperature (0-0.3) makes the distribution sharper, concentrating probability on top tokens; high temperature (>1.0) flattens it, making unlikely tokens more probable. Top-p (nucleus sampling) dynamically selects the smallest token set exceeding the probability threshold, then samples uniformly within that set. Temperature affects how probabilities are distributed; top-p affects which tokens are even considered. For most applications, adjust one or the other, not both.

What strategies would you recommend to reduce LLM API costs by 50% without significantly impacting quality?

A 50% cost reduction strategy: (1) Model tiering—use smaller models (GPT-3.5, Haiku) for simple tasks, reserving larger models for complex queries, (2) Response caching—cache identical or similar queries (can reduce costs 30-80% depending on query repetition), (3) Prompt optimization—remove redundant instructions, use concise examples (10-30% savings), (4) Output limits—set appropriate max_tokens rather than defaults (prevents overly long responses), (5) Batch processing—combine related requests where possible. Combination of these approaches can achieve 50%+ reduction while maintaining quality for priority use cases.