Extended thinking is the mode where Claude “thinks out loud” before giving a response. Instead of jumping straight to an answer, Claude works through the problem internally and you can see that process. For complex tasks, the quality difference is significant.

How it works internally

When you enable extended thinking, Claude generates a thinking block before the text block of the response. This block contains the internal reasoning: hypotheses, checks, self-corrections. It’s not decorative — it’s the real process that improves the final response.

Thinking consumes output tokens but Anthropic bills it differently: thinking tokens cost the same as regular output tokens but don’t count toward context limits the same way.

Enabling extended thinking in the API

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[{
        "role": "user",
        "content": "Design a microservices architecture for an e-commerce platform with 50k concurrent users."
    }]
)

for block in response.content:
    if block.type == "thinking":
        print("REASONING:", block.thinking)
    elif block.type == "text":
        print("RESPONSE:", block.text)

The budget_tokens parameter controls how many tokens the thinking can use. More budget = more reasoning = better response (at higher cost and latency).

When to use extended thinking

Worth enabling for:

  • Complex software architecture problems
  • Code analysis with multiple dependencies
  • Mathematical or logical reasoning
  • Decisions with many trade-offs
  • Tasks where the first attempt is usually wrong

Not needed for:

  • Creative text generation
  • Translations
  • Structured data extraction
  • Questions with direct answers
TaskSuggested budget
Simple code analysis2,000
Architecture design8,000
Hard math problems10,000
Complex systems analysis15,000+

The model stops when it reaches a satisfactory answer, not when it exhausts the budget. If you see thinking cutting off before concluding, increase the budget.

Streaming with extended thinking

For production, use streaming to avoid blocking while Claude thinks:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content": "..."}]
) as stream:
    for event in stream:
        if hasattr(event, 'type') and event.type == 'content_block_delta':
            # Process thinking or text based on block_index
            pass

Extended thinking is one of Claude’s most underused capabilities. For problems where the first response isn’t enough, it’s the change that has the most impact on quality.