Extended thinking in Claude: when and how to use deep reasoning

Extended thinking is the mode where Claude “thinks out loud” before giving a response. Instead of jumping straight to an answer, Claude works through the problem internally and you can see that process. For complex tasks, the quality difference is significant.

How it works internally

When you enable extended thinking, Claude generates a thinking block before the text block of the response. This block contains the internal reasoning: hypotheses, checks, self-corrections. It’s not decorative — it’s the real process that improves the final response.

Thinking consumes output tokens but Anthropic bills it differently: thinking tokens cost the same as regular output tokens but don’t count toward context limits the same way.

Enabling extended thinking in the API

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[{
        "role": "user",
        "content": "Design a microservices architecture for an e-commerce platform with 50k concurrent users."
    }]
)

for block in response.content:
    if block.type == "thinking":
        print("REASONING:", block.thinking)
    elif block.type == "text":
        print("RESPONSE:", block.text)

The budget_tokens parameter controls how many tokens the thinking can use. More budget = more reasoning = better response (at higher cost and latency).

When to use extended thinking

Worth enabling for:

Complex software architecture problems
Code analysis with multiple dependencies
Mathematical or logical reasoning
Decisions with many trade-offs
Tasks where the first attempt is usually wrong

Not needed for:

Creative text generation
Translations
Structured data extraction
Questions with direct answers

Recommended budget tokens by use case

Task	Suggested budget
Simple code analysis	2,000
Architecture design	8,000
Hard math problems	10,000
Complex systems analysis	15,000+

The model stops when it reaches a satisfactory answer, not when it exhausts the budget. If you see thinking cutting off before concluding, increase the budget.

Streaming with extended thinking

For production, use streaming to avoid blocking while Claude thinks:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content": "..."}]
) as stream:
    for event in stream:
        if hasattr(event, 'type') and event.type == 'content_block_delta':
            # Process thinking or text based on block_index
            pass

Extended thinking is one of Claude’s most underused capabilities. For problems where the first response isn’t enough, it’s the change that has the most impact on quality.