Amazon Bedrock Inference

Overview

Amazon Bedrock Runtime provides APIs for invoking foundation models including Claude (Opus, Sonnet, Haiku), Nova (Amazon), Titan (Amazon), and third-party models (Cohere, AI21, Meta). Supports both synchronous and asynchronous inference with streaming capabilities.

Purpose: Production-grade model inference with unified API across all Bedrock models

Pattern: Task-based (independent operations for different inference modes)

Key Capabilities:

Model Invocation - Direct model calls with native or Converse API
Streaming - Real-time token streaming for low latency
Async Invocation - Long-running tasks up to 24 hours
Token Counting - Cost estimation before inference
Guardrails - Runtime content filtering and safety
Inference Profiles - Cross-region routing and cost optimization

Quality Targets:

Latency: < 1s first token for streaming
Throughput: Up to 4,000 tokens/sec
Availability: 99.9% SLA with cross-region profiles

When to Use

Use bedrock-inference when:

Invoking Claude, Nova, Titan, or other Bedrock models
Building conversational AI applications
Implementing streaming responses for better UX
Running long-running async inference tasks
Applying runtime guardrails for content safety
Optimizing costs with inference profiles
Counting tokens before model invocation
Implementing multi-turn conversations

When NOT to Use:

Building complex agents (use bedrock-agentcore)
Knowledge base RAG (use bedrock-knowledge-bases)
Model customization (use bedrock-fine-tuning)

Prerequisites

Required

AWS account with Bedrock access
Model access enabled in AWS Console
IAM permissions for Bedrock Runtime

Installation

pip install boto3 botocore

Enable Model Access

# Check available models
aws bedrock list-foundation-models --region us-east-1

# Request model access via Console:
# AWS Console → Bedrock → Model access → Manage model access

Model IDs and Inference Profiles

Claude Models (Anthropic)

Model	Model ID	Inference Profile ID	Region	Max Tokens
Claude Opus 4.5	`anthropic.claude-opus-4-5-20251101-v1:0`	`global.anthropic.claude-opus-4-5-20251101-v1:0`	Global	200K
Claude Sonnet 4.5	`anthropic.claude-sonnet-4-5-20250929-v1:0`	`us.anthropic.claude-sonnet-4-5-20250929-v1:0`	US	200K
Claude Haiku 4.5	`anthropic.claude-haiku-4-5-20251001-v1:0`	`us.anthropic.claude-haiku-4-5-20251001-v1:0`	US	200K
Claude Sonnet 3.5 v2	`anthropic.claude-3-5-sonnet-20241022-v2:0`	`us.anthropic.claude-3-5-sonnet-20241022-v2:0`	US	200K
Claude Haiku 3.5	`anthropic.claude-3-5-haiku-20241022-v1:0`	`us.anthropic.claude-3-5-haiku-20241022-v1:0`	US	200K

Amazon Nova Models

Model	Model ID	Inference Profile ID	Region	Max Tokens
Nova Pro	`amazon.nova-pro-v1:0`	`us.amazon.nova-pro-v1:0`	US	300K
Nova Lite	`amazon.nova-lite-v1:0`	`us.amazon.nova-lite-v1:0`	US	300K
Nova Micro	`amazon.nova-micro-v1:0`	`us.amazon.nova-micro-v1:0`	US	128K

Amazon Titan Models

Model	Model ID	Region	Max Tokens
Titan Text Premier	`amazon.titan-text-premier-v1:0`	All	32K
Titan Text Express	`amazon.titan-text-express-v1`	All	8K

Inference Profile Prefixes

us. - US-only routing (lower latency for US traffic)
global. - Global cross-region routing (highest availability)
apac. - Asia-Pacific routing (lower latency for APAC traffic)

Quick Reference

Client Initialization

import boto3
from typing import Optional

def get_bedrock_client(region_name: str = 'us-east-1',
                        profile_name: Optional[str] = None):
    """Initialize Bedrock Runtime client"""
    session = boto3.Session(
        region_name=region_name,
        profile_name=profile_name
    )
    return session.client('bedrock-runtime')

# Usage
bedrock = get_bedrock_client(region_name='us-west-2')

Operations

1. Invoke Model (Native API)

Direct model invocation using model-specific request format.

Basic Invocation:

import json

def invoke_claude(prompt: str, model_id: str = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'):
    """Invoke Claude with native API"""
    bedrock = get_bedrock_client()

    # Claude-specific request format
    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = bedrock.invoke_model(
        modelId=model_id,
        body=json.dumps(request_body)
    )

    # Parse response
    response_body = json.loads(response['body'].read())
    return response_body['content'][0]['text']

# Usage
result = invoke_claude("Explain quantum computing in simple terms")
print(result)

With System Prompts:

request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 2048,
    "system": "You are a helpful AI assistant specialized in technical documentation.",
    "messages": [
        {
            "role": "user",
            "content": "Write API documentation for a REST endpoint"
        }
    ]
}

With Tool Use:

request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": "What's the weather in San Francisco?"
        }
    ],
    "tools": [
        {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "input_schema": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    ]
}

2. Converse API (Unified Interface)

Model-agnostic API that works across all Bedrock models with consistent interface.

Basic Conversation:

def converse_with_model(
    messages: list,
    model_id: str = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0',
    system_prompts: Optional[list] = None,
    max_tokens: int = 2048
):
    """Converse API for unified model interaction"""
    bedrock = get_bedrock_client()

    inference_config = {
        'maxTokens': max_tokens,
        'temperature': 0.7,
        'topP': 0.9
    }

    request_params = {
        'modelId': model_id,
        'messages': messages,
        'inferenceConfig': inference_config
    }

    if system_prompts:
        request_params['system'] = system_prompts

    response = bedrock.converse(**request_params)

    return response

# Usage
messages = [
    {
        'role': 'user',
        'content': [
            {'text': 'What are the benefits of microservices architecture?'}
        ]
    }
]

system_prompts = [
    {'text': 'You are a software architecture expert.'}
]

response = converse_with_model(messages, system_prompts=system_prompts)
assistant_message = response['output']['message']
print(assistant_message['content'][0]['text'])

Multi-turn Conversation:

def multi_turn_conversation():
    """Multi-turn conversation with context"""
    bedrock = get_bedrock_client()

    messages = []
    model_id = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'

    # Turn 1
    messages.append({
        'role': 'user',
        'content': [{'text': 'My name is Alice and I work in healthcare.'}]
    })

    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 1024}
    )

    # Add assistant response to history
    messages.append(response['output']['message'])

    # Turn 2 (model remembers context)
    messages.append({
        'role': 'user',
        'content': [{'text': 'What are some AI applications in my field?'}]
    })

    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 1024}
    )

    return response['output']['message']['content'][0]['text']

With Tool Use (Converse API):

def converse_with_tools():
    """Converse API with tool use"""
    bedrock = get_bedrock_client()

    tools = [
        {
            'toolSpec': {
                'name': 'get_stock_price',
                'description': 'Get current stock price for a symbol',
                'inputSchema': {
                    'json': {
                        'type': 'object',
                        'properties': {
                            'symbol': {
                                'type': 'string',
                                'description': 'Stock ticker symbol'
                            }
                        },
                        'required': ['symbol']
                    }
                }
            }
        }
    ]

    messages = [
        {
            'role': 'user',
            'content': [{'text': "What's the price of AAPL stock?"}]
        }
    ]

    response = bedrock.converse(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        messages=messages,
        toolConfig={'tools': tools},
        inferenceConfig={'maxTokens': 2048}
    )

    # Check if model wants to use a tool
    if response['stopReason'] == 'tool_use':
        tool_use = response['output']['message']['content'][0]['toolUse']
        print(f"Tool requested: {tool_use['name']}")
        print(f"Tool input: {tool_use['input']}")

        # Execute tool and return result
        # (Add tool result to messages and call converse again)

    return response

3. Stream Response (Real-time Tokens)

Stream tokens as they're generated for lower perceived latency.

Streaming with Native API:

def stream_claude_response(prompt: str):
    """Stream response tokens in real-time"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    }

    response = bedrock.invoke_model_with_response_stream(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        body=json.dumps(request_body)
    )

    # Process event stream
    stream = response['body']
    full_text = ""

    for event in stream:
        chunk = event.get('chunk')
        if chunk:
            chunk_obj = json.loads(chunk['bytes'].decode())

            if chunk_obj['type'] == 'content_block_delta':
                delta = chunk_obj['delta']
                if delta['type'] == 'text_delta':
                    text = delta['text']
                    print(text, end='', flush=True)
                    full_text += text

            elif chunk_obj['type'] == 'message_stop':
                print()  # New line at end

    return full_text

# Usage
response = stream_claude_response("Write a short story about a robot")

Streaming with Converse API:

def stream_converse(messages: list, model_id: str):
    """Stream response using Converse API"""
    bedrock = get_bedrock_client()

    response = bedrock.converse_stream(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 2048}
    )

    stream = response['stream']
    full_text = ""

    for event in stream:
        if 'contentBlockDelta' in event:
            delta = event['contentBlockDelta']['delta']
            if 'text' in delta:
                text = delta['text']
                print(text, end='', flush=True)
                full_text += text

        elif 'messageStop' in event:
            print()
            break

    return full_text

# Usage
messages = [{'role': 'user', 'content': [{'text': 'Explain neural networks'}]}]
stream_converse(messages, 'us.anthropic.claude-sonnet-4-5-20250929-v1:0')

Streaming with Error Handling:

def safe_streaming(prompt: str):
    """Streaming with comprehensive error handling"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [{"role": "user", "content": prompt}]
    }

    try:
        response = bedrock.invoke_model_with_response_stream(
            modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
            body=json.dumps(request_body)
        )

        full_text = ""
        for event in response['body']:
            chunk = event.get('chunk')
            if chunk:
                chunk_obj = json.loads(chunk['bytes'].decode())

                if chunk_obj['type'] == 'content_block_delta':
                    text = chunk_obj['delta'].get('text', '')
                    print(text, end='', flush=True)
                    full_text += text

                elif chunk_obj['type'] == 'error':
                    print(f"\nStreaming error: {chunk_obj['error']}")
                    break

        return full_text

    except Exception as e:
        print(f"Stream failed: {e}")
        raise

4. Count Tokens

Estimate token usage and costs before invoking models.

Converse Token Counting:

def count_tokens(messages: list, model_id: str):
    """Count tokens for cost estimation"""
    bedrock = get_bedrock_client()

    # Optional system prompts
    system_prompts = [
        {'text': 'You are a helpful assistant.'}
    ]

    # Optional tools
    tools = [
        {
            'toolSpec': {
                'name': 'example_tool',
                'description': 'Example tool',
                'inputSchema': {
                    'json': {
                        'type': 'object',
                        'properties': {}
                    }
                }
            }
        }
    ]

    response = bedrock.converse_count(
        modelId=model_id,
        messages=messages,
        system=system_prompts,
        toolConfig={'tools': tools}
    )

    # Get token counts
    usage = response['usage']
    print(f"Input tokens: {usage['inputTokens']}")
    print(f"System tokens: {usage.get('systemTokens', 0)}")
    print(f"Tool tokens: {usage.get('toolTokens', 0)}")
    print(f"Total input: {usage['totalTokens']}")

    return usage

# Usage
messages = [
    {'role': 'user', 'content': [{'text': 'This is a test message'}]}
]
tokens = count_tokens(messages, 'us.anthropic.claude-sonnet-4-5-20250929-v1:0')

Cost Estimation:

def estimate_cost(messages: list, model_id: str, estimated_output_tokens: int = 1000):
    """Estimate inference cost before invocation"""
    bedrock = get_bedrock_client()

    # Count input tokens
    token_response = bedrock.converse_count(
        modelId=model_id,
        messages=messages
    )

    input_tokens = token_response['usage']['totalTokens']

    # Pricing (as of December 2024, prices vary by region)
    pricing = {
        'us.anthropic.claude-opus-4-5-20251101-v1:0': {
            'input': 15.00 / 1_000_000,   # $15 per 1M input tokens
            'output': 75.00 / 1_000_000   # $75 per 1M output tokens
        },
        'us.anthropic.claude-sonnet-4-5-20250929-v1:0': {
            'input': 3.00 / 1_000_000,
            'output': 15.00 / 1_000_000
        },
        'us.anthropic.claude-haiku-4-5-20251001-v1:0': {
            'input': 0.80 / 1_000_000,
            'output': 4.00 / 1_000_000
        }
    }

    if model_id in pricing:
        input_cost = input_tokens * pricing[model_id]['input']
        output_cost = estimated_output_tokens * pricing[model_id]['output']
        total_cost = input_cost + output_cost

        print(f"Input tokens: {input_tokens:,} (${input_cost:.6f})")
        print(f"Estimated output: {estimated_output_tokens:,} (${output_cost:.6f})")
        print(f"Estimated total: ${total_cost:.6f}")

        return {
            'input_tokens': input_tokens,
            'estimated_output_tokens': estimated_output_tokens,
            'input_cost': input_cost,
            'output_cost': output_cost,
            'total_cost': total_cost
        }
    else:
        print("Pricing not available for this model")
        return None

5. Async Invoke (Long-Running Tasks)

For inference tasks that take longer than 60 seconds (up to 24 hours).

Start Async Invocation:

def async_invoke_model(prompt: str, s3_output_uri: str):
    """Start async model invocation for long tasks"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 10000,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    }

    response = bedrock.invoke_model_async(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        modelInput=json.dumps(request_body),
        outputDataConfig={
            's3OutputDataConfig': {
                's3Uri': s3_output_uri
            }
        }
    )

    invocation_arn = response['invocationArn']
    print(f"Async invocation started: {invocation_arn}")

    return invocation_arn

# Usage
s3_output = 's3://my-bucket/bedrock-outputs/result.json'
arn = async_invoke_model("Write a 10,000 word technical guide", s3_output)

Check Async Status:

def check_async_status(invocation_arn: str):
    """Check status of async invocation"""
    bedrock = get_bedrock_client()

    response = bedrock.get_async_invoke(
        invocationArn=invocation_arn
    )

    status = response['status']
    print(f"Status: {status}")

    if status == 'Completed':
        output_uri = response['outputDataConfig']['s3OutputDataConfig']['s3Uri']
        print(f"Output available at: {output_uri}")

        # Download and parse result
        # (Use boto3 S3 client to retrieve)

    elif status == 'Failed':
        print(f"Failure reason: {response.get('failureMessage', 'Unknown')}")

    return response

# Usage
status = check_async_status(arn)

List Async Invocations:

def list_async_invocations(status_filter: Optional[str] = None):
    """List all async invocations"""
    bedrock = get_bedrock_client()

    params = {}
    if status_filter:
        params['statusEquals'] = status_filter  # 'InProgress', 'Completed', 'Failed'

    response = bedrock.list_async_invokes(**params)

    for invocation in response.get('asyncInvokeSummaries', []):
        print(f"ARN: {invocation['invocationArn']}")
        print(f"Status: {invocation['status']}")
        print(f"Submit time: {invocation['submitTime']}")
        print("---")

    return response

6. Apply Guardrail (Runtime Safety)

Apply content filtering and safety policies at runtime.

Invoke with Guardrail:

def invoke_with_guardrail(
    prompt: str,
    guardrail_id: str,
    guardrail_version: str = 'DRAFT'
):
    """Invoke model with runtime guardrail"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    }

    response = bedrock.invoke_model(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        body=json.dumps(request_body),
        guardrailIdentifier=guardrail_id,
        guardrailVersion=guardrail_version
    )

    # Check if content was blocked
    response_body = json.loads(response['body'].read())

    if 'amazon-bedrock-guardrailAction' in response['ResponseMetadata']['HTTPHeaders']:
        action = response['ResponseMetadata']['HTTPHeaders']['amazon-bedrock-guardrailAction']
        if action == 'GUARDRAIL_INTERVENED':
            print("Content blocked by guardrail")
            return None

    return response_body['content'][0]['text']

# Usage
result = invoke_with_guardrail(
    "Tell me about quantum computing",
    guardrail_id='abc123xyz',
    guardrail_version='1'
)

Converse with Guardrail:

def converse_with_guardrail(messages: list, guardrail_config: dict):
    """Converse API with guardrail configuration"""
    bedrock = get_bedrock_client()

    response = bedrock.converse(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        messages=messages,
        inferenceConfig={'maxTokens': 2048},
        guardrailConfig=guardrail_config
    )

    # Check trace for guardrail intervention
    if 'trace' in response:
        trace = response['trace']['guardrail']
        if trace.get('action') == 'GUARDRAIL_INTERVENED':
            print("Guardrail blocked content")
            for assessment in trace.get('assessments', []):
                print(f"Policy: {assessment['topicPolicy']}")

    return response

# Usage
guardrail_config = {
    'guardrailIdentifier': 'abc123xyz',
    'guardrailVersion': '1',
    'trace': 'enabled'
}

messages = [{'role': 'user', 'content': [{'text': 'Test message'}]}]
converse_with_guardrail(messages, guardrail_config)

Error Handling Patterns

Comprehensive Error Handling

from botocore.exceptions import ClientError, BotoCoreError
import time

def robust_invoke(prompt: str, max_retries: int = 3):
    """Invoke model with retry logic and error handling"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [{"role": "user", "content": prompt}]
    }

    for attempt in range(max_retries):
        try:
            response = bedrock.invoke_model(
                modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
                body=json.dumps(request_body)
            )

            response_body = json.loads(response['body'].read())
            return response_body['content'][0]['text']

        except ClientError as e:
            error_code = e.response['Error']['Code']

            if error_code == 'ThrottlingException':
                wait_time = (2 ** attempt) + 1  # Exponential backoff
                print(f"Throttled. Waiting {wait_time}s before retry {attempt + 1}/{max_retries}")
                time.sleep(wait_time)
                continue

            elif error_code == 'ModelTimeoutException':
                print("Model timeout - request took too long")
                if attempt < max_retries - 1:
                    time.sleep(2)
                    continue
                raise

            elif error_code == 'ModelErrorException':
                print("Model error - check input format")
                raise

            elif error_code == 'ValidationException':
                print("Invalid parameters")
                raise

            elif error_code == 'AccessDeniedException':
                print("Access denied - check IAM permissions and model access")
                raise

            elif error_code == 'ResourceNotFoundException':
                print("Model not found - check model ID")
                raise

            else:
                print(f"Unexpected error: {error_code}")
                raise

        except BotoCoreError as e:
            print(f"Connection error: {e}")
            if attempt < max_retries - 1:
                time.sleep(2)
                continue
            raise

    raise Exception(f"Failed after {max_retries} attempts")

Specific Error Scenarios

def handle_model_errors():
    """Common error scenarios and solutions"""
    bedrock = get_bedrock_client()

    try:
        # Attempt invocation
        response = bedrock.invoke_model(
            modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 2048,
                "messages": [{"role": "user", "content": "test"}]
            })
        )

    except ClientError as e:
        error_code = e.response['Error']['Code']

        if error_code == 'ModelNotReadyException':
            # Model is still loading
            print("Model not ready, wait 30 seconds and retry")

        elif error_code == 'ServiceQuotaExceededException':
            # Hit service quota
            print("Exceeded quota - request increase or use different region")

        elif error_code == 'ModelStreamErrorException':
            # Error during streaming
            print("Stream interrupted - restart stream")

Best Practices

1. Cost Optimization

def cost_optimized_inference(prompt: str, require_high_accuracy: bool = False):
    """Choose model based on task complexity and cost"""

    # Simple tasks → Haiku (cheapest)
    # Moderate tasks → Sonnet (balanced)
    # Complex tasks → Opus (most capable)

    if not require_high_accuracy:
        model_id = 'us.anthropic.claude-haiku-4-5-20251001-v1:0'
        print("Using Haiku for cost efficiency")
    elif require_high_accuracy:
        model_id = 'global.anthropic.claude-opus-4-5-20251101-v1:0'
        print("Using Opus for maximum accuracy")
    else:
        model_id = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'
        print("Using Sonnet for balanced performance")

    return invoke_claude(prompt, model_id)

2. Use Inference Profiles

def use_inference_profiles():
    """Leverage inference profiles for cost savings"""

    # Cross-region profiles offer 30-50% cost savings
    # with automatic region failover

    profiles = {
        'global_opus': 'global.anthropic.claude-opus-4-5-20251101-v1:0',
        'us_sonnet': 'us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        'us_haiku': 'us.anthropic.claude-haiku-4-5-20251001-v1:0'
    }

    # Use global profile for high availability
    # Use regional profile for lower latency

    return profiles

3. Implement Caching

from functools import lru_cache
import hashlib

@lru_cache(maxsize=100)
def cached_inference(prompt: str, model_id: str):
    """Cache responses for identical prompts"""
    return invoke_claude(prompt, model_id)

def cache_key(prompt: str) -> str:
    """Generate cache key for prompt"""
    return hashlib.sha256(prompt.encode()).hexdigest()

4. Monitor Token Usage

def track_token_usage(messages: list, model_id: str):
    """Track and log token usage"""
    bedrock = get_bedrock_client()

    # Count before invocation
    token_count = bedrock.converse_count(
        modelId=model_id,
        messages=messages
    )

    input_tokens = token_count['usage']['totalTokens']

    # Invoke
    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 2048}
    )

    # Get actual output tokens
    output_tokens = response['usage']['outputTokens']
    total_tokens = response['usage']['totalInputTokens'] + output_tokens

    # Log to CloudWatch or database
    print(f"Input: {input_tokens}, Output: {output_tokens}, Total: {total_tokens}")

    return response

5. Use Streaming for Better UX

def stream_for_user_experience(prompt: str):
    """Always use streaming for interactive applications"""

    # Streaming reduces perceived latency
    # Users see tokens immediately instead of waiting

    return stream_claude_response(prompt)

6. Async for Long Tasks

def use_async_for_batch(prompts: list, s3_bucket: str):
    """Use async invocation for batch processing"""

    invocation_arns = []

    for idx, prompt in enumerate(prompts):
        s3_uri = f's3://{s3_bucket}/outputs/result-{idx}.json'
        arn = async_invoke_model(prompt, s3_uri)
        invocation_arns.append(arn)

    return invocation_arns

IAM Permissions

Minimum Runtime Permissions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:*::foundation-model/anthropic.claude-*",
        "arn:aws:bedrock:*::foundation-model/amazon.nova-*",
        "arn:aws:bedrock:*::foundation-model/amazon.titan-*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:Converse",
        "bedrock:ConverseStream"
      ],
      "Resource": "*"
    }
  ]
}

With Async Invocation

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream",
        "bedrock:InvokeModelAsync",
        "bedrock:GetAsyncInvoke",
        "bedrock:ListAsyncInvokes"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bedrock-bucket/*"
    }
  ]
}

Progressive Disclosure

Quick Start (This File)

Client initialization
Model IDs and inference profiles
Basic invocation (native and Converse API)
Streaming responses
Token counting
Async invocation
Guardrail application
Error handling patterns
Best practices

Detailed References

Advanced Invocation Patterns: Batch processing, parallel requests, custom retry logic, response parsing
Multimodal Support: Image inputs, document parsing, vision capabilities for Claude and Nova
Tool Use and Function Calling: Complete tool use patterns, multi-turn tool conversations, error handling
Performance Optimization: Latency optimization, throughput tuning, cost reduction strategies
Monitoring and Observability: CloudWatch integration, custom metrics, cost tracking, usage analytics

Related Skills

bedrock-agentcore: Build production AI agents with managed infrastructure
bedrock-guardrails: Configure content filters and safety policies
bedrock-knowledge-bases: RAG with vector stores and retrieval
bedrock-prompts: Manage and version prompts
anthropic-expert: Claude API patterns and best practices
claude-cost-optimization: Cost tracking and optimization for Claude
boto3-eks: For containerized Bedrock applications

bedrock-inference