CrewAI vs Strands SDK: My 90-Day Production Comparison

Last year I got to do something rare: evaluate two competing agentic AI frameworks in a real production environment, on real financial data, with real consequences if something broke.

The context: a multi-agent system to automate legacy code conversion and reduce engineering onboarding from weeks to hours. I ran CrewAI and Strands SDK in parallel for 90 days on the same codebase. This is what I found.

Spoiler: they're not competing. They're complementary. But you need to know when to use which.

The Setup

The use case: a system of agents that could analyze a legacy COBOL/Java codebase, understand its data contracts, generate equivalent Python/PySpark, validate the output, and document the transformation.

This required:

Long-running tasks (analysis could take 30+ minutes)
Tool use (file system, SQL queries, AWS API calls)
Agent memory across a multi-step workflow
Graceful degradation when an agent got confused

CrewAI: Strengths

CrewAI's mental model is intuitive if you think in teams. You define agents with roles, goals, and backstories — then define tasks and let the crew coordinate.

from crewai import Agent, Task, Crew

analyst = Agent(
    role='Legacy Code Analyst',
    goal='Understand COBOL data contracts and business logic',
    backstory='You are an expert in legacy financial systems...',
    tools=[file_reader, schema_extractor],
    verbose=True,
)

converter = Agent(
    role='Python Migration Engineer',
    goal='Convert legacy code to idiomatic PySpark',
    backstory='You write clean, tested, documented Python...',
    tools=[code_generator, test_runner],
)

analyze_task = Task(
    description='Analyze the COBOL program and extract all data contracts',
    agent=analyst,
    expected_output='JSON schema of all inputs/outputs',
)

convert_task = Task(
    description='Convert the analyzed program to PySpark',
    agent=converter,
    context=[analyze_task],
    expected_output='Tested PySpark module with docstrings',
)

crew = Crew(agents=[analyst, converter], tasks=[analyze_task, convert_task])
result = crew.kickoff()

What CrewAI does well:

Role-based agent definition is natural and readable
Task context passing works reliably — agents genuinely use outputs from prior tasks
The process model (sequential vs hierarchical) is flexible enough for most workflows
Strong community; lots of examples for data engineering use cases

Where it struggled:

Long-running tasks (30+ min) would occasionally lose context in the middle
Tool error handling required a lot of custom wrapping — uncaught tool exceptions could derail the entire crew
Memory persistence across sessions required significant custom infrastructure
The verbose output, while helpful for debugging, was noisy in production logs

Strands SDK: Strengths

Strands SDK (Amazon's framework) takes a different philosophical position: agents are defined by their tools, not their roles. The tool definitions are first-class citizens.

from strands import Agent
from strands.tools import tool

@tool
def extract_schema(file_path: str) -> dict:
    """Extract data contract schema from a legacy file."""
    # implementation
    return schema

@tool
def generate_pyspark(schema: dict, business_logic: str) -> str:
    """Generate validated PySpark code from schema and logic description."""
    # implementation
    return pyspark_code

agent = Agent(
    model="anthropic.claude-3-5-sonnet-20241022-v2:0",
    tools=[extract_schema, generate_pyspark],
    system_prompt="You are a data migration specialist..."
)

result = agent("Convert the COBOL program at /path/to/file.cbl to PySpark")

What Strands does well:

Tool definition via Python decorators is clean and type-safe
AWS Bedrock integration is seamless — critical for our AWS-native stack
Agents are stateless by default, which is actually a feature for our use case (idempotent runs)
Error handling at the tool level is much cleaner — exceptions stay contained
Streaming output works out of the box

Where it struggled:

Multi-agent coordination requires more manual orchestration — no built-in "crew" concept
Less community content (it's newer)
The lack of built-in agent memory meant we had to build our own persistence layer

The Decision Framework

After 90 days, here's how I think about it:

| Use Case | Winner | |----------|--------| | Multi-agent workflows with clear roles | CrewAI | | AWS-native, tool-heavy single agents | Strands | | Long-running stateful analysis | CrewAI (with custom memory) | | Idempotent, repeatable tasks | Strands | | Rapid prototyping | CrewAI | | Production reliability on AWS | Strands |

What We Actually Shipped

The answer: both. The outer orchestration layer uses CrewAI (a Crew of 4 agents: Analyzer, Converter, Validator, Documenter). Each individual agent, when it needs to call AWS services or execute tools reliably, uses a Strands agent internally.

Think of it as: CrewAI for the workflow, Strands for the tool execution.

The result: onboarding time dropped dramatically. New engineers describe what they need in plain English. The crew handles the rest.

What's Next

This architecture directly informed AgentFlow — my open-source framework for agentic ETL pipelines. The same CrewAI + Strands composition pattern, applied to data pipeline orchestration instead of code conversion.

If you're building multi-agent systems for data engineering, I'd start with CrewAI for the workflow design and layer in Strands (or LangChain tool definitions) for the heavy lifting. Don't pick one — understand what each does best.

Questions? Disagreements? I'm building AgentFlow in public and posting weekly on LinkedIn. Hit me there.