CrewAI vs Strands SDK: My 90-Day Production Comparison
Last year I got to do something rare: evaluate two competing agentic AI frameworks in a real production environment, on real financial data, with real consequences if something broke.
The context: a multi-agent system to automate legacy code conversion and reduce engineering onboarding from weeks to hours. I ran CrewAI and Strands SDK in parallel for 90 days on the same codebase. This is what I found.
Spoiler: they're not competing. They're complementary. But you need to know when to use which.
The Setup
The use case: a system of agents that could analyze a legacy COBOL/Java codebase, understand its data contracts, generate equivalent Python/PySpark, validate the output, and document the transformation.
This required:
- Long-running tasks (analysis could take 30+ minutes)
- Tool use (file system, SQL queries, AWS API calls)
- Agent memory across a multi-step workflow
- Graceful degradation when an agent got confused
CrewAI: Strengths
CrewAI's mental model is intuitive if you think in teams. You define agents with roles, goals, and backstories — then define tasks and let the crew coordinate.
from crewai import Agent, Task, Crew
analyst = Agent(
role='Legacy Code Analyst',
goal='Understand COBOL data contracts and business logic',
backstory='You are an expert in legacy financial systems...',
tools=[file_reader, schema_extractor],
verbose=True,
)
converter = Agent(
role='Python Migration Engineer',
goal='Convert legacy code to idiomatic PySpark',
backstory='You write clean, tested, documented Python...',
tools=[code_generator, test_runner],
)
analyze_task = Task(
description='Analyze the COBOL program and extract all data contracts',
agent=analyst,
expected_output='JSON schema of all inputs/outputs',
)
convert_task = Task(
description='Convert the analyzed program to PySpark',
agent=converter,
context=[analyze_task],
expected_output='Tested PySpark module with docstrings',
)
crew = Crew(agents=[analyst, converter], tasks=[analyze_task, convert_task])
result = crew.kickoff()
What CrewAI does well:
- Role-based agent definition is natural and readable
- Task context passing works reliably — agents genuinely use outputs from prior tasks
- The process model (sequential vs hierarchical) is flexible enough for most workflows
- Strong community; lots of examples for data engineering use cases
Where it struggled:
- Long-running tasks (30+ min) would occasionally lose context in the middle
- Tool error handling required a lot of custom wrapping — uncaught tool exceptions could derail the entire crew
- Memory persistence across sessions required significant custom infrastructure
- The verbose output, while helpful for debugging, was noisy in production logs
Strands SDK: Strengths
Strands SDK (Amazon's framework) takes a different philosophical position: agents are defined by their tools, not their roles. The tool definitions are first-class citizens.
from strands import Agent
from strands.tools import tool
@tool
def extract_schema(file_path: str) -> dict:
"""Extract data contract schema from a legacy file."""
# implementation
return schema
@tool
def generate_pyspark(schema: dict, business_logic: str) -> str:
"""Generate validated PySpark code from schema and logic description."""
# implementation
return pyspark_code
agent = Agent(
model="anthropic.claude-3-5-sonnet-20241022-v2:0",
tools=[extract_schema, generate_pyspark],
system_prompt="You are a data migration specialist..."
)
result = agent("Convert the COBOL program at /path/to/file.cbl to PySpark")
What Strands does well:
- Tool definition via Python decorators is clean and type-safe
- AWS Bedrock integration is seamless — critical for our AWS-native stack
- Agents are stateless by default, which is actually a feature for our use case (idempotent runs)
- Error handling at the tool level is much cleaner — exceptions stay contained
- Streaming output works out of the box
Where it struggled:
- Multi-agent coordination requires more manual orchestration — no built-in "crew" concept
- Less community content (it's newer)
- The lack of built-in agent memory meant we had to build our own persistence layer
The Decision Framework
After 90 days, here's how I think about it:
| Use Case | Winner | |----------|--------| | Multi-agent workflows with clear roles | CrewAI | | AWS-native, tool-heavy single agents | Strands | | Long-running stateful analysis | CrewAI (with custom memory) | | Idempotent, repeatable tasks | Strands | | Rapid prototyping | CrewAI | | Production reliability on AWS | Strands |
What We Actually Shipped
The answer: both. The outer orchestration layer uses CrewAI (a Crew of 4 agents: Analyzer, Converter, Validator, Documenter). Each individual agent, when it needs to call AWS services or execute tools reliably, uses a Strands agent internally.
Think of it as: CrewAI for the workflow, Strands for the tool execution.
The result: onboarding time dropped dramatically. New engineers describe what they need in plain English. The crew handles the rest.
What's Next
This architecture directly informed AgentFlow — my open-source framework for agentic ETL pipelines. The same CrewAI + Strands composition pattern, applied to data pipeline orchestration instead of code conversion.
If you're building multi-agent systems for data engineering, I'd start with CrewAI for the workflow design and layer in Strands (or LangChain tool definitions) for the heavy lifting. Don't pick one — understand what each does best.
Questions? Disagreements? I'm building AgentFlow in public and posting weekly on LinkedIn. Hit me there.