Let’s be honest: the two-phase commit (2PC) protocol has been the venerable staple of distributed transactions for decades. It’s reliable, it’s predictable, and—just like my grandpa’s fax machine—it blocks everything until it’s good and ready. In small clusters under perfect network conditions, 2PC works swimmingly. But in today’s world of global microservices, shaky networks, and 100-millisecond SLOs, those blocking global locks become a performance and availability nightmare.
Enter the new school of thought: eventual consistency, sagas, and Try-Confirm-Cancel (TCC). Instead of locking all your resources in limbo until every participant agrees, we let them work at their own pace, handle failures with compensations, and move on. No more “please wait, the coordinator is making tea.” Patterns like Saga and TCC trade strict atomicity for higher availability and resilience—crucial in highly distributed environments.
The Saga Pattern: Embracing Eventual Consistency
At its core, a saga is simply a long-lived transaction split into a sequence of local actions linked by compensating steps. Think of it like booking a multi-leg flight: you reserve your seat on each flight segment in turn, and if one leg falls through, you cancel the earlier reservations. No single global lock; just a chain of local commits and, if needed, local rollbacks.
Key properties of the Saga pattern:
• Sequence of local transactions, each with its own database.
• Compensating transaction for each step to undo side effects.
• No blocking of resources across microservices.
• Failure handling via compensation rather than global rollback.
Once you accept that some data might be “in flight” for a short time, you unlock enormous scalability gains. But you trade off complex choreography or centralised coordination—and you need robust monitoring to spot and correct saga failures.
Choreographed vs Orchestrated Sagas
Sagas come in two primary flavours—each with a distinct trade-off between coupling, observability, and complexity:
Choreographed Sagas
• Each service emits domain events when it completes its local transaction (e.g., OrderCreated, PaymentProcessed).
• Other services listen and react, coordinating the flow implicitly.
• Pros: High autonomy, loose coupling—each service owns its logic.
• Cons: Harder to trace end-to-end, potential event storm, error handling sprinkled across services.Orchestrated Sagas
• A central orchestrator (or coordinator) drives the saga via explicit commands (e.g., “ChargePayment”, “ShipOrder”).
• The orchestrator tracks state, handles retries, and invokes compensations on failures.
• Pros: Clear end-to-end view, simpler error and retry logic.
• Cons: Orchestrator becomes a critical component (potential single point of failure), more coupling.
For example, in an e-commerce flow, a choreographed saga would have the Order service emit OrderCreated, the Payment service listen for that event and issue PaymentProcessed, and so on. An orchestrated approach would have a Saga Orchestrator that says:
Reserve inventory
Process payment
Confirm shipment
If step 2 fails, the orchestrator issues CancelInventory and potentially refund flows.
TCC: Try-Confirm-Cancel for Deterministic Rollbacks
While sagas rely on compensating transactions (that may not always perfectly undo side effects), Try-Confirm-Cancel (TCC) aims for a stricter two-phase workflow—minus the locking drawbacks of 2PC. Here’s how it works:
Try (Reserve) Phase
• Each service “reserves” the required resources (e.g., puts a hold on inventory) in an idempotent, provisional state.Confirm (Commit) Phase
• When all participants have successfully reserved, confirm the provisional state and make the change permanent.Cancel (Rollback) Phase
• If any reservation fails or the coordinator times out, cancel all provisional reservations.
TCC adds stronger consistency guarantees than sagas—you know that once Confirm is invoked, every participant has locked in the final state. But it requires long-lived provisional states, careful idempotency design (so repeats of Try or Confirm don’t corrupt data), and more complex error-handling logic (what if the Confirm call gets lost?).
Beyond Saga and TCC: Hybrid Consensus and CRDT-Based Approaches
As microservices scale into the thousands, purely saga-or-TCC models can strain under complex compensation logic or resource-locking overhead. A growing body of research and practice shows that hybrid approaches using consensus algorithms (Raft/Paxos) or Conflict-Free Replicated Data Types (CRDTs) can offer tunable atomicity, fault tolerance, and partition tolerance:
• Consensus-Backed Sharded Transaction Managers
– Shard your data and employ local Raft groups to coordinate atomic writes within each shard.
– Use a two-step prepare/commit inside a single Raft cluster, minimising cross-shard locks.
– Inter-shard communication leverages asynchronous markers or lightweight locks.
• CRDT-Driven Workflows
– Model data replicas as CRDTs—structures that merge automatically and deterministically.
– For workflows, represent transactional state as CRDT sets or maps.
– On conflict, built-in merge logic resolves inconsistencies, eliminating the need for compensations in some domains.
These advanced patterns aren’t trivial to implement from scratch—but they can bridge the gap between strict ACID and elastic BASE semantics when you need fine-grained control over consistency vs. availability trade-offs.
Putting Patterns into Practice: A Python-Orchestrated Saga Example
Below is a simplified orchestrator for a three-step order saga in Python. We’ll use HTTP calls to services, track state in memory, and provide compensations on failure.
import requests
import time
SERVICES = {
“inventory”: “http://localhost:8001”,
“payment”: “http://localhost:8002”,
“shipping”: “http://localhost:8003”
}
class SagaOrchestrator:
def __init__(self, order_id):
self.order_id = order_id
self.steps = [
(”reserve_inventory”, self.compensate_inventory),
(”process_payment”, self.compensate_payment),
(”arrange_shipping”, self.compensate_shipping),
]
self.completed = []
def run(self):
try:
for step, _ in self.steps:
self._call_service(step)
self.completed.append(step)
print(f”Saga {self.order_id} completed successfully.”)
except Exception as e:
print(f”Step {step} failed: {e}, rolling back.”)
self.rollback()
def _call_service(self, action):
url = f”{SERVICES[action.split(’_’)[0]]}/{action}”
resp = requests.post(url, json={”order_id”: self.order_id}, timeout=2)
if resp.status_code != 200:
raise RuntimeError(f”{action} error: {resp.text}”)
time.sleep(0.1) # simulate network latency
def rollback(self):
# Roll back in reverse order
for action, compensate in reversed(self.steps):
if action in self.completed:
try:
compensate()
except Exception as e:
print(f”Compensation for {action} failed: {e}”)
def compensate_inventory(self):
requests.post(f”{SERVICES[’inventory’]}/cancel_reservation”,
json={”order_id”: self.order_id})
def compensate_payment(self):
requests.post(f”{SERVICES[’payment’]}/refund”,
json={”order_id”: self.order_id})
def compensate_shipping(self):
requests.post(f”{SERVICES[’shipping’]}/cancel_shipment”,
json={”order_id”: self.order_id})
if __name__ == “__main__”:
saga = SagaOrchestrator(order_id=42)
saga.run()
This simple orchestrator:
• Invokes each service in sequence
• Tracks completed steps
• Rolls back with compensations if any step fails
In production, you’d persist saga state, add retries with back-off, integrate distributed tracing, and ensure idempotency of all endpoints. But this toy example captures the essence of an orchestrated saga.
Tooling the Transactional Journey
Thankfully, you don’t have to reinvent the wheel. A robust ecosystem of platforms and libraries now tackles the heavy lifting:
• Temporal (Go, Java/.NET, Python beta) – stateful workflow engine with built-in retries, guarantees, and observability.
• Dapr – provides a workflow building block for sagas, outbox patterns, and sidecar-based messaging.
• AWS Step Functions – serverless orchestrator with visual workflows and native integrations.
• Netflix Conductor – microservice orchestrator with pluggable persistence and metrics.
• Akka Persistence (Scala/Java) – actor-based state machines ideal for TCC and CRDT workflows.
• NoSQL Datastores (e.g., Cosmos DB, DynamoDB) – with transactional outbox/inbox patterns and change-data-capture (CDC) for event sourcing.
Each toolkit varies in your ability to customise, integrate, and observe, so align your choice with your organisation’s ecosystem, skill set, and operational model.
Bringing It All Together
We’ve come a long way since two-phase commit. Today’s patterns—sagas, TCC, hybrid consensus/CRDT approaches—let us balance consistency, availability, and performance in a way that 2PC simply can’t. By understanding the trade-offs:
• Saga: Nonblocking, compensation-driven workflows for high availability.
• TCC: Deterministic reserve/confirm/cancel for stronger consistency at a cost.
• Hybrid: Consensus or CRDTs for tunable atomicity and partition tolerance at scale.
—and by leveraging mature frameworks like Temporal, Dapr, and AWS Step Functions—we can build robust, resilient distributed transactions without resorting to heavyweight global locks.
Keep pushing the envelope, measure your SLAs, instrument your failures, and never stop tuning the consistency dial to match your domain requirements. The two-phase commit may not be dead, but it’s definitely on life support.
Stay curious, stay consistent—see you in tomorrow’s edition of The Backend Developers.
Keep shipping code that scales, and don’t forget to hit ‘reply’ with your thoughts!
— Your fellow code enthusiast,
The Backend Developers Team