Coding by Hand
Python home

AWS SQS

The gym has a ticket window where members drop request slips: "fix the treadmill belt," "restock towels on rack 3," "the sauna bulb is out." Maintenance comes by every 20 minutes, takes the next slip off the pile, and works on it. The member who dropped the slip walks back to the squat rack without waiting. Maintenance never stares at the window wondering when the next request will arrive. Neither side blocks the other. SQS is that ticket window between two programs. One program drops a message. The other program picks it up when it is ready. This lesson hangs the window and sends ten tickets through it.

SQS stands for Simple Queue Service, and it was the third of the three original AWS services launched in 2006, alongside S3 and EC2. The idea is older than AWS by decades. IBM's MQ Series shipped in 1993 for mainframe shops that needed to pass messages between programs running in different buildings. Sun wrote a spec called JMS in 2001 that every Java server implemented. ActiveMQ in 2004 and RabbitMQ in 2007 made open-source queues normal on Linux. What SQS added was "I don't want to run a queue server." You hand AWS a queue name, they hold the messages, replicate them across three datacenters, and charge you 0.40 dollars per million API calls. There are no servers to patch, no disks to grow, no restarts on Tuesday night.

Producers drop messages at the ticket window; consumers pick them up on their own schedule.
Producers drop messages at the ticket window; consumers pick them up on their own schedule.

SQS has two queue types. A standard queue gives you nearly unlimited throughput but delivers messages at-least-once — a consumer can occasionally see the same message twice if something goes wrong mid-process, so consumer code must be idempotent (running the same message twice produces the same result as running it once). A FIFO queue keeps strict ordering inside a group and guarantees exactly-once processing but caps at 300 messages per second per group. Standard is the default. Most systems use it.

The lifecycle of a message is the other thing to know. A producer calls send_message and the message lands in the queue. A consumer calls receive_message and SQS hands over a batch of up to 10 messages, each with a receipt handle — a one-time token that identifies this delivery of this message. At that moment, SQS starts a visibility timeout clock. For the next N seconds (30 seconds by default, up to 12 hours) the message is hidden from every other consumer. If the consumer finishes the work and calls delete_message with the receipt handle, the message is gone for good. If the consumer crashes, hangs, or simply takes longer than the visibility timeout, the message reappears in the queue and another consumer gets a turn. This is how SQS makes "at least once" true: a message keeps coming back until somebody explicitly deletes it.

A dead-letter queue (DLQ) catches the messages nobody can process. When you create the main queue, you attach a redrive policy: "if a message has been received more than N times, move it to this other queue." That other queue is the DLQ. It is where poison messages go to be inspected by a human instead of looping forever. Every well-run SQS system has a DLQ, an alarm on its depth, and a dashboard showing the failed messages. Without one, a single bad row can chew through your capacity forever.

Create both queues. The main queue is PokerHandsQueue, matching the hand shape from the DynamoDB lesson. The DLQ is PokerHandsDLQ. The Lambda lesson after this one consumes the same queue.

import boto3
import json
import time
import uuid
 
region = "us-east-1"
queue_name = "PokerHandsQueue"
dlq_name = "PokerHandsDLQ"
 
sqs = boto3.client("sqs", region_name=region)
 
dlq_resp = sqs.create_queue(QueueName=dlq_name)
dlq_url = dlq_resp["QueueUrl"]
dlq_arn = sqs.get_queue_attributes(
    QueueUrl=dlq_url,
    AttributeNames=["QueueArn"],
)["Attributes"]["QueueArn"]
 
main_resp = sqs.create_queue(
    QueueName=queue_name,
    Attributes={
        "VisibilityTimeout": "30",
        "RedrivePolicy": json.dumps({
            "deadLetterTargetArn": dlq_arn,
            "maxReceiveCount": "3",
        }),
    },
)
main_url = main_resp["QueueUrl"]
 
print(f"DLQ url:  {dlq_url}")
print(f"Main url: {main_url}")

The redrive policy says a message that is received and left unacknowledged 3 times gets moved to the DLQ. Visibility timeout is 30 seconds — plenty for a poker hand write. Run once to create, then cache the URLs for the rest of the session.

Produce 10 messages. Mix in one poison message with an invalid chip_delta (a string instead of a number) so the consumer will fail to process it.

hands = [
    {"user_id": "aarit",  "hand_timestamp": 1_713_200_000, "hole_cards": "AsKs", "chip_delta": 120,  "result": "won"},
    {"user_id": "aarit",  "hand_timestamp": 1_713_200_300, "hole_cards": "7h2d", "chip_delta": -40,  "result": "folded"},
    {"user_id": "aarit",  "hand_timestamp": 1_713_200_600, "hole_cards": "QdQh", "chip_delta": 75,   "result": "won"},
    {"user_id": "aditya", "hand_timestamp": 1_713_200_150, "hole_cards": "JsTs", "chip_delta": -20,  "result": "lost"},
    {"user_id": "aditya", "hand_timestamp": 1_713_200_450, "hole_cards": "AhAs", "chip_delta": 240,  "result": "won"},
    {"user_id": "aarit",  "hand_timestamp": 1_713_200_900, "hole_cards": "5c6c", "chip_delta": -15,  "result": "folded"},
    {"user_id": "aditya", "hand_timestamp": 1_713_201_000, "hole_cards": "KhQh", "chip_delta": 60,   "result": "won"},
    {"user_id": "aarit",  "hand_timestamp": 1_713_201_200, "hole_cards": "9d9s", "chip_delta": 35,   "result": "won"},
    {"user_id": "aditya", "hand_timestamp": 1_713_201_400, "hole_cards": "3h4h", "chip_delta": -10,  "result": "folded"},
    {"user_id": "aarit",  "hand_timestamp": 1_713_201_600, "hole_cards": "TsJs", "chip_delta": "POISON", "result": "won"},
]
 
for hand in hands:
    resp = sqs.send_message(
        QueueUrl=main_url,
        MessageBody=json.dumps(hand),
    )
    print(f"sent {hand['hole_cards']:<6} message_id={resp['MessageId'][:8]}...")

Ten calls, ten message IDs. SQS assigns the ID — it is globally unique and independent of the body. You never reuse it.

sent AsKs   message_id=3f9c1a07...
sent 7h2d   message_id=a1b8e3d4...
sent QdQh   message_id=7c2f91e5...
...
sent TsJs   message_id=9e4d6c12...

Consume the queue. Accept a hand only if chip_delta is a real integer. The poison message fails that check, so do not delete it — let the visibility timeout expire and let SQS hand it back.

processed = 0
while processed < 20:
    resp = sqs.receive_message(
        QueueUrl=main_url,
        MaxNumberOfMessages=10,
        WaitTimeSeconds=5,
        AttributeNames=["ApproximateReceiveCount"],
    )
    messages = resp.get("Messages", [])
    if not messages:
        print("queue empty, stopping")
        break
    for msg in messages:
        body = json.loads(msg["Body"])
        delivery = msg["Attributes"]["ApproximateReceiveCount"]
        handle_preview = msg["ReceiptHandle"][:16]
        try:
            delta = body["chip_delta"]
            if not isinstance(delta, int):
                raise ValueError(f"chip_delta must be int, got {type(delta).__name__}")
            print(f"  ok  {body['hole_cards']:<6} delivery={delivery} receipt={handle_preview}...")
            sqs.delete_message(QueueUrl=main_url, ReceiptHandle=msg["ReceiptHandle"])
        except (ValueError, KeyError) as exc:
            print(f"  FAIL {body.get('hole_cards','?'):<6} delivery={delivery} reason={exc}")
        processed += 1
    time.sleep(2)

MaxNumberOfMessages=10 asks for up to a batch of 10. WaitTimeSeconds=5 enables long polling — SQS holds the request open for up to 5 seconds if the queue is empty, which costs nothing and avoids tight polling loops. ApproximateReceiveCount is the count SQS increments every time it hands out this message. When that count passes the redrive threshold (3 in our policy), SQS moves the message to the DLQ instead of re-delivering it to the main queue.

The poison ticket cycles through the main queue three times, then SQS moves it to the DLQ.
The poison ticket cycles through the main queue three times, then SQS moves it to the DLQ.

Run the consumer. The first pass shows the 9 valid hands getting processed and the poison message failing. Wait 30 seconds (one visibility timeout) and the poison message comes back with a delivery count of 2. Wait again for count 3. On the fourth delivery attempt, SQS moves it to the DLQ.

  ok  AsKs   delivery=1 receipt=AQEBn7HGRzTx...
  ok  7h2d   delivery=1 receipt=AQEBTxdP9kLq...
  ok  QdQh   delivery=1 receipt=AQEBmNp2oE8v...
  ok  JsTs   delivery=1 receipt=AQEB1fJz04qa...
  ok  AhAs   delivery=1 receipt=AQEBP7a2eWxR...
  ok  5c6c   delivery=1 receipt=AQEBu3L8cN2m...
  ok  KhQh   delivery=1 receipt=AQEBk5D1vH7p...
  ok  9d9s   delivery=1 receipt=AQEBy0W9tR4s...
  ok  3h4h   delivery=1 receipt=AQEBq8M6fJ1x...
  FAIL TsJs  delivery=1 reason=chip_delta must be int, got str
  FAIL TsJs  delivery=2 reason=chip_delta must be int, got str
  FAIL TsJs  delivery=3 reason=chip_delta must be int, got str
queue empty, stopping

The receipt handle is long (100+ bytes) because it encodes the queue, the message ID, the delivery, and a signed token SQS uses to verify you received this specific delivery. Only the preview is printed so the output stays readable. The delivery count increments from 1 to 3 across the four polling cycles, then SQS routes the TsJs message to the DLQ.

Confirm the poison landed in the DLQ.

dlq_resp = sqs.receive_message(QueueUrl=dlq_url, MaxNumberOfMessages=10)
dlq_messages = dlq_resp.get("Messages", [])
print(f"\nDLQ depth: {len(dlq_messages)}")
for msg in dlq_messages:
    body = json.loads(msg["Body"])
    print(f"  dead: {body['hole_cards']}  user={body['user_id']}  delta={body['chip_delta']!r}")
DLQ depth: 1
  dead: TsJs  user=aarit  delta='POISON'

One dead message, isolated from the live queue, waiting for a human to look at it. A monitor on DLQ depth — a CloudWatch alarm that fires when the count is over zero — is how production SQS systems get paged.

Clean up both queues with sqs.delete_queue(QueueUrl=...) when you finish.

A queue decouples producers from consumers. It does not consume anything on its own. Your laptop can run the consumer for an afternoon, but a real service needs something that runs 24/7 and scales up when the queue grows. The next lesson deploys a Python Lambda that reads from this same PokerHandsQueue on its own.