Defusing Refund Rage: A Prompt System for Boutique Stores That Won't Sound Like a Robot
A field-tested prompt stack that turns furious refund emails into calm, on-brand replies in under a minute — without the dead-eyed corporate apology voice.
Refund emails sit on a different shelf than the rest of your inbox. They arrive in caps lock at 11:47 PM, they assume bad faith, and they expect you to apologise for the postal service, the fit model, and the weather. Handling them well is one of the highest-leverage things a small store can automate. Handling them badly is one of the fastest ways to earn a one-star review and a screenshot on Reddit.
I've been running prompt-driven customer service for boutique apparel and home goods brands for two years. What follows is the exact system I hand to founders — the structure, the prompt, the guardrails, and the bits that newer practitioners almost always get wrong.
Why most 'polite email' prompts fail
Asking ChatGPT to 'write a polite refund reply' produces something that reads like a hostage statement from a Fortune 500 PR team. You know the voice. We sincerely apologise for any inconvenience this may have caused. Your satisfaction is our top priority. Nobody talks like that. Nobody believes it. And in a boutique context — where customers chose you specifically because you weren't Amazon — it actively breaks trust.
The fix is structural, not stylistic. You need to **separate diagnosis from response**, give the model a small catalog of voice samples to mirror, and explicitly ban the phrases that make replies sound auto-generated. Skip any of those three and you'll get bland output no matter how clever your tone instructions are.
The two-stage architecture
Stage one is a classifier. It reads the inbound email and returns a small JSON object: anger level, refund eligibility, missing information, and whether escalation is needed. Stage two takes that JSON plus the original email and drafts a reply. Two prompts, two model calls, total cost roughly a third of a cent per email on a mid-tier model.
Why bother splitting it? Because the moment you ask one prompt to both judge and respond, the model hedges. It softens the diagnosis to justify a softer reply, or hardens the reply to match a harsh diagnosis. Decoupling them gives you two sharper outputs and — more importantly — a clean handoff point where a human can review before anything goes out.
What goes in each stage
| Stage | Input | Output | Best for / Nuance |
|---|---|---|---|
| 1. Classifier | Raw customer email + order metadata | JSON: anger (1-5), refund_status, missing_info[], escalate (bool) | Run this even on calm emails — the anger score lets you sort the queue by who actually needs a human first. |
| 2. Drafter | Classifier JSON + original email + brand voice samples | A ready-to-send reply in your store's voice | Best for: anger 1-3. Always route anger 4-5 through a human before sending — the cost of one bad auto-reply outweighs a week of saved time. |
| 3. (Optional) Tone audit | Drafted reply | Pass / flag with reason | Nuance: this is overkill below ~50 tickets a week. Above that, it catches the 1 in 30 reply that quietly slips into corporate voice. |
The production prompt
This is the stage-two drafter — the one that actually writes the reply. The classifier prompt is shorter and I'll show it after. Drop this into ChatGPT, Claude, or your help-desk integration. The variables in curly braces are what your help-desk macro or Zapier step fills in.
ROLE
You are the customer service voice of {{BRAND_NAME}}, a small independent {{PRODUCT_CATEGORY}} brand. You are NOT a corporate support agent. You are a real person who cares about the product and the customer, and you've been on the other end of a bad delivery yourself.
INPUTS
- Original customer email:
"""
{{CUSTOMER_EMAIL}}
"""
- Classifier output:
{{CLASSIFIER_JSON}}
- Order facts (only state these if relevant):
Order #: {{ORDER_NUMBER}}
Item: {{ITEM_NAME}}
Ordered: {{ORDER_DATE}}
Shipped: {{SHIP_DATE}}
Return window status: {{RETURN_STATUS}}
- Voice samples (mirror the rhythm, not the wording):
"""
{{VOICE_SAMPLE_1}}
{{VOICE_SAMPLE_2}}
"""
TASK
Write a reply email of 90-160 words.
STRUCTURE (in this order, no headings)
1. Acknowledge the specific frustration in their own words (not paraphrased into corporate speak).
2. Take responsibility for the part that's actually our fault. Do not take responsibility for things outside our control (carrier delays, weather) — name them honestly instead.
3. State the concrete next action and the timeline. Be specific: 'refund hits your card in 3-5 business days' not 'we will process this shortly'.
4. Offer ONE small gesture if anger >= 3 (free shipping on next order, 15% code, hand-written note). Never offer cash or extra refunds without human approval.
5. Sign off with a real first name from this list: {{SIGNER_NAMES}}.
HARD RULES — DO NOT VIOLATE
- Banned phrases (do not use, do not paraphrase): 'we sincerely apologise', 'we apologise for any inconvenience', 'your satisfaction is our top priority', 'rest assured', 'kindly', 'as per our policy', 'unfortunately', 'we understand your frustration'.
- Banned structures: no bullet points, no numbered lists, no headers, no emojis.
- Sentence length: mix short (4-8 words) and medium (12-20 words). No sentence over 28 words.
- Contractions ON. 'We're', 'you'll', 'didn't'. Always.
- If the classifier flagged escalate=true OR refund_status='ineligible', DO NOT draft a final reply. Instead output: 'ESCALATE — reason: <one sentence>' and stop.
OUTPUT
Reply email body only. No subject line, no signature block beyond the first name.The classifier (stage one)
This one's smaller and almost never needs tweaking once it's working. The JSON shape is what matters — keep it strict so your downstream automation doesn't break.
You are a triage classifier for a small ecommerce store's support inbox.
Read the customer email below and return ONLY a JSON object, no prose.
Email:
"""
{{CUSTOMER_EMAIL}}
"""
Order metadata:
{{ORDER_METADATA}}
Return this exact shape:
{
"anger": <integer 1-5, where 1 is neutral and 5 is hostile>,
"refund_status": "eligible" | "ineligible" | "needs_review",
"missing_info": [<list of fields the customer didn't provide that you'd need to act, e.g. 'order_number', 'photo_of_damage'>],
"escalate": <true if any of: legal threat, chargeback mentioned, accessibility complaint, allegation of discrimination, third refund request from same customer>,
"summary": "<one sentence, max 20 words, in neutral voice>"
}Tweaking the prompt for your brand
Three knobs do 90% of the work. The rest is fiddling.
1. Voice samples beat voice descriptions
Don't write 'warm, friendly, slightly sarcastic tone'. Models interpret those words wildly differently. Instead, paste two or three actual replies your founder has sent on a good day — typos preserved, em-dashes intact. The model will pattern-match the rhythm. If your founder doesn't write customer emails, mine the brand's Instagram captions. Same voice, shorter form.
2. Ban phrases, don't request tone
Every banned phrase in the prompt above is there because I caught it in production output. The list grows over time — keep a running note and add to the prompt monthly. 'Unfortunately' is the sneakiest one. It feels neutral but it primes every sentence after it to be a deflection.
3. The escalation guard is non-negotiable
The single line that says 'if escalate=true, stop and flag' is the difference between a system you can sleep at night with and one that will, eventually, send an auto-reply to a customer who mentioned a lawyer. Test this branch first. Send a fake email containing the word 'attorney' and confirm the system flags it. Do this every time you change the classifier.
What the output looks like in practice
A real example. Customer ordered a linen dress, it arrived with a snag, she emailed at midnight calling the QC team 'a joke'. Classifier returned anger=4, refund_status=eligible, escalate=false. Here's what the drafter produced — published unchanged, signed by the founder:
Notice what's not in there. No 'we sincerely apologise'. No 'your satisfaction is our top priority'. No bullet list of next steps. It reads like Mia wrote it on her phone, because the prompt was tuned to make it read like Mia wrote it on her phone.
Where this fits in your stack
If you're running Shopify with Gorgias, Re:amaze, or Front, both prompts can be wired in as macros or Rules that call the OpenAI/Anthropic API on inbound emails. If you're running a one-person operation out of Gmail, a single Zapier zap with two GPT-4-class steps gets you 80% of the value. The classifier output becomes a label, the drafter output becomes a draft reply you eyeball before sending. That eyeball pass takes about 8 seconds per email and catches the rare weird one.
For more on building reusable prompt assets like this, see our earlier walkthrough on the client onboarding email prompt — same structural ideas, gentler use case.
Frequently asked questions
- Yes, if you let the system send unsupervised. No, if you treat it as a drafter. The version I run for clients never auto-sends an anger-4 or anger-5 reply, never auto-sends on an escalation flag, and never auto-sends on the first refund from a customer who's already had one. That covers roughly the 6% of emails that can actually hurt you. The other 94% — 'where is my order', 'wrong size', 'changed my mind' — are fine to draft and send with a human glance. The real failure mode isn't the system getting one wrong; it's the founder getting bored and turning off the human-in-the-loop after a quiet month.
Written by
Dani
AI Workflow Explorer
Dani writes SoloPrompt AI — a working notebook of copy-paste prompts, low-code automations, and field-tested workflows for solo operators. Equal parts skeptic and tinkerer, Dani road-tests every prompt against real micro-business problems before it ships.