
Prompt engineering has evolved from simple “give me a summary” requests into a discipline that can squeeze extra intelligence, consistency, and safety out of large language models (LLMs). Below you will find three advanced families of techniques—chain-of-thought reasoning, few-shot scaffolding, and systematic prompt optimization—with concrete patterns, code snippets, and trade-offs you can apply tomorrow in production systems.
The core idea is to elicit a trace of intermediate reasoning before the final answer. This mimics how humans solve multi-step problems and has been shown to improve accuracy on arithmetic, logic, and scientific reasoning tasks.
No examples are required; you simply append an instruction that forces the model to think aloud.
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant that explains your reasoning before answering."},
{"role": "user", "content": "A train leaves Chicago heading west at 60 mph. Two hours later a second train leaves Chicago heading east at 45 mph. When will they be 500 miles apart?"}
]
)
A well-crafted system message or user prompt can trigger CoT even without examples:
Please solve the following problem by showing each step and then give the final answer in bold.
When you supply hand-crafted demonstrations, the model tends to follow the same reasoning pattern.
Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day using four eggs. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the market?
A: Janet starts with 16 eggs.
- She eats 3, so 16 - 3 = 13 remain.
- She uses 4 for muffins, so 13 - 4 = 9 remain.
- She sells 9 eggs at $2 each ⇒ 9 × $2 = $18 every day at the market.
Q: A train leaves Chicago heading west at 60 mph. Two hours later a second train leaves Chicago heading east at 45 mph. When will they be 500 miles apart?
Instead of writing examples by hand, you cluster problems, generate rationales with a small model, and keep only the most diverse ones. This reduces prompt-engineering labor while maintaining coverage of reasoning styles.
<reasoning>...</reasoning>) to isolate the trace; the final <answer> tag tells the model when to stop.Few-shot prompting becomes more reliable when each example is not just a question-answer pair but a miniature “workflow” that teaches the LLM how to behave.
Assign a persona or role that the model must inhabit for the duration of the conversation.
You are Dr. Lee, a board-certified cardiologist reviewing patient echocardiogram reports.
Your task is to grade diastolic dysfunction on a 0-3 scale and write a one-sentence summary.
Report 1: ...
Grade: 1
Summary: Mild diastolic dysfunction with preserved EF.
Report 2: ...
Grade: 3
Summary: Severe restrictive pattern with elevated LVEDP.
Add explicit formatting rules so the model’s output is parseable later.
- Output format: JSON with keys: {"grade": int, "summary": str, "actionable": bool}
- grade must be 0, 1, 2, or 3
- actionable is true only if the summary contains the word "follow-up"
Pairs of correct vs. incorrect traces can steer the model away from common mistakes.
Good: "Ejection fraction 55 % → Grade 1 diastolic dysfunction → Summary: Normal diastolic function."
Bad: "Ejection fraction 55 % → Grade 4 diastolic dysfunction → Summary: Severe systolic impairment."
Instead of hard-coding examples in the prompt, retrieve the most semantically similar demonstrations at runtime using a vector store.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer("all-MiniLM-L6-v2")
query_emb = model.encode(user_question)
candidates = [...] # pre-loaded examples
scores = cosine_similarity([query_emb], [ex["emb"] for ex in candidates])[0]
top_k = sorted(zip(candidates, scores), key=lambda x: -x[1])[:3]
Prompt engineering is no longer a game of guessing; it is an optimization loop that can be automated with LLMs themselves.
Treat the prompt string as a parameterizable function.
def build_prompt(task: str, style: str = "concise", max_tokens: int = 512) -> str:
base = f"""Act as an expert {task}.
Style: {style}.
Be factual and cite sources when possible."""
return base
Use the same LLM to iteratively improve a prompt.
def refine_prompt(prompt: str, eval_set: list[tuple]) -> str:
candidates = [llm.generate_refinement(prompt, i) for i in range(5)]
scores = [evaluate(cand, eval_set) for cand in candidates]
return candidates[scores.index(max(scores))]
Even with automation, human annotators can judge nuanced qualities such as tone, safety, or brand voice. A lightweight HITL dashboard surfaces the top 10 prompts and lets reviewers up-vote or down-vote outputs.
Wrap your prompt in a lightweight A/B framework so you can roll out new variants to a small percentage of traffic and compare conversion or error rates.
from abtesting import Experiment
exp = Experiment(
name="diag_grade_v6",
variants=["baseline", "cot_v1", "cot_v2"],
metric=lambda logs: logs["accuracy"],
traffic_split=0.05
)
selected_variant = exp.serve()
Store every prompt variant in Git, add a semantic commit message (“feat: add contrastive examples for grade 3”), and tag each release. If a new variant causes a regression, roll back in seconds.
Remember that prompt engineering is not a one-time setup but a continuous loop. As models evolve, so must your prompts; treat them as living artifacts that grow with your product and user expectations.
Web developers have long wrestled with a fundamental tension: how to keep users secure while maintaining seamless functionality across domai…

JWTs have become the de facto standard for securing Single Sign-On (SSO) flows because they’re stateless, self-contained, and easy to verify…

Open redirects seem harmless at first glance—a simple URL that reroutes users to another location. But when these redirects intersect with S…

Comments
Sign in to join the conversation
No comments yet. Be the first to share your thoughts!