这篇是去年翻译的OpenAI官方指南Prompt engineering,值得多读几次。写这篇文章时发现OpenAI给这篇指南改了名字,之前的标题是gpt best practices,现在改为了prompt engineering.

GPT最佳实践

获得更好结果的六项策略

1 指示要明确

GPT无法读懂你的心思。如果输出内容过长,可要求简短回复;如果输出内容过于简单,可要求专家级写作;如果不喜欢当前的格式,可展示你期望的格式。你的指示越明确,GPT满足你的要求的可能性就越大。

  • 在提问中提供完整信息,以获得更准确的答复。
  • 要求模型扮演不同角色。
  • 使用分隔符明确区分输入的不同部分。
  • 阐明完成任务所需的步骤。
  • 举例说明。
  • 指定输出内容的期望长度。

2 提供参考文字

GPT可以自信地编造虚假答案,特别是在询问关于深奥话题或要求引用和URL时。就像小抄可以帮助学生在考试中做得更好一样,向GPT提供参考文本可以帮助它减少错误信息的回答。

  • 指导模型使用参考文献来回答问题
  • 指导模型在回答时引用参考文献

3 将复杂的任务拆分为更简单的子任务

正如软件工程中将复杂系统拆分为一系列模块化部件的常规做法,对GPT提交的任务同样适用。复杂任务相较于简单任务具有更高的错误率。此外,复杂任务往往可以重新定义为一系列的简单任务的工作流,在这个流程中,前一个任务的输出用于构建后续任务的输入。

  • 利用意图分类来识别用户查询中最相关的指令
  • 对于需要长对话的对话应用,摘要或过滤之前的对话内容
  • 分块摘要长文档,递归地构建完整摘要

4 给GPT时间思考

如果把17乘以28的题目摆在你面前,你可能不会立刻答出来,但给点时间你能慢慢算出来。同样地,GPT在被要求立即回答时,也更容易出错。让它先进行一番逻辑推理,再给出答案,就能更可靠地引导它找到正确的答案。

  • 让模型先不急于下结论,先计算出自己的答案。
  • 利用内心独白或一系列自问自答来掩盖模型的推理过程。
  • 询问模型在之前的回答中是否有遗漏。

5 使用外部工具

通过结合其他工具的功能来弥补GPT的不足之处。举例来说,文本检索系统能够为GPT提供相关文档信息。代码执行引擎则能辅助GPT进行数学计算和运行代码。若外部工具能比GPT更可靠或高效地完成任务,就将任务外包出去,以实现优势互补。

  • 使用嵌入式搜索技术,打造高效的知识检索系统
  • 通过代码执行功能进行精确计算或接入外部API
  • 让模型访问具体的函数

6 系统化地测试更改

衡量改进成效的诀窍在于可量化的数据。有时候,简单的提示修改可能会在零星案例中取得不错的效果,但在更广泛的案例中却适得其反。因此,要确认某项更改确实提升了性能,就需要设立一套全面的测试方案(也称为评估测试)。

  • 以最佳答案为标准,评估模型的输出结果

策略详解

上述策略都可以通过具体的手段来实施。这些手段旨在激发尝试的灵感,但并不是面面俱到。你大可发挥创意,尝试这里未涉及的新点子。

1 指示要明确

1.1 在提问中提供完整信息,以获得更准确的答复。

想要得到精确无误的回答,就得在提问中提供所有必要的信息和情境。不然,你的意图就只能让模型去猜了。

1.2 要求模型扮演不同角色。

system message可以用来指定模型回复时扮演的角色,例如

SYSTEM:
When I ask for help to write something, you will reply with a document that contains at least one joke or playful comment in every paragraph.

1.3 使用分隔符明确区分输入的不同部分。

使用三重引号、XML标签和小节题等分隔符可以清晰地区分需要不同处理的文本部分。

USER: 
Summarize the text delimited by triple quotes with a haiku.

"""insert text here"""
SYSTEM:
You will be provided with a pair of articles (delimited with XML tags) about the same topic. First summarize the arguments of each article. Then indicate which of them makes a better argument and explain why.

USER:
<article> insert first article here </article>

<article> insert second article here </article>
SYSTEM:
You will be provided with a thesis abstract and a suggested title for it. The thesis title should give the reader a good idea of the topic of the thesis but should also be eye-catching. If the title does not meet these criteria, suggest 5 alternatives.

USER:
Abstract: insert abstract here

Title: insert title here

1.4 阐明完成任务所需的步骤。

有些任务最好指定为一系列步骤。明确地写出这些步骤可以使模型更容易遵循它们。

SYSTEM:
Use the following step-by-step instructions to respond to user inputs.

Step 1 - The user will provide you with text in triple quotes. Summarize this text in one sentence with a prefix that says "Summary: ".

Step 2 - Translate the summary from Step 1 into Spanish, with a prefix that says "Translation: ".

USER:
"""insert text here"""

1.5 举例说明

通常,给出通用指导比通过例子展示任务的所有可能性更高效。但有时,如果你希望模型模仿一种难以明确描述的回答用户查询的特定风格,提供实例可能更简单。这就是所谓的“少示例”提示法。

SYSTEM: Answer in a consistent style.

USER: Teach me about patience.

ASSISTANT: The river that carves the deepest valley flows from a modest spring; the grandest symphony originates from a single note; the most intricate tapestry begins with a solitary thread.

USER: Teach me about the ocean.

1.6 指定输出内容的期望长度。

模型可以根据要求产生特定长度的输出,这个长度可以是词数、句子数、段落数等等。但是需要注意的是,模型生成特定词数的输出可能不太精确,生成特定数量的段落或者列表则相对可靠。

USER:
Summarize the text delimited by triple quotes in about 50 words.

"""insert text here"""
USER:
Summarize the text delimited by triple quotes in 2 paragraphs.

"""insert text here"""
USER:
Summarize the text delimited by triple quotes in 3 bullet points.

"""insert text here"""

2 提供参考文字

2.1 指导模型使用参考文献来回答问题

如果我们能提供与当前问题相关的可靠信息,我们就可以指导模型使用这些信息来构建它的回答。

SYSTEM:
Use the provided articles delimited by triple quotes to answer questions. If the answer cannot be found in the articles, write "I could not find an answer."

USER:
<insert articles, each delimited by triple quotes>

Question: <insert question here>

鉴于GPT有限的上下文窗口,为了运用这一方法,我们需要某种方法来动态地查找与所问问题相关的信息。嵌入技术可以用来实现高效的知识检索。有关如何实现的更多细节,请参见“使用嵌入式搜索技术,打造高效的知识检索系统”策略。

2.2 指导模型在回答时引用参考文献

如果输入增加了相关知识,就可以直接要求模型通过引用已提供文档中的段落,在其回答中添加引用。注意,之后可以通过在提供的文档内进行字符串匹配来程序性地验证输出中的引用。

SYSTEM:
You will be provided with a document delimited by triple quotes and a question. Your task is to answer the question using only the provided document and to cite the passage(s) of the document used to answer the question. If the document does not contain the information needed to answer this question then simply write: "Insufficient information." If an answer to the question is provided, it must be annotated with a citation. Use the following format for to cite relevant passages ({"citation": …}).

USER:
"""<insert document here>"""

Question: <insert question here>

3 将复杂的任务拆分为更简单的子任务

3.1 利用意图分类来识别用户查询中最相关的指令

对于需要大量独立指令集来处理不同情况的任务,首先对问题类型进行分类,再根据类别确定所需指令是个明智之举。具体操作是设定固定类别,针对特定类别编写硬编码指令。此法还可递归应用,将任务分解为多个阶段。其优势在于,每个查询只包含执行下一任务阶段所必需的指令,这样相较于用单一查询处理整个任务,错误率自然更低。另外,这还有助于降低成本,因为复杂的提示指令会消耗更多成本(详情请参阅定价信息)。

以客户服务软件为例,可以将用户的提问按以下方式有效分类:

SYSTEM:

You will be provided with customer service queries. Classify each query into a primary category and a secondary category. Provide your output in json format with the keys: primary and secondary.

Primary categories: Billing, Technical Support, Account Management, or General Inquiry.

Billing secondary categories:
- Unsubscribe or upgrade
- Add a payment method
- Explanation for charge
- Dispute a charge

Technical Support secondary categories:
- Troubleshooting
- Device compatibility
- Software updates

Account Management secondary categories:
- Password reset
- Update personal information
- Close account
- Account security

General Inquiry secondary categories:
- Product information
- Pricing
- Feedback
- Speak to a human

USER:

I need to get my internet working again.

输出为:

{
  "primary": "Technical Support",
  "secondary": "Troubleshooting"
}

根据客户查询的分类,可以向 GPT 模型提供一组更具体的指令来处理后续步骤。例如,假设客户需要“故障排除”方面的帮助。

SYSTEM:
You will be provided with customer service inquiries that require troubleshooting for technical support. Help the user by:

- Ask them to check that all cables to/from the router are connected. Note that it is common for cables to come loose over time.
- If all cables are connected and the issue persists, ask them which router model they are using
- Now you will advise them how to restart their device: 
-- If the model number is MTD-327J, advise them to push the red button and hold it for 5 seconds, then wait 5 minutes before testing the connection.
-- If the model number is MTD-327S, advise them to unplug and replug it, then wait 5 minutes before testing the connection.
- If the customer's issue persists after restarting the device and waiting 5 minutes, connect them to IT support by outputting {"IT support requested"}.
- If the user starts asking questions that are unrelated to this topic then confirm if they would like to end the current chat about troubleshooting and classify their request according to the following scheme:

Classify their query into a primary category and a secondary category. Provide your output in json format with the keys: primary and secondary.

Primary categories: Billing, Technical Support, Account Management, or General Inquiry.

Billing secondary categories:
- Unsubscribe or upgrade
- Add a payment method
- Explanation for charge
- Dispute a charge

Technical Support secondary categories:
- Troubleshooting
- Device compatibility
- Software updates

Account Management secondary categories:
- Password reset
- Update personal information
- Close account
- Account security

General Inquiry secondary categories:
- Product information
- Pricing
- Feedback
- Speak to a human

USER:
I need to get my internet working again.

请注意,我们已经设置了特殊信号来标识对话状态的变化。这样一来,我们就能将系统转换成一个状态机,状态机中的状态会决定注入哪些指令。通过追踪状态、识别各状态下的相关指令,以及选择性地确认允许的状态转换,我们能够为用户体验设定界限,这在非结构化的方法中是难以实现的。

3.2 对于需要长对话的对话应用,摘要或过滤之前的对话内容

由于GPT的上下文长度是固定的,包含整个会话上下文的用户与助手之间的对话不能无限进行。

有很多方法可以解决这个问题:

方法之一就是总结之前的对话。当输入的大小达到预定的阈值长度时,可以触发一个查询,总结部分对话,并将之前对话的摘要包括在系统消息中。或者,可以在整个对话过程中后台异步地总结以前的对话。

另一个解决方案是动态选择与当前查询最相关的之前对话的部分。参见策略“使用嵌入式搜索技术,打造高效的知识检索系统”。

3.3 分块摘要长文档,递归地构建完整摘要

由于GPT的上下文长度固定,它无法在单一查询中总结超过上下文长度减去总结长度的文本。要总结一本书等长文档,可以逐节用多个查询来总结。将各节的总结连缀起来,再进行总结,形成总结的总结。这个过程可以递归进行,直到整个文档被总结完毕。如果后续部分的理解需要依赖前面的信息,可以在总结后续内容时包括对前文的持续总结。OpenAI之前使用GPT-3的变体研究过这种总结书籍的有效性。

4 给GPT时间思考

4.1 让模型先不急于下结论,先计算出自己的答案。

有时,当我们明确指示模型在得出结论之前从第一原理进行推理时,我们会得到更好的结果。假设我们想要一个模型来评估学生对数学问题的解决方案。解决这个问题最明显的方法是简单地询问模型学生的解决方案是否正确。

SYSTEM:
Determine if the student's solution is correct or not.

USER:
Problem Statement: I'm building a solar power installation and I need help working out the financials.
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot
What is the total cost for the first year of operations as a function of the number of square feet.

Student's Solution: Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000

ASSISTANT:
The student's solution is correct.

但这位同学的解法其实并不正确!我们可以通过提示模型首先生成自己的解决方案来让模型成功注意到这一点。

SYSTEM:
First work out your own solution to the problem. Then compare your solution to the student's solution and evaluate if the student's solution is correct or not. Don't decide if the student's solution is correct until you have done the problem yourself.

USER:
Problem Statement: I'm building a solar power installation and I need help working out the financials.
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot
What is the total cost for the first year of operations as a function of the number of square feet.

Student's Solution: Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000

ASSISTANT:
Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 10x
Total cost: 100x + 250x + 100,000 + 10x = 360x + 100,000

The student's solution is incorrect. They made an error in the maintenance cost calculation, using 100x instead of 10x. The correct total cost for the first year of operations is 360x + 100,000.

4.2 利用内心独白或一系列自问自答来掩盖模型的推理过程。

此策略说明,模型有时需要仔细推敲问题,然后再给出具体答案。对于某些应用,模型得出答案的思考过程可能不宜告知用户。比如,在辅导学生时,我们可能希望鼓励学生自己思考答案,但如果模型披露了它关于学生答案的推理过程,可能就直接告诉了学生正确答案。

内心独白是一种可用来减少这种情况的策略。所谓内心独白,就是指导模型将部分输出内容以一种易于解析的结构化格式呈现,从而在向用户展示输出内容之前,对其进行解析,只显示部分信息。

SYSTEM:
Follow these steps to answer the user queries.

Step 1 - First work out your own solution to the problem. Don't rely on the student's solution since it may be incorrect. Enclose all your work for this step within triple quotes (""").

Step 2 - Compare your solution to the student's solution and evaluate if the student's solution is correct or not. Enclose all your work for this step within triple quotes (""").

Step 3 - If the student made a mistake, determine what hint you could give the student without giving away the answer. Enclose all your work for this step within triple quotes (""").

Step 4 - If the student made a mistake, provide the hint from the previous step to the student (outside of triple quotes). Instead of writing "Step 4 - ..." write "Hint:".

USER:
Problem Statement: A box contains 4 red balls and 6 blue balls. Two balls are drawn from the box without replacement. What is the probability that one of the drawn balls is red and the other one is blue?

Student Solution: (4/10)*(6/9) = 24/90 = 4/15

ASISTANT:
"""
Step 1 - My own solution:

First, determine the total number of ways to draw two balls from the box without replacement. There are 10 balls in total (4 red and 6 blue).

The number of ways to choose 2 balls from 10 is calculated using the combination formula C(n, k) = n! / [k!(n - k)!], where "n" is the total number of items, and "k" is the number of items to choose.

For our case, n = 10 and k = 2, so:
C(10, 2) = 10! / [2!(10 - 2)!] = 10! / (2!8!) = (10 × 9) / (2 × 1) = 45

There are two possible favorable outcomes: 
- Drawing a red ball first and then a blue ball.
- Drawing a blue ball first and then a red ball.

The probability of drawing a red ball and then a blue ball is:
(4/10) * (6/9), because there are 4 red balls out of 10 total balls, and after drawing one red ball, there are 6 blue balls out of the remaining 9.

(4/10) * (6/9) = (4 * 6) / (10 * 9) = 24 / 90

The probability of drawing a blue ball and then a red ball is:
(6/10) * (4/9), because there are 6 blue balls out of 10 total balls, and after drawing one blue ball, there are 4 red balls out of the remaining 9.

(6/10) * (4/9) = (6 * 4) / (10 * 9) = 24 / 90

The total probability of one ball being red and the other being blue is the sum of the two individual probabilities:

24/90 + 24/90 = 48/90 = 8/15 after simplification.

Step 2 - Compare the solution:

The student's solution calculates only the probability of one scenario (a red ball first then a blue ball) and does not take into account the second scenario (a blue ball first then a red ball).

The student's calculation:
(4/10) * (6/9) = 24/90 = 4/15

This calculation is correct for one scenario but does not account for both scenarios.

Step 3 - Hint for the student:

Consider the possibility that either a red ball can be drawn first followed by a blue ball or a blue ball can be drawn first followed by a red ball. You need to find the probability for both scenarios and then add them together to get the total probability for drawing one red and one blue ball.
"""

Hint: Remember that there are two scenarios where you can have one red and one blue ball. You have correctly calculated the probability for one of the scenarios. Now, try to calculate the probability for the second scenario (drawing a blue ball first and then a red ball) and add it to the probability of the first scenario to get the total probability.

4.3 询问模型在之前的回答中是否有遗漏。

设想我们正在使用一个模型,从一个相关的大型文档中列出与特定问题相关的摘录。模型在列出每个摘录后需要决定是继续写下一个摘录还是停止。如果源文件很长,模型通常会过早地停下来,并且未能列出所有相关的摘录。在这种情况下,通过向模型提出后续查询来发掘它在前几轮中可能错过的摘录,常常可以获得更好的性能。

SYSTEM:
You will be provided with a document delimited by triple quotes. Your task is to select excerpts which pertain to the following question: "What significant paradigm shifts have occurred in the history of artificial intelligence."

Ensure that excerpts contain all relevant context needed to interpret them - in other words don't extract small snippets that are missing important context. Provide output in JSON format as follows:

[{"excerpt": "..."},
...
{"excerpt": "..."}]

USER:
"""<insert document here>"""

ASSISTANT:
[{"excerpt": "the model writes an excerpt here"},
...
{"excerpt": "the model writes another excerpt here"}]

USER:
Are there more relevant excerpts? Take care not to repeat excerpts. Also ensure that excerpts contain all relevant context needed to interpret them - in other words don't extract small snippets that are missing important context.

5 使用外部工具

5.1 使用嵌入式搜索技术,打造高效的知识检索系统

如果将外部信息源作为输入的一部分,模型可以利用这些信息产生更加有根据和有时效性的回答。例如,如果用户询问一个特定电影的问题,将有关该电影的高质量信息(例如演员、导演等)添加到模型的输入中可能是有益的。嵌入可以用来实现高效的知识检索,这样就可以在运行时动态地将相关信息添加到模型输入中。

文本嵌入是一个向量,可以衡量文本字符串之间的关联度。相似或相关的字符串在嵌入空间中会比无关的字符串更接近。另外,快速的向量搜索算法的存在意味着嵌入可以用来实现高效的知识检索。具体而言,可以将文本语料库分成块,每一块都进行嵌入并存储。然后对给定的查询进行嵌入,执行向量搜索,以找到与查询最相关(即在嵌入空间中最接近的)的文本块。

在OpenAI Cookbook中可以找到这样的实现例子。参见“指导模型使用检索到的知识回答查询”的策略,了解如何使用知识检索来最小化模型生成不正确事实的可能性。

5.2 通过代码执行功能进行精确计算或接入外部API

GPT模型不能独立可靠地执行算术或长计算。当需要这样做时,应该指导模型编写并运行代码,而不是自行计算。特别是,可以指示模型将要运行的代码放入特定格式,如三重反引号。产生输出后,可以提取并执行代码。如果需要,还可以将代码执行引擎(即Python解释器)的输出作为模型下一次查询的输入。

SYSTEM:
You can write and execute Python code by enclosing it in triple backticks, e.g. ```code goes here```. Use this to perform calculations.

USER:
Find all real-valued roots of the following polynomial: 3*x**5 - 5*x**4 - 3*x**3 - 7*x - 10.

代码执行的另一个很好的用例是调用外部 API。如果模型接受了如何正确使用 API 的指导,它就可以编写使用该 API 的代码。通过向模型提供展示如何使用 API 的文档和/或代码示例,可以指导模型如何使用 API。

SYSTEM:
You can write and execute Python code by enclosing it in triple backticks. Also note that you have access to the following module to help users send messages to their friends:

```python
import message
message.write(to="Bob", message="Hey, want to meetup after work?")```

USER:
Let Alice know I'll meet her at the coffee shop at 10.

警告:执行模型生成的代码本质上并不安全,任何试图执行此操作的应用程序都应采取预防措施。特别是,需要沙盒代码执行环境来限制不受信任的代码可能造成的危害。

5.3 让模型访问函数

Chat completions API允许在请求中传递一系列的函数描述。这使得模型能够根据提供的架构生成函数参数。生成的函数参数由API以JSON格式返回,并可用于执行函数调用。函数调用的输出然后可以在随后的请求中反馈给模型以闭环。这是使用GPT模型调用外部函数的推荐方法。要了解更多,请查看我们的入门GPT指南中的函数调用部分和OpenAI Cookbook中的更多函数调用示例。

6 系统化地测试更改

有时候,很难辨别一项变更——比如新指令或新设计——是让你的系统变得更好还是更坏。通过观察几个例子可能会发现是否变好,但如果样本量很小,很难区分这是真正的改进还是随机的运气。也许这个变更在某些输入上变好了,但在其他输入上却变差了。

评估程序对于优化系统设计非常有用。好的评估程序应该:

  • 代表真实世界的使用情况(或至少是多样的)
  • 包含足够多的测试用例以提供更大的统计力量(见下表指南)
  • 易于自动化或重复执行
检查误差95%置信度的样本大小
30%~10
10%~100
3%~1,000
1%~10,000

评估的结果可以由计算机、人类或两者结合来完成。计算机可以使用客观标准(例如,有单一正确答案的问题)来自动化评估,也可以评估一些主观或不明确的标准,这时会通过其他模型查询来评估模型输出。OpenAI Evals 是一个开源软件框架,它提供了创建自动化评估工具的手段。

当存在多种可能的输出,并且这些输出都被认为是高质量的时候(例如,对于回答较长的问题),基于模型的评估可能会很有用。什么可以用基于模型的评估来实现评估,什么需要人类来评估,这个界限是模糊的,并且随着模型变得更加能干,这个界限也在不断变化。我们鼓励进行实验,以确定模型基于评估在您的使用案例中能否有效工作。

6.1 以最佳答案为标准,评估模型的输出结果

假设问题的正确答案是一系列关键信息的集合,那我们就可以用查询模型来统计答案中包含了多少必须的信息点。

例如,使用以下系统消息:

SYSTEM:
You will be provided with text delimited by triple quotes that is supposed to be the answer to a question. Check if the following pieces of information are directly contained in the answer:

- Neil Armstrong was the first person to walk on the moon.
- The date Neil Armstrong first walked on the moon was July 21, 1969.

For each of these points perform the following steps:

1 - Restate the point.
2 - Provide a citation from the answer which is closest to this point.
3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind.
4 - Write "yes" if the answer to 3 was yes, otherwise write "no".

Finally, provide a count of how many "yes" answers there are. Provide this count as {"count": <insert count here>}.

ASSISTANT:
Of course, I'd be happy to help with that. Please provide the text that contains the answer you're referring to.

USER:
"""Neil Armstrong is famous for being the first human to set foot on the Moon. This historic event took place on July 21, 1969, during the Apollo 11 mission."""

ASSISTANT:

1 - Restate the point: Neil Armstrong was the first person to walk on the moon.
2 - Provide a citation from the answer: "Neil Armstrong is famous for being the first human to set foot on the Moon."
3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point: Yes, the citation directly states that Neil Armstrong was the first human to set foot on the moon, which is equivalent to being the first person to walk on it.
4 - Write "yes" or "no": yes

1 - Restate the point: The date Neil Armstrong first walked on the moon was July 21, 1969.
2 - Provide a citation from the answer: "This historic event took place on July 21, 1969, during the Apollo 11 mission."
3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point: Yes, the citation directly states the date on which the historic event of Neil Armstrong setting foot on the Moon occurred.
4 - Write "yes" or "no": yes

{"count": 2}

以下是仅满足一个点的示例输入:

SYSTEM:
You will be provided with text delimited by triple quotes that is supposed to be the answer to a question. Check if the following pieces of information are directly contained in the answer:

- Neil Armstrong was the first person to walk on the moon.
- The date Neil Armstrong first walked on the moon was July 21, 1969.

For each of these points perform the following steps:

1 - Restate the point.
2 - Provide a citation from the answer which is closest to this point.
3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind.
4 - Write "yes" if the answer to 3 was yes, otherwise write "no".

Finally, provide a count of how many "yes" answers there are. Provide this count as {"count": <insert count here>}.

ASSISTANT:
Of course, I'd be happy to help with that. Please provide the text that contains the answer you're referring to.

USER:
"""Neil Armstrong made history when he stepped off the lunar module, becoming the first person to walk on the moon."""

ASSISTANT:
1 - Restate the point:
   Neil Armstrong was the first person to walk on the moon.

2 - Provide a citation from the answer which is closest to this point:
   "Neil Armstrong made history when he stepped off the lunar module, becoming the first person to walk on the moon."

3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind:
   Yes, the citation clearly states that Neil Armstrong became the first person to walk on the moon when he stepped off the lunar module, which directly conveys the point in question.

4 - Write "yes" if the answer to 3 was yes, otherwise write "no":
   Yes

The second point:

1 - Restate the point:
   The date Neil Armstrong first walked on the moon was July 21, 1969.

2 - Provide a citation from the answer which is closest to this point:
   There is no direct citation that provides the date when Neil Armstrong walked on the moon within the provided answer.

3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind:
   No, because the date of when Neil Armstrong walked on the moon is not mentioned in the provided text, and therefore cannot be inferred.

4 - Write "yes" if the answer to 3 was yes, otherwise write "no":
   No

Finally, provide a count of how many "yes" answers there are:

{"count": 1}

一个不满足任何条件的示例输入:

SYSTEM:
<insert system message above>

USER:
"""In the summer of '69, a voyage grand,
Apollo 11, bold as legend's hand.
Armstrong took a step, history unfurled,
"One small step," he said, for a new world."""

在这类基于模型的评估中,有许多可能的变体。考虑到下面的变化,它追踪候选答案与黄金标准答案之间的重叠,并且追踪候选答案是否与黄金标准答案有任何相悖的地方。

SYSTEM:
Use the following steps to respond to user inputs. Fully restate each step before proceeding. i.e. "Step 1: Reason...".

Step 1: Reason step-by-step about whether the information in the submitted answer compared to the expert answer is either: disjoint, equal, a subset, a superset, or overlapping (i.e. some intersection but not subset/superset).

Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.

Step 3: Output a JSON object structured like: {"type_of_overlap": "disjoint" or "equal" or "subset" or "superset" or "overlapping", "contradiction": true or false}

下面的一个示例输入,其答案不合格,但与专家答案并不矛盾:

SYSTEM:
<insert system message above>

USER:
Question: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""

Submitted Answer: """Didn't he walk on the moon or something?"""

Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""

下面的一个示例输入,其答案与专家答案直接矛盾:

SYSTEM:
<insert system message above>

USER:
Question: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""

Submitted Answer: """On the 21st of July 1969, Neil Armstrong became the second person to walk on the moon, following after Buzz Aldrin."""

Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""

下面是一个带有正确答案的示例输入,它还提供了必要的更多的细节:

SYSTEM:
<insert system message above>

USER:
Question: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""

Submitted Answer: """At approximately 02:56 UTC on July 21st 1969, Neil Armstrong became the first human to set foot on the lunar surface, marking a monumental achievement in human history."""

Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""