LLM知识
2025年3月12日 星期三
14:49

LLM输出token时,相当于对多项式分布进行采样 ,多项式分布的概率即token的概率分布,token概率越大,越有可能 被采样 到。
★关于为什么chain of thought这种提示方式会有效?
This "note taking" or "thinking" strategy typically works well with auto-regressive models, where the generated text is passed back into the model at each generation step. This means the working "notes" are used when generating final result output.
★关于LLM的评估:
评估prompt或者model的好坏
When using LLMs in real-world cases, it's important to understand how well they are performing. The open-ended generation capabilities of LLMs can make many cases difficult to measure. In this notebook you will walk through some simple techniques for evaluating LLM outputs and understanding their performance.
评估一个prompt或者llm输出的好坏,需要定义一个Evaluator,这个evaluator通常也是个llm,
Define an evaluator
For a task like this, you may wish to evaluate a number of aspects, like how well the model followed the prompt ("instruction following"), whether it included relevant data in the prompt ("groundedness"), how easy the text is to read ("fluency"), or other factors like "verbosity" or "quality".
You can instruct an LLM to perform these tasks in a similar manner to how you would instruct a human rater: with a clear definition and assessment rubric.
In this step, you define an evaluation agent using a pre-written "summarisation" prompt and use it to gauge the quality of the generated summary.
Note: For more pre-written evaluation prompts covering groundedness, safety, coherence and more, check out this comprehensive list of model-based evaluation prompts from the Google Cloud docs. https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates
当然,用llm来评估也面临一些挑战。
LLM limitations
LLMs are known to have problems on certain tasks, and these challenges still persist when using LLMs as evaluators. For example, LLMs can struggle to count the number of characters in a word (this is a numerical problem, not a language problem), so an LLM evaluator will not be able to accurately evaluate this type of task. There are solutions available in some cases, such as connecting tools to handle problems unsuitable to a language model, but it's important that you understand possible limitations and include human evaluators to calibrate your evaluation system and determine a baseline.
One reason that LLM evaluators work well is that all of the information they need is available in the input context, so the model only needs to attend to that information to produce the result. When customising evaluation prompts, or building your own systems, keep this in mind and ensure that you are not relying on "internal knowledge" from the model, or behaviour that might be better provided from a tool.
Improving confidence
One way to improve the confidence of your evaluations is to include a diverse set of evaluators. That is, use the same prompts and outputs, but execute them on different models, like Gemini Flash and Pro, or even across different providers, like Gemini, Claude, ChatGPT and local models like Gemma or Qwen. This follows the same idea used earlier, where repeating trials to gather multiple "opinions" helps to reduce error, except by using different models the "opinions" will be more diverse. 用多个llm的结果,而不只是依赖单个llm。
★关于system prompt的疑问,有什么用,llm训练时如何用system prompt,预测时如何用system prompt?
System prompt的初衷是,为对话设置话题界限,以防止adversarial generation(i.e. use 客服机器人 as a virtual girlfriend)。有些llm训练时没有用到system prompt这种 形式,因此在预测时,不应该或不建议使用system prompt,如deepseek r1。A system prompt only makes sense if it is trained to work with it .like giving it more weight and always keeping it in context no matter how long the conversation.
How do llms see the system prompts? It depends on the model and what prompt format it trained on.、
System prompt在训练时,不用于预训练阶段,而用于instruction或chat的 fine tune阶段。
已使用 OneNote 创建。