Microsoft's Orca 2 LLM Outperforms Models That Are 10x Larger – InfoQ.com

Microsoft's Orca 2 LLM Outperforms Models That Are 10x Larger – InfoQ.com

A monthly overview of things you need to know as an architect or aspiring architects.
View an example

We protect your privacy.
Facilitating the Spread of Knowledge and Innovation in Professional Software Development


Radia Perlman describes what aspects of identity and authentication blockchain might address, and compares a “blockchain“ approach with what is deployed today.
Niosha Behnam, Sergey Fedorov tell how Netflix has shifted from geo-based DNS load-balancing to a latency-based approach relying on real-user measurements and building a model of Netflix traffic.
Montana Low provides an understanding of how Postgres can be used as a vector database for AI and how it can be integrated into your existing application stack.
Gemma Honour discusses how simple shapes can shape thinking, and break down difficult problems into something that is manageable and approachable.
Michael Friedrich discusses the learning steps with eBPF and traditional metrics monitoring and future Observability data collection, storage and visualization.
Learn practical insights on implementing & sustaining successful platform engineering programs. Register for free with code “PLATFORMENGJAN24”.
Discover new ideas and insights from senior practitioners driving change in software. Attend in-person.
Discover transformative insights to level up your software development decisions. Register now with early bird tickets.
Level up your software skills by uncovering the emerging trends you should focus on. Register now.
InfoQ Homepage News Microsoft’s Orca 2 LLM Outperforms Models That Are 10x Larger
Dec 12, 2023 2 min read
by
Anthony Alford
Microsoft Research released its Orca 2 LLM, a fine-tuned version of Llama 2 that performs as well as or better than models that contain 10x the number of parameters. Orca 2 uses a synthetic training dataset and a new technique called Prompt Erasure to achieve this performance.
Orca 2 models are trained using a teacher-student scheme, where a larger, more powerful LLM acts as a teacher for a smaller student LLM, with the goal of improving the performance of the student to be comparable with that of a larger model. Microsoft's training technique teaches the smaller model multiple reasoning techniques and also how to choose the most effective technique for a given task. To do this, the teacher is given sophisticated prompts to trigger a certain reasoning behavior. However, in a scheme called Prompt Erasure, the student is given only the task requirements and desired response, but not the teacher's prompt. When evaluated on benchmarks, a 13B parameter Orca 2 model outperformed a baseline 13B parameter Llama 2 by 47.54%. The 7B parameter Orca 2 was "better or comparable" to a 70B parameter Llama 2 on reasoning tasks.
Although LLMs like ChatGPT can often perform well on a wide range of tasks with few-shot prompting, hosting the models is challenging due to their memory and compute requirements. Smaller models can also perform well when fine-tuned, and many researchers have investigated training them with synthetic datasets generated by larger LLMs. InfoQ recently covered Google's Distilling Step-by-Step method which prompts a teacher LLM to automatically generate a small fine-tuning dataset that contains both an input with an output label, as well as a "rationale" for why the output label was chosen. InfoQ also covered Stability AI's Stable Beluga model which is trained using Microsoft's original Orca 1 scheme, which uses Explanation Tuning, where the teacher LLM is prompted to "generate detailed answers."
Like Orca 1, the Orca 2 training dataset is generated by a teacher LLM which is given a detailed prompt. However, the new approach, which Microsoft dubs Cautious Reasoning, pairs training tasks with prompts which elicit the teacher to use a specific problem solving strategy, such as "step-by-step" or "explain your answer." Then during training of the student, the teacher's prompt is erased, which pushes the student to learn to pick the correct strategy.
To evaluate the methodology, Microsoft compared Orca 2 model performance to several baseline models, including Llama 2, ChatGPT (GPT-3.5) and GPT-4. The benchmark tasks included reasoning, language understanding, text completion, and summarization. On the reasoning benchmarks, the 13B parameter Orca 2 model outperformed all baselines except ChatGPT and GPT-4. They also found that giving Orca 2 a "cautious" system prompt ("You are a cautious assistant. You carefully follow instructions.") gave it a small performance boost compared to an empty system prompt.
Several users posted about Orca 2 on X. One noted that "[Y]ou do not need to prompt it with tricks like 'explain step by step.' It just knows." AI researcher Rudi Ranck wrote:
Many brilliant ideas are so simple…Like "Prompt Erasure" in Orca 2: Instead of presenting the entire prompt, only the task and the answer are shown to the model (it filters the full prompt used to generate those answers). It helps the model to strategize at a higher level. Such a nice paper. I highly recommend reading it all the way through.
The 7B and 13B parameter Orca 2 models are available on Huggingface.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
You need to Register an InfoQ account or or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
April 8-10, 2024
Real-world technical talks. No product pitches.
Practical ideas to inspire you and your team.
QCon London International Software Development Conference returns on April 8-10, 2024. Level-up on 15 major software and leadership topics including; The Tech of FinTech, What’s Next in GenAI and Large Language Models (LLMs), Performance Engineering, Architecture for the Age of AI, Innovations in Data Engineering and more.
Learn the emerging trends. Explore the use cases. Implement the best practices.
SAVE YOUR SPOT NOW
InfoQ.com and all content copyright © 2006-2023 C4Media Inc.
Privacy Notice, Terms And Conditions, Cookie Policy

source