Cross Capabilities of LLMs

The development and evaluation of Large Language Models (LLMs) have primarily focused on individual capabilities. Typically, developers construct specialized datasets tailored to distinct abilities, and then train models by blending these data sources. However, this overlooks the intersection of multiple capabilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities.

In this project, we systematically explore this concept of cross capabilities in LLMs step by step.

What is Cross Capability?

Examples:
- Consider a user prompt asking, “Which direction has the total rainfall in Tokyo, Japan been trending over the past 10 years?” Such a task requires the integration of tool use (web browsing) with analytical reasoning.
- When a developer provides HTML and JavaScript code and asks, “Give me a basic understanding of what this web app does,” the model must combine long-context comprehension with coding expertise.
Definition:
- We define these scenarios as cross capabilities—the intersection of multiple distinct capabilities across different types of expertise necessary to address complex, real-world tasks.

Taxonomy for Individual and Cross Capabilities

We start by identifying seven core individual capabilities of LLMs and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy.

Individual Capabilities:
- English
- Reasoning
- Coding
- Image Recognition
- Tool Use
- Long Context
- Spanish
Cross Capabilities:
- Coding & Reasoning
- Image Recognition & Reasoning
- Tool Use & Coding
- Tool Use & Reasoning
- Long Context & Coding
- Spanish & Reasoning
- Spanish & Image Recognition
Taxonomy:
- As illustrated in the Figure, these taxonomies follow a hierarchical design: the root node represents either an individual or cross capability, with the next two layers (Level-1 and Level-2 categories) breaking these down into increasingly specific tasks.
- This framework clearly distinguishes between tasks that rely on an individual capability and those that demand the integration of multiple abilities, allowing for a comprehensive evaluation of LLMs across various scenarios.

CrossEval: Benchmarking LLM Cross Capabilities

To benchmark the cross capabilities of LLMs, we introduce the CrossEval benchmark, which includes:

Prompts: 1,400 expert-annotated prompts, with 100 prompts per capability
Category: each prompt categorized by level 1 and 2 based on the taxonomy
Difficulty Level: 10% easy, 30% medium and 60% hard for each capability
Reference Examples:

Responses: 3 model responses per prompt
Expert reviews: 2 human ratings with explanations per model response
A total of 4,200 model responses and 8,400 expert reviews

Building LLM-based Evaluators

CrossEval is the largest meta-evaluation benchmark for measuring the correlation between LLM-based scoring and human judgments. With each prompt including 3 reference model responses and 6 human ratings, we can explore how to develop the most effective in-domain LLM evaluator for this benchmark.

Prompting LLMs for Evaluation:
- Multi-reference-based prompting: When using LLM-as-a-Judge, up to two reference responses, along with their ratings and explanations, are provided as context. For instance, when evaluating the first response, the LLM can be given the other responses with their four ratings.
- Point-deduction-based prompting: LLM-as-a-Judge paradigm tends to favor longer, more structured responses, which leads to inflated evaluation scores. To address this, instead of assigning scores directly, LLMs are instrusted to summarize issues in both reference examples and the response under evaluation, specifying point deductions.
Correlations between LLM ratings and human judgments:
- Each LLM shows particular strengths in evaluating different capabilities.
- With our reference examples and prompting methods, LLM evaluators achieve a Pearson correlation of nearly 0.7 with expert annotators' judgments on CrossEval.

Ablation study on the number of reference examples:

As shown in the Figure, a clear trend emerges: as the number of reference examples increases, all three correlation metrics improve significantly.
Notably, when evaluating new model responses in our benchmark, we provide all three reference examples, which could potentially lead to even higher correlations.

Law of the Weakest Link in LLMs

The full results of 17 LLMs from 5 model families are provided in the Table. Our experiments reveal several key findings:

CrossEval effectively differentiates advanced models:
- The CrossEval benchmark successfully distinguishes between state-of-the-art LLMs.
- For instance, the four Claude model variants achieve progressively higher scores in the Reasoning capability: 56.81, 62.88, 66.22, and 71.54.
LLMs exhibit a “Law of the Weakest Link” effect in cross capabilities:
- In cross-capability evaluations, we define one of the involved individual capabilities as stronger and the other as weaker if the absolute score difference between them exceeds 3 points.
- In 58 cross-capability scenarios where this difference is present, 38 cases show performance lower than both individual capabilities (red background), and 20 show performance between the two but closer to the weaker capability (blue background).
- Notably, no cross-capability score ever comes close to or exceeds the stronger individual capability.
Tool Use is currently the most challenging capability for LLMs:
- Our prompt set includes tasks involving web browsing and code interpretation, and Llama 3.1 is the only model family that currently supports both.
- However, these scores for Tool Use capability are significantly lower than those for other capabilities, indicating a critical area for improvement.
LLMs underperform in cross-capability tasks:
- Despite our efforts to maintain a consistent difficulty level across both individual and cross-capability tasks, LLMs generally perform worse on tasks requiring multiple capabilities.
- Across all models, the average score for individual capabilities is 65.72, compared to 58.67 for cross capabilities, revealing a significant performance gap.

“Law of the Weakest Link” effect is evaluator-agnostic:

“Law of the Weakest Link” holds true regardless of the evaluator used. With GPT-4o, the density peaks slightly below the weaker capability, while Claude 3.5 Sonnet shows a slight peak above it. However, in both cases, performance clusters closely around the weaker capability.
“Law of the Weakest Link” effect suggests that deficiencies in an individual capability can substantially limit performance across any cross-capability tasks involving that capability.
CrossEval benchmark provides a foundation for identifying LLM weaknesses, but further research is needed to more comprehensively diagnose and address these deficiencies without compromising other capabilities.

Case Study on Individual-Capability Alterations

Beyond evaluating the relationship between individual and cross capabilities of LLMs on CrossEval, we explore the crucial follow-up questions: when we adjust the performance of specific capabilities, how does this affect cross-capability performance? To explore this in LLMs, we propose a prompting method designed to modulate specific capabilities of LLMs. Following this, we present case studies involving two LLMs to illustrate the effects of these alterations.

Principle-based System Prompting:

To reliably explore the impact of altering individual capabilities, we aim to enhance a specific capability without significantly affecting others.
Our solution is a principle-based method that iteratively refines the system prompt to enhance the specific capabilities of LLMs. It builds on the CrossEval dataset to selectively boost individual capabilities.

Investigating the impact of individual-capability alterations:

Principle-based system prompting is particularly effective in enhancing weaker capabilities.
“Law of the Weakest Link” effect persists after individual-capability alterations.

Altering the weaker capability in a cross-capability scenario has a significant effect on overall performance, while changes to the stronger capability result in only minor adjustments.
In 10 out of the 18 cross-capability scores examined across the two models, we observe one individual capability improving while the other declines. Notably, in 90% of these cases, changes in cross-capability performance closely follow the trends of the weaker capability.
Therefore, our case study also confirms that performance shifts in individual capabilities continue to conform to the “Law of the Weakest Link” effect.

BibTeX

@article{zhong2024law,
      title={Law of the Weakest Link: Cross Capabilities of Large Language Models},
      author={Zhong, Ming and Zhang, Aston and Wang, Xuewei and Hou, Rui and Xiong, Wenhan and Zhu, Chenguang and Chen, Zhengxing and Tan, Liang and Bi, Chloe and Lewis, Mike and Popuri, Sravya and Narang, Sharan and Kambadur, Melanie and Mahajan, Dhruv and Edunov, Sergey and Han, Jiawei and van der Maaten, Laurens},
      journal={arXiv preprint arXiv:2409.19951},
      year={2024}
}