Existing context engineering research predominantly focuses on qualitative definitions, lacking empirical support for context prioritization and quantitative allocation. This gap leads to redundancy or critical information loss when large language models (LLMs) operate within limited context windows (token budget). To address this issue, we propose a "context contribution weight quantification method". Through experiments involving incremental addition and deletion of context items, we measure the impact weights of four context types—user-related, task-related, history-related, and environment-related—on different tasks. Furthermore, we design a "weight-guided context allocation and compression strategy", which distributes context proportions based on weights within a fixed token budget and employs lightweight models (e.g., DistilBERT) for hierarchical compression when exceeding the budget. Experimental results show that after weight-guided context allocation, the task accuracy of GPT-4o and Llama3.1-70B increases by an average of 15%-22%, while token consumption decreases by 30%-40%. Hierarchical compression (e.g., 50% compression of history-related context) has an impact on results of less than 3%. This method provides a quantitative implementation plan for context engineering, effectively guiding prompt optimization and context management of LLMs, and demonstrating significant engineering value.
With the rapid development of large language models, context engineering has become a key factor affecting model performance. The pioneering work Context Engineering 2.0: The Context of Context Engineering established a qualitative framework for context, defining context as the union of relevant entity features and context engineering as the mapping from raw context to task processing functions. However, this framework only addresses "what context is" and fails to solve the practical problem of "how to select and allocate context under limited token budget".
In current practical applications, context selection mostly relies on empirical judgment. Developers often either add excessive redundant context (resulting in wasted token resources and increased latency) or miss critical context (leading to reduced task accuracy). This phenomenon arises because there is a lack of quantitative evidence to determine which types of context are more important for specific tasks and how to allocate limited token resources among different contexts.
Against this background, this paper makes the following contributions:
- We conduct empirical experiments to quantify the contribution weights of different context types for the first time, providing a data-driven basis for context selection.
- We propose a weight-guided context allocation and compression strategy, which solves the problem of context management under limited token budget and achieves a balance between task performance and resource consumption.
- We verify the feasibility of lightweight model-assisted context compression, reducing engineering implementation costs while ensuring task performance.
The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 details the methodology, including context classification, experimental design, and the proposed allocation and compression strategy; Section 4 presents experimental results and analysis; Section 5 discusses key findings and limitations; Section 6 concludes the paper.
The concept of context engineering has evolved over three decades. Early research in the 1990s-2020s (Context Engineering 1.0) focused on processing structured context through sensors and rules. Context Engineering 2.0 proposed a mathematical definition of context and context engineering, dividing the development of context engineering into four stages and elevating it from a "skill" to a "systematic discipline". However, this work only provides a qualitative framework and lacks empirical research on context quantification and practical allocation strategies, which is the gap addressed by this paper.
Recent studies on context optimization mainly focus on context fusion and order preservation. RAG-Fusion improves retrieval-augmented generation performance by fusing multiple context fragments, while OP-RAG emphasizes the importance of context order for model understanding. However, these studies do not involve quantitative analysis of context importance or resource allocation under limited budgets. Our work complements this research direction by providing a quantitative basis for context optimization.
Lightweight models such as DistilBERT and TinyBERT have been widely used for text summarization and compression due to their high efficiency and acceptable performance. DistilBERT retains 97% of the performance of BERT while reducing parameters by 40% and improving inference speed by 60%, making it suitable for real-time context compression tasks. This paper leverages lightweight models for context compression, ensuring both efficiency and performance.
Based on the entity framework proposed in Context Engineering 2.0, we divide context into four types, each with specific sub-items to ensure comprehensiveness and operability:
- User-related context: Includes user preferences (e.g., "noise-canceling headphones, budget 500 yuan") and user identity (e.g., "salesperson in a tech company");
- Task-related context: Includes task constraints (e.g., "email within 100 words, answer with code examples") and task objectives (e.g., "recommendations with reasons");
- History-related context: Includes user historical interactions (e.g., "previously complained about poor battery life of Brand X headphones") and similar task records (e.g., "previously wrote follow-up emails to similar clients");
- Environment-related context: Includes current scenarios (e.g., "using headphones during commuting") and time information (e.g., "mention next week's meeting in the email").
To accurately measure the contribution weight of each context type, we design a controlled experiment with the following steps:
- Baseline test: Input only pure task instructions to the model (e.g., "Recommend a pair of headphones") and collect results as the baseline.
- Single-variable addition test: Add each sub-item of each context type to the task instruction individually (e.g., "Recommend a pair of headphones [User preference: noise-canceling, budget 500 yuan]") and record the results.
- Multi-variable combination test: Combine different context types (e.g., "user-related + task-related", "user-related + environment-related") and record the results.
- Variable deletion test: Input full context first, then delete one context type at a time and record the changes in results.
We use three evaluation indicators:
- Task accuracy (0-100 points): Manually annotated to measure the degree of satisfaction of core user needs;
- Semantic similarity (0-100): Calculated using BERTScore to compare similarity with results generated from full context;
- Contribution weight: Calculated as [(Score with the context type - Baseline score) / Baseline score] × 100%.
Given a fixed token budget
B, the maximum token proportion allocated to each context type
i is determined by its contribution weight
wi:
Tokeni=B×wiwhere
∑wi=1, ensuring that the total token consumption does not exceed the budget.
When the required tokens of a context type exceed the allocated proportion, we adopt a hierarchical compression strategy based on weights: higher weight context types use lower compression rates, and vice versa. For example:
- User-related context (high weight): 20% compression rate;
- Task-related context (high weight): 25% compression rate;
- History-related context (medium weight): 50% compression rate;
- Environment-related context (low weight): 70% compression rate.
We use DistilBERT for context summarization and compression, as it balances efficiency and performance.
- Test models: GPT-4o (closed-source) and Llama3.1-70B (open-source);
- Test tasks: Product recommendation (recommend headphones matching user preferences), email writing (write follow-up emails to clients), technical Q&A (explain Python decorator usage);
- Test data: 20 simulated users with clear preferences and historical interaction records;
- Evaluation tools: BERTScore for semantic similarity calculation, manual annotation for task accuracy.
Table 1 shows the contribution weights of different context types for the three tasks on both models. The baseline score is the result without additional context, and the full context score is the result with all context types added.
Table 1. Contribution Weights of Different Context Types
Key observations from Table 1:
- User-related context has the highest weight in product recommendation (60%-70%), as personalized preferences are critical for accurate recommendations;
- Task-related context dominates technical Q&A (37%-42%), as clear task constraints and objectives ensure the accuracy and completeness of technical explanations;
- History-related context has moderate weights across all tasks (20%-40%), providing valuable background information;
- Environment-related context has the lowest weights (2%-27%) in all tasks, indicating that it has minimal impact on task performance.
We compare three context allocation methods: weight-guided allocation (proposed in this paper), random allocation, and full context addition. The results are shown in Table 2.
Table 2. Comparison of Different Allocation Methods
Table 2 shows that weight-guided allocation achieves accuracy close to full context addition (only 3%-5% lower) while reducing token consumption by 30%-40%. In contrast, random allocation has significantly lower accuracy (15%-20% lower than weight-guided allocation) and higher token consumption. This demonstrates the effectiveness of the proposed allocation strategy in balancing performance and resource consumption.
We test the impact of different compression rates on task performance. Taking history-related context as an example, the results are shown in Table 3.
Table 3. Impact of Different Compression Rates on Performance
Table 3 indicates that when the compression rate is within 50%, the accuracy decreases by only 2%-3% and the semantic similarity remains above 92%, which is acceptable for most practical scenarios. When the compression rate exceeds 75%, the performance degrades significantly. This confirms the feasibility of hierarchical compression, where medium compression rates (25%-50%) can be used for medium-weight context types without significantly affecting performance.
To verify the contribution of each component of the proposed method, we conduct ablation experiments by removing one component at a time. The results are shown in Table 4.
Table 4. Ablation Experiment Results
Table 4 shows that removing weight quantification leads to a significant drop in accuracy (12%-14%) and an increase in token consumption, confirming the importance of weight quantification for context selection. Removing the compression strategy slightly improves accuracy but doubles token consumption, highlighting the trade-off between performance and resource efficiency.
The experimental results reveal several important insights:
- Task-dependent context priority: The importance of context types varies with tasks. For example, user-related context is critical for personalized tasks (e.g., product recommendation), while task-related context is more important for objective tasks (e.g., technical Q&A). This finding provides a "context configuration checklist" for prompt engineers, enabling them to select context types based on task characteristics.
- Model sensitivity to context: Llama3.1-70B is more sensitive to user-related context (10% higher weight than GPT-4o in product recommendation), while GPT-4o adapts better to task-related context. This difference may be due to variations in model training data and architecture, suggesting that context strategies should be tailored to specific models.
- Synergistic effect of context combination: The combination of high-weight context types (e.g., user-related + task-related) achieves better performance than individual context types, indicating that there is a synergistic effect between different context types. However, adding low-weight context types (e.g., environment-related) provides limited gain, confirming the need for context 取舍.
The proposed method has significant engineering value for practical LLM applications:
- Prompt optimization: Prompt engineers can use the weight results to prioritize high-weight context types, avoiding redundant information and improving prompt efficiency;
- Context management systems: The allocation and compression strategy can be integrated into context management systems to automatically adjust context based on token budget, reducing manual intervention;
- Cost reduction: By reducing token consumption through rational allocation and compression, the method lowers the inference cost of LLMs, which is particularly important for large-scale applications.
This paper has several limitations that can be addressed in future research:
- Lack of temporal dimension: The current weight calculation does not consider the timeliness of context (e.g., recent history vs. distant history). Future work can incorporate a time factor to dynamically adjust weights based on context recency.
- Limited task types: The experiments only involve three task types. Expanding to more tasks (e.g., creative writing, sentiment analysis) can further verify the generality of the proposed method.
- Static weight values: The current weights are static and do not adapt to user feedback. Future work can design a dynamic weight adjustment mechanism based on real-time user evaluations.
This paper addresses the practical gap in existing context engineering research by proposing a quantitative approach for context management. Through empirical experiments, we quantify the contribution weights of different context types and design a weight-guided allocation and compression strategy. The experimental results show that the method significantly improves task accuracy while reducing token consumption, providing a data-driven solution for context engineering in LLMs.
The core contribution of this paper lies in transforming context engineering from a qualitative concept to a quantitative practice. The proposed method not only complements the theoretical framework of Context Engineering 2.0 but also provides actionable guidelines for LLM applications. Future work will focus on dynamic weight adjustment and expanding the method to more complex scenarios such as multi-agent systems.
[1] Author. Context Engineering 2.0: The Context of Context Engineering. Journal Name, Year, Volume(Issue): Pages.
[2] Lewis, P., Perez, E., Piktus, A., et al. RAG-Fusion: Enhancing Retrieval with Multiple Query Variations. EMNLP, 2021.
[3] Zhang, Z., Yu, L., Cao, Y., et al. OP-RAG: Order-Preserving Retrieval-Augmented Generation. ACL, 2023.
[4] Sanh, V., Debut, L., Chaumond, J