53 research outputs found
Understanding Emergent Abilities of Language Models from the Loss Perspective
Recent studies have put into question the belief that emergent abilities in
language models are exclusive to large models. This skepticism arises from two
observations: 1) smaller models can also exhibit high performance on emergent
abilities and 2) there is doubt on the discontinuous metrics used to measure
these abilities. In this paper, we propose to study emergent abilities in the
lens of pre-training loss, instead of model size or training compute. We
demonstrate that the models with the same pre-training loss, but different
model and data sizes, generate the same performance on various downstream
tasks. We also discover that a model exhibits emergent abilities on certain
tasks -- regardless of the continuity of metrics -- when its pre-training loss
falls below a specific threshold. Before reaching this threshold, its
performance remains at the level of random guessing. This inspires us to
redefine emergent abilities as those that manifest in models with lower
pre-training losses, highlighting that these abilities cannot be predicted by
merely extrapolating the performance trends of models with higher pre-training
losses.Comment: 18 pages, 6 figure
Revisiting Parallel Context Windows: A Frustratingly Simple Alternative and Chain-of-Thought Deterioration
We identify two crucial limitations in the evaluation of recent
parallel-integrated method Parallel Context Windows (PCW), which extends the
maximum context lengths of language models, e.g., 2048 for LLaMA, by harnessing
window-wise attention and positional embedding techniques. We first show that a
simple yet strong baseline, weighted sum ensemble, is missing for the
in-context few-shot classification. Moreover, on more challenging
Chain-of-Thought (CoT) reasoning (e.g., HotpotQA), PCW would present unexpected
deterioration regarding question miscomprehension and false inference. Based on
our findings, we suggest that the existing PCW design may not guarantee
sufficient improvement and practicality in handling lengthy documents in
real-world applications. More community efforts on enabling language models'
long context understanding ability should be paid
AgentTuning: Enabling Generalized Agent Abilities for LLMs
Open large language models (LLMs) with great performance in various tasks
have significantly advanced the development of LLMs. However, they are far
inferior to commercial models such as ChatGPT and GPT-4 when acting as agents
to tackle complex tasks in the real world. These agent tasks employ LLMs as the
central controller responsible for planning, memorization, and tool
utilization, necessitating both fine-grained prompting methods and robust LLMs
to achieve satisfactory performance. Though many prompting methods have been
proposed to complete particular agent tasks, there is lack of research focusing
on improving the agent capabilities of LLMs themselves without compromising
their general abilities. In this work, we present AgentTuning, a simple and
general method to enhance the agent abilities of LLMs while maintaining their
general LLM capabilities. We construct AgentInstruct, a lightweight
instruction-tuning dataset containing high-quality interaction trajectories. We
employ a hybrid instruction-tuning strategy by combining AgentInstruct with
open-source instructions from general domains. AgentTuning is used to
instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show
that AgentTuning enables LLMs' agent capabilities without compromising general
abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent
tasks, demonstrating generalized agent capabilities. We open source the
AgentInstruct and AgentLM-7B, 13B, and 70B models at
https://github.com/THUDM/AgentTuning, serving open and powerful alternatives to
commercial LLMs for agent tasks.Comment: 31 page
Incidence and factors associated of early non-response in first-treatment and drug-naïve patients with schizophrenia: a real-world study
BackgroundSchizophrenia is a severe and persistent mental condition that causes disability. For subsequent clinical care, it is extremely practical to effectively differentiate between patients who respond to therapy quickly and those who do not. This study set out to document the prevalence and risk factors for patient early non-response.MethodsThe current study included 143 individuals with first-treatment and drug-naïve (FTDN) schizophrenia. Patients were classified as early non-responders based on a Positive and Negative Symptom Scale (PANSS) score reduction of less than 20% after 2 weeks of treatment, otherwise as early responders. Clinical subgroups’ differences in demographic data and general clinical data were compared, and variables related to early non-response to therapy were examined.ResultsTwo weeks later, a total of 73 patients were described as early non-responders, with an incidence of 51.05%. The early non-response subgroup had significantly higher PANSS scores, Positive symptom subscale (PSS) scores, General psychopathology subscale (GPS) scores, Clinical global impression scale - severity of illness (CGI-SI) and Fasting blood glucose (FBG) levels compared to the early-response subgroup. CGI-SI and FBG were risk factors for early non-response.ConclusionHigh rates of early non-response have been seen in FTDN schizophrenia patients, and risk variables for predicting early non-response include CGI-SI scores and FBG levels. However, we need more in-depth studies to confirm the generalizable range of these two parameters
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Although large language models (LLMs) demonstrate impressive performance for
many language tasks, most of them can only handle texts a few thousand tokens
long, limiting their applications on longer sequence inputs, such as books,
reports, and codebases. Recent works have proposed methods to improve LLMs'
long context capabilities by extending context windows and more sophisticated
memory mechanisms. However, comprehensive benchmarks tailored for evaluating
long context understanding are lacking. In this paper, we introduce LongBench,
the first bilingual, multi-task benchmark for long context understanding,
enabling a more rigorous evaluation of long context understanding. LongBench
comprises 21 datasets across 6 task categories in both English and Chinese,
with an average length of 6,711 words (English) and 13,386 characters
(Chinese). These tasks cover key long-text application areas including
single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks,
and code completion. All datasets in LongBench are standardized into a unified
format, allowing for effortless automatic evaluation of LLMs. Upon
comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial
model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still
struggles on longer contexts. (2) Scaled position embedding and fine-tuning on
longer sequences lead to substantial improvement on long context understanding.
(3) Context compression technique such as retrieval brings improvement for
model with weak ability on long contexts, but the performance still lags behind
models that have strong long context understanding capability. The code and
datasets are available at https://github.com/THUDM/LongBench.Comment: ACL 202
ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback
ChatGLM is a free-to-use AI service powered by the ChatGLM family of large
language models (LLMs). In this paper, we present the ChatGLM-RLHF pipeline --
a reinforcement learning from human feedback (RLHF) system -- designed to
enhance ChatGLM's alignment with human preferences. ChatGLM-RLHF encompasses
three major components: the collection of human preference data, the training
of the reward model, and the optimization of policies. Throughout the process
of integrating ChatGLM-RLHF into production, we encountered and addressed
several unprecedented challenges. We introduce the strategies to mitigate
reward variance for stabilized large-scale training, implement model
parallelism with fused gradient-descent, and design regularization constraints
to avoid catastrophic forgetting in LLMs. Experiments show that ChatGLM-RLHF
brings significant improvements in alignment tasks compared to the supervised
fine-tuned (SFT) version of ChatGLM. For instance, it achieves on average 15\%
more wins against ChatGLM-SFT in Chinese alignment tasks. The work presents our
practices of aligning LLMs with human preferences, offering insights into the
challenges and solutions in RLHF implementations
Integrating life cycle assessment and a farmer survey of management practices to study environmental impacts of peach production in Beijing, China
While intensive peach production has expanded rapidly in recent years, few studies have explored the environmental impacts associated with specific regional systems or the optimal management strategies to minimize associated environmental risks. Here, data from a survey of 290 native farmers were used to conduct a life cycle assessment to quantify the acidification potential (AP), global warming potential (GWP), eutrophication potential (EP), and reactive nitrogen (Nr) losses in peach production in Pinggu District, Beijing. Total annual Nr losses, and GWP, AP, and EP values for peach production in Pinggu District were respectively 10.7 kg N t−1, 857 kg CO2-eq t−1, 12.9 kg SO2-eq t−1, and 4.1 kg PO4-eq t−1. The principal driving factors were fertilizer production, transportation, and application, which together accounted for 94%, 67%, 75%, and 94% of Nr losses, GWP, AP, and EP, respectively. In the high yield, high nitrogen-use efficiency (HH) group, relative values of Nr losses, GWP, AP, and EP were respectively 33%, 25%, 39%, and 32% lower than the overall averages for 290 orchards. Further analyses indicate that improved farming practices such as decreasing application rates of fertilizers, increasing proportion of base fertilization rate, and proper fertilization frequency in the HH group were the main reasons for these orchards’ better performance in peach yields and partial factor productivity of nitrogen fertilizer, and their reduced environmental impacts. These results highlight the need to optimize nutrient management in peach production in order simultaneously to realize both environmental sustainability and high productivity in the peach production system.</p
CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation
Since the natural language processing (NLP) community started to make large
language models (LLMs), such as GPT-4, act as a critic to evaluate the quality
of generated texts, most of them only train a critique generation model of a
specific scale on specific datasets. We argue that a comprehensive
investigation on the key factor of LLM-based evaluation models, such as scaling
properties, is lacking, so that it is still inconclusive whether these models
have potential to replace GPT-4's evaluation in practical scenarios. In this
paper, we propose a new critique generation model called CritiqueLLM, which
includes a dialogue-based prompting method for high-quality referenced /
reference-free evaluation data. Experimental results show that our model can
achieve comparable evaluation performance to GPT-4 especially in system-level
correlations, and even outperform GPT-4 in 3 out of 8 tasks in a challenging
reference-free setting. We conduct detailed analysis to show promising scaling
properties of our model in the quality of generated critiques. We also
demonstrate that our generated critiques can act as scalable feedback to
directly improve the generation quality of LLMs.Comment: 18 pages, 5 figure
- …
