22 research outputs found
Aligning Offline Metrics and Human Judgments of Value for Code Generation Models
Large language models have demonstrated great potential to assist programmers
in generating code. For such human-AI pair programming scenarios, we
empirically demonstrate that while generated code is most often evaluated in
terms of their functional correctness (i.e., whether generations pass available
unit tests), correctness does not fully capture (e.g., may underestimate) the
productivity gains these models may provide. Through a user study with N = 49
experienced programmers, we show that while correctness captures high-value
generations, programmers still rate code that fails unit tests as valuable if
it reduces the overall effort needed to complete a coding task. Finally, we
propose a hybrid metric that combines functional correctness and syntactic
similarity and show that it achieves a 14% stronger correlation with value and
can therefore better represent real-world gains when evaluating and comparing
models.Comment: Accepted at ACL 2023 (Findings
Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications
The rapid development in the field of Large Language Models (LLMs) has led to
a surge in applications that facilitate collaboration among multiple agents to
assist humans in their daily tasks. However, a significant gap remains in
assessing whether LLM-powered applications genuinely enhance user experience
and task execution efficiency. This highlights the pressing need for methods to
verify utility of LLM-powered applications, particularly by ensuring alignment
between the application's functionality and end-user needs. We introduce
AgentEval provides an implementation for the math problems, a novel framework
designed to simplify the utility verification process by automatically
proposing a set of criteria tailored to the unique purpose of any given
application. This allows for a comprehensive assessment, quantifying the
utility of an application against the suggested criteria. We present a
comprehensive analysis of the robustness of quantifier's work
Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting
Hand-crafting high quality prompts to optimize the performance of language
models is a complicated and labor-intensive process. Furthermore, when
migrating to newer, smaller, or weaker models (possibly due to latency or cost
gains), prompts need to be updated to re-optimize the task performance. We
propose Concept Distillation (CD), an automatic prompt optimization technique
for enhancing weaker models on complex tasks. CD involves: (1) collecting
mistakes made by weak models with a base prompt (initialization), (2) using a
strong model to generate reasons for these mistakes and create rules/concepts
for weak models (induction), and (3) filtering these rules based on validation
set performance and integrating them into the base prompt
(deduction/verification). We evaluated CD on NL2Code and mathematical reasoning
tasks, observing significant performance boosts for small and weaker language
models. Notably, Mistral-7B's accuracy on Multi-Arith increased by 20%, and
Phi-3-mini-3.8B's accuracy on HumanEval rose by 34%. Compared to other
automated methods, CD offers an effective, cost-efficient strategy for
improving weak models' performance on complex tasks and enables seamless
workload migration across different language models without compromising
performance.Comment: 13 pages, 8 figures, conferenc
Effect of early tranexamic acid administration on mortality, hysterectomy, and other morbidities in women with post-partum haemorrhage (WOMAN): an international, randomised, double-blind, placebo-controlled trial
Background
Post-partum haemorrhage is the leading cause of maternal death worldwide. Early administration of tranexamic acid reduces deaths due to bleeding in trauma patients. We aimed to assess the effects of early administration of tranexamic acid on death, hysterectomy, and other relevant outcomes in women with post-partum haemorrhage.
Methods
In this randomised, double-blind, placebo-controlled trial, we recruited women aged 16 years and older with a clinical diagnosis of post-partum haemorrhage after a vaginal birth or caesarean section from 193 hospitals in 21 countries. We randomly assigned women to receive either 1 g intravenous tranexamic acid or matching placebo in addition to usual care. If bleeding continued after 30 min, or stopped and restarted within 24 h of the first dose, a second dose of 1 g of tranexamic acid or placebo could be given. Patients were assigned by selection of a numbered treatment pack from a box containing eight numbered packs that were identical apart from the pack number. Participants, care givers, and those assessing outcomes were masked to allocation. We originally planned to enrol 15 000 women with a composite primary endpoint of death from all-causes or hysterectomy within 42 days of giving birth. However, during the trial it became apparent that the decision to conduct a hysterectomy was often made at the same time as randomisation. Although tranexamic acid could influence the risk of death in these cases, it could not affect the risk of hysterectomy. We therefore increased the sample size from 15 000 to 20 000 women in order to estimate the effect of tranexamic acid on the risk of death from post-partum haemorrhage. All analyses were done on an intention-to-treat basis. This trial is registered with ISRCTN76912190 (Dec 8, 2008); ClinicalTrials.gov, number NCT00872469; and PACTR201007000192283.
Findings
Between March, 2010, and April, 2016, 20 060 women were enrolled and randomly assigned to receive tranexamic acid (n=10 051) or placebo (n=10 009), of whom 10 036 and 9985, respectively, were included in the analysis. Death due to bleeding was significantly reduced in women given tranexamic acid (155 [1·5%] of 10 036 patients vs 191 [1·9%] of 9985 in the placebo group, risk ratio [RR] 0·81, 95% CI 0·65–1·00; p=0·045), especially in women given treatment within 3 h of giving birth (89 [1·2%] in the tranexamic acid group vs 127 [1·7%] in the placebo group, RR 0·69, 95% CI 0·52–0·91; p=0·008). All other causes of death did not differ significantly by group. Hysterectomy was not reduced with tranexamic acid (358 [3·6%] patients in the tranexamic acid group vs 351 [3·5%] in the placebo group, RR 1·02, 95% CI 0·88–1·07; p=0·84). The composite primary endpoint of death from all causes or hysterectomy was not reduced with tranexamic acid (534 [5·3%] deaths or hysterectomies in the tranexamic acid group vs 546 [5·5%] in the placebo group, RR 0·97, 95% CI 0·87-1·09; p=0·65). Adverse events (including thromboembolic events) did not differ significantly in the tranexamic acid versus placebo group.
Interpretation
Tranexamic acid reduces death due to bleeding in women with post-partum haemorrhage with no adverse effects. When used as a treatment for postpartum haemorrhage, tranexamic acid should be given as soon as possible after bleeding onset.
Funding
London School of Hygiene & Tropical Medicine, Pfizer, UK Department of Health, Wellcome Trust, and Bill & Melinda Gates Foundation
