5 research outputs found
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models.
To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG- bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood develop- ment, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google- internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting
HIT Poster session 2P486The effect of short term aerobic exercise and ACE polymorphism on cardiovascular remodeling in healthy sedentary postmenopausal womenP487Are there predictors of malignant progression of aortic stenosis severity?P488Quantitative und semiquantitative parameters in the classification of aortic insufficiency: a 3D-echocardiography and magnet resonance imaging studyP489Vascular indicies surrogate markers for left ventricular dysfunctionP490Left ventricular systolic strain data does not require indexation to cavity size in mitral valve diseasesP491Impact of EACVI grant programme on career progression of grant winnersP492Early predictor of atrial fibrillation recurrence after electrical cardioversion: diastolic parameters come firstP493Echocardiographic diagnosis of arrhythmias in the fetusP4943D echocardiography is a fast-learning and a more reliable method compared with 2D echocardiography for the assessment of left ventricular volumes and ejection fraction in patients with heart failureP495Right ventricular mechanics in functional ischemic mitral regurgitation in acute inferior myocardial infarctionP496Added value of two dimentional strain in assessement of left ventricular systolic function in rheumatic mitral stenosis patients with normal ejection fractionP497Left ventricular myocardial deformation in arterial hypertension with different types of glucose metabolism disordersP498Epicardial to pericardial adipose tissue ratio: predicting myocardial ischemia in patients referred for exercise stress echocardiographyP499Echocardiographic evaluation of the patients with asd after percutaneous closureP500Screening for carotid artery stenosis with the use of pocket-size imaging device equipped with linear probeP501LAD correlates poorly with LAVIP502Predictors associated with the diastolic dysfunction formation in patients with moderate hypertensionP503Assessment of left atrial function by speckle tracking analysis in transthoracic echocardiography for predicting the presence of left atrial appendage thrombus in patients with atrial fibrillationP504can echocardiography detect subclinical myocardial damage in the layers of myocardial wall? (The first study in a large population with known inflammatory disease)P505Epicardial fat thickness and galectin 3 in patients with atrial fibrillation and metabolic syndromeP506Left ventricular reverse remodeling in heart failure: a new obesity paradox?P507Epicardial adipose tissue and carotid intima media thickness in hemodialysis patients; single center experienceP508Echocardiographic parameters of mitral valve remodeling associated with poor clinical outcome in high risk patients with functional mitral regurgitation after Mitraclip implantationP509Prevalence of valve disease in a community population over the age of 60P510Discordance between mitral valve area and mean transmitral pressure gradient in mitral stenosis: Is mean gradient marker of the severity or parameter of tolerance in severe mitral stenosis?P511Ischemic mitral regurgitation is associated with impaired radial and circumferential myocardial deformation in acute inferoposterior myocardial infarctionP512The importance of early left atrial functional changes in predicting long term left ventricular remodeling in patients surviving a ST elevation myocardial infarctionP513Remodeling of myocardial deformation after mitral valve surgeryP514Global longitudinal peak systolic strain is reduced shortly after heart transplantationP515Detailed transthoracic and transesophageal echocardiographic analysis of mitral leaflets in patient undergoing mitral valve repair
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Language models demonstrate both quantitative improvement and new qualitative
capabilities with increasing scale. Despite their potentially transformative
impact, these new capabilities are as yet poorly characterized. In order to
inform future research, prepare for disruptive new model capabilities, and
ameliorate socially harmful effects, it is vital that we understand the present
and near-future capabilities and limitations of language models. To address
this challenge, we introduce the Beyond the Imitation Game benchmark
(BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442
authors across 132 institutions. Task topics are diverse, drawing problems from
linguistics, childhood development, math, common-sense reasoning, biology,
physics, social bias, software development, and beyond. BIG-bench focuses on
tasks that are believed to be beyond the capabilities of current language
models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense
transformer architectures, and Switch-style sparse transformers on BIG-bench,
across model sizes spanning millions to hundreds of billions of parameters. In
addition, a team of human expert raters performed all tasks in order to provide
a strong baseline. Findings include: model performance and calibration both
improve with scale, but are poor in absolute terms (and when compared with
rater performance); performance is remarkably similar across model classes,
though with benefits from sparsity; tasks that improve gradually and
predictably commonly involve a large knowledge or memorization component,
whereas tasks that exhibit "breakthrough" behavior at a critical scale often
involve multiple steps or components, or brittle metrics; social bias typically
increases with scale in settings with ambiguous context, but this can be
improved with prompting.Comment: 27 pages, 17 figures + references and appendices, repo:
https://github.com/google/BIG-benc
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting
