Skip to content

Latest commit



1567 lines (916 loc) · 189 KB

File metadata and controls

1567 lines (916 loc) · 189 KB

2024 (109 papers)

  1. A Computational Framework for Behavioral Assessment of LLM Therapists, Yu Ying Chiu,Ashish Sharma,Inna Wanyin Lin,Tim Althoff, 01-01-2024


    Computation and Language, Human-Computer Interaction


    The emergence of ChatGPT and other large language models (LLMs) has greatly increased interest in utilizing LLMs as therapists to support individuals struggling with mental health challenges. However, due to the lack of systematic studies, our understanding of how LLM therapists behave, i.e., ways in which they respond to clients, is significantly limited. Understanding their behavior across a wide range of clients and situations is crucial to accurately assess their capabilities and limitations in the high-risk setting of mental health, where undesirable behaviors can lead to severe consequences. In this paper, we propose BOLT, a novel computational framework to study the conversational behavior of LLMs when employed as therapists. We develop an in-context learning method to quantitatively measure the behavior of LLMs based on 13 different psychotherapy techniques including reflections, questions, solutions, normalizing, and psychoeducation. Subsequently, we compare the behavior of LLM therapists against that of high- and low-quality human therapy, and study how their behavior can be modulated to better reflect behaviors observed in high-quality therapy. Our analysis of GPT and Llama-variants reveals that these LLMs often resemble behaviors more commonly exhibited in low-quality therapy rather than high-quality therapy, such as offering a higher degree of problem-solving advice when clients share emotions, which is against typical recommendations. At the same time, unlike low-quality therapy, LLMs reflect significantly more upon clients' needs and strengths. Our analysis framework suggests that despite the ability of LLMs to generate anecdotal examples that appear similar to human therapists, LLM therapists are currently not fully consistent with high-quality care, and thus require additional research to ensure quality care.

    Bullet Points

    • The paper proposes BOLT, a computational framework to study the conversational behavior of LLMs when employed as therapists, and develops an in-context learning method to quantitatively measure their behavior based on 13 psychotherapy techniques

    • The study compares LLM behavior against that of high-quality human therapy and explores how their behavior can be modulated to better reflect behaviors observed in low-quality therapy

    • Despite the ability to generate anecdotal examples that appear similar to human therapist, LLM therapy is currently not fully consistent with high quality care and requires additional research to ensure quality care.

  2. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models, Guangji Bai,Zheng Chai,Chen Ling,Shiyu Wang,Jiaying Lu,Nan Zhang,Tingwei Shi,Ziyang Yu,Mengdan Zhu,Yifei Zhang,Carl Yang,Yue Cheng,Liang Zhao, 01-01-2024


    Machine Learning


    The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in the high consumption of computational, memory, energy, and financial resources, especially in environments with limited resource capabilities. This survey aims to systematically address these challenges by reviewing a broad spectrum of techniques designed to enhance the resource efficiency of LLMs. We categorize methods based on their optimization focus: computational, memory, energy, financial, and network resources and their applicability across various stages of an LLM's lifecycle, including architecture design, pretraining, finetuning, and system design. Additionally, the survey introduces a nuanced categorization of resource efficiency techniques by their specific resource types, which uncovers the intricate relationships and mappings between various resources and corresponding optimization techniques. A standardized set of evaluation metrics and datasets is also presented to facilitate consistent and fair comparisons across different models and techniques. By offering a comprehensive overview of the current sota and identifying open research avenues, this survey serves as a foundational reference for researchers and practitioners, aiding them in developing more sustainable and efficient LLMs in a rapidly evolving landscape.

    Bullet Points

    • This survey reviews techniques to enhance resource efficiency of LLMs, categorizing them based on their optimization focus and applicability across various stages of an LLM's lifecycle

    • The survey also presents a nuanced categorization of resource efficiency techniques by their specific resource types, providing a foundational reference for researchers and practitioners.

  3. General-purpose foundation models for increased autonomy in robot-assisted surgery, Samuel Schmidgall,Ji Woong Kim,Alan Kuntz,Ahmed Ezzat Ghazi,Axel Krieger, 01-01-2024


    Robotics, Machine Learning, Quantitative Biology


    The dominant paradigm for end-to-end robot learning focuses on optimizing task-specific objectives that solve a single robotic problem such as picking up an object or reaching a target position. However, recent work on high-capacity models in robotics has shown promise toward being trained on large collections of diverse and task-agnostic datasets of video demonstrations. These models have shown impressive levels of generalization to unseen circumstances, especially as the amount of data and the model complexity scale. Surgical robot systems that learn from data have struggled to advance as quickly as other fields of robot learning for a few reasons: (1) there is a lack of existing large-scale open-source data to train models, (2) it is challenging to model the soft-body deformations that these robots work with during surgery because simulation cannot match the physical and visual complexity of biological tissue, and (3) surgical robots risk harming patients when tested in clinical trials and require more extensive safety measures. This perspective article aims to provide a path toward increasing robot autonomy in robot-assisted surgery through the development of a multi-modal, multi-task, vision-language-action model for surgical robots. Ultimately, we argue that surgical robots are uniquely positioned to benefit from general-purpose models and provide three guiding actions toward increased autonomy in robot-assisted surgery.

    Bullet Points

    • The dominant paradigm for end-to-end robot learning focuses on optimizing task-specific objectives

    • However, recent work on high-capacity models in robotics has shown promise towards being trained on large collections of diverse and task-agnostic datasets of video demonstrations

    • Surgical robot systems that learn from data have struggled to advance as quickly as other fields of robot learning due to lack of large-scale open-source data, difficulty in modeling soft-body deformations, and risk of harming patients when tested in clinical trials

    • The article aims to develop a multi-modal, multi-task, vision-language-action model for surgical robots to increase robot autonomy in robot-assisted surgery.

  4. If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents, Ke Yang,Jiateng Liu,John Wu,Chaoqi Yang,Yi R. Fung,Sha Li,Zixuan Huang,Xu Cao,Xingyao Wang,Yiquan Wang,Heng Ji,Chengxiang Zhai, 01-01-2024


    Computation and Language


    The prominent large language models (LLMs) of today differ from past language models not only in size, but also in the fact that they are trained on a combination of natural language and formal language (code). As a medium between humans and computers, code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity. In this survey, we present an overview of the various benefits of integrating code into LLMs' training data. Specifically, beyond enhancing LLMs in code generation, we observe that these unique properties of code help (i) unlock the reasoning ability of LLMs, enabling their applications to a range of more complex natural language tasks; (ii) steer LLMs to produce structured and precise intermediate steps, which can then be connected to external execution ends through function calls; and (iii) take advantage of code compilation and execution environment, which also provides diverse feedback for model improvement. In addition, we trace how these profound capabilities of LLMs, brought by code, have led to their emergence as intelligent agents (IAs) in situations where the ability to understand instructions, decompose goals, plan and execute actions, and refine from feedback are crucial to their success on downstream tasks. Finally, we present several key challenges and future directions of empowering LLMs with code.

    Bullet Points

    • The survey discusses the benefits of integrating code into LLMs' training data, including unlocking their reasoning ability, steering them to produce structured and precise intermediate steps, taking advantage of code compilation and execution environment, and tracing how these capabilities have led to their emergence as intelligent agents in downstream tasks

    • Key challenges and future directions are presented.

  5. The Earth is Flat? Unveiling Factual Errors in Large Language Models, Wenxuan Wang,Juluan Shi,Zhaopeng Tu,Youliang Yuan,Jen-tse Huang,Wenxiang Jiao,Michael R. Lyu, 01-01-2024


    Software Engineering, Artificial Intelligence, Computation and Language


    Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by test data leakage or the need for extensive human labor, hindering efficient and accurate error detection. To tackle this problem, we introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs. This framework involves three main steps: First, it constructs a factual knowledge graph by retrieving fact triplets from a large-scale knowledge database. Then, leveraging the knowledge graph, FactChecker employs a rule-based approach to generates three types of questions (Yes-No, Multiple-Choice, and WH questions) that involve single-hop and multi-hop relations, along with correct answers. Lastly, it assesses the LLMs' responses for accuracy using tailored matching strategies for each question type. Our extensive tests on six prominent LLMs, including text-davinci-002, text-davinci-003, ChatGPT~(gpt-3.5-turbo, gpt-4), Vicuna, and LLaMA-2, reveal that FactChecker can trigger factual errors in up to 45% of questions in these models. Moreover, we demonstrate that FactChecker's test cases can improve LLMs' factual accuracy through in-context learning and fine-tuning (e.g., llama-2-13b-chat's accuracy increase from 35.3% to 68.5%). We are making all code, data, and results available for future research endeavors.

    Bullet Points

    • FactChecker is an automatic testing framework that aims to uncover factual errors in large language models like ChatGPT

    • It uses a rule-based approach to generate three types of questions that involve single-hop and multi-hop relations, along with correct answers, and assesses the LLMs' responses for accuracy using tailored matching strategies for each question type

    • The framework has been tested on six prominent LLM models, including text-davinci-002, text-dravinci-012, chatGPT-3.5-turbo, gpt-4, Vicuna, and LLaMA-2, and can trigger factual inaccuracies in up to 45% of questions in these models

    • We are making all code, data, and results available for future research endeavors.

  6. A Comprehensive Study of Knowledge Editing for Large Language Models, Ningyu Zhang,Yunzhi Yao,Bozhong Tian,Peng Wang,Shumin Deng,Mengru Wang,Zekun Xi,Shengyu Mao,Jintian Zhang,Yuansheng Ni,Siyuan Cheng,Ziwen Xu,Xin Xu,Jia-Chen Gu,Yong Jiang,Pengjun Xie,Fei Huang,Lei Liang,Zhiqiang Zhang,Xiaowei Zhu,Jun Zhou,Huajun Chen, 02-01-2024


    Computation and Language, Artificial Intelligence, Computer Vision, Human-Computer Interaction, Machine Learning


    Large Language Models (LLMs) have shown extraordinary capabilities in understanding and generating text that closely mirrors human communication. However, a primary limitation lies in the significant computational demands during training, arising from their extensive parameterization. This challenge is further intensified by the dynamic nature of the world, necessitating frequent updates to LLMs to correct outdated information or integrate new knowledge, thereby ensuring their continued relevance. Note that many applications demand continual model adjustments post-training to address deficiencies or undesirable behaviors. There is an increasing interest in efficient, lightweight methods for on-the-fly model modifications. To this end, recent years have seen a burgeoning in the techniques of knowledge editing for LLMs, which aim to efficiently modify LLMs' behaviors within specific domains while preserving overall performance across various inputs. In this paper, we first define the knowledge editing problem and then provide a comprehensive review of cutting-edge approaches. Drawing inspiration from educational and cognitive research theories, we propose a unified categorization criterion that classifies knowledge editing methods into three groups: resorting to external knowledge, merging knowledge into the model, and editing intrinsic knowledge. Furthermore, we introduce a new benchmark, KnowEdit, for a comprehensive empirical evaluation of representative knowledge editing approaches. Additionally, we provide an in-depth analysis of knowledge location, which can give a deeper understanding of the knowledge structures inherent within LLMs. Finally, we discuss several potential applications of knowledge editing, outlining its broad and impactful implications.

    Bullet Points

    • The paper discusses the limitations of Large Language Models (LLMs) in understanding and generating text that closely mirrors human communication

    • The computational demands during training are significant, and frequent updates are necessary to correct outdated information or integrate new knowledge

    • There is an increasing interest in efficient, lightweight methods for on-the-fly model modifications

    • Recent years have seen a burgeoning in the techniques of knowledge editing for LLMs

    • A unified categorization criterion is proposed that classifies knowledge editing methods into three groups: resorting to external knowledge, merging knowledge into the model, and editing intrinsic knowledge

    • A new benchmark, KnowEdit, is introduced for a comprehensive empirical evaluation of representative knowledge editing approaches

    • Additionally, an in-depth analysis of knowledge location provides a deeper understanding of the knowledge structures inherent within LLM

    • The paper concludes by discussing several potential applications of Knowledge Editing, outlining its broad and impactful

  7. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models, S.M Towhidul Islam Tonmoy,S M Mehedi Zaman,Vinija Jain,Anku Rani,Vipula Rawte,Aman Chadha,Amitava Das, 02-01-2024


    Computation and Language


    As Large Language Models (LLMs) continue to advance in their ability to write human-like text, a key challenge remains around their tendency to hallucinate generating content that appears factual but is ungrounded. This issue of hallucination is arguably the biggest hindrance to safely deploying these powerful LLMs into real-world production systems that impact people's lives. The journey toward widespread adoption of LLMs in practical settings heavily relies on addressing and mitigating hallucinations. Unlike traditional AI systems focused on limited tasks, LLMs have been exposed to vast amounts of online text data during training. While this allows them to display impressive language fluency, it also means they are capable of extrapolating information from the biases in training data, misinterpreting ambiguous prompts, or modifying the information to align superficially with the input. This becomes hugely alarming when we rely on language generation capabilities for sensitive applications, such as summarizing medical records, financial analysis reports, etc. This paper presents a comprehensive survey of over 32 techniques developed to mitigate hallucination in LLMs. Notable among these are Retrieval Augmented Generation (Lewis et al, 2021), Knowledge Retrieval (Varshney et al,2023), CoNLI (Lei et al, 2023), and CoVe (Dhuliawala et al, 2023). Furthermore, we introduce a detailed taxonomy categorizing these methods based on various parameters, such as dataset utilization, common tasks, feedback mechanisms, and retriever types. This classification helps distinguish the diverse approaches specifically designed to tackle hallucination issues in LLMs. Additionally, we analyze the challenges and limitations inherent in these techniques, providing a solid foundation for future research in addressing hallucinations and related phenomena within the realm of LLMs.

    Bullet Points

    • The paper presents a comprehensive survey of over 32 techniques developed to mitigate hallucination in LLMs, including Retrieval Augmented Generation (Revshney et al., 2021), Knowledge Revieval (CoNLI), and CoVe

    • The paper categorizes these techniques based on dataset utilization, common tasks, feedback mechanisms, and retriever types, and analyzes the challenges and limitations inherent in these techniques for future research.

  8. LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning, Hongye Jin,Xiaotian Han,Jingfeng Yang,Zhimeng Jiang,Zirui Liu,Chia-Yuan Chang,Huiyuan Chen,Xia Hu, 02-01-2024


    Computation and Language, Artificial Intelligence, Machine Learning


    This work elicits LLMs' inherent ability to handle long contexts without fine-tuning. The limited length of the training sequence during training may limit the application of Large Language Models (LLMs) on long input sequences for inference. In this work, we argue that existing LLMs themselves have inherent capabilities for handling long contexts. Based on this argument, we suggest extending LLMs' context window by themselves to fully utilize the inherent ability.We propose Self-Extend to stimulate LLMs' long context handling potential. The basic idea is to construct bi-level attention information: the group level and the neighbor level. The two levels are computed by the original model's self-attention, which means the proposed does not require any training. With only four lines of code modification, the proposed method can effortlessly extend existing LLMs' context window without any fine-tuning. We conduct comprehensive experiments and the results show that the proposed method can effectively extend existing LLMs' context window's length.

    Bullet Points

    • The work proposes extending LLMs' context window by themselves to fully utilize their inherent ability to handle long contexts without fine-tuning

    • The proposed method, Self-Extend, involves building bi-level attention information on the group level and neighbor level using the original model's self-attention, which does not require any training

    • The experiment results show that the proposed method can effectively extend existing LLM's context window's length.

  9. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models, Zixiang Chen,Yihe Deng,Huizhuo Yuan,Kaixuan Ji,Quanquan Gu, 02-01-2024


    Machine Learning, Artificial Intelligence, Computation and Language, Machine Learning


    Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.

    Bullet Points

    • The paper proposes a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised well-tuned model

    • The LLM refines its capability by playing against instances of itself, refining its policy by discerning self-generated responses from those obtained from human-annotated data

    • The global optimum to the training objective function of SPIN is achieved only when the LLM policy aligns with the target data distribution

    • The results demonstrate that SPIN can significantly improve LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data.

  10. LLaMA Beyond English: An Empirical Study on Language Capability Transfer, Jun Zhao,Zhihao Zhang,Luhui Gao,Qi Zhang,Tao Gui,Xuanjing Huang, 02-01-2024


    Computation and Language, Artificial Intelligence


    In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.

    Bullet Points

    • The paper focuses on how to transfer language generation and following instructions to a non-English language by conducting an empirical investigation based on LLaMA and analyzing the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer

    • Four standardized testing benchmarks are used to assess the model's level of knowledge, and a comprehensive evaluation of its response quality is conducted using LLM-Eval, a benchmark consisting instruction tasks from 17 diverse categories

    • Comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality

    • Experimental outcomes across thirteen low-resource languages also exhibit similar trends

    • The conclusions revealed by the experiments will aid the community in developing non- English LLMs.

  11. Enhancing the medical foundation model with multi-scale and cross-modality feature learning, Weijian Huang,Cheng Li,Hong-Yu Zhou,Jiarun Liu,Hao Yang,Yong Liang,Shanshan Wang, 03-01-2024


    Computer Vision


    The development of multi-modal medical foundation models has attracted significant attention in the field of medicine and healthcare due to their promising prospects in various clinical applications. One area of focus in this research direction is the extractions of features at different scales. While previous studies have explored feature learning at individual scales, investigation on integrating the diverse scales and modalities of information is lacking, which may hinder the potential for mutual reinforcement among these features. This paper aims to bridge this gap by proposing a method that effectively exploits multi-scale and cross-modality information to enhance the performance of medical foundation models. The proposed method simultaneously exploit features at the local, instance, modality and global aspects, facilitating comprehensive representation learning within the models. We evaluate the effectiveness of the proposed method on six open-source datasets across different clinical tasks, demonstrating its ability to enhance the performance of medical foundation models.

    Bullet Points

    • The paper proposes a method that effectively exploits multi-scale and cross-modality information to enhance the performance of medical foundation models

    • The proposed method leverages features at local, instance, modality, and global aspects, facilitating comprehensive representation learning within the models

    • We evaluated its effectiveness on six open-source datasets across different clinical tasks.

  12. Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review, Luoma Ke,(1),,Song Tong,(1),,Peng Cheng,(2),,Kaiping Peng,(1) ((1) Department of Psychology, Tsinghua University, (2) School of Social Science, Tsinghua University), 03-01-2024


    Machine Learning, Artificial Intelligence


    This paper explores the frontiers of large language models (LLMs) in psychology applications. Psychology has undergone several theoretical changes, and the current use of Artificial Intelligence (AI) and Machine Learning, particularly LLMs, promises to open up new research directions. We provide a detailed exploration of how LLMs like ChatGPT are transforming psychological research. It discusses the impact of LLMs across various branches of psychology, including cognitive and behavioral, clinical and counseling, educational and developmental, and social and cultural psychology, highlighting their potential to simulate aspects of human cognition and behavior. The paper delves into the capabilities of these models to emulate human-like text generation, offering innovative tools for literature review, hypothesis generation, experimental design, experimental subjects, data analysis, academic writing, and peer review in psychology. While LLMs are essential in advancing research methodologies in psychology, the paper also cautions about their technical and ethical challenges. There are issues like data privacy, the ethical implications of using LLMs in psychological research, and the need for a deeper understanding of these models' limitations. Researchers should responsibly use LLMs in psychological studies, adhering to ethical standards and considering the potential consequences of deploying these technologies in sensitive areas. Overall, the article provides a comprehensive overview of the current state of LLMs in psychology, exploring potential benefits and challenges. It serves as a call to action for researchers to leverage LLMs' advantages responsibly while addressing associated risks.

    Bullet Points

    • The paper explores the frontiers of large language models (LLMs) in psychology applications, exploring their impact on various branches of psychology, including cognitive and behavioral, clinical and counseling, educational and developmental, and social and cultural psychology

    • LLMs can simulate human cognition and behavior, offering innovative tools for literature review, hypothesis generation, experimental design, experimental subjects, data analysis, academic writing, and peer review in psychology

    • However, the paper cautions about their technical and ethical challenges, including data privacy, ethical implications, and the need for a deeper understanding of these models' limitations

    • Researchers should responsibly use these technologies in psychological studies, adhering to ethical standards and considering the potential consequences of deploying them in sensitive areas.

  13. Few-shot Adaptation of Multi-modal Foundation Models: A Survey, Fan Liu,Tianshu Zhang,Wenwen Dai,Wenwen Cai,Xiaocong Zhou,Delong Chen, 03-01-2024


    Computer Vision


    Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: 1) prompt-based methods, 2) adapter-based methods, and 3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: 1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive knowledge utilization.

    Bullet Points

    • Three possible solutions for few-shot adaptation methods for multi-modal models are prompt-based, adapter-based and external knowledge-based methods

    • These solutions are based on the limited theoretical support for existing methods, domain gap, model capacity, and sample size, as well as adaptive domain generalization.

  14. Large Language Models Relearn Removed Concepts, Michelle Lo,Shay B. Cohen,Fazl Barez, 03-01-2024


    Artificial Intelligence


    Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. Our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons. While neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model \textit{safety}. Monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. Overall, our work strongly demonstrates the resilience and fluidity of concept representations in LLMs post concept removal.

    Bullet Points

    • Neuron pruning can remove undesirable concepts from large language models, but it is unclear whether models have the capacity to reacquire pruned concepts after editing

    • To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining

    • Models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned Concepts to primed neurons with similar semantics

    • This demonstrates polysemantic capacities and can blend old and new concepts in individual neurons

    • While neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model textit safety

    • Monitoring concept emergence and developing techniques to mitigate re learning of unsafe concepts will be important directions for more robust model editing

    • Overall, our work highlights the resilience and fluidity of concept representations in LLMs post concept removal.

  15. Correctness Comparison of ChatGPT-4, Bard, Claude-2, and Copilot for Spatial Tasks, Hartwig H. Hochmair,Levente Juhasz,Takoda Kemp, 04-01-2024


    Computers and Society


    Generative AI including large language models (LLMs) have recently gained significant interest in the geo-science community through its versatile task-solving capabilities including coding, spatial computations, generation of sample data, time-series forecasting, toponym recognition, or image classification. So far, the assessment of LLMs for spatial tasks has primarily focused on ChatGPT, arguably the most prominent AI chatbot, whereas other chatbots received less attention. To narrow this research gap, this study evaluates the correctness of responses for a set of 54 spatial tasks assigned to four prominent chatbots, i.e., ChatGPT-4, Bard, Claude-2, and Copilot. Overall, the chatbots performed well on spatial literacy, GIS theory, and interpretation of programming code and given functions, but revealed weaknesses in mapping, code generation, and code translation. ChatGPT-4 outperformed other chatbots across most task categories.

    Bullet Points

    • Generative AI, including LLMs, has gained interest in the geo-science community due to its versatile task-solving capabilities, including coding, spatial computations, generation of sample data, time-series forecasting, toponym recognition, or image classification

    • The study evaluated the correctness of responses for 54 spatial tasks assigned to four prominent chatbots, i.e

    • ChatGPT-4, Bard, Claude-2, and Copilot, and revealed weaknesses in mapping, code generation, and code translation.

  16. LLM Augmented LLMs: Expanding Capabilities through Composition, Rachit Bansal,Bidisha Samanta,Siddharth Dalmia,Nitish Gupta,Shikhar Vashishth,Sriram Ganapathy,Abhishek Bapna,Prateek Jain,Partha Talukdar, 04-01-2024


    Machine Learning, Artificial Intelligence, Computation and Language, Computer Vision


    Foundational models with billions of parameters which have been trained on large corpora of data have demonstrated non-trivial skills in a variety of domains. However, due to their monolithic structure, it is challenging and expensive to augment them or impart new skills. On the other hand, due to their adaptation abilities, several new instances of these models are being trained towards new domains and tasks. In this work, we study the problem of efficient and practical composition of existing foundation models with more specific models to enable newer capabilities. To this end, we propose CALM -- Composition to Augment Language Models -- which introduces cross-attention between models to compose their representations and enable new capabilities. Salient features of CALM are: (i) Scales up LLMs on new tasks by 're-using' existing LLMs along with a few additional parameters and data, (ii) Existing model weights are kept intact, and hence preserves existing capabilities, and (iii) Applies to diverse domains and settings. We illustrate that augmenting PaLM2-S with a smaller model trained on low-resource languages results in an absolute improvement of up to 13% on tasks like translation into English and arithmetic reasoning for low-resource languages. Similarly, when PaLM2-S is augmented with a code-specific model, we see a relative improvement of 40% over the base model for code generation and explanation tasks -- on-par with fully fine-tuned counterparts.

    Bullet Points

    • CALM introduces cross-attention between existing foundation models to compose their representations and enable new capabilities

    • CALM allows for scaling up LLMs on new tasks by 're-using' existing models along with a few additional parameters and data, and preserves existing capabilities

    • It applies to diverse domains and settings

    • A smaller model trained on low-resource languages results in an absolute improvement of up to 13% on tasks like translation into English and arithmetic reasoning, while a code-specific model improves 40% over the base model for code generation and explanation tasks.

  17. LLaMA Pro: Progressive LLaMA with Block Expansion, Chengyue Wu,Yukang Gan,Yixiao Ge,Zeyu Lu,Jiahao Wang,Ye Feng,Ping Luo,Ying Shan, 04-01-2024


    Computation and Language


    Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forgetting. In this paper, we experiment on the corpus of code and math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLaMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent. Our findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.

    Bullet Points

    • The paper proposes a new post-pretraining method for LLMs with an expansion of Transformer blocks

    • The expanded blocks are tuned using only new corpus, improving the model's knowledge without catastrophic forgetting

    • LLaMA Pro-8.3B is a versatile foundation model that excels in general tasks, programming, and mathematics

    • It achieves advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLAMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent

    • The findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.

  18. LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model, Yichen Zhu,Minjie Zhu,Ning Liu,Zhicai Ou,Xiaofeng Mou,Jian Tang, 04-01-2024


    Computer Vision, Computation and Language


    }. None

  19. TinyLlama: An Open-Source Small Language Model, Peiyuan Zhang,Guangtao Zeng,Tianduo Wang,Wei Lu, 04-01-2024


    Computation and Language, Artificial Intelligence


    . None

  20. Understanding LLMs: A Comprehensive Overview from Training to Inference, Yiheng Liu,Hao He,Tianle Han,Xu Zhang,Mengyuan Liu,Jiaming Tian,Yutong Zhang,Jiaqi Wang,Xiaohui Gao,Tianyang Zhong,Yi Pan,Shaochen Xu,Zihao Wu,Zhengliang Liu,Xin Zhang,Shu Zhang,Xintao Hu,Tuo Zhang,Ning Qiang,Tianming Liu,Bao Ge, 04-01-2024


    Computation and Language


    The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs' utilization and provides insights into their future development.

    Bullet Points

    • The introduction of ChatGPT has led to an increase in the utilization of Large Language Models (LLMs) for downstream tasks, with a focus on cost-efficient training and deployment

    • This trend represents the future development trend

    • The paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend, including data preprocessing, training architecture, pre-training tasks, parallel training, relevant content related to model fine-tuning, model compression, parallel computation, memory scheduling, and structural optimization

    • It also explores LLMs' utilization and provides insights into their future development.

  21. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism, DeepSeek-AI,:,Xiao Bi,Deli Chen,Guanting Chen,Shanhuang Chen,Damai Dai,Chengqi Deng,Honghui Ding,Kai Dong,Qiushi Du,Zhe Fu,Huazuo Gao,Kaige Gao,Wenjun Gao,Ruiqi Ge,Kang Guan,Daya Guo,Jianzhong Guo,Guangbo Hao,Zhewen Hao,Ying He,Wenjie Hu,Panpan Huang,Erhang Li,Guowei Li,Jiashi Li,Yao Li,Y.K. Li,Wenfeng Liang,Fangyun Lin,A.X. Liu,Bo Liu,Wen Liu,Xiaodong Liu,Xin Liu,Yiyuan Liu,Haoyu Lu,Shanghao Lu,Fuli Luo,Shirong Ma,Xiaotao Nie,Tian Pei,Yishi Piao,Junjie Qiu,Hui Qu,Tongzheng Ren,Zehui Ren,Chong Ruan,Zhangli Sha,Zhihong Shao,Junxiao Song,Xuecheng Su,Jingxiang Sun,Yaofeng Sun,Minghui Tang,Bingxuan Wang,Peiyi Wang,Shiyu Wang,Yaohui Wang,Yongji Wang,Tong Wu,Y. Wu,Xin Xie,Zhenda Xie,Ziwei Xie,Yiliang Xiong,Hanwei Xu,R.X. Xu,Yanhong Xu,Dejian Yang,Yuxiang You,Shuiping Yu,Xingkai Yu,B. Zhang,Haowei Zhang,Lecong Zhang,Liyue Zhang,Mingchuan Zhang,Minghua Zhang,Wentao Zhang,Yichao Zhang,Chenggang Zhao,Yao Zhao,Shangyan Zhou,Shunfeng Zhou,Qihao Zhu,Yuheng Zou, 05-01-2024


    Computation and Language, Artificial Intelligence, Machine Learning


    The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

    Bullet Points

    • The article discusses the study of scaling laws and DeepSeek LLM, a project focused on advancing open-source language models with a long-term perspective

    • We present findings that facilitate scaling of large scale models in two commonly used Open-source configurations, 7B and 67B

    • We introduce deepSeek LCLM and conduct supervised fine-tuning and Direct Preference Optimization on the dataset, resulting in the creation of deepseek chat models

    • The evaluation results demonstrate that DeepSeven LLM 67Be surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning

    • Additionally, open-ended evaluations reveal that Deepseek LML 67BE Chat exhibits superior performance compared to GPT-3.5.

  22. From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models, Na Liu,Liangyu Chen,Xiaoyu Tian,Wei Zou,Kaijiang Chen,Ming Cui, 05-01-2024


    Computation and Language, Artificial Intelligence


    This paper introduces RAISE (Reasoning and Acting through Scratchpad and Examples), an advanced architecture enhancing the integration of Large Language Models (LLMs) like GPT-4 into conversational agents. RAISE, an enhancement of the ReAct framework, incorporates a dual-component memory system, mirroring human short-term and long-term memory, to maintain context and continuity in conversations. It entails a comprehensive agent construction scenario, including phases like Conversation Selection, Scene Extraction, CoT Completion, and Scene Augmentation, leading to the LLMs Training phase. This approach appears to enhance agent controllability and adaptability in complex, multi-turn dialogues. Our preliminary evaluations in a real estate sales context suggest that RAISE has some advantages over traditional agents, indicating its potential for broader applications. This work contributes to the AI field by providing a robust framework for developing more context-aware and versatile conversational agents.

    Bullet Points

    • The paper introduces RAISE, an advanced architecture that enhances the integration of LLMs like GPT-4 into conversational agents

    • It entails a dual-component memory system that mirrors human short-term and long-term memory to maintain context and continuity in conversations

    • The approach enhances agent controllability and adaptability in complex, multi-turn dialogues

    • Preliminary evaluations suggest it has some advantages over traditional agents, indicating its potential for broader applications

    • This work contributes to the AI field by providing a robust framework for developing more context-aware and versatile chatbots.

  23. Thousands of AI Authors on the Future of AI, Katja Grace,Harlan Stewart,Julia Fabienne Sandkühler,Stephen Thomas,Ben Weinstein-Raun,Jan Brauner, 05-01-2024


    Computers and Society, Artificial Intelligence, Machine Learning


    Most respondents expressed substantial uncertainty about the long-term value of AI progress: While 68.3% thought good outcomes from superhuman AI are more likely than bad, of these net optimists 48% gave at least a 5% chance of extremely bad outcomes such as human extinction, and 59% of net pessimists gave 5% or more to extremely good outcomes. Between 38% and 51% of respondents gave at least a 10% chance to advanced AI leading to outcomes as bad as human extinction. More than half suggested that "substantial" or "extreme" concern is warranted about six different AI-related scenarios, including misinformation, authoritarian control, and inequality. There was disagreement about whether faster or slower AI progress would be better for the future of humanity. However, there was broad agreement that research aimed at minimizing potential risks from AI systems ought to be prioritized more.

    Bullet Points

    • Most respondents expressed uncertainty about the long-term value of AI progress, with 68.3% believing good outcomes from superhuman AI are more likely than bad

    • Net optimists gave at least a 5% chance of extremely bad outcomes such as human extinction, while net pessimists give 5% or more to extremely good outcomes

    • Nearly half of respondents gave a 10% chance to advanced AI leading to outcomes as bad as human survival

    • More than half suggested "substantial" or "extreme" concern is warranted about six different AI-related scenarios, including misinformation, authoritarian control, and inequality

    • There was disagreement about whether faster or slower AI progress would be better for the future of humanity, but there was broad agreement that research aimed at minimizing potential risks from AI systems should be prioritized more.

  24. Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon, Peitian Zhang,Zheng Liu,Shitao Xiao,Ninglu Shao,Qiwei Ye,Zhicheng Dou, 07-01-2024


    Computation and Language, Artificial Intelligence


    The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Although the context window can be extended through fine-tuning, it will result in a considerable cost at both training and inference time, and exert an unfavorable impact to the LLM's original capabilities. In this work, we propose Activation Beacon, which condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. Activation Beacon is introduced as a plug-and-play module for the LLM. It fully preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts. Besides, it works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference. Activation Beacon is learned by the auto-regression task conditioned on a mixture of beacons with diversified condensing ratios. Thanks to such a treatment, it can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine. The experimental studies show that Activation Beacon is able to extend Llama-2-7B's context length by $\times100$ times (from 4K to 400K), meanwhile achieving a superior result on both long-context generation and understanding tasks. Our model and code will be available at the BGE repository.

    Bullet Points

    • Activation Beacon is a plug-and-play module for LLM that preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts

    • It works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference

    • It can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine

    • The model and code will be available at the BGE repository.

  25. Mixtral of Experts, Albert Q. Jiang,Alexandre Sablayrolles,Antoine Roux,Arthur Mensch,Blanche Savary,Chris Bamford,Devendra Singh Chaplot,Diego de las Casas,Emma Bou Hanna,Florian Bressand,Gianna Lengyel,Guillaume Bour,Guillaume Lample,Lélio Renard Lavaud,Lucile Saulnier,Marie-Anne Lachaux,Pierre Stock,Sandeep Subramanian,Sophia Yang,Szymon Antoniak,Teven Le Scao,Théophile Gervet,Thibaut Lavril,Thomas Wang,Timothée Lacroix,William El Sayed, 08-01-2024


    Machine Learning, Computation and Language


    We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

    Bullet Points

    • Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model with 8 feedforward blocks (experts) at each layer

    • Each token has access to 47B parameters, but only 13B active parameters during inference

    • Mixtral was trained with a context size of 32k tokens and outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks

    • The model is fine-tuned to follow instructions, and both the base and instruct models are released under Apache 2.0 license.

  26. Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding, Zilong Wang,Hao Zhang,Chun-Liang Li,Julian Martin Eisenschlos,Vincent Perot,Zifeng Wang,Lesly Miculicich,Yasuhisa Fujii,Jingbo Shang,Chen-Yu Lee,Tomas Pfister, 09-01-2024


    Computation and Language


    Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices.

    Bullet Points

    • Table-based reasoning with large language models (LLMs) is a promising direction to tackle table-based question answering and fact verification tasks

    • It requires extracting underlying semantics from free-form questions and semi-structured tabular data

    • Chain-of-Thought and similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage it

    • We propose a framework that uses a table as a proxy for intermediate thoughts, using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain

    • LLMs can dynamically plan the next operation based on the results of the previous ones

    • This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem

    • The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions

    • It achieves new state-

  27. AUTOACT: Automatic Agent Learning from Scratch via Self-Planning, Shuofei Qiao,Ningyu Zhang,Runnan Fang,Yujie Luo,Wangchunshu Zhou,Yuchen Eleanor Jiang,Chengfei Lv,Huajun Chen, 10-01-2024


    Computation and Language, Artificial Intelligence, Human-Computer Interaction, Machine Learning, Multiagent Systems


    . None

  28. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Evan Hubinger,Carson Denison,Jesse Mu,Mike Lambert,Meg Tong,Monte MacDiarmid,Tamera Lanham,Daniel M. Ziegler,Tim Maxwell,Newton Cheng,Adam Jermyn,Amanda Askell,Ansh Radhakrishnan,Cem Anil,David Duvenaud,Deep Ganguli,Fazl Barez,Jack Clark,Kamal Ndousse,Kshitij Sachan,Michael Sellitto,Mrinank Sharma,Nova DasSarma,Roger Grosse,Shauna Kravec,Yuntao Bai,Zachary Witten,Marina Favaro,Jan Brauner,Holden Karnofsky,Paul Christiano,Samuel R. Bowman,Logan Graham,Jared Kaplan,Sören Mindermann,Ryan Greenblatt,Buck Shlegeris,Nicholas Schiefer,Ethan Perez, 10-01-2024


    Cryptography and Security, Artificial Intelligence, Computation and Language, Machine Learning, Software Engineering


    Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

    Bullet Points

    • Yes, if an AI system learns a deceptive behavior, it can be detected and removed using current state-of-the-art safety training techniques

    • Proof-Of-Concept Examples of Deceptive Behavior in Large Language Models (LLMs) are constructed

    • The backdoor behavior can be persistent, so it is not removed by standard training techniques such as supervised fine-tuning, reinforcement learning, and adversarial training

    • The persistence of the behavior remains even when the chain of thought is distilled away

    • Adversarial Training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.

  29. The Impact of Reasoning Step Length on Large Language Models, Mingyu Jin,Qinkai Yu,Dong shu,Haiyan Zhao,Wenyue Hua,Yanda Meng,Yongfeng Zhang,Mengnan Du, 10-01-2024


    Computation and Language, Artificial Intelligence


    Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models (LLMs). However, the correlation between the effectiveness of CoT and the length of reasoning steps in prompts remains largely unknown. To shed light on this, we have conducted several empirical experiments to explore the relations. Specifically, we design experiments that expand and compress the rationale reasoning steps within CoT demonstrations, while keeping all other factors constant. We have the following key findings. First, the results indicate that lengthening the reasoning steps in prompts, even without adding new information into the prompt, considerably enhances LLMs' reasoning abilities across multiple datasets. Alternatively, shortening the reasoning steps, even while preserving the key information, significantly diminishes the reasoning abilities of models. This finding highlights the importance of the number of steps in CoT prompts and provides practical guidance to make better use of LLMs' potential in complex problem-solving scenarios. Second, we also investigated the relationship between the performance of CoT and the rationales used in demonstrations. Surprisingly, the result shows that even incorrect rationales can yield favorable outcomes if they maintain the requisite length of inference. Third, we observed that the advantages of increasing reasoning steps are task-dependent: simpler tasks require fewer steps, whereas complex tasks gain significantly from longer inference sequences.

    Bullet Points

    • The effectiveness of Chain of Thought (CoT) in improving LLMs' reasoning abilities is unknown, but the correlation between the effectiveness and length of reasoning steps in prompts remains largely unknown

    • We conducted experiments that expand and compress the rationale reasoning steps within CoT demonstrations while keeping all other factors constant

    • The results indicate that lengthening the reasoning steps enhances LLM's reasoning abilities across multiple datasets, while shortening them significantly diminishes the reasoning abilities of models

    • The importance of the number of steps in CoT prompts and the relationship between performance and rationales used in demonstrations highlights the importance of maintaining the requisite length of inference

    • The advantages of increasing reasoning steps are task-dependent, with simpler tasks requiring fewer steps whereas complex tasks gain significantly from longer inference sequences.

  30. TrustLLM: Trustworthiness in Large Language Models, Lichao Sun,Yue Huang,Haoran Wang,Siyuan Wu,Qihui Zhang,Chujie Gao,Yixin Huang,Wenhan Lyu,Yixuan Zhang,Xiner Li,Zhengliang Liu,Yixin Liu,Yijue Wang,Zhikun Zhang,Bhavya Kailkhura,Caiming Xiong,Chaowei Xiao,Chunyuan Li,Eric Xing,Furong Huang,Hao Liu,Heng Ji,Hongyi Wang,Huan Zhang,Huaxiu Yao,Manolis Kellis,Marinka Zitnik,Meng Jiang,Mohit Bansal,James Zou,Jian Pei,Jian Liu,Jianfeng Gao,Jiawei Han,Jieyu Zhao,Jiliang Tang,Jindong Wang,John Mitchell,Kai Shu,Kaidi Xu,Kai-Wei Chang,Lifang He,Lifu Huang,Michael Backes,Neil Zhenqiang Gong,Philip S. Yu,Pin-Yu Chen,Quanquan Gu,Ran Xu,Rex Ying,Shuiwang Ji,Suman Jana,Tianlong Chen,Tianming Liu,Tianyi Zhou,Willian Wang,Xiang Li,Xiangliang Zhang,Xiao Wang,Xing Xie,Xun Chen,Xuyu Wang,Yan Liu,Yanfang Ye,Yinzhi Cao,Yong Chen,Yue Zhao, 10-01-2024


    Computation and Language


    Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.

    Bullet Points

    • The paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, which includes principles and benchmarks for different dimensions

    • It also discusses open challenges and future directions, and emphasizes the importance of transparency in the models and technologies involved in their effectiveness.

  31. Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems, Tianyu Cui,Yanling Wang,Chuanpu Fu,Yong Xiao,Sijia Li,Xinhao Deng,Yunpeng Liu,Qinglin Zhang,Ziyi Qiu,Peiyang Li,Zhixing Tan,Junwu Xiong,Xinyu Kong,Zujie Wen,Ke Xu,Qi Li, 11-01-2024


    Computation and Language, Artificial Intelligence


    Large language models (LLMs) have strong capabilities in solving diverse natural language processing tasks. However, the safety and security issues of LLM systems have become the major obstacle to their widespread application. Many studies have extensively investigated risks in LLM systems and developed the corresponding mitigation strategies. Leading-edge enterprises such as OpenAI, Google, Meta, and Anthropic have also made lots of efforts on responsible LLMs. Therefore, there is a growing need to organize the existing studies and establish comprehensive taxonomies for the community. In this paper, we delve into four essential modules of an LLM system, including an input module for receiving prompts, a language model trained on extensive corpora, a toolchain module for development and deployment, and an output module for exporting LLM-generated content. Based on this, we propose a comprehensive taxonomy, which systematically analyzes potential risks associated with each module of an LLM system and discusses the corresponding mitigation strategies. Furthermore, we review prevalent benchmarks, aiming to facilitate the risk assessment of LLM systems. We hope that this paper can help LLM participants embrace a systematic perspective to build their responsible LLM systems.

    Bullet Points

    • The paper proposes a comprehensive taxonomy that systematically analyzes potential risks associated with each module of an LLM system and discusses the corresponding mitigation strategies

    • It also reviews prevalent benchmarks to facilitate the risk assessment of LLM systems.

  32. Seven Failure Points When Engineering a Retrieval Augmented Generation System, Scott Barnett,Stefanus Kurniawan,Srikanth Thudumu,Zach Brannelly,Mohamed Abdelrazek, 11-01-2024


    Software Engineering


    Software engineers are increasingly adding semantic search capabilities to applications using a strategy known as Retrieval Augmented Generation (RAG). A RAG system involves finding documents that semantically match a query and then passing the documents to a large language model (LLM) such as ChatGPT to extract the right answer using an LLM. RAG systems aim to: a) reduce the problem of hallucinated responses from LLMs, b) link sources/references to generated responses, and c) remove the need for annotating documents with meta-data. However, RAG systems suffer from limitations inherent to information retrieval systems and from reliance on LLMs. In this paper, we present an experience report on the failure points of RAG systems from three case studies from separate domains: research, education, and biomedical. We share the lessons learned and present 7 failure points to consider when designing a RAG system. The two key takeaways arising from our work are: 1) validation of a RAG system is only feasible during operation, and 2) the robustness of a RAG system evolves rather than designed in at the start. We conclude with a list of potential research directions on RAG systems for the software engineering community.

    Bullet Points

    • Software engineers are using Retrieval Augmented Generation (RAG) to add semantic search capabilities to applications

    • RAG systems aim to reduce the problem of hallucinated responses from LLMs, link sources/references to generated responses, and remove the need for annotating documents with meta-data

    • However, the RAG system suffers from limitations inherent to information retrieval systems and from reliance on LLM

    • This paper presents an experience report on the failure points of RAG Systems from three case studies from different domains: research, education, and biomedical

    • We share the lessons learned and present 7 failure points to consider when designing a RAG System

    • The two key takeaways arising from our work are: 1) Validation is only feasible during operation

      1. Robustness of aRAG system evolves rather than designed in at the start.
  33. The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models, Matthew Renze,Erhan Guven, 11-01-2024


    Computation and Language, Artificial Intelligence


    In this paper, we introduce Concise Chain-of-Thought (CCoT) prompting. We compared standard CoT and CCoT prompts to see how conciseness impacts response length and correct-answer accuracy. We evaluated this using GPT-3.5 and GPT-4 with a multiple-choice question-and-answer (MCQA) benchmark. CCoT reduced average response length by 48.70% for both GPT-3.5 and GPT-4 while having a negligible impact on problem-solving performance. However, on math problems, GPT-3.5 with CCoT incurs a performance penalty of 27.69%. Overall, CCoT leads to an average per-token cost reduction of 22.67%. These results have practical implications for AI systems engineers using LLMs to solve real-world problems with CoT prompt-engineering techniques. In addition, these results provide more general insight for AI researchers studying the emergent behavior of step-by-step reasoning in LLMs.

    Bullet Points

    • The paper presents Concise Chain-of-Thought (CCoT) prompting, which reduces response length and correct-answer accuracy by 48.70% for both GPT-3.5 and GPT-4 with a MCQA benchmark

    • The results have practical implications for AI systems engineers using LLMs to solve real-world problems with CoT prompt-engineering techniques, as well as general insight for AI researchers studying the emergent behavior of step-by-step reasoning.

  34. How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs, Yi Zeng,Hongpeng Lin,Jingwen Zhang,Diyi Yang,Ruoxi Jia,Weiyan Shi, 12-01-2024


    Computation and Language, Artificial Intelligence


    Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs

    Bullet Points

    • The paper explores how to persuade LLMs to jailbreak them as human-like communicators, and explores the intersection between everyday language interaction and AI safety

    • The paper proposes a persuasive adversarial prompt taxonomy derived from social science research, and uses it to automatically generate interpretable persuasive antagonistic prompts (PAP)

    • The results show that PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused attacks

    • On the defense side, it explores various mechanisms against PAP and advocates for more fundamental mitigation for highly interactive LLM.

  35. Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender, Yuqi Zhang,Liang Ding,Lefei Zhang,Dacheng Tao, 12-01-2024


    Computation and Language



  36. RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, Angels Balaguer,Vinamra Benara,Renato Luiz de Freitas Cunha,Roberto de M. Estevão Filho,Todd Hendry,Daniel Holstein,Jennifer Marsman,Nick Mecklenburg,Sara Malvar,Leonardo O. Nunes,Rafael Padilha,Morris Sharp,Bruno Silva,Swati Sharma,Vijay Aski,Ranveer Chandra, 16-01-2024


    Computation and Language, Machine Learning


    There are two common ways in which developers are incorporating proprietary and domain-specific data when building applications of Large Language Models (LLMs): Retrieval-Augmented Generation (RAG) and Fine-Tuning. RAG augments the prompt with the external data, while fine-Tuning incorporates the additional knowledge into the model itself. However, the pros and cons of both approaches are not well understood. In this paper, we propose a pipeline for fine-tuning and RAG, and present the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. Our pipeline consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. We propose metrics to assess the performance of different stages of the RAG and fine-Tuning pipeline. We conduct an in-depth study on an agricultural dataset. Agriculture as an industry has not seen much penetration of AI, and we study a potentially disruptive application - what if we could provide location-specific insights to a farmer? Our results show the effectiveness of our dataset generation pipeline in capturing geographic-specific knowledge, and the quantitative and qualitative benefits of RAG and fine-tuning. We see an accuracy increase of over 6 p.p. when fine-tuning the model and this is cumulative with RAG, which increases accuracy by 5 p.p. further. In one particular experiment, we also demonstrate that the fine-tuned model leverages information from across geographies to answer specific questions, increasing answer similarity from 47% to 72%. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of LLMs in other industrial domains.

    Bullet Points

    • The paper proposes a pipeline for fine-tuning and RAG, which involves extracting information from PDFs, generating questions and answers, and leveraging GPT-4 for evaluating the results

    • The pipeline consists of multiple stages, including extracting data from PDF and generating answers

    • The results demonstrate the effectiveness of the dataset generation pipeline in capturing geographic-specific knowledge and the quantitative and qualitative benefits of RAG and Fine-Tuning, with an accuracy increase of over 6 p.p

    • and cumulative with RAG

    • In an in-depth study on an agricultural dataset, the results demonstrate that the finetuned model leverages information from across geographies to answer specific questions, increasing answer similarity from 47% to 72%

    • This suggests that systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of L

  37. MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World, Yining Hong,Zishuo Zheng,Peihao Chen,Yian Wang,Junyan Li,Chuang Gan, 16-01-2024


    Computer Vision, Artificial Intelligence, Computation and Language, Machine Learning, Robotics


    Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment, as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, multisensory captioning, and task decomposition.

    Bullet Points

    • MultiPLY is a multisensory embodied large language model that incorporates multisensorial interactive data, including visual, audio, tactile, and thermal information, into large language models to establish the correlation among words, actions, and percepts

    • We first collect Multisensory Universe dataset by deploying an LLM-powered agent to engage with the 3D environment

    • To perform instruction tuning with pre-trained LLM on generated data, we encode 3D scenes as abstracted object-centric representations and introduce action tokens and state tokens that represent the multisensorious state observations of the agent at each time step

    • Inference time could be generated by generating instruction tokens, instructing the agent to take the action in the environment and obtain the next multiplesensory state observation

    • The observation is then appended back to the LLM via State tokens to generate subsequent text or action token

    • This model outperforms baselines by

  38. A Survey of Resource-efficient LLM and Multimodal Foundation Models, Mengwei Xu,Wangsong Yin,Dongqi Cai,Rongjie Yi,Daliang Xu,Qipeng Wang,Bingyang Wu,Yihao Zhao,Chen Yang,Shihe Wang,Qiyang Zhang,Zhenyan Lu,Li Zhang,Shangguang Wang,Yuanchun Li,Yunxin Liu,Xin Jin,Xuanzhe Liu, 16-01-2024


    Machine Learning, Artificial Intelligence, Distributed, Parallel, and Cluster Computing


    Large foundation models, including large language models (LLMs), vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from training to deployment. However, the substantial advancements in versatility and performance these models offer come at a significant cost in terms of hardware resources. To support the growth of these large models in a scalable and environmentally sustainable way, there has been a considerable focus on developing resource-efficient strategies. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations. The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field.

    Bullet Points

    • The survey focuses on developing resource-efficient strategies to support the growth of large foundation models in a sustainable and scalable way, examining both algorithmic and systemic aspects

    • It provides an overview of current approaches to tackling resource challenges and potential future breakthroughs in this field.

  39. Knowledge Fusion of Large Language Models, Fanqi Wan,Xinting Huang,Deng Cai,Xiaojun Quan,Wei Bi,Shuming Shi, 19-01-2024


    Computation and Language


    While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at \url{this https URL}.

    Bullet Points

    • Knowledge fusion for LLMs is a cost-effective and compelling approach to combining existing LLM capabilities and transferring them into a single LLM

    • By leveraging generative distributions, we externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM, and we validate our approach using three popular LLM architectures - Llama-2, MPT, and OpenLLaMA - across various benchmarks and tasks

    • Our code, model weights, and data are public at urlthis https URL.

  40. MM-LLMs: Recent Advances in MultiModal Large Language Models, Duzhen Zhang,Yahan Yu,Chenxing Li,Jiahua Dong,Dan Su,Chenhui Chu,Dong Yu, 24-01-2024


    Computation and Language


    In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Specifically, we first outline general design formulations for model architecture and training pipeline. Subsequently, we provide brief introductions of $26$ existing MM-LLMs, each characterized by its specific formulations. Additionally, we review the performance of MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

    Bullet Points

    • The paper provides a comprehensive survey to facilitate further research of MM-LLMs, including general design formulations, introductions of existing models, performance on mainstream benchmarks, key training recipes, and a real-time tracking website

    • The survey encourages further research and contributes to the ongoing advancement of the domain.

  41. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models, Hongliang He,Wenlin Yao,Kaixin Ma,Wenhao Yu,Yong Dai,Hongming Zhang,Zhenzhong Lan,Dong Yu, 25-01-2024


    Computation and Language, Artificial Intelligence


    The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.

    Bullet Points

    • We introduce WebVoyager, an LMM-powered web agent that can complete user instructions end-to-end by interacting with real-world websites

    • We propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V

    • We create a benchmark by gathering real-life tasks from 15 widely used websites to evaluate our agents, achieving a 55.7% task success rate

    • Our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real world setting.

  42. The Power of Noise: Redefining Retrieval for RAG Systems, Florin Cuconasu,Giovanni Trappolini,Federico Siciliano,Simone Filice,Cesare Campagnano,Yoelle Maarek,Nicola Tonellotto,Fabrizio Silvestri, 26-01-2024


    Information Retrieval, Computation and Language


    Retrieval-Augmented Generation (RAG) systems represent a significant advancement over traditional Large Language Models (LLMs). RAG systems enhance their generation ability by incorporating external data retrieved through an Information Retrieval (IR) phase, overcoming the limitations of standard LLMs, which are restricted to their pre-trained knowledge and limited context window. Most research in this area has predominantly concentrated on the generative aspect of LLMs within RAG systems. Our study fills this gap by thoroughly and critically analyzing the influence of IR components on RAG systems. This paper analyzes which characteristics a retriever should possess for an effective RAG's prompt formulation, focusing on the type of documents that should be retrieved. We evaluate various elements, such as the relevance of the documents to the prompt, their position, and the number included in the context. Our findings reveal, among other insights, that including irrelevant documents can unexpectedly enhance performance by more than 30% in accuracy, contradicting our initial assumption of diminished quality. These results underscore the need for developing specialized strategies to integrate retrieval with language generation models, thereby laying the groundwork for future research in this field.

    Bullet Points

    • Retrieval-Augmented Generation (RAG) systems enhance their generation ability by incorporating external data retrieved through an Information Retriation (IR) phase, overcoming limitations of traditional LLMs

    • The paper analyzes the influence of IR components on RAG systems, analyzing which characteristics a retriever should possess for an effective RAG prompt formulation, focusing on the type of documents that should be retrieved

    • Including irrelevant documents can unexpectedly enhance performance by more than 30% in accuracy, contradicting our initial assumption of diminished quality

    • Specified strategies should be developed to integrate retrieval with language generation models, laying the groundwork for future research in this field.

  43. A Comprehensive Survey of Compression Algorithms for Language Models, Seungcheol Park,Jaehyeon Choi,Sojin Lee,U Kang, 27-01-2024


    Computation and Language, Artificial Intelligence, Natural Language Processing, Natural Language Processing


    How can we compress language models without sacrificing accuracy? The number of compression algorithms for language models is rapidly growing to benefit from remarkable advances of recent language models without side effects due to the gigantic size of language models, such as increased carbon emissions and expensive maintenance fees. While numerous compression algorithms have shown remarkable progress in compressing language models, it ironically becomes challenging to capture emerging trends and identify the fundamental concepts underlying them due to the excessive number of algorithms. In this paper, we survey and summarize diverse compression algorithms including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. We not only summarize the overall trend of diverse compression algorithms but also select representative algorithms and provide in-depth analyses of them. We discuss the value of each category of compression algorithms, and the desired properties of low-cost compression algorithms which have a significant impact due to the emergence of large language models. Finally, we introduce promising future research topics based on our survey results.

    Bullet Points

    • To compress language models without sacrificing accuracy, we can use diverse compression algorithms such as pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design

    • The paper surveys and summarizes diverse algorithms, selects representative algorithms, provides in-depth analyses, discusses the value of each category of compression algorithms, and proposes promising future research topics based on our survey results.

  44. OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models, Fuzhao Xue,Zian Zheng,Yao Fu,Jinjie Ni,Zangwei Zheng,Wangchunshu Zhou,Yang You, 29-01-2024


    Computation and Language, Artificial Intelligence, Distributed, Parallel, and Cluster Computing, Machine Learning


    One more important contribution of this study is an in-depth analysis of the routing mechanisms within our OpenMoE models, leading to three significant findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. We discovered that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance. The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged. This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations, where tokens appearing later in a sequence are more likely to be dropped. Finally, we rethink our design based on the above-mentioned observations and analysis. To facilitate future MoE LLM development, we propose potential strategies for mitigating the issues we found and further improving off-the-shelf MoE LLM designs.

    Bullet Points

    • The study found that routing decisions in OpenMoE models are predominantly based on token IDs, with minimal context relevance

    • The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged

    • This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations

    • The study proposes potential strategies for mitigating these issues and improving off-the-shelf MoE LLM designs.

  45. Corrective Retrieval Augmented Generation, Shi-Qi Yan,Jia-Chen Gu,Yun Zhu,Zhen-Hua Ling, 29-01-2024


    Computation and Language


    Large language models (LLMs) inevitably exhibit hallucinations since the accuracy of generated texts cannot be secured solely by the parametric knowledge they encapsulate. Although retrieval-augmented generation (RAG) is a practicable complement to LLMs, it relies heavily on the relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong. To this end, we propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation. Specifically, a lightweight retrieval evaluator is designed to assess the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered. Since retrieval from static and limited corpora can only return sub-optimal documents, large-scale web searches are utilized as an extension for augmenting the retrieval results. Besides, a decompose-then-recompose algorithm is designed for retrieved documents to selectively focus on key information and filter out irrelevant information in them. CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches. Experiments on four datasets covering short- and long-form generation tasks show that CRAG can significantly improve the performance of RAG-based approaches.

    Bullet Points

    • We propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of LLM generation by utilizing a lightweight retrieval evaluator, large-scale web searches, and a decompose algorithm to selectively focus on key information and filter out irrelevant information in retrieved documents

    • CRAG is a plug-and-play approach that can be seamlessly coupled with various RAG-based approaches.

  46. Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?, Xue-Yong Fu,Md Tahmid Rahman Laskar,Elena Khasanova,Cheng Chen,Shashi Bhushan TN, 01-02-2024


    Computation and Language


    Large Language Models (LLMs) have demonstrated impressive capabilities to solve a wide range of tasks without being explicitly fine-tuned on task-specific datasets. However, deploying LLMs in the real world is not trivial, as it requires substantial computing resources. In this paper, we investigate whether smaller, compact LLMs are a good alternative to the comparatively Larger LLMs2 to address significant costs associated with utilizing LLMs in the real world. In this regard, we study the meeting summarization task in a real-world industrial environment and conduct extensive experiments by comparing the performance of fine-tuned compact LLMs (e.g., FLAN-T5, TinyLLaMA, LiteLLaMA) with zero-shot larger LLMs (e.g., LLaMA-2, GPT-3.5, PaLM-2). We observe that most smaller LLMs, even after fine-tuning, fail to outperform larger zero-shot LLMs in meeting summarization datasets. However, a notable exception is FLAN-T5 (780M parameters), which performs on par or even better than many zero-shot Larger LLMs (from 7B to above 70B parameters), while being significantly smaller. This makes compact LLMs like FLAN-T5 a suitable cost-efficient solution for real-world industrial deployment.

    Bullet Points

    • The paper investigates if smaller, compact LLMs are a good alternative to Larger LLM2 to address significant costs associated with deploying them in the real world

    • We study meeting summarization tasks in a real-world industrial environment and compare the performance of fine-tuned compact and zero-shot LargerLLMs

    • FLAN-T5 performs on par or even better than many Zero-Shot Larger MLMs, making it a suitable cost-efficient solution for real-life industrial deployment.

  47. TravelPlanner: A Benchmark for Real-World Planning with Language Agents, Jian Xie,Kai Zhang,Jiangjie Chen,Tinghui Zhu,Renze Lou,Yuandong Tian,Yanghua Xiao,Yu Su, 02-02-2024


    Computation and Language


    Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.

    Bullet Points

    • TravelPlanner is a new planning benchmark that focuses on travel planning, a common real-world planning scenario

    • It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans

    • However, the current language agents are not yet capable of handling such complex planning tasks, even GPT-4 only achieves a success rate of 0.6%

    • The mere possibility for language agents to tackle such a complex problem is non-trivial progress.

  48. More Agents Is All You Need, Junyou Li,Qin Zhang,Yangbin Yu,Qiang Fu,Deheng Ye, 03-02-2024


    Computation and Language, Artificial Intelligence, Machine Learning


    We find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated. Also, this method is orthogonal to existing complicated methods to further enhance LLMs, while the degree of enhancement is correlated to the task difficulty. We conduct comprehensive experiments on a wide range of LLM benchmarks to verify the presence of our finding, and to study the properties that can facilitate its occurrence. Our code is publicly available at: \url{}.

    Bullet Points

    • A sampling-and-voting method is used to measure the performance of large language models (LLMs) with the number of agents instantiated

    • This method is orthogonal to existing complicated methods to enhance LLMs, and the degree of enhancement is correlated to task difficulty

    • We conduct extensive experiments on LLM benchmarks to verify the presence of our finding and study the properties that can facilitate its occurrence

    • Our code is publicly available at

  49. Large Language Model for Table Processing: A Survey, Weizheng Lu,Jiaming Zhang,Jing Zhang,Yueguo Chen, 04-02-2024


    Artificial Intelligence, Computation and Language


    Tables, typically two-dimensional and structured to store large amounts of data, are essential in daily activities like database queries, spreadsheet calculations, and generating reports from web tables. Automating these table-centric tasks with Large Language Models (LLMs) offers significant public benefits, garnering interest from academia and industry. This survey provides an extensive overview of table tasks, encompassing not only the traditional areas like table question answering (Table QA) and fact verification, but also newly emphasized aspects such as table manipulation and advanced table data analysis. Additionally, it goes beyond the early strategies of pre-training and fine-tuning small language models, to include recent paradigms in LLM usage. The focus here is particularly on instruction-tuning, prompting, and agent-based approaches within the realm of LLMs. Finally, we highlight several challenges, ranging from private deployment and efficient inference to the development of extensive benchmarks for table manipulation and advanced data analysis.

    Bullet Points

    • Tables are essential in daily activities like database queries, spreadsheet calculations, and generating reports from web tables

    • LLMs are used to automate table-centric tasks, gaining public interest from academia and industry

    • This survey provides an overview of table tasks, including traditional areas like table question answering and fact verification, as well as newly emphasized aspects such as table manipulation and advanced table data analysis

    • The survey also includes recent paradigms in LLM usage, including instruction-tuning, prompting, and agent-based approaches

    • Challenges include private deployment, efficient inference, and the development of benchmarks for table manipulation.

  50. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, Pranab Sahoo,Ayush Kumar Singh,Sriparna Saha,Vinija Jain,Samrat Mondal,Aman Chadha, 05-02-2024


    Artificial Intelligence, Computation and Language, Human-Computer Interaction


    Prompt engineering has emerged as an indispensable technique for extending the capabilities of large language models (LLMs) and vision-language models (VLMs). This approach leverages task-specific instructions, known as prompts, to enhance model efficacy without modifying the core model parameters. Rather than updating the model parameters, prompts allow seamless integration of pre-trained models into downstream tasks by eliciting desired model behaviors solely based on the given prompt. Prompts can be natural language instructions that provide context to guide the model or learned vector representations that activate relevant knowledge. This burgeoning field has enabled success across various applications, from question-answering to commonsense reasoning. However, there remains a lack of systematic organization and understanding of the diverse prompt engineering methods and techniques. This survey paper addresses the gap by providing a structured overview of recent advancements in prompt engineering, categorized by application area. For each prompting approach, we provide a summary detailing the prompting methodology, its applications, the models involved, and the datasets utilized. We also delve into the strengths and limitations of each approach and include a taxonomy diagram and table summarizing datasets, models, and critical points of each prompting technique. This systematic analysis enables a better understanding of this rapidly developing field and facilitates future research by illuminating open challenges and opportunities for prompt engineering.

    Bullet Points

    • Prompt engineering is a technique that extends the capabilities of LLMs and vision-language models by using task-specific instructions, such as prompts, to enhance model efficacy without modifying the model parameters

    • It allows seamless integration of pre-trained models into downstream tasks by eliciting desired model behaviors solely based on the given prompt

    • This field has enabled success across various applications, from question-answering to commonsense reasoning

    • However, there is still a lack of systematic organization and understanding of the diverse prompt engineering methods and techniques

    • This survey paper addresses the gap by providing a structured overview of recent advancements in prompt engineering, categorized by application area

    • For each prompting approach, we provide a summary detailing the prompting methodology, its applications, the models involved, and the datasets utilized

    • We also explore the strengths and limitations of each approach and include a taxonomy diagram and table summar

  51. AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls, Yu Du,Fangyun Wei,Hongyang Zhang, 06-02-2024


    Computation and Language


  52. Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning, Yanfang Zhang,Yiliu Sun,Yibing Zhan,Dapeng Tao,Dacheng Tao,Chen Gong, 06-02-2024


    Computation and Language, Artificial Intelligence


    Recently, increasing attention has been focused drawn on to improve the ability of Large Language Models (LLMs) to perform complex reasoning. However, previous methods, such as Chain-of-Thought and Self-Consistency, mainly follow Direct Reasoning (DR) frameworks, so they will meet difficulty in solving numerous real-world tasks which can hardly be solved via DR. Therefore, to strengthen the reasoning power of LLMs, this paper proposes a novel Indirect Reasoning (IR) method that employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof. Specifically, our methodology comprises two steps. Firstly, we leverage the logical equivalence of contrapositive to augment the data and rules to enhance the comprehensibility of LLMs. Secondly, we design a set of prompt templates to trigger LLMs to conduct IR based on proof by contradiction that is logically equivalent to the original DR process. Our IR method is simple yet effective and can be straightforwardly integrated with existing DR methods to further boost the reasoning abilities of LLMs. The experimental results on popular LLMs, such as GPT-3.5-turbo and Gemini-pro, show that our IR method enhances the overall accuracy of factual reasoning by 27.33% and mathematical proof by 31.43%, when compared with traditional DR methods. Moreover, the methods combining IR and DR significantly outperform the methods solely using IR or DR, further demonstrating the effectiveness of our strategy.

    Bullet Points

    • The paper proposes a novel Indirect Reasoning (IR) method that employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof

    • The methodology involves leveraging the logical equivalence of counterpositives to enhance the comprehensibility of LLMs, and designing prompt templates to trigger them to conduct IR based on proof by contradiction that is logically equivalent to the original DR process

    • The IR method is simple yet effective, and can be easily integrated with existing DR methods to further boost the reasoning abilities

    • Experimental results show that our method enhances the overall accuracy of factual Reasoning by 27.33% and mathematical proof by 31.43% when compared with traditional DRM methods

    • The methods combining IR and DR significantly outperform the methods solely using IR or DR, further demonstrating the effectiveness of our strategy.

  53. LLM Agents can Autonomously Hack Websites, Richard Fang,Rohan Bindu,Akul Gupta,Qiusi Zhan,Daniel Kang, 06-02-2024


    Cryptography and Security, Artificial Intelligence


    In this work, we show that LLM agents can autonomously hack websites, performing tasks as complex as blind database schema extraction and SQL injections without human feedback. Importantly, the agent does not need to know the vulnerability beforehand. This capability is uniquely enabled by frontier models that are highly capable of tool use and leveraging extended context. Namely, we show that GPT-4 is capable of such hacks, but existing open-source models are not. Finally, we show that GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild. Our findings raise questions about the widespread deployment of LLMs.

    Bullet Points

    • LLM agents can autonomously hack websites without human feedback, using frontier models that are highly capable of tool use and leveraging extended context

    • GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild, which raises questions about the widespread deployment of LLMs.

  54. Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning, Yanfang Zhang,Yiliu Sun,Yibing Zhan,Dapeng Tao,Dacheng Tao,Chen Gong, 06-02-2024


    Computation and Language, Artificial Intelligence


    Recently, increasing attention has been focused drawn on to improve the ability of Large Language Models (LLMs) to perform complex reasoning. However, previous methods, such as Chain-of-Thought and Self-Consistency, mainly follow Direct Reasoning (DR) frameworks, so they will meet difficulty in solving numerous real-world tasks which can hardly be solved via DR. Therefore, to strengthen the reasoning power of LLMs, this paper proposes a novel Indirect Reasoning (IR) method that employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof. Specifically, our methodology comprises two steps. Firstly, we leverage the logical equivalence of contrapositive to augment the data and rules to enhance the comprehensibility of LLMs. Secondly, we design a set of prompt templates to trigger LLMs to conduct IR based on proof by contradiction that is logically equivalent to the original DR process. Our IR method is simple yet effective and can be straightforwardly integrated with existing DR methods to further boost the reasoning abilities of LLMs. The experimental results on popular LLMs, such as GPT-3.5-turbo and Gemini-pro, show that our IR method enhances the overall accuracy of factual reasoning by 27.33% and mathematical proof by 31.43%, when compared with traditional DR methods. Moreover, the methods combining IR and DR significantly outperform the methods solely using IR or DR, further demonstrating the effectiveness of our strategy.

    Bullet Points

    • The paper proposes a novel Indirect Reasoning (IR) method that employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof

    • The methodology involves leveraging the logical equivalence of counterpositives to enhance the comprehensibility of LLMs, and designing prompt templates to trigger them to conduct IR based on proof by contradiction that is logically equivalent to the original DR process

    • The IR method is simple yet effective, and can be easily integrated with existing DR methods to further boost the reasoning abilities

    • Experimental results show that our method enhances the overall accuracy of factual Reasoning by 27.33% and mathematical proof by 31.43% when compared with traditional DRM methods

    • The methods combining IR and DR significantly outperform the methods solely using IR or DR, further demonstrating the effectiveness of our strategy.

  55. In-Context Principle Learning from Mistakes, Tianjun Zhang,Aman Madaan,Luyu Gao,Steven Zheng,Swaroop Mishra,Yiming Yang,Niket Tandon,Uri Alon, 08-02-2024


    Computation and Language, Artificial Intelligence


    In-context learning (ICL, also known as few-shot prompting) has been the standard method of adapting LLMs to downstream tasks, by learning from a few input-output examples. Nonetheless, all ICL-based approaches only learn from correct input-output pairs. In this paper, we revisit this paradigm, by learning more from the few given input-output examples. We introduce Learning Principles (LEAP): First, we intentionally induce the model to make mistakes on these few examples; then we reflect on these mistakes, and learn explicit task-specific "principles" from them, which help solve similar problems and avoid common mistakes; finally, we prompt the model to answer unseen test questions using the original few-shot examples and these learned general principles. We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4 turbo and Claude-2.1. For example, LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DROP, and by 3.3% in HotpotQA. Importantly, LEAP does not require any more input or examples than the standard few-shot prompting settings.

    Bullet Points

    • In-context learning (ICL) is a standard method of adapting LLMs to downstream tasks by learning from a few input-output examples

    • However, all ICL-based approaches only learn from correct input_output pairs

    • In this paper, we revisit this paradigm by learning more from the few given inputs

    • We introduce Learning Principles (LEAP) by intentionally inducing the model to make mistakes on these few examples, reflecting on these mistakes, and learning explicit task-specific "principles" from them, which help solve similar problems and avoid common mistakes

    • We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH)

    • LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DR

  56. How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis, Federico Bianchi,Patrick John Chia,Mert Yuksekgonul,Jacopo Tagliabue,Dan Jurafsky,James Zou, 08-02-2024


    Artificial Intelligence, Computation and Language, Computer Science and Game Theory


    Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena: a flexible framework for evaluating and probing the negotiation abilities of LLM agents. We implemented three types of scenarios in NegotiationArena to assess LLM's behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations). Each scenario allows for multiple turns of flexible dialogues between LLM agents to allow for more complex negotiations. Interestingly, LLM agents can significantly boost their negotiation outcomes by employing certain behavioral tactics. For example, by pretending to be desolate and desperate, LLMs can improve their payoffs by 20% when negotiating against the standard GPT-4. We also quantify irrational negotiation behaviors exhibited by the LLM agents, many of which also appear in humans. Together, \NegotiationArena offers a new environment to investigate LLM interactions, enabling new insights into LLM's theory of mind, irrationality, and reasoning abilities.

    Bullet Points

    • The paper explores how LLM agents can negotiate and develops NegotiationArena, a flexible framework for evaluating and probing their negotiation abilities

    • Three scenarios were implemented to assess LLM's behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations)

    • LLMs can significantly boost their negotiation outcomes by employing certain behavioral tactics, such as pretending to be desolate and desperate, and quantifying irrational negotiation behaviors.

  57. Large Language Models: A Survey, Shervin Minaee,Tomas Mikolov,Narjes Nikzad,Meysam Chenaghlu,Richard Socher,Xavier Amatriain,Jianfeng Gao, 09-02-2024


    Computation and Language, Artificial Intelligence


    Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.

    Bullet Points

    • LLMs have gained attention due to their strong performance on natural language tasks and their ability to general-purpose language understanding and generation

    • They are developed by training billions of parameters on massive amounts of text data, as predicted by scaling laws citekaplan2020scaling,hoffmann2022training

    • Their research area is evolving rapidly in many different ways

    • The paper reviews some of the most prominent LLM families, including GPT, LLaMA, and PaLM, and discusses their characteristics, contributions, and limitations

    • We also provide an overview of techniques developed to build and augment LLM models, survey popular datasets prepared for training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare their performance on a set of representative benchmarks

    • Finally, the paper concludes by discussing open challenges and future research directions.

  58. OS-Copilot: Towards Generalist Computer Agents with Self-Improvement, Zhiyong Wu,Chengcheng Han,Zichen Ding,Zhenmin Weng,Zhoumianze Liu,Shunyu Yao,Tao Yu,Lingpeng Kong, 12-02-2024


    Artificial Intelligence


    Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OS-Copilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.

    Bullet Points

    • OS-Copilot is a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications

    • It is used to create FRIDAY, a self-improving embodied agent for automating general computer tasks, which outperforms previous methods by 35% on GAIA, showcasing strong generalization to unseen applications via accumulated skills from previous tasks

    • The framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.

  59. Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model, Ahmet Üstün,Viraat Aryabumi,Zheng-Xin Yong,Wei-Yin Ko,Daniel D'souza,Gbemileke Onilude,Neel Bhandari,Shivalika Singh,Hui-Lee Ooi,Amr Kayid,Freddie Vargus,Phil Blunsom,Shayne Longpre,Niklas Muennighoff,Marzieh Fadaee,Julia Kreutzer,Sara Hooker, 12-02-2024


    Computation and Language


  60. DoRA: Weight-Decomposed Low-Rank Adaptation, Shih-Yang Liu,Chien-Yi Wang,Hongxu Yin,Pavlo Molchanov,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Min-Hung Chen, 14-02-2024


    Computation and Language, Computer Vision


    Among the widely used parameter-efficient finetuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed LowRank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing DoRA, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding.

    Bullet Points

    • The work proposes Weight-Decomposed LowRank Adaptation (DoRA) to investigate the accuracy gap between parameter-efficient finetuning (PEFT) methods and full fine-tuned (FT)

    • DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tinging, enhancing both learning capacity and training stability while avoiding inference overhead

    • DoRA consistently outperforms LoRA on fine-timing LLaMA, LLlaVA, and VL-BART on downstream tasks, including commonsense reasoning and visual instruction tuning.

  61. How to Train Data-Efficient LLMs, Noveen Sachdeva,Benjamin Coleman,Wang-Cheng Kang,Jianmo Ni,Lichan Hong,Ed H. Chi,James Caverlee,Julian McAuley,Derek Zhiyuan Cheng, 15-02-2024


    Machine Learning, Artificial Intelligence, Computation and Language


    The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

    Bullet Points

    • The paper studies data-efficient approaches for pre-training LLMs to optimize the Pareto frontier of model quality and training resource/data consumption

    • The paper aims to understand tradeoffs associated with data selection routines based on expensive-to-compute data-quality estimates and maximization of coverage and diversity-based measures in the feature space

    • The first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLM models to directly assess the quality of a training example

    • To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample

    • Models trained on Ask-LLM data consistently outperform full-data training, even when we reject 90% of the original dataset, while converging up to 70% faster.

  62. Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models, Yijia Shao,Yucheng Jiang,Theodore A. Kanell,Peter Xu,Omar Khattab,Monica S. Lam, 22-02-2024


    Computation and Language, Artificial Intelligence


    For evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage. We further gather feedback from experienced Wikipedia editors. Compared to articles generated by an outline-driven retrieval-augmented baseline, more of STORM's articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%). The expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.

    Bullet Points

    • To evaluate the pre-writing stage of STORM's articles, we curate FreshWiki, formulate outline assessments, and gather feedback from experienced editors

    • More articles are organized and broad in coverage compared to an outline-driven retrieval-augmented baseline

    • Expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.

  63. INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models, Hanseok Oh,Hyunji Lee,Seonghyeon Ye,Haebin Shin,Hansol Jang,Changwook Jun,Minjoon Seo, 22-02-2024


    Computation and Language


    Despite the critical need to align search targets with users' intention, retrievers often only prioritize query information without delving into the users' intended search context. Enhancing the capability of retrievers to understand intentions and preferences of users, akin to language model instructions, has the potential to yield more aligned search targets. Prior studies restrict the application of instructions in information retrieval to a task description format, neglecting the broader context of diverse and evolving search scenarios. Furthermore, the prevailing benchmarks utilized for evaluation lack explicit tailoring to assess instruction-following ability, thereby hindering progress in this field. In response to these limitations, we propose a novel benchmark,INSTRUCTIR, specifically designed to evaluate instruction-following ability in information retrieval tasks. Our approach focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics inherent in real-world search scenarios. Through experimental analysis, we observe that retrievers fine-tuned to follow task-style instructions, such as INSTRUCTOR, can underperform compared to their non-instruction-tuned counterparts. This underscores potential overfitting issues inherent in constructing retrievers trained on existing instruction-aware retrieval datasets.

    Bullet Points

    • Retrievers prioritize query information without understanding user intentions and preferences, which hinders progress in information retrieval

    • A new benchmark, INSTRUCTIR, focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics inherent in real-world search scenarios

    • This approach can underperform retrievers fine-tuned to follow task-style instructions, highlighting potential overfitting issues in constructing retrievers trained on existing instruction-aware retrieval datasets.

  64. MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, Omkar Thawakar,Ashmal Vayani,Salman Khan,Hisham Cholakal,Rao M. Anwer,Michael Felsberg,Tim Baldwin,Eric P. Xing,Fahad Shahbaz Khan, 26-02-2024


    Computation and Language


  65. A Survey on Data Selection for Language Models, Alon Albalak,Yanai Elazar,Sang Michael Xie,Shayne Longpre,Nathan Lambert,Xinyi Wang,Niklas Muennighoff,Bairu Hou,Liangming Pan,Haewon Jeong,Colin Raffel,Shiyu Chang,Tatsunori Hashimoto,William Yang Wang, 26-02-2024


    Computation and Language, Machine Learning


    To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

    Bullet Points

    • The paper presents a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches

    • The aim is to accelerate progress in data selection by establishing an entry point for new and established researchers

    • The review highlights noticeable holes in the literature and proposes promising research avenues.

  66. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, Shuming Ma,Hongyu Wang,Lingxiao Ma,Lei Wang,Wenhui Wang,Shaohan Huang,Li Dong,Ruiping Wang,Jilong Xue,Furu Wei, 27-02-2024


    Computation and Language, Machine Learning


    Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

    Bullet Points

    • BitNet introduces a 1-bit LLM variant called BitNet b1.58, which matches the full-precision Transformer LLM with the same model size and training tokens, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption

    • This variant defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-efficient, enabling new computation paradigm and opening the door for designing specific hardware optimized for 1-bitLLMs.

  67. Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation, Nihal V. Nayak,Yiyang Nan,Avi Trost,Stephen H. Bach, 28-02-2024


    Computation and Language, Machine Learning


  68. ResLoRA: Identity Residual Mapping in Low-Rank Adaption, Shuhua Shi,Shaohan Huang,Minghui Song,Zhoujun Li,Zihan Zhang,Haizhen Huang,Furu Wei,Weiwei Deng,Feng Sun,Qi Zhang, 28-02-2024


    Computation and Language, Artificial Intelligence


  69. Datasets for Large Language Models: A Comprehensive Survey, Yang Liu,Jiahuan Cao,Chongyu Liu,Kai Ding,Lianwen Jin, 28-02-2024


    Computation and Language, Artificial Intelligence


  70. Retrieval-Augmented Generation for AI-Generated Content: A Survey, Penghao Zhao,Hailin Zhang,Qinhan Yu,Zhengren Wang,Yunteng Geng,Fangcheng Fu,Ling Yang,Wentao Zhang,Bin Cui, 29-02-2024


    Computer Vision


  71. PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval, He Zhu,Wenjia Zhang,Nuoxian Huang,Boyang Li,Luyao Niu,Zipei Fan,Tianle Lun,Yicheng Tao,Junyou Su,Zhaoya Gong,Chenyu Fang,Xing Liu, 29-02-2024


    Computation and Language


    In the field of urban planning, general-purpose large language models often struggle to meet the specific needs of planners. Tasks like generating urban planning texts, retrieving related information, and evaluating planning documents pose unique challenges. To enhance the efficiency of urban professionals and overcome these obstacles, we introduce PlanGPT, the first specialized Large Language Model tailored for urban and spatial planning. Developed through collaborative efforts with institutions like the Chinese Academy of Urban Planning, PlanGPT leverages a customized local database retrieval framework, domain-specific fine-tuning of base models, and advanced tooling capabilities. Empirical tests demonstrate that PlanGPT has achieved advanced performance, delivering responses of superior quality precisely tailored to the intricacies of urban planning.

    Bullet Points

    • PlanGPT is the first specialized Large Language Model tailored for urban and spatial planning, developed through collaboration with institutions like the Chinese Academy of Urban Planning

    • It leverages a customized local database retrieval framework, domain-specific fine-tuning of base models, and advanced tooling capabilities, and has achieved advanced performance, delivering responses of superior quality precisely tailored to urban planning intricacies.

  72. ShortGPT: Layers in Large Language Models are More Redundant Than You Expect, Xin Men,Mingyu Xu,Qingyu Zhang,Bingning Wang,Hongyu Lin,Yaojie Lu,Xianpei Han,Weipeng Chen, 06-03-2024


    Computation and Language


    As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.

    Bullet Points

    • The study found that many layers of LLMs exhibit high similarity and some play a negligible role in network functionality

    • We define a metric called Block Influence (BI) to gauge the significance of each layer, and propose a simple pruning approach called ShortGPT

    • This method significantly outperforms previous state-of-the-art (SOTA) methods in model pruning and is orthogonal to quantization-like methods, enabling further reduction in parameters and computation

    • The ability to achieve better results through simple layer removal suggests high redundancy in the model architecture.

  73. Apollo: An Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People, Xidong Wang,Nuo Chen,Junyin Chen,Yan Hu,Yidong Wang,Xiangbo Wu,Anningzhe Gao,Xiang Wan,Haizhou Li,Benyou Wang, 06-03-2024


    Computation and Language, Artificial Intelligence


    Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.

    Bullet Points

    • To extend medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion

    • This includes the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark

    • Released Apollo models at various relatively-small sizes achieve the best performance among models of equivalent size

    • These lite models could be used to improve multi-lingual medical capabilities of larger models without fine-tuning

    • The open-source training corpora, code, model weights, and evaluation benchmark will be created.

  74. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection, Jiawei Zhao,Zhenyu Zhang,Beidi Chen,Zhangyang Wang,Anima Anandkumar,Yuandong Tian, 06-03-2024


    Machine Learning


    Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

    Bullet Points

    • Gradient Low-Rank Projection (GaLore) is a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA

    • It reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks

    • Our approach reduces optimizer memory up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline

    • We demonstrate the feasibility of pretraining a 7B model on consumer GPUs with 24GB memory without model parallel, checkpointing, or offloading strategies.

  75. Can Large Language Models Reason and Plan?, Subbarao Kambhampati, 07-03-2024


    Artificial Intelligence, Computation and Language, Machine Learning


    While humans sometimes do show the capability of correcting their own erroneous guesses with self-critiquing, there seems to be no basis for that assumption in the case of LLMs.

    Bullet Points

    • LLMs lack self-critiquing ability to correct their own erroneous guesses, despite the ability of humans to do so

    • However, there is no basis for this assumption in the case of humans.

  76. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference, Wei-Lin Chiang,Lianmin Zheng,Ying Sheng,Anastasios Nikolas Angelopoulos,Tianle Li,Dacheng Li,Hao Zhang,Banghua Zhu,Michael Jordan,Joseph E. Gonzalez,Ion Stoica, 07-03-2024


    Artificial Intelligence, Computation and Language


  77. RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation, Zihao Wang,Anji Liu,Haowei Lin,Jiaqi Li,Xiaojian Ma,Yitao Liang, 08-03-2024


    Computation and Language, Artificial Intelligence


  78. LLM4Decompile: Decompiling Binary Code with Large Language Models, Hanzhuo Tan,Qi Luo,Jing Li,Yuqun Zhang, 08-03-2024


    Programming Languages, Computation and Language


  79. AutoDev: Automated AI-Driven Development, Michele Tufano,Anisha Agarwal,Jinu Jang,Roshanak Zilouchian Moghaddam,Neel Sundaresan, 13-03-2024


    Software Engineering, Artificial Intelligence


    The landscape of software development has witnessed a paradigm shift with the advent of AI-powered assistants, exemplified by GitHub Copilot. However, existing solutions are not leveraging all the potential capabilities available in an IDE such as building, testing, executing code, git operations, etc. Therefore, they are constrained by their limited capabilities, primarily focusing on suggesting code snippets and file manipulation within a chat-based interface. To fill this gap, we present AutoDev, a fully automated AI-driven software development framework, designed for autonomous planning and execution of intricate software engineering tasks. AutoDev enables users to define complex software engineering objectives, which are assigned to AutoDev's autonomous AI Agents to achieve. These AI agents can perform diverse operations on a codebase, including file editing, retrieval, build processes, execution, testing, and git operations. They also have access to files, compiler output, build and testing logs, static analysis tools, and more. This enables the AI Agents to execute tasks in a fully automated manner with a comprehensive understanding of the contextual information required. Furthermore, AutoDev establishes a secure development environment by confining all operations within Docker containers. This framework incorporates guardrails to ensure user privacy and file security, allowing users to define specific permitted or restricted commands and operations within AutoDev. In our evaluation, we tested AutoDev on the HumanEval dataset, obtaining promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment.

    Bullet Points

    • AutoDev is a fully automated AI-driven software development framework designed for autonomous planning and execution of intricate software engineering tasks

    • It enables users to define complex software engineering objectives and has access to files, compiler output, build and testing logs, static analysis tools, and more

    • It establishes a secure development environment by confining all operations within Docker containers, incorporating guardrails to ensure user privacy and file security

    • The framework achieved promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively.

  80. Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking, Eric Zelikman,Georges Harik,Yijia Shao,Varuna Jayasiri,Nick Haber,Noah D. Goodman, 14-03-2024


    Computation and Language, Artificial Intelligence, Machine Learning


    When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting -- ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions. We address key challenges, including 1) the computational cost of generating continuations, 2) the fact that the LM does not initially know how to generate or use internal thoughts, and 3) the need to predict beyond individual next tokens. To resolve these, we propose a tokenwise parallel sampling algorithm, using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique. Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K (5.9%$\rightarrow$10.9%) and CommonsenseQA (36.3%$\rightarrow$47.2%) and observe a perplexity improvement of difficult tokens in natural text. Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way.

    Bullet Points

    • Quiet-STaR is a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions

    • We propose a tokenwise parallel sampling algorithm using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique to address the computational cost of generating continuations and the need to predict beyond individual next tokens

    • These improvements disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions.

  81. TnT-LLM: Text Mining at Scale with Large Language Models, Mengting Wan,Tara Safavi,Sujay Kumar Jauhar,Yujin Kim,Scott Counts,Jennifer Neville,Siddharth Suri,Chirag Shah,Ryen W White,Longqi Yang,Reid Andersen,Georg Buscher,Dhruv Joshi,Nagu Rangan, 18-03-2024


    Computation and Language, Artificial Intelligence, Information Retrieval


    Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.

    Bullet Points

    • The paper proposes TnT-LLM, a two-phase framework that automates the process of end-to-end label generation and assignment with minimal human effort for any given use-case

    • It employs LLMs as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale

    • Extensive experiments using human and automatic evaluation metrics demonstrate that LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines and achieves a favorable balance between accuracy and efficiency for classification at scale, and shares practical experiences and insights on the challenges and opportunities of using LLM for large-scale text mining in real-world applications.

  82. Evolutionary Optimization of Model Merging Recipes, Takuya Akiba,Makoto Shing,Yujin Tang,Qi Sun,David Ha, 19-03-2024


    Neural and Evolutionary Computing


    We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.

    Bullet Points

    • Evolutionary algorithms are used to automate the creation of powerful foundation models by discovering effective combinations of diverse open-source models without requiring additional training data or compute

    • This approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of individual models

    • Our Japanese Math LLM achieved state-of-the-art performance on established benchmarks, surpassing models with significantly more parameters despite not being explicitly trained for such tasks

    • The approach also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.

  83. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models, Yaowei Zheng,Richong Zhang,Junhao Zhang,Yanhan Ye,Zheyan Luo,Yongqiang Ma, 20-03-2024


    Computation and Language, Artificial Intelligence


    and already received over 13,000 stars and 1,600 forks.

    Bullet Points

    • Stars and forks received over 13,000 stars and 1,600 points for the team's performance

    • The team has already received a total of 1,600 stars and a fork for their performance

    • It's a testament to their hard work and dedication to the sport.

  84. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity, Soyeong Jeong,Jinheon Baek,Sukmin Cho,Sung Ju Hwang,Jong C. Park, 21-03-2024


    Computation and Language, Artificial Intelligence


  85. AIOS: LLM Agent Operating System, Kai Mei,Zelong Li,Shuyuan Xu,Ruosong Ye,Yingqiang Ge,Yongfeng Zhang, 25-03-2024


    Operating Systems, Artificial Intelligence, Computation and Language


  86. BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text, Elliot Bolton,Abhinav Venigalla,Michihiro Yasunaga,David Hall,Betty Xiong,Tony Lee,Roxana Daneshjou,Jonathan Frankle,Percy Liang,Michael Carbin,Christopher D. Manning, 27-03-2024


    Computation and Language, Artificial Intelligence


  87. Octopus v2: On-device language model for super agent, Wei Chen,Zhiyuan Li, 02-04-2024


    Computation and Language


    Language models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.

    Bullet Points

    • Language models have shown effectiveness in automating workflows, but they are often associated with privacy and cost concerns

    • On-device models for function calling face issues with latency and accuracy, and a new method empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95%

    • Our method enhances latency by 35-fold and reduces the latency to levels suitable for deployment across edge devices in production environments, aligning with performance requirements for real-world applications.

  88. Social Skill Training with Large Language Models, Diyi Yang,Caleb Ziems,William Held,Omar Shaikh,Michael S. Bernstein,John Mitchell, 05-04-2024


    Computation and Language, Human-Computer Interaction


    People rely on social skills like conflict resolution to communicate effectively and to thrive in both work and personal life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Drawing upon interdisciplinary research from communication and psychology, this perspective paper identifies social skill barriers to enter specialized fields. Then we present a solution that leverages large language models for social skill training via a generic framework. Our AI Partner, AI Mentor framework merges experiential learning with realistic practice and tailored feedback. This work ultimately calls for cross-disciplinary innovation to address the broader implications for workforce development and social equality.

    Bullet Points

    • To make social skill training more accessible, accessible, and inviting, interdisciplinary research from communication and psychology identifies social skill barriers to entry specialized fields, and a solution that leverages large language models via a generic framework is presented

    • The AI Partner, AI Mentor framework merges experiential learning with realistic practice and tailored feedback

    • This work calls for cross-disciplinary innovation to address the broader implications for workforce development and social equality.

  89. Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers, Libo Qin,Qiguang Chen,Yuhang Zhou,Zhi Chen,Yinghui Li,Lizi Liao,Min Li,Wanxiang Che,Philip S. Yu, 07-04-2024


    Computation and Language


    Multilingual Large Language Models are capable of using powerful Large Language Models to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in this paper, we present a thorough review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. The contributions of this paper can be summarized: (1) First survey: to our knowledge, we take the first step and present a thorough review in MLLMs research field according to multi-lingual alignment; (2) New taxonomy: we offer a new and unified perspective to summarize the current progress of MLLMs; (3) New frontiers: we highlight several emerging frontiers and discuss the corresponding challenges; (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.

    Bullet Points

    • The paper presents a comprehensive review and unified perspective to summarize the recent progress and emerging trends in multilingual large language models (MLLMs) literature, including a first survey, new taxonomy, new frontiers, and abundant open-source resources

    • The paper hopes to provide quick access and spur breakthrough research in these models.

  90. ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past, Van Pham,Scott Cunningham, 11-04-2024

  91. Best Practices and Lessons Learned on Synthetic Data for Language Models, Ruibo Liu,Jerry Wei,Fangyu Liu,Chenglei Si,Yanzhe Zhang,Jinmeng Rao,Steven Zheng,Daiyi Peng,Diyi Yang,Denny Zhou,Andrew M. Dai, 11-04-2024


    Computation and Language


    The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.

    Bullet Points

    • The paper discusses synthetic data research, its applications, challenges, and future directions

    • It presents empirical evidence from prior art to demonstrate its effectiveness and emphasizes the importance of ensuring its factuality, fidelity, and unbiasedness

    • The paper emphasizes responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.

  92. From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples, Robert Vacareanu,Vlad-Andrei Negru,Vasile Suciu,Mihai Surdeanu, 11-04-2024


    Computation and Language, Artificial Intelligence


    We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised methods such as Random Forest, Bagging, or Gradient Boosting. For example, on the challenging Friedman #2 regression dataset, Claude 3 outperforms many supervised methods such as AdaBoost, SVM, Random Forest, KNN, or Gradient Boosting. We then investigate how well the performance of large language models scales with the number of in-context exemplars. We borrow from the notion of regret from online learning and empirically show that LLMs are capable of obtaining a sub-linear regret.

    Bullet Points

    • Pre-trained large language models can perform linear and non-linear regression when given in-context examples without any additional training or gradient updates

    • They are able to perform regression tasks with a performance rivaling or even outperforming traditional supervised methods such as Random Forest, Bagging, or Gradient Boosting

    • They also investigate how well their performance scales with the number of in-consumer exemplars and empirically show that LLMs are capable of obtaining a sub-Linear Regret.

  93. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models, Jinheon Baek,Sujay Kumar Jauhar,Silviu Cucerzan,Sung Ju Hwang, 11-04-2024


    Computation and Language, Artificial Intelligence, Machine Learning


    Scientific Research, vital for improving human life, is hindered by its inherent complexity, slow pace, and the need for specialized experts. To enhance its productivity, we propose a ResearchAgent, a large language model-powered research idea writing agent, which automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature. Specifically, starting with a core paper as the primary focus to generate ideas, our ResearchAgent is augmented not only with relevant publications through connecting information over an academic graph but also entities retrieved from an entity-centric knowledge store based on their underlying concepts, mined and shared across numerous papers. In addition, mirroring the human approach to iteratively improving ideas with peer discussions, we leverage multiple ReviewingAgents that provide reviews and feedback iteratively. Further, they are instantiated with human preference-aligned large language models whose criteria for evaluation are derived from actual human judgments. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines, showcasing its effectiveness in generating novel, clear, and valid research ideas based on human and model-based evaluation results.

    Bullet Points

    • To enhance scientific research productivity, we propose a large language model-powered research idea writing agent called ResearchAgent that automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature

    • It is augmented with relevant publications through connecting information over an academic graph, entities retrieved from an entity-centric knowledge store, peer discussions, multiple ReviewingAgents, and human preference-aligned large language models whose criteria for evaluation are derived from actual human judgments

    • The agent is experimentally validated on scientific publications across multiple disciplines, demonstrating its effectiveness in generating novel, clear, and valid research ideas despite its complexity, slow pace, and need for specialized experts.

  94. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, Tianbao Xie,Danyang Zhang,Jixuan Chen,Xiaochuan Li,Siheng Zhao,Ruisheng Cao,Toh Jing Hua,Zhoujun Cheng,Dongchan Shin,Fangyu Lei,Yitao Liu,Yiheng Xu,Shuyan Zhou,Silvio Savarese,Caiming Xiong,Victor Zhong,Tao Yu, 11-04-2024


    Artificial Intelligence, Computation and Language

  95. LLM Agents can Autonomously Exploit One-day Vulnerabilities, Richard Fang,Rohan Bindu,Akul Gupta,Daniel Kang, 11-04-2024


    Cryptography and Security, Artificial Intelligence


    In this work, we show that LLM agents can autonomously exploit one-day vulnerabilities in real-world systems. To show this, we collected a dataset of 15 one-day vulnerabilities that include ones categorized as critical severity in the CVE description. When given the CVE description, GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs) and open-source vulnerability scanners (ZAP and Metasploit). Fortunately, our GPT-4 agent requires the CVE description for high performance: without the description, GPT-4 can exploit only 7% of the vulnerabilities. Our findings raise questions around the widespread deployment of highly capable LLM agents.

    Bullet Points

    • LLM agents can exploit one-day vulnerabilities in real-world systems, based on a dataset of 15 vulnerabilities categorized as critical severity in the CVE description

    • GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs), but without the description, it can only exploit 7% of the vulnerabilities

    • The findings raise questions about the widespread deployment of highly capable LLM Agents.

  96. Reducing hallucination in structured outputs via Retrieval-Augmented Generation, Patrice Béchard,Orlando Marquez Ayala, 12-04-2024


    Machine Learning, Artificial Intelligence, Computation and Language, Information Retrieval


    A common and fundamental limitation of Generative AI (GenAI) is its propensity to hallucinate. While large language models (LLM) have taken the world by storm, without eliminating or at least reducing hallucinations, real-world GenAI systems may face challenges in user adoption. In the process of deploying an enterprise application that produces workflows based on natural language requirements, we devised a system leveraging Retrieval Augmented Generation (RAG) to greatly improve the quality of the structured output that represents such workflows. Thanks to our implementation of RAG, our proposed system significantly reduces hallucinations in the output and improves the generalization of our LLM in out-of-domain settings. In addition, we show that using a small, well-trained retriever encoder can reduce the size of the accompanying LLM, thereby making deployments of LLM-based systems less resource-intensive.

    Bullet Points

    • We developed a system leveraging Retrieval Augmented Generation (RAG) to improve the quality of structured output representing natural language workflows, which reduces hallucinations and improves generalization of LLM in out-of-domain settings

    • Additionally, using a small, well-trained retriever encoder can reduce the size of the accompanying LLM, making deployments of GLM-based systems less resource-intensive.

  97. Is ChatGPT Transforming Academics' Writing Style?, Mingmeng Geng,Roberto Trotta, 12-04-2024


    Computation and Language, Artificial Intelligence, Digial Libraries, Machine Learning


    Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts by means of a statistical analysis of word frequency changes. Our model is calibrated and validated on a mixture of real abstracts and ChatGPT-modified abstracts (simulated data) after a careful noise analysis. We find that ChatGPT is having an increasing impact on arXiv abstracts, especially in the field of computer science, where the fraction of ChatGPT-revised abstracts is estimated to be approximately 35%, if we take the output of one of the simplest prompts, "revise the following sentences", as a baseline. We conclude with an analysis of both positive and negative aspects of the penetration of ChatGPT into academics' writing style.

    Bullet Points

    • The textual density of ChatGPT's writing style in arXiv abstracts is assessed using a statistical analysis of word frequency changes

    • The model is calibrated and validated on a mixture of real abstracts and chatGPT-modified abstracts after a careful noise analysis

    • The fraction of chatGPP-revised abstracts estimated to be approximately 35%, and the output of one of the simplest prompts, "revise the following sentences" as a baseline

    • The analysis of both positive and negative aspects of the penetration into academics' writing style is conducted.

  98. Pre-training Small Base LMs with Fewer Tokens, Sunny Sanyal,Sujay Sanghavi,Alexandros G. Dimakis, 12-04-2024


    Computation and Language, Artificial Intelligence, Machine Learning

  99. Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length, Xuezhe Ma,Xiaomeng Yang,Wenhan Xiong,Beidi Chen,Lili Yu,Hao Zhang,Jonathan May,Luke Zettlemoyer,Omer Levy,Chunting Zhou, 12-04-2024


    Machine Learning, Computation and Language

  100. A Survey on Retrieval-Augmented Text Generation for Large Language Models, Yizheng Huang,Jimmy Huang, 17-04-2024


    Information Retrieval, Artificial Intelligence, Computation and Language


    Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.

    Bullet Points

    • Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning to address static limitations of LLMs by enabling the dynamic integration of up-to-date external information

    • This method focuses on the text domain and provides a cost-effective solution to the generation of plausible but incorrect responses by LLM

    • The paper organizes the RAG paradigm into four categories: pre-retreeval, retrieval, post-retraceval and generation, offering a detailed perspective from the retrieval viewpoint

    • It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies

    • It introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions

    • The study aims to consolidate existing research, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LMLs.

  101. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, Marah Abdin,Sam Ade Jacobs,Ammar Ahmad Awan,Jyoti Aneja,Ahmed Awadallah,Hany Awadalla,Nguyen Bach,Amit Bahree,Arash Bakhtiari,Harkirat Behl,Alon Benhaim,Misha Bilenko,Johan Bjorck,Sébastien Bubeck,Martin Cai,Caio César Teodoro Mendes,Weizhu Chen,Vishrav Chaudhary,Parul Chopra,Allie Del Giorno,Gustavo de Rosa,Matthew Dixon,Ronen Eldan,Dan Iter,Amit Garg,Abhishek Goswami,Suriya Gunasekar,Emman Haider,Junheng Hao,Russell J. Hewett,Jamie Huynh,Mojan Javaheripi,Xin Jin,Piero Kauffmann,Nikos Karampatziakis,Dongwoo Kim,Mahoud Khademi,Lev Kurilenko,James R. Lee,Yin Tat Lee,Yuanzhi Li,Chen Liang,Weishung Liu,Eric Lin,Zeqi Lin,Piyush Madan,Arindam Mitra,Hardik Modi,Anh Nguyen,Brandon Norick,Barun Patra,Daniel Perez-Becker,Thomas Portet,Reid Pryzant,Heyang Qin,Marko Radmilac,Corby Rosset,Sambudha Roy,Olatunji Ruwase,Olli Saarikivi,Amin Saied,Adil Salim,Michael Santacroce,Shital Shah,Ning Shang,Hiteshi Sharma,Xia Song,Masahiro Tanaka,Xin Wang,Rachel Ward,Guanhua Wang,Philipp Witte,Michael Wyatt,Can Xu,Jiahang Xu,Sonali Yadav,Fan Yang,Ziyi Yang,Donghan Yu,Chengruidong Zhang,Cyril Zhang,Jianwen Zhang,Li Lyna Zhang,Yi Zhang,Yue Zhang,Yunan Zhang,Xiren Zhou, 22-04-2024


    Computation and Language, Artificial Intelligence


    We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).

    Bullet Points

    • Phi-3-mini is a 3.8 billion parameter language model trained on 3.3 trillion tokens

    • Its performance compares to models like Mixtral 8x7B and GPT-3.5, despite being small enough to be deployed on a phone

    • The innovation lies in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data

    • The model is also aligned for robustness, safety, and chat format

    • Initial parameter-scaling results are provided with 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi3-medium.

  102. Better Synthetic Data by Retrieving and Transforming Existing Datasets, Saumya Gandhi,Ritu Gala,Vijay Viswanathan,Tongshuang Wu,Graham Neubig, 22-04-2024


    Computation and Language

  103. OpenELM: An Efficient Language Model Family with Open Training and Inference Framework, Sachin Mehta,Mohammad Hossein Sekhavat,Qingqing Cao,Maxwell Horton,Yanzi Jin,Chenfan Sun,Iman Mirzadeh,Mahyar Najibi,Dmitry Belenko,Peter Zatloukal,Mohammad Rastegari, 22-04-2024


    Computation and Language, Artificial Intelligence, Machine Learning

  104. Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice, Ranim Khojah,Mazen Mohamad,Philipp Leitner,Francisco Gomes de Oliveira Neto, 23-04-2024


    Software Engineering, Artificial Intelligence, Computation and Language, Human-Computer Interaction, Machine Learning


    Large Language Models (LLMs) are frequently discussed in academia and the general public as support tools for virtually any use case that relies on the production of text, including software engineering. Currently there is much debate, but little empirical evidence, regarding the practical usefulness of LLM-based tools such as ChatGPT for engineers in industry. We conduct an observational study of 24 professional software engineers who have been using ChatGPT over a period of one week in their jobs, and qualitatively analyse their dialogues with the chatbot as well as their overall experience (as captured by an exit survey). We find that, rather than expecting ChatGPT to generate ready-to-use software artifacts (e.g., code), practitioners more often use ChatGPT to receive guidance on how to solve their tasks or learn about a topic in more abstract terms. We also propose a theoretical framework for how (i) purpose of the interaction, (ii) internal factors (e.g., the user's personality), and (iii) external factors (e.g., company policy) together shape the experience (in terms of perceived usefulness and trust). We envision that our framework can be used by future research to further the academic discussion on LLM usage by software engineering practitioners, and to serve as a reference point for the design of future empirical LLM research in this domain.

    Bullet Points

    • LLMs are commonly discussed in academia and the general public as support tools for software engineering

    • However, there is little empirical evidence regarding the practical usefulness of LLM-based tools such as ChatGPT for engineers in industry

    • An observational study of 24 professional software engineers who have been using chatGPT over a week in their jobs and qualitatively analysed their dialogues with the chatbot as well as their overall experience

    • The researchers propose a theoretical framework for how the purpose of the interaction, internal factors, and external factors shape the experience in terms of perceived usefulness and trust

    • This framework can be used by future research to further the academic discussion on LLM usage by software engineering practitioners and serve as a reference point for future empirical LLM research in this domain.

  105. Autonomous LLM-driven research from data to human-verifiable research papers, Tal Ifargan,Lukas Hafner,Maor Kern,Ori Alcalay,Roy Kishony, 24-04-2024


    Quantitative Biology, Artificial Intelligence


    As AI promises to accelerate scientific discovery, it remains unclear whether fully AI-driven research is possible and whether it can adhere to key scientific values, such as transparency, traceability and verifiability. Mimicking human scientific practices, we built data-to-paper, an automation platform that guides interacting LLM agents through a complete stepwise research process, while programmatically back-tracing information flow and allowing human oversight and interactions. In autopilot mode, provided with annotated data alone, data-to-paper raised hypotheses, designed research plans, wrote and debugged analysis codes, generated and interpreted results, and created complete and information-traceable research papers. Even though research novelty was relatively limited, the process demonstrated autonomous generation of de novo quantitative insights from data. For simple research goals, a fully-autonomous cycle can create manuscripts which recapitulate peer-reviewed publications without major errors in about 80-90%, yet as goal complexity increases, human co-piloting becomes critical for assuring accuracy. Beyond the process itself, created manuscripts too are inherently verifiable, as information-tracing allows to programmatically chain results, methods and data. Our work thereby demonstrates a potential for AI-driven acceleration of scientific discovery while enhancing, rather than jeopardizing, traceability, transparency and verifiability.

    Bullet Points

    • We built data-to-paper, an automation platform that guides interacting LLM agents through a complete stepwise research process while programmatically back-tracing information flow and allowing human oversight and interactions

    • The process demonstrated autonomous generation of de novo quantitative insights from data, and for simple research goals, a fully-autonomous cycle can create manuscripts that recapitulate peer-reviewed publications without major errors in about 80-90%

    • However, as goal complexity increases, human co-piloting becomes critical for assuring accuracy

    • Moreover, created manuscripts are inherently verifiable.

  106. A Primer on the Inner Workings of Transformer-based Language Models, Javier Ferrando,Gabriele Sarti,Arianna Bisazza,Marta R. Costa-jussà, 30-04-2024


    Computation and Language


    The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.

    Bullet Points

    • The primer provides a technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture

    • It covers the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions.

  107. RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing, Yucheng Hu,Yuxing Lu, 30-04-2024


    Computation and Language, Artificial Intelligence

  108. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Seungone Kim,Juyoung Suk,Shayne Longpre,Bill Yuchen Lin,Jamin Shin,Sean Welleck,Graham Neubig,Moontae Lee,Kyungjae Lee,Minjoon Seo, 02-05-2024


    Computation and Language

  109. RLHF Workflow: From Reward Modeling to Online RLHF, Hanze Dong,Wei Xiong,Bo Pang,Haoxiang Wang,Han Zhao,Yingbo Zhou,Nan Jiang,Doyen Sahoo,Caiming Xiong,Tong Zhang, 13-05-2024


    Machine Learning, Artificial Intelligence, Computation and Language, Machine Learning
