Search CORE

25 research outputs found

FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA

Author: Hussain Shehzeen
Javaheripi Mojan
Kastner Ryan
Koushanfar Farinaz
Neekhara Paarth
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/02/2020
Field of study

Autoregressive convolutional neural networks (CNNs) have been widely exploited for sequence generation tasks such as audio synthesis, language modeling and neural machine translation. WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution that is used for sequence generation. While WaveNet produces state-of-the art audio generation results, the naive inference implementation is quite slow; it takes a few minutes to generate just one second of audio on a high-end GPU. In this work, we develop the first accelerator platform~\textit{FastWave} for autoregressive convolutional neural networks, and address the associated design challenges. We design the Fast-Wavenet inference model in Vivado HLS and perform a wide range of optimizations including fixed-point implementation, array partitioning and pipelining. Our model uses a fully parameterized parallel architecture for fast matrix-vector multiplication that enables per-layer customized latency fine-tuning for further throughput improvement. Our experiments comparatively assess the trade-off between throughput and resource utilization for various optimizations. Our best WaveNet design on the Xilinx XCVU13P FPGA that uses only on-chip memory, achieves 66 faster generation speed compared to CPU implementation and 11 faster generation speed than GPU implementation.Comment: Published as a conference paper at ICCAD 201

arXiv.org e-Print Archive

Crossref

zPROBE: Zero Peek Robustness Checks for Federated Learning

Author: Ghodsi Zahra
Huang Ke
Javaheripi Mojan
Koushanfar Farinaz
Sheybani Nojan
Zhang Xinqiao
Publication venue
Publication date: 05/09/2023
Field of study

Privacy-preserving federated learning allows multiple users to jointly train a model with coordination of a central server. The server only learns the final aggregation result, thus the users' (private) training data is not leaked from the individual model updates. However, keeping the individual updates private allows malicious users to perform Byzantine attacks and degrade the accuracy without being detected. Best existing defenses against Byzantine workers rely on robust rank-based statistics, e.g., median, to find malicious updates. However, implementing privacy-preserving rank-based statistics is nontrivial and not scalable in the secure domain, as it requires sorting all individual updates. We establish the first private robustness check that uses high break point rank-based statistics on aggregated model updates. By exploiting randomized clustering, we significantly improve the scalability of our defense without compromising privacy. We leverage our statistical bounds in zero-knowledge proofs to detect and remove malicious updates without revealing the private user updates. Our novel framework, zPROBE, enables Byzantine resilient and secure federated learning. Empirical evaluations demonstrate that zPROBE provides a low overhead solution to defend against state-of-the-art Byzantine attacks while preserving privacy.Comment: ICCV 202

arXiv.org e-Print Archive

MPCircuits: Optimized Circuit Generation for Secure Multi-Party Computation

Author: Farinaz Koushanfar
M. Sadegh Riazi
Mojan Javaheripi
Siam U. Hussain
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 12/03/2019
Field of study

Secure Multi-party Computation (MPC) is one of the most influential achievements of modern cryptography: it allows evaluation of an arbitrary function on private inputs from multiple parties without revealing the inputs. A crucial step of utilizing contemporary MPC protocols is to describe the function as a Boolean circuit. While efficient solutions have been proposed for special case of two-party secure computation, the general case of more than two-party is not addressed. This paper proposes MPCircuits, the first automated solution to devise the optimized Boolean circuit representation for any MPC function using hardware synthesis tools with new customized libraries that are scalable to multiple parties. MPCircuits creates a new end-to-end tool-chain to facilitate practical scalable MPC realization. To illustrate the practicality of MPCircuits, we design and implement a set of five circuits that represent real-world MPC problems. Our benchmarks inherently have different computational and communication complexities and are good candidates to evaluate MPC protocols. We also formalize the metrics by which a given protocol can be analyzed. We provide extensive experimental evaluations for these benchmarks; two of which are the first reported solutions in multi-party settings. As our experimental results indicate, MPCircuits reduces the computation time of MPC protocols by up to 4.2x

Crossref

Cryptology ePrint Archive

Lanturn: Measuring Economic Security of Smart Contracts Through Adaptive Learning

Author: Ari Juels
Farinaz Koushanfar
Kushal Babel
Mahimna Kelkar
Mojan Javaheripi
Yan Ji
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 07/09/2023
Field of study

We introduce Lanturn: a general purpose adaptive learning-based framework for measuring the cryptoeconomic security of composed decentralized-finance (DeFi) smart contracts. Lanturn discovers strategies comprising of concrete transactions for extracting economic value from smart contracts interacting with a particular transaction environment. We formulate the strategy discovery as a black-box optimization problem and leverage a novel adaptive learning-based algorithm to address it. Lanturn features three key properties. First, it needs no contract-specific heuristics or reasoning, due to our black-box formulation of cryptoeconomic security. Second, it utilizes a simulation framework that operates natively on blockchain state and smart contract machine code, such that transactions returned by Lanturn’s learning-based optimization engine can be executed on-chain without modification. Finally, Lanturn is scalable in that it can explore strategies comprising a large number of transactions that can be reordered or subject to insertion of new transactions. We evaluate Lanturn on the historical data of the biggest and most active DeFi Applications: Sushiswap, UniswapV2, UniswapV3, and AaveV2. Our results show that Lanturn not only rediscovers existing, well-known strategies for extracting value from smart contracts, but also discovers new strategies that are previously undocumented. Lanturn also consistently discovers higher value than evidenced in the wild, surpassing a natural baseline computed using value extracted by bots and other strategic agents

Cryptology ePrint Archive

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Author: Abdin Marah
Aneja Jyoti
Awadalla Hany
Awadallah Ahmed
Awan Ammar Ahmad
Bach Nguyen
Bahree Amit
Bakhtiari Arash
Bao Jianmin
Behl Harkirat
Benhaim Alon
Bilenko Misha
Bjorck Johan
Bubeck Sébastien
Cai Martin
Cai Qin
Chaudhary Vishrav
Chen Dong
Chen Dongdong
Chen Weizhu
Chen Yen-Chun
Chen Yi-Ling
Cheng Hao
Chopra Parul
Dai Xiyang
de Rosa Gustavo
Del Giorno Allie
Dixon Matthew
Eldan Ronen
Fragoso Victor
Gao Jianfeng
Gao Mei
Gao Min
Garg Amit
Goswami Abhishek
Gunasekar Suriya
Haider Emman
Hao Junheng
Hewett Russell J.
Hu Wenxiang
Huynh Jamie
Iter Dan
Jacobs Sam Ade
Javaheripi Mojan
Jin Xin
Karampatziakis Nikos
Kauffmann Piero
Khademi Mahoud
Kim Dongwoo
Kim Young Jin
Kurilenko Lev
Lee James R.
Lee Yin Tat
Li Yuanzhi
Li Yunsheng
Liang Chen
Liden Lars
Lin Xihui
Lin Zeqi
Liu Ce
Liu Liyuan
Liu Mengchen
Liu Weishung
Liu Xiaodong
Luo Chong
Madan Piyush
Mahmoudzadeh Ali
Majercak David
Mazzola Matt
Mendes Caio César Teodoro
Mitra Arindam
Modi Hardik
Nguyen Anh
Norick Brandon
Patra Barun
Perez-Becker Daniel
Portet Thomas
Pryzant Reid
Qin Heyang
Radmilac Marko
Ren Liliang
Rosset Corby
Roy Sambudha
Ruwase Olatunji
Saarikivi Olli
Saied Amin
Salim Adil
Santacroce Michael
Shah Shital
Shang Ning
Sharma Hiteshi
Shen Yelong
Shukla Swadheen
Song Xia
Tanaka Masahiro
Tupini Andrea
Vaddamanu Praneetha
Wang Chunyu
Wang Guanhua
Wang Lijuan
Wang Shuohang
Wang Xin
Wang Yu
Ward Rachel
Wen Wen
Witte Philipp
Wu Haiping
Wu Xiaoxia
Wyatt Michael
Xiao Bin
Xu Can
Xu Jiahang
Xu Weijian
Xue Jilong
Yadav Sonali
Yang Fan
Yang Jianwei
Yang Yifan
Yang Ziyi
Yu Donghan
Yuan Lu
Zhang Chenruidong
Zhang Cyril
Zhang Jianwen
Zhang Li Lyna
Zhang Yi
Zhang Yue
Zhang Yunan
Zhou Xiren
Publication venue
Publication date: 30/08/2024
Field of study

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.24 page

arXiv.org e-Print Archive

Recommended from our members

Holistic Algorithm and System Co-Optimization for Trustworthy and Platform-Aware Deep Learning

Author: Javaheripi Mojan
Publication venue: eScholarship, University of California
Publication date: 01/01/2023
Field of study

Simultaneous growth in the volume of available data along with rapid advancements in computing and hardware technology have paved the way for unprecedented breakthroughs in the field of Artificial Intelligence (AI). In particular, a modern class of AI algorithms, dubbed Deep Learning (DL), has shown great promise by achieving or even surpassing human-level capabilities in many tasks. The rise of DL has brought forth a new industrial revolution by taking over the modern landscape of smart applications, e.g., self-driving cars, virtual assistants, drug discovery, and manufacturing. Nevertheless, to date, there exist quite a few challenges for wide-scale adoption of DL in real-life scenarios.Firstly, confidence characterization and ensuring robustness of DL-enabled services is imperative, particularly in safety-critical autonomous systems. Secondly, concerns over the scalability and efficiency of DL hinder its training and deployment on diverse hardware platforms. This dissertation addresses the above-mentioned challenges via a holistic customization of DL algorithm and system from the standpoint of task-based metrics (e.g., accuracy), physical constraints (e.g., memory and power budget), as well as new design metrics that facilitate DL integration in safety-sensitive tasks. The presented research in this dissertation interlinks theoretical fundamentals, domain-specific architecture design, and automated tools that enable co-optimization of the DL algorithm with the underlying platform while satisfying various constraints. The key contributions of this dissertation are as follows: 1) Devising CuRTAIL, the first end-to-end and automated framework that simultaneously enables efficient and safe execution of DL models in face of adversarial attacks. CuRTAIL formalizes the goal of thwarting adversarial attacks as an optimization problem and trains parallel defense modules to minimize vulnerability. The proposed framework leverages hardware/algorithm co-design and customized acceleration to enable just-in-time execution in resource-constrained settings. 2) Designing a novel framework, dubbed ACCHASHTAG, which identifies any faults occurring during DL inference in real time. I propose to summarize the ground-truth DL model as a unique hash signature, which is used to verify the model’s integrity on the fly. Notably, ACCHASHTAG, for the first time, provides guaranteed lower bounds on the detection rate using a formal statistical analysis of hash collision. 3) Proposing CLEANN, the first end-to-end framework that enables online mitigation of backdoor, a.k.a. Trojan, attacks on DL. CLEANN uses sparse recovery and statistical analysis to identify incoming Trojan samples and remove their effect on the victim model’s prediction. I design the algorithmic solutions as well as customized hardware-accelerated engines to enable real-time DL model decision verification via CLEANN. 4) Innovating an approach for restructuring inter-layer connections in DL models, leading to faster convergence to a desired accuracy during training. This is achieved by transforming the DL model into a small-world network using principles from graph theory. The obtained DL model, dubbed SWANN, is a highly-connected, small-world topology with enhanced signal propagation characteristics and faster learning speed. 5) Developing LTS, the first training-free, hardware-aware neural architecture search for autoregressive Transformers. The proposed method delivers high-performance specialized architectures for inference on a target hardware. The core of LTS is an ultra-low-cost proxy that can estimate the performance of candidate architectures without any need for training. Using this novel proxy, the search can be performed entirely on the target hardware, allowing us to incorporate hardware measurements, e.g., peak memory utilization and latency, within the architecture search loop. 6) Automating DL model customization for various target hardware by formulating it as a constrained optimization. The optimization goal is to compress a large model to satisfy given accuracy and hardware performance constraints. I propose a highly-scalable blackbox optimizer, dubbed AdaNS, to solve the aforesaid optimization problem. AdaNS leverages adaptive non-uniform sampling with carefully crafted probabilistic distributions to locate and reconstruct the optimization objective function around its maximizers

eScholarship - University of California