FPGA Acceleration of Deep Reinforcement Learning Using On-chip Replay Management

DRL System

Abstract

A major bottleneck in parallelizing deep reinforcement learning (DRL) is in the high latency to perform various operations used to update the Prioritized Replay Buffer on CPU. The low arithmetic intensity of these operations leads to severe under-utilization of the SIMT computation power of GPUs. In this work, we propose a high-throughput on-chip accelerator for Prioritized Replay Buffer and learner that efficient allocates computation and memory resources to saturate the FPGA computation power. Our design features hardware pipelining on FPGA such that the latency of replay operations is completely hidden.

Publication
In Proceedings of the 19th ACM International Conference on Computing Frontiers
Yuan Meng
Yuan Meng
Senior SDE - AI Engine Architecture Team

I co-optimize algorithm and hardware for deploying parallel AI workloads on heterogeneous platforms.