Albert Cohen

Albert Cohen

Albert is a research scientist at Google. An alumnus of École Normale Supérieure de Lyon and the University of Versailles, he has been a research scientist at Inria, a visiting scholar at the University of Illinois, an invited professor at Philips Research, and a visiting scientist at Facebook Artificial Intelligence Research. Albert Cohen works on parallelizing and optimizing compilers, machine learning compilers, parallel and synchronous programming languages, with applications to high-performance computing, artificial intelligence and reactive control.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
Structured Operations: Modular Design of Code Generators for Tensor Compilers
Nicolas Vasilache
Oleksandr Zinenko
Aart Bik
Mahesh Ravishankar
Thomas Raoux
Alexander Belyaev
Matthias Springer
Tobias Gysi
Diego Caballero
Stephan Herhut
Stella Laurenzo
LCPC 2022, Springer (2023)
Preview abstract The performance of machine learning systems heavily relies on code generators tailored to tensor computations. We propose an approach to the design and implementation of such code generators leveraging the natural structure of tensor algebra and illustrating the progressive lowering of domain-specific abstractions in the MLIR infrastructure. View details
Code Generation for Data-Dependent Stencils
Mohammed Essadki
Bertrand Michel
Bruno Maugars
Oleksandr Zinenko
Nicolas Vasilache
CGO, IEEE (2023)
Preview abstract Numerical simulation often resorts to iterative in-place stencils such as the Gauss-Seidel or Successive Overrelaxation (SOR) methods. Writing high performance implementations of such stencils requires significant effort and time; it also involves non-local transformations beyond the stencil kernel itself. While automated code generation is a mature technology for image processing stencils, convolutions and out-of place iterative stencils (such as the Jacobi method), the optimization of in-place stencils requires manual craftsmanship. Building on recent advances in tensor compiler construction, we propose the first domain-specific code generator for iterative in-place stencils. Starting from a generic tensor compiler implemented in the MLIR framework, tensor abstractions are incrementally refined and lowered down to parallel, tiled, fused and vectorized code. We used our generator to implement a realistic, implicit solver for structured meshes, and demonstrate results competitive with an industrial computational fluid dynamics framework. We also compare with stand-alone stencil kernels for dense tensors. View details
RL4ReAl: Reinforcement Learning for Register Allocation
S. VenkataKeerthy
Siddharth Jain
Anilava Kundu
Rohit Aggarwal
Ramakrishna Upadrasta
CC 2023, ACM
Preview abstract We aim to automate decades of research and experience in register allocation, leveraging machine learning. We tackle this problem by embedding a multi-agent reinforcement learning algorithm within LLVM, training it with the state of the art techniques. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC-based framework providing a modular and efficient compiler interface for training and inference. Our approach is architecture independent: we show experimental results targeting Intel x86 and ARM AArch64. Our results match or out-perform the heavily tuned, production-grade register allocators of LLVM. View details
Preview abstract This paper considers the correctness of domain-specific compilers for tensor programming languages through the study of Halide, a popular representative. It describes a translation validation algorithm for affine Halide specifications, independently of the scheduling language. The algorithm relies on “prophetic” annotations added by the compiler to the generated array assignments. The annotations provide a refinement mapping [Abadi and Lamport 1988] from assignments in the generated code to the tensor definitions from the specification. Our implementation leverages an affine solver and a general SMT solver, and scales to complete Halide benchmarks. View details
RL4ReAl: Towards Optimal Register Allocation using Reinforcement Learning
S. VenkataKeerthy
Siddharth Jain
Rohit Aggarwal
Ramakrishna Upadrasta
arXiv (2022)
Preview abstract We propose a novel solution for the Register Allocation problem, leveraging multi-agent hierarchical Reinforcement Learning. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC based framework providing a modular and efficient compiler interface for training and inference. Experimental results match or outperform the LLVM register allocators, targeting Intel x86 and ARM AArch64. View details
Autotuning Convolutions is Easier Than You Think
Nicolas Tollenaere
Guillaume Iooss
Stéphane Pouget
Hugo Brunie
Christophe Guillon
P. Sadayappan
Fabrice Rastello
ACM TACO (2022)
Preview abstract A wide range of scientific and machine learning applications depend on highly optimized implementations of tensor computations. Exploiting the full capacity of a given processor architecture remains a challenging task, due to the complexity of the microarchitectural features that come into play when seeking near-peak performance. Among the state-of-the-art techniques for loop transformations for performance optimization, AutoScheduler tends to outperform other systems. It often yields higher performance as compared to vendor libraries, but takes a large number of runs to converge, while also involving a complex training environment. In this paper, we define a structured configuration space that enables much faster convergence to highperformance code versions, using only random sampling of candidates. We focus on two-dimensional convolutions on CPUs. Compared to state-of-the-art libraries, our structured search space enables higher performance for typical tensor shapes encountered in convolution stages in deep learning pipelines. Compared to autotuning code generators like AutoScheduler, it prunes the search space while increasing the density of efficient implementations. We analyze the impact on convergence speed and performance distribution, on two Intel x86 processors and one ARM AArch64 processor. We match or outperform the performance of the state-of-the-art oneDNN library and TVM’s AutoScheduler, while reducing the autotuning effort by at least an order of magnitude. View details
Weaving Synchronous Reactions Into the Fabric of SSA-Form Compilers
Hugo Pompougnac
Ulysse Beaugnon
Dumitru Potop Butucaru
TACO (2022)
Preview abstract We investigate the programming of reactive systems combining closed-loop control with performance- intensive components such as Machine Learning (ML). Reactive control systems are often safety- critical and associated with real-time execution requirements, a domain of predilection for syn- chronous programming languages. Extending the high levels of assurance found in reactive control systems to computationally-intensive code remains an open issue. We tackle it by unifying concepts and algorithms from synchronous languages with abstractions commonly found in general-purpose and ML compilers. This unification across embedded and high-performance computing enables a high degree of reuse of compiler abstractions and code. We first recall commonalities between dataflow synchronous languages and the static single assignment (SSA) form of general-purpose/ML compilers. We highlight the key mechanisms of synchronous languages that SSA does not cover—denotational concepts such as synchronizing computations with an external time base, cyclic and reactive I/O, as well as the operational notions of relaxing control flow dominance and the modeling of absent values. We discover that initialization-related static analyses and code generation aspects can be fully decoupled from other aspects of synchronous semantics such as memory management and causality analysis, the latter being covered by existing dominance-based algorithms of SSA-form compilers. We show how the SSA form can be seamlessly extended to enable all SSA-based transformations and optimizations on reactive programs with synchronous concurrency. We derive a compilation flow suitable for both high-performance and reactive aspects of a control application, by embedding the Lustre dataflow synchronous language into the SSA-based MLIR/LLVM compiler infrastructure. This allows the modeling of signal processing and deep neural network inference in the (closed) loop of feedback-directed control systems. With only a minor efforts leveraging the MLIR infrastructure, the generated code matches or outperforms state-of-the-art synchronous language compilers on computationally-intensive ML applications. View details
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner
Mehdi Amini
Uday Bondhugula
River Riddle
Tatiana Shpeisman
Nicolas Vasilache
Oleksandr Zinenko
CGO 2021
Preview abstract This work presents the MLIR compiler infrastructure, which is a novel approach to building reusable compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduces the cost of building domain specific compilers, and aid in connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizer at different levels of abstraction and also across application domains, hardware targets and execution environments. The scientific perspective on these challenges is twofold: 1) evaluating MLIR as an infrastructure that enables new research and educational approaches on programming languages, compilers, code generators, execution environments, hardware acceleration and codesign; and 2) discussing MLIR as a research artifact built for extension and evolution, raising its own design, semantics, algorithmic, system, engineering, and multi-disciplinary challenges. The paper presents the rationale for MLIR, its original design principles, structures and semantics, and validates these by surveying some applications of it. View details
Secure Optimization Through Opaque Observations
Arnaud De Grandmaison
Christophe Guillon
Karine Heydemann
Son Tuan Vu
arXiv (2021)
Preview abstract Secure applications implement protections against side-channel and physical attacks. Such protections embed input/output side-effects preventing optimizing compilers from altering the protection. These side-effects are error-prone and compiler-dependent, and the current practice involves analyzing the generated machine code to make sure security or privacy properties are still enforced. Vu et al. recently demonstrated how to automate the insertion of volatile side-effects in a compiler [30], but these may be too expensive in fine-grained protections such as control-flow integrity. We introduce observations of the program state that are intrinsic to the correct execution of security protections, along with means to specify and preserve observations across the compilation flow. Such observations complement the traditional input/output-preservation contract of compilers. We show how to guarantee their preservation without modifying compilation passes and with as little performance impact as possible. We validate our approach on a range of benchmarks, expressing the secure compilation of these applications in terms of observations to be made at specific program points. View details
Efficient Convolution Optimisation by Composing Microkernels
Nicolas Tollenaere
Auguste Olivry
Guillaume Iooss
Hugo Brunie
P Sadayappan
Fabrice Rastello
INRIA (2021)
Preview abstract Optimizing the implementation of tensor computations is essential to exploiting the full capacity of a given processor architecture on a wide range of scientific and machine learning applications. However, the complexity of the microarchitectural features that come into play when approaching the peak performance of the processor makes it very hard. Focusing on 2D convolutions, we observe a common weakness in all tensor compilers and libraries related to efficiently covering the wide variety of problem sizes occurring in real-world applications. We propose TTile, a domain-specific code generator and autotuner for implementing efficient convolutions. Similarly to BLIS, TTile nests multiple levels of tiling above a vectorized tensor contraction microkernel. But unlike traditional approaches, we explore of a variety of microkernels and compose them to fit exactly the tensor shapes of a convolution. While this helps achieving consistently high performance on virtually all possible tensor sizes, our method also introduces more degrees of freedom in the optimization space, which makes it challenging for autotuning strategies. To address this, we leverage an analytical model of data movement, and combine it with feedback-directed autotuning. We evaluate TTile as a stand-alone compiler and also as a complement to TVM on recent Intel x86 microarchitectures. View details
×