Proceedings of the 2011 Conference on Design ... - Xun ZHANG .fr

Hardware/Software Co-Design of Dataflow Programs for Reconfigurable Hardware and Multi-Core ...... We show how to modify the source code of a syntax.
34MB taille 6 téléchargements 836 vues
Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing

Tampere, Findland November 2hd-4th, 2011 General Co-Chairs: Jari Nurmi, Tapani Ahonen Tampere University of Technology

DASIP is an

event!

The 2011 Conference on Design & Architectures for Signal and Image Processing (DASIP) Tampere, Finland November 2-4, 2011 ISSN 1966-7116 IEEE Catalog Number CFP11DAS-USB ISBN 978-1-4577-0619-6

Editors Dr. Adam Morawiec Jinnie Hinderscheit Orna Ghenassia ECSI Electronic Chips & Systems design Initiative Parc Equation 2, Avenue de Vignate 38610 Gières, France [email protected]

The 2011 Conference on Design & Architectures for Signal and Image Processing (DASIP) Copyright © 2011 Copyright and Reprint Permission Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limit of U.S. copyright law, for private use of patrons, those articles in this volume that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For other copying, reprint or republication permission, write to IEEE Copyrights Manager, IEEE Operations Center, 445 Hoes Lane, Piscataway, NJ 08854. All rights reserved. Copyright 2011 by IEEE. IEEE Catalog Number CFP11DAS-USB ISBN 978-1-4577-0619-6 ISSN 1966-7116 © ECSI Electronic Chips & Systems design Initiative, 2011 No part of the work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.

2011 Table of Contents General Chairs & Program Chairs ......................................................................................................... 7 Keynote & Invited Speakers................................................................................................................... 8 Program Committee ............................................................................................................................... 9 Session 1, Regular Session: System Simulation and Processor Generation .................................. 11 A SystemC TLM Framework for Distributed Simulation of Complex Systems with Unpredictable Communication………………………………………………………………………………………………....12 Julien Peeters, Nicolas Ventroux and Tanguy Sassolas Performance Evaluation of an Automotive Distributed Architecture Based on HPAV Communication Protocol Using a Transaction Level Modeling Approach……………………………………………………………………………....20 Takieddine Majdoub, Sebastien Le Nours, Olivier Pasquier and Fabienne Nouvel MORPHEO: a High-Performance Processor Generator for a FPGA Implementation…………………………………....27 Mathieu Rosière, Nathalie Drach, Jean Lou Desbarbieux and Franck Wajsbürt Design of a Processor Optimized for Syntax Parsing in Video Decoders………………………………………………....35 Nicolas Siret, Jean-François Nezan and Aimad Rhatay

Session 2: Low Power Design & Methodologies……………………………………..…………………….44 Fast and Accurate Hybrid Power Estimation Methodology for Embedded Systems……………………………………...45 Santhosh Kumar Rethinagiri, Rabie Ben Atitallah, Eric Senn, Smail Niar and Jean-Luc Dekeyser Embedded Operating Systems Energy Overhead…………………………………………………………………………....52 Bassem Ouni, Cécile Belleudy, Sebastien Bilavarn and Eric Senn

Session 3: Reconfigurable Systems & Tools for Signal & Image Processing : part 1……………….58 Flexible VLIW Processor Based on FPGA for Real-Time Image Processing……………………………………………....59 Vincent Brost, Charles Meunier, Debyo Saptono and Fan Yang Acceleration of Image Reconstruction in 3D Ultrasound Computer Tomography: An Evaluation of CPU, GPU and FPGA Computing…………………………………...………………………………………………………………….…....67 Matthias Birk, Alexander Guth, Michael Zapf, Matthias Balzer, Nicole Ruiter, Michael Hübner and Jürgen Becker Task Model and Online Operating System API for Hardware Tasks in OLLAF Platform………………………………....75 Samuel Garcia and Bertrand Granado

Session 4: Signal and Image Processing on GPU……………………………………..………………….82 DFG Implementation on Multi GPU Cluster with Computation-Communication Overlap…………………………………83 Sylvain Huet, Vincent Boulos, Vincent Fristot and Luc Salvo An Efficient Parallel Motion Estimation Algorithm and X264 Parallelization in CUDA…………………………………....91 Youngsub Ko, Youngmin Yi and Soonhoi Ha

Session 5: Dynamic Architectures & Adaptive Management for Image & Signal Processing…………………………………………………………………………..………….99 Middleware Approaches for Adaptivity of Kahn Process Networks on Networks-on-Chip……………………………....100 Emanuele Cannella, Onur Derin and Todor Stefanov FPGA Dynamic Reconfiguration Using the RVC Technology: Inverse Quantization Case Study……………………...108 Manel Hentati,Yassine Aoudni, Jean-François Nezan, Mohamed Abid and Olivier Deforges Graphic Rendering Application Profiling on a Shared Memory MPSoC Architecture……………………………….…...115 Matthieu Texier, Raphaël David, Karim Ben Chehida and Olivier Sentieys

Session 6: Regular Session: Signal Processing and Processor Designs…………………………...…122 Speed VLSI Architecture for 2-D Lifting Discrete Wavelet Transform…………………………………...………………...123 Anand Darji, Arun Chandorkar, S Merchant and Rajul Bansal Pilot Studies of Wireless Sensor Networks: Practical Experiences……………………………………...………………...129 Teemu Laukkarinen, Jukka Suhonen, Timo Hämäläinen and Marko Hännikäinen Efficient Maximal Convex Custom Instruction Enumeration for Extensible Processors………………………………....137 Chenglong Xiao and Emmanuel Casseau Efficient FFT Pruning Algorithm for Non-Contiguous OFDM Systems……………………………………...……………..144 Roberto Airoldi, Fabio Garzia and Jari Nurmi Designing Processors Using MAsS, a Modular and Lightweight Instruction-level Exploration Tool…………………...150 Matthieu Texier, Erwan Piriou, Mathieu Thévenin, and Raphaell David

Session 7: Smart Image Sensors……………………………………………………..…………..………….…156 A New Approach to 3D Form Recognition within Video Capsule Endoscopic……………………………………...…….157 Jad Ayoub, Bertrand Granado, Olivier Romain and Yasser Mohanna A System C AMS/TLM Platform for CMOS Video Sensors……………………………………...…………………….…...164 Fabio Cenni, Serge Scotti and Emmanuel Simeu

Session 8: Methods & Tools for Dataflow Programming…………………………..…………..………..…170 Hardware/Software Co-Design of Dataflow Programs for Reconfigurable Hardware and Multi-Core Platforms……..171 Ghislain Roquier, Endri Bezati, Richard Thavot and Marco Mattavelli The Multi-Dataflow Composer Tool: A Runtime Reconfigurable HDL Platform Composer…………………………......178 Francesca Palumbo, Nicola Carta and Luigi Raffo A Unified Hardware/Software Co-Synthesis Solution for Signal Processing Systems………………………..…….…...186 Endri Bezati, Herve Yviquel, Mickael Raulet and Marco Mattavelli Optimization Methodologies for Complex FPGA-Based Signal Processing Systems with CAL………………………..192 Ab Al-Hadi Ab Rahman, Hossam Amer, Anatoly Prihozhy, Christophe Lucarz and Marco Mattavelli

Session 9: Reconfigurable Systems & Tools for Signal & Image Processing: Part 2……………...…200 Development of a Method for Image-Based Motion Estimation of a VTOL-MAV on FPGA……………………………..201 Natalie Frietsch, Lars Braun, Matthias Birk, Michael Hübner, Gert F Trommer and Jürgen Becker Real-Time Moving Object Detection for Video Surveillance System in FPGA…………………………………………....208 Tomasz Kryjak, Mateusz Komorkiewicz and Marek Gorgon An Approach to Self- Learning Multicore Reconfiguration Management Applied on Robotic Vision……………..…....216 Walter Stechele, Jan Hartmann and Erik Maehle Power Consumption Improvement with Residue Code for Fault Tolerance on SRAM FPGA…………………….…....222 Frederic Amiel, Thomas Ea and Vinay Vashishtha

Poster Session: Main Track………………………………………..………………………………………..….228 Embedded Systems Security: An Evaluation Methodology Against Side Channel Attacks…………………….……...229 Youssef Souissi, Jean-Luc Danger, Sylvain Guilley, Shivam Bhasin and Maxime Nassar Interfacing and Scheduling Legacy Code within the Canals Framework………………………………………………....237 Andreas Dahlin, Fareed Jokhio, Jérôme Gorin, Johan Lilius and Mickaël Raulet Range-Free Algorithm for Energy-Efficient Indoor Localization in Wireless Sensor Networks………………………...245 Ville Kaseva, Timo Hämäläinen and Marko Hännikäinen Application Workload Model Generation Methodologies for System-Level Design Exploration……………………….253 Jukka Saastamoinen and Jari Kreku Flexible NoC-Based LDPC Code Decoder Implementation and Bandwidth Reduction Methods…………………......260 Carlo Condo and Guido Masera FERONOC: Flexible and Extensible Router Implementation for Diagonal Mesh Topology…………………………....268 Majdi Elhajji, Brahim Attia, Abdelkrim Zitouni, Samy Meftali, Jean-Luc Dekeyser and Rached Tourki A New Algorithm for Realization of FIR Filters Using Multiple Constant Multiplications………………………………..276 Mohsen Amiri Farahani, Eduardo Castillo Guerra and Bruce Colpitts Analyzing Software Inter-Task Communication Channels on a Clustered Shared Memory Multi Processor System-on-Chip………………………………………………………………………………………………………………....283 Daniela Genius and Nicolas Pouillon Multiplier Free Filter Bank Based Concept for Blocker Detection in LTE Systems……………………………………...291 Thomas Schlechter Practical Monitoring and Analysis Tool for WSN Testing…………………………………………………..……………....298 Markku Hänninen, Jukka Suhonen, Timo D Hämäläinen and Marko Hännikäinen

Poster Session: Reconfigurable Systems & Tools for Signal & Image Processing……….……...….306 High Level Design of Adaptive Aistributed Controller for Partial Dynamic Reconfiguration in FPGA………………....307 Sana Cherif, Chiraz Trabelsi, Samy Meftali and Jean-Luc Dekeyser Methodology for Designing Partially Reconfigurable Systems Using Transaction-Level Modeling……………….......315 Francois Duhem, Fabrice Muller and Philippe Lorenzini

Poster Session: Dynamic Architectures & Adaptive Management for Image & Signal Processing……………………………………………………………………………...……..322 A Framework for the Design of Reconfigurable Fault Tolerant Architectures……………………………………………323 Sebastien Pillement, Hung Manh Pham, Olivier Pasquier and Sébastien Le Nours High-Level Modelling and Automatic Generation of Dynamically Reconfigurable Systems…………………………....331 Gilberto Ochoa, El-Bay Bourennane, Hassan Rabah and Ouassila Labbani

Poster Session: Smart Image Sensors……………………………………………………...………………..339 SystemC Modelization for Fast Validation of Imager Architectures…………………………….………………………....340 Yves Blanchard

Poster Session: Signal and Image Processing on GPU…………………………………………………..345 Parallelization of an Ultrasound Reconstruction Algorithm for Non-Destructive Testing on Multicore CPU and GPU…………………………….…………………………………………………………………………...………..356 Antoine Pédron, Lionel Lacassagne, Franck Bimbard and Stéphane Le Berre

Welcome to DASIP 2011 General CO-Chairs: General Co-Chair: Jari Nurmi Tampere University of Technology, Finland Jari Nurmi is a professor of Computer Systems at Tampere University of Technology (TUT). He has held various research, education and management positions at TUT and in the industry since 1987. He got a PhD degree from TUT in 1994. His current research interests include System-on-Chip integration, on-chip communication, embedded and application-specific processor architectures, and circuit implementations of digital communication, positioning and DSP systems. He is leading a group of about 20 researchers at TUT.

General Co-Chair : Tapani Ahonen Tampere University of Technology, Finland Tapani Ahonen is a Senior Research Fellow at Tampere University of Technology (TUT) in Tampere, Finland, where he has held various positions since 2000. Since 2004 he is co-managing a group of about 30 researchers. He is a part-time Lecturer at Carinthia Tech Institute (CTI) in Villach, Austria since 2007. In 2009-2010 Ahonen was a Visiting Researcher (Chercheur Invité) at Université Libre de Bruxelles (ULB) in Bruxelles, Belgium. His work is focused on proof-of-concept driven computer systems design with emphasis on many-core processing environments. Ahonen has an MSc in Electrical Engineering and a PhD in Information Technology from TUT.

Program Co-Chairs: Michael Hübner, KIT, Germany Dr.-Ing. habil. Michael Hübner is a senior research scientist and group leader at the ITIV, Karlsruhe Institute of Technology (KIT). He received his diploma degree in electrical engineering and information technology in 2003 and his PhD degree in 2007 from the University of Karlsruhe (TH). His research interests are in reconfigurable computing and particularly new technologies for adaptive FPGA run-time reconfiguration and on-chip network structures with application in automotive systems, incl. the integration into high-level design and programming environments. Dr.Hübner received his postdoctoral lecture qualification (Venia legendi) in the domain of reconfigurable computing from the Karlsruhe Institute of Technology (KIT) in 2011.

Program Co-Chairs: Daniel Chillet, IRISA, France Daniel Chillet is an Associated Professor in the University of Rennes 1 and member of the Cairn research Inria Team. He received the Engineering degree and the M.S. degree in electronics and signal processing engineering from ENSSAT, University of Rennes, respectively, in 1992 and in 1994. He received the Ph.D. degree in signal processing and telecommunications from the University of Rennes, in 1997, and the Habilitation thesis in 2010. He is leading the electronic and informatic departement at the ENSSAT engineering school. His research interests include memory hierarchy, dynamic and partial reconfigurable resources, real-time systems, middleware and high level language for system modeling. All these topics are studied in the context of SoC design for embedded systems, including low power constraint.

Keynote & Invited Speakers Keynote Speaker: Yves Leduc, TI Chair at the University of Nice, France Yves got his PhD in electrical engineering from the University of Louvain in 1979. With Texas instruments France for the last 30 years, he created in 1994 the mixed signal development team of TI France and was elected TI Fellow in 1998. Yves then led the advanced system technology team paving the way to the future development of the new SOCs. Yves is currently holding the TI Chair at the University of Nice and participate to several organizations to promote the creation of start-ups in microelectronics.

Keynote Speaker: Mario Porrmann, acting Professor of the research group System and Circuit Technology, Heinz Nixdorf Institute, University of Paderborn, Germany Since January 2010, Mario Porrmann is Acting Professor of the research group "System and Circuit Technology" at the Heinz Nixdorf Institute, University of Paderborn. Mario Porrmann graduated as "Diplom-Ingenieur" in Electrical Engineering at the University of Dortmund, Germany, in 1994. In 2001 he received a PhD in Electrical Engineering from the University of Paderborn, Germany for his work on performance evaluation of embedded neurocomputers. From 2001 to 2009 he was Assistant Professor in the research group "System and Circuit Technology" at the Heinz Nixdorf Institute, University of Paderborn. His main scientific interests are in on-chip multiprocessor systems, dynamically reconfigurable computing, and resource-efficient architectures for network components. Mario Porrmann has published more than 130 peerreviewed papers in scientific journals as well as for international conferences.

Keynote Speaker: Lasse Harju, ST-Ericsson, Finland Dr. Lasse Harju is SoC architect at ST-Ericsson. His current responsibilities cover SoC system control and power management topics, ranging from low-level circuit technologies to firmware implementations. Dr. Harju earned his PhD degree in 2006 from Tampere University of Technology in Finland. His academic work focused on programmable digital baseband implementations.

Invited Speaker: Toshihiro Hattori, Renesas Mobile Corporation, Japan Toshihiro Hattori received the B.S. and M.S. degrees in electrical engineering from Kyoto University, Japan, in 1983 and 1985, respectively. He received the Ph.D in informatics from Kyoto University, Japan, in 2006. He joined the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan, in 1985. He engaged in logic/layout tool development. From 1992 to 1993 he was a Visiting Researcher at the University of California Berkeley, with a particular interest in CAD. He joined the Semiconductor Development Center in the Semiconductor Integrated Circuits Division in Hitachi Ltd. in 1995. He moved to Renesas Technology Corp. in 2003. He was belonging to SuperH (Japan), Ltd. from 2001 to 2004 to conduct SH processor licensing and development. He moved to Renesas Electronics Corp. in 2010. He is currently working with Renesas Mobile Corp. as VP of SoC design. He is a member of IEEE(SSCS), ACM, IEICE, and IPSJ.

Invited Speaker: Patricia Derler, UC Berkeley, USA Patricia Derler is a postdoctoral researcher at the UC Berkeley. She received her Ph.D. in Computer Science from the University of Salzburg, Austria and she did her undergraduate studies at the University of Hagenberg, Austria. Her research interests are in design and simulation of cyber-physical systems, deterministic models of computation and the use of predictability in software, hardware and the environment towards efficient simulations and executions.

Program Committee STEERING COMMITTEE Mohamed Abid Tapani Ahonen Tughrul Arslan Daniel Chillet Ahmet Erdogan Guy Gogniat Bertrand Granado Michael Hübner Jean-Didier Legat Stéphane Mancini Marco Mattavelli Dragomir Milojevic Adam Morawiec Jean-François Nezan Jari Nurmi Michel Paindavoine

Ecole nationale d'ingénieurs de Sfax, Tunisia TUT, Finland University of Ediburgh, UK ENSSAT, University of Rennes,France University of Edinburgh, UK Université de Bretagne Sud, France Ecole Nationale Supérieure de l'Electronique et de ses Applications, France Karlsruhe Institute of Technology (KIT), Germany Université catholique de Louvain, Belgium INPG/Ensimag, TIMA, Grenoble, France Ecole Polytechnique Fédérale de Lausanne, Switzerland Université Libre de Bruxelles, Belgium European Electronic Chips & Systems design Initiative, France INSA, France TUT, Finland Université de Bourgogne, France

TECHNICAL PROGRAM COMMITTEE Mohamed Abid Ecole nationale d'ingénieurs de Sfax, Tunisia Tapani Ahonen Tampere University of Technology (TUT), Finland Marwan Al-Akaidi De Montfort University, Leicester, UKKarim Ali, CSEM, Switzerland Ihab Amer EPFL, Switzerland Abbes Amira Brunel University, UK Slaheddine Aridhi Texas Instruments, France Tughrul Arslan University of Edinburgh, UK Iuliana Bacivarov ETH Zurich, Switzerland Juergen Becker Karlsruhe Institute of Technology (KIT) Germany Cécile Belleudy LEAT laboratory - University of Nice Sophia-Antipolis, France Rabie Ben Atitallah INRIA, France Mladen Berekovic Technical University of Braunschweig, Germany Christophe Bobda University of Arkansas, Fayetteville, USA Ahmed Bouridane Queen's University Belfast, UK Jani Boutellier University of Oulu , Finland Claudio Brunelli Nokia Research Center, Finland Giovanni Busonera CRS4, Italy Joan Cabestany UPC - Technical University of Catalunya, Spain Joao Cardoso University of Porto, Portugal Emmanuel Casseau IRISA/INIRIA - University of Rennes I, France Daniele Caviglia University of Genova, Italy Stéphane Chevobbe CEA, France Jorge Juan Chico University of Sevilla, Spain Daniel Chillet ENSSAT,University of Rennes, France Christopher Claus Bosch Reutlingen,Germany David Crawford Epson Scotland Design Centre, UK Piet De Moor IMEC Belgium Jean-Philippe Diguet Lab-STICC CNRS, France Milos Drutarovsky Technical University of Kosice, Slovakia Marc Duranton CEA LIST, France Ahmet T. Erdogan University of Edinburgh, UK Carles Ferrer Universitat Autònoma de Barcelona, Spain Eric Fragnière University of Applied Sciences Western Switzerland James Fung Nvidia, Santa Clara, USA Alberto Garcia University of Bremen, Germany Patrick Garda Université Pierre et Marie Curie, France Fabio Garzia Tampere University of Technology, Finland Guy Gogniat Université de Bretagne Sud, France Diana Göhringer Fraunhofer IOSB, Ettlingen, Germany Bertrand Granado ETIS-ENSEA, France

6

Arnaud Grasset Stéphane Guyetant Frank Hannig David Hasler Dominique Houzet Gareth Howells Michael Hübner Mohammad Ibrahim Jorn Janneck Nathalie Julien François Kaess Udo Kebschull Johann Laurent Jean-Didier Legat Yannick LeMoullec Shujun Li Yuzhe Liu Felix Lustenberger Alberto Macii Franck Mamelet Stéphane Mancini Philippe Manet Mohammad M. Mansour Marco Mattavelli Klaus D. McDonald-Maier Samy Meftali Paolo Meloni Dragomir Milojevic Benoît Miramond Adam Morawiec Jean-François Nezan Smail Niar Juanjo Noguera Jari Nurmi Tokunbo Ogunfunmi Michel Paindavoine Vassilis Paliouras Francesca Palumbo Danilo Pani Christian Piguet Sébastien Pillement Massimo Poncino Vincenzo Rana Mickael Raulet Frédéric Robert Gilles Sassatelli Simone Secchi Eric Senn Baraniya Shailendra Mohamed M. Shawky Gilles Sicard David Siguenza-Tortosa Leandro Soares Indrusiak Yves Sorel Dimitrios Soudris Jarmo Takala Arnaud Tisserand Yves Vanderperren François Verdier Tanya Vladimirova Nikos Voros Serge Weber Matthieu Wipliez Olivier Zendra Mathieu Thevenin

Thales Research & Technology, France CEA, LIST, France University of Erlangen-Nuremberg, Germany CSEM, Switzerland Institute National Polytechnique de Grenoble, France University of Kent, UK Karlsruhe Institute of Technology (KIT), Germany Faculty of Technology, Leicester, UK Lund Institute of Technology, Sweden University of South Britany,France CSEM, Switzerland Goethe-Universität Frankfurt, Germany UBS Lab-STICC / CNRS UMR 3192, France Université Catholique de Louvain, Belgium Aalborg University, Denmark University of Konstanz, Greece University of Notre Dame, USA IEEE, USA Politecnico di Torino, Italy France Télécom, France INPG/Ensimag, TIMA, France Université Catholique de Louvain, Belgium American University of Beirut, Lebanon EPFL, Switzerland University of Essex, UK Université de Lille, France University of Cagliari, Italy Université Libre de Bruxelles, Belgium ETIS – UMR 8051 – ENSEA Cergy, France ECSI, France INSA, France INRIA, France Xilinx, Ireland Tampere University of Technology (TUT), Finland Santa Clara University, California, USA Université de Bourgogne, France University of Patras, Greece University of Cagliari, Italy DIEE - University of Cagliari, Italy CSEM, Switzerland Universite de Rennes 1- IRISA - CAIRN, France Politecnico di Torino, Italy EPFL, Switzerland IETR/INSA Rennes, France Université Libre de Bruxelles, Belgium LIRMM, France Pacific Northwest National Laboratory, USA Université de Bretagne-Sud, France NMIMS University, India Heudiasyc, France TIMA Laboratory, Joseph Fourier University of Grenoble, France Universidad Complutense de Madrid (UCM), Spain University of York, UK INRIA, France National Technical University of Athens, Greece Tampere University of Technology, Tampere, Finland IRISA, CNRS - Univ. Rennes, France KU Leuven, Belgium ENSEA, France University of Surrey, UK Technological Educational Institute of Mesolongi, Greece Université Henri Poincaré, France IETR/INSA Rennes, France INRIA, France CEA, France

7

2011

Tampere, Finland, November 2-4, 2011

Session1: Regular session: System Simulation and Processor Generation Chair: Michael Hübner, Karlsruhe Institute of Technology (KIT), Germany

A SystemC TLM Framework for Distributed Simulation of Complex Systems with Unpredictable Communication Julien Peeters, Nicolas Ventroux and Tanguy Sassolas Performance Evaluation of an Automotive Distributed Architecture Based on HPAV Communication Protocol Using a Transaction Level Modeling Approach Takieddine Majdoub, Sebastien Le Nours, Olivier Pasquier and Fabienne Nouvel MORPHEO: a High-Performance Processor Generator for a FPGA Implementation Mathieu Rosière, Nathalie Drach, Jean Lou Desbarbieux and Franck Wajsbürt Design of a Processor Optimized for Syntax Parsing in Video Decoders Nicolas Siret, Jean-François Nezan and Aimad Rhatay

www.ecsi.org/s4d

A SystemC TLM Framework for Distributed Simulation of Complex Systems with Unpredictable Communication Julien Peeters, Nicolas Ventroux, Tanguy Sassolas, Lionel Lacassagne CEA, LIST, Embedded Computing Laboratory 91191 Gif-sur-Yvette CEDEX, FRANCE Email: [email protected]

Abstract—Increasingly complex systems need parallelized simulation engines. In the context of SystemC simulation, existing proposals require predicting communication in the simulated system. However, this is often unpredictable. In order to deal with unpredictable systems, this paper presents a parallelization approach using asynchronous communication without modification of the SystemC simulation engine. Simulated system model is cut up and distributed across separate simulation engines, each part being evaluated in parallel of others. Functional consistency is preserved thanks to the simulated system write exclusive memory access policy while temporal consistency is guaranteed using explicit synchronization. Experimental results show up a speed-up up to 13x on 16 processors.

temporal consistency which allows a temporal error under certain conditions in opposite to strict causality as defined in previous work. We focus our work on two key points, which lead to two contributions, addressing simulation and design space exploration of complex unpredictable systems: •

I. I NTRODUCTION Hardware complexity is continuously increasing. For instance, System-on-Chip (SoC’s) now integrate ever more sophisticated architectures, targeting multimedia (H.264, VC-1...) or wireless communication (UMTS, WiMAX...). Simulation of such systems slows down as their complexity increases [1]. SystemC is a C++ class library that supports mixed software/hardware system design. System modeling can be done at different levels of abstraction and accuracy. As a consequence, SystemC has become widely used as a tool to explore design space of complex systems. For this task, there is no need for bit-level accuracy. Instead, wire-level communication is abstracted away as high-level transactions implemented by a SystemC extension named Transaction-Level Modeling (TLM). A promising approach is to parallelize simulation. Related proposals [1]–[7] preserve simulation temporal causality thanks to conservative synchronization [8], avoiding the management of an expensive checkpoint/rollback strategy required when using the optimistic variant. However, conservative synchronization becomes a bottleneck when communication is unpredictable. Indeed, such synchronization must occur as soon as a communication can be initiated by any module in the simulated system. In the present context, we assume that this might happen at each simulated system clock cycle, called further a cycle. In addition, we will talk, in this paper, about



We propose (section II) a parallelization approach, drawing its inspiration from proposals of Mello [2] and Galiano [6]. We partition the system model into clusters and evaluate each of them in a separate simulation engine. Communication between clusters is made asynchronous so as to avoid blocking simulation. In order to deal with unpredictable systems, we introduce a new synchronization mechanism divided in two parts. One relies on the write exclusive memory access policy of the simulated system and preserves simulation functional consistency. The other generates explicit synchronization to bound the temporal error introduced by asynchronous communication. We implement (section III) our approach as a SystemC TLM framework. Thanks to this framework, the parallelization of a simulation does not imply to rewrite or adapt the simulated system model. Parallelization is transparent for the system designer and does not require modifying the SystemC simulation engine. This framework guarantees that functional modeling semantics is preserved when simulation is parallelized.

So as to validate our approach, we build a customizable validation environment (section IV). We use our approach to run many parallel simulations against many environment configurations. The results (section V) offer to characterize the simulation speed-up and accuracy. Furthermore, we deduce from results two simulation modes, which give a simulation speed-up up to 13x on 16 processors compared to a standard non-distributed simulation. Finally, we compare our approach with related work (section VI) and conclude (section VII) on the results and features given by our approach.

II. BACKGROUND AND SUGGESTED NEW APPROACH Among promising approaches speeding up SystemC simulation, Mello [2] and Galiano [6] propose to cut up the model describing the system to simulate, also called the simulated system model, into clusters. A cluster is a partition of the whole simulated system model and then is purely virtual and does not represent a concrete structure in the simulated system. Thereafter, each cluster is evaluated in parallel with each other in a separate simulation engine, running on a separate processor. Parallel Discrete Event Simulation (PDES) theory [8] provides a formal representation of synchronization in parallel simulation. It has two variants: conservative and optimistic. The optimistic variant lets the simulation speculate on its future states. A rollback mechanism returns to a valid state in case of an incorrect speculation. This requires to record checkpoints during simulation in order to roll back to the last valid state when necessary. However, simulation state history becomes difficult to manage while the simulated system complexity increases [5]. In return, the conservative variant does not comprise a speculation nor a rollback mechanism. Instead, simulation is only allowed to progress when synchronization is assured that no past time-annotated communication might occur. This way, it guarantees temporal causality at any time during simulation. For this reason, most of related works implement a conservative synchronization mechanism. The conservative variant requires knowing the minimum duration between two communications occurring in the simulated system [5], [6]. This duration is called lookahead and represents the time during which the parallel simulation can be evaluated without synchronizing. The longer the lookahead, the greater the speed-up. However, in some systems, communication rate cannot be specified. Consequently, lookahead must be shortened to the worst case value: one simulated system clock cycle. This causes the simulation to dramatically slow down or even become unusable. Our approach draws its inspiration from Mello [2] and Galiano [6] proposals. In our case, communication between clusters is implemented using the Message Passing Interface (MPI) [9]. Our implementation transforms any inter-cluster communication into asynchronous calls. This lets the simulation locally progress on the initiator side while waiting for the communication to complete. In order to deal with unpredictable systems, we propose a hybrid synchronization approach that is divided in two parts. One part guarantees the functional modeling consistency when a cluster accesses data from another cluster, constraining the simulated system to have a write exclusive memory access policy. This means that a writer has exclusive access to memory when it is writing. Hence, it must wait for all the readers to have finished reading before writing. This constraint creates implicit synchronization between clusters. Here, writers and readers are SystemC modules in the simulated system model. However, even if functional consistency is guaranteed, nothing prevents clusters from diverging in time as communication

Module

Module

(a)

Module

Module

...

Module

(b) Fig. 1. Example of (a) an initiator connected to a target and (b) chaining of modules. Arrows indicate the direction from an initiator to a target.

is asynchronous, thereby potentially introducing a temporal error. This error varies according to the relative simulation speed deviation between communicating clusters. Then the other part of the synchronization mechanism sends explicit synchronization at regular intervals so as to bound this error. The explicit synchronization period is specified by the system designer. The designer can also change the period value so as to tune the simulation accuracy. We will now present the implementation details of our parallelization approach as a SystemC TLM framework. III. F RAMEWORK IMPLEMENTATION DETAILS In the TLM specification an initiator initiates a request and sends it to a target. The target processes the request and sends back a response. Such a request-response pair is named a transaction. In this context, initiator and target are both SystemC modules (figure 1a). According to the specification, a communication between an initiator and a target may be either blocking or non-blocking. In the latter case, the response part of a transaction may be omitted. SystemC modules may also be chained. In this case, intermediate modules act like a target for the previous module and like an initiator for the next module in the chain (figure 1b). When parallelizing a SystemC simulation, the simulated system model is cut up into clusters. Therefore, some communication links between SystemC modules are broken as the result of the cutting. These broken links are replaced by virtual links, providing the inter-cluster communication substrate. However, the introduction of virtual links does not modify the initial semantics of the simulated system model. Figures 2a and 2b illustrate this transformation. A virtual link is composed of two end-points named wrappers. A wrapper acts both like an interface to the distributed communication layer (i.e. the MPI middleware) and to one of the initiator or target involved in the link (figure 2c). As a consequence, a SystemC module connected to a virtual link ignores whether the link is virtual or not. Then, there is no requirement to modify or adapt initiators and targets for the parallel simulation to work. Explicit synchronization, introduced in section II, is the central mechanism that prevents cluster’s simulations from

Cluster 2n

Cluster 0 CPU CPU

Mem

CPU

Mem

...

CPU

Mem

CPU

Mem

...

...

Mem Distributed interconnect

Interconnect

...

CPU

Mem

CPU

Mem

CPU

Mem

...

CPU

Mem

...

CPU

Mem

Cluster 1

(a)

(b)

Cluster i

Module

Cluster 2n+1

Proxy Wrapper

Cluster j

MPI Middleware

Remote Wrapper

Module

Virtual link

(c) Fig. 2. Using our approach, the parallelization of a SystemC simulation begins with the transformation of the non-distributed model (a) of the simulated system to a distributed one (b). In the same time, the interconnect is split across clusters and, thereafter, is known as distributed interconnect. Finally, links carrying communication through the interconnect are replaced by virtual links (c). The latter exchange data through the MPI middleware abstracted and hidden in the distributed interconnect.

diverging in time with one another. This part of the whole synchronization approach implements a handshake between clusters sharing at least one virtual link. The implementation takes place in a SystemC module named synchronization module, one assigned to each cluster. In the following paragraphs, we will deeper detail wrapping and synchronization mechanisms and implementations. A. Wrappers In a standard non-distributed SystemC TLM simulation, an initiator communicates with its targets through transport methods, where the name method refers to a C++ one. A transport method literally transports a transaction from an initiator to a target, or at least to an intermediate SystemC module acting like a target when modules are chained. When the simulation is distributed, an initiator and a target communicating together may be mapped onto different clusters. So as to hide this mapping from both initiator and target point of views, wrappers are introduced in the simulated system model. These wrappers transform the call to the transport method into an asynchronous MPI communication, thereby implementing the abstraction exposed by the virtual link. On the initiator side, the wrapper is named the proxy wrapper; on the target side, the wrapper is named the remote wrapper. The proxy wrapper mimics the behavior of the target as if the initiator was directly connected to it. The remote wrapper is in charge of forwarding the request to the target and the response back to the initiator. In order to handle the

asynchronous nature of MPI communication, both wrappers are implemented as SystemC modules. They contain one process that is activated when a transaction is received. This activation is managed by an event dispatcher, one assigned to each cluster. The event dispatcher polls MPI communication from remote clusters sharing at least one virtual link with the cluster owning the event dispatcher. Figure 3 illustrates communication between an initiator and a target. Wrapper association, used to create a virtual link, is made before the simulation begins. Associations are specified in a configuration file, containing wrappers unique identifiers. This file is loaded at start-up and parsed to build virtual links. B. Synchronization module Synchronization in a parallel simulation aims to keep the simulation state consistent among all parts of the simulation environment. In our case, synchronization only occurs between clusters sharing at least one communication dependency between an initiator and a target, assuming both modules are on different clusters. Consequently, if two clusters do not share such dependency, they never synchronize. The synchronization module implements the explicit part of our hybrid synchronization approach as a handshake between clusters. The implementation of this handshake is detailed in figure 4. In opposite to distributed transactional communication, synchronization is done synchronously. Then, the main issue is to prevent deadlocks as SystemC processes are

1

8

Module (initiator)

Proxy Wrapper

9

Event Disp.

2

6: 7: 8: 9: 10: 11:

MPI Middleware 5

7

12:

4 Module (target)

1: 2: 3: 4: 5:

Remote Wrapper

Event Disp. 3

procedure EXPLICIT SYNCHRONIZATION for ct ∈ remote target clusters do SEND SYNC (ct ) end for for ci ∈ remote initiator clusters do RECV SYNC (ci ) SEND ACK(ci ) end for for ct ∈ remote target clusters do RECV ACK (ct ) end for end procedure Fig. 4.

Listing of the explicit synchronization algorithm.

6

Fig. 3. When an initiator sends a transaction to a target, the proxy wrapper connected to this initiator receives the transaction (1). The proxy wrapper forwards the transaction to its associated remote wrapper (2) and waits for the response. On the other side, the remote wrapper wakes up when the transaction arrives (3), previously notified by the event dispatcher of its cluster. The remote wrapper transfers the transaction to the target (4), which processes it (5). When the target returns the response (6), the remote wrapper sends it back to the proxy wrapper (7), which is notified by the initiator’s cluster event dispatcher (8). Finally, the proxy wrapper forwards the response to the initiator (9).

evaluated sequentially. To do so, explicit synchronization is done as follows, considering the point of view of a cluster: 1) A synchronization order is sent (send_sync) to all clusters containing one or more targets connected to one or more initiator in the current cluster (lines 2-4); 2) Upon a synchronization order is received from an initiator’s cluster (recv_sync), an acknowledge is sent back (send_ack) to it (lines 5-8); 3) Finally, the current cluster wait (recv_ack) for receiving all acknowledges for all synchronization orders it sent at the first phase (lines 9-11). Explicit synchronization occurs at regular intervals in the simulation. The time spent during two explicit synchronizations is called the synchronization period. This period is specified by the simulated system designer and/or simulation end-user. So as to reduce the cost of synchronization during the simulation, explicit synchronization is implemented as a SystemC method process. This kind of SystemC process (i.e SC_METHOD) corresponds to a function call, where the other kind, named a thread process (i.e. SC_THREAD), generates thread context switches, which is more expensive. Nonetheless, it is mandatory to guarantee that explicit synchronization will effectively bound the temporal error introduced by asynchronous communication between clusters. Actually, this is already guaranteed by the nature of explicit synchronization and its periodicity. Indeed, explicit synchronization occurs in each cluster with the same period and the synchronization is processed synchronously between related clusters. Moreover, all clusters start their part of the simulation

evaluation at time zero. As a result, when clusters synchronize, their local times converge to a global simulation time. Figure 5 illustrates how explicit synchronization effectively bounds the temporal error. In the given example, at step (2), cluster j is in advance compared to cluster i. Their local times differ by a certain δ, previously defined as the temporal error. Next, at step (3), cluster i is in advance compared to cluster j and their local times differ again by a certain, possibly different, δ. Finally, at step (4), clusters initiate a handshake and block until the synchronization completes. Thereafter, cluster i and j continue the simulation with their synchronized local times. Simulation termination is another issue we address in this paper. When a part of the whole simulation terminates in a cluster, there is, a priori, no reason for other clusters to know whether that cluster has finished its part of the simulation. Worse, some clusters may wait for the terminated cluster to synchronize, resulting in a deadlock. So as to deal with this issue, we propose to use a cooperative termination method, based on the algorithm of explicit synchronization (figure 4). When the simulation is expected to end, a call to the sim_stop function is made. This causes synchronization modules to send stop orders instead of synchronization one. Then, all clusters are allowed to respond to pending transactions until the last (stop) synchronization is achieved. IV. VALIDATION ENVIRONMENT The validation environment is composed of two parts: a dedicated hardware simulation infrastructure and a SystemC TLM validation model. A. Simulation hardware The hardware used for validation is composed of four nodes. Each node is a quad-core Xeon W3550 at 3.06 GHz with 24 GB RAM and two 1000 BaseT network interfaces. All run a 2.6.9-67 RHEL 4 SMP Linux kernel without support for HyperThreading. The MPI implementation is OpenMPI version 1.4.2. The command line used to launch distributed simulation depends on the number of clusters involved in the simulation.

1

2

3

4

cluster i

t1

T

local time

t2

T

local time

t2 − δ t 2

T

simulation time

t2

transport

transport

0

cluster j

0

0

t1

t1 − δ t 1

Fig. 5. Example of relative time deviation between clusters and effect of explicit synchronization on the local time of clusters. First, all clusters start the simulation at time zero (1). Next, a cluster i’s initiator sends a transaction to a target of cluster j through a transport method and receives a response (2). After a little while, a cluster j’s initiator send a transaction to a target of cluster i and receives a response (3). When the synchronization period (T ) has elapsed, cluster i and cluster j synchronize synchronously (4).

For instance, when using 4 clusters, the command line looks like mpiexec --mca btl tcp,self --mca mpi_paffinity_alone 1 -hostfile mpi.hosts -n 4 ./top. B. Validation model The hardware part of the validation model (figure 2a) is composed of processing cores (CPU’s), distributed memories (Mem) and a network interconnect. Each memory is accessible directly to one processor as its local memory, using a dedicated port, and to all other processors through the network interconnect. The address space is globally shared. Communication between CPU’s is made thanks through local memories and is assumed to be unpredictable. The application implemented for validation purpose follows a data-flow software model. In this context, each CPU executes two tasks. The first one aims to represent realistic thread context switches involved in genuine complex system simulations. It implements a fixed-length integer sum. This task is executed at each cycle and does not produce communications outside the CPU. The second task can be a producer or a consumer task, implementing a variable length integer sum. This task is executed periodically following a user-defined simulated period (given in cycles). In a producer task, the sum is processed and the result is written to a consumer local memory. In a consumer task, the sum is processed as soon as a new data is available in its local memory. This new data is used as the initial value of the processed sum. For the purpose of the validation, producers and consumers are chained twoby-two, each producer providing data to only one consumer. In order to distribute the SystemC model of figure 2a, the model is cut up into clusters as defined in section II. In the present case, CPU’s are grouped with their local memories and the network interconnect is distributed among clusters. The resulting SystemC model is presented figure 2b.

The transformation from a standard non-distributed SystemC model to a distributed one is completely automated. Though, the current implementation of this automated transformation is only valid for the module pattern of figure 2a at this time, the transformation process can be easily extended to a general case. For instance, SystemCXML [10], PinaVM [11] and Scoot [12] are tools that can be used to extract communication dependency information and generate a top-level SystemC module with the appropriate allocation of clusters. The validation model we propose here is customizable. The following parameters can be changed: the number of CPU’s, the computational load in CPU’s, the local memory size, the number of clusters and the explicit synchronization period. Nevertheless, in order to keep the testing set size reasonable, some constraints have been put on the model parameter values: producer and consumer tasks have all the same computational load (i.e. the same sum size) and the explicit synchronization period is identical for all clusters. It is important to notice that the system model used for validation only aims to be theoretical and does not intend to be implemented on silicon. Nevertheless, it aims to express the following system properties: a high degree of parallelism through a great number of tasks and a high degree of interconnection. V. E XPERIMENTAL RESULTS For each distributed simulation, the validation model is composed of 64 CPU’s and 64 memories. The distribution is made following three configurations for which the simulated system model is partitioned in 4, 8 and 16 clusters respectively. The number of modules per cluster is chosen so as to balance the load among node resources. In addition, we forced a cluster to be evaluated alone on one node core. In order to characterize the behavior of a distributed SystemC TLM simulation using our approach, we ran the simulations in three modes, corresponding to three producer/consumer task simulated period values: • using a fixed value of 100 cycles (no random variation); • following an uniform law of probability with a mean of 100 cycles and a variance of 20% (i.e. 20 cycles); • following a Poisson law of probability with a mean of 100 cycles and a variance of 20% (i.e. 20 cycles). We also ran distributed simulations with a variance equals to 40% of the mean. In those cases, the results were similar to ones exposed thereafter, so we will not discuss them in this paper. Figure 6 shows the results we obtained for each distributed configuration and each producer/consumer task simulated period value. The results present the influence of the real-time duration of producer/consumer tasks for a given simulated period, which we are going to discuss now. A. Speed-up characterization Figure 6a, 6d and 6g show up the relationship between Tc , the real-time producer/consumer task duration, and Tes , the explicit synchronization period. One can notice that Tc

Speed-up

3 2 1 0

4

4

3

3 Speed-up

1 cycle 10 cycles 100 cycles 1000 cycles 10000 cycles

Speed-up

4

2 1 0

10−6 10−5 10−4 10−3 10−2 10−1 Task average duration (seconds)

1

8

8

6

6

4

4

2

2

2

1

1

1

0

0

10−6 10−5 10−4 10−3 10−2 10−1 Task average duration (seconds)

5 1 0

15

15

10

10

5

10−6 10−5 10−4 10−3 10−2 10−1 Task average duration (seconds)

(f)

Speed-up

10

1 cycle 10 cycles 100 cycles 1000 cycles 10000 cycles

10−6 10−5 10−4 10−3 10−2 10−1 Task average duration (seconds)

(e)

Speed-up

15

0

10−6 10−5 10−4 10−3 10−2 10−1 Task average duration (seconds)

(d)

Speed-up

(c)

Speed-up

Speed-up

Speed-up

4

10−6 10−5 10−4 10−3 10−2 10−1 Task average duration (secs)

(b)

8 6

0

10−6 10−5 10−4 10−3 10−2 10−1 Task average duration (seconds)

(a)

1 cycle 10 cycles 100 cycles 1000 cycles 10000 cycles

2

1 0

5

10−6 10−5 10−4 10−3 10−2 10−1 Task average duration (seconds)

(g)

(h)

1 0

10−6 10−5 10−4 10−3 10−2 10−1 Task average duration (seconds) (i)

Fig. 6. Speed-up of SystemC TLM distributed simulations compared to a non-distributed one. These results present the speed-up we achieved using our new approach, considering several explicit synchronization periods given in cycles. Rows correspond, in order, to the three distributed configuration composed of 4, 8, 16 clusters. The first column gives the results for a fixed value of the simulated producer/consumer task period. The second and third columns give the results for a random producer/consumer task simulated period, following an uniform probability law and a Poisson probability law respectively.

also represents the communication real-time period. When Tes < Tc , more than one synchronization occur during a period of Tc . As a consequence, the number of thread context switches grows within the SystemC simulation engine, slowing down the distributed simulation. On the other side, increasing Tes beyond Tc does not provide a significant benefit. Indeed, simulation throughput is limited by implicit synchronization, which its period equals Tc . As shown in the second and third columns of figure 6, a random producer/consumer task simulated period gives a better speed-up for shorter values of Tc than with a non-random

simulated task period. This is explained by the probability that the producer/consumer task simulated period is less than 100 cycles (i.e. the mean value). Therefore, the number of synchronization during a period of Tc decreases. In addition, the acceleration given by such a scenario is greater than the slowdown caused when the task simulated period value is greater than 100 cycles. Looking at validation results, our approach is scalable. Indeed, the maximum speed-up in all cases is very close to the theoretical speed-up. For instance, for a non-random SystemC TLM distributed simulation composed of 16 clusters,

TABLE I T EMPORAL ERROR GIVEN IN PERCENT OF THE PROCUDER / CONSUMER TASK SIMULATED PERIOD (100 CYCLES ). VALUES GIVEN HERE ARE THE MAXIMUM GENERATED ERROR CONSIDERING ALL VALUES OF Tc . # clusters 4 8 16

1 1.00% 0.99% 0.95%

Explicit 10 4.2% 3.4% 4.2%

synchronization period 100 1000 10 000 8.6% 143.90% 1517.70% 13.1% 231.90% 1388.60% 15.5% 73.00% 2193.60%

the speed-up equals 13.33 compared to a non-distributed simulation; the maximum speed-up for distributed simulation composed of 16 clusters and following a uniform random law equals 12.54. B. Error characterization Table I shows the temporal error, as defined in section II, along with different values of the explicit synchronization period. One can see that the temporal error is always bounded by the explicit synchronization period. For instance, given an explicit synchronization period of 10 cycles, the maximum error equals 4.2% in average. As expected, when the explicit synchronization period gets longer, the temporal error increases. In addition, as detailed in section IV, communication in producer and consumer tasks is surrounded by task evaluations. In the case of a SystemC TLM simulation, these tasks are sequentially evaluated like any other concurrent SystemC process in a given simulation engine. Then, when a producer/consumer task is evaluated, clusters do not synchronize with one another. Therefore, the longer the producer/consumer task evaluation, the greater the temporal error. C. Exploiting distributed simulation properties When looking at the boundary behavior, two distinct situations can be observed according to the explicit synchronization period. Indeed, on one hand, short periods (< 10 cycles) give little throughput but high precision. On the other hand, long periods (> 100 cycles) give higher throughput, but little precision. Therefore, we propose two simulations modes: one ought to use a period around 100 cycles when expecting a high throughput; in return one ought to use an explicit synchronization period of 1 cycle when more precision is required. VI. R ELATED WORK SystemC is a discrete event simulator using delta cycles to simulate concurrent processes in a system. Such a process can be modeled like a function call or a thread depending on its nature. Buchmann [13], Mouchard [14] and Naguib [15] observed that the default SystemC dynamic process scheduling produces more thread context switches than effectively needed. So, they proposed a static scheduling relying on communication dependencies between SystemC processes. The scheduling is obtained thanks to a static analysis of the simulated system model. However, when communication is unpredictable this approach cannot be used. An alternative is

to parallelize concurrent process evaluation. To do so, two methods are exposed in related work. One method requires modifying the SystemC simulation engine. Ezudheen [1] do it by adding OpenMP [16] directives while Mello [2] use the QuickThread framework [17]. More radically, Nanjundappa [18] implement a transformation chain from SystemC to CUDA [19]. All these proposals make severe modification of the SystemC implementation that leads to relevant results. However, such modification implies an expensive maintenance to stay compatible with future versions of SystemC. Our aim is to focus on the synchronization mechanism being one of the most critical part in parallel simulation. The other parallelization method is the one our approach relies on and we detailed in section II. Combes [3] show up the synchronization bottleneck generated by the conservative variant of PDES. As a solution, they propose an interesting distributed synchronization mechanism. However, it requires modifying the SystemC simulation engine what we proscribed. Our synchronization implementation is close to their approach but ours is much simpler. We implement it at model level instead of inside the SystemC simulation engine. Yi [20] proposes an interesting method, called trace-driven virtual synchronization, that separates event generation from time alignment. However, implementing this in the context of a SystemC simulation requires modifying the simulation engine, which we exclude in this paper. VII. C ONCLUSION In conclusion, this paper presents a new parallel approach with a hybrid synchronization mechanism designed to deal with unpredictable systems, offering to simulate and explore design space of such systems. Simulation parallelization is transparent to the simulated system designer thanks to a dedicated SystemC TLM framework, thereby increasing reuse of previously written SystemC model. This framework does not require modifying the SystemC simulation engine at all. Both features provide an easy to use and relevant simulation environment. Experimental results show that the distributed simulation speed-up is conditioned by a threshold. This threshold is inherent to distributed programming methods and relies on the ratio between the simulation processing and the communication cost. Results also highlight the relationship between the synchronization mechanism and the nature of communication between clusters. Two simulation modes can be extracted from results. A first one that provides little throughput but high precision. In return, the second mode provides high throughput but little precision. In this second case, results underline that when the communication simulated period is fixed, corresponding to the ideal case, the best speed-up is obtained when the explicit synchronization period overlaps with the communication period. In addition, when the communication simulated period is randomized, approaching the real case, acceleration values are a few smaller but still brasatisfying. For instance, the speed-up

with a distributed simulation composed of 16 clusters compared to a non-distributed simulation equals 13.33 and 12.54 for non-random and random cases respectively. In all cases, the temporal error is bounded by explicit synchronization period. These results require comparison with those of distributed simulations using more realistic SystemC TLM models to be fully validated. R EFERENCES [1] P. Ezudheen, P. Chandran, J. Chandra, B.P. Simon, and D. Ravi, “Parallelizing SystemC Kernel for Fast Hardware Simulation on SMP Machines,” in Proceedings of the 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation. Montreal, Canada, pp. 80–87, 2009. [2] A. Mello, I. Maia, A. Greiner, and F. Pecheux, “Parallel Simulation of SystemC TLM 2.0 Compliant MPSoC on SMP Workstations,” in Proceedings of Design, Automation and Test in Europe (DATE), Dresden, Germany, pp. 606–609, 2010. [3] P. Combes, E. Caron, P. Desprez, B. Chopard and J. Zory, “Relaxing Synchronization in a Parallel SystemC Kernel,” in Proceedings of IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Sydney, Australia, pp. 180–187, 2008. [4] B. Chopard, P. Combes, and J. Zory, “A Conservative Approach to SystemC Parallelization,” in Proceedings of International Conference on Computational Science (ICCS), Reading, United Kingdom, pp. 653– 660, 2006. [5] M. Trams, “Conservative Distributed Discrete Event Simulation with SystemC using Explicit Lookahead,” Digital Force White Papers, 2004. [6] V. Galiano, H. Migall´on, D. P´erez-Caparr´os, M. Mart´ınez, “Distributing SystemC Structures in Parallel Simulations,” in Proceedings of the 2009 Spring Simulation Multiconference, San Diego, CA, United States, pp. 1–8, 2009. [7] D. R. Cox, “RITSim: Distributed SystemC Simulation,” Master thesis at Kate Gleason College of Engineering, 2005. [8] R. M. Fujimoto, “Parallel Discrete Event Simulation,” in Proceedings of the 21st Conference on Winter Simulation,Washington D.C., United States, pp. 19–28, 1989.

[9] Message Passing Interface Forum, “MPI: a Message Passing Interface Standard,” Stuttgart, Germany, 2009. [10] D. Berner, J.-P. Talpin, H. D. Patel, D. Mathaikutty, S. K. Shukla. “SystemCXML: An Exstensible SystemC Front-end Using XML,” in Proceedings of Forum on specification and Design Languages (FDL), Lausanne, Switzerland, pp. 405–409, 2005. [11] K. Marquet, M. Moy, “PinaVM: a SystemC Front-end Based on an Executable Intermediate Representation,” in Proceedings of the 10th ACM International Conference on Embedded Software (ICES), Scottsdale, AZ, United States, pp. 79–88, 2010. [12] N. Blanc, D. Kroening, N. Sharygina, “Scoot: A Tool for the Analysis of SystemC Models,” in Proceedings of the 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), Budapest, Hungary, 2008, pp. 467–470. [13] R. Buchmann and A. Greiner, “A Fully Static Scheduling Approach for Fast Cycle Accurate SystemC Simulation of MPSoCs,” in Proceedings of International Conference on Microelectronics (ICM), Cairo, Egypt, pp. 105–108, 2007. [14] G. Mouchard, D. G. P´erez, and O. Temam, “FastSysC: A Fast Simulation Engine,” in Proceedings of Design, Automation and Test in Europe (DATE), Paris, France, 2004. [15] Y. N. Naguib and R. S. Guindi, “Speeding up SystemC Simulation Through Process Splitting,” in Proceedings of Design, Automation and Test in Europe (DATE), Nice, France, pp. 111–116, 2007. [16] OpenMP Architecture Review Board, “OpenMP: The OpenMP API specification for parallel programming,” http://www.openmp.org. [17] QuickThread Programming, LLC, “QuickThread framework,”, http://www.quickthreadprogrammin.com [18] M. Nanjundappa, H. D. Patel, B. A. Jose, and S. K. Shukla, “SCGPSim: A Fast SystemC Simulator on GPUs,” in Proceedings of the 15th Asia and South Pacific Design Automation Conference (ASP-DAC), Taipei, Taiwan, pp. 149–154, 2010. [19] NVidia, “CUDA technology,” http://www.nvidia.com. [20] Y. Yi, D. Kim, S. Ha, “Fast and Accurate Cosimulation of MPSoC Using Trace-Driven Virtual Synchronization,” in IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems,Vol. 26, No. 12, pp. 2186–2200, 2007.

PERFORMANCE EVALUATION OF AN AUTOMOTIVE DISTRIBUTED ARCHITECTURE BASED ON HPAV COMMUNICATION PROTOCOL USING A TRANSACTION LEVEL MODELING APPROACH Takieddine Majdoub*, Sébastien Le Nours*, Olivier Pasquier*, Fabienne Nouvel** *Univ Nantes, IREENA, EA 1770, Polytech-Nantes, rue C. Pauc, Nantes, F-44000 France **INSA Rennes, IETR, UMR 6164, 20 avenue des Buttes de Coësmes, 35043 Rennes, France {takieddine.majdoub, sebastien.le-nours, olivier.pasquier}@univ-nantes.fr, [email protected] ABSTRACT Due to increasing complexity of communication infrastructures in the automotive domain, reliable models are necessary in order to assist designers in the development process of networked embedded systems. In this context, transaction level modeling, supported by languages as SystemC, is a promising solution to assess performances of networked architectures with a good compromise between accuracy and simulation speed. This article presents the application of a specific modeling approach for performance evaluation of a networked embedded system inspired from the automotive domain. The considered approach is illustrated by the modeling of a video transmission system made of three electronic controller units and based on a specific power line communication protocol. The created model incorporates description of various communication layers and simulation of the model allows evaluation of time properties and memory cost inferred. Index Terms— performance evaluation, transaction level modeling, distributed architecture 1. INTRODUCTION In the automotive domain, an increasing number of functionalities are implemented using distributed Electronic Controller Units (ECUs) interconnected through heterogeneous communication networks. The system architecting of such networked embedded systems aims at defining organization of ECUs and associated properties in terms of processing, communication, and memory resources according to expected functional and non-functional requirements. Typical non-functional requirements under consideration in the automotive domain are timing constraints, power consumption, fault tolerance, and cost. In this context, model-based approaches have then recently received wide interest in the automotive domain to face related design complexity of distributed architectures [1]. Systematic approaches for early evaluation of nonfunctional properties are necessary to assist system designers in the development process and related efficiency concerns simulation speed and light modeling effort.

In current works, virtual prototypes of architectures are considered to evaluate through simulation performances achieved and to explore the design space. Virtual prototypes are formed by models of the system applications mapped onto a model of the considered platform. Creation of such prototypes has been facilitated through emergence of modeling approaches such as Transaction Level Modeling (TLM) [2]. Raising the level of design abstraction above Register Transfer Level (RTL), TLM offers a good trade-off between modeling accuracy and simulation speed. This approach is currently supported by languages such as SystemC [3] to provide executable specifications of architectures. However, few examples have been addressed in the automotive domain to illustrate the benefits of TLM for performance evaluation of a distributed architecture. In this paper, a modeling approach of networked embedded systems is illustrated through a specific case study inspired from the automotive domain. Created models make possible to evaluate by simulation time performances of a networked embedded system and to estimate expected hardware and software resources. In the considered modeling approach, created models combine structural description of system architecture and behavioral description of resource usage. The considered case study is about a distributed system supporting a video transmission application for the automotive domain. This system is based on a specific protocol that considers communication between ECUs through power lines. The created model is simulated in a specific framework based on the SystemC language in order to analyze influence of system parameters on time performances and to evaluate resulting memory cost. The remainder of this paper is structured as follows. Section II analyzes related modeling and simulation approaches for evaluation of performances of networked embedded systems. In Section III, the proposed modeling approach is presented and related notations are defined. In Section IV, we detail the distributed system studied and the related model created. The simulation results obtained are described in Section V. Finally conclusions are drawn in Section VI.

2. RELATED WORK The increasing interaction between ECUs gets deep influence on the definition of hardware and software architectures and the related design process. In this context, the AUTOSAR (AUTomotive Open System ARchitecture) consortium defines software architecture with standardized application programming interfaces (APIs) that make application independent from the underlying platform and allow arbitrary distribution onto different ECUs [4]. In this context, evaluation of non-functional properties early in the development process becomes mandatory in order to avoid costly design iterations. Performance evaluation of distributed architectures in the automotive domain calls for specific evaluation methods. In [5], various analysis approaches are described for timing estimation in networked architectures. Three distinguishable approaches are identified: classical real-time system theory, timed automatas, and simulation. Considering simulation-based approaches, the affinities of the concepts of AUTOSAR and SystemC are discussed in [6]. It is detailed how SystemC provides a promising solution for simulation of networked embedded systems. SystemC supports specific modeling mechanisms to include the timing behavior of the underlying platform architecture. A specific case study is illustrated through an architecture based on the FlexRay communication protocol. In [7], an illustration of how TLM techniques can be adapted to networked systems is given. Simulation results are obtained for evaluation of the energy consumption in wireless sensor networks. In the following, the proposed modeling approach considers evaluation of performances of a distributed architecture through SystemC simulation. Analysis is performed in order to correctly size expected resources taking into account influence of communication protocols. More generally, performance evaluation of embedded systems has been approached in many ways at different levels of abstraction. A good survey of various methods, tools, and environments for early design space exploration is presented in [8]. In the following, as we do not aim at functional verification, we assume that performance evaluation can be led without considering a complete description of system functionalities. This abstraction enables efficient simulation speed of virtual prototypes of architectures. Workload models are then defined to represent computation and communication loads applications cause on platform resources when executed. Workload models are mapped onto platform models and resulting architectures are simulated to obtain performance data. Related works mainly differ according to the way application and platform models are created and combined. A modeling approach similar to ours is presented in [9], where functionalities allocated to resources are described as sequences of processing delays denoted as traces. Each trace represents the sequence of tasks executed on a specific resource to model the processing of a data and the

transactions with other resources of the architecture. Depending on the allocation decision, each resource contains one or more traces that are related to different types of processing sequences. The obtained model is then simulated with respect to a given set of stimuli by triggering the traces that are required for processing particular data in the respective resources of the architecture. The proposed design framework in [10] aims at evaluating non-functional properties such as power consumption and temperature. In this approach, the description of the temporal behavior of the system is done through a model called communication dependency graph. This model represents a probabilistic quantification of temporal aspects of computation as well as an abstract representation of the control flow of each component. This description is completed by SystemC models of non-functional properties characterizing the behavior of dynamic power management. Simulation is then performed to obtain an evaluation of the time evolution of power consumption. Both approaches presented in [11] and [12] combine UML2 description for application and SystemC platform modeling for performance evaluation. In [11], system requirements are captured as a service-oriented model using UML2 sequence diagrams. Workload models are then defined to illustrate the load an application causes to a platform when executed. These workload models do not contain timing information; it is left to the platform model to find out how long it takes to process the workloads. The approach presented in [12] gets a strong emphasis on streaming data embedded systems. UML2 activity diagram and class diagram are used to capture workload and platform models. Stereotypes of the UML2 MARTE profile [13] are used for description of non-functional properties and definition of allocation. Once allocation defined, SystemC description is generated automatically and simulated to obtain performance data. Our approach mainly differs from the above as to the way the system architecture is modeled and the models of workload are defined. Besides, existing performance modeling approaches rarely address evaluation of performances of networked embedded systems, considering influence of communication protocols on expected resources. 3. PROPOSED MODELING APPROACH The modeling approach presented in this section aims at creating approximately-timed models in order to evaluate by simulation expected resources of system architectures. As previously discussed, model of system architecture does not require complete description of system functionalities. In the considered approach, architecture model combines structural description of architecture and description of properties relevant to the considered hardware and software resources. Utilization of resources is described as sequences of processing delays interleaved with exchanged transactions. This approach is illustrated in Figure 1.

Model of system architecture SA

A2

/ k:=0; s0

A1 M k> 3; unsigned long d; d = *(tab + pos_char) > (32 - pos_bit - n) ; return (d); }

Fig. 4. Algorithm of the ShowNbits function using 32-bit bigendian architecture. Although, the ShowNbits function is quite simple, it requires several cycles to build the M bits size word, to shift and to mask (if necessary) the bits. Hence, a specific instruction which processes these operations allows the performance to be improved [4, 7].

The ShowNbits function can be used, either directly in the C program, or in an assembler part of the program. We chose to hand-write it in assembler to avoid the creation of useless register initialization operations by GCC. As shown on Fig.5, the updated ShowNbits function has the following behavior: 1. the required variables and registers are initialized (not presented in the figure), 2. the default value of the fixed variables is loaded within registers (e.g. $7, 8, etc.) (not presented in the figure), 3. the current location in the bitstream is computed (not presented in the figure), 4. the location and the size are concatenated in a single word which is stored in the first input register (the location is 4-bits left shifted) (lines 13 to 24 and 29), 5. the required part of the stream is loaded from the data memory and stored in the second input register (line 25), 6. the showbits dedicated instruction is called (line 36), 7. finally, the result of the showbits instruction is return (line 40).

2 4 6

static __inline ulong showNbits (const unsigned char *const RESTRICT tab, const long position, const int n ) { unsigned long pos_char; unsigned long pos_size;

12

// // The initialization, the bitstream loading, and the // location and the size computing are not presented // in the figure. //

14

asm

volatile(

asm

volatile(

asm

volatile(

8 10

"and %0,%1" : "=r" (pos_size) : "r" (position), "0" (pos_size) ); "add %0,%1" : "=r" (pos_char) : "r" (tab), "0" (pos_char) ); "lsli %0,#0x8" : "=r" (pos_size) : "0" (pos_size) ); "ld %0,(%1)" : "=r" (pos_char) : "r" (pos_char) ); "add %0,%1" : "=r" (pos_size) : "r" (n), "0"(pos_size) );

16 18 20 22 24 26

asm

volatile(

28 30

asm

volatile(

32 34 36

// // done ! pos_size // asm volatile(

architecture with a CISC type instruction mode, but which operates as a RISC-type processor. Precisely, we worked with the 32-bits version of the aRDAC and we used the aRDAC plug-in (GNU toolchain) for Eclipse IDE which uses GCC/GDB. The aRDAC is sold by Lead T ech Design1 company. The aRDAC processor is composed of eight main components shown in Fig. 6. The data and program memories, the memory controller, the ALU and the various registers are usual components of processors. The IO block manages the inputs and outputs of the aRDAC and the transfer block ensures the transmission of the data between the registers and the data memory.

Program Memory Data Memory IO

UART

= n + ((position & 0x7) u without private predecessor and private successor; 19  = G  (Succ(G,u) ∪ {u}); 20 21 22 23

 =  ∩  ˗  = GS ∪ Pred( ,u); Divide( ,  ,  );  = G  (Pred(G,u) ∪ {u});

24

 =  ∩  ˗

25 26 27 28 }

 = GS ∪ Succ( ,u); Divide( ,  ,  ); }

Figure 6. algorithm.

Pseudo code for the maximal convex subgraph enumeration

non-maximal MCSs, we use an efficient strategy with a worst-case computation complexity of O(n · |VI′ |), where n is the number of MCS candidates and |VI′ | is the number of invalid nodes in G′ (lines 12-13, Fig. 6). Due to the selection of invalid node to division operation, there is no MCS in Disc(Gcur ,u). This implies that every MCS candidate ma in G′a contains at least a node from Pred(Gcur ,u) and every MCS candidate mb in G′b contains at least a node from Succ(Gcur ,u). Thus, we use two guarding sets to ensure the maximality. The first set records the predecessors of u, while the second set records the successors of u (lines 21,25, Fig. 6). When a MCS candidate is generated at branch, it is checked for maximality. Only the MCS that intersects with all the subgraphs in the current guarding set can be added to MCS set (lines 12-14, Fig. 6). Based on the division operations, we can get a tighter upper bound on the number of MCSs within a given DFG. Theorem 3: Given a DFG G, there exists no more than ′ 2|VI | MCSs, where VI′ is the set of invalid node set after preprocessing and clustering. Proof: As the search tree of the proposed division operations is a binary search tree of the invalid inner nodes, the depth of the binary search tree is |VI′ |. Thus, the number ′ of leaves of the binary search tree is bounded on 2|VI | . Each

Table I C HARACTERISTICS OF THE BENCHMARKS Benchmark Blowfish EPIC SHA GSM JPEG MPEG2 ADPCM DES3

Domain Security Security Network Telecommunication Consumer Consumer Telecommunication Security

|V | 635 75 36 947 349 345 17 157

|VI | 273 38 25 457 162 143 13 63

|V − VI | 362 37 11 490 187 202 4 91

leaf of the binary search tree corresponds to a possible MCS. ′ Therefore, there are no more than 2|VI | MCSs. V. E XPERIMENTAL R ESULTS We have carried out extensive experiments to evaluate the performance of the maximal convex subgraph enumeration algorithm we propose. The experiments were carried out on a PC with a P9400 processor running at 2.4 GHz. In order to evaluate the performance of our algorithm, we obtained the DFGs from the benchmarks in MediaBench [20] and MiBench [21]. These benchmarks were compiled and simulated by a generic compilation platform GECOS [22]. Table I describes the DFGs used in our experiments. In the experiments of enumerating all maximal convex subgraphs, we choose one computation-intensive basic block’s DFG from each benchmark. The number of nodes of each DFG is presented in the column |V |. The number of invalid nodes and valid nodes are shown in the columns |VI | and |V −VI |. In the experiments, we have compared our maximal convex subgraph enumeration algorithm (denoted as a) with the latest algorithm proposed in [18] (denoted as b). As the algorithm b is faster than all the other previous algorithms, we only compare our algorithm with the algorithm b. Note that, for a fair comparison, all the related algorithms are implemented in the same developing environment and with the same data structure representing the patterns. Table II shows the performance of the algorithms in enumerating all maximal convex subgraphs. For different benchmarks, the two algorithms produce the same MCSs. The number of invalid nodes and the number of valid nodes in the compacted graph G′ that is generated by the preprocessing step and the clustering step are shown in the columns |V ′ | and |V ′ − VI′ | respectively. In this table, the number of identified MCSs is recorded in the column Number of MCSs. The columns runtime a and runtime b represent the run time of the two algorithms. The runtime unit in the following experiments is millisecond. The column speed-up indicates the speed-up achieved by our algorithm over b. According to the experimental results, our algorithm has a significant better performance in terms of runtime. Based on the experiments, we can observe that with the benchmarks with less than 100 invalid nodes, our algorithm achieves the speedup ranging from 5.5 to 47.1. For the benchmarks

Table II C OMPARISON OF MCS ENUMERATION ALGORITHMS Benchmark

|VI′ |

|V ′ − VI′ |

Blowfish EPIC SHA GSM JPEG MPEG2 ADPCM DES3

49 14 0 0 8 26 0 16

194 31 1 1 26 74 1 62

Number of MCSs 50 900 1 1 256 16785409 1 511

runtime a 0.52 23.7 0.04 0.09 3.1 113723 1.04 7.03

runtime b 19.6 1116 0.65 10.3 45.9 5.71 84.9

speed up 37.7 47.1 16.3 114 14.8 5.5 12.1

with more than 100 invalid nodes, the speedup achieved by our algorithm is more significant, ranging from 14.8 to 114 times over the algorithm b. For the benchmark MPEG2, our algorithm takes 114 seconds to produce more than 16 million MCSs, while the algorithm b fails to produce the result. The reduction of the runtime can be mainly attributed to four parts: 1) a considerable number of invalid nodes are removed by the preprocessing step and the invalid node clustering step, such that the binary search tree is reduced dramatically. For example, with the benchmarks Blowfish and GSM, the number of invalid nodes is reduced to 49 and 0 respectively after the preprocessing and the invalid node clustering 2) the valid node clustering step reduces the size of the DFG, this contributes to a reduction of time in computations such as the calculations in formula (1). As an example, the number of valid nodes is reduced from 187 to 26 for the benchmark JPEG using valid node clustering. 3) the selection of invalid nodes to division operation also affects the binary search tree, selecting invalid nodes with private predecessor or private successor avoids generation of redundant MCSs and 4) an efficient strategy with a worst-case computation complexity of O(n · |VI′ |) is applied to filter away nonmaximal MCS candidates. VI. C ONCLUSION In this paper, we have proposed an efficient algorithm for solving maximal convex subgraph enumeration problem. Our algorithm generates all maximal convex subgraphs in a sandwich manner. The sandwich manner takes advantage of the bottom-up technique and the top-down technique by combining the two techniques. Compared with the latest algorithm, our approach can achieve orders of magnitude speedup while generating the identical set of all maximal convex subgraphs. In the future work, we will deal with the custom instruction selection problem. R EFERENCES [1] T. R. Halfhill. Tensilicas software makes hardware. Microprocess. Rep.,Jun. 23, 2003. [2] –, ARC Cores encourages ’plug-ins’. Microprocess. Rep., Jun. 19, 2000.

[3] P. Faraboschi, G. Brown, J. A. Fisher, G. Desoli, and F. Homewood, Lx:A technology platform for customizable VLIW embedded processing. in Proc. 27th Annu. Int. Symp. Computer Architecture, Vancouver, BC,Canada, Jun. 2000, pp. 203213. [4] –, Alteras new CPU for FPGAs. Microprocess. Rep., Jun. 28, 2004. [5] N. Clark, A. Hormati, S. Mahlke, and S. Yehia. Scalable subgraph mapping for acyclic computation accelerators. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems, Seoul, South Korea, 2006. [6] P. Bonzini and L. Pozzi, Polynomial-time subgraph enumeration for automated instruction set extension, in DATE ’07, pp. 1331-1336, 2007. [7] M. Arnold, H. Corporaal. Designing domain-specific processors, in: Proceedings of CODES ’01 - the Ninth International Symposium on Hardware/Software Codesign, 2001, pp. 61 -66. [8] K. Atasu, L. Pozzi, and P. Ienne. Automatic applicationspecific instruction-set extensions under microarchitectural constraints. in Proc. 40th Des. Autom. Conf., Jun. 2003, pp. 256-261. [9] L. Pozzi, K. Atasu, and P. Ienne. Exact and approximate algorithms for the extension of embedded processor instruction sets. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 25, no. 7, pp. 1209-1229, Jul. 2006. [10] P. Yu and T. Mitra. Scalable custom instructions identification for instruction-set extensible processors, in Proc. Int. Conf. Compilers, Architectures, and Synth. Embed. Syst., Sep. 2004, pp. 69-78. [11] X. Chen, D.L. Maskell, and Y. Sun. Fast identification of custom instructions for extensible processors. IEEE Trans. Computer-Aided Design Integr. Circuits Syst. 26 (2007), 359368. [12] L. Pozzi and P. Ienne. Exploiting pipelining to relax registerfile port constraints of instruction-set extensions. In CASES ’05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems,

pages 2-10, NewYork, NY, USA, 2005. ACM. [13] N. Pothineni, A. Kumar, and K. Paul. Application specific datapath extension with distributed I/O functional units. In VLSID ’07: Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference, pages 551-558, Washington, DC, USA, 2007. IEEE Computer Society. [14] Pan Yu, Tulika Mitra. Disjoint Pattern Enumeration for Custom Instructions Identification. In: FPL, 2007. [15] A. K. Verma, P. Brisk, and P. Ienne. Rethinking custom ISE identification: a new processor-agnostic method. In CASES ’07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, pages 125-134, New York, NY, USA, 2007. ACM. [16] R. Razdan and M. D. Smith. A high-performance microarchitecture with hardware-programmable functional units. In Proc. 27th Annu. Int. Symp.Microarchit., Nov. 1994, pp. 172-180. [17] K. Atasu, O. Mencer, W. Luk, C. Ozturan, and G. Dundar. Fast custom instruction identification by convex subgraph enumeration. In ASAP ’08: Proceedings of the 2008 International Conference on Application-Specific Systems, Architectures and Processors, pages 1-6, Washington, DC, USA, 2008. IEEE Computer Society. [18] T.Li, Z.Sun, W.Jigang, and X.Lu, Fast enumeration of maximal valid subgraphs for custom-instruction identification, in Proc.CASES, 2009, pp.29-36. [19] J. Cong et al. Application-specific instruction generation for configurable processor architectures. In FPGA, 2004. [20] C. Lee, M. Potkonjak, and W. H. Mangione-smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In International Symposium on Microarchitecture, pages 330-335, 1997. [21] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M.Austin, T. Mudge, and R. B. Brown. Mibench: A free,commercially representative embedded benchmark suite. In IEEE 4th Annual Workshop on Workload Characterization, 2001, pp.3-14 [22] GeCoS: Generic compiler suite - http://gecos.gforge.inria.fr/

EFFICIENT FFT PRUNING ALGORITHM FOR NON-CONTIGUOUS OFDM SYSTEMS Roberto Airoldi, Fabio Garzia and Jari Nurmi Tampere University of Technology Department of Computer Systems P.O. Box 553, FIN-33101, Tampere, Finland [email protected] ABSTRACT This paper presents the study of an efficient trade-off between memory requirements and performance for the implementation of FFT pruning algorithm. FFT pruning algorithm is utilized in NC-OFDM systems to simplify the FFT algorithm complexity in presence of subcarrier sparseness. State-of-theart implementations offer good performance with the drawback of high resources utilization, i.e. data memory for storage of the configuration matrix. In this work we introduce the partial pruning algorithm as an efficient way to implement FFT pruning, obtaining a balanced trade-off between performance and resources allocation. Cycle accurate simulation results showed that even in presence of low-medium input sparseness levels the proposed algorithm can reduce the computation time of at least a 20% factor, when compared to traditional FFT algorithms and, at the same time, decreases the memory utilization up to a 20% factor over state of the art pruning algorithms. Index Terms— FFT, FFT-pruning, NC-OFDM, Cognitive radios. 1. INTRODUCTION Over the past two decades wireless communication systems have been continuously evolving to offer to users higher bandwidth and new services over the air. Many different communication standards have been developed, therefore today’s mobile devices are required to operate over different networks, according to users’ requirements and networks availability. To enable such mobility between networks devices need to be as flexible as possible. Hence, the concept of software defined radio (SDR) was introduced. A SDR device is able to re-program its functionality and hence to change its communication parameters allowing the device to operate on different networks. The increasing demand of high bandwidth and services have led to a point where the available spectrum for standard communication systems is going to be saturated soon. However, a recent study of the American Federal Communication Commission [1] showed that the perceived limited availability

of spectrum is related to an inefficient utilization. Indeed, portion of spectrum are often freely available through time and space but inaccessible by standard communication systems. To overcome such barrier, new radio concepts have been deployed. Smart radio, or cognitive radio (CR) [2], systems are able to sense the environment and adapt their communication parameters according to run-time condition of the available spectrum, users’ requirements, and quality of communication. Dynamic Spectrum Access (DSA) techniques [3] allow a CR terminal to scan the spectrum at runtime and dynamically allocate the communication over the available bandwidth, avoiding interference with other users. Non-Contiguous OFDM (NC-OFDM)[4] and Discontiguous OFDM (D-OFDM) [5] are examples of proposed communication techniques DSA enabled. NC-OFDM, as the more generic OFDM technique, bases its (de)modulation on FFT kernels. Each subcarrier of the communication system is represented by one of the FFT inputs. In the specific case of NC-OFDM, the system can dynamically prune any arbitrary subset of subcarriers according to communication parameters, such as: quality of the channel, available bandwidth, Bit Error Rate (BER) and interferences with other communication systems. Therefore it is possible to introduce simplifications in some of the building blocks of the NC-OFDM transceiver. In particular simplification can be introduced in the (de)modulation block. FFT pruning can be applied to dynamically reduce the computational load of such blocks when subset of subcarriers are set to zero value. FFT pruning algorithm is capable of eliminate redundant operations from its structure when a subset of the input is zero-valued, reducing the final algorithm complexity [6]. Different mathematical solutions have been proposed for FFT pruning, however only few implementations of such algorithms have been implemented on embedded systems. Many of the proposed algorithms are quite inefficient when mapped on embedded systems. The inefficiency is either from performance or resource utilization point of view. A trade-off is needed to balance performance and resource utilization. With state-of-the-art algorithms, it is possible to obtain high performance with the drawback of high memory requirements

or, in the opposite case, lower memory requirements and limited performance. No optimal trade-off has been found yet. In this work we propose a partial pruning algorithm as an optimal trade-off between performance and memory requirement, allowing an efficient implementation of the FFT pruning algorithm for CR systems on DSP cores. The rest of the paper is organized as follows: Section 2 gives a brief overview of FFT pruning theory and state of the art algorithms are analyzed; Section 3 introduces the partial pruning algorithm; Section 4 discusses simulation results, considering different levels of partial pruning; Finally, Section 5 presents concluding remarks. 2. FFT PRUNING A Radix-2 𝑁 -point FFT [7] can be interpreted as a succession of 𝑛 = log2 (𝑁 ) butterfly planes where each plane requires to perform 𝑁/2 butterfly operations. Each butterfly takes two inputs and produces two outputs. In the case of complex inputs, each butterfly operation requires to compute six real additions and four real multiplications. Hence, for an N-point FFT the algorithm complexity is 𝑂(𝑁 ⋅ log(𝑁 ). However, the algorithm complexity can be reduced if, for some reason, some of the inputs are fixed to zero. In such scenario it is possible to remove dummy operations like multiplications and/or additions by zero. To better highlight the simplification introduced by the FFT pruning, Figure 1 presents the structure of a 16-point FFT. In the figure is also reported an example input data distribution, where 𝑎𝑖 represents the 𝑖𝑡ℎ input signal to the FFT block. In the example some of the inputs are set to zero allowing a partial or complete pruning of some butterflies. Analyzing the propagation of zero-valued input in the FFT structure, it is possible to notice that the number of FFT planes affected by the pruning is proportional to log2 (𝑀 ), where 𝑀 is the largest set of contiguous zero-valued inputs. Larger is the set of contiguous zero-valued inputs, deeper in the FFT structure the pruning penetrates. From a mathematical point of view it is possible to reevaluate the algorithm complexity. Given a subset of 𝐿 nonzero input values, the reduction in complexity varies according to the distribution of the non-zero inputs. For instance, it was shown in [8] that the number of total multiplications and additions for executing a pruned FFT can be reduced to: 𝑀 𝑈 𝐿𝑝𝑟𝑢𝑛𝑒𝑑 = 2⋅𝑁 ⋅log2 (𝐿)+2⋅𝑁 −4⋅𝐿+

(2 ⋅ 𝑁 ⋅ 𝐿) (1) 2(log2 (𝐿)

and 𝐴𝐷𝐷𝑝𝑟𝑢𝑛𝑒𝑑 = 3⋅𝑁 ⋅log2 (𝐿)+3⋅𝑁 −6⋅𝐿+

(3 ⋅ 𝑁 ⋅ 𝐿) (2) 2(log2 (𝐿)

Furthermore, the total number of multiplications and additions can be further reduced if the non-zero inputs follow a

Fig. 1. 16-point FFT structure and example of input pruning certain distribution. For example, (1) and (2) can be reduced to: (3) 𝑀 𝑈 𝐿𝑝𝑟𝑢𝑛𝑒 = 2 ⋅ 𝑁 ⋅ log2 (𝐿) and 𝐴𝐷𝐷𝑝𝑟𝑢𝑛𝑒 = 3 ⋅ 𝑁 ⋅ log2 (𝐿)

(4)

when the non-zero inputs are contiguous and 𝐿 is a power of two [9]. The distribution of the zeros, as input to a butterfly, defines four types of butterfly operations. If both the inputs are zero, the output will be zero, hence no operation is performed. In the opposite case, no zero-valued inputs, the operation performed is a traditional butterfly. In the other two cases, where one of the two inputs is zero-valued, the output depends on the position of the zero-valued input. Therefore, when dealing with the actual implementation of the pruning algorithms, an efficient solution for the selection of which type of butterfly to perform has to be found. The introduction of if-thenelse statements [10] within the execution part of the algorithm results in a substantial increase in control overhead that ultimately dilutes the advantages of having a reduced algorithm complexity. Therefore, techniques based on a configuration matrix have been introduced in [4]. The configuration matrix stores a footprint of the non-zero inputs for each FFT plane, enabling the algorithm to compute only significant butterfly

operations. In fact, the elimination of conditional executions leads into better performance from a computational point of view. On the other hand, the configuration matrix size is 𝑁 × (log2 (𝑁 ) + 1) which might require high data storage capabilities. Perhaps, for large N (e.g. 2048), the algorithm data memory requirements might be too demanding for embedded system applications.

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

3 3 5 6 0 0 0 0 0

6 2 3 4 5 6 7 0 0

8 0 1 2 3 4 5 6 7

8 0 1 2 3 4 5 6 7

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

3. FFT PARTIAL PRUNING ALGORITHM The algorithm proposed in this work aims to obtain high performance in terms of computation time within a contained utilization of data memory for the storage of the configuration matrix. Rajbanshi’s [4] configuration matrix can be reduced in size introducing further constraints. In fact, FFT structure is very regular. When considering a particular butterfly of the FFT structure, it is known which are its inputs. Therefore, introducing a butterfly identifier, which defines the actual butterfly operation to be performed, it is possible to halve the size of the configuration matrix, since each butterfly identifier refers to a fixed couple of inputs. The new configuration matrix can then be downsized to 𝑁/2 × (log2 (𝑁 ) + 1). For the given example of Fig. 1 the configuration matrix proposed by Rajbanshi et al. [4] is

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

5 6 7 10 11 13 0 0 0 0 0 0 0 0 0 0 0

6 6 7 10 11 12 13 0 0 0 0 0 0 0 0 0 0

12 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0

16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

The first row of the matrix indicates how many inputs are non-zero for each FFT plane (columns of the matrix). The following rows index the actual non-zero input. Introducing the butterfly identifier the configuration matrix can be rewritten as:

In this case, the first row reports how many butterfly operations have to be performed for each FFT plane. The following rows contain the actual butterfly identifiers. Therefore we were able to reduce of a factor 2 the configuration matrix size. Furthermore, we observed that the input pruning might not affect all the FFT planes. Its penetration inside the FFT structure is proportional to the logarithm of the number of contiguous zero-valued inputs. In fact, only high input sparseness would introduce simplifications into the final FFT planes. Hence, we decided to implement a partial pruning of the FFT which would affect only a certain number of FFT planes, leading into a reduced size of the configuration matrix and, as consequence, in lighter memory requirements. Analyzing then the reduced configuration matrix, it is possible to notice that for that particular pruning scenario the last two FFT planes (last two columns) do not contain any pruning and hence any useful information. They could be omitted introducing the partial pruning. Finally the configuration matrix for the partial pruning can be reformulated as

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

3 3 5 6 0 0 0 0 0

6 2 3 4 5 6 7 0 0

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

The choice of how many FFT planes should support pruning depends on the trade off that we want to achieve between performance and memory storage requirements. Small configuration matrix relaxes the constraints on memory requirements but might affect the overall performance in presence of high input sparseness. On the other hand, a full configuration matrix takes better advantage of the pruning but at the price of higher memory costs.

Fig. 2. Normalized memory requirements for the tested pruning algorithms 4. RESULTS ANALYSIS To evaluate the performance of the proposed algorithm we utilized a open-source RISC processor; the selected processor is COFFEE RISC core [11], developed at Tampere University of Technology; its HDL model is freely available at the COFFEE project webpage [12]. The simulation platform was then composed of a COFFEE RISC core, data and instruction scratch-pads (16KB each). The core is cacheless and interconnected to the memories via a non-blocking switched architecture [13]. The evaluation of the algorithms’ performance, in terms of computation time and memory utilization, different input sparseness and different levels of partial prunings were considered. In particular we have analyzed 4 different level of partial pruning: full pruning (as reference), 2/3 of pruning, 1/2 of pruning and 1/3 of pruning. Considering a 2048-point FFT, the total number of FFT planes is 12. Hence the number of planes supporting pruning for the tested algorithms are respectively: 12, 9, 6, 4. As comparison meter we also implemented a radix-2 FFT algorithm, without pruning and the algorithm presented in [4]. Figure 2 reports the normalized data memory requirement for each one of the partial prunings tested and the reference algorithms. We normalized the memory requirements in comparison to radix-2 algorithm which does not require any configuration matrix, hence it is the algorithm with the lowest memory requirements. From the figure it is possible to see

that the memory utilization has a linear trend, growing according to the number of FFT planes supporting the pruning. From the figure is is also possible to notice that, thanks to the introduction of the butterfly identifiers, already the full pruning algorithm improves the memory requirements when compared to the algorithm proposed in [4]. Figure 3 shows the computation time, for each algorithm tested, in function of the input sparseness. The computation times were normalized once again in reference to the radix-2 algorithm. Therefore it is possible to see how much the partial pruning affects the performance. To consider the worst case scenario for the partial pruning, we have considered that all the input pruned where a contiguous subset. In this case the full pruning scenario would be the most effective since the pruning penetrates in the FFT structure as deep as possible. From the performance results it is possible to notice that at low pruning levels the partial pruning profiles tested offer the same performance of the full pruning while only for really high pruning levels the partial pruning offers significantly lower performance when compared to the full pruning algorithm. However when compared to the radix-2 implementation the partial pruning algorithm still shows good performance. Comparing the obtained performance of full and partial pruning introduced in this research work with Rajbanshi et al. [4] implementation is is possible to notice that for lower pruning sparseness full pruning and partial pruning offer better computation time performance. For the memory utiliza-

Fig. 3. Normalized computation time for the tested pruning algorithms tion point of view, the saving introduced by the partial pruning is significant when compared to [4]. However for relatively high sparseness factor the solution introduced by Rajbanshi et al. offers better performance than the partial pruning configuration tested. In presence of low-medium sparseness levels the partial pruning algorithm, in its tested configurations, offers performance in the same range of the full pruning algorithm, with the gain of a reduced utilization of memory resources (up to 20%). These results justify the utilization of partial pruning as a optimal trade-off between performance and resources allocation. On the other hand, for high levels of sparseness the partial pruning introduces limitations to the algorithm performance. If the pruning penetrates really deep into the FFT structure, the partial pruning might not be able to take a full advantage of the simplification introduced by the pruning. It is also true that in real cases scenarios it might be unlike that the inputs would present a wide contiguous subset of zeros. It is mostly liked to have a scattered distribution of the input sparseness. In such condition, the partial pruning performance will not be too far from the full pruning since the simplification introduced by the pruning would affect only the initial FFT planes.

tems, which are generally characterized by a limited amount of memory. The partial pruning allows the pruning of the input until a certain deepness in the FFT structure, reducing the memory requirements for the storage of the configuration matrix. We investigated the partial pruning performance, in terms of computation time and memory requirements, for different levels of partial pruning. In particular 1/3, 1/2 and 2/3 partial pruning were considered. The partial pruning were then compared to full pruning algorithm, the algorithm proposed in [4] and a radix-2 algorithm to evaluate the improvement in overall performance. On the basis of the obtained results it is possible to define which level of pruning would suit better our application domain. For example, in scenarios with a medium low input sparseness factor 1/3 and 1/2 partial pruning would still deliver performance similar to the full pruning algorithm but with a reduced memory requirements (about 20%), making the algorithm more friendly towards embedded system applications characterized by limited memory resources.

5. CONCLUSIONS

The authors would like to thank GETA doctoral program and Nokia Foundation for the financial support. The research leading to these results has also been partially funded by SYSMODEL project (http://www.sysmodel.eu).

We presented the partial pruning algorithm as an efficient implementation of FFT pruning algorithm on embedded sys-

Acknowledgements

6. REFERENCES [1] Federal Communication Commission, “Spectrum policy task force report,” Tech. Rep., ET Docket No. 02135, 2002. [2] III Mitola, J. and Jr. Maguire, G. Q., “Cognitive radio: making software radios more personal,” IEEE Personal Communications, vol. 6, no. 4, pp. 13–18, 1999. [3] Qing Zhao and B. M. Sadler, “A Survey of Dynamic Spectrum Access,” IEEE Signal Processing magazine, vol. 24, no. 3, pp. 79–89, 2007. [4] Rakesh Rajbanshi, Alexander M. Wyglinski, and Gary J. Minden, “An Efficient Implementation of NC-OFDM Transceivers for Cognitive Radios,” in Proc. 1st Int Cognitive Radio Oriented Wireless Networks and Communications Conf, 2006, pp. 1–5. [5] J. D. Poston and W. D. Horne, “Discontiguous OFDM Considerations for Dynamic Spectrum Access in Idle TV Channels,” in Proc. IEEE Int. Symp. New Frontiers Dynamic Spectr. Access Networks, Baltimore, MD, Nov. 2005, vol. 1, pp. 607–610. [6] J. Markel, “FFT pruning,” IEEE Trans. Audio Electroacoust., vol. 19, no. 4, pp. 305–311, 1971. [7] J. W. Cooley and J. W. Tukey, “An Algorithm for the Machine Calculation of Complex Fourier Series,” Mathematics of Computation, vol. 19, pp. 297–301, 1965. [8] H. V. Sorensen and C. S. Burrus, “Efficient computation of the DFT with only a subset of input or output points,” IEEE Trans. Signal Processing, vol. 41, no. 3, pp. 1184– 1200, 1993. [9] D. Skinner, “Pruning the decimation in-time FFT algorithm,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 24, no. 2, pp. 193–194, 1976. [10] R. G. Alves, P. L. Osorio, and M. N. S. Swamy, “General FFT pruning algorithm,” in Proc. 43rd IEEE Midwest Symp. Circuits and Systems, 2000, vol. 3, pp. 1192– 1195. [11] J. Kylliainen, J. Nurmi, and M. Kuulusa, “Coffee - a core for free,” in Proc. International Symposium on System-on-Chip, 19–21 Nov. 2003, pp. 17–22. [12] COFEE RISC CORE web page: http://coffee.tut.fi/. [13] T. Ahonen and J. Nurmi, “Programmable Switch for Shared Bus Replacement,” in Proc. PhD Research in Microelectronics and Electronics 2006, 2006, pp. 241– 244.

Designing Processors Using MAsS, a Modular and Lightweight Instruction-level Exploration Tool Matthieu Texier, Erwan Piriou, Mathieu Thevenin (IEEE Member) and Rapha¨el David CEA, LIST, Embedded Computing Laboratory – PC94 F91191 Gif sur Yvette – France [email protected]

II. R ELATED W ORK Abstract— As application complexity increases, the design of efficient computing architectures able to cope with embedded constraints requires a fine algorithm analysis. This paper proposes an original approach based on Modular Assembly Simulator (MAsS) tool that allows Design Space Exploration (DSE) for programmable processors. The originality of the method resides in its capacity to generate operator level simulators allowing a quick code analysis from real data sets. This paper also presents two successfully designed architectures using MAsS.

I. I NTRODUCTION Constant growing of applications complexity and computational needs implicates high performance computing requirements. A good application mapping over control and processing elements allows to maintain this performance level according to the embedded constraints such as program execution speed, electrical power consumption and silicon area. Previous studies [1], [2], [3] emphasized Hardware/Software Co-design. But the task partitioning remains a fussy work and most of local optimizations are also difficult to quantify without the use of specific tools [4]. Moreover resources can be specialized in order to make the system more performant for specific applications, for instance by adding specific operators [5]. This approach is especially relevant for programmable architectures and Application Specific Instruction Set Processors (ASIPs) design. This paper describes the MAsS tool that helps in designing programmable cores through a quick performance estimation. It is easily performed by an iterative refining process of the architectural constraints. This paper is organized as follows, section II presents related work and similar approaches in academical and industrial researches. Then, our approach and MAsS tool principles are explained. The third section illustrates the way to use it through specific design cases and gives obtained results. Finally, this paper concludes on the originality of the approach and future work.

To help the design of processors architecture, several methodologies or approaches are presented in the literature. Thus, the DSE is the usual first step to estimate the impact of the application mapping on several embedded architectures and vice-versa. A relevant partitioning between hardware and software resources is always a crucial task. First, the High Level Synthesis (HLS) [6], [7] approach proposes to explore the design space to extract efficient implementations from a high-level algorithm description. Several tools like Catapult [8] enable to get a sketch of a design from these high level representations such as C or C++. Nevertheless some basic hardware descriptions are required by HLS tools in order to obtain a relevant mapping of the application on an architecture [9]. Thus, it becomes difficult to accurately characterize the potential gain due to processor customization. Other high level languages with much strict semantics allow to consider data-flow applications through graphs of operators [10] and loops transformations. The polyhedral model is also used to describe different classes of computation and to extract the parallelism of an application at different levels [11]. In both cases, support for compiler back-end is not considered. The main task of a compiler is to translate the source code of an application described in a high level description language such as C into a binary format with respect to the Instruction Set Architecture (ISA) (processor ISA definition can be modified in order to optimize the processor for a specific application domain). Some tools like Trimaran [12] propose to design the compiler concurrently with the ISA and also exploit the Instruction Level Parallelism (ILP) of the application. Such tools support several features like custom instructions, clustering, register file, etc. They consider that the global architecture design is partially constrained by a design library compiler and initial templates which may

limit the design innovation. Finally, during these last years, many software suites were proposed for ASIP generation and application mapping. They consist in the extraction and the acceleration of critical portions of code by hardware. This is usually done by ISA specialization and high detailed inputs may be required [13]. The main inputs are the algorithm and the hardware description which takes into account resources (register files, memory, etc.) and a precise pipeline description. This implies that the user need some a-priori about the target architecture. The mentioned approaches such as HLS, retargetable compilers and ASIP tools suites remain efficient for traditional architectures and rapid prototyping. MAsS proposes an easy way to explore a large spectrum of implementations. This approach is quick, modular and lightweight. It allows to explore the design and estimate performances of domain specific programmable processors with real data sets.

Original C Code

Extracted Kernel

1

IR extraction

IR

2

MAsS MAsS Config File

Back-annotated C kernel

III. MA S S P RINCIPLES The aim of MAsS tool is to provide a flexible way to design a programmable solution in order to support specific kernels of a defined applications set. MAsS gives statistics and operator level profiling about the execution of a program on a real data set and so helps the architecture DSE. Usually an application is written in a high level language such as C. A previous task consists in profiling the application in order to extract the main computing parts (kernels). The kernels are the main input of our tool (Figure 1). Figure 2 presents an example code of a convolution filter that is used for didactic purpose. In brief, this is a multiplication of a set of pixels P [i][j] with a set of fixed coefficients C[i][j] for a specific sliding window. An initial transformation translate the C kernel code into a GIMPLE like Intermediate Representation (IR) which can be easily extracted from the initial code (step 1 in Figure 1). Then the code is optimized – loop unrolling, constant propagation, etc. – and described in a set of basic unary or binary operators. This IR is used in order to generate simulators which allow the execution of the code using a real data set (step 2). This is done through the generation of an annotated C code that replaces the original kernel code (step 3). The generation is guided by a configuration file which defines the processor model through parameters like the ISA definition, the way parallelism, the channel assignation of the operations, the amount of functional units and ways.

Replace

3

6

Results analysis

Modified C code

4

Host Compiler

Instruction level Simulator

5

Data Set

Fig. 1.

Simulation results

Illustration of the MAsS flow.

1 i n t conv2D ( i n t P [ 3 ] [ 3 ] , i n t C [ 3 ] [ 3 ] ) { 2 i n t sum = 3 3 ; i n t i = 0 ; i n t j = 0 ; i n t a c c = 0 ; 3 f o r ( i = 0 ; i 0 • Radial Basis Function (RBF):

TN TN + FP

Area

f (x) = sgn(wT φ(x) + b) l P yi αi < xi , x > +b f (x) =< w, x > +b =

Sp =

SNR

Subject to yi (w φ(xi ) + b) ≥ 1 − ξi , ξi ≥ 0 Where, training data are mapped to a higher dimensional space by the function φ, C is a penalty parameter, and slack variables ξi allow for violations of the constraint. For any testing instance x, the decision function (predictor) is [13]:

- Specificity (Sp) of a test is the probability of having a benign polyp correctly identified (exp. 2). A highly specific test will generate few false positives.

Kurtosis

T

(1)

Skewness

ξi

i=1

TP TP + FN

Variance

+C

Se =

Average

min 1 wT w w,b,ξ 2

l P

X X X -

X X X X

X X X X

X X X -

X -

X X -

X -

X X X X

# 8 6 5 3

TABLE II: Descriptors sub-sets used to evaluate the predictive accuracy of the classifier. VII. E XPERIMENTAL R ESULTS The evaluation results of the classifier are presented in table III. For each combination (data set / kernel function), we calculate the success rate, sensitivity and specificity. By analyzing the results of Table III, and as shown in the figure 5, the vectors (B) and (C) provide the best recognition rates (respectively 96.4% and 95.5%, both with RBF function). It should be noted that by adding the parameter related to the area occupied by the object to the vector (C), the recognition rate increases by about 0.9%. But the addition two parameters related to its thickness and heterogeneity will decrease the recognition rate of about 5% in (A). We conclude, then, that these two parameters are unnecessary to distinguish between different objects. Note also that the results obtained from vectors (B) and (C) are almost identical. We find that the combination of each one

Sub-set

Kernel function

% classification

Sp

Se

A

Linear 2nd order polynomial RBF

88.29 87.39 90.09

0.93 0.89 0.93

0.8 0.85 0.85

B

Linear 2nd order polynomial RBF

89.19 92.79 96.40

0.93 0.96 0.97

0.82 0.87 0.95

C

Linear 2nd order polynomial RBF

88.29 91.89 95.50

0.89 0.93 0.94

0.87 0.9 0.97

D

Linear 2nd order polynomial RBF

60.36 63.96 67.57

0.65 0.68 0.67

0.52 0.57 0.70

TABLE III: Overall performance evaluation of classifier on 4 sub-sets and 3 kernel functions.

Fig. 6: ROC curve: the points farthest from the diagonal represent the combination of RBF kernel with the C sub-set

Fig. 5: Classification rate for different combination of descriptor sets and kernel function

with the RBF kernel function gives the best classification rates rather than other kernels. However, the choice of vector (B) that gives the highest classification rate is not automatic. In fact, it is not sufficient to obtain a higher classification rate to conclude the reliability of the model. Optimal combination should be found between sensitivity and specificity. The compromise represented graphically in figure 6 in the form of ROC curves, used to study changes in the specificity and sensitivity of a test for different values of the discrimination threshold. A. Discussion Figure 6 represents the ROC curves for each kernel function applied to the vectors B and C, the ordinate represents the sensitivity and the abscissa represent to the quantity 1-specificity. Couples (1-specificity, sensitivity) of each combination are then placed on the curve. the worst situations are the points closest to the diagonal, obtained with the linear kernel. On the other hand, the more ”efficient” diagnostic test is corresponding to the curve near the upper left corner. This situation is achieved with the RBF kernel for two sets. However, the vector (B) represents the highest rate and specificity values, whereas

the vector (C) represents the highest sensitivity. For some applications, the choice of vector (B) is most advantageous. But in our application we choose the vector (C) as a representative vector, for a reason related to the sensitivity which is the most dominant parameter. That means that some benign polyps will be considered as malignant and, consequently, removed. The removal of these polyps is less critical than in case of wrongly classifying an adenoma as being a hyperplasia. In this case, the patient would run the risk of developing cancer. B. Hardware Implementation To achieve this classifier, we have used the architecture proposed by D. Anguita [22], where he designed an IP generator to realize an SVM-based classifier. This tool is able to generate, according to user needs, a physical description of digital architecture, implementing the Support Vector Machines. Its output is an HDL description suitable to be mapped on a reconfigurable platform such as an FPGA. The high-level view of the SVM module is shown in figure 7. The module takes as inputs: the support vectors, their classes, classifier parameters, and the new instance. It outputs the class of new instances. Table IV summarizes the temporal and physical performance of the whole project. The used target is a Xilinx VirtexII-Pro.This table reflects the efforts to limit the resources. Regarding occupancy rate, all the architectures need an occupancy rate of around 33%. To evaluate the performance of our model, we have compared two cases : At first, we illustrated the theoretical results of our classifier. In a second step, we presented the classification results obtained during the implementation on FPGA. The comparison is presented in figure 8. We notice clearly that the theoretical results (on computer) are better than those

Fig. 7: High level view of the SVM module Algorithm Acquisition Preprocessing Features extraction SVM classifier Communication Total used Total free occupation %

Clb Slices 309 2.527 865 622 170 4493 13696 32.8 %

Latches 337 3.499 228 342 157 4563 29060 15.7 %

LUT 618 4.331 1587 248 277 7261 27392 26.5 %

RAM 4 22 4 3 13 136 24 %

f (Mhz) 116 194 83 194 52

This rate is function of image size and 3D data quantity. knowing that the size of 3D data is a function of recovered laser points number (on average equal to 16.86 KB for each image). Normally, in endoscopy, a precision between 2 and 7 frames per second is sufficient to give a clear idea on the part under consideration (the precision of the famous PillCam capsule is 2fps). In figure 9 we illustrate two different types of data transmission : • in figure 9a we present the continuous transmission type of a conventional videocapsule, where the capsule transmits the images on a rate between 2 and 7 frames per second during its navigation into the patient’s body. • in Figure 9b, only a part of captured images will be transmitted, i.e when the recognition system identifies a polyp, the transmitter begins to send images and 3D data of relevant part. and thus the physician will read only a small amount of data rather than the entire pictures (can achieve tens of thousands of images).

TABLE IV: occupied resources in implementation

implemented on FPGA. The classification rate dropped about 1.8%, with a sensitivity of 95% and a specificity of 92.9%. In fact, several elements and approximations used during implementation can degrade the accuracy of the classifier and the classification error can reach 5% , this error is mainly due to following factors [22] : • fixed to floating point conversion; • quantization error; • and the degree of polynomial used to approximate the kernel function. Despite, the results are satisfactory in comparison to the state of the art of recent work in this domain.

(a)

(b)

Fig. 9: Reduction in consumption: The continuous transmission of images in (a) is more consuming in energy than the smart transmission in (b) corresponding to the relevant zones

C. Power Consumption Estimation To estimate the total consumption of the sensor, we must take into account the data rate of transmitted informations.

The comparison of these two types of transmission standpoint overall sensor consumption is illustrated in figure 10. We can note that under a data rate of 3.2 Mbps, the continuous transmission of images is less energy consuming. But this flow corresponds to an image matrix of 256 × 256

Fig. 8: Classification comparison between theoretical and implemented results

Fig. 10: Estimation of energy consumption versus data rate and transmission type

pixels, with an accuracy of 4 frames per second, that amount does not represent a sufficient resolution in endoscopy. To improve the resolution and image precision, we had to exceed this rate, especially since our sensor does not include until now an image compression functionality to reduce the amount of transmitted information. Beyond 3.2 Mbps, the amount of energy needed for communication increases exponentially, that will accelerate the expiration of the life of the battery, and then reduce the autonomy of the sensor. In this area, the use of our method of smart transmission will increase the autonomy of the sensor. for example at a data rate of 12 Mbps, a conventional videocapsule needs 390 mW to transmit the pattern images, whereas with our method, 212 mW are only needed for processing and transmission. VIII. C ONCLUSION AND F URTHER W ORK In this paper we described a novel approach to capsular endoscopy that overcomes some of its important limits. The basic and essential task was concerning the integration of SVM classifier in such a videocapsule in order to improve its autonomy and diagnostic capability. For this reason we have designed a large scale demonstrator to simulate our multi-stage system for the automatic features extraction and classification of colon polyps using 3D data obtained from an active stereovision system. In vitro experimental results were encouraging and show correct classification rate of intestinal polyps of approximately 93.7%, obtained by combining RBF kernel function to the original feature set for the SVM models. Note that this is an ideal example, because the scene is very simple and the size of the training set is small. Further work is clearly required to investigate different feature areas, like second order statistics features (co-occurrence matrices), and to evaluate system performance for multi-classification. Another advancement can increase system performance is achieved by integrating image compression task in the sensor. Such a functionality could increase system precision and autonomy while reducing power consumption due to reduction of transmitted data volume. R EFERENCES [1] D. K. Rex, et al. ”Quality indicators for colonoscopy”, American Journal of Gastroenterol, 101, 873-885 (2006). [2] T. Stehle, and al. ”Classification of colon polyps in NBI endoscopy using vascularization features”, in Medical Imaging 2009: Computer-Aided Diagnosis, vol. 7260. Orlando, USA. [3] M. Hebert, ”Active and Passive Range Sensing for Robotics”, Proceedings of the IEEE International Conference on Robotics and Automation (ICRA ’00), April, 2000, pp. 102 - 110. [4] J. Ayoub, Olivier, Romain, B. Granado, Y. Mhanna, ”Accuracy Amelioration of an Integrated Realtime 3D Image Sensor”, Conference on Design and Architectures for Signal and Image Processing (DASIP), 2008, Bruxelles. [5] A. Kolar, O. Romain, J. Ayoub, and B. Granado, ”A system for an accurate 3D reconstruction in Video Endoscopy Capsule”, in EURASIP Journal on Embedded Systems, 2009. [6] P. Tchangani, ”Support Vector Machines : A Tool for Pattern Recognition and Classification”, Studies in Informatics and Control Journal, Vol. 14, No. 2, pp. 99-109, Ed. ICI, 1220-1766.

[7] T. Graba, B. Granado, O. Romain, T. Ea, A. Pinna, and P. Garda. , ”Cyclope: an integrated real-time 3d image sensor”, in XIX Conference on design of circuits and integrated systems, 2004 [8] S. Theodoridis, K. Koutroumbas, ”Pattern Recognition”, 2nd edition. academic press 2003 [9] M. Gruber and K. Y. Hsu, ”Moment-Based Image Normalization with High Noise-Tolerance”, IEEE Transactions Pattern Analysis and Machine Intelligence, Vol. 19, No. 2, February 1997, pp. 136-139. [10] J. Daniels II, L. K. Ha, T. Ochotta, and C. T. Silva, ”Robust Smooth Feature Extraction from Point Clouds”, Shape Modeling International 2007: pp.123-136 [11] S. Gumhold, X. Wang, and R. McLeod, ”Feature extraction from point clouds”, In Proc. 10th Int. Meshing Roundtable, 2001. [12] C. Gold and P. Sollich, ”Model selection for Support Vector Machine classification”. Neurocomputing, 55:221-249, 2003. [13] C. Cortes and V. Vapnik, ”Support-vector network”, Machine Learning, 20:273-297, 1995. [14] D. Fradkin and I. Muchnik, ”Support Vector Machines for Classification”, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, volume 70, pp. 13-20, 2006. [15] L. T. Jolliffe, Principal Component Analysis”, Springer-Verlag, New. York 1986. [16] A. Moglia, A. Menciassi, M. O. Schurr, P. Dario, Wireless capsule endoscopy: from diagnostic devices to multipurpose robotic systems, in Biomedical Microdevices, vol. 9, no. 2, pp. 235-243, 2007. [17] A. de Leusse. ”Vido-capsule colique : quel avenir ? Cancero digest”, vol. 14, page 8, 2008. [18] P. Katsinelos, J. Kountouras, and al. ”Wireless capsule endoscopy in detecting small-intestinal polyps in familial adenomatous polyposis”, in World Journal of Gastroenterology ISSN 1007-9327, 2009 December ; 15(48): 6075-6079 [19] L. Lai, G. Wong, D. Chow, J. Lau, J. Sung, and W. Leung, ”Long-term follow-up of patients with obscure gastrointestinal bleeding after negative capsule endoscopy”, in Am J Gastroenterol 2006, v(101),pp 1224-1228 . [20] Ji Peng, Wu Chengdong, Zhang Yunzhou and Wang Xiaozhe, ”A power-aware layering optimization scheme for wireless sensor network”, in Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing, 2009, pp. 3282–3285, Piscataway, NJ, USA [21] V. Mazzaferro, R. Doci, S. Andreola, A. Pulvirenti, F. Bozzeti F et al. Liver transplantation for the treatment of small hepatocellular carcinomas in patients with cirrhosis. N Engl J Med, vol. 334, page 693, 1996. [22] D. Anguita, A. Ghio and al. A FPGA Core Generator for embedded classification systems. Journal of Circuits, Systems, and Computers, vol. 20, no. 2, pages 263282, 2011.

A SYSTEMC AMS/TLM PLATFORM FOR CMOS VIDEO SENSORS Fabio Cenni1,2 , Serge Scotti1 , Emmanuel Simeu2 1

STMicroelectronics, Grenoble, France TIMA Laboratory, Grenoble, France

2

ABSTRACT This work presents how an image acquisition system based on a CMOS image sensor (CIS) has been modeled by means of the recently standardized analog and mixed-signal (AMS) extension to the SystemC 1666 IEEE standard. Many optical and electrical effects at a high level of abstraction are described by the model while limiting the model complexity for making the model suitable to top-level simulations and performance analysis. A comparison among SystemC AMS models developed at different levels of abstraction is shown. The SystemC AMS model of the image sensor is supplied with input scenarios that mimic the scene captured by an image sensor inserted in a dark-box for tuning purposes. The model is integrated in a SystemC TLM platform that contains also the image signal processor (ISP) algorithms and a comparator between the detected image and the processed one. The resulting SystemC AMS/TLM platform demonstrates how the tight interaction between the SystemC AMS image sensor model and the SystemC TLM ISP model allows an early development/validation of the embedded software. Simulation results are shown and future related works discussed. Index Terms— Mixed-signal system, image sensor modeling, IP reuse, SystemC AMS, SystemC TLM. 1. INTRODUCTION The devices offered by the market of embedded electronics integrate more and more features concerning different physical domains (mechanical, optical, electrical). The validation of the overall functioning of these systems is crucial as such needs to be performed as soon as possible during the design flow by means of simulations. It is important for silicon manufacturer to develop models (virtual prototypes) of their intellectual properties (IPs). The simulation of such complex systems by means of their models shrinks time-to-market by anticipating some phases of the design flow. An image acquisition system is composed by three main blocks: an image sensor followed by an image signal processor (ISP) and a central processing unit (CPU). This work focuses on the development of a reliable and accurate behavioral model of a CIS with the purpose to simulate its functioning within its surrounding environment, the benefits shall be: high-level ar-

chitecture exploration capabilities through early validation of the overall specifications and early embedded software development/debugging. The validation of the sensing/controlling interoperability between CIS and ISP during the development phase shall be enabled without having to wait for hardware prototypes. This would also allow to validate the ISP algorithms at simulation-level by tuning them and stressing them up to determine their limit working conditions. With respect to a CIS, analog and digital electrical signals are involved together with analog optical values. SystemC AMS [1, 2] is an analog and mixed signal (AMS) extension of the SystemC framework [3], both C++ based , that aims at creating a complete design environment for the modeling and simulation of heterogeneous systems at a high level of abstraction [4]. The SystemC AMS extension was chosen for modeling the CIS heterogeneous system with properties in optics and electronics. Different SystemC AMS models of computation (MoCs) have led to different model accuracies. The model takes into account many analog phenomena that occur in a CIS. The SystemC AMS model of the video sensor is integrated in a SystemC TLM OSCI 2.0 demonstration platform [5] containing a model of an ISP and a comparator unit. The paper is structured as follows: Section 2 will illustrate the SystemC AMS MoCs. Section 3 will introduce the target CMOS video sensor and show the SystemC AMS model history. Section 4 gives an overview of the fastest SystemC AMS TDF. Section 5 details the SystemC TLM platform. Section 6 shows the simulation results and Section 7 gives the conclusions and introduces the future works. 2. OVERVIEW OF SYSTEMC AMS SystemC AMS adds analog-specific MoCs allowing the modeling of analog blocks, the interoperability among them and among other SystemC/SystemC TLM models or Intellectual Properties (IPs). Therefore, analog blocks can be modeled at behavioral level and simulated together with their surrounding digital environment. The digital environment typically consists of the hardware description language (HDL) model of the platform architecture together with its embedded software. The overall system simulation allows the validation of the behavior of a complex heterogeneous system using the single unified programming language C++, contrary to

what a VHDL/VHDL-AMS based modeling would offer. Three MoCs or modeling formalisms are available in SystemC AMS. In this work only the Electrical Linear Network (ELN) and the Timed Data Flow (TDF) MoC are used. The ELN MoC allows to model conservative and continuous-time behaviors. The modeling style consists on the instantiation of linear electrical primitives (resistors, capacitors, inductors, V/I sources) in order to form an electrical linear network that will be solved by the simulator. The added value of the ELN MoC is that a low-level model is obtainable by means of the C++ language. The TDF MoC regards signals as uniformly sampled directed signals. Samples are provided by TDF source modules and consumed by TDF sink modules. A TDF cluster is a chain (or closed loop) of TDF modules interconnected by means of TDF signals. The TDF MoC allows simulating both non-linear static and linear dynamic behaviors [6]. Non-linear static behaviors are modeled by means of user-defined calculations inside TDF modules while linear dynamic behaviors can be implemented using predefined types of calculations from the SystemC AMS library (e.g. Laplace transfer functions or state space equations).

each photodiode has been described as a capacitor charged by a current source and a parallel Ron/Roff switch for its reset. The ELN voltage across each capacitor was converted to a TDF signal and the timing of transfer-gate (TG) signals and the other read-out control signals managed the driving of the Vx column line. Those read-out control signals are provided by the video timing block, the timing of these signals had been accurately described. Depending on the TDF time step the model simulated within different simulation times, for a good accuracy the chosen TDF time step of 0.5µs allowed to simulate a 48 by 48 pixel array in ten minutes. Both the VHDL-AMS and the SystemC AMS ELN-TDF models describe the sensor at a low level of abstraction comparable to the register-transfer architectural level (RTL). The capability of SystemC AMS to cover RTL level descriptions is proved. The SystemC AMS ELN-TDF model allows to gain a speed-up factor of about 35 times compared to the VHDLAMS model but the performances really depends on the time step. Despite that, the desired SystemC AMS model is intended to raise the level of abstraction even further, hence, a TDF-based model has been studied. RSTr TG0

3. CMOS VIDEO SENSOR AND SYSTEMC AMS MODELS

RSTgr

TG1

-0.8 2.3

RSTgb

The CMOS video sensor studied is designed by STMicroelectronics in its IMG140 CMOS technology. The size of each pixel of the matrix is 1.4µm x 1.4µm. The pixel architecture is called 1T75 because the transistors are shared among the neighboring photodiodes leading to a 1.75 equivalent transistors per photodiode (see [7] for more details). The full matrix counts 2 megapixel (1920x1080). The sampling is performed by at most 10 bits analog to digital converters (ADCs). Analog HDLs have been used for modeling image sensors in the literature [8, 9] and in STMicroelectronics mostly VHDLAMS based models aimed at assisting the CIS designers with rapid and comprehensive views of the waveforms of the control signals (reset, read, line selection etc.). Such models are highly demanding in terms of simulation duration due to the accurate description of the control signals timing and the differential equation system to be solved for each photodiode. The desired CIS model is intended to describe the system at a higher abstraction level for developing/validating the feedback control loop (CIS/ISP) while keeping a (relatively) accurate modeling of the analog behavior of the sensor. In the following the SystemC AMS models evolution is regrouped by MoC and sorted by chronological order of development. 3.1. SystemC AMS ELN-TDF model Following the same principle of the VHDL-AMS a SystemC AMS ELN-TDF first model has been developed by reproducing the schematic 1T75 architecture of the pixel. As shown in Fig. 1, for what concerns the conservative domain of energies,

TG2 RSTb

SN SRST READ SRST

TG3 Vx

SystemC AMS ELN

SystemC AMS TDF

Fig. 1. Structure of SystemC AMS ELN-TDF model.

3.2. Enhancing the SystemC AMS TDF model The first TDF model developed contained the entire array of pixel instances. The model describes the discharge of the photodiodes as a linear sampled discharge generated by a multiplication between the current value due to the light and the time. The accuracy of the model relies on the granularity of the TDF time step since the read signal could arrive at any time between two samples of the TDF signal. In this model the whole array of pixels was represented by a two dimension (2D) array of instances of the pixel module. This model presents a simulation time of a 48 by 48 pixel array, with a TDF time step of 0.5 µs for the discharge, of about 2 minutes and 40 seconds (Table 1). In order to reduce the simulation time a new modeling style has been introduced. The 2D array of pixels is no more fully instantiated but only one instance of the pixel sweeps the whole array row-after-row, like it occurred in an old-fashioned analog-TV-like image refresh. The discharge of the photodiodes is still described as a negative constant-slope sampled TDF signal and a trade-off between accuracy and granularity of the TDF time step still subsists.

This new modeling style allowed to reduce considerably the simulation time thanks to the elimination of huge amounts of non-relevant processing. It presents a simulation time of a 640 by 480 (VGA size) pixel array, with 50 TDF time step per discharge, of about 10 seconds (Table 1). Further improvements have led to consider only the final value of the discharge once the integration time has elapsed. The discharge is considered linear as far as the saturation occurs with a discontinuity of the first order derivative. The TDF time step is reduced to one TDF calculation per pixel discharge. Therefore a fictitious pixel time is introduced for representing the SystemC time of processing of one pixel over the SystemC time of processing of the array. The tremendous gain in simulation time allowed to simulate a 1920 by 1080 (2 megapixel) pixel array within about 7 seconds (Table 1). 4. OVERVIEW OF THE SYSTEMC AMS TDF CIS MODEL The AMS model is composed by a TDF cluster, as shown in Fig. 2. Three TDF time steps are mainly present in the model: the frame time (TDF modules fired at the frame rate), the pixel time (TDF modules fired every pixel time) and the row time step for the firing of the bank of ADCs. The frames of the video stream are passed from block to block at the beginning of the chain. The images can be generated in two ways in the model: one emulates the capture of a moving object on a fixed background, in the following called input image builder (IIB), and the other emulates the situation of the CIS inserted in a dark box with controlled input scenarios and light, in the following called dark box (DB). In the case of the IIB, see Fig. 2, the background builder (BGB) builds the background and the object builder (OB) draws an object

Tp:f_t R:1

Input Image Builder (IIB) background builder (BGB): « fade2grey »

Tp:f_t R:1

object builder (OB): «f» or «car»

Tp:f_t R:1

Tm:f_t

Tp:f_t R:1

Tm:f_t

iw=input_width=2*w ih=input_height=2*h cl=color pd=photodiode f_t=frame_time=1/frame_rate r_t=row_time=f_t/h p_t=pixel_time=r_t/w LEGEND: SC_MODULE SCA_TDF_MODULE

sensor_sc_wrapper

Lens

B

Controller Tm:p_t

GB GR R

Tp:f_t R:1

Tp:f_t Tp:f_t R:1 R:1

Tm:f_t

Tp:p_t R:w*h Tm:f_t

tdf

Tp:p_t R:1

1 GR 1

1 R 1

Bayer filter Tp:p_t R:1

1 GB 1

1 B 1

t_i_ps scaling to light range 0-500000

Tp:p_t R:1 de-scaling to a range 0-(2^N_bit-1)

VxB

Tm:f_t

Tm:f_t

w=960 h=540 0.5Mpix with 4pd/pix(2green) ~2Mpd*16bit

VxGB

Tp:f_t R:1

iw=2*960 ih=2*540 2Mpix with 3cl/ pix(1green) ~6Mcl*16bit

VxGR

Tp:f_t R:1

image loader (IL)

VxR

Dark Box (DB) environment (ENV)

upon it (a car for instance). The OB also models the electronic rolling shutter (ERS) effect on a running car captured. In the case of the DB, the environment (ENV) block supplies the image loader (IL) with three TDF control signals. The controls are: what chart is to be loaded among predefined charts (scenarios), what color and intensity the light illuminates the chart with. The DB emulates the functioning of a dark box containing a set of red, green, blue and white light emitting diodes (LEDs) driven by pulse width modulated (PWM) signals. The image issued by the DB (or IIB see Fig. 2) is coded upon a parametric number of bits (typically between 8 and 10) and sent to the lens module. The lens models the distortion of the light’s angle of incidence due to the lens effect that causes the visible artifact on the image called vignetting (fainting of the light intensity on the periphery of the image compared to the center of the image). The image is sent to the Bayer filter (BF) module which, in turn, sends it to the inner part of the model. The Bayer filter models the Bayer patterned color filter array on top of the silicon photodiodes and add randomly located hot pixels defects. The Bayer pattern consists in groups of four color filters sometimes referred to as R-Gr-Gb-B. The TDF time step is reduced to the pixel time step and the whole matrix is swept by an instance of pixel module. The controller drives the pixel with the light signals coming from the Bayer filter. The pixel contains four photodiodes driven by the four light signals and the integration time information. The light signals are updated at the beginning of each photodiode discharge processing. The saturated discharge value Vx at the end of the integration time is output by each pixel. The column voltage Vx is the signal sampled by the bank of ADCs (ADC module). Finally, a decimator is needed to convert the row rate TDF signal into a frame rate SystemC signal.

integration_time_register Tp:p_t Tp:p_t R:w R:w

Decim

Tm:f_t

output@frame_rate

Tp:f_t Tp:r_t R:h R:1

sc_signal 0(2^N_bit-1)

output@row_rate

Tp:r_t R:1

ADC

Tm:r_t

sensor

Fig. 2. Block diagram of the SystemC AMS model.

Tp:p_t Tp:p_t R:w R:w frame buffer 0 frame buffer 1

5. SYSTEMC AMS / SYSTEMC TLM PLATFORM The SystemC AMS model of the video sensor is integrated in a SystemC TLM OSCI 2.0 platform for a demonstration of the concept, that is, a SystemC AMS / TLM joint simulation allow to simulate a complex mixed-signal system. On the one hand, the analog part reacts to changes of the digital control signals driven by the digital part. Notably these can be: changes of the integration time, changes to the autofocus control signal, changes to column amplifier digital and analog gains, etc. Therefore many analog aspects have to be modeled in the SystemC AMS model. On the other hand, the analog part must send the relevant information to the digital part. This can be done in different ways, typically, analog signals are sampled and these samples are packaged in packets and sent to the digital part. In the case of a video sensor platform, the output of the sensor model is packaged frame by frame and sent to the ISP. The analog and digital parts can be modeled at different levels of details. It is understood that an accurate modeling requires a complex model demanding a high simulation time. In order to validate specific functionalities a solution is to reduce the complexity of the model by focusing on the targeted functionality only, such as the autoexposure. This section deals with the structure of the SystemC TLM platform. Three TLM blocks are identified (Fig. 3): the sensor (tlm sensor), the ISP (tlm isp) and the comparator (tlm comparator). Legend SC_MODULEs: TLM AGENT

sensor idle

stream

Streaming

sensor_sc_wrapper dummy sink

tlm_isp

sensitive SCTHREAD

input image sinthesized wait for event

Control feedback

b_transport

SCTHREAD

column and the R-Gr-Gb or B values. The output sc signal is used to trigger a sensitive SC THREAD (black dashed connection). Each time the AMS model delivers a new capture the 2-state-machine of the SC THREAD (cloud on top of Fig. 3) initializes a tlm generic payload containing a pointer to the image matrix and sends the transaction to the tlm isp by means of the TLM OSCI 2.0 blocking interface on the output streaming socket (red connection named “streaming”). Meanwhile, once the DB (or IIB) has built the input image it notifies an sc event (light blue curved connection) that is detected by the SC THREAD that sends the input synthesized image to the comparator. The comparator will wait for the reception of the reconstructed image for performing the comparison. 5.2. TLM ISP The tlm isp is activated by the reception of the image issued by the sensor. The ISP performs two main tasks: correction of the image and feedback the control. With respect to the correction of the image, the first step consists on correcting the Bayer filter inequality of light wavelength adsorption. This correction is done by applying the inverse of the nominal coefficients of the Bayer filter model. The second step is to retrieve a raw RGB image from a Bayer patterned image, this is done by means of an interpolation. With respect to the feedback of the control signals the ISP performs an estimation of the light of the reconstructed image and decides whether the integration time has to be updated or not (autoexposure). The ISP sends the new values to the tlm sensor using a blocking transport through the “control feedback” transaction link of Fig. 3. The tlm sensor receives the new values and updates the integration time sc signals, which are read by the controller. The reconstructed image is then sent to the tlm comparator through the “reconstructed image” transaction link (Fig. 3).

tlm_sensor

5.3. TLM Comparator top

Synthesized input image

tlm_comparator

Reconstructed image

Fig. 3. SystemC TLM OSCI 2.0 platform: top view.

5.1. TLM Sensor The SystemC AMS TDF model (Fig. 2) of the CIS is wrapped by the tlm sensor. The SystemC AMS TDF chain is represented by the light green blocks of Fig. 3; it is contained by the sensor sc wrapper module. The output is an sc signal (black arrowed line in Fig. 3) driven by the TDF output converter port to the discrete-event domain of the decimator module. The output sc signal carries a triple pointer to a 3dimension array, the 3 dimensions allow to select the row, the

The tlm comparator performs a pixel-by-pixel comparison between the synthesized input image and the output image. A study on the comparison equations has been carried out evaluating the best way to make visible a difference between input and output otherwise sometimes imperceptible to the naked eye. Fig. 4 shows a sequence of comparison images built using different equations and colored according to the color legend and scales on top left. Granted that R,G and B values go from 0 to 255, equation 1 penalizes triplets with either R,G or B tending to 0 making the error too strong (notable on the left of each horizontal stripes). Equation 2 attenuates this drawback and penalizes only the black (0,0,0). Since color values are codes and not quantities it has been decided not to divide by the absolute values of colors, even because +1 should be added to denominators of equation 1 and 2. Equation 3 is the Euclidean norm of the delta vector

(∆R, ∆G, ∆B) = |(R, G, B)OU T −(R, G, B)IN |. Equation 3 does not consider absolute values, but only distances, thus no penalization persists. A division by the maximum possible value is then needed for limiting Ei,j between 0 and 1. Integration time evolution

Ei,j

0.5

.5

0.4

1.5

.375

0.3

1.0

.25

0.2

0.5

.125

0.1

0.0

0.0

0.0

+

+

Bi,j

Size

ELN for photo diodes 48x48 only+TDF TDF array of pixels 48x48 instantiated TDF one pixel sweeping 640x480 the array TDF one pixel sweeping 1920x1080 the array ↓ Enhancement to 1920x1080 one “controller”

Sim time for 1 frame 10mn

2mn 40s

10s

7s

2s

Time step TDF time step =0.5µs TDF time step =0.5µs 50 TDF steps per pixel→ waveform 1 TDF step per pixel→ no waveforms 1 TDF step per pixel→ no waveforms

Sim Time Speed-up for 1 pixel ratio 0.26s

1

69ms

x3.7

32.6µs

x7949

3.4µs

x77142

0.96µs

x270000

Bi,j IN

( R i,j + Gi, j + Bi,j ) (Ri,j + Gi,j + Bi,j )IN 2

Ei,j =

Gi, j Gi,j IN

IN

.625

2.0

OUT

2.5

EQ. (1)

.75

EQ. (2)

Ei, j =

R i,j R i,j IN

0.6

10

2

R i,j + Gi,j + Bi,j 2552 + 2552 + 2552

2

EQ. (3)

Ei,j =

EQ(1) EQ(2) EQ(3)

Table 1. Simulation performances. Model

Fig. 4. Comparison equations analyzed.

6. SIMULATION RESULTS 6.1. Simulation performances Table 1 evaluates the performances of the different CIS models developed, without taking into account the tlm isp and tlm comparator simulation time overheads, that, however, are not very influent. Since sizes are different, the simulation times for one frame have been relativized to one pixel for comparison purpose. The simulation time for one pixel is calculated by dividing the simulation time for one frame by the number of pixel. The reference for the speed-up ratio is the SystemC AMS ELN-TDF model. A further simplification consists in encompassing the body of 4 TDF modules (controller, pixel, ADC, decimator) into one TDF module, this leads to 2 seconds of simulation time for the acquisition of a 2 megapixel frame (last line of Table 1). The latter is the fastest SystemC AMS TDF model obtained, it reaches a tremendous speed-up factor of about 5 orders of magnitude compared to the low-level SystemC ELN-TDF model but the accuracy level is necessarily reduced. Its simulation does not allow to trace voltage waveforms anymore for assisting the hardware/RTL designer. However, this high-level model can take into account many aspects that would not be possible to model using VHDL-AMS because of their computational weight. Such aspects are: the Bayer filter, the lens and, obviously, the interaction with a SystemC or SystemC TLM description of the digital hardware of the surrounding platform. These reasons make the model suitable for embedded software development/debugging or ISP algorithm validation.

6.2. SystemC AMS/SystemC TLM platform simulation Fig. 5 shows the results of a simulation of a CIS of size 200 by 200. The DB loads the input chart (top sequence), it illuminates the chart with a white light of changing intensity (evolution in green in figure, initially 10lux). The Bayer patterned image output by the SystemC AMS sensor (second sequence) is initially dark because of low initial ambient light (AL) and integration time (IT). The tlm isp interpolates the images and builds the third sequence of Fig. 5 without correcting nor the faulty pixels nor the vignetting. The tlm isp estimates an under-exposure on the initial frames and sends a request to the tlm sensor for increasing the IT. Since the DB independently augments the AL too, soon (at frame 5) the image reconstructed by the tlm isp will be too bright because of the two effects. Hence the tlm isp imposes a decrease of the IT. The tlm comparator compares the input and the output images and builds the comparison images (bottom sequence). Since both the input and output first images are dark the error is low and a small surface is white colored. The error is the lowest at frames 9 and 10 (wider white colored surface on the comparison images) once the transient over-reaction of the ISP feedback is ended. The AL gradually goes to zero and the IT tends to its maximum but of course, it is not sufficient when light is off. The image is visibly affected by vignetting (caused by the lens effect) and stuck-at faulty pixels. 7. CONCLUSION AND FUTURE WORKS We have demonstrated the suitability of SystemC AMS/TLM for the modeling of a complex mixed-signal video acquisition platform. Different SystemC AMS MoCs were used for modeling the CIS at different abstraction/accuracy levels. The performances of the respective models have been compared. The fastest model of the CIS has been developed by means of the TDF MoC. The interoperability between mixed-signal components described using SystemC AMS and their associated digital environment has been proved by integrating the

0

25

50

62

62

50

38

38

10

20

30

40

50

60

70

80

1

2

3

4

5

6

7

8

26

26

38

38

38

50

62

74

86

98

110

60

50

40

30

20

10

0

10

12

13

14

15

16

17

18

110

Integration time evolution [row time] 90

80

70

20

Ambient light evolution [lux] 9

10

11

19

20

Fig. 5. Simulation results of the SystemC AMS/TLM platform. model of a CIS in a SystemC TLM OSCI 2.0 demonstration platform. It is claimed that the low execution time of the SystemC AMS TDF model allows performing validation and debug of the whole system before the hardware availability. A first integration of the CIS model into a more generic complex STMicroelectronics SystemC TLM virtual platform containing the ISP and a 32 bit CPU has been seamlessly done but validations are still ongoing. The final aim is to validate three control loops, that are: automatic white balance (AWB), autoexposure (AE) and auto-focus (AF).

[5] M. Damm, J. Haase, C. Grimm, “Connecting SystemCAMS models with OSCI TLM 2.0 models using temporal decoupling,” International Forum on Specification and Design Languages (FDL), Stuttgart, Germany, pp.25-30, September 2008.

8. REFERENCES [1] Open SystemC Initiative (OSCI), “Standard SystemCAMS extensions language reference manual.” March 2010.

[7] M. Cohen et al., “Fully Optimized Cu based process with dedicated cavity etch for 1.75um and 1.45um pixel pitch CMOS Image Sensors,” International Electron Devices Meeting (IEDM), San Francisco, CA, pp.1-4, December 2006.

[2] A. Vachoux, C. Grimm, K. Einwich, “SystemC-AMS requirements, design objectives and rationale,” Design, Automation and Test in Europe (DATE) conference, Munich, Germany, pp.388-393, March 2003.

[8] F. Dadouche, A. Pinna, P. Garda, A. AlexandreGauthier, “Modelling of pixel sensors for image systems with VHDL-AMS.” International Journal of Electronics, vol.95, no.3, pp.211-225, 2008.

[3] Open SystemC Initiative (OSCI) “Standard SystemC Language Reference Manual.” April 2005. [4] M. Vasilevski, F. Pecheux, H. Aboushady, L. de Lamarre, “Modeling of wireless sensor network nodes using SystemC-AMS,” Behavioral Modeling and Simulation Workshop (BMAS), pp.11-16, Septemper 2007.

[6] F. Cenni, E. Simeu, S. Mir, “Macro-modeling of analog blocks for SystemC-AMS simulation: A chemical sensor case-study,” 17th IFIP International Conference on Very Large Scale Integration (VLSI-SoC), Florianopolis, Brazil, October 2009.

[9] D. Navarro, D. Ramat, F. Mieyeville, I. O’Connor, F. Gaffiot, L. Carrel, “VHDL & VHDL-AMS modelling and simulation of a CMOS imager IP,” Forum on specification and Design Launguages (FDL), Lausanne, Switzerland, pp.179-183, Semptember 2005.

2011

Tampere, Finland, November 2-4, 2011

Session 8: Methods & Tools for Dataflow Programming -

, INSA, France Jorn Janneck, Lund Institute of Technology, Sweden

Hardware/Software Co-Design of Dataflow Programs for Reconfigurable Hardware and Multi-Core Platforms Ghislain Roquier, Endri Bezati, Richard Thavot and Marco Mattavelli The Multi-Dataflow Composer Tool: A Runtime Reconfigurable HDL Platform Composer Francesca Palumbo, Nicola Carta and Luigi Raffo A Unified Hardware/Software Co-Synthesis Solution for Signal Processing Systems Endri Bezati, Herve Yviquel, Mickael Raulet and Marco Mattavelli Optimization Methodologies for Complex FPGA-Based Signal Processing Systems with CAL Ab Al-Hadi Ab Rahman, Hossam Amer, Anatoly Prihozhy, Christophe Lucarz and Marco Mattavelli

www.ecsi.org/s4d

HARDWARE/SOFTWARE CO-DESIGN OF DATAFLOW PROGRAMS FOR RECONFIGURABLE HARDWARE AND MULTI-CORE PLATFORMS Ghislain Roquier, Endri Bezati, Richard Thavot, Marco Mattavelli Ecole Polytechnique Fédérale de Lausanne, CH-1015, Lausanne ABSTRACT The possibility of specifying both software and hardware components from a unified high-level description of an application is a very attractive design approach. However, despite the efforts spent for implementing such an approach using general purpose programming languages, it has not yet shown to be viable and efficient for complex designs. One of the reasons is that the sequential programming model does not naturally provide explicit and scalable parallelism and composability properties that effectively permits to build portable applications that can be efficiently mapped on different kind of heterogeneous platforms. Conversely dataflow programming is an approach that naturally provides explicit parallel programs with composability properties. This paper presents a methodology for the hardware/software co-design that enables, by direct synthesis of both hardware descriptions (HDL), software components (C/C++) and mutual interfaces, to generate an implementation of the application from an unique dataflow program, running onto heterogeneous architectures composed by reconfigurable hardware and multi-core processors. Experimental results based on the implementation of a JPEG codec onto an heterogeneous platform are also provided to show the capabilities and flexibility of the implementation approach. Index Terms— dataflow programming, hardware/software co-design, reconfigurable hardware, multi-core processor 1. INTRODUCTION Parallelism is becoming more and more a necessary property for implementations running on nowadays computing platforms including multi-core processors and FPGAs units. However, one of the main obstacles that may prevent the efficient usage of heterogeneous concurrent platforms is the fact that the traditional sequential specification formalisms and all existing software and HDL IPs, legacy of several years of the continuous successes of the sequential processor architectures, are not the most appropriate starting point to program such heterogeneous parallel platforms [1]. Moreover, such specifications are no more appropriate as unified specifications when targeting both processors and reconfigurable hardware components. Another problem is that portability

of applications on different platforms become a crucial issue and this is not provided by a sequential specification model. The work presented in this paper focuses on the methodology developing scalable parallel programs that provide portability on heterogeneous platforms. To achieve this objective we are moving away from the traditional programming paradigms and adopt a dataflow programming paradigm. The dataflow actor based programming model describes applications as graph where nodes and edges represent respectively computational components and communication channels. Dataflow programming naturally exposes the intrinsic parallelism of applications, which can be used to distribute computations according to the available parallelism of the computing platforms. This paper proposes a methodology for the hardware/software co-design of high-level dataflow program targeting heterogeneous parallel platforms, composed of multi-core processors and FPGAs components. To this end, we argue that application descriptions should be scalable, portable and modular. Portability means that the application should be able to run unchanged on any architectures from single-core processors to multi-core processors as well as FPGAs or a combination of them. Portability enables fast deployment of applications with no assumption on the architecture. Application should also be scalable, that means that the performance of the application should scale with the available parallelism of the target architecture. The application should expose appropriate explicit parallelism so that efficient mappings can be found to exploit the available parallelism of the architecture. The rest of the paper is organized as follows. Section 3 introduces the main concepts of the dataflow programming approach, the advantages versus the traditional sequential programming model and the essential elements of the CAL language used for the high level design. Section 4 presents an architectural model that is functional to model heterogeneous platforms composed by multi-core processors and FPGAs at high-level of abstraction. The hardware/software synthesis methodology is presented in Section 5 which explains the different stages that enable to automatically synthesize dataflow programs and efficiently map onto platform components. Section 6 describes in more details the co-design tools supporting the designer in the different stages of the development of an application. Section 7 presents a case

study where a motion JPEG codec is implemented onto different platforms made of FPGAs and embedded processors. Section 8 concludes the paper by discussing the advantages of the approach and the remaining challenges as well as some perspectives of future work and further extensions. 2. RELATED WORKS Hardware/Software co-design is nothing new and several of the fundamental ideas of our work go well back into the nineties [2]. In [3], the authors proposed a classification of ESL tools where the methodology we propose in the paper can be classified in the functionality-platform-mapping category. Indeed, our design method intends to address both the modeling of the functionality (the application description) and the platform (the architectural description) at a high-level of abstraction as well as to provide the tools that enable to map the functionality onto the platform instances to generate lower-level components (software component, RTL, etc.). Prior research related to our methodology has been proposed in the literature [4, 5, 6]. They mainly differ in the way to describe the system behavior. For instance, The Artemis Workbench is based on Kahn Process Network (KPN) to model applications or PeaCE from the Seoul National University which extends the Synchronous Dataflow model (SDF) and Finite State Machine (FSM), to name but a few. To the best of our knowledge, we propose a different model based on an extension of the dataflow process network model (DPN) that enable to express a large class of applications, including non-determinism and timing-dependent behaviors.

express actors that belong to the DPN MoC [10]. In addition to strong encapsulation and language constructs supporting the fundamental idioms of dataflow with firing (production and consumption of tokens, state modification, and discrete steps), The language supports implementation of actors that can belong to more restricted MoCs, e.g KPN, SDF, etc. CAL is a domain specific language, so it has the nice property to be easily analyzable. This property enables the development of tools that can detect problems (such as e.g. potential non-determinism and deadlock), and it is also the key for many optimizations and program transformations, such as static scheduling (ordering of actors at compile-time) or vectorization [11, 12]. CAL can be synthesized to software and hardware [13, 14, 15]. Recently a subset of CAL, named RVC-CAL, has been standardized by ISO/IEC MPEG [16] and is used as reference software language for the specification of MPEG video coding technology under the form of a library of components (actors) that are configured (instantiations and connections of actors) to generate video decoders. 3.2. CAL Actors Actors are the basic computational entities of a CAL dataflow program. Actors execute by performing a number of discrete computational steps, also referred to as firings. During each firing, an actor may: • consume tokens from its input ports, • produce tokens on its output ports, • modify its internal state.

3. DATAFLOW PROGRAMMING The dataflow paradigm for parallel computing has a long history from the early 1970s. Important milestones may be found in the works of Dennis [7] and Kahn [8]. A dataflow program is conceptually represented as a directed graph where nodes (named actors in the rest of the document) represent computational units, while edges represent sequences of data. Formal dataflow models have been introduced in literature, from Kahn process network (KPN) to Synchronous dataflow (SDF) just to name a few. They differ by their so-called Models of Computation (MoC), that define the behavior of dataflow programs. There exists a variety of dataflow MoCs which results into different trade-offs between expressiveness and analyzability. Dataflow programs provide a natural way to express concurrency in a program. An immediate benefit of dataflow is that actors can execute simultaneously, thus allowing the program to take advantage of parallel platforms. 3.1. Dataflow with firing In [9], authors presented a formal language for writing actors. The language, named CAL, is designed specifically to

An actor may include an internal state. An important guarantee is that internal states are completely encapsulated and cannot be shared with other actors, i.e. actors communicate with each other exclusively through passing tokens along dataflow channels. This makes dataflow programs more robust and safe, regardless of the interleaving of actors. A firing, once initiated, must terminate irrespective of the environment of the actor, i.e. all conditions necessary for its termination (input tokens availability, current state of the actor, etc.) must be met before the firing is begun. An actor executes when it meets all firing conditions, regardless of the status of all other actors. 4. PLATFORM MODELING Architecture modelling is based on the model proposed in [17]. The architecture is modeled by an undirected graph where each node represents an operator (a processing element like a CPU or an FPGA in the terms of [17]) or a medium of communication (bus, memories, etc.), and edge represents interconnection viewed as a transfer of data from/to operator to/from medium of communication. The architectural

description is serialized into an IP-XACT description, an XML format for the definition and the description of electronic components, an IEEE standard originated from the SPIRIT Consortium. The architecture description is hierarchical and permits to describe architectures with different levels of granularity. For instance, a multi-core processor can be represented as an atomic node or hierarchically exposing lower level details, where cores and memories become in turn atomic nodes. Figure 1 depicts an architecture description on the QorIQ P2020 platform from Freescale. This platform includes two e500 cores, sharing a L2 cache. The platform modeling environment we develop is freely available as a part of the dftools project [18]. e500 core

L2 cache

There are several possible strategies for scheduling and partitioning a dataflow graph [19], from the fully static strategy, where assignment and ordering of actors are known at compile-time to the fully dynamic strategy where those decisions are made only at run-time. In order to be expressive enough to model complex applications, we have chosen the dataflow process network (DPN) computation model [10] and as a consequence actors are scheduled at runtime in the general case. That’s why the static assignment strategy is used, where partitioning is done at compile-time, while scheduling is done at run-time. However, when regions of the dataflow program on one partition belong to MoCs that can be scheduled statically, named statically schedulable regions (SSR), the self-timed strategy [19] is chosen.

e500 core 5.1.1. Trace-based scheduling/partitioning

system bus

Fig. 1. An example of architecture description representing a Freescale QorIQ P2020 platform.

5. CO-DESIGN METHODOLOGY The methodology for the hardware/software co-design is illustrated in Figure 2. It consists essentially in 3 steps: the scheduling and partitioning of the dataflow program according to the architecture, the synthesis of the inter-partition communications as well as their interfaces and finally the co-synthesis in order to generate software components and hardware descriptions. 5.1. Scheduling and partitioning The first step of the methodology consists in scheduling and partitioning the dataflow program according to the architecture. The partitioning consists of assigning a processing element (PE) for each actor of the dataflow program, while scheduling consists in determining the order of execution of actors assigned to a given partition. Indeed, the number of processing elements is often lower than the number of actors, actors assigned on the same processing element must be scheduled. The problem of scheduling and partitioning a dataflow graph onto an architecture with multiple PEs is NP-complete. Therefore, heuristics with polynomial-time complexity are widely used when dealing with large-scale dataflow graphs. The scheduling/partitioning step is illustrated in the second step of the Figure 2 where the color of actors on the dataflow program corresponds to the PE assignment.

In general case, the static assignment strategy is chosen for scheduling/partitioning the CAL program. The consequence is that it is not possible to apply any optimization algorithms/heuristics on the input dataflow graph as is. Instead, the static partitioning results from the joint scheduling and partitioning of a trace of the input CAL program. A trace is a Directed Acyclic Graph (DAG), determined by interpreting (simulating) the CAL program, where nodes represent executed actions within actors and the DAG forms a partial order on the nodes defined by their (data- or state-) dependencies. This task consists in implementing heuristics to output compile-time partitioning.

5.1.2. Static scheduling post-optimization Once the partitioning is determined, actors assigned on a given PE are scheduled dynamically, since all of them are assumed to belong to the DPN MoC. It results in a significant run-time overhead. However, scheduling statically (a subset of) those actors is sometimes possible when they belong to more restricted MoCs, viz. SDF and CSDF MoCs, that can help to reduce the overhead. Several attempts have been proposed to statically schedule CAL program, such as [20]. An attempt to detect and schedule SSRs is also available in a dataflow toolset named ORCC (see Section 6.2). a method to classify of actors (to determine the MoC associated to each actor), which has been implemented in ORCC. Starting from the classification of actors, proposed by Wipliez and al in [12], the static scheduling consists in first detecting SSRs of the partitioned dataflow program. Then static analysis enable to determine a static schedule associated to each SSR. Finally, actors included in an SSR are merged together according to the associated static schedule inside a single (super-) actor. This work is available in ORCC, where we implement both the SSR detector, the SDF/CSDF analyzer and the actor merger.

Application and

Scheduling and

platform modeling

partitioning

Interface synthesis

co-synthesis

C/C++

cpu0 ram

PCIe

HDL

fpga0

cpu1

Fig. 2. Methodology for hardware/software co-design from dataflow programs. 5.2. Interface synthesis Once partitioning and scheduling have been applied to the input dataflow program, transformations still need to be applied on the dataflow program in order to correctly implement the inter-partition communications. The process mainly consists in transforming the initial dataflow program by inserting additional nodes that represent inter-partition communications interconnections using the platform physical interconnections between PEs. Such transformation introduces special nodes in the dataflow program, which will encapsulate at a later stage the serialization/deserialization of tokens and the inclusion of the corresponding interfaces between partitions. The inter-partition communication stage is illustrated in Fig.2 where black nodes are inserted to represent connections between partitions. 5.2.1. Serialization The serialization stage has the objective of scheduling the communications between actors that are allocated on different partitions. The deserialization is obviously the reverse process. FIFOs, that connect actors on two different partitions, need to be serialized in order to be able to share the same physical interconnection. In Fig. 2, the red partition has two incoming FIFOs from the yellow partitions. Those FIFOs need to be serialized on the yellow partition and deserialized on the red one. 5.2.2. Interface Reconfigurable hardware and multi-core processors can invoke various system communication primitives (e.g. PCIe, Ethernet, etc.) to implement the interaction between components which compose the heterogeneous platform. Interfaces communicate with other components via an input/output system and associated protocols. Interfaces are introduced dur-

ing the synthesis stage and must be supported by libraries according to the nature of the component present on the platform. 5.3. Co-synthesis Co-synthesis consists in generating both software components and hardware descriptions for the different partitions. The software synthesis of software components and the hardware synthesis of hardware descriptions can be automatically generated using appropriate tools from the partitioned and scheduled dataflow program [21, 22]. On the one hand, the software synthesis consists in generating implementations for the different application partitions mapped on software PEs. C and C++ generally are the target language, since they can be ported on most of processors (DSP, PowerPC, ARM, etc.). However the described co-design methodology can be applied to any kind of target language, provided that the synthesis tools from CAL implement the corresponding backend as described in more details in the following sections. On the other hand, the hardware synthesis has the objective of automatically generate HDL code ready for RTL synthesis from the different application partitions assigned to FPGAs. The target language is obviously a traditional HDL language such as VHDL or Verilog. 6. TOOLCHAIN This section describes a number of tools supporting implementation of CAL programs. The inputs are a description of the dataflow program serialized in an XDF network, an XML format for the definition of hierarchical dataflow graphs, and an IP-XACT description that specifies the target architecture. The co-design tool is responsible for the partitioning of the dataflow program as well as its transformation, to embed the platform characteristics of the architectural model, while

ORCC and OpenForge synthesis tools are used to generate the code from the different software and hardware partitions. The outputs of the toolchains are both software components and hardware descriptions, automatically generated. 6.1. Co-design tool The co-design tool is responsible for the partitioning of the dataflow program as well as the transformation of the dataflow program. The partitioning may be done automatically as presented in [23] or manually by annotating the dataflow program, based on design requirements or designer experience. Once, each partition is defined, the network is automatically transformed to represent at dataflow level the inter-partition communications. The result is a hierarchical network where each node represents a partition and each edge represents a communication that encapsulate information of the architectural model about the type of connection (Ethernet, PCI Express, etc.). The co-design tool is written in Java as an Eclipse plug-in built on the top of ORCC and OpenForge tools, which are presented in more details in next sections. 6.2. ORCC The Open RVC-CAL Compiler (ORCC) is a compiler infrastructure dedicated to the CAL language [21]. It is a collection of support tools that enable to synthesize code from dataflow programs written in CAL. ORCC is composed of two main components to synthesis code: the front-end and the backend. The front-end consists in parsing the dataflow program and translating them to an Intermediate Representation (IR) in static single assignment (SSA) form. At this step, transformations may be applied depending on the target language and platform. Finally, the back-ends generates code from IRs of actors. Several back-ends exists that target various languages (C, C++, LLVM, etc.). In this paper, 2 back-ends are used in the purpose of the co-design: • C++ back-end: The C++ back-end is used for the automatic code generation of software components. The generated code has dependencies to a portable run-time library that enable to instantiate actors and FIFOs, to schedule actors at run-time as mentioned in Section 5.1 using a round-robin scheduler as well as to support the inter-partition communications by instantiating serializers, deserializers and I/O interfaces. • XLIM back-end : In order to generate HDL descriptions, an IR called XLIM was previously defined. A previous front-end that parses and translates CAL to XLIM is no more used. Instead, an XLIM back-end was developed to this end. Both ORCC and XLIM IRs essentially contain the same information, simply in a different ways. The XLIM back-end is an IR-to-IR transformation that translate the ORCC IR into XLIM.

6.3. OpenForge OpenForge is a behavioral synthesis tool translating XLIM IR into hardware descriptions expressed in a hardware description language (HDL) [22]. More precisely, OpenForge turns the XLIM IR into a web of circuits built from a set of basic operators (arithmetic, logic, flow control, memory accesses and the like). The synthesis stage can also be given directives driving the unrolling of loops, or the insertion of registers to improve the maximal clock rate of the generated circuit. The final result is a Verilog file that can be synthesized into implementations on Xilinx FPGAs. It contains the circuit implementing the actor, and expose asynchronous handshakestyle interfaces for each of its ports. These can be connected either back-to-back or using FIFO buffers into complete systems. The FIFO buffers can be synchronous or asynchronous, making it easy to support multi-clock-domain dataflow designs. 7. DESIGN CASE STUDY: MOTION JPEG CODEC In this design case study we have implemented a baseline profile motion JPEG codec to illustrate the hardware/software co-design using the co-design toolchain on two different platforms made of FPGAs and embedded processors. Figure 3 represents the JPEG dataflow program where computation blocks are, DCT, quantization (Q), zigzag scan (ZZ), Huffman encoding (H), syntax writer (W) for the encoding as well as the bitstream parser (P) and reverse blocks for the decoding. Concerning the architecture, the host used in both platforms is a general-purpose computer, that includes an Intel i7-870 processor, with 4 cores at 2.93GHz. The first platform is a ML509 board with a Virtex-5 FPGA that communicate with the host using PCI Express, while the last platform is composed by the Freescale P2020 board where Ethernet is used to communicate between the host and the specialized PE. It should be noticed that a large number of configurations has been tested with success which means that actors are fully interchangeable. It means that a single description of an actor can be mapped to general-purpose processors, embedded processors or FPGAs. An partition can be swapped from software to hardware and vice versa, that leads to functionallyequivalent implementation. Those partitioning configurations have been randomly defined and do not intend to all result as efficient implementations of the JPEG application, but as demonstration that in principle any partitioning configuration can be automatically ported on the architecture. However, for the sake of clarity, experimental results are given from a meaningful partitioning, separating the encoding and the decoding processes. More precisely, the partitioning of the application consists in assigning the whole encoding process on the specialized PE and the decoding process is done by the host. Figure 4(a) illustrates the partitioning of the

S

DCT

Q

ZZ

H

W

D

DCT−1

Q−1

ZZ−1

H−1

P

Fig. 3. Top-level view of the motion JPEG application. first platform where the whole encoding process is assigned on the FPGA, while the bitstream parser and the Huffman decoding are assigned on one core of the host, and finally, dequantization, inverse scan and idct are assigned on the second core. Figure 4(b) illustrates the partitioning of the second platform. The encoding is assigned on the P2020 between the 2 cores (orange and green). The results of the experiment are summarized in Table 1.

(a) first partitioning

components as well as their interconnections from the same dataflow program is the feature that make possible to test and validate the performance of large number of design option without any manual low level SW and HDL code rewriting as required by traditional approaches. Future extensions of the co-design framework include GPU programming support by means of back-end dedicated to the generation of CUDA and OpenCL code and the support of most of the FPGA families. Another direction of research is the development of metrics at dataflow level that can drive the choice of a limited number of efficient partitioning configurations.

(b) second partitioning

9. REFERENCES Fig. 4. Partitions on the JPEG program. XXX

XXXinterface Ethernet XXX platform X ML509 with Virtex-5 – P2020 with PowerPC 3.8

PCIe 8.5 –

Table 1. framerate of the motion JPEG codec on 2 platforms with 2 interfaces and a 512x512 video resolution.

8. CONCLUSIONS AND FUTURE WORK The abstraction level of dataflow programming provides a unified and portable approach for the hardware/software codesign. It enables efficient rapid prototyping of applications onto architectures, validation and testing of performances for different partitioning and scheduling configurations by means of automatic synthesis tools. The strong encapsulation of actors eases also the refactoring of applications into lower level of granularity (i.e. actors splitting into networks of actors to increase the parallelism and the number of possible partitions. The paper has highlighted the portability of applications by validating several configurations onto different components of the platform architecture. Dataflow programming raises the level of abstraction of all stages of the design flow enabling effective design exploration consisting on mapping different partitioning of the application onto component platforms. The automatic synthesis of software and hardware

[1] G. De Micheli, “Hardware synthesis from C/C++ models,” in Design, Automation and Test in Europe Conference and Exhibition 1999. Proceedings, 1999, pp. 382 –383. [2] A. Kalavade and E.A. Lee, “A hardware-software codesign methodology for dsp applications,” Design Test of Computers, IEEE, vol. 10, no. 3, pp. 16 –28, sep 1993. [3] Douglas Densmore, Roberto Passerone, and Alberto Sangiovanni-Vincentelli, “A platform-based taxonomy for esl design,” IEEE Des. Test, vol. 23, pp. 359–374, September 2006. [4] Felice Balarin, Massimiliano Chiodo, Paolo Giusto, Harry Hsieh, Attila Jurecska, Luciano Lavagno, Claudio Passerone, Alberto Sangiovanni-Vincentelli, Ellen Sentovich, Kei Suzuki, and Bassam Tabbara, Hardwaresoftware co-design of embedded systems: the POLIS approach, Kluwer Academic Publishers, Norwell, MA, USA, 1997. [5] Andy D. Pimentel, “The Artemis workbench for system-level performance evaluation of embedded systems,” IJES, vol. 3, no. 3, pp. 181–196, 2008. [6] Soonhoi Ha, Sungchan Kim, Choonseung Lee, Youngmin Yi, Seongnam Kwon, and Young-Pyo Joo, “Peace: A hardware-software codesign environment for multimedia embedded systems,” ACM Trans. Des. Autom. Electron. Syst., vol. 12, pp. 24:1–24:25, May 2008.

[7] Jack B. Dennis, “First version of a data flow procedure language,” in Symposium on Programming, 1974, pp. 362–376. [8] Gilles Kahn, “The Semantics of Simple Language for Parallel Programming,” in IFIP Congress, 1974, pp. 471–475. [9] J. Eker and J. Janneck, “CAL Language Report,” Tech. Rep. ERL Technical Memo UCB/ERL M03/48, University of California at Berkeley, Dec. 2003. [10] Edward A. Lee and Thomas M. Parks, “Dataflow Process Networks,” Proceedings of the IEEE, vol. 83, no. 5, pp. 773–801, May 1995. [11] J. Eker and J. W. Janneck, “A structured description of dataflow actors and its applications,” Tech. Rep. UCB/ERL M03/13, EECS Department, University of California, Berkeley, 2003. [12] M. Wipliez and M. Raulet, “Classification and transformation of dynamic dataflow programs,” in Design and Architectures for Signal and Image Processing (DASIP), 2010 Conference on, 2010, pp. 303 –310. [13] Jörn Janneck, Ian Miller, David Parlour, Ghislain Roquier, Matthieu Wipliez, and Mickaël Raulet, “Synthesizing Hardware from Dataflow Programs: An MPEG-4 Simple Profile Decoder Case Study,” Journal of Signal Processing Systems, vol. 63, no. 2, pp. 241– 249, 2009, 10.1007/s11265-009-0397-5. [14] Matthieu Wipliez, Ghislain Roquier, and Jean-François Nezan, “Software Code Generation for the RVC-CAL Language,” Journal of Signal Processing Systems, vol. 63, no. 2, pp. 203–213, 2009, 10.1007/s11265-0090390-z. [15] I. Amer, C. Lucarz, G. Roquier, M. Mattavelli, M. Raulet, J.-F. Nezan, and O. Deforges, “Reconfigurable video coding on multicore,” Signal Processing Magazine, IEEE, vol. 26, no. 6, pp. 113 –123, november 2009. [16] ISO/IEC 23001-4:2009, “Information technology MPEG systems technologies - Part 4: Codec configuration representation,” 2009. [17] M. Pelcat, J.F. Nezan, J. Piat, J. Croizer, and S. Aridhi, “A System-Level Architecture Model for Rapid Prototyping of Heterogeneous Multicore Embedded Systems,” in Design and Architectures for Signal and Image Processing (DASIP), 2009 Conference on, 2009. [18] “Dataflow Tools,” http://dftools.sourceforge.net/.

[19] E.A. Lee and S. Ha, “Scheduling strategies for multiprocessor real-time DSP,” in Global Telecommunications Conference, GLOBECOM ’89., IEEE, 1989, pp. 1279 –1283 vol.2. [20] Ruirui Gu, Jorn W. Janneck, Mickael Raulet, and Shuvra S. Bhattacharyya, “Exploiting statically schedulable regions in dataflow programs,” in Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Washington, DC, USA, 2009, ICASSP ’09, pp. 565–568, IEEE Computer Society. [21] “Open RVC-CAL http://orcc.sourceforge.net/.

Compiler,”

[22] “OpenForge,” https://openforge.sourceforge.net. [23] Christophe Lucarz, Ghislain Roquier, and Marco Mattavelli, “High level design space exploration of RVC codec specifications for multi-core heterogeneous platforms,” in Conference on Design and Architectures for Signal and Image Processing, DASIP, 2010.

THE MULTI-DATAFLOW COMPOSER TOOL: A RUNTIME RECONFIGURABLE HDL PLATFORM COMPOSER Francesca Palumbo, Nicola Carta and Luigi Raffo. DIEE - Dept. of Electrical and Electronic Engineering University of Cagliari P.zza D’Armi 09123 Cagliari, ITALY [email protected] ABSTRACT Dataflow Model of Computation (D-MoC) is particularly suitable to close the gap between hardware architects and software developers. Leveraging on the combination of the D-MoC with a coarse-grained reconfigurable approach to hardware design, we propose a tool, the Multi-Dataflow Composer (MDC) tool, able to improve time-to-market of modern complex multi-purpose systems by allowing the derivation of HDL runtime reconfigurable platforms starting from the D-MoC models of the targeted set of applications. MDC tool has proven to provide a considerable on-chip area saving: the 82% of saving has been reached combining of different applications in the image processing domain, adopting a 90 nm CMOS technology. In future the MDC tool, with a very small integration effort, will also be extremely useful to create multi-standard codec platforms for MPEG RVC applications. Index Terms— Dataflow model of computation, coarsegrained reconfigurability, automatic code generation, MDCC tool, RVC-CAL, MPEG RVC. 1. INTRODUCTION In the embedded systems world, the worst-case design approach in several cases could be infeasible, most of all for the over-provisioning of the resources that such an approach requires. Moreover, it implies also to withstand several uncertainties related to the worst-case resource utilization, hardly predictable in advance and subjected to several external conditions, mainly related to the targeted use case needs. As a counterpart, market trends push the designers toward the conception of systems integrating an ever increasing number of features together, in order to produce efficient multi-purpose devices. Theoretically that is not a problem. In fact, due to The research leading to these results has received funding from the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 248424, MADNESS Project, and by the Region of Sardinia, Young Researchers Grant, PO Sardegna FSE 2007-2013, L.R.7/2007 “Promotion of the scientific research and technological innovation in Sardinia”

technology scaling, as envisioned by the Moore’s law, the amount of possible free space on a die, with respect to the past, is not currently a critical issue. Nevertheless, sooner or later a physical stop will be encountered. Specialization, beside integration, is also required by modern market trends to achieve the acceleration of specific tasks by means of dedicated hardware blocks, with a particular focus on speeding up the execution of critical computational kernels. Therefore, in the years, complex and highly specialized platforms have caused the onset of a gap between software development and hardware implementation. From the hardware point of view, a promising approach to the solution of all these issues is represented by reconfigurable hardware: using a reconfigurable paradigm allows having a specialized computing platform, capable of changing its own configuration to optimally serve the targeted computation. Unfortunately, even though this paradigm fits to the integration and specialization needs, still it does not suffice stand-alone to close the described gap between hardware and software. In this paper we are going to show how it would be possible to couple such a reconfigurable paradigm to the Dataflow Model of Computation (D-MoC) to address the tricky problem of efficiently mapping complex software applications on highly specialized hardware platforms. We are going to demonstrate that the composability property offered by the D-MoC is extremely suitable to be combined with the reconfigurable approach to create a complete back-end hardware toolchain able to generate highly optimized multipurpose devices. Such a toolchain will allow developers to be competitive on the market, reducing time-to-market. The proposed Multi-Dataflow Composer (MDC) tool has been conceived to automatically create a reconfigurable platform able to provide an adaptive exploitation of all the available computational resources. Such a platform will be capable of providing runtime reconfiguration of the system in response to variation of external conditions or to serve different required functionalities. The adopted baseline architectural template leverages on the consideration that: reconfiguring at runtime some parts of the system could improve its overall

potentials together with its flexibility, since many more applications can fit on the same physical structure and be optimally served. The rest of this paper is organized as follow. Sect.2 provides an overview of the reference scenario. Sect.3 describes the MDC tool, while Sect.4 is meant to assess the benefits of adopting it. Sect.5 describes the possible exploitations of the tool in the MPEG RVC domain, prior the final remarks provided in Sect.6. 2. REFERENCE ENVIRONMENT Besides the advantages already stated in Sect.1, opting for a reconfigurable approach allows balancing pros and cons of ASICs and software implementations. In fact, they are able to achieve potentially much higher performance than general purpose systems, while preserving a higher level of flexibility with respect to ASICs [1]. Systems based on a reconfigurable approach are often called adaptive, in the sense that their logic functionality and interconnect can be customized to suit a specific application, by programming them at the hardware level [2]. Dealing with reconfiguration, mainly two major levels of configurability are available: • Fine-grained - complete or runtime partial reconfiguration of the substrate (e.g. FPGA platforms). • Coarse-grained - reconfiguration of the interconnections among the involved processing elements. Fine-grained architectures are able to go beyond the integration capabilities of the target device; therefore they are more flexible than coarse-grained ones. The counterbalance is that their configuration overhead is huge limiting their reconfiguration speed, and thus their application field. Moreover, opting for a fine-grained approach, besides the shutdown state of operation necessary to change the context, it is necessary also to be able to afford a dedicated storage space to memorize the configuration bit-streams. On the contrary, coarsegrained reconfiguration is still affected by the target physical space limitation, but it is able to provide faster switching between different hardware implementations: neither requiring any context change nor any need of specifying and storing fine-grained FPGA bit-streams. Reconfigurable computing is traditionally seen as a way to optimize the performance of a class of algorithms, enabling ASICs performance within a reconfigurable substrate. It has been demonstrated that an algorithm can be partitioned into a sequence of computational kernels and its execution can be improved building up a reconfigurable platform able to switch among them during the algorithm execution [3]. A similar approach has been explored for DSP applications, where coarsegrained reconfiguration is applied to create multi-mode platforms [4]. In both cases, the most critical aspect turned out to be the toolchain necessary to create the bit-stream for platform configuration purposes. Here is again the gap between

the software development and the hardware design, as highlighted in Sect.1, and the most suitable context of application for a tool such as the one we are proposing in this paper. 3. THE MULTI-DATAFLOW COMPOSER TOOL In this section we are going to present the Multi-Dataflow Composer tool that, as already mentioned in Sect.1, is intended to easily create multi-purpose systems directly mapping different applications on a reconfigurable hardware substrate. The MDC tool is an automatic platform constructor able to compose and interconnect all the computational elements, necessary to execute a given set of kernels, in a parametric reconfigurable top module. The baseline architecture the MDC tool leverages on is based on a coarse-grained reconfigurable hardware model providing runtime reconfiguration capabilities to the physical substrate. The automatic creation of the multi-kernel heterogeneous datapath is performed by the MDC tool starting from the high level dataflow descriptions of the involved kernels. Fig. 1 offers a generic overview of the proposed tool, highlighting: • the main building blocks composing it, described in Sect.3.3 and Sect.3.4; • the required inputs, which are the dataflow descriptions of the kernels and the complete HDL library of Functional Units (FUs) (comprehensive of computational and interconnection elements); • the provided output, which is the automatically created HDL description of the assembled multi-kernel datapath, compliant to the chosen baseline reconfigurable coarse-grained model (to be detailed in Sect. 3.2). Starting from the assumption of modelling the computational kernels exploiting a pre-defined set of FUs, the MDC tool assembles a substrate able to switch at runtime among the integrated kernels by simply reconfiguring the interconnections among the involved HDL FUs. This tool provides a way to define the configuration of the coarse-grained reconfigurable multi-purpose structure by exploiting heterogeneous blocks, the FUs defined in the high level dataflow model, with homogeneous interfaces, derived from the specific use case under consideration. 3.1. The Background of the MDC Tool The D-MoC is particularly suitable to create a common development environment for both hardware architects and software developers. In this sense, at the state of the art, it has been already developed a synthesizable high level dataflow framework, based on the CAL language [5]. Such a framework is capable of specifying, modelling and programming both software [6] and hardware [7, 8] components. It is based

Dataflow 1 ACTOR A

ACTOR C

ACTOR B

Verilog/ VHDL Sbox IP

Verilog/ VHDL actors

ACTOR E

Multi-Flow Reconfigurable System: Verilog / VHDL Top Module RECONFIGURABILITY

ACTOR H

MANAGER

ACTOR G

ACTOR E

NL descriptions

N:1 ACTOR A

ACTOR F

ACTOR C

ACTOR B

ACTOR I

ACTOR A

Platform Composer tool

MDCCCompliant Tool C++ multi- dataflow directed graph

1 :1

Sbox ACTOR

ACTOR C

Verilog/ VHDL system

ACTOR H

ACTOR F Sbox ACTOR

ACTOR B

ACTOR G

ACTOR I

ACTOR G

Dataflow 2

Fig. 1. Multi-Dataflow Composer Tool Overview. on the RVC-CAL language [9], which is a dataflow oriented actor language leveraging on the actor entity. An actor is an abstract representation of a computing element that can be transformed either into a software agent or into a physical hardware one. All the actors concur asynchronously to the computation in order to generate output data sequences from the input ones. Both inputs and outputs are provided in the form of tokens; specific tokens configuration or sequences can fire pre-determined actions inside the actors, leading to state variations or to specific output tokens emission. The RVCCAL framework has recently entered in the MPEG standard (MPEG ISO/IEC 23001-4 [10]) to provide the specification of video codecs in forms of dataflow programs. These specifications are created assembling networks of RVC-CAL FUs belonging to a standard Video Tool Library (VTL). The out coming executable programs of the assembled codecs lead to completely abstract representations, which can be translated into proprietary implementations by replacing the FUs with custom hardware or software implementations. Such composability property is the main reason that led us to opt for the D-MoC as a starting point to build up our MDC tool. In practice, the inputs of the MDC tool are RVC-CAL compliant network of actors (in the form of .nl descriptions), to exploit the modularity and the composability properties offered by the dataflow model in combination with the chosen reconfigurable design paradigm.

among the set of hardware FUs included in it. Physically that reconfigurability will be carried on, according to the actual functionality required to the datapath, by means of Switching Box (Sbox) modules. The Sbox units, placed at the crossroads between the different dataflows, provide highspeed runtime coarse-grained reconfiguration preserving, at the same time, the correctness of each single involved kernel. Such an interconnection structure resembles state of the art generalized mesh topologies, where the routing lines run between the processing elements and are connected by means of switch boxes and each processing element is interfaced to this network by means of a connect box [1]. On the right hand side of Fig.1 the output top module produced by the MDC tool, starting from two single dataflow descriptions (shown on the left hand side of Fig.1), is provided. Two different models of Sbox unit, an example of which is depicted in Fig.2, are necessary to assemble a generic multi-kernel datapath. They are required to divide and/or to merge the different dataflows and act respectively as a programmable de-multiplexer, Sbox 1x2, or a multiplexer, Sbox 2x1, units. These units are both programmable in terms of width of the data to be carried and, in the example provided in Fig.2, their interfaces are compliant with the RVC-CAL standard. With respect to the interface of the Sbox units it has to be noticed that it depends on that of the actors to be integrated in the multi-datapath system. In practice, the library of HDL FUs, available to the MDC tool, will contain as many Sbox units couples as the number of possible actor interfaces are.

3.2. The Baseline Reference Platform The baseline reference architecture that has been chosen leverages, as already said, on a coarse-grained reconfigurable template. The final heterogeneous structure will embed the minimal set of HDL FUs necessary to correctly accomplish the computation of all the involved kernels. Those FUs will be connected among each other according to the input dataflows. The configurable HDL top module will support high-speed runtime reconfiguration. Such a reconfiguration is going to be ensured providing to the system the possibility of changing, at runtime, the topology of the connections

3.3. The Multi-Decoder CAL Composer Compliant Tool The front-end of the MDC tool has been realized implementing a tool following the approach presented in [11]. In that work the Multi-Decoder CAL Composer (MDCC) tool has been presented. The MDCC has been originally conceived to provide a unique high-level description of an RVC-CAL multi-decoder system. In this context, we have used the same approach to create the intermediate representation necessary to physically assemble the runtime multi-kernel reconfig-

Sbox_ 2x1

KERNEL 1

Sbox_ 1x2

in1 _SEND in2 _SEND

0

out_SEND

0

in_SEND

1

1

KERNEL 2

E

A

E

B

F

F

out1 _SEND

MDCC Compl. Tool

out2 _SEND

NAME SBOX1 KERNEL 1 0 KERNEL 2 1

A in1 _DATA[N:0] in2 _DATA[N:0]

0

out_DATA[ N:0]

0

in_DATA[ N:0]

1

1

0

0

out1 _DATA[N:0]

in2 _COUNT[15 :0]

out1 _ ACK

out2 _ ACK

out1 _ RDY

out_COUNT[15 :0]

in_COUNT[15 :0]

1

1

0

in_ ACK

0

out_ ACK

1

1

0

in_ RDY

0

out_ RDY

1

1

out2 _ RDY

E

F

B

out2 _DATA[N:0]

E

C in1 _COUNT[15 :0]

Sbox 1

KERNEL 1,2

MDCC Compl. Tool

F

out1 _COUNT[15 :0]

NAME SBOX2 KERNEL 1,2 0 KERNEL 3 1

KERNEL 3

out2 _COUNT[15 :0]

A Sbox 1

in1 _ ACK

B

KERNEL 1,2,3

in2 _ ACK

Sbox 2

C

in1 _ RDY

E

D

E

MDCC Compl. Tool

F

in2 _ RDY

F

NAME SBOX3 KERNEL 1,2,3 0 KERNEL 4 1

KERNEL 4

sel

A

sel

D Sbox 1

Fig. 2. De-multiplexer (Sbox 1x2) and Multiplexer (Sbox 2x1) Sbox Unit Models. urable platform. As in [11] this MDCC compliant tool merges multiple dataflow descriptions within a unique dataflow, providing it both in the form .nl description and in the form of a C++ directed graph. In practice, to define the high level D-MoC of the multi-purpose system this MDCC compliant tool: • analyzes the input dataflows, applying them the proper transformations to create a C++ directed graph of only atomic actors (meaning no sub-networks are allowed);

Sbox 3

Sbox 2

C

• inserts in the C++ directed graph of the multi-flow system, the switching nodes, corresponding in hardware to the Sbox units, in order to merge (when two dataflows share an actor) or separate (when the children of a common actor are not the same) the considered flows. With respect to the MDCC tool, this implementation is also in charge of storing all the information necessary to allow the runtime switch among the kernels. This action is performed by constructing the tables that will fill the Look-Up Tables (LUTs) responsible of the physical programmability of the coarse-grained reference architecture. A step-by-step example of application of this MDCC compliant tool is provided in Fig.3. 3.4. The Platform Composer Tool The Platform Composer (PC) tool represents the back-end of the MDC tool. This tool translates the C++ directed graph,

F

KERNEL 1,2,3,4

Fig. 3. MDCC Compliant Tool: Example of Application. KERNEL 1,2,3,4 A

D Sbox 1

Sbox 3

E

NAME SBOX1 KERNEL 1 0 KERNEL 2 1

F

B C

Sbox 2

NAME SBOX2 KERNEL 1,2 0 KERNEL 3 1

NAME SBOX3 KERNEL 1,2,3 0 KERNEL 4 1

PC tool

LUT 1 KERNEL 1 0 (in_1) KERNEL 2 1 (in_2)

LUT KERNEL 1 KERNEL 2 KERNEL 3

2 0 (in_1) 0 (in_1) 1 (in_2)

RECONFIGURABILITY MANAGER sel 2 sel 1

• compares the C++ directed graphs of the flattened dataflows to compose the C++ graph of the multi-flow system;

E

B

KERNEL KERNEL KERNEL KERNEL

LUT 1 2 3 4

3 0 0 0 1

(in_1) (in_1) (in_2) (in_2)

sel 3

A

D in_1 in_2

in_1

Sbox 1

in_2 in_1

B

in_2

Sbox 3

E

F

Sbox 2

C

Fig. 4. PC Tool: Example of Application. output of the front-end tool and representative of the multikernel dataflow, into an HDL reconfigurable datapath according to the baseline reference architecture. The PC tool is also responsible of programming the reconfigurability manager (see Fig.1). The latter is the module in charge of handling the multi-kernel programmability of the system; the PC tool has to infer in it the proper LUTs, to allow the correct manipulation on the incoming data. The reconfigurability manager hosts as many LUTs as the kernels in the multi-purpose reconfigurable system are. The reason is that in each iteration of the MDCC compliant tool a table, specifying the configuration of all the newly instantiated Sbox units, is created. The PC tool takes those tables to create the LUTs that will compose the reconfigurability manager. The outcome of these procedures is depicted in Fig.4 where, considering the outputs produced by

the MDCC compliant tool in the example provided in Fig.3, the PC tool creates the multi-kernel reconfigurable datapath. Notice that, in this example, only LUT3 is specified for all the involved kernels. The reason is that the tables the PC tool receives as inputs are dynamically created while the MDCC compliant tool runs. Therefore, in all its iterations that tool can specify table entries just for those kernels received in input as directed graphs, not for those not received yet. As a consequence also the LUTs created by the PC tool, except the one corresponding to the last iteration of the front-end tool, have a scope on a subset of kernels. 4. PERFORMANCE EVALUATION This section is intended to highlight the benefits that the MDC tool can provide. The results discussed hereafter regard mainly the efficiency in terms of on-chip area occupancy that can be achieved using the presented coarse-grained runtime reconfigurable architectural template. To demonstrate those benefits two different applications in the field of image processing have been adopted: Anti-Aliasing and Zoom. 4.1. Spatial Anti-Aliasing In digital signal processing, the Spatial Anti-Aliasing is a technique meant to minimize the distortion artefacts, known as aliasing, when representing a high-resolution image at a lower resolution. Such a distortion is due to the undersampling applied to the initial image with respect to the Nyquist’s spatial frequency. The aliasing effect appears more on the oblique lines, on the areas with a high gradient value or at the edges. Moreover, to capture an image, the acquisition system must have an array of sensors, characterized by a Colour Filter Array (CFA) that assigns a unique colour between R, G, and B for each spatial location. To derive the full colour image, after the CFA sampling process, the two missing colours for each pixel are estimated through interpolation. The latter leads to an aliasing effect on the edges. In this case, the aliasing effects can be cancelled applying a median filter on chrominance plans of the original image, eliminating the presence of false colours. 4.2. Zoom The Zoom application simply determines the relative scaling, by a zooming factor S, of the size of the original considered image. The adopted algorithm involves the interlacing of the image pixel matrix with rows and columns of zeros depending on S. The value of the pixels, initially set to zero, is then replaced by applying the appropriate interpolation method. The algorithm divides the image into overlapping P × P blocks, centred on the current reference pixel of the i − th iteration and classifies each block in smooth, edge

or texture. The zooming is adaptive because it adapts the interpolation method to the image content. 4.3. Synthesis Results The MDC tool, as already specified in Sect.3, receives as input the dataflow descriptions of the applications to be integrated in the final hardware, in the form of single .nl description of the kernels. With regards to the previously described applications, we succeeded in identifying several kernels for each application, as listed in Tab.1, and in describing them in the required format. Table 1. Summary of the Kernels per Application. Qsort Min Max Corr Abs RGB2YCC YCC2RGB Sbwlabel Median Cubic Cubic Conv Check GeneralBilevel

Zoom X X X X X X

Anti-Aliasing X X

X X X X X

The results that we are going to discuss in this section are related to three different applications of the MDC tool, namely: • generation of the multi-kernel platform executing the Spatial Anti-Aliasing application [UC1]; • generation of the multi-kernel platform executing the Zoom application [UC2]; • generation of the multi-kernel platform executing both the above-mentioned applications [UC3]; In any case, the output of the MDC application will be the RTL Verilog implementation of three different coarse-grained reconfigurable datapath. At first it has to be highlighted the granularity level reached by the proposed tool: the MDC tool provides coarsegrained reconfigurability at the actors level, rather than at the kernels one. Therefore, even though UC1 and UC2 seem to have a reduced overlap of kernels, since they only share the Min Max and the Abs kernels (as it can be deduced observing Tab.1), provided that the MDC tool goes at a finer granularity level it will succeed in saving much more area than expected. Different kernels can share several actors, as demonstrated in Fig.5 where the YCC2RGB kernel (Fig.5.(a)) and the RGB2YCC kernel (Fig.5.(b)) of UC2 are depicted. These two kernels could share all the white-filled actors in the figure, namely: the RAM, one adder, three shifter units, the maxColor actor and one buffer. Note that no multipliers

(a) The YCC2RGB Kernel.

(b) The RGB2YCC Kernel.

Fig. 5. Examples of Dataflows.

UC1 [MDC]

UC2 [No MDC]

UC2 [MDC]

UC3 [No MDC]

UC3 [No MDC]

RAM adder control max min subtractor shifter buffer out dec mm cubic mul abs countCorr buffer par T buffer C2 12 buffer C1 12 check gb sorter maxColor buffer HsumVsum Kernel Actor Sum LUT cnv Sbox MDC Actor Sum

UC1 [No MDC]

Table 2. Summary of the Actors per Kernel.

5 6 4 2 12 3 2 1 1 3 9 0 0 0 0 1 0 0 0 1 50 0 0 0 56

1 4 3 1 6 2 2 1 1 3 6 0 0 0 0 1 0 0 0 1 32 6 1 68 107

3 7 2 2 4 7 0 0 0 14 3 1 1 1 1 0 1 2 2 0 51 0 0 0 51

1 5 2 1 2 4 0 0 0 14 2 1 1 1 1 0 1 1 1 0 38 5 1 47 91

8 13 6 3 16 10 2 1 1 17 11 1 1 1 1 1 1 2 2 1 99 0 0 0 108

1 6 4 1 6 6 2 1 1 17 6 1 1 1 1 1 1 1 2 1 61 9 1 137 208

are shared, since they are not identical in the above-mentioned

kernels and, in the provided FUs library, the mul actor is not parametric. Consider also that the RAM actor is inferred by all the following kernels: Corr, RGB2YCC, YCC2RGB, Sbwlabel, Median, Cubic, Cubic Conv and Check GeneralBilevel. Taking a look to Tab.1, while the first three of them belong to UC2 the others are adopted in UC1. According to these examples you may see how the actors granularity level allows sharing in the final hardware platform several FUs, even belonging to non shared kernels. Obviously such a saving cannot be guaranteed in general, being highly dependent on the type of chosen applications. Nevertheless, creating a multi-purpose platform targeted for a specific domain of interest this can be achieved, e.g. in image processing different applications use a common set of basic operations, corresponding to the same actors in their dataflow descriptions. Tab.2 shows the number of actors, specified per type, adopted by the considered use-cases, adopting (see columns UC1[MDC], UC2[MDC] and UC3[MDC]) or not (see columns UC1[No MDC], UC2[No MDC] and UC3[No MDC]) the MDC tool to create the HDL Verilog coarse-grained reconfigurable datapath. The number of instances of each actor for the reconfigurable architecture is always less than or equal to one obtained not applying the MDC tool, for all the three coarse-grained reconfigurable platforms. The saving the MDC guarantees (see row Kernel Actor Sum in Tab.2), merely in terms of number of actors, correspond on average to the 33%, neglecting the overhead required by the reconfigurability. If that overhead is considered (see row MDC Actor Sum in Tab.2), the number of required Sbox units is so big to apparently reverse such a saving. Fortunately, that is not so in terms of physical on-chip area occupation.

The basic assumption of the proposed approach is: let’s implement runtime reconfigurability in a very simple and low overhead manner. Therefore the Sbox units, responsible of that feature, have a hardware complexity comparable to that of a simple multiplexer. This means that, even if their count is so high to considerably increment the number of instances in the coarse-grained reconfigurable datapath (as shown in row MDC Actor Sum of Tab.2), their area is so small to still allow physical on-chip area saving. The area saving has been assessed though ASIC synthesis using SoC Encounter, a dedicated synthesis tool in the commercial release of Cadence, and a 90nm CMOS technology. All the synthesis trials were carried out for every coarsegrained reconfigurable datapath, adopting or not the MDC tool. Tab.3 provides an overview of the achieved area results, together with the evaluated area saving percentages. As it can be noticed, adopting the MDC tool, it is always possible to save a considerable amount of on-chip area, nearly 67% in the worst case scenario (UC2). The best achieved result correspond to the 82% of area saving of the UC3, when the Spatial Anti-Aliasing and the Zoom applications are implemented on the same reconfigurable multi-purpose platform. This desirable result is due to the fact that the overlap of actors, from different kernels, normally increases with the number of integrated kernels, dealing with applications belonging to the same reference domain. In this case in particular, despite the high global FU instances overhead 208 instead of 108 (as shown in row MDC Actor Sum of Tab.2), the overlap of actors among the kernels of the UC1 and those of the UC2 led to save 82% of on-chip area, having in the coarse-grained reconfigurable datapath just 61 inferred computing FUs instead of 99 (as shown in row Kernel Actor Sum of Tab.2).

Table 3. On-Chip Area Saving. Qsort Min Max Corr Abs RGB2YCC YCC2RGB Sbwlabel Median Cubic Cubic Conv Check GeneralBilevel UC1 [No MDC] UC1 [MDC] UC2 [No MDC] UC2 [MDC] UC3 [No MDC] UC3 [MDC]

Area [µm2 ] 32243 3946 724933 2052 722822 712698 747089 687082 684997 683681 691273 3480468 860816 2198694 809705 5673164 1018635

Saving %

5. EXPLOITATION OF THE PROPOSED MDC TOOL The MPEG RVC standard is intrinsically based on the possibility of creating a library of FUs, implementing the CAL actors, able to describe all the MPEG standards through their composition. Adopting the MDC tool in this environment will allow performing the direct composition of multi-standard MPEG decoders, starting from their D-MoC descriptions. Verilog/ VHDL Verilog/ VHDL Sbox IP actors CAL2DHL/ ORCC N:N

Platform Composer Tool

MDCCTool

NL descriptions

N:1

C++ multi- dataflow directed graph

1 :1

Verilog/ VHDL system

Fig. 6. MDC Tool with MPEG RVC-CAL Extension. In principle, as depicted in Fig.6, it will be possible to adapt the MDC tool to be integrated with other standard tools of the RVC-CAL framework. This transition will be straightforward since: • the front-end of the MDC tool has been already conceived to be compliant with the MDCC tool, designed and tested for the RVC environment [11]; therefore substituting it with the MDCC itself will involve a very small effort; • the library of HDL FUs has already available a couple of Sbox units RVC-CAL compliant (as shown in Fig.2). The only real modification to be accomplished will require to slightly modify the MDCC in order to support also the .xdf format beside the .nl one, since in the MPEG RVC standard any dataflow is provided as an .xdf network description. Provided these considerations, to extend our proposed tool to RVC applications, it will suffice to integrate the MDC tool with an automatic dataflow-to-hardware code generator [12, 8], which will automatically create the Verilog/VHDL actors composing the library of HDL FUs. Some steps in this direction have been already moved starting the integration of the MDC tool with the Orcc-VHDL [13], developed at the IETR lab of INSA (Rennes, France).

75.27

6. CONCLUSIONS

67.17

In this paper we have presented the Multi-Dataflow Composer tool, which is intended to close the gap between complex multi-purpose heterogeneous hardware platforms implementation and their software programming. In fact this tool, leveraging on the Dataflow Model of Computation, is able to auto-

82.04

matically build the coarse-grained reconfigurable HDL platform that correspond to a specific set of applications, starting from their high level dataflow descriptions. The benefits this tool allows achieving are two-fold. First of all, it guarantees the possibility of automatically derive a complex hardware platform, requiring a very small intervention from the users that simply need to provide the correct dataflow descriptions of the applications as input. Besides, from the hardware point of view, the proposed multi-kernel implementation allows runtime reconfigurability, without any need for hardware shut-down or suspension due to context switch purposes, as long as with a concrete on-chip area saving. In particular, in this paper referring to the image processing domain, it was possible to reach the 82% of area saving combining in a unique coarse-grained datapath two different applications, Zoom and Spatial Anti-Aliasing. In addition we want to stress the attention on the fact that the presented tool could result extremely useful in the MPEG RVC domain to create multistandard codec platforms, with an extremely low integration overhead. 7. REFERENCES [1] Katherine Compton and Scott Hauck, “Reconfigurable computing: a survey of systems and software,” ACM Comput. Surv., vol. 34, pp. 171–210, June 2002. [2] Russell Tessier and Wayne Burleson, “Reconfigurable computing for digital signal processing: A survey,” J. VLSI Signal Process. Syst., vol. 28, pp. 7–27, May 2001. [3] Salvatore M. Carta, Danilo Pani, and Luigi Raffo, “Reconfigurable coprocessor for multimedia application domain,” J. VLSI Signal Process. Syst., vol. 44, pp. 135– 152, August 2006. [4] Vinu Vijay Kumar and John Lach, “Highly flexible multimode digital signal processing systems using adaptable components and controllers,” EURASIP J. Appl. Signal Process., vol. 2006, january. [5] Johan Eker and J¨orn Janneck, “Caltrop—language report,” Technical memorandum, Electronics Research Lab, Department of Electrical Engineering and Computer Sciences, 2002. [6] G. Roquier, M. Wipliez, M. Raulet, J.W. Janneck, I.D. Miller, and D.B. Parlour, “Automatic software synthesis of dataflow program: An mpeg-4 simple profile decoder case study,” in Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop on, 8-10 2008, pp. 281 –286. [7] J¨orn W. Janneck, Ian D. Miller, David B. Parlour, Ghislain Roquier, Matthieu Wipliez, and Micka¨el Raulet, “Synthesizing hardware from dataflow programs: An mpeg-4 simple profile decoder case study,” in SiPS, 2008, pp. 287–292.

[8] Jorn W. Janneck, Ian D. Miller, David B. Parlour, Ghislain Roquier, Matthieu Wipliez, and Mickael Raulet, “Synthesizing Hardware from Dataflow Programs,” Journ. of Signal Processing Systems, 2009. [9] M. Mattavelli, I. Amer, and M. Raulet, “The reconfigurable video coding standard,” Signal Processing Magazine, IEEE, vol. 27, no. 3, pp. 159–167, May 2010. [10] “ISO/IEC 23001-4 (2009). MPEG systems tech.—Part 4: Codec configuration representation,” . [11] Francesca Palumbo, Danilo Pani, Emanuele Manca, Luigi Raffo, Marco Mattavelli, and Ghislain Roquier, “Rvc: A multi-decoder cal composer tool,” in DASIP, 2010, pp. 144–151. [12] Nicolas Siret, Matthieu Wipliez, Jean-Franc¸ois Nezan, and Aimad Rhatay, “Hardware code generation from dataflow programs,” in DASIP, 2010, pp. 113–120. [13] Open RVC-CAL Compiler, http://orcc.sourceforge.net/, web.

“Vhdl back-end,”

A UNIFIED HARDWARE/SOFTWARE CO-SYNTHESIS SOLUTION FOR SIGNAL PROCESSING SYSTEMS Endri Bezati1 , Herv´e Yviquel2 , Micka¨el Raulet3 , Marco Mattavelli1 1

Ecole Polytechnique F´ed´erale de Lausanne, CH-1015 Lausanne, Switzerland {firstname.lastname}@epfl.ch 2 IRISA/University of Rennes 1, F-22300 Lannion, France {firstname.lastname}@irisa.fr 3 IETR/INSA of Rennes, F-35708 Rennes, France {firstname.lastname}@insa-rennes.fr ABSTRACT

This paper presents a methodology to specify from a highlevel data-flow description an application for both hardware and software synthesis. Firstly, an introduction to RVC-C AL data-flow programming and Orcc framework is presented. Furthermore, an analysis of a close to gate intermediate representation (XLIM) is bestowed. As a proof of concept a JPEG codec was written purely in RVC-C AL to test the co-synthesis tools and then an analysis of the generated hardware and software results are given. Our experience shows that using RVCC AL can unify the process of creating the same application for software and hardware without modifying a single source code for each solution. Index Terms— Co-Design, Co-Synthesis, Dataflow, FPGA, JPEG, OpenForge, Orcc, RVC-CAL, XLIM 1. INTRODUCTION Signal processing applications becomes more and more complicated and their complexity continuously grow at each generation. This is the case, for instance, of video compression standards that following the demand for higher quality video transmitted by smaller and smaller bandwidths, achieve the objective at the expense of introducing, at each new standard release, a large increase of codec complexity. Developing implementations for heterogeneous platform of such applications is always a difficult challenge. Currently frameworks capable of generating code from the same, ideally high-level specification, for both Hardware and Software synthesis are not available or presents severe limitations. C is one possible high level language as C to Gates tools and their corresponding design flows (ImpulseC [2], HandelC [10] and Spark [7]) generate VHDL code from C-like specThis work is part of the ACTORS European Project (Adaptivity and Control of Resources in Embedded Systems), funded in part by the European Unions Seventh Framework Programme. Grant agreement no 216586

ifications. Thus, the entire design space is far from being completely explored because these tools handle only hardware code generation. Moreover, approaches to high-level hardware synthesis fall broadly into two categories: • those that attempt to adapt software programming languages to the creation of hardware by creating tools that translate software programs into circuit descriptions such as Catapult C, c2h, PICO Express, ImpulseC, • those that devise one or more new languages (textual or visual), designed to be more amenable to the generation of efficient hardware e.g., Handel-C [5], Mitrion-C, Mobius. The approaches in the first category attempt to leverage software tools and a large community of programmers. However, the goal of translating real-world applications written in a language such as C into efficient hardware implementations has proven elusive, despite considerable efforts in this direction. Although hardware code generation by the C AL dataflow language has been presented in the past [3, 9] with the OpenDF framework, this paper presents an approach for unified hardware and software synthesis starting from the same program (specification). This paper is organized as follow: Firstly, Section 2 gives a brief introduction of C AL data-flow programming and its associate compiler called Open RVC-C AL Compiler. Then, two sections make the following contributions: • We present a complete compilation flow from RVC-CAL towards HDL synthesis (Section 3). This flow uses the XLIM Intermediate Representation (IR), an XML format for representing a language independent model of imperative programs based on a well-known form called three-address code (TAC or 3AC). • We give a co-design case study, in which a JPEG Codec written only in RVC-C AL is partitioned into components and then synthesized to both SW and HW (Section 4).

Finally Section 6 outlines the current limitation of the approach and discusses the perspectives of future extensions. 2. BACKGROUND This section presents RVC-C AL, a standardized subset of the original C AL Actor Language, the Open RVC-CAL Compiler, an open-source framework that supports RVC-C AL for generating implementation code, and OpenForge, a synthesis tool developed by Xilinx. 2.1. RVC-CAL Data-flow Programming C AL Actor Language is a language based on the Actor model of computation for data-flow systems [6]. An actor, is a modular component that encapsulates its own state. Each actor interacts with each other through FIFO channels, see Figure 1. An actor in general may contain state variables, global parameters, actions, procedures, functions and finite state machine that control the executions of actions. C AL enables concurrent development and provides strong encapsulation properties. C AL is used in a variety of applications and has been compiled to hardware and software implementations. The RVC-C AL language is a subset of the C AL language and it is normalized by ISO/IEC as a part of the RVC standard. Although it has some restrictions in data types and features that are in used in C AL [4, 6], is sufficient and efficient for specifying streaming and signal processing systems such as MPEG compression technology.

Fig. 2. Orcc framework chain. a program, and from which part or all of the output data of the program is constructed in turn. The next steps is to run a language target-specific back-end. The idea is to let flexibility to the backends for generating optimized code and to preserve features of the C AL language that does not overspecify scheduling information. The Backends generate code depending on the targets. Their purpose is to create target specific code. Each backend will parse the hierarchical network from a top-level network and its child network. Also optionally it flattens the hierarchical network. Orcc for the moment offers a variety of backends. These back-ends are C, C++, LLVM, VHDL, XLIM, etc. To generate a software decoding solution we used the C backend of Orcc. The generated C code is ANSI-C compatible and it is portable to different platforms such as Windows, Linux, Mac OS X and others. Orcc gives the possibility to create native actors and native procedures. A native actor can be written directly in C or VHDL (for modelsim simulation). The purpose of these native actors is to offer the possibility to use the host Input/Output (write a file, display an image). As RVC-C AL has not its standard library to communicate with the host, native actors permit that. 2.3. OpenForge and HDL code generation

Fig. 1. The C AL computing model.

2.2. Open RVC-CAL Compiler The Open RVC-C AL Compiler (Orcc)1 is an open-source framework designed to generate implementation code from a network of RVC-C AL actors specified by a network topology description [15]. The Frontend has the task of parsing all actors and translating them to an Intermediate Representation (IR). Such IR is a data structure that is built from input data (the actor) to 1 Orcc

is available at http://orcc.sf.net

Forge was a research tool developed by Xilinx for their C to gate implementation. Forge has fallen to the public domain and was renamed to OpenForge. The first tool that permitted the C AL to HDL was the OpenDF framework (a C AL simulator) with the use of OpenForge as a backend for the Verilog HDL generation. Often in RVC-C AL literature this tool is called as CAL2HDL [9, 12]. XLIM OpenDF code generation does not support all the RVC-C AL subset and it is too slow compared to the XLIM generation of the Orcc. Thus the choice of creating an XLIM backend was a must. OpenForge takes as an input an XLIM file, which is just the representation of the static single assignment (SSA) form of an actor in an XML format. Then the synthesis stage follows with the analysis of the SSA representation into a web of circuits built from a set of basic operations like arithmetic, logic, flow control, memory accesses and etc. Also OpenForge as an intelligent synthesis tool supports the unrolling of loops, or the insertion of registers to improve the maximal clock rate of the general circuit (pipe-lining). Finally the OpenForge will generate a Verilog file that repre-

sents the RVC-C AL actor with an asynchronous handshakestyle interface for each of its ports. Orcc generates a Top VHDL that connects the Verilog generated actors with backto-back or using FIFO buffers into a complete system. Also the FIFO buffers can be synchronous or asynchronous (given the user choice). Give the previous statement it is easy to support multi-clock-domain data-flow designs (different clock domains can be given directly from the Orcc user interface). Orcc can also generate directly VHDL code for the actors and the network [13], but as the VHDL code generation is not mature enough it was not used for this implementation. Future work on the VHDL Backend will maybe provide an alternative to the OpenForge.

• Three-address code (3AC) representation describes each basic operation executed in the program (like addition or multiplication) by the following 4-tuple (Operator, Operand1, Operand2, Result) where Operand1, Operand2 and Result are the variables and Operator is a primitive operator (an arithmetic operator for example). As a result, there is no more complex expression containing several primitive operations. 3.2. XLIM backend

3. XLIM CODE GENERATION This section presents first the XLIM representation then the compilation process of an RVC-CAL application towards a hardware target. 3.1. Presentation of the XLIM representation The XML Language-Independent Model (XLIM) is an intermediate representation (IR) developed by Xilinx [1] to make the code optimizations of data-flow programs easier. Indeed, the source code written by developers and also the abstract syntax tree (AST) usually produced by the parser at the beginning of the compilation process are not ideal to compute analysis and transformation of code due to their lexical structure. XLIM is mainly an XML document containing several elements which describes the behavior of a data-flow actor: the interfaces of the actor (inputs and outputs), the set of state variables, the computational procedure of each action and finally the action scheduler which manages the execution of the actions according to. The XLIM representation is a close to gate representation which requires the respect of the following properties to represent directly the dependency relation between the different elements of the program and consequently permit code optimizations for hardware targets: • Static single assignment form (SSA) requires that each variable used by the program is assigned exactly once. As a consequence, a variable assigned several times in the initial form of IR is transformed in different versions of the variable (for example a variable x which is assigned three times is transformed on three variables x1 , x2 and x3 ) and some special statements called φfunctions are inserted at join nodes of the control-flow graph in order to assign a variable according to the execution path (for example the statement x3 ← φ(x1 , x2 ) expresses that x3 has the value of x1 if the program jumped from the first node and x2 if it jump from the second one).

Fig. 3. The compilation flow of the XLIM backend. The Open RVC-CAL Compiler (Orcc) includes an XLIM backend which corresponds to an RVC-CAL frontend for tools like OpenForge or XLIM2C [14]. The default compilation flow is presented in Fig. 3 and consists on several passes and particularly some transformations the intermediate representation of Orcc: • Inlining of RVC-CAL functions and procedures, indeed the XLIM does not support the call instruction because of its low representation level. • SSA transformation is made to validate the needed SSA property and consists in indexing variables and adding φ-functions. • 3AC transformation splits complex expressions including several primitive operations to multiple 3AC compliant instructions. • Copy propagation is an optimization which removes the direct assignment of a variable to another variable like a := b by replacing all uses of a by b. • Cast adder: the XLIM representation use a precise type system (for typename and size) which needs explicit cast instructions. These cast instructions are added thanks to the bit exact precision of Orcc IR by visiting the type of each expression and variable. • Array flattener transforms multidimensional arrays to unidimensional ones and made computation of right index. After these transformations, the actors are printed in XML format respecting XLIM properties using a template engine called StringTemplate [11]. This mechanism permits to quickly generate XLIM files (at most few seconds), increases flexibility (several backends were developed in Orcc) and reduces maintenance cost (a template is easier to change than a program).

Fig. 4. The RVC-C AL Codec: a) JPEG Encoder, b) JPEG Decoder. 4. USER CASE: AN RVC-CAL JPEG CODEC As a user case a JPEG Codec was chosen and it was written purely in RVC-C AL2 . The JPEG codec is based on the ITU-T. IS 1091 standard. The idea was to implement the encoder in the FPGA and the decoder in a computer host. The JPEG Codec is implemented based on the simple profile and is using a static quantification and Huffman Table. This was chosen only for simplicity, using RVC-C AL is very easy to add an actor that can have as inputs different quantification and/or Huffman tables without changing the JPEG codec model structure. 4.1. RVC-C AL JPEG Encoder The RVC-CAL JPEG encoder is modeled as a serial data-flow application. Then encoding is done at a Macro-block (MB) level. The input of the encoder is giving in Raster 4:2:0 YUV format (see Fig.5) , this format was chosen due to the output of the input camera. The encoder is separated in six actors. The JPEG standard describe the encoding in a block level of 64 pixels. In 4:2:0 format there are four luminance (Y) blocks of 8x8 for two chrominance (one for U and one for V) blocks of 8x8.

an MB the FDCT and the Quantization actors are processing six blocks of 8x8, this can give a potential parallelism in the JPEG algorithm, this will be described in the future work section. The most important part of the JPEG encoder is the Huffman actor. This actor is specific to 4:2:0 MB scheme, it will treat the luminance blocks and then the two chrominance blocks. The Huffman will generate two bit-streams (not shown in the Fig.4), one for the luminance and one for the chrominance. For simplifying the JPEG model a merger actor was created so that it can serialize the previous bit-streams. Finally the Streamer actor will add all the necessary information (start/stop flags, quantization and Huffman tables) so that a proper JPEG bit-stream is correctly generated. 4.2. RVC-C AL JPEG Decoder The RVC-C AL Decoder is the inverse process of the JPEG encoder, but with quality loss of the encoded image due to the lossless nature of the JPEG compression. The decoder is separated in four actors. The first step is to decode the JPEG bit-stream. So the JPEG Parser is retrieving all the necessary information form the bit-stream. Next, the Huffman actor is responsible to decode the Huffman encoding of the transformed and quantified 4:2:0 YUV blocks. Finally the inverse quantization and the Inverse DCT (IDCT) actors form the 4:2:0 MBs of the reconstructed image. 4.3. A Co-Design example of the JPEG Codec

Fig. 5. The YUV 4:2:0 Macro-Block representation. So the first actor in the encoder is the Raster to MB operation. The actor is taking as inputs the Y, Cb and Cr together and a signal SOI that indicates the Size Of the Image. The next step is to transform the YUV pixels to a Forward Discrete Cosine Transform (FDCT) and to quantify them. The FDCT and the Quantization works in 8x8 block level. So for 2 The

JPEG Codec is available at http://orc-apps.sf.net

Orcc framwork offers the possibility to generate source code for hardware and software. Given the RVC-C AL of the JPEG Codec, the encoder can be implemented in a FPGA board and the decoder can be compiled as an ANSI C program so that it can run on each platform that has an ANSI C compiler. For our user case a Virtex 6 FPGA board with a PCI-express connector and an Intel iCore 7 PC host were used. The communication between the FPGA board and the PC was done via the PCI-Express port. The RVC-C AL JPEG encoder is implemented in the Virtex 6 FPGA board. Orcc first generates the XLIM intermediate representation and then it calls the OpenForge Back-end.In the end the Verilog source code files compose the generated code for each actor and a VHDL file that represents the network of the actors. Xilinx PCIe IPCore does the commu-

package jpeg . encoder ; i m p o r t j p e g . e n c o d e r . common . T a b l e s . QT ; i m p o r t j p e g . e n c o d e r . common . T a b l e s . z i g z a g ;

Fig. 6. The RVC-C AL JPEG Codec partitioned in Hardware and Software.

actor Quantization ( ) i n t ( s i z e = 3 2 ) I n ⇒ i n t ( s i z e = 3 2 ) Out : i n t B l o c k T y p e : = 0 ; / / B l o c k T y p e = 0 , 1 , 2 , 3 ( Luma ) , / / B l o c k T y p e = 4 , 5 ( Chroma )

nication between the generated code and the PCI-express. A camera connected with the FPGA board via the HD-SDI interface gives the acquisition of the input image. Then the image is compressed by the RVC-C AL JPEG Encoder and is then transmitted by the PCI-Express. Xilinx provides a basic Linux driver for the PCI-Express port, which permits to read and write values from the PCIExpress bus. With the help of the native actors in Orcc, a read or source PCI-Express actor was written so that the RVC-C AL C or C++ generated application could communicate with the PCI-Express bus. The RVC-CAL JPEG decoder application is generated in C so that it can be implemented in the PC Host. As an input for the JPEG decoder the PCI-Express source actor is giving a correct JPEG bit-stream to decode. Then finally the decoder decodes the bit-stream and is then displaying the images on the computer screen.

Quant : a c t i o n I n : [ v a l ] r e p e a t 64 ⇒ Out : [ d a t a ] r e p e a t 64 var L i s t ( t y p e : i n t ( s i z e =24) , s i z e =64) d a t a do f o r e a c h u i n t i i n 0 . . 63 do i f ( v a l [ z i g z a g [ i ] ] > 0) then data [ i ] := ( val [ zigzag [ i ] ] + (QT[ B l o c k T y p e >> 2 ] [ i ] >> 1 ) ) / QT[ B l o c k T y p e >> 2 ] [ i ] ; else data [ i ] := ( val [ zigzag [ i ] ] − (QT[ B l o c k T y p e >> 2 ] [ i ] >> 1 ) ) / QT[ B l o c k T y p e >> 2 ] [ i ] ; end end B l o c k T y p e : = ( B l o c k T y p e + 1 ) mod 6 ; end end

4.4. Results The RVC-C AL JPEG Codec has 1653 source code lines: 990 lines for the encoder and 663 lines for the decoder. The numbers of lines are quite comparable for those of an entirely written JPEG codec in pure C. Thus the number of lines depends on the authors source code and how he/she is programming. Hence writing in RVC-C AL is really easy and intuitive. For example the Fig. 7 presents the source code of the Quantization actor written in RVC-C AL. The RVC-C AL JPEG encoder takes 22 % of the Virtex 6 FPGA (see Fig. 8) and it can encode 14 Frames per second with a 50Mhz clock (the time was measured by Xilinx Chip scope) for a set of 512x512 input images. The time for encoding 30 images of 512x512 plus the transfer from the PCIexpress bus is ' 3 seconds. The open-source Xilinx PCI-Express driver is too slow compared to the PCI-Express 1x standard, the images are stocked in the DDR RAM of the FPGAs board and it takes almost one second to pass the encoded images from the DDR to the host via the PCI-Express bus. Thought no optimization in RVC-C AL code was done and the serial architecture of the RVC-C AL JPEG encoder is penalizing the throughput. A simple splitting of the YUV components can increase the theoretical performance by 3 even 5 if an intelligent splitting is implied in the 4 blocks of the Y. In the other side of the PCI-Express bus the JPEG Decoder can decode 135 Frames/sec for a set of images with a resolution at 512x512. Still here the potential parallelism of the YUV splitting it is not taken in account due to the serial ar-

Fig. 7. The Quantization (Q) actor in the JPEG encoder written in RVC-C AL. Logic utilization Registers Slice LUTs LUT-FF pairs IOB Block RAM

Used 10033 13308 4241 85 85

Utilization % 6 16 22 14 6

Fig. 8. Synthesis Information on Virtex 6 FPGA chitecture of the RVC-C AL JPEG Decoder. 5. CONCLUSION AND FUTURE WORK This co-synthesis solution is still in progress and a lot of work is to be done. The work is separated in two fronts, the code generation and the code optimization of the RVC-C AL Applications (JPEG Codec for this paper). Code generation is naturally separated in software and hardware code generation. Different metrics and code analysis are being developed so that the generated code is more efficient. Although in software code generation the current bottleneck is the network and actor scheduling, efforts are being given for a dynamic scheduling that will take into account the dynamic execution of actors in a multi-core platform and the load balancing. As for hardware code generation, optimization can be done directly in the XLIM code generation for reducing the number of Slices in the synthesized code. Due to the nature of the Orcc IR a lot of intermediate variables are added so that the generated code for software is more efficient and easier to

[4]

Fig. 9. Potential YUV component parallelism for the JPEG encoder. interpret by the C/C++ compilers. But OpenForge is not recognizing that the intermediate variable comes from the same variable so it is adding more registers. A test was conducted between the Orcc XLIM and the OpenDF XLIM generation and demonstrated that actors with a lot of calculation (FDCT in our example) had almost twice the requirement in slice than the OpenDF XLIM code generation but still the Orcc XLIM throughput was 20 % higher. A possible quick fix for this problem is envisaged. Another approach which is not yet finalized is the different clock domain in actors so that a less power consumption or better performance can be achieved, depending the specification of the developer. As was mentioned in the OpenForge subsection actors can be totally asynchronous. As for the RVC-C AL source code different metrics are being developed to help the developer to optimize its source code. A simple optimization in signal processing is the splitting of components. Here for the JPEG encoder a potential splitting of the YUV components gives a strong parallelism in the design (Fig. 9). As for coding in RVC-C AL is really easy and intuitive the programmer should take in account that its design will be data-flow by nature and potential parallelism occurs by its design. This paper has presented a solution to generate code of the same application for hardware and software co-synthesis along with the first implementation of a JPEG Codec written purely in RVC-C AL. Orcc framework goes one step further than what was done in the OpenDF framework and it is offering software synthesis plus a better XLIM code generation. Even if a lot of work is need to be done to achieve better results both in hardware and in software level the current solution is one of the few framework in the market and in academia that can offer software and hardware code synthesis from the same specification. 6. REFERENCES [1] XLIM: An XML Language-Independent Model. Technical report, Xilinx DSP Division, 2007. [2] A. Antola, M. Santambrogio, M. Fracassi, P. Gotti, and C. Sandionigi. A novel hardware/software codesign methodology based on dynamic reconfiguration with impulse c and codeveloper. In Programmable Logic, 2007. SPL’07. 2007 3rd Southern Conference on, pages 221–224. IEEE. [3] S. S. Bhattacharyya, G. Brebner, J. W. Janneck, J. Eker, C. von

[5] [6]

[7]

[8] [9]

[10]

[11]

[12]

[13]

[14] [15]

Platen, M. Mattavelli, and M. Raulet. OpenDF: a dataflow toolset for reconfigurable hardware and multicore systems. SIGARCH Comput. Archit. News, 36(5):29–35, 2008. S. S. Bhattacharyya, J. Eker, J. W. Janneck, C. Lucarz, M. Mattavelli, and M. Raulet. Overview of the MPEG Reconfigurable Video Coding Framework. Springer journal of Signal Processing Systems. Special Issue on Reconfigurable Video Coding, 2009. Celoxica. Handel-C Language Reference Manual, 2004. J. Eker and J. Janneck. CAL Language Report. Technical Report ERL Technical Memo UCB/ERL M03/48, University of California at Berkeley, Dec. 2003. S. Gupta, N. Dutt, R. Gupta, and A. Nicolau. Spark: a highlevel synthesis framework for applying parallelizing compiler transformations. In VLSI Design, 2003. Proceedings. 16th International Conference on, pages 461 – 466, jan. 2003. ISO/IEC FDIS 23001-4. MPEG systems technologies – Part 4: Codec Configuration Representation, 2009. J. Janneck, I. Miller, D. Parlour, G. Roquier, M. Wipliez, and M. Raulet. Synthesizing hardware from dataflow programs: An mpeg-4 simple profile decoder case study. In Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop on, pages 287 –292, oct. 2008. E. Khan, M. El-Kharashi, F. Gebali, and M. Abd-El-Barr. Applying the handel-c design flow in designing an hmac-hash unit on fpgas. Computers and Digital Techniques, IEE Proceedings -, 153(5):323 – 334, sept. 2006. T. J. Parr. Enforcing strict model-view separation in template engines. In Proceedings of the 13th international conference on World Wide Web, WWW ’04, pages 224–233, New York, NY, USA, 2004. ACM. N. Siret, I. Sabry, J. Nezan, and M. Raulet. A codesign synthesis from an mpeg-4 decoder dataflow description. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 1995 –1998, 30 2010-june 2 2010. N. Siret, M. Wipliez, J.-F. Nezan, and A. Rhatay. Hardware code generation from dataflow programs. In Design and Architectures for Signal and Image Processing (DASIP), 2010 Conference on, pages 113 –120, 2010. C. von Platen. CAL ARM Compiler, Jan. 2010. Report of ACTORS project. M. Wipliez, G. Roquier, and J. Nezan. Software Code Generation for the RVC-CAL Language. Journal of Signal Processing Systems, 2009.

OPTIMIZATION METHODOLOGIES FOR COMPLEX FPGA-BASED SIGNAL PROCESSING SYSTEMS WITH CAL Ab Al-Hadi Ab Rahman, Hossam Amer, Anatoly Prihozhy, Christophe Lucarz, Marco Mattavelli ´ SCI-STI-MM, Ecole Polytechnique F´ed´erale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland ABSTRACT Signal processing designs are becoming increasingly complex with demands for more advanced algorithms. Designers are now seeking high-level tools and methodology to help manage complexity and increase productivity. Recently, CAL dataflow language has been specified which is capable of synthesizing dataflow description into RTL codes for hardware implementation, and based on several case studies, have shown promising results. However, no work has been done on global network analysis, which could increase the optimization space. In this paper, we introduce methodologies to analyze and optimize CAL programs by determining which actions should be parallelized, pipelined, or refactored for the highest throughput gain, and then providing tools and techniques to achieve this using minimum resource. As a case study on the RVC MPEG-4 SP Intra decoder for implementation on Virtex-5 FPGA, experimental results confirmed our analysis with throughput gain of up to 3.5x using relativelyminor additional slice compared to the reference design. Index Terms— Optimization methodologies, FPGA, signal processing, CAL dataflow, MPEG decoder 1. INTRODUCTION As time-to-market window continues to shrink, it is crucial than ever to be as rapid and productive as possible during the design phase. At the same time, algorithms are becoming more advanced with demands for more features, higher throughput, and lower resource and power requirements. For hardware-based designs, the classical RTL methodology is no longer seen as the appropriate way to specify complex signal processing systems, since they are known to be difficult and time-consuming to perform. Furthermore, the low level of abstraction in RTL designs also makes it challenging to analyze for optimization and extend design features on the program level. Because of this, there is now a growing interest in specifying hardware systems at a high-level of abstraction, mostly using imperative languages (C/C++) such as the Spark framework [1] and GAUT tool of LABSTICC [2]. However, C program is designed to execute sequentially and it still remains a

difficult problem to generate efficient HDL codes from C, especially for DSP applications. Furthermore, C programs are also difficult to analyze and identify for potential parallelism due to the lack of concurrency and the concept of time [3]. The C AL dataflow language [4] was developed to address these issues by specifying programs using a dataflow abstraction with explicit parallelism using the concept of actors [5]. The language also allows the synthesis of these programs to efficient parallel hardware. The design environment was initiated and developed by Xilinx and later became Eclipse IDE opensource plugins called OpenDF (1 ) and OpenForge (2 ). The tools only perform analysis and optimizations for a given C AL actor for HDL synthesis; the final result highly depends on the design style and specification. Reference paper [6] presents coding recommendations for C AL designers in order to achieve best results. However, for complex systems, it is not trivial to find in a given dataflow network, where and how the recommendations should be applied in order to achieve the highest improvement factor. The main focus of the paper is therefore, to provide systematic methodologies, tools, and techniques to automatically analyze the global network, and apply optimizations on critical parts of the network. The present paper contributes to the merging of various tools that have been developed by the authors, and their application to complex dataflow network. We prove the feasibility of our methodologies by applying them on the Reconfigurable Video Coding (RVC) MPEG-4 SP Intra decoder [7]. The decoder is specified in CAL and based on the new RVC standard, which proposes a new paradigm for specifying and designing complex signal processing systems. Essentially, the standard enables specifying new codecs by assembling blocks from a Video Tool Library (VTL), which results in higher flexibility, reusability, and modularity. We show by experiments for FPGA implementation, how the throughput of this reference decoder can be improved by multiple factor using our tools and techniques.

1 http://opendf.sourceforge.net 2 http://sourceforge.net/projects/openforge

2. CAL DATAFLOW MODELING AND HARDWARE SYNTHESIS C AL programs are specified in a high level of abstraction; therefore, complex systems can be designed significantly more rapid compared to RTL language for hardware implementation. Furthermore, actor-based dataflow abstraction is also much easier to analyze (on high level) for optimization since algorithms in actions are specified in the style that is similar to software than hardware. Compared to RTL designs where designers may use primitive components such as registers, multiplexers, and controllers, these are hidden in the dataflow abstraction, therefore simplifying designs and results in higher flexibility to scale and optimize. In other words, instead of having to deal with low level components, C AL designs allow more focus on implementing advanced features and algorithms. The mapping of algorithms specified in imperative languages such as C/C++/Java to CAL is performed by translating the body of operations into actions, and then grouping the actions into actors. The sequence of action firing (i.e. execution) of a given actor is specified by the actor scheduler. The constraint is that at any moment in time, only one action of a given actor can be fired. However, in contrast to imperative languages where operations are performed sequentially, actors are designed to run in parallel if input tokens (i.e. data) are available. This explicit parallelism in dataflow abstraction makes it suitable for high level modeling of hardware systems. C AL designs have proven to generate efficient hardware, for example as given in [8] for the hardware implementation of MPEG-4 SP decoder. The design results in smaller implementation area and higher throughput compared to classical RTL methodology. Similarly, the work in [9] presents CAL design of various image filtering system with implementation on FPGA and general purpose computer. Results show up to 3.5x higher throughput for the system using the C AL to HDL code generator. 3. OPTIMIZATION METHODOLOGIES FOR CAL DATAFLOW PROGRAMS Optimizing complex hardware systems generally involves two main steps: maximizing parallel execution and pipelining the critical path. The objective of the former is to reduce the number of clock cycles to execute an algorithm by replicating components so that they could run in parallel, while the latter aims at increasing the maximum allowable frequency by partitioning large combinatorial logic. The optimization problem in our case is to find the best parallelization and pipelining strategy as to achieve maximum system throughput with minimum resources. This is done by analyzing the dataflow network at both the CAL-program level and the generated RTL-level using various external tools.

3.1. Methodology for Parallel Action Execution The first step is to identify in a given dataflow network, which actors or actions have the potential to increase the overall systems throughput by exploiting parallelism, and which of these give the highest improvement factor. This can be determined by analyzing the paths of the network from source to sink. We define the network critical path as the longest weighted path from source to sink node of the system, as shown by the example in figure 1. It can be seen that optimizing a node that is not from the network critical path will not result in any improvement, since the longest path is not yet resolved. The general idea is to optimize node(s) in the critical path in order to gain higher system performance. 1

4

W=2

2

W=2

3

W=3

5

W=2

6

W=2

W=4

7

W=1

Fig. 1. Example of a weighted path in a dataflow network with the network critical path. For C AL dataflow programs, the path analysis is performed by the causation trace, which is a directed acyclic graph such that: • Every node is a firing of an action inside an actor in the program • For any two nodes v1 and v2 , every edge from v1 to v2 indicates a dependency (either through a token, state or port) from v2 to v1 , implying that v1 has to be executed before v2 Each node is also assigned weights, which is the execution time of an action. This is taken as the number of clock cycles to execute a given action. The evaluation phase involves constructing a Gantt-chart of execution from the weighted causation trace. Figure 2 shows an example of the chart based on the network in figure 1. Actions (or nodes) are listed on the vertical axes, while the horizontal axes represents time in terms of the number of clock cycles. The shaded blocks in the chart show action execution and the arrows are action dependencies. The objective is to minimize the makespan from the chart, i.e. the length of executing the network in terms of the number of clock cycles from start to end. All these steps are done automatically by the C AL Profiling tool called ProfiCAL [10] that is developed by one of the

ClockAction cycle 1

2

3

4

5

6

7

CAL Program

8

CAL to HDL Synthesis

RTL Simulation

Action Weights Extraction Script

1 2

Action Weights

Input Sample

3 4

ProfiCAL

5 6

Action(s) Optimization

7

Fig. 2. Example of Gantt-chart hardware execution for CAL dataflow programs. authors in the present paper. It takes in the C AL dataflow actors and network, input sample for duration t, and the weights of each action. The weights are also obtained automatically from a hardware simulator (after C AL to HDL synthesis) using a scripting language, by monitoring and evaluating the start and end of action execution. The profiling tool generates the causation trace and the Gantt-chart of execution, and displays the network critical path. The complexity of each action i in the critical path is also evaluated based on: compi = wi × fi , f or i ∈ T,

(1)

where compi is the complexity, wi is the average weight and fi is the number of firings of an action i for a certain duration T = {0, . . . , t}. From this, it can be concluded that improving an action with the highest complexity (critical action) in the network critical path would give the highest throughput improvement. Further analysis can be performed by iteratively reducing the complexity of the critical action (e.g. by parallel execution) by a given factor, and re-evaluating the critical path and action, essentially creating a rank of actions by the highest potential overall throughput improvement. Figure 3 shows the flow chart of the methodology. The CAL program is first synthesized to HDL, and RTL simulation is performed where the weights of each action can be obtained. Using these weights, the CAL program, and the input sample of the program, ProfiCAL is used to rank actions with potential throughput improvement. The critical actions are then analyzed if they are possible to increase the amount of parallelism. The modified actions and actors are updated in the CAL program and the process repeats until it is not possible for further parallelization. As a case-study on complex dataflow network, we have performed this analysis on the reference RVC MPEG-4 SP Intra decoder. The results are as follows: 1. 10% of action read_write of the actor Inverse Scan 2. 10% of action copy of the actor Inverse ac prediction 3. 10% of action write_only of the actor Inverse Scan

Action Rank

Fig. 3. Methodology to find critical actions for potential parallel execution. 4. 10% of action read_only of the actor Inverse Scan 5. 10% of action advance of actor Inverse dc prediction This indicates that optimizing lower rank action copy of the actor Inverse ac prediction would result in lower throughput increase compared to optimizing higher rank action read_write of the actor Inverse Scan. The percentage refers to how much optimization should be applied to an action until it is no longer the dominant action to optimize. In other words, Inverse Scan (IS) should be optimized first by 10%, then Inverse AC Prediction (IAP) of 10%, IS of 20%, and Inverse DC Prediction (IDP) of 10%. It is important to note that the profiling tool only reveals which action should be optimized in order to gain overall improvement. The technique on how the action can be improved is based on the discretion of the designer. For our design case, we have developed a refactoring technique to increase the parallel execution of the critical actions in the video decoder. The critical actions above reside in the texture decoder network. For the case of this video decoder, texture decoder is divided into three separate parts: one for luminance (Y) and two for chrominance (U and V). Through analysis of the Ybranch texture decoding, it is found that further splitting can be done. In the original implementation, the luminance texture decoder processes four blocks in series, i.e. one after the other. The reason for this is because of the blocks’ dependencies in the prediction components. However, the processing of a particular block does not depend on all previous blocks, but only on one or two previous blocks. The blocks’ processing dependencies is shown in figure 4. The processing of the first block-0 executes without any dependencies. The processing of block-1 and block-2 depends only on block-0, therefore these two blocks can be processed in parallel. Block-3 depends on the availability of block-1 and block-2, while the next block-0 depends on only block-1. Therefore, block-3 and block-0 can be processed in parallel. In the general case, the processing of block-X depends on one block-above and one block-left-side if these dependency blocks are available. The three critical actors IS, IAP, and IDP in the luminance

Fig. 4. Blocks layout and parallel execution potential for luminance texture decoding. texture decoder are refactored by replicating the components, and data token is re-distributed so that all components could run in parallel for as much as possible. We then explore the impact of splitting these components using various combinations on the whole decoder performance, and verify the accuracy of our analysis to the experimental results on hardware. 3.2. Methodology for Action Pipelining and Refactoring The design abstraction of C AL dataflow programs can be loosely defined as pipelined implementations, where actors are the processing elements that are connected to adjacent actors by FIFO buffers. The key difference however, is that actors may contain many actions that are controlled by a local scheduler. In this case, data may not necessarily be written at FIFO buffers at every clock cycle as in true pipelined circuits. A dataflow program only becomes a fully pipelined implementation if every actor in the network contains just a single action. The technique to extract an action from an actor with multiple actions starts with analyzing state variables required and updated by the action-to-pipeline. All required state variables for the action-to-pipeline are sent as data tokens if it is modified by the previous action. Similarly, all updated state variables by the action-to-pipeline are sent to the next action. If more than one action in the original actor has the same output port, then a multiplexer will be required to select the correct output. The actor with a single action can then be partitioned into several smaller actors as shown in figure 6. In the example, the actor sample is partitioned into two actors, creating a 2stage pipeline with the output of sample1 directly connected to the input of sample2. The optimization problem reduces to: • Finding the action to pipeline that would increase the overall system throughput • Partitioning the action in the most effective way, that is with maximum throughput for n-stage pipeline using minimum resources In order to find the action that should be pipelined, we define action critical path as the longest combinatorial path of an action in a dataflow network as specified by an RTL synthesis tool. Indeed, this path determines the maximum

actor sample ( ) i n t ( s i z e =SZ ) I n p u t ⇒ i n t ( s i z e =SZ ) O u t p u t : a : action Input : [ in ] ⇒ Output : [ out ] do ... ... ... end end

Fig. 5. Actor sample with a single action a. actor sample1 ( ) i n t ( s i z e =SZ ) I n p u t ⇒ i n t ( s i z e =SZ ) Ou t a 1 , . . : a1 : a c t i o n I n p u t : [ i n ] ⇒ O u t a 1 : [ o u t a 1 ] , . . do ... end end actor sample2 ( ) i n t ( s i z e =SZ ) I n a 1 , . . ⇒ i n t ( s i z e =SZ ) O u t p u t : a2 : a c t i o n I n a 1 : [ i n a 1 ] , . . ⇒ O u t p u t : [ o u t ] do ... end end

Fig. 6. 2-stage pipeline of actor sample with two actions a1 and a2. allowable frequency that the system can run. Therefore, the way to increase operating frequency is to partition this path into smaller paths that are separated by registers. It is important to note that pipelining is limited by variable dependencies. In pipelined circuits, data should be read and written at every clock cycle. Any variable dependencies would require all operations between the dependencies to be in the same stage. For actions with large dependencies between complex operations, pipelining may not reduce the action critical path. Another solution is to still pipeline actions with large dependencies, but stalling the pipeline until dependencies are resolved. At the CAL-program level, this involves partitioning an action into the same actor and modifying the scheduler to execute the actions in series. We call this technique action refactoring with an example in figure 7. Instead of having a single action a, the actor sample now has two actions a1 and a2. Essentially, the two actions do not execute in parallel, but one after the other as specified in the scheduler. However, this technique should be used carefully because the number of clock cycles to execute the whole dataflow network may increase due to pipeline stalling. In order to check if action refactoring increases or not the number of clock cycles to execute the network, hardware simulation can be analyzed. Basically, actions that are safe to perform action refactoring is one that has a finite number of clock

where µ(j) is the number of variables with un-assigned values in the j column of the X matrix. The task is to select a single X matrix according to an objective function and constraints. In this work, we consider a resource-optimized pipeline, with the objective to minimize the total width of registers inserted between neighboring pipeline stages. Let Ω be a set of possible X matrix. The objective function as follows minimizes the total pipeline register width over all elements of set Ω:

actor sample ( ) i n t ( s i z e =SZ ) I n p u t ⇒ i n t ( s i z e =SZ ) O u t p u t : a1 : a c t i o n I n p u t : [ i n ] ⇒ do ... end a2 : a c t i o n ⇒ O u t p u t : [ o u t ] do ... end s c h e d u l e fsm s a 1 : s a 1 ( a1 ) −−> s a 2 ; s a 2 ( a2 ) −−> s a 1 ; end

Fig. 7. 2-stage action refactoring of actor sample with two actions a1 and a2.

min

X∈Ω

k nX m X s=1

[max(fi,j × xs,i ) − max(hi,j × xs,i )]× i∈N

j=1

i∈N

width(j) + cycles delay from the end of firing of the action to the start of firing of the next action. In this case, pipeline stalls take up the delay before the firing of the next action and does not affect the overall execution. This can be automatically checked using a simple script to monitor and evaluate the start and end of any two consecutive actions given by a hardware simulator. In both cases pipelining and action refactoring, the task is to efficiently synthesize single action into k actions with minimum resources. For this reason, we have developed a pipeline synthesis and optimization technique that splits an action by k-parts as equally as possible in terms of the length of combinatorial path with minimum usage of pipeline registers. The following describes the pipeline optimization task, objective function and constraints. Let N = {1, . . . , n} be a set of algorithm operators (from action) and K = {1, . . . , k} be a set of pipeline stages. We describe the distribution of operators onto pipeline stages with the X matrix:   x1,1 · · · x1,n  ..  .. X =  ... . .  xk,1

···

xk,n

In the matrix, xi,j ∈ {0, 1} variable takes one of two possible values. If xi,j = 1 then the i operator is scheduled to the j stage, otherwise it is not scheduled to the stage. However, depending on operator mobility, some operators can be scheduled to more than one stage. In this case, the xi,j is not assigned a definitive value of 0 or 1. It is marked with unassigned variable which may be replaced with values 0 or 1 in such a way as to obtain a valid X matrix, i.e. an operator that is assigned to only at a single stage. One complete X matrix describes one possible pipeline schedule. The upper bound S upper of the total number of X matrix can be estimated as

m X

[max(τj ,

j=1

xe,i )) −

max

e=s,...,k,i∈N

Y j∈N

µ(j),

(2)

(fi,j ×

(hi,j × xe,i )] × width(j)

o (3)

where τj = 1 if the j variable is an output token and τj = 0 otherwise; × is the arithmetic multiplication operation. There are two parts in equation 3. The first one estimates for each stage s the width of registers inserted in between the stage and the previous neighboring stage. The second one estimates for each stage the width of transmission registers. There are three constraints related to our optimization tasks – operator scheduling, time, and precedence constraints. The operator scheduling constraint describes the requirement that each operator should belong to only one pipeline stage: alap(i)

X

xx,i = 1

(4)

s=asap(i)

where asap(i) and alap(i) are the earliest and latest that operator i can be scheduled on pipeline stage s. The time constraint describes the requirement that the time delay between two operators i and j must be less than the stage-time requirement if the operators are scheduled to one pipeline stage s: xs,i × xs,j × gi,j ≤ Tstage

(5)

where gi,j is the longest path between i and j operators on the dataflow graph. The operator precedence constraint describes the requirement that if the i operator is a predecessor of the j operator on a dataflow graph then i must be scheduled to a stage whose number is less than the number of stage which j operator is scheduled to: alap(i)

S upper =

max

e=s+1,...,k,i∈N

X

alap(j)

(s × xs,i ) −

s=asap(i)

X

(s × xs,j ) ≤ 0

s=asap(j)

(6)

Constraints 4, 5, and 6 together define the structure of the optimization space. The pipeline optimizaton technique has been developed as a Java program, where it takes as input, the CAL actor with a single action and the timing requirement of each pipeline stage. The program first generates the asap and alap schedules for the action based on the operator-input, operator-output, and operator-precedence relations. From this, operator mobility is determined and operators are arranged in order of mobility. This is then used in the coloring algorithm [11] that generates all possible (and valid) pipeline schedules based on the operator conflict and nonconflict relations. For each pipeline schedule, total register width is estimated, and the least among all schedules is taken as the optimal solution, which is finally used to generate pipelined CAL actors. Note that for action refactoring technique, each generated action is taken and combined into a single actor. Figure 8 shows our methodology for action pipelining and refactoring of complex dataflow network. Starting from a CAL dataflow program, it is first synthesized to HDL, and then to RTL, where the information on action critical path can be obtained. Using this, the critical action is extracted from the CAL program and analyzed for pipelining (dependency check). If it is possible, then the action is sent for pipeline synthesis and optimization. If not, then the action is checked if it is feasible to perform action refactoring, i.e. it will not increase the number of clock cycles to execute the whole dataflow network. The results from pipelining or action refactoring is updated in the CAL program and the process is repeated until actions no longer dominate the critical path. CAL Program

CAL to HDL Synthesis

RTL Synthesis

Critical Action Extraction

Pipeline Possible? Yes Pipeline Synthesis & Optimization

Action with Critical Path

No

Action Refactoring feasible? Yes Action Refactoring Optimization

Fig. 8. Methodology for action pipelining and refactoring from CAL dataflow programs. This methodology is also applied on the RVC MPEG-4 SP Intra decoder, using the best results from the methodology of parallel action execution in section 3.1. Table 1 shows the top three action critical path as given by the XST synthesis tool for the implementation on Xilinx Virtex-5 FPGA. As shown, the delay on a path depends on logical (action) and routing (interconnections) delays. Our methodology aims at minimizing only the logical path since the routing delay is de-

pendent on the RTL synthesis tool. The next critical path after the IDCT is not anymore in the action body, but dominated by the routing delay and action scheduler. Table 1. Top three action critical path with logical and routing delay. Actor IDP IQ IDCT -

Action read.intra ac calc -

Logical(ns) 10.65 6.46 6.83 2.06

Routing(ns) 13.10 4.76 3.57 8.14

Total(ns) 23.75 11.22 10.40 10.30

Through analysis of the critical actions, the ac action from the Inverse Quantization (IQ) and calc action from the Inverse Discrete Cosine Transform (IDCT) do not have any variable dependencies, therefore they can be pipelined. On the other hand, action refactoring of these two actions are not feasible since they are performed repetitively for finite number of times without any delay between the firing. Action refactoring would increase the number of clock cycles required to execute the whole decoder network. Analysis of the read.intra action of the Inverse DC Prediction (IDP) shows that there is a large dependency in the action. The action starts with reading from a memory location, and at the end of the action, the results are written into memory. Therefore, pipelining is not possible. However, action refactoring is found to be feasible since the action is fired only once for every processing of one block. As a result, increasing the number of clock cycles to execute this action will not increase the number of clock cycles to execute the whole decoder network. 4. RESULTS & ANALYSIS The methodologies we introduced in section 3 have been applied on the reference RVC MPEG-4 SP Intra decoder, synthesized to HDL and implemented on a Xilinx Virtex-5 FPGA. Results are compared among various design points for action parallel execution, pipelining and refactoring techniques. Modelsim hardware simulator has been used to obtain the decoder’s latency in terms of the number of clock cycles, while the XST synthesis tool has been used for RTL synthesis to obtain maximum frequency and required FPGA resources. Experiments have been performed using a sample encoded CIF video frames at resolution of 352x288 pixels. In order to simplify the presentation of results, Every design point for the parallel execution technique is assigned a 3-bit binary value, where IDP is designated as bit-2, IS bit-1, and IAP bit-0. If a component is split, a value of 1 is assigned, and if it is not split, a value of 0 is assigned. For example, a design point of 100 is a design where the IDP is split, while the IS and IAP are not split. If all components are not split,

a design point 000 is assigned, which is our reference design. For action pipelining and refactoring technique, the following convention is used for the design points: pn x, where p is the pipeline identifier, n is the pipeline sequence, and x is the design point from parallel execution technique. For example, p1 000 refers to the first pipeline iteration from design point 000. The graph in figure 9 shows all design points. The cluster of points on the left side of the graph shows results of the parallel execution technique from the reference design, the middle is pipelining or refactoring of design point 000, while on the right shows pipelining or refactoring of design point 011. The arrow going from design point 000 to p1 000 shows the first step in minimizing the action critical path from the reference design, which is action refactoring of the IDP actor by 2-stage. The following action pipelining and refactoring is subsequently applied from this point. The optimization starts with the reference design (000) with a throughput of 108.4 CIF frames/s and slice of 17405. The first task is to obtain results for splitting only a single component IS, IAP, or IDP. Among these three design points, IS split (010) shows the highest throughput improvement of 28.1% with 138.9 CIF frames/s, followed by IAP split (001) with 137.7 CIF frames/s of 27.0% improvement. IDP (100) shows the least with only 5.9% with 114.8 CIF frames/s. Slice usage is relatively-minor with 1.8%, 3.2%, and 5.1% for IS, IAP, and IDP respectively. More importantly, the throughput improvement conforms exactly to our analysis that IS split would result in the highest improvement, followed by IAP split, and finally IDP split.In order to find the best split combination, we extended the experiment to two and three components splitting, as shown on the graph. The best combination in terms of throughput-to-slice ratio is 011 with IS and IAP split with throughput of 163.1 CIF frames/s and slice of 18153. For the design point 011, the action critical path is found to be on the read.intra of the IDP actor. Therefore this action is refactored into two stages. In terms of throughput, this operation results in 295.1 CIF frames/s, that is 80.9% improvement over design point 011, and 2.7x compared to design point 000. The next three action critical path is also found to be on this actor (on the split action) therefore, they are recursively split from three stages to five stages, with the final throughput of 342.3 CIF frames/s. The ac action of the IQ actor is found to be the next action critical path for two iterations, therefore is pipelined into two and three stages. This results in a throughput of 369.2 CIF frames/s. Finally, the IDCT is pipelined into two stages with the final throughput of 376.9 CIF frames/s, which is 3.5x improvement over the design point 000. In terms of slice, they are again relatively minor, up to 13.4% on the IDCT-2S compared to design point 011, due to the resource optimization that is performed during pipeline and refactoring synthesis. It is interesting to observe the impact of applying our

methodologies to the number of clock cycles to execute the network, and the maximum frequency (which determines throughput). As mentioned in section 3, the parallel execution technique aims at reducing the number of clock cycles to execute the whole network, while the action pipelining and refactoring technique allows the system to run at a higher frequency (as shown in table 2). For the parallel execution technique, the number of clock cycles is reduced by up to 51.9%, while for the action pipelining and refactoring technique, frequency is increased by up to 2.3x compared to the reference design (000).

Table 2. Number of clock cycles and maximum frequency for all design points. Design Clock-cycles/ Diff fmax Diff Point CIF Frame (%) (MHz) (%) 000 396000 42.9 001 314424 25.9 43.3 0.1 010 317592 24.7 44.1 2.8 100 392832 0.1 45.1 5.1 011 260568 51.9 42.5 -0.1 101 342936 15.4 44.1 2.8 110 316800 25.0 44.2 3.0 111 302544 30.8 44.2 3.0 p1 000 396000 0.0 74.8 74.4 p2 000 396000 0.0 79.8 86.0 p3 000 396000 0.0 89.2 107.9 p4 000 396000 0.0 89.6 108.9 p1 011 260568 51.9 76.9 80.9 p2 011 260568 51.9 82.9 95.1 p3 011 260568 51.9 85.1 100.2 p4 011 260568 51.9 89.2 109.9 p5 011 260568 51.9 89.6 110.8 p6 011 260568 51.9 96.2 126.4 p7 011 260568 51.9 98.2 131.1

The reduction of the number of clock cycles to execute the network leads to another interesting observation, that is, it does not only increase design throughput, but also allows the system to run at a lower frequency for the same throughput requirements, which is key to obtaining a low power design. For example, the reference design (000) is able to decode 108.4 CIF frames/s at 42.9 MHz with 396,000 clock cycles per frame. Taking the best design of 011, it requires only 260,568 clock cycles per frame. Therefore only requires 260,568*108.4=28.2 MHz to decode at the same frame rate. This is a reduction of 34.3% on the clock frequency. With action pipelining and refactoring, we provide option to increase system throughput by allowing the clock frequency to go up to 98.2 MHz with maximum throughput of 376.9 CIF frames/s.

21000

p7_011

20500 20000

FPGA Slice

19500

p5_011 p6_011

111

101

19000

110

18500

p2_011

100

18000

p1_011 001

17500

011

010

p2_000

p4_011 p3_011

p4_000 p3_000

p1_000

000

17000 16500 0

50

100

150

200

250

300

350

400

Throughput (CIF Frames/s)

Fig. 9. FPGA slice vs. throughput for all design points of action parallel execution, pipelining, and refactoring methodologies. 5. CONCLUSION & FUTURE WORK In this paper, we present methodologies to optimize complex FPGA-based signal processing systems that are specified in CAL dataflow language. Two methodologies are presented: the first is to optimize action parallel execution by analyzing the network critical path, and finding critical actions that would increase overall system throughput if the amount of parallelism is increased. The second methodology is to pipeline or refactor actions based on action critical path, and automatically synthesize non-pipelined to pipelined actions using minimum resource. The methodologies utilize various external tools at the CAL-program level and the generated RTL-level. We have applied both methodologies on the RVC MPEG-4 SP Intra decoder with synthesis to HDL for FPGA implementation. The strength of our methodologies is proven with throughput increase by multiple fold using relativelyminor additional slice. In the near future, the methodologies presented will be applied to other applications of complex signal processing systems including motion JPEG encoder, Sobel image filter, as well as video converter and sampler. The methodologies will also be enhanced by merging and simplifying some of the tools and techniques. 6. REFERENCES [1] S. Gupta, N. Dutt, R. Gupta, and A. Nicolau, “Spark: a high-level synthesis framework for applying parallelizing compiler transformations,” in International Conference on VLSI Design, 2003, pp. 461–466. [2] E. Martin, O. Sentieys, H. Dubois, and J. L. Philippe, “Gaut: An architectural synthesis tool for dedicated signal processors,” in European Design Automation Conference - Proceedings, 1993, pp. 14–19. [3] G. De Micheli, “Hardware synthesis from c/c++ models,” in Design, Automation and Test in Europe Conference and Exhibition 1999, 1999, pp. 382–383.

[4] J. Eker and J. Janneck, CAL Language Report: Specification of the CAL Actor Language, University of California-Berkeley, December 2003. [5] C. Hewitt, “Viewing control structures as patterns of passing messages,” Journal of Artificial Intelligence, vol. 8, no. 3, pp. 323–363, June 1977. [6] D. Parlour, CAL Coding Practices Guide: Hardware programming in the CAL Actor language, Xilinx Inc, June 2003. [7] M. Mattavelli, I. Amer, and M. Raulet, “The reconfigurable video coding standard,” IEEE Signal Processing Magazine, vol. 27, no. 3, pp. 159–164+167, 2010. [8] J. W. Janneck, I. D. Miller, D. B. Parlour, G. Roquier, M. Wipliez, and M. Raulet, “Synthesizing hardware from dataflow programs: An mpeg-4 simple profile decoder case study,” Journal of Signal Processing Systems, vol. 63, no. 2, 2011. [9] A.A.H Ab-Rahman, R. Thavot, M. Mattavelli, and P. Faure, “Hardware and software synthesis of image filters from cal dataflow specification,” in 2010 Conference on Ph.D. Research in Microelectronics and Electronics (PRIME), 2010, pp. 1–4. [10] C. Lucarz, G. Roquier, and M. Mattavelli, “High level design space exploration of rvc codec specifications for multi-core heterogeneous platforms,” in Proceedings of the 2010 Conference on Design and Architectures for Signal and Image processing (DASIP), October 2010. [11] G. De Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, New Jersey, USA, 3rd edition, 1994.

2011

Tampere, Finland, November 2-4, 2011

Session 9: Reconfigurable Systems & Tools for Signal & Image Processing 2 Co-Chairs: Juanjo Noguera, Xilinx, Ireland, , Fraunhofer IOSB, Ettlingen, Germany

Development of a Method for Image-Based Motion Estimation of a VTOL-MAV on FPGA Natalie Frietsch, Lars Braun, Matthias Birk, Michael Hübner, Gert F Trommer and Jürgen Becker Real-Time Moving Object Detection for Video Surveillance System in FPGA Tomasz Kryjak, Mateusz Komorkiewicz and Marek Gorgon An Approach to Self- Learning Multicore Reconfiguration Management Applied on Robotic Vision Walter Stechele, Jan Hartmann and Erik Maehle Power Consumption Improvement with Residue Code for Fault Tolerance on SRAM FPGA Frederic Amiel, Thomas Ea and Vinay Vashishtha

www.ecsi.org/s4d

DEVELOPMENT OF A METHOD FOR IMAGE-BASED MOTION ESTIMATION OF A VTOL-MAV ON FPGA N. Frietsch, I. Pashkovskiy, G. F. Trommer Karlsruhe Institute of Technology (KIT) Institute of Systems Optimization (ITE) Karlsruhe, Germany

ABSTRACT In this paper, the development of the vision based motion estimation for a small-scale VTOL-MAV as well as the implementation on FPGA are investigated. Especially in urban environments the GPS signal quality is disturbed by shading and multipath propagation and an augmentation with another sensor is inevitable. The vision system bases on the analysis of the sparse optical flow that is extracted from images taken by the onboard camera. From the extracted point correspondences projective transformations are estimated with a robust parameter estimation algorithm. As the underlying image processing routines are computationally expensive but can be processed in parallel they have been implemented on FPGA. The different parts of the algorithm as well as the implementation are covered in detail. Index Terms— micro aerial vehicle, MAV, FPGA, airborne camera, sparse optical flow, navigation 1. INTRODUCTION Over the last years, the interest in small unmanned vehicles has increased constantly. The possible application areas of UAV’s (unmanned aerial vehicle) are widespread. They are especially useful for security and rescue operations for example in cases of industrial or natural disasters like earthquakes or fires and can significantly reduce the risk for the human rescue teams. As their navigation systems often depend on GPS information, their application in urban environments has challenging demands concerning the navigation module due to limited GPS coverage as well as shading and multipath effects as it occurs especially in urban canyons. This paper focuses on the augmentation of a small, electrically powered micro aerial vehicle (MAV) with vertical take-off and landing (VTOL) capabilities with a vision based system. The platform is shown in Figure 1. The maximal diameter including the rotors is 79 cm and the takeoff weight without payload is around 800 g. The payload capability is up to 400 g. The current navigation systems bases on the Kalman filter integration of sensor data of an inertial measurement unit and a GPS

L. Braun, M. Birk, M. H¨ubner, J. Becker Karlsruhe Institute of Technology (KIT) Institute for Information Processing Technologies (ITIV), Karlsruhe, Germany receiver and is augmented by magnetometer and barometric height sensor data [1]. In urban environments, GPS coverage and signal quality decrease due to shading and multipath effects. Additional image based navigation estimation can help to overcome this problem.

Fig. 1. Four rotor MAV platform with digital camera in flight. In the scenarios described above, MAVs have in the majority of cases image sensors as payload on-board. Vision based navigation systems have been investigated by several research groups with promising results to overcome the described short-coming of GPS. Our approach is the integration of a visual odometer depending on corner-like image features [1]. The underlying image processing routines are computationally expensive but parallelizable and therefore suitable for the implementation on a Field-programmable Gate Array (FPGA). In this paper the implementation of the image based rotation and translation estimation between subsequent images on a FPGA is described. As the camera is mounted perpendicular under the helicopter, the observed scene is rather flat and the relative camera motion from frame to frame is therefore modelled by projective transformations, called homographies. In the first step, point correspondences are extracted from the sparse optical flow between pairs of subsequent images. In contrast to other applications like moving objects tracking, there is no need for a dense optical flow field but for a set

of well known point correspondences being distributed over the image plane. In the next step homographies are estimated with the robust RANSAC (Random Sample Consensus) approach. At the same time outliers like mismatched point pairs or points belonging to moving objects in the world are discarded. Finally the relative rotation and translation is extracted by decomposing the homography matrix based on singular-value decomposition (SVD). This paper is organized as follows: In Section 2, the theoretical background and considerations are described followed by the implementation details on a FPGA in Section 3. In Section 4, the performance of the implemented algorithm is illustrated followed by a conclusion.

2. THEORETICAL CONSIDERATIONS Several methods for sparse optical flow estimation and point correspondences extraction have been described in literature. One approach is the extraction of a set of salient point features along with descriptors from each image. Corresponding points between two images can then be found by comparing the two lists of descriptors. Depending on the characteristic of the descriptor, this approach can be used to find matches between images even if brightness and the point of view of the camera differ significantly. Another approach to get point correspondences is the extraction of a set of features in an image and then following a matching against a region in the subsequent frame. Several implementations of optical flow on FPGAs have been described for different purposes: In [2], a dense optical flow field is extracted by using the census transform. Well known distinct feature extraction algorithms are the Harris and Stephens corner detector [3] as well as a method first published by Lucas and Kanade [4] and further developed by Shi and Tomasi [5] [6]. Along with the tracking, the last approach is also referred to as Kanade-Lucas-Tomasi (KLT) tracker. A modified Harris corner detector is used in [7] [8] on a FPGA. Whereas a modified high performance version of the KLT-tracker has been implemented by Diaz et. al. [9] [10]. If the camera motion is small compared to the sampling rate of the images, reasonable results can be achieved with region based matching methods, compare [11]. In contrast to most described methods, in our case there is no need for a dense optical flow field but for a set of well known point correspondences being distributed over the image plane.

2.1. Corner feature selection The approaches of Harris and Stephens and of Kanade, Lucas and Tomasi base on the windowed second order moment

matrix M(x, y)

= =

 

Ix2 (i, j) Ix (i, j)Iy (i, j) (i,j)∈N   a(x, y) b(x, y) b(x, y) c(x, y)

Ix (i, j)Iy (i, j) Iy2 (i, j)



(1)

that can be calculated for each image point (x, y)T by using all image points in its neighborhood N with Ix and Iy being the approximated image gradients. Tomasi and Kanade state that if both eigenvalues λ1 and λ2 of the matrix given in Eq. (1) are large the local gray value variance indicates a corner or another pattern that can be tracked reliably [5]. Therefore, an image point (x, y)T is assumed to be a corner feature if the smaller eigenvalue of its according matrix M(x, y) exceeds a certain threshold λ: min(λ1 , λ2 ) > λ.

(2)

Harris and Stephens choose the slightly different corner criterion: R(x, y) = det(M(x, y)) − k · trace2 (M(x, y))

(3)

with the parameter k that is often set to 0.04. An image point is chosen as corner feature if the value of R(x, y) is above a certain positive threshold. The calculation of the eigenvalues requires the extraction of roots and is therefore computationally expensive regarding the desired hardware realization. A software evaluation was done with images with known ground-truth as well as images taken by the onboard camera of the helicopter. The well known Translating and Diverging Tree sequence and Yosemite Sequence [12] were used together with the error metrics described in [13]. As there were no significant differences, the Harris and Stephens corner detector was chosen to be implemented. In order to successfully implement the feature detection on the FPGA, the following modifications and parameter choices have been done: The multiplication with parameter k in Equation (3) is realised as bit shift operation by choosing the parameter to k = 1/8. The results with k = 1/8 were slightly better than with k = 1/16 and therefore the parameter showed a slightly better performance. The best size of the neighborhood N in Equation (1) concerning hardware size and performance for the observed sequences was found to be 5 × 5. For an image size of 640 × 480, 200 to 300 point correspondences should be extracted. A possible approach in software is to extract corner features exceeding a certain quality, sort them and then choose the best features for further processing. As this approach is not suited for the hardware implementation and in addition, to get features that are distributed over the whole image plane, the image is divided into

flow. The KLT tracker optimizes the error residue  = (d)



 Ik (x) − Ik+1 (x + d)

2

(4)

 x∈W

with the optical flow d of all pixels in the window W iteratively with the help of the spatial image gradients. The regionbased matching used in this work (compare [11]) uses the sum of absolute differences (SAD) error function:     − Ik (x) .  = (5) SAD(d) Ik + 1(x + d) x ˜∈W

Fig. 2. Segmentation of the image for the corner feature extraction. squares of 30 × 30 pixels, compare Figure 2 and within each square the best feature is determined. A global threshold is additionally applied to the locally best features, to ensure also a high global corner quality. This threshold is adapted during operation in order to always get a sufficient number of features. In Figure 3 an onboard image of the MAV is shown along with the grid and the extracted features. In areas with low contrast, the locally strongest features do not exceed the global threshold and are not chosen for further processing. As the input image data is provided line-by-line, the R-values



Fig. 3. Onboard image with extracted image features: in blue the locally best features and in red the features additionally exceeding a global quality threshold are shown. defined in Equation (3) can be calculated on the data stream by buffering just a few image rows in shift registers. A small patch around each corner is saved for the following optical flow estimation. Its size is discussed in the next section. 2.2. Sparse Optical Flow The region-based matching algorithms as well as the derivative based KLT-algorithm base on the assumption that all pixels in a small neighborhood are subject to the same optical

In the software evaluation, the sum of squared differences (SSD) has also been tested as larger differences have more influence on the result. The test showed no significant improvement being worth the doubled bit-width on the FPGA due to the additional square operation. The corresponding point to a corner feature having been chosen in image Ik is extracted by calculating the SAD between a small pixel patch around the corner feature and a certain region G of the subsequent image Ik+1 . The error function is calculated for each pixel of G and the point with the smallest SAD is chosen as correspondence if it additionally goes below a certain threshold. In order to enable an optimization of the implementation, the search region G is not chosen symmetrically around the corner feature in the precedent image, but it is chosen according to the square in which the feature was located. The calculations are again done on the data stream by buffering image rows in shift registers. As the search regions for different features overlap, multiple SAD-blocks have to be used in parallel. Therefore the sizes of the window W and the region G have to be chosen as compromise between detectable flow magnitude and necessary hardware resources. If nine SADblocks are used, the maximal size of G is 90 × 90 pixels that is reduced to 80 × 90 pixels to enable the switches between calculations for different features. The optimized size of the window W is 9 × 8 pixels, as this results in a computation speed of eight times the pixel clock. The described approach results in correspondences with whole-number pixel coordinates. To overcome this disadvantage a subpixel refinement is done using bilinear interpolation. From the eight surrounding pixels a one-eighth grid is generated. In order to compute the subpixel refinement in time, the maximal size of G is further reduced to 80 × 79. 2.3. Homography estimation and decomposition Under the assumption that the observed scene is planar or the motion of the camera is pure rotational, the geometrical relation between the consecutive frames Ik and Ik+1 can be described by the homography matrix Hk+1,k . The relation is given by xk+1 ∼ Hk+1,k · xk , (6)

with the homogeneous coordinates xk of the pixels in frame Ik and xk+1 of the pixels in frame Ik+1 , respectively. To estimate the eight degrees of freedom of the homography matrix, at least four known point correspondences are necessary, but to increase robustness and accuracy all correspondences being computed by the optical flow algorithm are used. The coordinates of the found point pairs are normalized by using isotropic scaling [14]. Additionally, the robust RANSAC algorithm [15] is used to find a subset of feature pairs resulting in a homography matrix that most points go well with. During this iterative process, points are identified that do not fit to the homography model. These outliers are mismatched feature pairs, points in the image where the planarity assumption does not hold or points belonging to moving objects in the scene. If homography matrices describe the motion of the scene well, the camera motion between the two frames Ik and Ik+1 can be extracted by Hk+1,k as described by several authors like Triggs [16] or Ma et. al. [17]: In the first step, the calibrated homography is calculated with the help of the camera calibration matrix K containing the intrinsic camera parameters [14]: (7) Hcal = K−1 HK, with H = Hk+1,k . The calibrated homography can then be rewritten as   1  ck+1 Hcal = ±λ Cck+1 ,ck + tck+1 ,ck nck T . (8) dk with λ being the second-largest singular value of Hcal . Furthermore Cck+1 ,ck is the direction cosine matrix from the camera coordinate system ck at time k to the camera coordinate system ck+1 at time k + 1; dk is the distance from the center of the camera coordinate system at time k to the ck+1 observed 2D-plane P in the real world. By tck+1 ,ck the translational displacement between the origins of the camera coordinate systems ck and ck+1 is given and nck is the normal vector of the plane P given in coordinates of the camera coordinate system ck . Further details about the decomposition and the image based height above ground estimation can be found in [1]. 3. FPGA IMPLEMENTATION In this section the implementation on the FPGA is covered in detail. 3.1. System Overview Figure 4 shows the design of the system including the different implemented modules. The incoming pixel stream is buffered in the pixel input module that provides it for the following calculation modules. As input the system uses a VGA signal with 640 × 480 pixel resolution and a refresh rate of

60 Hz which results in a pixel clock of 25.175 MHz. The color information is transformed to 12 bits gray level. In the feature detection module, which is explained further in Section 3.2.1, the corners are extracted and selected for the flow calculation on the basis of input pixels from pixel input module. The flow calculation itself runs in optical flow module and is discussed in Section 3.2.2. This module can access the pixels from the current image provided by pixel input module as well as stored pixels from the memory module.

Fig. 4. Project overview on target FPGA The memory module is a central storage module for the corner points of the last image as well as the coordinates of the corners found again in the current picture. Furthermore, the pixel environments of the corners as well as the found correspondences are saved. These are needed for the extraction of the correspondences and the subsequent subpixel refinement. This refinement from pixel accuracy to a one-eighthpixel grid is done by the subpixel refinement module, as explained in Section 3.2.3. The two clock signals plotted in Figure 4 pix clk and sys clk are obtained by a CMT (Clock Management Tile) from the incoming VGA clock signal. The pix clk is reduced by 20% compared to the vga clk, which is exactly the duration for reading a single image line of 640 pixels from a full VGA-line (including front porch and back porch). The pixel clock pix clk is set to 0.8 times the frequency of the VGA clock and thus pix clk = 20.14 MHz. The system clock sys clk corresponds to 8 times the pixel frequency and hence sys clk = 161.12 MHz. The completion of the calculations for an image pair is indicated by a ready signal and the point correspondences can then be read from memory. Additionally, the number of accepted corners in the picture can be accessed directly by fd count. By evaluating this number, an adjustment of the threshold is done. Afterwards, the extracted point correspondences are received by the MicroBlaze to compute the motion estimation. This is calculated in software. The software as well as the additional CORDIC (Coordinate Rotation Digital Computer) cores used by the MicroBlaze to accelerate the singular-value decomposition are explained further in

Section 3.3. 3.2. Sparse Optical Flow System 3.2.1. Feature detection At first, the module feature detection module will be discussed that is responsible for the detection and selection of corner points in the image. Its structure is shown schematically in Figure 5. This module is designed to run the calculations in a pipeline and the calculation runs at speed of the current pixel clock. At each pixel clock, four new pixels fd data arrive at the module and the decision is made for a new maximum. feature_detection_module

sync

Fig. 6. Block diagram of the optical flow module that calculates the optical flow in the previously extracted feature points.

pix_clk sys_clk

a

4·12 fd_data

b

R

new_max

c

Fig. 5. Block diagram of the feature detection module to detect and select corner points in the image. Thereby, fd data are the vertical and horizontal neighboring pixels to the current pixel. In the first block matrix builder the windowed second order moment matrix according to Equation (1) is calculated. From the resulting matrix elements the R-value according to Equation (3) is subsequently calculated in calc harris. In feature selector the R-value is analysed using the given threshold and the image coordinates to detect a new block maximum and accept a valid corner point. Furthermore, the output signals for storing the corner point and its pixel environment in the memory module are created by feature selector. 3.2.2. Optical Flow A block diagram of the module for calculating the optical flow in the extracted corner features is shown in Figure 6. The SAD-norm given in Equation (5) for each location within a limited search area of G = 80 × 79 pixels is used as error function to find the point that matches best the extracted feature point in the last image. The neighborhood used to form the error sum has a size of U = 9 × 8 pixels. The corresponding pixel patch around the corner points from the first image (calculated by feature detection module, compare Section 3.2.1) are stored in memory module and can be loaded line by line. For comparison with the corner points from the feature detection module at each pixel clock, a pixel matrix of the same size from the pixel input module is available. The optical flow module consists of four units: control unit, mem unit, corr unit and dec unit shown in Figure 6.

The last three components are composed of nine sub-units operating in parallel to allow the simultaneous calculation of the optical flow for nine corner points. Additionally the counter module is included to synchronize the modules, working on fast system clock, and the incoming pixels. The module mem unit consists of nine independent memories that store the surrounding pixels of the nine corner points which are currently processed. The pixel environment of the nine corner points (op2 vec) are transferred to the module corr unit along with the pixels around the currently considered pixels (op1) that are transferred via mem unit to the corr unit as well. In corr unit the error functions, given by Equation (5), between the currently considered image patches and the nine corner points are evaluated. The resulting calculated error values are transferred to dec unit. In this module the new error values are compared with the previously smallest ones. If a new error minimum is found, (new min vec) is set. The point with the smallest SAD value can be chosen as correspondence to the feature point in the last image. 3.2.3. Subpixel Refinement In this section, subpixel refinement module, the last module to calculate the sparse optical flow, is discussed. As the SAD calculation described in the last section is done for each pixel, the pixel coordinates of the resulting correspondence points are whole-number values. The referred pixel accurate minima are refined here to one-eighth pixel grid, for which the eight surrounding pixels (according to four surrounding pixels-quadrants) must be considered. A block diagram of subpixel refinement module is shown in Figure 7. This module contains besides the calculating modules value unit, corr unit, dec unit, mem unit and control unit again a counter to synchronize the pixel and system clock. The value unit calculates intensity values between two pixels using bilinear interpolation. From an input line of 11

Normalized point correspondences

RANSAC

Choose random subset Check non-collinearity Estimate homography by using HW Singular Value Decomposition module Calculate symmetric transfer error

Start

Count inliers

Read point correspondences

Best subset?

Normalisation

Fig. 7. Block diagram of the subpixel refinement module to do the subpixel refinement. pixels 10 values between the given pixels are calculated. In corr unit, the error sums at a subpixel position for the four surrounding pixel-quadrants are calculated simultaneously. Therefore four separate correlation units are used, one for each pixel-quadrant. The following dec unit decides if a new minimum for the actual corner point is found. From the point coordinates of the correspondences the module mem unit calculates the flow vectors. The coordinates are stored together with the flow vectors and are read, as shown in Figure 4, from the MicroBlaze. The control unit generates different controlling signals for the processing units as well as addresses for the memory. 3.3. Motion Estimation In this section, the implementation of the camera motion estimation from the extracted point correspondences is described. The algorithm works on the point correspondences that are calculated as described in Section 3.2. The robust homography estimation with RANSAC starts with the reading of the point correspondences from BRAM and their normalization. As the most commutative part of the motion estimation in particular of the RANSAC is the singular value decomposition (SVD), this part has been accelerated by using a hardware module. During development the results were compared to the implementation completely in software. The robust estimation of a model by RANSAC is shown as black box in Figure 8(a). The individual steps are visualized in Figure 8(b). Iteratively, subsets of four point correspondences are chosen randomly and the homography for each of these subsets is estimated using the hardware module for singular value decomposition (SVD). For this task the two-sided Jacobi SVD method described by Forsythe and Henrici [18] is used, which enables a diagonalization of an n × n matrix by solving a series of separate 2 × 2 SVD problems. It allows for the use of multiple processing units and therefore increases the computation speed. The architecture of a square mesh-connected systolic array for SVD computation proposed by Brent, Luk and Van Loan [19] was implemented using control logic and 5 separate 2 × 2 SVD processors. The n × n matrix is diag-

Yes Store inliers

RANSAC De-normalisation Output of homography and inliers

(a) Overview

No

No

Abort criterion? Yes Estimate homography from all Inliers

(b) RANSAC estimation

Fig. 8. Homography estimation with RANSAC

onalized iteratively. During each iteration the diagonal 2 × 2 processors in the mesh-connected array diagonalize their matrices and transfer the computed angles to the off-diagonal 2 × 2 processors, which apply two-sided rotations to their matrices. From each subset of feature pairs an 8 × 9 matrix is constructed. In order to run it through the processor, zero rows and columns need to be added, resulting in a 10 × 10 matrix [19]. All values are then converted from floating-point to fixed-point and stored in BRAM, the order of data access for each step and for every 2 × 2 SVD unit is stored in a lookup table. The singular vectors that are necessary for homography estimation are computed along with the diagonalization of the main matrix utilizing the same processor array and performing one-sided rotations of the matrices. A single 2 × 2 SVD processing unit performs its computations using CORDIC functions for angle solving and twosided rotations [20] [21]. The configurable CORDIC IP core is provided by Xilinx [22]. The 2 × 2 processor is adaptable to the task required during each processing step - if the matrix is being diagonalized, then all three CORDIC stages are being used (angle solving and two rotations), while one or two CORDIC stages can be skipped when performing rotations. The symmetric transfer error for all other point correspondences and the currently estimated homography is calculated in software. If the best subset of point correspondences is found, all inliers are used to re-estimate the final homography matrix, being implemented in software. This matrix and the inliers vector are now available in BRAM and can be decomposed into the rotational and translational part of the camera motion.

4. RESULTS The modules described in Chapter 3 are implemented on a Xilinx XUPV5-LX110T evaluation board. All tests and results were made on this board. The implemented system will calculate the corner detection and the calculation of optical flow, including the subpixel refinement in real time. In the current configuration, one frame is used for the identification and selection of corner points and a second frame to do the flow calculation and subpixel refinement. By the used refresh rate of the VGA signal of 60 Hz the optical flow calculation computes 30 frames per second. As a result of the usage of the pixel clock the system will adapt its speed to this clock frequency. Hence the system calculates the optical flow for each second frame independent of the pixel clock and because of the modular layout also of the incoming frame size. The implementation of the sparse optical flow estimation has been tested with synthetic images as well as real images taken by the onboard camera of the MAV. Furthermore a software implementation has been used at the same time to evaluate the influence of the different modifications and simplifications compared to the performance of the Lucas and Kanade algorithm. The evaluation has been done with the well-established benchmark sequences Translating and Diverging Tree and Yosemite [12] using the error metrics described in [13]. In Figure 9, a sample onboard image is shown together with the sparse optical flow extracted by the FPGA implementation.

software. The estimation of the homography on the MicroBlaze with 60 RANSAC loop cycles runs at around 7 Hz and evaluations showed that the number of RANSAC loops can be reduced while still giving good results. With a reduced number of only 22 RANSAC loop cycles and the use of the SVD hardware module an update rate of around 20 Hz was achieved, while still giving comparable results to the software implementation. In Table 1 the utilized chip area for the complete system is shown. Table 1. Resource utilization of the image based motion estimation system Resource Utilization Occupancy on V5-LX110T Slices 11113 64 % BRAM 85 57 % DSP48E 46 71 % The results of the sparse optical flow computation are very well suited for the subsequent motion estimation by homographies. In a pipelined version, image data with 60 Hz and a grayvalue pixel depth of 12 bits can be processed. The homography estimation runs at 20 Hz and should be further accelerated to meet the requirements, for example by increasing the speed of the calculation of the symmetric transfer errors as this is now the most time-consuming part. Future work is the integration of the image based navigation information into the Kalman filter along with other sensor data and the detailed evaluation of the in-flight performance. 5. CONCLUSION



Fig. 9. Onboard image with sparse optical flow: The estimated correspondences are shown in red together with the optical flow vectors marked in green from the positions of the corner features in the previous image at. The homography estimation and decomposition is implemented on the Xilinx MicroBlaze softcore processor supported by a hardware module for singular value decomposition. Because of the different numbers of point correspondences the runtime is variable. To get a good overview about the runtime, different scenarios were generated. At first the computation time completely was evaluated completely in

In this paper first results of an image based motion estimation system for a VTOL-MAV implemented on a FPGA are presented. The sparse optical flow is realized by choosing corner features in each image and by tracking them in the subsequent images. The described system processes image data with 12 bit pixel depth. For the corner detection the algorithm of Harris and Stephens has been chosen and modified for the optimized hardware implementation. Depending on the image content up to 300 point correspondences are calculated with an accuracy of one-eighth pixel in real-time directly on the data stream. The homography estimation with the robust RANSAC algorithm runs on the the Xilinx MicroBlaze softcore processor and is accelerated by a hardware module for the SVD calculation. The first evaluation of the performance with in-flight data has shown promising results. The test system is implemented on a Virtex 5 LX110T. This platform was used for the evaluation and testing phase. To fulfil the requirements like power consumption and weight a smaller platform will be evaluated. Cause of the similarity to the Spartan 6 FPGAs the porting to this should be feasible.

6. REFERENCES [1] N. Frietsch, C. Kessler, O. Meister, C. Schlaile, J. Seibold, and G. F. Trommer, “Image based augmentation of an autonomous VTOL-MAV,” in SPIE Europe’s International Symposium on Security&Defence, 2009. [2] C. Claus, A. Laika, Lei Jia, and W. Stechele, “High performance FPGA based optical flow calculation using the census transformation,” in IEEE Intelligent Vehicles Symposium, june 2009, pp. 1185 –1190. [3] C. Harris and M. Stephens, “A combined corner and edge detector,” in Proceedings of The Fourth Alvey Vision Conference, 1988, pp. 147–151. [4] B. D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” in Proceedings of Imaging Understanding Workshop, 1981, pp. 121–130. [5] C. Tomasi and T. Kanade, “Detection and Tracking of Point Features,” Tech. Rep. CMU-CS-91-132, Carnegie Mellon University, Pittsburgh, 1991. [6] J. Shi and Tomasi C., “Good features to track,” in Computer Vision and Pattern Recognition, 1994. Proceedings CVPR ’94., 1994 IEEE Computer Society Conference on, 1994, pp. 593 –600. [7] A. Benedetti and P. Perona, “Real-time 2-D feature detection on a reconfigurable computer,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1998, pp. 586 –593. [8] A. Bissacco, S. Ghiasi, M. Sarrafzadeh, J. Meltzer, and S. Soatto, “Fast visual feature selection and tracking in a hybrid reconfigurable architecture,” in Proceedings of the Workshop on Applications of Computer Vision (ACV), 2006. [9] J. Diaz, E. Ros, F. Pelayo, E.M. Ortigosa, and S. Mota, “FPGA-based real-time optical-flow system,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 2, pp. 274 – 279, 2006. [10] J. Diaz, E. Ros, R. Agis, and J. L. Bernier, “Superpipelined high-performance optical-flow computation architecture,” Computer Vision and Image Understanding, vol. 112, no. 3, pp. 262 – 273, 2008. [11] H. Niitsuma and T. Maruyama, High Speed Computation of the Optical Flow, vol. 3617 of Lecture Notes in Computer Science, pp. 287–295, Springer Berlin / Heidelberg, 2005. [12] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of optical flow techniques,” International Journal of Computer Vision, vol. 12, pp. 43–77, 1994.

[13] B. McCane, K. Novins, D. Crannitch, and B. Galvin, “On Benchmarking Optical Flow,” Computer Vision and Image Understanding, vol. 84, no. 1, pp. 126 – 143, 2001. [14] R. Hartley and A. Zisserman, Multiple View Geometry, Second Edition, Cambridge University Press, Cambridge, 2003. [15] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981. [16] B. Triggs, “Autocalibration from Planar Scenes,” in Extended version of paper in Proceedings of the 5th European Conference on Computer Vision ECCV, 1998, vol. I. [17] Y. Ma, S. Soatto, J. Koseck, and S. S. Sastry, An Invitation to 3-D Vision, Springer, New York, 2004. [18] G. E. Forsythe and P. Henrici, “The Cyclic Jacobi Method for Computing the Principal Values of a Complex Matrix,” Transactions of the American Mathematical Society, vol. 94, pp. 1–23, 1960. [19] R. P. Brent, F. T. Luk, and C. Van Loan, “Computation of the Singular value Decomposition Using Meshconnected Processors,” Journal of VLSI and Computer Systems, vol. 1, no. 3, pp. 242–270, 1985. [20] J. R. Cavallaro and F. T. Luk, “CORDIC Arithmetic for an SVD Processor,” Journal of Parallel and Distributed Computing, vol. 5, pp. 271–290, 1988. [21] Weiwei Ma, M. E. Kaye, and R. Luke, D. M. andDoraiswami, “An FPGA-Based Singular Value Decomposition Processor,” in Electrical and Computer Engineering, 2006. CCECE ’06. Canadian Conference on, 2006, pp. 1047 –1050. [22] Xilinx, “LogiCORE IP CORDIC v4.0, DS249,” 2011.

REAL-TIME MOVING OBJECT DETECTION FOR VIDEO SURVEILLANCE SYSTEM IN FPGA* Tomasz Kryjak, Mateusz Komorkiewicz, Marek Gorgon [email protected], [email protected], [email protected] AGH University of Science and Technology, Faculty of Electrical Engineering, Automatics, Computer Science and Electronics, Department of Automatics, al. Mickiewicza 30, 30-059 Krakow, Poland ABSTRACT FPGA devices are a perfect platform for implementing image processing algorithms. In the article, an advanced video system is presented, which is able to detect moving objects in video sequences. The detection method is using two algorithms. First of all, a multimodal background generation method allows reliable scene modelling in case of rapid changes in lighting conditions and small background movement. Finally a segmentation based on three parameters lightness, colour and texture is applied. This approach allows to remove shadows from the processed image. Authors proposed some improvements and modifications to existing algorithms in order to make them suitable for reconfigurable platforms. In the final system only one, low cost, FPGA device is able to receive data from high speed digital camera, perform a Bayer transform, RGB to CIE Lab colour space conversion, generate a moving object mask and present results to the operator in real-time. Index Terms— FPGA, moving object detection, realtime video processing, background generation, shadow removal

The effort of scientists is now focused on developing algorithms which would allow to automatically analyse the video sequence from surveillance systems and detect dangerous situations such as abandoned objects (potential bomb), intruding perimeter protection zone (eg. museums), theft (disappearance of protected object), fights, medical emergencies, sudden gathering of people etc. In the paper a system for moving object detection implemented in Xilinx Spartan 6 FPGA reconfigurable device is presented. It is based on two algorithms: background subtraction and moving object segmentation using three criteria: lightness, colour and texture. Moreover, additional modules had to be implemented in FPGA device: communication with the camera, conversion of the colour signal from the CMOS sensor (so called Bayer transformation), conversion from RGB to CIE Lab colour space, efficient external memory controller and a DVI controller to allow displaying the result on a PC monitor. Additionally, some custom hardware was designed to allow transmission between the camera and the FPGA board (Camera Link to FMC card expansion module). In Chapter 2 the overall concept of the system is described, in Chapters 3-7 particular modules are explained. At the end, results and conclusions are presented.

1. INTRODUCTION Nowadays an intensive development of advanced video surveillance systems can be observed. Cameras are installed in such places as railway stations, airports, shops, public administration buildings, schools, universities, museums and shopping centres. Analysing the obtained video stream is a big challenge. In most cases, it is carried out by an operator, who makes the decision about classifying the observed situations as dangerous or not. Moreover, the video streams are recorded and stored for later analysis. *The work presented in this paper was supported by AGH-UST grant 11.11.120.612.

2. OVERVIEW OF THE SYSTEM The proposed system consists of a digital camera, a Xilinx development board SP605 with Spartan 6 FPGA device (XC6SLX45T) and a LCD monitor. In the conception phase it was decided that all computation should be done inside the FPGA and the moving objects mask displayed on the monitor. The idea is presented in Figure 1. The system is based on the following elements: • SI 1920HD camera with Camera Link interface from Silicon Imaging, • Camera Link to FMC (FPGA Mezzanine Card) conversion board designed and assembled by one of the

3. PROCESSING DATA FROM CAMERA 3.1. Receiving data from camera (cl2vga) Fig. 1: Overview of the moving object detection system

authors, which allowed connecting the camera to the development board, • SP605 evaluation board with Spartan 6 FPGA from Xilinx as a main computing platform on which all image processing operation are carried out, • LCD monitor for visualising the results. A functional diagram of the modules implemented in the FPGA device is presented in Figure 2. Most important are: • cl2vga - module for reading the video signal received from the camera by the FMC expansion card and formatting it to the VGA standard, • bayer2rgb - colour restoration module (executing Bayer transform algorithm), • rgb2lab - block for changing the colour space from RGB to CIE Lab, • bg - hardware realisation of a multimodal background generation algorithm, • seg - module for moving objects segmentation, • dvi - part which is responsible for transforming the video output to the format accepted by dvi encoder (thus presenting result on LCD monitor), • regs - registers holding parameters needed for the algorithm, • uart - block responsible for transmission of data between registers and PC (run-time adjustments), • mem ctrl - RAM controller with FIFO buffers, • hw ddr ctrl - hardware DDR3 RAM memory controller.

Module is responsible for receiving video stream transmitted by the SI 1920HD camera on Camera Link bus. In next step, the received signal is converted to a format which could be sent to PC monitor directly. In order to achieve this, the synchronisation signals, data position and clock speed from the camera have to be adjusted to meet 640x480 60 fps VGA standard. The SI 1920HD due to constraints of Camera Link bus is not able to transmit pixels with a single pixel clock lower than 40MHz. Because VGA standard demands a 25MHz clock, either a frame buffer or a 50MHz clock with 1280x480 resolution (with mechanism of dropping every second pixel and dividing clock by 2) should be used. The second solution was chosen, because it allows saving the hardware resources of FPGA (no need for a frame buffer memory). 3.2. Colour conversion to RGB (bayer2rgb) Image sensor with Bayer filter topology [1] is used in the SI 1920HD camera. In this approach, each sensing cell is covered with a filter which allows to pass only one colour component (red, green or blue) Figure 3. The problem of colour restoration from Bayer matrices was described in work [2]. According to the description from 3.1, video stream with res-

Fig. 3: Bayer filter and four variants of neighbourhood olution 1280x480 pixels has to be reduced to 640x480 by dropping every second pixel. If this approach is used literally to Bayer matrix (Figure 4 a) and b) ) information about one of the colour component is lost. To prevent this, the pixels have to be dropped in a different manner (Figure 4 c) and d)).

a)

b)

c)

d)

Fig. 4: Pixel reduction for Bayer mosaic

Fig. 2: Scheme of the system implemented in FPGA

In the described system, option presented in Figure 4 c) is used. It allows to reduce the resolution without loosing the information about colour. The only side effect is a slight deformation of lines and edges.

r w11

When processing colour images, an important issue is the choice of a colour space. In [3] authors presented research result on different colour spaces. They pointed out, that for segmentation combined with shadow removal the best choice are the CIE Lab or CIE Luv colour spaces. In this implementation it was decided to use the CIE Lab space. In the CIE Lab system, the RGB triplets containing information about intensity of each colour are replaced by L,a,b parameters (Lluminance, a,b - chrominance). Conversion between RGB and CIE Lab is a two stage process [4, 5]. It the first step RGB is transformed to the CIE XYZ according to the formula:      X 0.41245 0.35758 0.18042 R  Y  =  0.21267 0.71516 0.07217   G  (1) Z 0.01933 0.11919 0.95023 B

g w 12

*

b w 13

*

100* f(X/X n)

*

+

*

100* f(Y/Yn)

a

5

+

b w 23

*

r w31

*

+

116* f(Y/Yn) -16

+

100* f(Z/Zn)

L

*

b w33

*

+

g w32

*

g w 22

+

*

r w 21

*

3.3. RGB to CIE Lab conversion

b

2

Fig. 5: RGB to CIE Lab conversion module

4. BACKGROUND GENERATION The conversion from CIE XYZ to the CIE Lab colour space is described by the formula: L = 116 ∗ f (Y /Yn ) − 16 a = 500[f (X/Xn ) − f (Y /Yn )] b = 200[f (Y /Yn ) − f (Z/Zn )]

(2)

where Xn = 0.950456,Yn = 1,Zn = 1.088754 are constants responsible for the white point and f (t) is given by equation: ( 1  6 3 t 3 for t > 29  (3) f (t) = 1 29 2 4 t + 29 otherwise 3 6 In order to implement this conversion on FPGA device, all multiplications were changed to fixed point multiplications and executed on DSP48 blocks of Spartan 6. Because implementing the root functions (equation 3) is not possible without using a lot of reconfigurable resources and would introduce a lot of latency, all further operations were moved to lookup tables (using the BRAM resource of the FPGA). Because Xn ,Yn ,Zn are constant, four tables were created in which the following values were stored: xlut(t) = 100(t/Xn ) ylut(t) = 100(t/Yn ) zlut(t) = 100(t/Zn ) llut(t) = 116(t/Yn ) − 16

(4)

In this way, the problem was transformed to a different form: L(X, Y, Z) = llut(Y ) a(X, Y, Z) = 5 · (xlut(X) − ylut(Y )) b(X, Y, Z) = 2 · (ylut(Y ) − zlut(Z))

(5)

The block diagram of the RGB to CIE Lab conversion module is presented in Figure 5. The implementation was made using Verilog HDL. Behavioural simulation results are fully compliant to software model created in Matlab 2009b. The operating frequency reported by synthesis tool is 252 MHz.

Background generation is one of the most commonly used techniques for movement detection. The general idea is to find moving objects by subtracting the current video frame from a reference background image. For almost 20 years of research in this area a lot of different algorithms were proposed. A very good review on this methods is presented in [6]. When implementing background generation in FPGA devices is considered, the difference between recursive and not recursive algorithms has to be stated. The non recursive methods such as mean, median from previous N frames or W4 algorithm are highly adaptive and are not dependent to history beyond N frames. Their main disadvantage is that they demand a lot of memory to store the data (e.g. if frame buffer N = 30 frames, then for a RGB colour images at the resolution of 640x480 about 26MB of memory are needed). In recursive techniques the background model is updated only according to current frame. The main advantage of this methods is that they have small memory complexity and the disadvantage is that such systems are prone to noise generated in background (they are conserved for a long time). Some recursive algorithms are: sigma-delta method, single Gauss distribution approach, Kalman filter, Multiple of Gaussian (MOG) distribution method and clustering. In this point it is important to notice, that recursive methods can work either with one background model (sigma-delta, single Gauss distribution) or can use multiple background models (MOG, clustering). Multimodal methods are better suited for scenes with dynamic changes of lighting conditions (e.g. shadows casted by clouds) and resolve the problem of background initialisation when there are moving objects on the scene. Some work related to background generation in FPGA can be found. In [7] an implementation of the MOG algorithm is presented (with some changes made due to FPGA device structure). Three different versions are described: an unimodal grey-scale background model, an unimodal colour

(RGB colour space) background model and a bi-modal greyscale background model. All methods are implemented on a Xilinx Virtex II device with 4 external ZBT RAM memory banks. The reported maximal frequency is about 65 MHz. Authors in [8] implemented the MOG algorithm (in RGB colour space) on a Virtex 2 1000 FPGA. They described a module which performs all MOG stages, however the research was finished at a simulation phase. In the work [9] another background subtraction system was described. It used the Horpraset method (unimodal) in RGB colour space. The design was implemented on two FPGA devices and operates in real-time at resolution 320 x 240 and 30 fps. In [10] a FPGA based road traffic detector was described. It used a unimodal, grey-scale background generation (pixel averaging method). The application operates in real-time at resolution 720 x 576 and 25 fps. In [11] a multimodal moving object detection method using a grey-scale, multimodal sigma-delta approach was presented. The system was implemented on a Xilinx Spartan II FPGA device and operates at 265 MHz, although the communication with external RAM is not considered. 4.1. Algorithm Analysis of previous work, as well as preliminary research, have shown that the most crucial constraint that has to be dealt with when implementing background generation algorithm is the efficient external memory access. This is why during the conception phase the most important factor for choosing a particular algorithm was its memory complexity. Moreover, some assumptions were made: the algorithm should work with colour images and have a multimodal background representation. The use of colour should improve the quality of image analysis and the multimodal approach should allow the background model to be adaptive both to rapid and slow light condition changes. From the two described in literature multimodal methods: MOG and clustering, the second one was chosen because of simpler computations. In the presented implementation some changes to the algorithm described in [12] were introduced. The first one was picking the CIE Lab colour space as better suited for shadow detection and removal (according to [3]). In the CIE Lab lightness L is a number from range 0-100 (7 bits) whereas chrominance components ab range from -127 to 127 (8 bits). Because on the SP605 board the memory controller can have a maximum data width of 128 bit, the number of background models was set to K=4, with a representation for a single model of 9 bits for L (7 for integer part, 2 for fractional part), 2 times 8 bits for a and b and 6 bits for model weight. Summing up, one model consumes 31 bits to represent a pixel of the background, four models use 124 bits. When processing a video stream the following steps are carried out (independently for each pixel): A) calculating the distance between new pixel and each of the

models. Distances are computed separately for luminance and chrominance based on the equations: dL = |LF − LM i |

(6)

dC = |CaF − CaM i | + |CbF − CbM i |

(7)

where LF , CaF , CbF - pixel values for current frame, LM i , CaM i , CbM i pixel values form i-th background model. B) choosing the model which is closest to actual pixel, checking if for this pixel dL and dC are smaller than defined thresholds (luminanceTh and colourTh). C) in case when the model fulfils conditions from B) it is updated using the equation: Mact = α1 F + (1 − α1 )M

(8)

where M - background model, Mact actualised background model, F - frame, α1 parameter controlling the background update rate Moreover, the weight of the model is incremented. Because the weight representation is 6 bit long, its maximum value is 63. In next step all models are sorted. It can be noticed, that some simplification of the sorting algorithm can be made based on the assumption that model which weight was incremented can change its position only with model before it. In most cases this assumption it is true and allows to simplify the sorting both in software and hardware. D) in case when no model matched the actual pixel, in the original implementation the model with smallest weight was replaced with actual pixel and its weight was cleared. In the test phase it turned out that such approach results in too fast accommodation of moving objects into background (e.g. car that stopped for a moment). Therefore it was decided to introduce a modification. It is based on using the update scheme from equation 8 with parameter α2 instead of directly replacing the value. Another modification introduced to the algorithm was omitting the moving object detection module proposed in the original implementation. Moving object mask is computed in a different module (described in Chapter 5) and the background generation module is providing only information about background value in each localisation. The model is consider valid only if its weight is greater than a threshold (weightTh). The last model (with the smallest weight) is not considered as a possible candidate for background as it is a buffer between the current frame and background model.

4.2. Hardware Implementation The block diagram of the proposed background generation module is shown in Figure 6. It was described in VHDL with some parts automatically generated from Xilinx IPCore Generator (multipliers, delay lines). Description of used submodules: • || || - computing distance between current pixel and background model [A)] • D - delay (for synchronisation of pipelined operations) • MINIMUM DISTANCE - choosing the background model with smallest distance to current pixel [B)] • UPDATE SELECT - picking the right background model to be updated • UPDATE MODEL - implementation of equation 8 and weight update. For α parameter which is from range [0;1) a fixed point, 10 bits, representation was used. Multiplying is realised with hardware DSP48 blocks present in Spartan 6 device • SORT MODELS - sorting of the models and choosing the actual background representation

ertheless the described method is able to remove shadows in many cases, what significantly improves the moving objects detection results. 5.1. Hardware Implementation The only work know to the authors (after the INSPEC database search) regarding implementation of shadow detection in FPGA devices is [14]. It is describing a method of detecting shadows using YCbCr colour space and information about edges. In this article a different method is proposed. It bases on three parameters: lightness (L component from CIE Lab), colour (ab components) and SILTP texture descriptor. Distance between the current frame and background model is computed according to equations 6 and 7. The definition of SILTP texture descriptor is presented in [13]. Values of all three parameters (lightness, colour, texture) are normalised by using a method similar to one described in [15].  1 if d > max(d) · β (9) dN = d max(d) otherwise where: β is a parameter in range (0;1] (0.75 was used for experiments), max(d) is the maximal value of measure for specific image (in hardware implementation the maximum value from previous frame was used). Based on a normalised values, combination of three measures was proposed:

Fig. 6: Block diagram of background generation module LCT = wL dN L + wC dN (ab) + wT SILT PN 5. MOVING OBJECT SEGMENTATION The most commonly used method to detect moving objects is based on thresholding the differential image between the current frame and the background model. This approach was also exploited in this work, however it was decided to use not only information about lightness and colour, but also about texture. Moreover, the algorithm was constructed in a way to minimise the shadows impact on the final segmentation result. Shadow removal is based on two assumptions: shadow is not changing the colour but only lightness (tests conducted in day light showed that it is not entirely true) and shadow is not having an impact on the texture of a surface. Shadow removal was implemented based on mentioned assumptions, utilising the results presented in [3] using the CIE Lab colour space and SILTP [13] as a texture descriptor. FPGA platform imposed some constraints for choosing the algorithm which must be working only on local features of the image (single pixel or 3x3, 5x5 context). Unfortunately this constraints are having a negative impact on the efficiency of the shadow removal algorithm. Most of the methods described in literature are working on the level of objects (after the phase of connected components labelling). In the presented implementation, shadow removal is not a priority, nev-

(10)

where wL ,wC ,wT weights (used values, determined by experiments are 1,3,2), dN L - normalised difference of lightness, dN (ab) normalised difference in colour, SILT PN normalised SILTP descriptor. In the last step of the algorithm the LCT parameter is thresholded with fixed threshold (0.9375 in conducted experiments). Additionally 5x5 median filtering was chosen for final image processing. Block schematic of the segmentation module is presented in Figure 7. It was described in VHDL language with some IPCores generated in Xilinx Core Generator (multiplication, delay lines). The design runs at a fixed resolution of 640x480, which determines the delay lines length used in SILTP and MEDIAN 5x5 blocks as well as the final resource utilisation and latency introduced by the module. Static timing analysis reported that module is able to work with 390 MHz clock. Modules description: LF

LB

ab F ab

B

|| || L

D

max dL

NORM dL

|| || ab

D

max d(ab)

NORM d(ab)

max SILTP

NORM SILTP

SILTP

I N T E G R A T I O N

movement mask

MEDIAN 5X5

Fig. 7: Diagram of the segmentation module

movement mask

• || || L and || || ab - computing distance between current frame and background according to equations 6 and 7 • SILTP - module for SILTP computation • D - delay (for synchronization of pipelined operations) • max dL, d(ab) SILTP - modules for computing maximal values for previous frame • NORM - module for normalising values into range [0;1] • INTEGRATION - module for integrating lightness, colour and texture information, and final thresholding operation (object/background decision) • MEDIAN 5x5 - binary median with 5x5 window

ififo

tfifo 64x128bit

1025x128bit

read

write

taddr

empty

tfifo_cnt

ctrl

full

raddr

ofifo

cmd address bl

HW MEM

DDR3

rfifo_cnt

write

read

rfifo 64x128bit

1025x128bit

Fig. 8: RAM controller block diagram 6. EXTERNAL MEMORY OPERATIONS The proposed background generation needs to read and store 128 bits of information (2x16bytes = 32B) in each clock cycle where there are valid data on the pixel path. Because the system is working with image resolution of 640x480 pixels at 60 frames per second that makes data throughput of about 590 MB/s (640 ∗ 480 ∗ 60f ps ∗ 32B). SP605 board is equipped with DDR3 400MHz memory (16-bit data bus width). The theoretical maximal throughput that can be achieved with this type of memory is 400MHz * 2(DDR) * 2 bytes which gives 1600MB/s. One has to remember, however, that DDR memory is a dynamic memory and access time is not constant. It is because the burst access scheme, in which one has to open specific row and column of the memory before accessing it, also each bank has to be refreshed (by issuing a special refresh command) once every 64 ms (during refresh time, the bank cannot be accessed). This is why the data buffer in the hardware controller which is 64 words of 128 bits deep is not large enough to provide constant data flow at the level of 590MB/s for background generation module. To overcome this problem another two FIFO buffers were added between hardware memory controller and the rest of the system (1025 words deep). Additional buffers allow to use RAM memory even when there are no valid data but only synchronisation in VGA signal (about 20% of time). The implemented user memory controller is keeping track of both read and write address and is checking the buffer level in hardware controller. It is also moving data between this buffers and large FIFO buffers which are connected directly to background generation modules. Block diagram is presented in Figure 8.

Resource FF LUT 6 SLICE DSP 48 BRAM

Used 4737 5188 1938 27 24

Available 54576 27288 6822 58 116

Percentage 8% 19 % 28 % 58 % 20 %

Table 1: Project resource utilisation

after place and route) confirmed that the hardware modules are fully compliant with software models described in Matlab 2009b. The reported maximal operating frequency (after place and route phase) was 119MHz, which is more than enough for processing video stream (pixel clock rate of 25 MHz). Power consumption reported by Xilinx XPower Analyzer for the device (On-Chip) is about 0.9 W. Resource usage is presented in Table 1. It is worth to notice that, basing on data from Table 1, even a small FPGA device from Spartan 6 series can run quite complex vision system which resource utilisation at about 30% of available resources. Remained logic can be used for implementing initial image filtration (elimination of camera noise), implementing median filtering between background generation and segmentation module or other image processing operations, except for those which needs external memory access. 8. RESULTS AND CONCLUSIONS 8.1. Background generation

7. SYSTEM INTEGRATION All modules described in chapters 3 - 7 were integrated according to the block diagram presented in Figure 2. The project was synthesised for a Spartan 6 (XC6SLX45T3FGG484) FPGA device using Xilinx ISE 13.1 Design Suite. Simulations performed in ModelSim 6.5c (behavioural and

The implemented method was tested, by simulation, on multiple sequences, which can be described as demanding (windy day with rapid changes of lightning). Thanks to multiple models, the algorithm is able to reduce the negative impact of small moving objects (branches moving due to wind). During test phase it turned out that choosing the right precision and model representation is crucial for the algorithm

which is updating the background according to equation 8. The picked representation (9 bits for L, 8 bits for a and b) turned out to be appropriate only for high α (background update rate)values (0.75, 0.125). For smaller alphas, the background is not properly updated due to truncation errors. To eliminate this problem a compromise should be made. Either using a higher number of background models with small precision or increasing precision and decreasing number of models. In our project it was decided to use 4 background models with small precision. On SP605 board memory throughput and port width is close to maximum, thus increasing the precision would result in decreasing the number of background models. The described system can be viewed as a further step in the development of moving objects detection on FPGA platforms. In comparison to previous designs (Section 4) it has the following attributes: calculations in colour (CIE Luv colour space), multimodal background model (4 models), shadow removal (using information about lightness, colour and texture) and it is a real-time operating hardware system. The mentioned features allow to recognise the work as innovative. Further research will concentrate on finding the right precision for particular α parameter. Moreover, an attempt would be made to move the system to different hardware platform (with wider and faster memory access capabilities).

and the proposed algorithm is not able to make proper segmentation of silhouettes. It is also worth to mention that described approach is less sensitive for choosing the final binarization threshold than methods using only one information (e.g. lightness).

a)

b)

c)

d)

e)

f)

Fig. 9: Segmentation example, a) current frame, b) background, c) difference in lightness, d) difference in colour, e) SILTP texture descriptor, f) integration of information

a)

b)

c)

d)

Fig. 10: Sample shadow removal, Correct (no strong light), a) scene, b) moving object mask. Wrong removal (strong light, deep shadows) c) scene, d) moving objects mask

8.2. Moving object segmentation The moving object segmentation method proposed in this work assumes integration of three information: lightness, colour and texture in order to obtain better results and allow shadow removal. Research pointed out situations when this approach gives better results than using only lightness. The results also confirms that using the colour background model gives better results (although the memory complexity is 3 times higher). Figure 9 is presenting such situation. In 9c it can be observed that lightness of persons hair and trouser are almost the same as the background and it is impossible to propose a good threshold for the whole silhouette. Information about colour (Figure 9d) allows a valid segmentation of head (hair) and trousers. Texture (Figure 9e) in the described case is only an additional information. Integration of all features according to equation 10 allows for proper segmentation of the silhouette (9f). Shadow removal performance is heavily influenced by using only local information (pixel, small context) and in many cases it fails. Research and literature seems to confirm this observation. However, it is possible to point situations (Figure 10 a) anb b) where the proposed method is able to reduce shadow impact. In case of situations presented in Figure 10 c) and d), with stronger light, the shadows become deeper

8.3. System The described system for detecting moving objects was integrated and tested in real life environment. It is able to work with targeted resolution (640x480) at 60 frames per second on colour images in the CIE Lab colour space. The system works properly and according to expectations. Example segmentation result is presented in Figure 11. On the right LCD monitor real image is displayed and on the left the moving objects mask. 9. SUMMARY A system for moving objects segmentation with shadow removal implemented in FPGA device was described in this article. It consisted of a lot of different hardware modules described in Verilog and VHDL languages: camera communication, Bayer transform, RGB to CIE Lab conversion, background generation, segmentation, external RAM controller, serial communication with PC and displaying the result on LCD monitor. A multimodal background generation algorithm combined with moving object segmentation considering three features: lightness, colour and texture was implemented on hardware platform.

background modelling,” in Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on, may 2005, pp. 1142 – 1145 Vol. 2. [9] Jozias Oliveira, Andr´e Printes, R. C. S Freire, Elmar Melcher, and Ivan S. S. Silva, “FPGA architecture for static background subtraction in real time,” in Proceedings of the 19th annual symposium on Integrated circuits and systems design, New York, NY, USA, 2006, SBCCI ’06, pp. 26–31, ACM.

Fig. 11: Working system (60 fps, 640x480 resolution) Finally, a real time system, able to process 60 frames per second, with resolution of 640x480 pixels, was created. Results show that a FPGA device is well suited for implementing sophisticated image processing algorithms in video surveillance systems. 10. REFERENCES [1] B. E. Bayer, “Color imaging array,” US Patent No. 3971065. 1976. [2] J.M. Perez, P. Sanchez, and M. Martinez, “Low-cost bayer to rgb bilinear interpolation with hardware-aware median filter,” in Electronics, Circuits, and Systems, 2009. ICECS 2009, dec. 2009, pp. 916 –919. [3] Csaba Benedek and Tam´as Szir´anyi, “Study on color space selection for detecting cast shadows in video surveillance: Articles,” Int. J. Imaging Syst. Technol., vol. 17, pp. 190–201, October 2007. [4] “ITU-R recommendation BT.709, basic parameter values for the HDTV standard for the studio and for international programme exchange,” ITU,Geneva, CH,1990. [5] “ICC.1:2004-10 specification (profile version 4.2.0.0) image technology colour management architecture, profile format, and data structure.,” 2004. [6] Ahmed S. H Elhabian S. Y., El-Sayed K. M., “Moving Object Detection in Spatial Domain using Background Removal Techniques - State-of-Art,” Recent Patents on Computer Science, vol. 1, pp. 32–34, 2008. [7] K. Appiah and A. Hunter, “A single-chip FPGA implementation of real-time adaptive background model,” in Field-Programmable Technology, 2005. Proceedings. 2005 IEEE International Conference on, dec. 2005, pp. 95 –102. [8] Hongtu Jiang, H. Ardo, and V. Owall, “Hardware accelerator design for video segmentation with multi-modal

[10] M. Gorgon, P. Pawlik, M. Jablonski, and J. Przybylo, “Pixelstreams-based implementation of videodetector,” in Field-Programmable Custom Computing Machines, 2007. FCCM 2007. 15th Annual IEEE Symposium on, april 2007, pp. 321 –322. [11] M.M. Abutaleb, A. Hamdy, M.E. Abuelwafa, and E.M. Saad, “FPGA-based object-extraction based on multimodal sigma - delta background estimation,” in Computer, Control and Communication, 2009. IC4 2009. 2nd International Conference on, feb. 2009, pp. 1–7. [12] D. Butler, S. Sridharan, and V.M.Jr. Bove, “Realtime adaptive background segmentation,” in Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on, april 2003, vol. 3, pp. III – 349–52 vol.3. [13] Rui Qin, Shengcai Liao, Zhen Lei, and S.Z. Li, “Moving cast shadow removal based on local descriptors,” in Pattern Recognition (ICPR), 2010 20th International Conference on, aug. 2010, pp. 1377 –1380. [14] M. Musial, D. Dybek, and M. Wojcikowski, “Hardware realization of shadow detection algorithm in FPGA,” in Information Technology (ICIT), 2010 2nd International Conference on, june 2010, pp. 201 –204. [15] Liyuan Li and M.K.H. Leung, “Integrating intensity and texture differences for robust change detection,” Image Processing, IEEE Transactions on, vol. 11, no. 2, pp. 105 –112, feb 2002.

An Approach to Self- Learning Multicore Reconfiguration Management Applied on Robotic Vision Walter Stechele1), Jan Hartmann2), Erik Maehle2) 1) Technische Universität München, 2) Universität zu Lübeck [email protected] Abstract Robotic Vision combined with real-time control imposes challenging requirements on embedded computing nodes in robots, exhibiting strong variations in computational load due to dynamically changing activity profiles. Reconfigurable Multiprocessor System-on-Chip offers a solution by efficiently handling the robot's resources, but reconfiguration management seems challenging. The goal of this paper is to present first ideas on self-learning reconfiguration management for reconfigurable multicore computing nodes with dynamic reconfiguration of soft-core CPUs and HW accelerators, to support dynamically changing activity profiles in Robotic Vision scenarios. Keywords: Vision

Multicore,

Reconfiguration,

Robotic

1. Introduction Platform-based design [KK00, ASV04, KK09] is widely applied in embedded computing, exploiting multicore processor systems on chip (MPSoC) due to their enhanced performance /power capabilities. Recently, it has been extended to reconfigurable platforms [PTW10, IRQ09] in order to adapt to dynamically changing workloads. FPGA seems a desirable platform in terms of power / performance / flexibility / cost for medium volume applications like robotics. Design tools and languages, e.g. UML/MARTE, ImpulseC, CatapultC, EXPRESSION, ArchC support design-time architectural exploration, but run-time strategies for dynamic resource and power management of reconfigurable platforms are missing. Early work on multicore programming and multicore compilers, including scheduling algorithm for multiprogramming in hard real-time [LL73],

parallelizing programs for multiprocessors [RS89], and static scheduling algorithms for multiprocessors [YKK99], has shown that static scheduling is not sufficient for compute intensive applications. Iterative compilation for heterogeneous reconfigurable cores [BOB98] shows good performance gains, but the search space is too large for practical applications. Machine learning for multicore resource management has been investigated in [AOB06], with results being applied in the Milepost compiler [GCC08]. However, thread distribution requires offline training for each hardware configuration [OB09, BB08], which seems not applicable for dynamically reconfigurable MPSoC with multiple soft-cores and hardware accelerators. Thread distribution for CellBE programmable HW accelerator engines has been investigated in [OBN08], but in static configuration only, dynamically reconfigurable engines have not been covered so far. Performance prediction for parallel programs was investigated in [EI05], and [SC06] proposes “Hill Climbing” algorithm for multithread resource allocation, applying online learning, but no investigations on dynamically reconfigurable MPSoC have been included. Some dynamic microarchitectural adaptivity was investigated for single core CPUs in [VK09] for peak power management without any learning capabilities, in [AB01] for adaptive out-of-order issue queues, in [LB08] for temporal and spatial adaptivity of single core, in [MH03] with some learning, i.e. good configurations for a code section are kept for later reuse with same code section. Current work on dynamic multicore microarchitectural adaptivity includes prediction of right hardware configuration in time for multiple reconfigurable ARM cores through off-line training [OB10]. First, the application is monitored so that it can be detected when the program enters a new phase

of execution. Profiling the application on a predefined profiling configuration gathers characteristics of the new phase. These are fed as an input into a machine learning model which gives a prediction of the best configuration to use. After the processor has been reconfigured the application continues running until the next phase change is detected. In [BIM08] management of shared resources in MPSoC has been investigated, i.e. shared caches, off-chip memory, taking into account bandwidth and power budget. However, a Neural Network has been applied, which requires off-line training which seems not feasible for advanced MPSoC in dynamically changing applications. More recent work from [BSH08] is focused on run-time scheduling and reconfiguration of Application-Specific Instruction Set Processors (ASIP), and on power management in multicore processors, based on negotiating agents [EFH09]. As a conclusion for dynamic resource and power management, we may observe how previous work exploits off-line training for multithread multicore resource allocation. However, online reconfiguration management seems still an open issue; applying machine learning to reconfiguration management seems promising. In this paper, a new approach to self-learning multicore reconfiguration management for embedded robotic computing will be introduced. In contrast to traditional machine learning techniques applied to robotic applications, where the robot learns its behavior, we will focus on learning to manage the robots internal computing resources efficiently.

2. Robotic Vision Scenario Computer Vision in general deals with “Understanding of a scene”, including stereo vision, tracking, object and action recognition, focus of attention. Many problems have been solved in the past, e.g. extraction of low level visual features (corners, SIFT, optical flow), recognition and tracking of known objects, background subtraction for static cameras, structure-from-motion computed off-line. Current research in Computer Vision includes object detection, localization and classification/recognition, where “detection” means finding objects in images; “recognition” means identifying objects [KV10]. Robotic Vision deals with “the capability of a robot to visually perceive the environment and interact with it” [KV10], including Simultaneous Localization and Mapping (SLAM). Problems solved

include visual servoing related to known objects (relate spatial velocity of camera to object position), and navigation in 3D environments based on laser scanners. Current research in Robotic Vision includes interaction with people and manipulation of objects, including action recognition, action representation based on human body models, imitation learning, extraction of object shape, affordance-based object classification [KV10]. As a conclusion for Robotic Vision we may observe many problems still not solved for complex scenarios under real-time and power constraints. In a robotic scenario, soft real-time and hard realtime applications will contribute with dynamically changing computational load, e.g. robotic vision and robot control. Multiprocessor System-on-Chip (MPSoC) might contain reconfigurable processor arrays and heterogeneous RISC cores. Efficient resource utilization in MPSoC requires advanced reconfiguration planning. Various strategies for reconfiguration planning might be applied, e.g. central vs. distributed vs. self-organizing approaches. Specific robotic platforms for the application of the reconfiguration scheme presented in this paper are the biologically inspired hexapod robot OSCAR and the autonomous underwater vehicle (AUV) HANSE. The robot OSCAR is built for search and rescue as well as environmental monitoring missions [ElS10, LPC10], while the AUV HANSE is specifically designed for the Student Autonomous Underwater Challenge in Europe (SAUC-E) [Ost10]. These scenarios exhibit a wide array of hard and soft realtime tasks, while the computational power is severely limited by weight (OSCAR) and size (HANSE). Reconfiguration could thus significantly increase the performance of both robots. Fig. 1 shows a simplified task graph for a rescue robot application as intended for OSCAR. A real-time control loop for motion control is depicted on the left hand side; two vision routines are depicted in center and on right hand side. Video input is analyzed twofold, (1) by optical flow algorithm and motion analysis, in order to watch for hazards, (2) by stereo vision, detection of Region of Interest (ROI) for object hypothesis generation and verification, and motion planning, in order to interact with surrounding objects. Cooperating robots might contribute video and ROI information over a network connection. Tasks might be mapped on an array of hardware accelerators (blue) and on a cluster of RISC CPUs (green). One sample option of resource utilization for an array of hardware accelerators and a RISC cluster is

depicted in Fig. 2. Dependencies between tasks are shown according to the task graph from Fig. 1. Motion control is computed in fixed, reserved time slots on a RISC core, other tasks are scheduled over all available cores on a best effort scheme. Watching for hazards consists of a sequence of optical flow (on array) and motion analysis (on RISC). An alarm might be triggered as a result from motion analysis, e.g. detection of falling objects. Search for objects of interest consists of a sequence of stereo vision, ROI detection (both on array), hypothesis verification and motion planning (both on RISC).

Sensor Input

Video Input

Motion Ctrl.

Actor Output

Network Input

Optical Flow

Stereo Vision

Motion Analysis

ROI

Verify Hypothesis

Alarm yes/no

Motion Planning

Motion Analysis Motion Ctrl.

Verify Hypothesis

40 msec

(blank)

Motion Analysis

Optical Flow

Motion Analysis

ROI

Motion Planning

(blank)

(unused)

Optical Flow

Optical Flow

Optical Flow

Motion Planning

ROI

(blank)

Stereo Vision

(blank)

Stereo Vision

Verify Hypothesis

Reconfiguration

CPU-0

CPU-1 or HW

CPU-2 or HW

CPU-3 or HW

Fig. 1: Task graph for rescue robot scenario

80 msec

Stereo Vision

time

Fig. 2: Resource utilization for an array of hardware accelerators (above) and RISC cluster (below) This simplified scenario shows the complexity of possible configurations and mappings. Efficient resource utilization should minimize blank spaces in the resource utilization graph from Fig. 2. It seems quite challenging to take runtime decisions on new configurations and to trigger the reconfiguration process, taking into account reconfiguration cost in terms of latency and power consumption. Too many

reconfigurations might keep the system busy and contribute to power consumption, without executing application tasks. Modules that are going to be used again soon might not be reconfigured, but clock/power gated instead. A suitable mechanism for dynamic partial reconfiguration of FPGA devices has been demonstrated in [CC10], with reconfiguration time in the range of msec, which seems acceptable for Robotic Vision applications. Real-time robotic applications include hard realtime (HRT) and soft real-time (SRT) tasks. HRT tasks include motion control, while SRT tasks include computer vision algorithms, using the majority of computational power. Motion control typically uses inputs from computer vision algorithms, e. g., identify obstacles on motion trajectories. However, if computer vision SRT tasks do not meet deadlines, then the robot would stop and continue later. A deadline violation in motion control could lead to damaging the robot, on the other hand. We plan to approach the real-time problem in the following way: HRT & SRT on virtualized domains, self-learning will be applied on the SRT domains only, e. g., for computer vision. This is a restriction, but makes sense nevertheless, because computer vision will use most of the computing power, so SRT massively contributes to computing requirements. For further exploitation of self-learning multicore reconfiguration management in HRT tasks, [SHE06] has introduced real-time property verification, which might be combined with self-learning.

3. A New Approach to Self- Learning Multicore Reconfiguration Management Fig. 3 shows an MPSoC with multiple computing nodes interconnected by a Network-on-Chip (NoC). Each computing node consists of a cluster of reconfigurable resources, including multiple CPU cores, hardware accelerators, local memory, and local interconnect. Configuration of each computing node is controlled by a reconfiguration manager, according to current status and planned activities of the robot. Reconfiguration management will include distribution of reconfigurable areas between soft core CPUs and HW Engines of various sizes, management of energy budgets and power consumption for SW tasks and HW Engines, including dynamic voltage frequency scaling, clock gating, power gating, blanking of unused modules, i.e. writing a blank bitstream configuration.

Mem

HW Acc

Mem

Core

Monitor

Core

Core

Core

Mem

Core

Mem

Mem

Core

Core

R

R

R

Core R

R

Mem

Core Core

I/O Ctrl

Core

Mem

Core Core R

R

Mem

R

Core

Core R

R

HW HW

HW Acc

Mem

R

Mem

Core

Mem Ctrl

Mem

Configuration i Core

Monitor Reconf ig

Reconfiguration Manager

Reconf ig Core

Rule Table Condition Action

Configuration Repository

HW R

R

Configuration i+1

Fig. 3: Proposed self-learning reconfiguration manager added to reconfigurable computing node. The behavior of advanced robots is controlled by a three layer cognitive architecture, with a high layer for reasoning and planning, a mid layer for cognition based on video and audio processing, and a low layer for sensor/motor control [TAD06] In our new approach, the reconfiguration manager is supported by unsupervised machine learning mechanisms, based on Learning Classifier Systems, LCS [SW95]. The classical LCS was modified and adapted to hardware implementation [ZSH08], in order to allow fast evaluation of Learning Classifier Tables (LCT) within just one clock cycle per rule.

Condition

Action

Fitness

CPU utilization CPU power consumption CPU temperatur ........ High level cognition Mid level cognition Low level cognition

Config CPU 1 Config CPU 2 ........ Config CPU n HW accelerator 1 HW accelerator 2 .......... HW accelerator n Keep configuration

Probability for rule selection

Learning Classifier Table (LCT)

10010XX1010 100110111010011 001XXX11001 011001101001101 X1001110X10 101011011010100 .............. .............

52 35 23 ....

Fitness update

Fig. 4: Rule table for reconfiguration manager An LCT rule table consists of a set of rules with conditions, actions, and a fitness value for each rule, as shown in Fig. 4. Conditions are derived from

monitors within the three layers of the cognitive control architecture, including monitors for current status and planned activities, as well as monitors inside the computing nodes, e.g. CPU utilization, CPU temperature, and power consumption. A condition entry in the LCT table may be represented by a specific value, a threshold or a value range. Same conditions may match to multiple rules with various actions. In our case, an action might take a specific configuration from the configuration repository and start the reconfiguration process of a computing node, taking into account reconfiguration cost, calculated according to [CC10], e.g. latency and energy, or keep the configuration unchanged. Among all the rules matching the current conditions, the rule with the highest fitness value has the highest probability to be selected. After a rule has been applied, its fitness value will be updated, based on a reward function representing the usefulness of this rule for the robot. Usefulness might be decided by the reasoning layer of the cognitive architecture, by observation of the current status of the robot, and by observation of load monitors inside the computing nodes. The reward function might include balanced load of CPU cores and hardware accelerators in order to avoid hot spots on the chip, meeting of deadlines and minimization of idle times in order to exploit

Dynamic Voltage and Frequency Scaling, as well as low overall power consumption. With LCT tables, there are two mechanisms for machine learning available. First, learning through fitness update enables the reconfiguration manager to reconfigure computing nodes in order to best match to previously known situations. Assume the robot is moving in a well known environment, where a sequence of activities might be most likely, e.g. stereo vision followed by object detection. If this assumption holds true for some time, the rule that matches the condition “stereo vision completed” leading to the action “reconfigure to object detection” will accumulate a high fitness value. Second, rule modification based on genetic operators enables the reconfiguration manager to adapt to previously unknown situations, e.g. changes of the environment. Assume a robot is leaving a house and moving in a city, then previously learned in-room behavior will not fit any more, but new configurations for localizing moving vehicles and traffic noise will be needed instead. Through genetic rule modification, new rules could be generated randomly, their fitness evaluated in the new situation, weak rules neglected and strong rules prioritized through fitness update.

4. Outlook and limitations Although the proposed LCT-based machine learning mechanisms could adapt reconfiguration management to previously unknown situations, there are limits of our proposed approach related to the repository of configurations. In order to keep the design effort limited, in a first step we propose to use a repository of pre-designed configurations, e.g. a set of configuration bitstreams for hardware accelerators and CPU cores. Our proposed machine learning mechanisms are not capable of modifying this repository. Theoretically, machine learning could be applied on the repository as well, but this would require online redesign and verification of hardware accelerators and CPU cores, which might be a topic for future research. Some basic investigations covering online design modification can be found in [NHB09] and [ABD07]. Transfer of knowledge within a swarm of cooperative robots might be investigated further, based on exchange of rules with high fitness values between individual members of the swarm.

References [AB01] Alper Buyuktosunoglu et al.: “A Circuit Level Implementation of an Adaptive Issue Queue for Power-Aware Microprocessors”, GLSVLSI 2001 [ABD07] Athanas, P.; Bowen, J.; Dunham, T.; Patterson, C.; Rice, J.; Shelburne, M.; Suris, J.; Bucciero, M.; Graf, J.:“Wires on Demand: RunTime Communication Synthesis for Reconfigurable Computing”, IEEE Conference on Field Programmable Logic (FPL) 2007 [AOB06] F. Agakov, E. Bonilla, J.Cavazos, B.Franke, G. Fursin, M. O’Boyle, J. Thomson, M. Toussaint, and C. Williams. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), 2006 [ASV04] Alberto L. Sangiovanni-Vincentelli, et al.: “Benefits and challenges for platform-based design”, DAC 2004 [BB08] Bradley J. Barnes et al.: “A RegressionBased Approach to Scalability Prediction”, ICS 2008 [BIM08] R. Bitirgen, E. Ipek, J. Martınez: “Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach“, IEEE Micro 2008 [BOB98] F. Bodin, T. Kisuki, P. Knijnenburg, M. O’Boyle, and E. Rohou. Iterative compilation in a non-linear optimisation space. In Proceedings of the Workshop on Profile and Feedback Directed Compilation, 1998 [BSH08] L. Bauer, M. Shafique, S. Kreutz, J. Henkel: “Run-time System for an Extensible Embedded Processor with Dynamic Instruction Set”, DATE 2008 [CC10] C. Claus, R. Ahmed, F. Altenried, W. Stechele, "Towards rapid dynamic partial reconfiguration in video-based driver assistance systems", 6th International Symposium on Applied Reconfigurable Computing, ARC 2010, Bangkok, Thailand, March 17-19, 2010 [CellBE] Filip Blagojevic et al.: “Modeling Multigrain Parallelism on Heterogeneous Multicore Processors: A Case Study of the Cell BE”, xxx [EFH09] T. Ebi, M. Al Faruque, J. Henkel: “TAPE: Thermal-Aware Agent-Based Power Economy for Multi/Many-Core Architectures“, ICCAD 2009 [EI05] Engin Ipek et al.: “An Approach to Performance Prediction for Parallel Applications”, Euro-Par 2005

[ElS10] El Sayed Auf, Adam-Pharaoun: Eine Organic Computing basierte Steuerung für einen hexapoden Laufroboter unter dem Aspekt reaktiver Zuverlässigkeit und Robustheit. Dissertation, Institut für Technische Informatik, Universität zu Lübeck, 2010 [GCC08] Grigori Fursin et al.: “MILEPOST GCC: machine learning based research compiler”, GCC Summit, Ottawa, 2008 www.milepost.eu [IRQ09] Imran Rafiq Quadri, Samy Meftali, and Jean-Luc Dekeyser: “High level modeling of Dynamic Reconfigurable FPGAs”, International Journal of Reconfigurable Computing, Volume 2009, Article ID 408605 [KK00] Kurt Keutzer, Sharad Malik, A. Richard Newton, Jan M. Rabaey, A. SangiovanniVincentelli: “System-Level Design: Orthogonalization of Concerns and PlatformBased Design“, IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems (TCAD), Vol. 19, No. 12, December 2000 [KK09] K. Keutzer. Mapping applications onto manycore, Design Automation Conference (DAC), 2009 [KV10] Danica Kragic and Markus Vincze: “Vision for Robotics”, Foundations and Trends in Robotics, Vol. 1, No. 1, pp. 1–78, 2010 [LB08] Benjamin C. Lee and David Brooks: “Efficiency Trends and Limits from Comprehensive Microarchitectural Adaptivity”, ASPLOS 2008 [LL73] C.L. Liu, J. Layland: “Scheduling Algorithms for Multiprogramming in Hard-Real-Time Environment”, Journal ACM, January 1973 [LPC10] Laika, A.; Paul, C.; Stechele,W.; El Sayed Auf, A.; Maehle, E.: FPGA-based Real-time Moving Object Detection for Walking Robots. 8th IEEE International Workshop on Safety, Security and Rescue Robotics, SSRR 2010, Bremen, Germany 2010 [MH03] Michael C. Huang et al.: “Positional Adaptation of Processors: Application to Energy Reduction”, ISCA 2003 [NHB09] M. Niknahad, M. Huebner, J. Becker: „Method for improving performance in online routing of reconfigurable nano architectures”, IEEE System-On-Chip Conference (SOCC) 2009 [OB09] Zheng Wang, M. O’Boyle: “Mapping Parallelism to Multi-cores: A Machine Learning Based Approach”, PPoPP 2009 [OB10] C. Dubach, T. Jones, E. Bonilla, M. O’Boyle: “A PredictiveModel for Dynamic

Microarchitectural Adaptivity Control”, IEEE Micro 2010 [OBN08] Luciano Oliveira, Ricardo Britto and Urbano Nunes: “On Using Cell Broadband Engine for Object Detection in IST”, IROS 2008 [Ost10] Christoph Osterloh, et.al., “HANSE – autonomous underwater vehicle for the SAUC-E competition 2010”, La Spezia, Italy, 2010, see www.iti.uni-luebeck.de/fileadmin/user_upload/ Paper/sauce_2010.pdf [PTW10] Platzner, Teich, Wehn (Editors): Dynamically Reconfigurable Systems, ISBN 97890-481-3484-7, Springer, 2010 [RS89] J. Ramanujam, P. Sadayappan: “A Methodology for Parallelizing Programs for Multicomputers and Complex Memory Multiprocessors”, Journal ACM 1989 [SC06] Seungryul Choi, Donald Yeung: “LearningBased SMT Processor Resource Distribution via Hill-Climbing”, ISCA 2006 [SHE06] S. Stein, A. Hamann, and R. Ernst. Realtime property veri_cation in organic computing systems. Second International Symposium on Leveraging Applications of Formal Methods, Verification and Validation, pages 192{197, November 2006. [SW95] Stewart W. Wilson: Classifier fitness based on accuracy. Evolutionary Computation 3(2), p. 149-175, 1995 [TAD06] T. Asfour, K. Regenstein, P. Azad, J. Schr¨oder, N. Vahrenkamp, and R. Dillmann. ARMAR-III: An Integrated Humanoid Platform for Sensory-Motor Control. In IEEE/RAS International Conference on Humanoid Robots (Humanoids), pages 169–175, Genova, Italy, December 2006 [VK09] V. Kontorinis et al.: „Reducing Peak power with a Table-Driven Adaptive Processor Core“, IEEE Micro 2009 [YKK99] Yu-Kwong Kwok, Ishfaq Ahmad: “Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors”, ACM Computing Surveys, December 1999 [ZSH08] J. Zeppenfeld, A. Bouajila, W. Stechele, A. Herkersdorf, "Learning Classifier Tables for Autonomic Systems on Chip", Lecture Notes in Informatics, Springer, Gesellschaft für Informatik, GI Jahrestagung, München, Vol. 134, p. 771-778, September 12, 2008

POWER CONSUMPTION IMPROVEMENT WITH RESIDUE CODE FOR FAULT TOLERANCE ON SRAM FPGA Amiel Frédéric - Ea Thomas - Vinay Vashishtha Institut Supérieur d‟électronique de Paris 21 rue d‟Assas, 75006 Paris, France email : [email protected]

ABSTRACT The reliability of new SRAM FPGA (Field Programmable Gate Array) devices, which are the first components launched for each new generation of transistor, is difficult to estimate. Their increasing use on electronic boards in both terrestrial and space applications necessitates the development of fault tolerant techniques in the wake of growing soft error rates (SER). In this article, In this article, a concurrent error detection and correction scheme using residue codes is proposed and designed. The results express the gain for power consumption and circuit area compared for other solutions for fault detection or fault correction. Index Terms—SRAM FPGA – Modulus calculation – code computation – power minimization – fault tolerance – user logic.

1 INTRODUCTION SRAM FPGA components contain different types of transistors: triple gate oxide, super thin gate oxide, multi threshold [1]. Slow and reliable cells are used to memorize and set configuration switches, transistors with two different bias voltages [1] are used inside logic blocks, DSP blocks, user registers, and memory cells are used inside LUT (Look Up Table) Blocks. Aging of component combined with local temperature condition can cause variation on timing and cause error in arithmetic units. Moreover, with the decreasing voltage used to reduce power consumption, with the increasing cell density, and increasing use of VLSI devices the threat from soft errors at ground level is constantly rising [2]. Furthermore SRAM FPGA components compete with ASIC and more and more manufacturers tend to use them in embedded applications. SEU (Single Event Upset) effect on FPGA [3] can cause

transient pulses on signal which can be memorized in a register (fig1).

Fig1: SET occurs when the ion induced pulse can propagate through the circuit network

Various SRAM FPGA resources such as LUT, configuration bits, flip-flops and block RAM (BRAM) are susceptible to SEUs. The new SRAM FPGA devices provide a mechanism to detect bit flip inside configuration RAM. All the cells are periodically read and a check sum is performed [4] (Scrubbing) which permits to reconfigure the device in case of a modification on configuration RAM cells. A modification in an embedded RAM in FPGA (M-RAM in Altera devices) can be detected or corrected by a hamming code. We focus in this paper on transient error in user logic. TMR [5] is classically used to realize fault tolerant systems, we give in this article the gain in terms of area and power consumption for fault detection and fault correction, versus reliability, based on hardware redundancy (CED: Concurrent Error Detection) [6]. We don‟t consider time redundancy techniques [7][8] which imply deep modification during the logic design phase. Spatial hardware redundancy is easier to add automatically by a tool. We investigate in part II the code used to control computation inside FPGA, and the method to detect or to correct an error is explained. Part III gives in detail the added blocks. Finally before concluding our work, part IV presents the results obtained and comparison with TMR is carried out.

2 PRINCIPLES OF FAULT DETECTION AND FAULT CORRECTION

2.1 Code generation In order to verify the result of arithmetic processing, we use modulus verification: [9] If If If If

A+B=C A-B=C A*B=C A/B=C

then then then then

(Am) + (Bm) = Cm (Am) - (Bm) = Cm (Am) * (Bm) = Cm (Am) / (Bm) = Cm

(Where the symbol  is for modulus operation) Our main applications are in signal processing field and then the divide operator is not useful in general. To detect the occurrence of fault in the result of an arithmetic operation (termed as main operation here), we encode the source operands by computing residue codes for each of them and then perform on these, an arithmetic operation corresponding that of the main operation to obtain a coded result CR‟. The main operation is performed in parallel and the main result is then encoded to generate CR. Properties of residue arithmetic dictate that the codes CR and CR‟ thus generated be equal. The veracity of the „main operation‟ can be therefore tested and an inequality in the two codes indicates that a fault has occurred. (fig2).

The detection probability is 100% for one bit modification. Furthermore, all error values which can be confined within a-1 adjacent bits of error number (bursts of length a-1 or less) will be detected since their error magnitude are g2 j with g in the range 1≤g≤2a-1 – 1. We will detail some detection probability in part IV. With this code the integrity of the internal operation is verified, a larger code has a better coverage than a smaller one. 2.2 Fault detection method Fault detection with hardware redundancy may be done in two ways. One method is to duplicate the operation unit and then compare the output of these two in order to detect a fault as shown in fig 3. Yet another approach is to introduce reduced hardware redundancy in the form arithmetic code calculation unit. The codes calculated on the operands by this unit are compared to those from operation unit for fault detection (see fig 2):

Fig 3: Duplication with Comparison (DWC)

2.3 –Fault correction method

Fig 2: Principle of operation (N1, N2: operands)

For code correction, TMR is a widely used technique where outputs from three identical operation units are constantly monitored and a disparity in results is interpreted as a fault. In such a case a majority voter is used to vote the correct output (fig. 4).

We will use a 2a-1 module to compute easily the modulus of a specified number Z [9]. Modulo of 2a-1 is the summation of the k groups (a-bit segments of value Ki that compose the ka-bits number of Z to compute the check sum. The division is replaced by an “endaround carry” addition. For example: with a 32 bits number divided in k=4 groups of a=8 bits, we perform the modulus operation by (28-1) in adding four 8 bits numbers. If Z = ABCD (A,B,C,D are groups of 8 bits) Z255 = [(A+B+C+D)  255] (addition without end carry)

Fig 4: Classic TMR

In our error detection scheme two identical operation units in conjunction with an arithmetic code calculation unit (fig. 5) are used:

Fig 5: Coded version for fault correction

The encoded outputs from the two identical operation units (output of blocks “cod” on fig 5) are compared with each other and with the output from the code calculation unit (CodeR on fig5). Once again a disparity in the result indicates the presence of a fault in either of the operation units that may be mitigated by returning the output matching that of the code calculation unit. The probability of error in the code calculation unit is low due to its small size and can be further lowered by placing it elsewhere in the design area as radiation particles striking a sensitive node tend to affect other surrounding nodes as well. This approach therefore reduces power dissipation as well as design area. However, a trade-off due to this small size (and therefore smaller code size) is the lower fault coverage when compared to TMR. 3 CODED UNITS 3.1 General description We developed some parameterized VHDL code to implement the code generation for a specified length number with a specified modulus. We uses parallel adder to accelerate the code computation. As an example for a 32 bits number with a 255 modulus (a=8, k=4) the equivalent schematic is:

Fig 6: Code computation schematic

In this example the adder are 8 bits limited to guarantee the 8 bits modulus result. In a pipeline scheme, the code computation is done in two time slices. Each slice should be done faster than the main calculation:

Fig 7: Timing for code computation

This constrain is easy to respect, the operator time usually increased as a n.log(n) function of the operator size; the time for multiplying is 20 ns with DSP blocks and 25ns with Logic Elements (LEs) in the case of an 8 bits code for a 32 bit number. The time to carry out the residue code calculation is 7 ns and the 8 bits multiplication is 15ns with a cyclone II-6 component (FPGA from ALTERA). We use classical operator saturated on 8 bits for code computation. It is possible to use LE elements to implement the code and hard macro to perform the main operation. 3.2 MAC operator MAC (Multiply ACcumulate) operator is intensively use for signal processing. Unfortunately, it is a sequential operator, and we need to decompose the full operation in multiply and addition unit to add the code. DSP blocks inside the FPGA provide MAC operator, and the coded unit has to be implemented in standard logic blocs. We then check the code at the final result without an access to the internal result of the multiplier, but we need to use the same commands as the main operator: (clock and reset operation):

Fig 8: MAC operator

3.3 Other operators The code part has to be inserted for each basic operation unit (add/sub/mul/div/logic). This can be done automatically inside the netlist generated by FPGA vendor tools. The code can be used to verify register integrity because all register contents have a code counterpart. A transition on a bit inside a register will be detected as an error in the following arithmetic operation. For pipeline operation, it is possible to chain the code and to do the verification at the end of the pipe (see fig 9).

Fig 9: Pipeline operation

4 MEASUREMENT AND RESULT ANALYSIS

Table 2: Area and power with logical blocs only

The required area and power consumption are approximately 1/5 of the area and surface of a full operator. (Fig 10 – 11 and table 2). Table 1 show the main operator can be implemented on hard macro when the code verification is implemented on LE blocks. We consider only dynamic power because static power is quite constant for a specified FPGA, but all the area gain permits to use a smaller component, and then the static power will decreased accordingly.

4.1.1 Fault detection For fault detection only, a single code unit is needed; the fig 12 and fig 13 show the comparison for dynamic power consumption and area for an implementation with redundant operators (fig 3) and a coded version (fig 2).

We use an ALTERA Cyclone II device: EP2C70F672C6N to do some measurements of area expressed in Logical Elements (LE). We use the power estimator available in ALTERA Quartus II tool. We have confirmed the accuracy of the power estimator in some other studies [10]. As test circuits, we use essentially a multiplier 32x32 bits with a 64 bits result. 4.1 Used area and power

Fig 12: Area for duplicated multiplier versus coded multiplier

Fig 13: Power for duplicated multiplier versus coded multiplier

The coded version occupied quite 60% of the area used by a redundant operator and its comparison logic to detect a fault. The dynamic power consumption is also 60% for the coded version compare to the redundant implementation.

Fig 10: Area multiplier with 8 bits residue code

Fig 11: Dynamic power with 8 bits residues code

Table 1: Area and power with embedded multipliers

4.1.2 Fault correction For fault correction, fig 14 and fig 15 compare the implementation of 32 bits multiplier in terms of area used and power consumption implemented with triple modular redundancy (fig4) and with 8 bits residue code (fig5).

Fig 14: Area triplex multiplier versus coded multiplier

Fig 15: Power triplex multiplier versus coded multiplier

The coded version permits to correct errors for nearly 73% of the area and dynamic power.

Table 3: Accuracy of code detection (measures)

The results can be predicted, as an example for 2 bits flips and 8 bits code on a 32 bits data a change of 0 to 1 in one bit can be undetected if another change from 1 to 0 occur in the same place in another byte (fig 17).

4.2 Accuracy of error detection - fault injection In order to test the accuracy of the fault detection, we have inserted in the design some “Fault injection blocks” (Fig 16) which can insert one bit flip on the main operator or on the code calculation. We apply some vectors and count the amount of detected or corrected errors. Several random number generators are used to create vectors, and to choose some bit to flip. We can inject from 1 to 3 bits flips at the same clock period. We use two billion vectors to have a good estimation of detection rate. With a 40 MHz frequency the total test spend 50 seconds. We use an embedded logical analyzer (Signal Tap) to read the different counters.

Fig 17: one bit change in byte 1 can be compensated by another bit change in another byte at the same position

In our design we can flip twice the same bit, the probability is then: Pdetection(2 bits) = 1 - (5/72) = 0,930 (To be compared with 0,937 measured in table 3). With this method we can also estimate the accuracy of fault detection with TMR method [5]. For 2 bits errors in the result of a 32 bits multiplier: Pdetection(2 bits) = 1 – (3/95) = 0,968

5 CONCLUSION AND FUTURE WORK Fig 13: Multiplier with Fault Injection block (FIB)

We also use some random numbers to generate input operands. The results are shown on table 3.

This work presented the accuracy of transient errors detection/correction with residue code. This method permits to detect fault in combinatorial circuits with area and power improvement of 40% compared to a duplication strategy, and to correct faults with 25% of gain compared to classic TMR. The accuracy of code is 100% for a single bit flip. For multiple bit flip the accuracy is a little bit less than duplication or triplication, but remains in the same order of magnitude. Compared to time duplication, residue code is easier to add after design phase [11].

We are now planning to integrate the code system inside a real application and stress the component accordingly to a model of aging variation. This method will be used to detect transient fault in sensitive algorithms for embedded application, with a minimal reaction delay.

6 REFERENCES [1] wp-01059-stratix-iv-40nm-power - Altera white paper [2] Eugene Normand “Single Event Upset at Ground Level” IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 43, NO. 6, DECEMBER 1996 [3] J.J Wang - Radiation effect in FPGAs Actel [4] “SEU mitigation in Cyclone IV Devices” (Cyclone IV Device Handbook, Vol 1) ALTERA. [5] R. E. Lyons W. Vanderkul k, “The Use of Triple-Modular Redundancy to Improve Computer Reliability” IBM JOURNAL APRIL 1962 [6] Fernanda Lima Kastensmidt, Luigi Carro, Ricardo Reis “Fault Tolerance Techniques for SRMA-Based FPGAs”, Springer Fret32 [7] Fernanda Lima, Luigi Carro, Ricardo Reis “Designing Fault Tolerant Systems into SRAM-based FPGAs” DAC 2003 [8] Barry W Johnson, James H Aylor, Haytham H Hana “Efficient Use of Time and Hardware Redundancy for Concurrent Error Detection in a 32-bits VLSI Adder” IEEE Journal of Solid State Circuits Vol 23 No1 Feb 88. [9] Algirdas Avizienis, “Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design,” IEEE Transactions on computers volc-20 No11 November 1971 [10] Henryk Blasinski, Frédéric Amiel, Thomas Ea “Impact of different power reduction techniques at architectural level on modern FPGAs” Lascas 2010 [11] Keith S. Morgan, Daniel L. McMurtrey, Brian H. Pratt, and Michael J. Wirthlin “A Comparison of TMR With Alternative FaultTolerant Design. IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 54, NO. 6, DECEMBER 2007 Techniques for FPGAs”

2011

Tampere, Finland, November 2-4, 2011

Poster Session Main Track

Embedded Systems Security: An Evaluation Methodology Against Side Channel Attacks Youssef Souissi, Jean-Luc Danger, Sylvain Guilley, Shivam Bhasin and Maxime Nassar Interfacing and Scheduling Legacy Code within the Canals Framework Andreas Dahlin, Fareed Jokhio, Jérôme Gorin, Johan Lilius and Mickaël Raulet Range-Free Algorithm for Energy-Efficient Indoor Localization in Wireless Sensor Networks Ville Kaseva, Timo Hämäläinen and Marko Hännikäinen Application Workload Model Generation Methodologies for System-Level Design Exploration Jukka Saastamoinen and Jari Kreku Flexible NoC-Based LDPC Code Decoder Implementation and Bandwidth Reduction Methods Carlo Condo and Guido Masera FERONOC: Flexible and Extensible Router Implementation for Diagonal Mesh Topology Majdi Elhajji, Brahim Attia, Abdelkrim Zitouni, Samy Meftali, Jean-Luc Dekeyser and Rached Tourki A New Algorithm for Realization of FIR Filters Using Multiple Constant Multiplications Mohsen Amiri Farahani, Eduardo Castillo Guerra and Bruce Colpitts Analyzing Software Inter-Task Communication Channels on a Clustered Shared Memory Multi Processor System-on-Chip Daniela Genius and Nicolas Pouillon Multiplier Free Filter Bank Based Concept for Blocker Detection in LTE Systems Thomas Schlechter Practical Monitoring and Analysis Tool for WSN Testing, Markku Hänninen, Jukka Suhonen, Timo D Hämäläinen and Marko Hännikäinen

www.ecsi.org/s4d

Embedded Systems Security: An Evaluation Methodology Against Side Channel Attacks Youssef Souissi, Jean-Luc Danger, Sylvain Guilley, Sihvam Bhasin and Maxime Nassar TELECOM ParisTech, 75 634 PARIS Cedex 13, FRANCE < {ysouissi,danger,guilley,bhasin,nassar}@telecom-paristech.fr >

Abstract—One of the most redoubtable attacks on modern embedded systems are Side-Channel Analysis. In this paper, we propose a security evaluation framework which aims at organizing the work of the evaluator to reliably assess the robustness of embedded systems against such attacks. Moreover, we highlight common errors made by evaluators and solutions to avoid them.

Keywords: Embedded systems security, Evaluation criteria, Security certification, Side Channel Attacks (SCA), Evaluation metrics and solutions. I. I NTRODUCTION Physical security has always been an open question and could pose a difficult long term problem. Indeed, we are more and more surrounded by different forms of Information Technology (IT), such as mobile phones and smart cards, that require an adequate level of security to work properly. Any violation of IT security could lead to the loss of sensitive and personal information. This would be more critical if we simply were dealing with the military and defense market which have always been ruled by high reliable devices such as ASICs and FPGAs. Thus, an infinity of questions arise about the insurance of data confidentiality and data integrity. Depending on the nature of the information that would be stored and manipulated by electronic devices, embedded system engineers are committed to providing the safest product possible by adjusting their scheduling policies regarding the security aspect, which can be seen as the level of robustness against external attacks and perturbations. One of the most redoubtable class of attacks on embedded systems is no doubt the SideChannel Analysis (SCA), which exploit unintentional physical leakage, such as the timing information, power consumption or radiated magnetic field. But more importantly, these attacks are low cost and easy to mount in practice. Clearly, dealing with this matter has become more important than ever. Besides, if a device, which has been evaluated as secure, could be attacked by such means, it might create a crisis of trust between vendors (manufacturers, embedded system designers, . . . ) and customers. In the world of Information Technology, the security compliance can be ensured by two approaches: the first one is to officially certify the embedded system product by an internationally approved set of security standards such as the Common Criteria (CC) [1] and the NIST FIPS [2]. The main goal behind standard certification is to obtain a degree which validates the security level implemented by the

product. Obviously, official standard compliance is guided by marketing and business needs. Indeed, the certificate is an advantage on the competitive market. The second approach is to asses the security robustness of one product according to a set of tests which are conducted by an evaluation lab (“non standard certification”) and are not necessarily documented in a formal standard: the security evaluation is often carried out through a set of examinations that involves known attacks, relatively to the cryptographic community, and often other security practices that are specific to the evaluation lab. These security practices are worthy in that they discover real vulnerabilities that always map to real threats. In the case of SCA, most evaluation labs are equipped to perform SCA testing on secure devices. Embedded systems vendors have often recourse to the evaluation labs in order to obtain a baseline assessment of information leakage from the cryptographic co-processor implemented on their products. Generally, the evaluator is often required to conduct its analysis following certain methodology. Actually, according to security levels as defined by Abraham .al [3], secure devices could be classified into several levels of security. For all levels, the evaluator should be able to evaluate the cryptographic implementation itself within its external environment which depends on the factory and the type of the circuit (FPGA, ASIC, . . . ). For instance, some added security measures could be employed to limit the access to the cryptographic implementation. Thus, the evaluator would not be free to acquire as much power consumption signals (traces) as he wants. Indeed, the main issue is that embedded system vendors often want to keep secret some parts of the design, specially when the evaluation process would be carried out by a third party. In the literature of IT security evaluation [4] [5] [6], only few standards [1] and references are offered on the common problem of securing embedded systems against Side Channel Attacks. This paper not only offers a more in-depth discussion of the issues related to the security evaluation of embedded systems such as ASICs and FPGAs, but also provides a generic and straightforward evaluation methodology in order to assess the actual security robustness of these cutting-edge technologies against such malicious attacks; and therefore to enhance user trust. The proposed methodology is principally based on five phases: the characterization of the device under test, the simulation process, the acquisition setup, the preprocessing

of acquired Side-Channel signals and the analysis phase. These processes are in close relationship with each other. For instance, we will show how the evaluator cannot claim a reliable evaluation of the product if the preprocessing phase is not taken seriously. The rest of the paper is organized as follows: Section II deals with the characterization phase which aims at exploring the device under test according to the documentation provided by the vendor. Actually, thanks to this phase, the evaluator is able to determine the most appropriate Side-Channel Analysis that could be performed on the cryptographic implementation. The characterization phase is related to the simulation phase which is described in Section III. Section IV shows how the evaluator could efficiently quantify the strength of deployed analysis, while knowing the value of the secret information, which is usually referred to as the secret key. Section V concerns a practical phase that is the acquisition of real Side-Channel traces. This is taken advantage of in Section VI, which aims at preparing acquired Side-Channel traces to the analysis. Section VII recalls the proposed phases and shows how they can be organized in a methodological scheme in order to make a reliable evaluation of secure implementations. Eventually, Section VIII is devoted to the conclusion. II. T HE CHARACTERIZATION PHASE Before proceeding to any Side-Channel analysis, the evaluator is required to find answers to questions related to his knowledge about the device under test. Generally, this concerns the access to the device and the type of the implemented countermeasures which is basically dependent on the documentation provided by the vendor. This way, the evaluator can determine the most appropriate analysis to perform.

Analysis exist. This section gives an overview of the most important attacks that we call basic attacks and explains why at least all these attacks should be taken into consideration. Besides, the idea behind using full controlled attacks is to make an easy and reliable comparison between the evaluation results provided by different labs when testing the same product. Basically, the power or electromagnetic consumption of hardware devices depends on bits transition at a certain time which allows to learn more about the behaviour of the cryptographic process. The idea of analyzing power consumption signals was mainly presented to the cryptographic community in [7] by P. Kocher, based on the implementation of the symmetrical encryption algorithm Data Encryption Standard (DES). Power analysis attacks involve two basic variants: The first one is called Simple Power Analysis (SPA). It is a direct analysis of patterns of instruction execution, obtained through monitoring variations in electrical power consumption of a cryptographic algorithm. SPA requires a detailed knowledge about the implementation of the cryptographic algorithm that is executed by the device under attack. As a matter of fact, the original idea behind performing SPA against Rivest Shamir and Adelman crypto-system (RSA) is to recover the multiplication and squaring operations. As it is shown in Fig. 1, the secret key is easily retrieved by analysing the shape of the acquired Side-Channel trace. In fact, according to this figure, the differences between the square and multiply executions are clearly visible. The second variant Power Consumption

A. Access to the device Manufacturers often deploy different types of sensors and filters that mainly aim at protecting the cryptographic implementation from improper manipulations that would threaten the encryption or decryption operation. Those sensors and filters include the level of voltage, frequency, temperature or light. For instance, light sensors define a range of variation in which the gradient of light should be, otherwise the circuit resets. More sophisticated techniques have been proposed, such as using a robust metal enclosure that acts as a Faraday cage. In fact, shielding the circuit with metal layers reduces the electromagnetic (EM) radiations. Therefore, the (EM) acquisition and exploitation of the Side-Channel traces becomes more difficult. All these examples of external protections (i.e relatively to the cryptographic implementation itself) should be taken seriously by the evaluator before performing his analysis. B. Basic SCA According to his knowledge about the external protections, used to externally protect the cryptographic co-processor, the evaluator is now required to perform a set of non-invasive analysis to recover the secret key from the cryptographic implementation. In the literature, many variants of Side-Channel

S

M

S

S

M

S

S

M

S

S

M

S

S

S

M

S

Secret Key Operations

Figure 1.

Simple Power Analysis on RSA (S: squaring, M: multiplication).

is the Differential Power Analysis (DPA) that is based on statistical computations. This attack is more powerful than SPA as no detailed knowledge about the cryptographic device is necessary. An alternative to DPA was suggested by Brier [8] called Correlation Power Analysis (CPA) based on linear correlation techniques. CPA offers more efficient analysis by eliminating the “ghost peak” problem in DPA. Recently, new powerful variant of Side-Channel Attacks, called Mutual Information Analysis (MIA) [9], have been proposed. These attacks that are based on the mutual information theory aim at exploiting both linear and non linear correlations. This fact makes these attacks more generic and more efficient than first order attacks like DPA and CPA, as they can even be applied on protected implementations. One other kind of attacks called template analysis are often considered to be the most performant SCA. Indeed, it is shown that such attacks can

easily break cryptographic implementations and countermeasures whose security is dependent on the assumption that an adversary cannot obtain more than one or a limited number of Side-Channel traces. However, these attacks require that an adversary has access to a clone device on which he can perform his trials and tests. Actually, he is first led to profile the clone product by building what we call templates. Second, those templates are used to recover the secret key from a real cryptographic co-processor. Generally, for all SCAs, the leaked information can be statically defined by a continuous random variable for which the probability law, denoted by Plaw , is unknown or uncertain. The main challenge of SCA is to make a sound estimation of Plaw without loss of information. Basically, random variables are measured and analyzed in term of their statistical and probabilistic features. Obviously, taking into account the high variety of existing attacks, there are many ways to play statistics in the Side-Channel field. For instance, new calculations, based on the second order statistics (the variance) seems to be a good way to quantify the secret information on some protected implementations. As a matter of fact, those calculations have already been exploited to mount an efficient attack called “Variance Power Analysis” (VPA) [10] [11]. However, playing statistics is a task often guided by certain conditions to get accurate results. For instance, attacks based on the mutual information theory like MIA, require a reliable estimation of the probability density function of Plaw . Theoretically, an accurate probabilistic statistics, such as the entropy measure, describes better one random variable than High-order statistics. Unfortunately, the optimal accuracy is hardly achieved specially when the probability law is unknown or uncertain. Actually, in many scientific disciplines, it is shown that the probability density of an unknown law is nearly impossible to properly estimate, specially when the available data to be studied is limited. By analogy to the cryptographic domain, statisticians are identified to evaluators and the available data to power or EM consumption signals. Indeed, the evaluator is often required to perform its analysis under certain constraints; and therefore he might be limited by the number of acquired Side-Channel traces. Before going further, as stated earlier, the evaluation labs (“non standard certification”) are often required to carry out their own security tests, which are not common to the cryptographic community. Our belief is that such security practices, on the one hand, can discover real threats that could not be discovered by the basic attacks. On the other hand, they can make the difference between evaluation labs and determine the position of one lab with regards to the rest. Moreover, in this context, vendors are brought to have recourse to more than one evaluation lab, before getting officially certified by an authority to be compliant with a security standard. The benefit from approaching several evaluation labs is to provide a closer approximation to the real robustness of the tested product. For instance, the DPA C ONTEST competition [12] is a good example to highlight the existence of several efficient SCAs that had not been made public beforehand.

C. SCA countermeasures There are essentially two options, called countermeasures (CM), to counteract side-channel attacks and protect the secret information: Masking and Dual rail Precharge Logic (DPL). Firstly, The masking countermeasure aims at masking the intermediate values that occur during the cryptographic process. Many masking schemes have been proposed to the cryptographic community for symmetric encryption algorithms (DES, AES, . . . ) [13], [14], [15]. Generally, they differ in term of hardware design complexity. But, they all aim at fulfilling the same goal by ensuring the resistance against attacks like DPA and CPA. However, it has been proved that the masking countermeasure is still susceptible to first-order SCA as long as glitches problem remains not completely resolved [13]. Moreover, masked implementations are not resistant against new variants of SCA like VPA which are mainly based on the variance analysis. It is also shown that a full-fledged masked DES implementation using a ROM (Masked-ROM) is breakable by VPA, in spite of its high resistance against first-order attacks. Secondly, Dual rail Precharge Logic countermeasure, like WDDL, deals directly with the logic level. In fact, DPL aims at making the activity of the cryptographic process constant independently from the manipulated data. In the literature, existing DPL designs vary in term of performance and complexity. In [16], authors introduce the different DPL styles (BCDL, WDDL, IWDDL . . . ) and make comparison between them. Obviously, such CM seem to be the most powerful. However, it is shown that the real life implementation of such CM could fail by leaking useful information about the secret key. III. T HE SIMULATION PHASE The simulation phase is strictly dependent on the documentation provided by the vendor. Actually, this phase consists in predicting the behaviour of the cryptographic implementation by a software program. It replaces real components with idealized electrical models. Power simulations are crucial in that they reveal the existence of power leakages during encryption or decryption process, which allows to make statements on the resistance against Side-Channel Attacks. These simulations can be performed at different level of power accuracy [17]: the analog level, the logic level and the behavioural level. However, this process is very time consuming specially when taking into account the whole implementation. Moreover, it requires attention to the smallest details about the cryptographic design such as transistor or cell netlists that are often classified as confidential; and therefore are not provided by the vendor. In case of non availability of the netlists, it is always possible for the evaluator to simulate the behaviour of the cryptographic co-processor and get a prior knowledge about its robustness against SCA. Indeed, given the cryptographic algorithm, the evaluator has some clues about the executed instructions which make him able to simulate the power consumption of these cryptographic instructions. The most commonly used power models are the Hamming Distance and the Hamming weight [8]. The main advantage of such simulations is to

provide the evaluator with a reliable information about the robustness of the cryptographic implementation against SCA. Actually, if the attack fails on simulated measurements, it must fail on real ones. However, during our tests, we have observed that such simulations do not give a fair comparison between SCA when applied on real measurements. Indeed, the efficiency of SCA is very dependent on the noise variation that we can not simulate with exactitude. Therefore, we believe that such simulations are not sufficient to conclude about the effectiveness of the comparison study. IV. T HE ANALYSIS AND DECISION PHASE Thanks to this phase, the evaluator is able to make decisions about the robustness of the product under test. Indeed it can be seen as a tool box that aims at giving as much details as possible about the real security dynamics deployed in the product. Today’s evaluator has access to a broad range of metrics used to assess the performance of SCA carried out on the tested device. These metrics consider all key hypotheses while knowing the correct one. Additionally, he is able to quantify the leaked information thanks to hypothesis testing and information theoretic approaches. A. Tools for Side-Channel Analysis assessment In the security field, there is a perception that the level of robustness of secure devices can be measured and deduced through attacks while the secret key is known. This is true as such analysis are worthy in that they pinpoint the vulnerabilities that the secure product is designed to resist. In the SCA literature, the first evaluation metric that is used to validate the robustness of cryptographic implementations is called stability criteria and consists in measuring the amount of acquired Side-Channel traces to guess the secret information: one key is supposed to be correctly guessed if a stability criteria is achieved (i.e the Side-Channel analysis has to continuously return the correct key when accumulating the traces). Such metric is useful when the evaluator is not free to acquire as much Side-Channel traces as he wants. Indeed, in such case, he has to perform the analysis once and for all on the totality of traces that he may acquire from the secure device. Recently, two independent evaluation metrics [18] have been proposed by F.X. Standaert to assess the performance of different analysis: the Success Rate and the Guessing Entropy. Both metrics measure the extent to which an adversary is efficient in turning the Side-Channel leakage into a key recovery. On the one hand, the First-order success rate expresses the probability that, given a pool of traces, the attack’s best guess is the correct key. On the other hand, the Guessing entropy measures the position of the correct key in a list of key hypotheses ranked by a distinguisher. Such metrics are very useful when the number of Side-Channel traces that can be acquired is unlimited. B. Leakage quantification: information theoretic and hypothesis testing approaches 1) Information theoretic approach: This approach is used to measure the amount of the useful information, which

is leaked from the tested device. Technically speaking, this metric is mainly based on the mutual information theory. In the context of SCA, we define K as a variable describing a part of the secret key and k as a realization of this variable. Let T rq be a random vector containing the Side-Channel traces generated with q queries to the device under test and trq be a realization i=q of this random vector. In practice, we have: trq = [tri ]i=1 where each tri is the Side-Channel trace corresponding to one given query. Moreover, let P r[k|trq ] be the conditional probability of a key class k given a leakage trq . We evaluate the amount of information in the Side-Channel leakages with the conditional entropy: H[K|T rq ] = Ek · Etrq|k − log2 P r[K = k|T rq = trq ] where E is the expectation value. Basically, the higher the value of H is, the more robust the implementation is. 2) Hypothesis testing approach: This approach, which is proposed by S. Mangard in [19], involves Signal-to-Noise Ratio (SNR) calculations and the use of the Fisher transformation. Basically, it aims at estimating the correlation coefficients that occur in first order SCA (DPA and CPA) without actually performing the attack in practice. In other words, computations made with the correct key are sufficient to apply this approach. But more importantly, these estimations of the correlation coefficients, denoted by ρestim , are used in a rule of thumb to reliably predict the number of Side-Channel traces needed to perform a successful CPA and DPA attacks (i.e extracting the value of the secret key among all key hypotheses). This rule is given by the following Eqn. (1): 

N = 3 + 8

Z1−α ln



1+ρestim 1−ρestim

2

 ,

(1)

where N is the predicted number of traces, α is a the confidence interval which is used to indicate the reliability of an estimate and Zα is a quantile of a normal distribution for the 2-sided confidence interval with error 1 − α. V. T HE ACQUISITION PHASE SCA traces are typically acquired by a digital scope. The accuracy of an oscilloscope can affect measurements greatly. Indeed, an oscilloscope with high quality display can reveal more important Side-Channel signal details than one with low quality display. For this purpose, it is necessary to properly set up the scope before acquiring Side-Channel traces. Generally, three parameters should by taken into consideration: the input bandwidth, the sampling rate and the resolution of the scope. Moreover, other equipments are needed to capture the activity of the tested device during the encryption or decryption process. These equipments consist of the power measurement probes and the antennas to measuring the electromagnetic (EM) radiations. The choice between performing power or EM acquisition is dependent on the implementation environment which includes the access to the device and the surrounded electrical components.

Table I N UMBER OF MEASUREMENTS NEEDED TO PERFORM A SUCCESSFUL DPA AND CPA.

DPA CPA

0.1 Gsa/s

0.5 Gsa/s

1 Gsa/s

-

1637

1408 548

A. A practical example In order to show the effect of the sampling rate parameter, we conducted a practical application which first consists in acquiring different sets of unprotected DES Side-Channel traces with the same messages and the same secret key; but with different values of the sampling rate. Second, a basic DPA attack is performed on each set of traces. For this experiment we placed ourselves in a situation in which the number of traces is limited to only 2000 traces. Our measurement setup consists of Xilinx FPGA soldered on SASEBO platform. One 54855 Infiniium Agilent oscilloscopes with a bandwidth of 6 GHz and a maximal sample rate of 40 GSa/s, amplifiers, antennas and probes of the HZ–15 kit from Rohde & Schwarz. One picture of one EM measurement setup is shown in Figure 2. The board is taken backside, because the most leaking

Figure 2.

EM measurement setup.

components were not the FPGA itself, but the decoupling capacitors that supply it with power. Those capacitors are surface mounted components (CMS) that have a fast response time, and thus radiate useful information about every distinct round of the algorithm. Moreover, they are easily accessible altogether with a large coil-shaped antenna: therefore the EM leakage of the entire FPGA is captured without precise knowledge of the placement information within the FPGA. We did the experiment for three different values of the sampling rate (1 Gsa/s, 0.5 Gsa/s and 0.1 Gsa/s) that can be tuned through the oscilloscope. Results are depicted in Table I. Obviously, the higher the sampling rate is, the more efficient the attack is. It is noteworthy that at low sampling rate (0.5 Gsa/s), CPA is still successful, whereas DPA fails to recover the correct key. Moreover, for the lowest fixed sampling rate value (0.1 Gsa/s), neither DPA or CPA is successfully performed. VI. T HE PREPROCESSING PHASE A. Traces re-synchronization In SCA, signals alignment process is of a great concern since deployed analysis are very sensitive to the magnitude

of acquired traces. The process of the traces alignment is usually referred to as synchronization. Time or frequency re-synchronization is needed for an accurate positioning of the encryption process window to avoid any displacement between the acquired digital signals. The displacement of the traces considerably affects the result of SCA. In the real life context, it is almost impossible to perfectly get aligned traces because of many factors. A frequent situation is that the trigger signal, which is precisely synchronized with the cryptographic process, is removed by the designer for security reasons. Indeed, devices like FPGA and smart cards which require a high level of security and which are designed for sensitive fields, are usually not equipped with an external clock synchronized with the internal clock, that is connected to the cryptographic block. Sometimes, for power and functional testing reasons, secure devices may be equipped with a trigger signal. Even in this case, a jitter related to deviations from the true leak instant of the cryptographic process is often observed. Therefore, in both situations, secret information can be lost due to errors induced by the displacement of acquired Side-Channel traces. In this context, the PhaseOnly Correlation (POC) is an interesting technique which is proposed by Naofumi Homma et al. [20]. This technique has shown its efficiency in the field of computer vision. It employs phase components in the frequential domain using the Discrete Fourier Transform and makes it possible to determine displacement errors between signals by using the location of the correlation peak. Recently, a new technique, based on the Dynamic Time Warping (DTW) algorithm, has been presented to the cryptographic community by J.van Woudenberg et al. [21]. DTW has the advantage to work with traces which have different sizes. However, it needs a parameter to trade off between speed and quality of curves realignment. More recently, Sylvain Guilley et al. has proposed the threshold POC (named T-POC) [22], which is a trade-off between POC and AOC (Amplitude Only Correlation). In [22], the authors show the great efficiency of such technique specially when cryptographic countermeasures are used. Recall that most of the countermeasures aim to hide information contained at amplitude-level and neglect information contained at phaselevel In order to highlight the effect of the re-synchronization process, we took accurately aligned traces and simulated a desynchronization, by creating displacements between traces, of various magnitudes. Then we performed basic DPA and CPA with four increasing displacements (dispi )i=3 i=0 and observe the difference. We define a displacement (dispi ) as a random number of time samples shifted left or right as shown in Fig. 3. Results are depicted in Fig. 4. 5 through the “Firstorder success rate” metric. It is clear that for both attacks, DPA and CPA, the sensitivity against the desynchronization of traces is getting higher when increasing the value of displacements. Moreover, note that DPA is much more sensitive than CPA. Indeed, even for the smallest displacement (disp0 ), it proves to be completely inefficient. Whereas, CPA still manages to recover the secret key for small displacements

values, disp0 and disp1 . However, in all desynchronization scenarios, DPA loses its performance with regards to the reference case DP Aref (i.e without displacements). We carried out our experiment on DES [23] power consumption traces that are made freely available on line, in the context of the first version of DPA C ONTEST. The DES algorithm used for the competition is unprotected and easily breakable by firstorder SCA like DPA. Indeed, around only 300 traces are needed to perform a successful CPA. More details about this implementation could be found in [24]. Displacements

Figure 3.

Figure 4.

Illustration of DES traces desynchronization.

CPA 1st-order success rate on desynchronized traces.

B. Noise cancellation In all fields which are based on the theory of signal processing, there is always an important part of the analysis that seriously tackles the noise problem. In the Side-Channel field, the performed analysis are very sensitive to the magnitude of the EM or power consumption signals to guess the value of the secret key. Additionally, these signals are always affected by some form of noise. In this context, the importance of noise analysis becomes clear. Despite the great concern of this topic to enhance SCA, only few publications deal with this issue. Thanh-Ha Le uses the fourth-order cumulant to remove the second order noise [25], Xavier Charvet employs wavelet-based denoising for SCA enhancement [26], some other alternatives uses the Kalman filter, which has been often adopted as an algorithmic solution in many scientific fields such as the robotic domain. One other approach, is a material

solution used by some evaluation labs. This solution consists in using an electromagnetic shielding to decrease the contribution of external sources. However this technique is costly and it is often hard to accommodate the test. In order to improve the reliability of the SCA, it is necessary to minimize the contribution of the external sources. Those concern nearby circuits, exterior lighting, electrical wiring, etc. Basically, we distinguish two independent types of noise: electrostatic and magnetic. Generally, the external noise sources engender combinations of the two noise types, which complicates the noise reduction problem. Electrostatic fields are induced by the presence of voltage, which is the origin of the electrostatic noise. Whereas, magnetic fields are generated either by the flow of electric current or by the presence of permanent magnetism, which is the origin of the magnetic noise. In academic cases, the noise is reduced by averaging the signals, since academics are free to acquire as many measurements as they want to perform a successful SCA. However, in real life, the evaluator may be limited by the number of measurements. Furthermore, for some protected cryptographic implementations such as masked algorithms, the evaluator is not allowed to average the recorded signals. C. Side-Channel trace windowing In this section, we encourage the evaluator to not neglect the importance of selecting the right window, that we refer to as the window of interest, when analysing Side-channel traces. The window of interest can be defined as the range of time samples covering the whole leakage instants, which are targeted by the Side-channel analysis. The major advantage from the window selection is no doubt related to timing considerations. Indeed, on the one hand, it makes SCA faster as the size of processed data would be significantly reduced, instead of considering the entire cryptographic process. On the other hand, it accelerates the acquisition operation of Side-Channel traces. The effect of different selected windows, depicted in Fig. 6, when performing a basic Differential Power Attack on DPA Contest DES traces is assessed by the Success rate and Guessing entropy metrics depicted in Fig. 7 and Fig. 8. Obviously, DPA is losing its performance when the window size is getting higher.

Figure 5.

DPA 1st-order success rate on desynchronized traces.

This can be explained by the increasing presence of noise when more information are taken into account rather than the

Documentation and specifications provided by the vendor (datasheets, design, . . . ) window 1 window 2

Characterization Phase • acces to the device. • netlists provided? • implemented CM? => analysis to perform.

Simulation Phase 1

• analog level. • logic level. • behavioural level. ...

3

Acquisition Phase 2

• oscilloscope setup. (bandwidht, sampling rate ...)

• EM / Power measurements.

The correct key.

4

Preprocessing Phase

Figure 6.

Illustration of windows selection process.

Analysis and Decision Phase

• synchronization. (Phase Only Correlation).

• noise cancellation. (kalman filter, high order stat).

• windowing process.

5

• stability criteria. • Success Rate metric. • Guessing Entropy metric. • information theoretic. • hypothesis testing.

Evaluation report.

Figure 9.

Figure 7.

Figure 8.

DPA 1st-order success rate on different windows.

DPA guessing entropy on different windows.

real leakage covered by the smallest window. The real leakage is the one modeled by our algorithm analysis and which is represented by the activity of the DES first round register. Moreover, the windowing process relaxes the memory depth parameter that is fixed through the oscilloscope and which allows more data storage memory. However, the evaluator is not always free to choose the window he wants. Thus, he must deal with the whole cryptographic process. Actually, the vendor, when having recourse to an evaluation lab, is free to keep secret some details about the cryptographic implementation. Consequently, the evaluator is bound to deal with such situation by considering the entire Side-Channel trace in his analysis. VII. M ETHODOLOGICAL SCHEME FOR THE EVALUATION This section is a framework in which previously proposed phases are organized in a methodological scheme, in order to brighten the task of the evaluator. According to Fig. 9, the

Evaluation process scheme.

evaluation process starts by exploring the tested device based on the documentations provided by the vendor. These documentations include for instance the hardware specifications of the device including the cryptographic implementation. These specifications have different level of accuracy. Actually, they could just be a text manual describing the physical configuration of the device or more accurately the design details (data sheets, netlists . . . ). Additionally, more specifications could be provided by the vendor like the ones related to the control of the device, status sensors and their physical, logical or electrical features. Thus, the task of the evaluator would be easy when accessing the device. Generally, specifications of the device security policy, which aim at describing the deployed security mechanism, are required when approaching a the certification lab. Indeed, based on these specifications, the evaluator is able to characterize the tested device. In the characterization phase, one question arises: which is the most appropriate analysis that should be performed. At this point, according to the provided documentations, the evaluator should be able to know which level of simulation he is allowed to perform. Actually, when transistor and cell netlists are available, the evaluator can precisely determine the different leakage instants occurring during the cryptographic process. It is noteworthy that in the case of a limited knowledge about the device, the evaluator can operate in the same manner as an attacker, which often has some clues about parts of the netlists. Therefore, the evaluator can simulate the power consumption of these parts. In such simulations, the most common used power model is the Hamming Distance. As it is illustrated through the Evaluation scheme, the simulation phase is then mapped to the analysis and decision phase. The double headed arrows, denoted by circled one and two, indicate the presence of a mutual relation. In fact, the evaluator is carried out to perform different simulations assessed by the analysis and decision phase until finding all leakages instants; and therefore determining the most appropriate analysis to perform on real Side-Channel measurements. The next step

in the evaluation process is to proceed to the acquisition phase which is mainly based on the setup of the oscilloscope. This practical phase is mapped to the preprocessing one, which aims at preparing the acquired Side-Channel traces to the analysis and decision phase. According to the double headed arrow, denoted by a circled five, the preprocessing phase receives a feedback from the decision phase. Actually, the evaluator is required to improve the analysis process by controlling the preprocessing phase. Eventually, the evaluator establishes an evaluation report in order to verify whether his analysis meet or not the security requirements claimed by the vendor. VIII. C ONCLUSION We have established a framework in which the work of the evaluator is organized into five different phases, when assessing the robustness of a secure embedded system against Side-Channel Attacks. We have shown that these phases, which have been commented by practical examples, are in close relationship with each other. Therefore, any lack of rigor occurring through one phase could mislead the whole evaluation process. Our future work consists in investigating new security solutions, for the different proposed phases, in order to improve the reliability of the evaluation process. R EFERENCES [1] C. C. consortium, “Application of Attack Potential to Smartcards v2-5,” April 2008, http://www.commoncriteriaportal.org/files/supdocs/ CCDB-2008-04-001.pdf. [2] NIST/ITL/CSD, “Security Requirements for Cryptographic Modules. FIPS PUB 140-2,” December 2002, http://csrc.nist.gov/cryptval/140-2.htm. [3] D. Abraham, D.G and Stevens, “Transaction security system,” in IBM systems journal, 1991, pp. 211– 218, iSSN: 1522-8681, ISBN: 0-76951540-1j INSPEC Accession Number: 7321683. [4] “Common Criteria (ISO/IEC 15408),” http://www.commoncriteriaportal. org/. [5] IT certification for the French government, http://www.ssi.gouv.fr/site article135.html. [6] IT certification for the German government, “https://www.bsi.bund.de/cln 165/ContentBSI/Themen,” /ZertifizierungundAnerkennung/ZertifierungnachCCundITSEC/ AnwendungshinweiseundInterpretationen/AISZertifizierungsschema/ ais schema.html. [7] P. C. Kocher, J. Jaffe, and B. Jun, “Differential Power Analysis,” in Proceedings of CRYPTO’99, ser. LNCS, vol. 1666. Springer-Verlag, 1999, pp. 388–397. ´ Brier, C. Clavier, and F. Olivier, “Correlation Power Analysis with [8] E. a Leakage Model,” in CHES, ser. LNCS, vol. 3156. Springer, August 11–13 2004, pp. 16–29, Cambridge, MA, USA. [9] B. Gierlichs, L. Batina, and P. Tuyls, “Mutual Information Analysis – A Universal Differential Side-Channel Attack,” Cryptology ePrint Archive, Report 2007/198, 2007. [10] I. V. Francois-Xavier Standaert, Benedikt Gierlichs, “Partition vs. Comparison Side-Channel Distinguishers.” [11] Y. Li, K. Sakiyama, L. Batina, D. Nakatsu, and K. Ohta, “Power Variance Analysis Breaks a Masked ASIC Implementation of AES,” in DATE’10. IEEE Computer Society, March 8-12 2010, Dresden, Germany. [12] TELECOM ParisTech SEN research group, “DPA Contest (1st edition),” 2008–2009, http://www.DPAcontest.org/. [13] S. Mangard and K. Schramm, “Pinpointing the Side-Channel Leakage of Masked AES Hardware Implementations,” in CHES, ser. LNCS, vol. 4249. Springer, October 10-13 2006, pp. 76–90, Yokohama, Japan.

[14] I. Koichi, T. Masahiko, and T. Naoya, “Encryption secured against DPA,” June 10 2008, Fujitsu US Patent 7386130, http://www.patentstorm.us/ patents/7386130/fulltext.html. [15] T. Popp and S. Mangard, “Masked Dual-Rail Pre-charge Logic: DPAResistance Without Routing Constraints,” in Proceedings of CHES’05, ser. LNCS, vol. 3659. Springer, August 29 – September 1 2005, pp. 172–186, Edinburgh, Scotland, UK. [16] M. Nassar, S. Bhasin, J.-L. Danger, G. Duc, and S. Guilley, “BCDL: A high performance balanced DPL with global precharge and without early-evaluation,” in DATE’10. IEEE Computer Society, March 8-12 2010, pp. 849–854, Dresden, Germany. [17] S. Mangard, E. Oswald, and T. Popp, Power Analysis Attacks: Revealing the Secrets of Smart Cards. Springer, December 2006, iSBN 0-38730857-1, http://www.dpabook.org/. [18] F.-X. Standaert, T. Malkin, and M. Yung, “A Unified Framework for the Analysis of Side-Channel Key Recovery Attacks,” in EUROCRYPT, ser. LNCS, vol. 5479. Springer, April 26-30 2009, pp. 443–461, Cologne, Germany. [19] S. Mangard, “Hardware Countermeasures against DPA – A Statistical Analysis of Their Effectiveness,” in CT-RSA, ser. Lecture Notes in Computer Science, vol. 2964. Springer, 2004, pp. 222–235, San Francisco, CA, USA. [20] N. Homma, S. Nagashima, Y. Imai, T. Aoki, and A. Satoh, “Highresolution side-channel attack using phase-based waveform matching,” in CHES, 2006, pp. 187–200. [21] J. G. J. van Woudenberg, M. F. Witteman, and B. Bakker, “Improving Differential Power Analysis by Elastic Alignment,” in CT-RSA, 2011, pp. 104–119. [22] S. Guilley, K. Khalfallah, V. Lomne, and J.-L. Danger, “Formal Framework for the Evaluation of Waveform Resynchronization Algorithms,” in WISTP, June 1 2011. [23] NIST/ITL/CSD, “Data Encryption Standard. FIPS PUB 46-3,” Oct 1999, http://csrc.nist.gov/publications/fips/fips46-3/fips46-3.pdf. [24] S. Guilley, P. Hoogvorst, and R. Pacalet, “A Fast Pipelined MultiMode DES Architecture Operating in IP Representation,” Integration, The VLSI Journal, vol. 40, no. 4, pp. 479–489, July 2007, DOI: 10.1016/j.vlsi.2006.06.004. [25] T.-H. Le, J. Cledi`ere, C. Servi`ere, and J.-L. Lacoume, “Noise Reduction in Side Channel Attack using Fourth-order Cumulant,” IEEE Transaction on Information Forensics and Security, vol. 2, no. 4, pp. 710–720, December 2007, DOI: 10.1109/TIFS.2007.910252. [26] H. Pelletier and X. Charvet, “Improving the DPA attack using Wavelet transform,” September 26-29 2005, Honolulu, Hawai, USA; NIST’s Physical Security Testing Workshop.

INTERFACING AND SCHEDULING LEGACY CODE WITHIN THE CANALS FRAMEWORK Andreas Dahlin, Fareed Jokhio, Johan Lilius Turku Centre for Computer Science Department of Information Technologies ˚ Akademi University, Turku, Finland Abo {andalin, fjokhio, jolilius}@abo.fi ABSTRACT The need for understanding how to distribute computations across multiple cores, have obviously increased in the multicore era. Scheduling the functional blocks of an application for concurrent execution requires not only a good understanding of data dependencies, but also a structured way to describe the intended scheduling. In this paper we describe how the Canals language and its scheduling framework can be used for the purpose of scheduling and executing legacy code. Additionally a set of translation guidelines for translating RVCCAL applications into Canals are presented. The proposed approaches are applied to an existing MPEG-4 Simple Profile decoder for evaluation purposes. The inverse discrete cosine transform (IDCT) is accelerated by the means of OpenCL. 1. INTRODUCTION The ever increasing complexity in software requires more computational power. Since increasing the clock frequency of a single core is not a way forward any more[1], especially for embedded systems where cooling and battery life often are critical aspects, going multi-core has become the dominant approach. The transition to multi-core started in the desktop segment and is now also clearly visible in the embedded systems domain. A vast amount of existing software has been written with sequential execution on a single core in mind. Unfortunately platform-dependent software optimized for sequential execution cannot easily make use of the additional computational power provided by additional cores. Synchronization of software executing simultaneously on several cores requires a more structured approach to scheduling than what is present in most presently available software implementations. Completely rewriting all software would be the best solution, but since impossible in practice there clearly is a need to deal with legacy code. As an example, video coding software is fairly complex because modern video standards, including MPEG-4[2], allow a bitstream to be encoded in a variety of ways. It is a tedious task to implement a full decoder/encoder and since a number of video decoder/encoder implementations already exist it would be

J´erˆome Gorin, Micka¨el Raulet IETR/Image Group Laboratory INSA Rennes, Rennes, France {jgorin, mraulet}@insa-rennes.fr

beneficial to use those as a starting point and only rewrite selected parts in a language that provides proper support for scheduling applications on multi-core platforms. The parts of most interest are those in which most time is spent, the bottlenecks. Bottlenecks that could be eliminated by performing computations in parallel are of special interest. Such bottlenecks can be found, for instance, by profiling and analyzing how the data flows through the application. Our analysis[3] of a particular RVC-CAL implementation of a MPEG-4 Simple Profile decoder shows that one third of the computation time is spent on overheads associated with scheduling, e.g. the IDCT component spends 73% performing real computations and 27% is consumed by overheads. Clearly there is a need for better scheduling approaches than the simple round-robin scheduler used in this decoder. To be able to elaborate with different scheduling strategies it is beneficial to separate scheduling decisions from the computational code. For this purpose the Canals Scheduling Framework is suitable. Benefits of using Canals for legacy applications are improved performance and increased ability to adapt to changes in the execution environment. The contributions of this paper are: • Presentation of a way to interface and schedule legacy code using the Canals framework. • A code translation approach for critical parts of the system where interfacing legacy code is not sufficient. Translation enables full Canals support for scheduling, mapping and code generation. • OpenCL interfacing/code generation by Canals for hardware acceleration purposes. The paper is structured as follows. Background information is provided in section 2. Section 3 focus on scheduling and interfacing legacy code from Canals, while we in section 4 present guidelines for translating applications written in RVC-CAL to the Canals language. A case study, conducted on a MPEG-4 Simple Profile Decoder demonstrates how the approaches described in the two previous sections can be applied in practice, is presented in section 5. Finally, in section 6, we conclude the paper.

2. BACKGROUND 2.1. Canals In this section we provide brief background on the Canals language [4]. Canals aims to facilitate code generation for heterogeneous platforms, primarily in the data flow driven application domain. Thus, one of the goals is to be able to explicitly express data flows making analysis and efficient code generation possible. Canals does also provide fine-grained scheduling and execution of a program. A Canals program describes the intended behaviour of the program in a platform independent manner. All elements exist in parallel and are capable of performing computations concurrently from a resource point of view. Only data dependencies restrict the parallelism. The completely concurrent behaviour can be restricted in the compilation process by supplying a mapping and an architecture description to the compiler. Canals is based on the concept of nodes (kernels and networks) and links (channels), which connect nodes together. Computations are performed in the nodes and the links are representing intermediate buffers in which output data from nodes is stored before the data is consumed by the next node. We have included expressive data type descriptions in Canals, since they are essential for understanding the precise behaviour of a data flow driven application. Kernels are the fundamental computational element in Canals. All major computations in Canals are carried out inside kernels. Computations are performed on data available on the incoming data port and the results are written to the outgoing data port. All declared variables are local: they are only accessible from the kernel in which they are declared. Variable values are stored between invocations of a kernel, implying that kernels have a state. Computations (data processing) are performed in the work block using the sequential Canals Kernel Language. Kernels are also the mechanism through which communication between the Canals program and the external environment is handled. Reading input data from a file, a network stream or a pipe as well as writing output data are all examples of this kind of external communication with the environment. The channel is an abstraction of an inter-kernel memory buffer, used for storing data produced and consumed by two connected kernels. A channel must specify the type of data the channel can hold, while other optional channel restrictions, for instance channel capacity, are specified in the channel definition body. Canals has one predefined channel type; generic channel, which is an unbounded FIFO queue that can hold elements of any defined data type. The task of distributing data and selecting appropriate data paths is essential in data flow based approaches. Scatter and gather are the Canals elements responsible of this. A scatter is responsible for distributing data from one input channel to several output channels. The policy for distributing the data and the amount of data distributed to each channel is specified as attributes in

the scatter body. Gathering parallel data flows is possible using the gather element. Definition of a gather is similar to the definition of a scatter. Scatter and gather elements can act as switches between data paths rather than distributing data on all paths. Kernel, scatter, gather and channel are all basic elements in the language. In order to be able to group a number of these elements into a larger functional module we need a container, in Canals denoted network. All defined elements, including networks, can be added to a network. It is also in the network the elements are connected together to build a larger functional unit. Elements in a network are connected together with the connect statement. Canals is a language that does not implement a single model of computation (MoC) or a set of predefined MoCs, but instead the computational model is possible to express in the language itself. Furthermore, each network can execute using its own model of computation, rather than relying on one central MoC for the entire program. In Canals, scheduling is concerned with the task of planning the execution of a Canals network, considering data flow as well as resource use. To be able to reason about scheduling considering both of these, they are handled by separate elements in Canals. The scheduler is responsible for planning the execution of kernels, in such a way that data available as input to the network is routed correctly through the network and eventually becomes available as output from the network. The scheduler can accomplish correct routing by inspecting data, obviously at run-time, arriving to the network and make routing decisions based on the contents of the data. The list of kernels that must be executed in order to actually move data according to the calculated route is denoted a schedule. Triggering of kernels is the task of the dispatcher. The dispatcher should strive to execute the schedule in an optimal order regarding available processing resources. 2.2. CAL The CAL Actor Language (CAL) [5] is a Domain-Specific Language which is especially designed to provide a concise and high-level description of actors. RVC-CAL is a subset of CAL normalized in MPEG RVC [6] as the reference programming language for describing coding tools in MPEG-C [7]. RVC-CAL, compared with the CAL, restricts the data types, and operators that cannot be easily implemented onto the platforms. Figure 1 shows an example of an actor describe in RVC-CAL that computes the absolute value of token received on the input port I to the output port O. An actor contains one or several actions. An action is the only entry point of an actor that may read tokens from input ports, compute data, change state of the actor and write tokens to output ports. The body of an action is executed as an imperative function with local variables and imperative statements. When an actor fires, a single action is selected among others according to the number and the values of tokens available on

actor Abs () int I ==> uint O : pos: action I: [u] ==> O:[u] end neg: action I :[u] ==> O:[-u] guard u < 0 end priority neg > pos; end end

application using OpenCL are significant, while parallel sections with inter-dependencies and small data sets gain less.

Fig. 1. CAL Actor for computation of absolute value.

In this section an approach for scheduling legacy code in Canals is presented. As discussed in the introduction, a large quantity of properly working software exists. The software might require partial redesign to benefit from multi-core processors, a design and programming effort that still is reasonable compared to rewriting the entire application. If critical parts and blocks of legacy code can be scheduled, mapped and executed within the same framework, the application can benefit from multi-core with a relatively low effort. Interfacing legacy code in this case means that the main function of the application is generated by the Canals compiler. A native Canals application can, a bit simplified, be seen as a collection of kernels performing computations and exchanging computational results over channels. Wrapping source or binary code into what we denote external kernels, gives the Canals Scheduling Framework access to them as if they were normal kernels. External kernels can share data over external channels or normal Canals channels depending on the interface specification. The use of external channels is an easy starting point when data transfers are of no or little interest, since external channels essentially hide data transfers from Canals. From this follows that scheduling of memory transfers is not possible if external channels are used. An external kernel is defined similarly to a normal kernel. The body of an external kernel can contain the following backend dependant attributes for specifying the interface between the kernel and Canals:

ports and the current state of the actor. The guard conditions specify additional firing conditions, where the action firing depends on the values of input tokens or the current state of the actor. Action selection may be further constrained using a Finite State Machine and priority inequalities to impose a partial order among action tags. A composition of RVC-CAL actors forms an RVC-CAL network. In an RVC-CAL network, actors communicate via unbounded channels by reading and writing tokens from and to FIFOs. At a network level, each actor works concurrently, executing their own sequential operations. RVC-CAL actors are only driven by token availability. An actor can fire simultaneously regardless of the environment, allowing the application to be easily distributed over different processing elements. This feature is particularly useful in the context of multi-core platforms. An important point of the RVC-CAL representation is that an actor is not specified in a specific execution model. RVC-CAL is expressive enough to specify a wide range of programs that follow a variety of dataflow models, trading between expressiveness and analyzability. 2.3. OpenCL The Open Computing Language[8], more commonly known as OpenCL, is a royalty-free specification for general purpose parallel programming tasks in heterogeneous systems. The specification is a platform-independent interface, but implementations are targeted towards a specific platform and vendor. OpenCL is based on the concepts of a platform model, an execution model and a memory model. The platform model, an abstraction of the underlying hardware, consists of a host and one or more devices. In practice this often today means that the CPU is the host and the GPU is a device. The memory model specifies the memory hierarchy of the device. Code executing on a device is called kernels. Kernels are written in a restricted version of C99, and can be either binary precompiled or source compiled by the host at runtime. Runtime (located on host) delegates tasks to the devices for execution. OpenCL supports data parallel and task parallel programming models. Command queues coordinate the execution of OpenCL kernels in a variety ways including in-order and out-of-order execution. All data passed between kernels should be encapsulated as memory objects, thus enforcing the OpenCL programmer to consider the data flow aspects of his program. The benefits from accelerating embarrassingly parallel sections, which operate on large amounts of data, in an

3. INTERFACING LEGACY CODE FROM CANALS

type External kernel type can be source or binary. files Files necessary for proper access to the kernel. inithandle A handle to initialization funtionality that should be invoked before first kernel execution. workhandle A handle to the normal execution routine. enabledhandle A handle to functionality that can decide if the kernel is eligible for execution. inputhandle, outputhandle Specifying data access. getamounthandle, putamounthandle Provides a handle to the current consunmption/production rate of the kernel. supportsCanalsIO Specify if the kernel use the Canals channel API (get, put, look). An example of a defintion compatible with the C++ backend: external kernel bit -> bit get * put * IDCT2d { type = binary; files = "common_constants.h, idct.o"; workhandle = "extern(C):void idct2d(); idct2d();"; enabledhandle = "extern(C): bool idct2d_is_enabled(); idct2d_is_enabled();"; }

Now that the Canals compiler is aware of the legacy code block through the external kernel mechanism, but we still need to gather scheduling information on these kernels. Additionally we must select a dispatching strategy. 3.1. Scheduling and Dispatching Scheduling legacy code means that we must extract information about the components and their interconnection, and make that information available to the Canals scheduler. A Canals scheduler contains at least one kernel, in which the schedule computations take place. The scheduling code is written in the Canals kernel language according to the intended scheduling strategy. To be able to make scheduling decisions the scheduler must be able to access information about the network it is scheduling, also during run-time. For this purpose any kernel inside a scheduler can navigate the scheduled network through the Run-Time Network Navigation API. The API implements functionality for retrieving the first element, last element, next element and previous element. Additionally element information queries regarding e.g. consumption/production rates, element type and number of output ports from a scatter element can be made. The same API is used for scheduling native as well as external kernels. The API is supported by one network topology matrix per network. The matrix is a lower-triangular matrix containing all static information about the network we schedule and information on how to retrieve correct information for variable values, such as dynamic production rate or channel size. This compact representation of network topology and element information can be placed in a memory close to the processing unit on which the scheduler is mapped, thus ensuring low communication overhead for most scheduler queries. Multiple inputs/outputs are possible for external kernels. In the Canals network they are however only connected by one ingoing and one outgoing external channel, which represent the input/output port most relevant for scheduling purposes. The purpose of the connection is to provide information needed for building the topology table. The order of elements in the topology table is the same as in the connection order. It is also possible to build the topology matrix based on other information than the information which can be extracted from the Canals network description. For rapid prototyping, the topology table can even be written manually in the target language (C++). In case a Canals network only contains elements written in RVC-CAL, the topology table can be derived from information provided by guards and consumption/production information available in each actor. A Canals scheduler can be generated based on information from guards, scheduling FSM and priorities. Run-time computations in the scheduler can be further reduced by analyzing the topology table and the scheduler at compile-time. In Canals the role of the dispatcher is to execute the schedules produced by the scheduler, through the Hardware Abstraction Layer

(HAL). The dispatcher should strive towards optimal utilization of computational resources by rearranging tasks from one or several schedules. Sorting of tasks must not break data dependencies imposed by the scheduler. There is a default dispatcher that always runs a schedule to completion before starting the dispatching of the next schedule. The same rules apply for dispatching external kernels. 3.2. Interfacing OpenCL Kernels from Canals It is possible to interface OpenCL kernels from Canals through the external kernel construct. The OpenCL framework (OpenCL API + execution model) can act as an interface between Canals and CAL. A CAL actor could, for instance, be compiled into an OpenCL kernel by the CAL compiler, and scheduled by Canals. The above mentioned interfacing and scheduling mechanism for external kernels applies for OpenCL kernels as well. Both the required OpenCL host and device code must be embedded into the function given to Canals as work handle. All initialization required should take place in the function specified as the init handle. However, it should be mentioned that some properties of OpenCL and its run-time system mismatch with this simple interfacing approach, resulting in degraded performance and limited flexibility. The number of OpenCL kernel instances intended for concurrent execution is limited by available input data and hardware limitations (e.g. number of threads on the GPU). This control information must be completely handled by the external code, or alternatively be sent as control flow. 4. TRANSLATING RVC-CAL INTO CANALS The interfacing mechanism described in the previous section enables scheduling of legacy functionality expressed in any supported language. In this section, another solution for scheduling code expressed in other languages is discussed, namely the possibility to translate existing source code into Canals. On the one hand, translation into Canals makes it possible to use all features of our scheduling framework, but on the other hand the required translation effort can often be larger than the effort for interfacing legacy code. The possible degree of automation is highly depending on the semantics of the source language and the software design philosophy applied. Since Canals is a streaming language, software that has been designed with data flow in mind is most suitable for translation into Canals. In this work we have studied how, and to what degree, translation from RVC-CAL (see section 2.2) can be carried out. General guidelines (see table 1) for such a translation are presented together with a concrete example on translation of the two-dimensional inverse cosine transform into Canals. RVC-CAL and Canals are quite similar in some aspects; both languages are based on the concept of computational nodes and links that connect these nodes. Each computational node consumes and produces

data. Differences between the languages relevant from a translation viewpoint are: number of input/output links from each node, data locality and scheduling. In Canals, two nodes (kernels) can only be connected by a single link while CAL actors can be connected by an arbitrary number of links. An approach for resolving this incompatibility is to group data from several links together into a larger Canals data structure. In RVC-CAL scheduling can be divided into intra-actor scheduling, scheduling execution of actions, and inter-actor scheduling which decides on the execution order among actors. Inter-actor scheduling is not decided on explicitly by the programmer but rather by the compiler implementation and/or run-time system. In practice this means that the most intuitive translation of an actor into a Canals kernel is not possible for all valid actors, since Canals schedulers are only associated with networks and not kernels. Therefore a CAL actor should in the general case be translated into a Canals network and an action translates into a kernel. In RVC-CAL, all variables are declared in the actor and shared by all actions. In Canals variables are private to the kernel, implying that two kernels cannot exchange data through a shared variable, which means that the shared variables must be modelled as data flow. The construction in figure 2a shows how a combination of scatter (switch) and gather (select) elements guarantees shared access to data among actions. Scheduler S decides on the action execution order based on information translated from guards, the finite state machine and priorities in an actor. Further, S directs data to the action to be executed next by altering the switch position for both scatter and gather elements through the means of control flow. DIN holds the data available on all actor input ports, while DOU T represents the data produced to all output ports. DSHARED contains all variables shared between actions. Actor variables that are used only within one action can easily be identified and translated as a kernel variable. All needed data is streamed to a kernel through the data definition D, which is a composition of DIN and DSHARED . K forwards data from the gather element if data is available, otherwise data initialized to default value is generated. In the case multiple actors are connected together, forming a RVC-CAL network, two strategies can be applied. Either all network levels can be flattened into a single RVC-CAL network, the other option is to translate each RVC-CAL network into a Canals network. Flattening a RVC-CAL network is straightforward since the network descriptions are only a syntactic construction without any semantics. Once the RVC-CAL network is translated into Canals it is possible for the programmer to write a scheduler that decides on the execution order for actors. The translation guideline above covers translation of a general actor or actor composition well. However, the translated Canals application introduces a large number of networks and thus run-time overheads. These overheads can be reduced by using optimizations in the Canals compiler. If semi-automated translation is a viable option, knowledge

RVC-CAL Network description Actor Actor variable Action Procedure, Function Buffer Guards, FSM, Priorities

Canals Network Network Data definition/flow Kernel Inlined kernel code Channel Scheduler, Control flow

Table 1. RVC-CAL to Canals element mappings.

Fig. 2. Canals representations of a RVC-CAL actor about the internal workings and data flows of an actor can be utilized to get a better translation into Canals. In figure 2b, we have an example of where an actor has been translated into a Canals kernel and the action selection is controlled by the scheduler of the containing Canals network. Such a translation is to recommend for actors where we know that the actor operates in certain modes, each mode represents a certain action firing sequence (quasi-static schedule). It is even possible to integrate the selection of mode into the kernel itself (figure 2c), increasing performance but at the same time reducing the possibilities for fine-grained scheduling. For actors where a static schedule can be calculated, it is always possible to translate them into a stand-alone kernel without use of control flow connections. 4.1. Translation of an IDCT-2D Actor To illustrate how translation is done in practice, a CAL implementation of the inverse discrete cosine transform (IDCT) is translated into Canals. The two-dimensional IDCT, operating on an 8x8 block, is frequently used in video and image processing. The chosen IDCT (see figure 3) is implemented as a single CAL actor, with two actions. Additionally priorities, guards, functions, procedures, constants and variables are present in this design. There are two inputs and a single output. The dominating input is the coded coefficients and the output is the decoded coefficients, an additional input represents a single SIGN ED boolean value for the entire block. The SIGN ED flag is a control token that affects the clipping functionality of the IDCT , which in practice means that it decides which action is to be fired when the actor is executed. Both actions will consume 64 values from actor port IN , a single value from actor port SIGN ED and produce 64 values to port OU T . In figure 4, the Canals translation based on the previously given guidelines is given. Some kernel code

actor Algo_IDCT2D_ISOIEC_23002_1 () int(size=13) IN, bool SIGNED ==> int(size=9) OUT : int A=1024;int B=1138;...int J=2528; List(type:int, size=64) scale = [A,B,...,E]; intra: action IN:[ x ] repeat 64, SIGNED:[ s ] ==> OUT:[ block1 ] repeat 64 var List(type:int,size=64) block1, block2 do // multiplier-based scaling block1 := [scale[n] * x[n] : for int n in 0..63]; block1[0] := block1[0] + lshift(1, 12); // scaled 1D IDCT for rows and columns idct1d(block1, block2); ... // clipping block1 :=[clip(block1[n], 0) :for int n in 0..63]; end inter: action IN:[ x ] repeat 64, SIGNED:[ s ] ==> OUT:[ block1 ] repeat 64 guard s var List(...) block1, List(...) block2 do ... block1:=[clip(block1[n],-255):for int n in 0..63]; end procedure idct1d ... end function clip(int x, int lim) --> int ... end priority inter > intra; end end

network IN_SIGNED -> OUT Algo_IDCT2D_ISOIEC_23002 { constant int32 A=1024; ... constant int32 J=2528; constant int32[64] scale = [A, B, C, ..., E]; set_scheduler S_Algo_IDCT2D_ISOIEC_23002; set_dispatcher default; add_scatter sc; add_gather ga; add_kernel intra; add_kernel inter; connect NETWORK_IN -> sc; connect sc.outport[1] -> intra -> ga.inport[1]; connect sc.outport[2] -> inter -> ga.inport[2]; connect ga -> NETWORK_OUT; } scheduler IN_SIGNED->Schedule S_Algo_IDCT2D_ISOIEC { set_scheduler default; add_kernel k; connect SCHEDULER_IN -> k -> SCHEDULER_OUT; } kernel IN_SIGNED -> Schedule S_Algo_ComputeSchedule { work look 1 put 1 { /* Scheduling code */ } } kernel IN_SIGNED -> OUT ActionIntra { variable IN_SIGNED input; variable OUT output; variable int13[64] x; variable int32[64] block1; variable int32[64] block2; variable int32 n; work get 1 put 1 { input = get(); x = input.IN; for(n=0; n[IN_SIGNED,IN_SIGNED] ActionSwitch gather [OUT, OUT]->OUT ActionSelect... datadef datadef datadef datadef datadef datadef

IN_SIGNED { int13[64] IN; bool SIGNED; } OUT { int9[64] OUT; } int32 (type="integer") { bit[32]; } int13 (type="integer") { bit[13]; } int9 (type="integer") { bit[9]; } bool (type="boolean") { bit[1]; }

5. CASE STUDY - A MPEG-4 SP DECODER In this section we demonstrate how the interfacing approach (see section 3) can be used to implement a improved scheduling strategy for an MPEG-4 Simple Profile Decoder[9] written in RVC-CAL. Further, hardware acceleration through OpenCL is enabled for a part of the decoder. 5.1. Analyzing and Profiling the Decoder The idea in this case study is to optimize the performance of the inverse discrete cosine transform (IDCT) component, but before we start the process of interfacing and scheduling the MPEG-4 SP decoder from Canals, we must assure that we have thorough understanding of it. Therefore we analyze the RVC-CAL design carefully to get an understanding of the in-

Fig. 4. IDCT actor expressed in Canals.

kernel DCTCodedBlock->Block Algo_IDCT2D_ISOIEC_23002{ constant int32[64] scale=[1024,1138,1730,...,1264]; variable DCTCodedBlock input; variable Block output; work get 1 put 1 { input = get(); if (input.SIGNED) { // Code for inter action } else { // Code for intra action } put(output); } } datadef DCTCodedBlock { int13[64] IN; bool SIGNED; } datadef Block { int9[64] OUT; }

Fig. 5. A manual translation of the IDCT actor into Canals.

Fig. 6. Decoder based on actors (as external kernels).

Fig. 7. Decoder based on actors and actions.

teraction between components. In this particular decoder the two-dimensional IDCT is described at a very detailed level with a large number of actors and actions. The next step is to verify that there actually is large overhead in scheduling the IDCT, overhead that can be eliminated by improved scheduling. Instrumentation and profiling the C-code generated from RVC-CAL by CAL2C for four video sequences give us that the default scheduling strategy is unsatisfactory. The decoder consists of 39 CAL actors that are scheduled and executed in a simple round-robin order. The scheduler function of an actor is called upon from a while loop in the main function. The scheduler function first checks if it is possible to execute any of the actions in the scheduler by checking the input and output requirements for each action and not violating rules imposed by the actor FSM. When no action can be executed anymore, the action scheduler will return and the main loop will call upon the scheduler function for the next actor in the list of actors. We are interested in the scheduling overhead, which is here defined as the time spent checking if an actor can execute (by checking the action guards) or not. For different actions the time consumed in guard conditions checking is different. If any guard condition is checked then there can be multiples checks for it. It can fail at the very first check means there is no need to check further conditions or it may fail at the very last check. The hardware used in this case study is a Desktop Computer equipped with an Intel Core 2 Duo at 2.66 GHz. The RAM was 2 GB and 32-bit windows 7 operating system. The number of checks was calculated for the MPEG4 decoder with the visual studio profiler instrumentation tool. Detailed results from the analysis is available in [3].

connection order is in this case not of great importance, but it will decide on the order elements will be placed in the network topology matrix. Scheduler S creates a schedule, by iterating over the element with the getNextElement() function, containing each kernel once. The system will only terminate if a iteration count argument is provided on command line. Since we by analysis and profiling decided to try to improve the scheduling and performance of the IDCT component, the next step is to expose that part in a more fine-grained way to the Canals Scheduler by describing it at action level (see figure 7). In this design all actions from the 12 actors, which describe the IDCT-component in the original design, have been extracted and added as 36 external kernels to a Canals sub-network. It has its own scheduler SIDCT , that decides on the action execution order within the network. The top-level scheduler has been altered so that instead of scheduling the 12 actors, it will only check if the Canals network is enabled (equivalent to the scheduler of the network being enabled) and if it is enabled add the network to the schedule. Enabledness for SIDCT is handled through the enabledhandle attribute, which in this case specifies that the return value of a certain C function decides on its enabledness. In this particular implementation the function will check that 64 tokens of input data and an additional token for the SIGN ED value is available at the input queue and that there is space for 64 output tokens on the output queue. SIDCT scheduling functionality could be derived from guards of each action, but in this case we have used another better approach. The work presented in [10] shows that a static action schedule (containing 755 action firings) can be calculated for the IDCT component. This static schedule is implemented in SIDCT and put on the dispatch queue when SIDCT is invoked. We now have a very fine-grained control over the scheduling of the IDCT component. Scheduling and dispatching each action one-byone in this manner through the Canals HAL is excellent for evaluation purposes, but because of the dispatching overhead this results in reduced performance compared to the original version. Another optimization step must be conducted. Analysis of the IDCT 2dN etwork gives us that it is perfectly possible, because of the static schedule, to flatten the network into a single kernel (figure 8). The transformation to a kernel was manual, but this step can be automated. The transformation is also possible for quasi-static schedules. It can be observed from the instrumentation and profiling that

5.1.1. Canals Scheduling Code Generated from CAL Actors The first step in the case study is to verify that the decoder is running properly when executed through Canals, using the same scheduling as in the original code. This means that for each of the 39 actors a Canals external kernel is defined. In this case we use binary linking against the C object files generated by the CAL2C compiler and provide the scheduler function (which deals with both action scheduling and action execution) as the workhandle. Some of the actors requires a certain initialization routine to be run before their first execution; for this purpose the inithandle is used. All defined kernels are added to the top-level network and connected together by external channels, as can be seen in figure 6. The

Fig. 8. Decoder with external kernel for IDCT. Fig. 9. OpenCL accelerated kernels in Canals. for our four video sequnces there is a reduction in number of hits and misses after applying the quasi-static schedule, because we now only perform three checks per IDCT invocation. Besides the 25% speedup for the IDCT part as a result of reduction in overheads, a reduction in overheads for the other actors can be observed. The explanation for this is that the simple round-robin scheduler in the original design will try to execute actors the theoretically cannot be enabled before the IDCT has produced output. For the decoder as a whole, decoded frames per second improved by 13%.

Canals is suitable for the purpose of interfacing and scheduling general legacy code with a small effort. Complete control and more efficient code generation are provided by the translation approach for those parts of the application that are of certain interest and where a larger effort is motivated. We have also described how Canals can generate, interface and schedule OpenCL kernels. A case-study of a MPEG-4 Simple Profile decoder is used for validation purposes. 7. REFERENCES

5.1.2. Canals Scheduling a OpenCL Enabled Decoder The IDCT operation on a block is obviously an operation that would benefit from data-parallel execution. For this purpose the external IDCT kernel in the design can be replaced with a external OpenCL IDCT kernel (figure 9).OpenCL host and device code is completely wrapped in the functions provided to Canals as init- and work-handle, including the decision on how many predefined number of parallel instances of the OpenCL kernel should be executed on the graphics processing unit (GPU). If only a small number of parallel instances are executed at once, the overhead caused by memory transfers from main to device memory will result in low performance. This is a common problem in OpenCL programming today, but future OpenCL enabled platforms, such as the Mali[11], aims to resolve the issue of memory transfers by using uniform memory for host and device. As mentioned above, the buffer sizes and number of parallel instances are decided on in the external code block that the Canals scheduler cannot access nor modify. This is a drawback of the interfacing approach. The decoder illustrated in figure 9, uses a native Canals kernel for the IDCT. The scatter and gather elements indicate that there can be from 1 to N parallel instances, where N is restricted by mapping strategy and compiler backend, as well as by the scheduling strategy. The exact number is decided by the scheduler/dispatcher. 6. CONCLUSION Existing software cannot directly benefit from the increased computational power available in multi-core systems. In this work we have presented two techniques for dealing with legacy code and adapting it gradually for better multicore compatibility, by having more control over scheduling.

[1] Herb Sutter and James Larus, “Software and the Concurrency Revolution,” Queue, vol. 3, pp. 54–62, 2005. [2] International Organization for Standardization, ISO/IEC 14496-2:1999: Information technology — Coding of audio-visual objects — Part 2: Visual. [3] F. Jokhio et al., “Analysis of an RVC-CAL MPEG-4 Simple Profile Decoder,” Tech. Rep. 1018, TUCS, 2011. [4] A.Dahlin et al., “The Canals Language and its Compiler,” 2009, SCOPES’09, pp. 43–52, ACM. [5] J. Eker and J. Janneck, “CAL Language Report,” Tech. Rep. ERL Technical Memo UCB/ERL M03/48, University of California at Berkeley, 2003. [6] M. Mattavelli, I. Amer, and M. Raulet, “The Reconfigurable Video Coding Standard,” Signal Processing Magazine, IEEE, vol. 27, no. 3, pp. 159 –167, 2010. [7] ISO/IEC CD 23002-4:2008, “MPEG Video Technologies - Part4:Video Tool Library,” . [8] A. Munshi, “OpenCL - Introduction and Overview,” January 2011, http://www.khronos.org/ registry/cl/specs/opencl-1.1.pdf. [9] G.Roquier et al., “Automatic Software Synthesis of Dataflow Program: An MPEG-4 Simple Profile Decoder Case Study,” in SiPS’08, pp. 281–286. [10] J. Ersfolk et al., “Scheduling of Dynamic Dataflow Programs with Model Checking,” in SiPS’11, 2011. [11] A. Lokhmotov, “Mobile and Embedded Computing on Mali GPUs Presentation,” 2010.

RANGE-FREE ALGORITHM FOR ENERGY-EFFICIENT INDOOR LOCALIZATION IN WIRELESS SENSOR NETWORKS Ville Kaseva, Timo D. Hämäläinen, Marko Hännikäinen Tampere University of Technology Department of Computer Systems, P.O.Box 553, FIN-33101 Tampere, Finland {ville.a.kaseva, timo.d.hamalainen, marko.hannikainen}@tut.fi ABSTRACT Wireless Sensor Networks (WSNs) form an attractive technology for ubiquitous indoor localization. The localized node lifetime is maximized by using energy-efficient radios and minimizing their active time. However, the most lowcost and low-power radios do not include Received Signal Strength Indicator (RSSI) functionality commonly used for RF-based localization. In this paper, we present a range-free localization algorithm for localized nodes with minimized radio communication and radios without RSSI. The low complexity of the algorithm enables implementation in resourceconstrained hardware for in-network localization. We experimented the algorithm using a real WSN implementation. In room-level localization, the area was resolved correctly 96% of the time. The maximum point-based error was 8.70 m. The corresponding values for sub-room-level localization are 100% and 4.20 m. The prototype implementation consumed 1900 B of program memory. The data memory consumption varied from 18 B to 180 B, and the power consumption from 345 𝜇W to 2.48 mW depending on the amount of localization data. Index Terms— Wireless Sensor Networks, Localization, Energy-efficiency 1. INTRODUCTION Wireless Sensor Networks (WSNs) consist of densely deployed, independent, and collaborating low-cost sensor nodes which are highly resource-constrained in terms of energy, processing, and data storage capacity [1]. The nodes can sense their environment, process data, and communicate over multiple short distance wireless hops. The network selforganizes and implements its functionality by co-operative effort. WSN nodes must operate for years with small batteries or by harvesting their energy from the environment. The long lifetime and small size of the devices, easy installation, and maintenance-free operation make WSNs an attractive technology for ubiquitous indoor localization. Due to the very large number of nodes, frequent battery replacements and manual network configuration are incon-

venient or even impossible. Thus, the networks must be self-configuring and self-healing, and the protocols used in WSNs must be highly energy-efficient. The contradicting requirements of long lifetime and small batteries result in very scarce energy budget. The power consumption of a WSN node can be minimized in two levels; by minimizing the hardware power consumption and by minimizing hardware active time at the the protocol level. Commonly, a radio transceiver is used for wireless communication between WSN nodes. The usage of the radio transceiver also for localization is a cost-efficient choice due to its inherent existence in the nodes. This minimizes hardware power consumption, size, and cost since no extra components are needed. Typically, RF measurements can be performed by measuring signal strengths from the transmissions of neighbors. However, the most low-cost and low-power radio transceivers do not include such a possibility [2]. Instead, variable transmission powers are used to infer path loss between the localized nodes and the anchor nodes whose location are known a priori. A radio transceiver is the most power consuming component in a WSN node [3]. Thus, the most energy-efficient operation is achieved by minimizing radio communication. Commonly, a WSN is deployed to perform a specific task or a small set of tasks. This makes application-specific optimization using application-dependent node platforms, communication protocols, and in-network processing possible [4]. With localized nodes minimal power consumption can be achieved by simple blinker operation where the nodes broadcast periodical beacon packets that the anchor nodes listen. Rest of the time is spent in low-power sleep mode. For in-network localization, also a short downlink slot following the beacons is needed so that the anchor nodes can communicate the RF measurements and their locations back to the localized nodes. Fig. 1 illustrates a multi-hop localization network topology with localized blinker nodes. Also, the localization implementation possibilities, centralized or distributed innetwork localization, are presented. In the centralized approach, a server calculates localized node locations. Also, the

Localized node •Transmit beacons In in-network localization → •Receive RF samples and anchor node locations from anchor nodes •Calculate own location Anchor node (Router) •Route received data to sink nodes Server In centralized localization → •Calculate localized node locations User interface

Anchor Sink node Internet

Fig. 1: Localization network topology, node functionality, and localization algorithm implementation possibilities for centralized and in-network localization.

anchor node locations are stored on the server. The localized nodes transmit beacons and the anchor nodes route received data to a sink node which in turn communicates the data to the server. User interfaces with a server connection can visualize the data. In distributed in-network localization, the anchor nodes know their own locations. These and the RF measurements are transmitted in the localized node downlink slot. Using this information, the localized node can resolve its own location. In this paper, we present a range-free localization algorithm for nodes using unreliable beacon broadcasts and radio transceivers without Received Signal Strength Indicator (RSSI) support. The proposed algorithm enables reliable localization with highly resource-constrained hardware and simple link-level operation that minimizes localized node power consumption and maximizes lifetime with small batteries. No calibration measurements are required making localization network deployment easy and cost-effective. The low-complexity of the algorithm enables implementation in resource-constrained hardware for in-network localization. The proposed algorithm localizes nodes to the closest anchor node. If multiple anchor nodes are closest, a centroid of these nodes is used. Anchor node proximity is inferred from the reception of beacons transmitted using varying power levels. The minimum transmission power beacon packets heard by the anchor nodes are used to choose the closest anchor node. Thus, the algorithm supports various types of devices with varying radio ranges and transmission power level amounts. Furthermore, the localization accuracy can be chosen according to application needs by differentiating the anchor node density and the transmission ranges of the localized nodes. Moving averages of the RF measurements are used to mitigate the effect of lost beacon packets and transient variation in the measurements. For experimenting the performance of the proposed algorithm we used real hardware prototypes and a multi-hop WSN

communication stack. The localization accuracy experiments consisted of Room and Sub-room localization scenarios including 14 and five anchor nodes, respectively. The scenarios demonstrate the algorithm reliability and scalability to variable localization accuracies. For flexible experimenting, a centralized implementation was used in the accuracy experiments. The proposed algorithm was implemented also on the presented hardware platform to give results of its resource consumption in in-network localization. The rest of the paper is organized as follows. Section 2 presents related research in RF-based indoor localization using short-range wireless networks. Section 3 introduces the proposed localization algorithm design. The WSN architecture and node hardware used for the experiments are presented in Section 4. Sections 5 and 6 present the experiments, accuracy results, and the resource consumption figures. Finally, Section 7 concludes the paper. 2. RELATED RESEARCH RF-based localization methods can be categorized to rangebased, scene analysis, and proximity-based [5]. Range-based methods rely on estimating distances between localized nodes and anchor nodes. Scene analysis consists of an off-line learning phase for creating reference measurement database and an online localization phase. Proximity-based methods estimate locations from connectivity information. The distance estimation process of range-based localization methods is called ranging. Received Signal Strength (RSS) is a common RF-based ranging technique. Distances estimated using RSS can have large errors due to multipath signals and shadowing caused by obstructions [6, 7]. Also, the correctness of the ranging estimates varies depending on the environment and device variability. Calibration measurements can be used to mitigate this problem, but this requires calibration in every environment the system is used in and for every device used for localization. The scene analysis off-line phase includes recording RSS values to different anchor nodes as a function of the users location. The recorded RSS values and the known locations of the anchor nodes are used either to construct an RF fingerprint database [8–10], or a probabilistic radio map [11–15]. In the online phase the localized nodes measure RSS values to different anchor nodes. With RF fingerprinting, the unknown location is determined by finding a fingerprint from the database that is closest the one currently measured. The unknown location is then estimated to be the one paired with the closest reference fingerprint or in the (weighted) centroid of 𝑘 nearest reference fingerprints. Location estimation using a probabilistic radio map includes finding the point in the map that maximizes the location probability. The applicability and scalability of scene analysis approaches are reduced by the time consuming collection and maintenance of the RSS sample database. When the en-

vironment changes, the calibration data needs to collected again. Proximity-based approaches [9,16–20] estimate locations from connectivity information. In WLANs, mobile devices are typically connected to the Access Point (AP) they are closest to. In the strongest base station method [9], the location of the localized node is estimated to be the same as the location of the AP it is connected to. In [16–20], the unknown location is estimated using connectivity information to several anchor nodes. In [16], nodes are localized to the centroid of the heard anchor nodes. [17] uses Approximate Point-In-Triangulation (APIT) method to form triangles to different three anchor nodes. The unknown location is then resolved to be in the center of gravity of the intersecting triangles in which the localized node resides. In [18], Weighted Centroid Localization (WCL) is introduced. It extends centroid localization with weights which emphasize anchor nodes closer to the localized node. The authors present an real implementation with weights obtained from the radios Link Quality Indicator (LQI) using RSSI. [19] proposes adaptive WCL (AWCL) which extends the WCL method with LQI error compensation and emphasizes the differences between the LQIs instead of the nominal values. In [20], simple constraints, that is an anchor node either can or cannot detect a localized node, are used for localization. The algorithm proposed in this paper extends the combination of the strongest base station and centroid methods to radio hardware with no RSSI. Also, unlike the related methods, the proposed algorithm takes unreliable data collection into account.

TX power 0 TX power 1 RX TX RX TX RX TX time Anchor Node 2 out of range for beacon with TX power 0

Radio range with TX power 0 Localized node Anchor Node 1 Anchor Node 2 Radio range with TX power 1

Fig. 2: Localization algorithm operation principle: Anchor Node 1 can receive both beacons with power levels 0 and 1 but Anchor Node 2 can hear only beacons transmitted with power 1. Thus, Anchor Node 1 is chosen as the closest and its location as the resolved location for the localized node. RF measurement Newest Data used for current sample buffer sample location calculation

Inside tref - Ta Outside tref - Ta

Power t1,3

Power t1,2

Power t1,1 =tref

Anchor ID (1) and location

Power t2,3

Power t2,2

Power t2,1

Anchor ID (2) and location

Power t3,3

Power t3,2

Power t3,1

Anchor ID (3) and location

Outside tref - Tp Inside tref - Tp

Anchor node info

Fig. 3: Example of an anchor node information buffer for a localized node. Each anchor node information unit includes the anchor node ID, anchor node location, and a buffer for RF measurement samples heard from the localized node. Anchor nodes 1 and 2 have RF measurement samples inside time window [𝑡𝑟𝑒𝑓 − 𝑇𝑎 , 𝑡𝑟𝑒𝑓 ]. Their RF measurement samples with time values 𝑡𝑖,1 and 𝑡𝑖,2 are inside time window [𝑡𝑟𝑒𝑓 − 𝑇𝑝 , 𝑡𝑟𝑒𝑓 ]. Thus, the data inside the thickoutlined box is used for location calculation. Old data outside the thick-outlined box (dashed boxes) can be removed from the buffers.

3. LOCALIZATION ALGORITHM DESIGN The proposed localization algorithm uses varying transmission power levels to find closest anchor node neighbors. Reliability is increased by averaging the RF measurement samples over time.

3.1. Location Resolution using Varying Transmission Powers and Closest Anchor Nodes Localized node transmits periodical beacon packets with varying transmission powers, one set of packets per access cycle. The access cycle length defines the location refresh rate for the localized node in question. The anchor node that can receive beacons with the smallest transmission power is chosen to be closest to the localized node. The location of the closest anchor node is used as the location of the localized node. If multiple anchor nodes are closest, their centroid is used as the resolved location. The centroid location coordinate

(𝐿𝑐 ) can be calculated as 𝐿𝑐 =

𝐿1 + ... + 𝐿𝑛 , 𝑛

(1)

where 𝐿𝑖 is the location coordinate for Anchor Node 𝑖 and 𝑛 is the number of closest anchor nodes. An example of the algorithm operation principle is presented in Fig. 2. The amount of used transmission powers in the example is two, indexed 0 and 1. Anchor Node 1 can receive both beacons but Anchor Node 2 can hear only beacons transmitted with power 1. Thus, Anchor Node 1 is chosen as the closest and its location as the resolved location for the localized node. The possible downlink slot is not illustrated in the figure. 3.2. Averaging using Time Windows RF measurement samples can be lost for example due to collisions of neighboring localized nodes’ packets or if the anchor nodes are busy forwarding the location data. Thus, a

new sample is not necessarily received every access cycle for every anchor node that can hear a localized node. This can produce variations and jitter in the location estimate as wrong anchor node is chosen as the closest one due to missing RF measurement samples. Also, RF measurements can vary due to multipath and interference. Thus, filtering is needed to give more reliable location estimates. In the proposed algorithm, filtering is done using a two-dimensional buffer, two time windows, and moving averages. A buffer of anchor nodes that can hear a specific localized node is maintained for each localized node. One information unit in the buffer includes an anchor node ID, anchor node’s location coordinate, and a buffer for RF measurement samples heard by the anchor node. A moving average of the latest RF measurement samples for each anchor node is used to determine the closest anchor nodes. The time of the latest RF measurement sample measured from the localized node is considered as the reference time (𝑡𝑟𝑒𝑓 ). The used notation for an RF measurement sample time value is 𝑡𝑖,𝑗 , where 𝑖 is the anchor node ID and 𝑗 is RF measurement sample index (the larger the index 𝑗 the older the sample). An anchor node 𝑖 is maintained in the buffer if its newest RF measurement sample with time value 𝑡𝑖,1 is inside time window [𝑡𝑟𝑒𝑓 − 𝑇𝑎 , 𝑡𝑟𝑒𝑓 ], that is if 𝑡𝑟𝑒𝑓 − 𝑡𝑖,1 ≤ 𝑇𝑎 . Another time window, [𝑡𝑟𝑒𝑓 − 𝑇𝑝 , 𝑡𝑟𝑒𝑓 ], is used to determine if RF measurement samples should be kept in the RF measurement sample buffer of an anchor node. An RF measurement sample 𝑗 received from anchor node 𝑖 at time 𝑡𝑖,𝑗 is kept in the buffer if 𝑡𝑟𝑒𝑓 − 𝑡𝑗,𝑖 ≤ 𝑇𝑝 . 𝑇𝑝 should be larger than 𝑇𝑎 . This gives robustness over lost samples and sample value variance whilst not compromising reactiveness during movement. Anchor nodes outside radio range can be removed more quickly using 𝑇𝑎 while more RF measurement samples from the anchor nodes inside radio range are used in the average calculation using 𝑇𝑝 . An example of the anchor node information buffer is presented in Fig. 3. The reference time is 𝑡1,1 . Anchor nodes and RF measurement samples inside the thick-outlined box (𝑖 ∈ 1..2 and 𝑗 ∈ 1..2) are inside the time windows and used in the location calculation. Rest of the data is old and can be removed from the buffers. Thus, for Anchor Nodes 1 and 2, average minimum transmission powers are calculated from RF measurement samples with time values 𝑡𝑖,1 and 𝑡𝑖,2 and these average values are used to choose the closest anchor node.

4. IMPLEMENTATION FOR EXPERIMENTS For experiments, we used hardware prototypes using a lowcost low-power 2.4 GHz radio and a multi-hop WSN communication stack.

Push button and LEDs

Radio

Accelerometer

Antenna

Batteries

Fig. 4: Node hardware platform circuit board and components.

4.1. Network Architecture The prototype network consists of sink nodes, router nodes and mobile nodes. The sink nodes act as data endpoints for the WSN and as gateways to other networks, for example Internet. The router nodes forward data via a wireless multihop network to one or multiple sinks. To achieve low delays, the router nodes listen all their free time. Thus, they consume more power and should be mains powered or equipped with large enough batteries. The mobile nodes have low duty cycles which can be configured according to application needs. They broadcast their data periodically to the router nodes. The broadcasts are randomized to avoid collisions. The energy-efficient operation of the mobile nodes allows them to operate with small batteries whilst still achieving lifetime in the order of years. The sink and router nodes act anchor nodes, and the mobile nodes are localized in the anchor network area. 4.2. Node Hardware Platform The node hardware platform is presented in Fig. 4. The platform uses a Microchip PIC18F8722 MicroController Unit (MCU), which integrates an 8-bit processor core with 128 kB of FLASH program memory, 4 kB of RAM data memory, and 1 kB EEPROM. The used clock speed of the MCU is 8 MHz resulting in 2 MIPS performance. For wireless communication the platform uses a Nordic Semiconductor nRF24L01 radio transceiver operating in the 2.4 GHz ISM frequency band. The radio data rate is 1 Mbps and there are 80 available frequency channels. Transmission power level is selectable from four levels between -18 dBm and 0 dBm with 6 dBm intervals and ±4 dBm accuracy. Loop type antenna is implemented as a trace on the Printed Circuit Board (PCB). The user interface is implemented with push button and LEDs. For the experiments, the localized node transmission range was reduced using a resistor connected between the loop antenna terminals. This reduces the power radiated through the antenna. With the smallest transmission power the range was reduced to a few meters. With the reduced

Anchor node Test location Localization area (320 cm x 267 cm ≈ 8.5 m2 )

800 cm

640 cm

Fig. 5: Localized node in enclosure. Test location

Fig. 7: The Sub-room scenario consisted of five workstations each including an anchor node. Localization accuracy was experimented at each of the workstations. The localization area size was 8.5 m2 with diagonal length of 4.20 m.

10 m

Anchor node

Fig. 6: The Room scenario included anchor nodes in seven rooms and in the hallway. Localization was experimented in the rooms. The size of the smallest rooms was 14 m2 with diagonal length of 5.30 m, the largest room was 72.45 m2 with diagonal length of 12.90 m, and the average room size was 37.75 m2 with average diagonal length of 8.70 m.

transmission ranges localization can be achieved in roomto sub-room-level granularities by changing the anchor node density as demonstrated in the following experiments. 5. LOCALIZATION ACCURACY EXPERIMENTS AND RESULTS The localization accuracy was experimented in Room and Sub-room scenarios. In both scenarios the floorplan was divided into localization areas. A person with a localized node worked in each test location for 30 minutes. The localized node (Fig. 5) was attached to the person’s chest with a clip. The resolved locations were recorded continuously. From this data, the time the area was resolved correctly was calculated. For each test location, the fraction of time the location of the localized node was resolved to the correct area and the area size are reported. This shows the algorithm stability over transient changes in the RF measurement samples which change as the person is working and the localized nodes orientation varies accordingly. In the Room scenario (Fig. 6), one anchor node was deployed in every room used in the experiments and each of

Fig. 8: Example of an anchor node deployed at a workstation in the Sub-room scenario.

the rooms constituted a localization area. Anchor nodes were also deployed to the hallway to ensure the anchor node network connectivity. The scenario included a total of seven test locations in different rooms. The size of the smallest room was 14 m2 with diagonal length of 5.30 m, the largest room was 72.45 m2 with diagonal length of 12.90 m, and the average room size was 37.75 m2 with average diagonal length of 8.70 m. The total anchor node amount in the rooms was eight and sum of all room sizes was 264.2 m2 giving a network density of 0.03 anchor nodes per m2 . The Sub-room scenario, depicted in Fig. 7, consisted of a 640 cm by 800 cm open office space with five workstations. An anchor node was installed at every workstation and the space was divided into equivalent sized localization areas. One localization area size was 8.50 m2 with diagonal length of 4.20 m. The localization was experimented at each of the five workstations. Fig. 8 illustrates an example of an anchor node deployed at a workstation. The total anchor node amount in the localization areas was five and sum of all localization area sizes was 42.5 m2 giving a network density of 0.12 anchor nodes per m2 . The localized node was configured to send four beacon packets with varying transmission power levels on average every five seconds (randomized between 4.5 and 5.5 seconds). The localization algorithm time window 𝑇𝑎 was set to 12 seconds which equals two times the worst case localized node access cycle length and some margin for the packet forward-

2.5

Used data memory [B]

10 anchor nodes 5 anchor nodes 2 anchor nodes 150

100

50

0 0

2 4 6 8 10 Number of RF measurement samples in the buffers

Fig. 9: Examples of the proposed localization algorithm implementation data memory consumption for variable amount of anchor nodes in the localized node radio range as function of RF measurement samples in the buffers.

ing delay in the network. The time window 𝑇𝑝 was set to 24 seconds (2 times the length of 𝑇𝑎 ). Localization algorithms can return locations as absolute points in the used coordinate system or as areas [12]. For the point-based evaluation, accuracy value indicates localization granularity. In our results, accuracy is given as the average localization area diagonal length since it presents the maximum point-based location error. The precision indicates the percentage a given accuracy is reached. This is given as the percentage the localized node was localized to the correct area. The precision value also gives the area-based accuracy of the proposed algorithm. The accuracy and precision results are presented in Table 1. In the Room scenario, the localized node was localized to the correct room 96% of the time corresponding to the point-based precision and area-based accuracy. The average room diagonal length was 8.70 m giving the point-based accuracy for this scenario. The corresponding values for the Sub-room scenario are point-based precision and area-based accuracy of 100%, and point-based accuracy of 4.20 m. The Sub-room scenario included four times more anchor nodes per 𝑚2 than the Room scenario.

Worst case power consumed per location estimate [mW]

200

2

10 anchor nodes 5 anchor nodes 2 anchor nodes

1.5

1

0.5

0 0 2 4 6 8 10 Number of RF measurement samples in the buffers

Fig. 10: Examples of the proposed localization algorithm implementation power consumption for variable amount of anchor nodes in the localized node radio range as function of RF measurement samples in the buffers. The results are given for 5 s location refresh rate (access cycle).

chor nodes in the localized node radio range and the amount of RF measurement samples in the buffers Fig. 9 presents examples of the data memory consumption as a function of the RF measurement sample amount for variable amount of anchor nodes. With two anchor nodes the data memory consumption varies from 18 B to 36 B (0.44% to 0.89% of data memory available) when the sample amount is varied from one to ten. Corresponding values for five and ten anchor nodes are 45 B to 90 B (1.10% to 2.20%), and 90 B to 180 B (2.20% to 4.40%), respectively. The required amount of processing, and thus, the power consumption is also dependent on the amount of anchor nodes in radio range and RF measurement samples in the buffers. Fig. 10 presents examples of the localization algorithm implementation power consumption as a function of the RF measurement sample amount for variable amount of anchor nodes. The used location refresh rate (access cycle) was five seconds. With two anchor nodes the power consumption varies from 345 𝜇W to 689 𝜇W when the sample amount is varied from one to ten. Corresponding values for five and ten anchor nodes are 576 𝜇W to 1.36 mW, and 961 𝜇W to 2.48 mW, respectively.

6. IN-NETWORK LOCALIZATION AND RESOURCE CONSUMPTION

7. CONCLUSIONS

The proposed localization algorithm was implemented on the presented prototype hardware platform to demonstrate the algorithm suitability for in-network localization. The algorithm implementation consumes 1900 B of program memory which is 1.45% of the total program memory available in the hardware platform. The amount of used data memory varies according to an-

In this paper, we presented a range-free localization algorithm for localized nodes using unreliable beacon broadcasts and radio transceivers without RSSI support. The proposed algorithm enables reliable localization with highly resourceconstrained hardware and simple link-level operation that minimizes localized node power consumption and maximizes lifetime. No calibration measurements are required making

Localization accuracy and precision

Point-based accuracy (Average diagonal length of localization area)

Room scenario Sub-room scenario

8.70 m 4.20 m

Point-based precision and area-based accuracy (Percentage of time localized in correct area) 96% 100%

Anchor node density

0.03 anchor nodes 𝑚2 0.12 anchor nodes 𝑚2

Table 1: Localization accuracy and precision results.

localization network deployment easy and cost-effective. The low-complexity of the algorithm enables implementation in resource-constrained hardware for in-network localization. The localization algorithm performance was experimented in an office environment using a real WSN implementation consisting of resource-constrained nodes. In room-level localization, the correct area was resolved 96% of the time. The maximum point-based error was 8.70 m. The corresponding values for sub-room-level localization are 100% and 4.20 m. The prototype implementation consumed 1900 B of program memory. The data memory consumption varied from 18 B to 180 B, and the power consumption from 345 𝜇W to 2.48 mW depending on the amount of anchor nodes in the localized node radio range and the amount of collected RF measurement samples. The experiments demonstrate the proposed algorithm reliability and scalability to variable localization accuracies. The localization accuracy can be chosen according to application needs by differentiating the anchor node density and the transmission ranges of the localized nodes. The prototype implementation shows that the algorithm uses minimal amount of resources. Despite of the averaging, the localized node location demonstrated some jitter. Our future work includes reducing this jitter by including filtering to the resolved location coordinates and by using also previous locations in the localization process. 8. REFERENCES [1] I.F. Akyildiz, Weilian Su, Y. Sankarasubramaniam, and E. Cayirci, “A survey on sensor networks,” Communications Magazine, IEEE, vol. 40, no. 8, pp. 102–114, 2002. [2] M. Kohvakka, J. Suhonen, M. Hannikainen, and T. D. Hamalainen, “Transmission power based path loss metering for wireless sensor networks,” in Personal, Indoor and Mobile Radio Communications, 2006 IEEE 17th International Symposium on, 2006, pp. 1–5. [3] M. Kohvakka, J. Suhonen, M. Kuorilehto, V. A. Kaseva, M. Hannikainen, and T. D. Hamalainen, “Energy-

efficient neighbor discovery protocol for mobile wireless sensor networks,” Elsevier Ad Hoc Networks, 2008. [4] John A. Stankovic, Tarek F. Abdelzaher, Chenyang Lu, Lui Sha, and Jennifer C. Hou, “Real-time communication and coordination in embedded sensor networks,” Proceedings of the IEEE, vol. 91, no. 7, pp. 1002–1022, July 2003. [5] J. Hightower and G. Borriello, “Location systems for ubiquitous computing,” Computer, vol. 34, no. 8, pp. 57 –66, Aug. 2001. [6] N. Patwari, J. N. Ash, S. Kyperountas, A. O. Hero III, R. L. Moses, and N. S. Correal, “Locating the nodes: cooperative localization in wireless sensor networks,” Signal Processing Magazine, IEEE, vol. 22, no. 4, pp. 54–69, 2005. [7] C. Wang and L. Xiao, “Sensor localization under limited measurement capabilities,” Network, IEEE, vol. 21, no. 3, pp. 16–23, 2007. [8] P. Bahl and V. N. Padmanabhan, “RADAR: an inbuilding RF-based user location and tracking system,” in INFOCOM 2000 Proceedings of the Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies, 2000, vol. 2, pp. 775–784 vol.2, IS:. [9] A. Smailagic and D. Kogan, “Location sensing and privacy in a context-aware computing environment,” Wireless Communications, IEEE [see also IEEE Personal Communications], vol. 9, no. 5, pp. 10–17, 2002. [10] K. Lorincz and M. Welsh, “MoteTrack: A robust, decentralized approach to RF-based location tracking,” in In Proceedings of the International Workshop on Location- and Context-Awareness (LoCA 2005) at Pervasive 2005, Oberpfaffenhofen, Germany, May 2005. [11] E. Elnahrawy, Xiaoyan Li, and R. P. Martin, “Using area-based presentations and metrics for localization systems in wireless LANs,” in Local Computer Networks, 2004. 29th Annual IEEE International Conference on, 2004, pp. 650–657.

[12] E. Elnahrawy, X. Li, and R. P. Martin, “The limits of localization using signal strength: a comparative study,” in Sensor and Ad Hoc Communications and Networks, 2004. IEEE SECON 2004. 2004 First Annual IEEE Communications Society Conference on, 2004, pp. 406– 414.

[17] Tian He, Chengdu Huang, Brian M. Blum, John A. Stankovic, and Tarek Abdelzaher, “Range-free localization schemes for large scale sensor networks,” in MobiCom ’03: Proceedings of the 9th annual international conference on Mobile computing and networking, New York, NY, USA, 2003, pp. 81–95, ACM Press.

[13] M. A. Youssef, A. Agrawala, and A. Udaya Shankar, “WLAN location determination via clustering and probability distributions,” in Pervasive Computing and Communications, 2003. (PerCom 2003). Proceedings of the First IEEE International Conference on, 2003, pp. 143– 150.

[18] J. Blumenthal, R. Grossmann, F. Golatowski, and D. Timmermann, “Weighted centroid localization in zigbee-based sensor networks,” in Intelligent Signal Processing, 2007. WISP 2007. IEEE International Symposium on, 2007, pp. 1 –6.

[14] C. Alippi, A. Mottarella, and G. Vanini, “A RF mapbased localization algorithm for indoor environments,” in Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on, 2005, pp. 652–655 Vol. 1.

[19] R. Behnke and D. Timmermann, “Awcl: Adaptive weighted centroid localization as an efficient improvement of coarse grained localization,” in Positioning, Navigation and Communication, 2008. WPNC 2008. 5th Workshop on, 2008, pp. 243 –250.

[15] T. Roos, P. Myllymäki, H. Tirri, P. Misikangas, and J. Sievänen, “A probabilistic approach to WLAN user location estimation,” International Journal of Wireless Information Networks, vol. 9, no. 3, July 2002. [16] N. Bulusu, J. Heidemann, and D. Estrin, “GPS-less lowcost outdoor localization for very small devices,” Personal Communications, IEEE [see also IEEE Wireless Communications], vol. 7, no. 5, pp. 28–34, 2000.

[20] M. Bouet and G. Pujolle, “A range-free 3-d localization method for rfid tags based on virtual landmarks,” in Personal, Indoor and Mobile Radio Communications, 2008. PIMRC 2008. IEEE 19th International Symposium on, sept. 2008, pp. 1 –5.

Application workload model generation methodologies for system-level design exploration Jukka Saastamoinen

Jari Kreku

Communication Platforms VTT Technical Research Centre of Finland Kaitoväylä 1, FI-90571 Oulu, Finland Email:[email protected]

Communication Platforms VTT Technical Research Centre of Finland Kaitoväylä 1, FI-90571 Oulu, Finland Email:[email protected]

Abstract—As most of the applications of embedded system products are realized in software, the performance estimation of software is crucial for successful system design. Significant part of the functionality of these applications is based on services provided by the underlying software libraries. Often used performance evaluation technique today is the system-level performance simulation of the applications and platforms using abstracted workload and execution platform models. The accuracy of the software performance results is dependent on how closely the application workload model reflects actual software as a whole. This paper presents a methodology which combines compiler based user code workload model generation with workload extraction of pre-compiled libraries, while exploiting an overall approach and execution platform model developed previously. Benefit of the proposed methodology compared to earlier solution is experimented using a set of benchmarks. Keywords-component; workload;design; exploration;

I.

INTRODUCTION

Digital convergence is leading to increasing integration of different technologies into one multi-function device. Manycore architectures nowadays common in general purpose processors are being introduced into these devices. As a consequence of this development, system complexity will increase by orders of magnitude in the near future. Therefore software and hardware designers will have growing number of design alternatives. It will be important to have systematic approaches for design space exploration. Erroneous design decisions must be found as early as possible to avoid costly redesign rounds late in the development process. Performance evaluation by system-level co-simulation on high abstraction level has been widely proposed methodology for this purpose. Typically this kind of design flow follows YChart process. After specification of the target system, application workload and execution platform are modeled separately. Next, the application model is mapped to the execution platform model to get system level model for the design space exploration. Repeating design modification steps are taken until the system requirements are met. Then, the iterative process of redefining the design towards greater accuracy and level of details is started to produce final product.

In modern many-core embedded systems the impact of software on the overall system performance is crucial. Therefore, the application workload model accuracy is essential in the whole product development process as erroneous design decisions can cause redesign rounds and increased time-to-market. The main contribution of this paper is to propose improved automatic workload model generation for system-level design exploration. It is based on ABSINTH [1] but overcomes many limitations of it. It consists of new version of ABSINTH, called ABSINTH2, which improves the accuracy of the application model and new tool called SAKE (abStract externAl library worKload Extractor) that can improve the workload accuracy even further. The approach is experimented with case studies. The rest of the paper is structured as follows: Chapter II introduces related work, Chapter III describes the ABSINTH2 workload model generation flow, Chapter IV presents case studies and Chapter V draws conclusions. II.

RELATED WORK

Many approaches for high level application workload modeling on Y-chart based design exploration concepts have been proposed. Metropolis [2] introduced concept called “metamodel” that can capture both functional and architecture description as well as map them together. Functional description consists of processes that while taking action concurrently also communicate with one another. The communication rules and processes execution constraints are defined separately. The downside of this methodology is the lack of code equivalence as the code executed by the model is different from the code executed by the real hardware. SPADE [3] and Sesame [4] are similar to Metropolis in this sense. Our solution differs from all of the above mentioned approaches in that the same source code used for application model can be used also in product software. In practice, majority of the product development projects rely on software reuse. Ability to utilize legacy software, as well as newly developed code, throughout the design process is important in cutting design cycle time and cost. Application workload model generation based on modified GCC compiler has been presented earlier in [1]. Here we give just an overview of the approach. ABSINTH (ABStract

INstruction exTraction Helper) is a tool that generates workload models from application source. It has been implemented by extending GNU Compiler Collection (GCC) with two additional passes. The first pass constructs the control flow between basic blocks in each source code function. The second pass traverses RTL (GCC’s low-level intermediate language) to extract read, write and execute primitives for each block. Profiling information during model generation is used for probabilities of branches that are modeled statistically. Workload models are generated at a late phase of compilation after most optimization passes. Real applications are usually compiled with some degree of compiler optimization. Therefore ABSINTH allows workload model generation from the production version of the application source. However, drawback of the ABSINTH is statistical modeling of branches. It is sensitive to the quality of the profiling data obtained from application execution. As explained in Chapter IV.D it can lead to very inaccurate simulation results for applications with considerable amount of control, unless the simulation is repeated multiple times. Still, the results are not unambiguous and the designer has to judge which of the simulation results represent the studied use case. Another drawback of ABSINTH is that it does not generate workloads for library functions automatically. As we will show in Chapter IV, depending on the application, this can lead to very inaccurate results. III.

ABSINTH2 FLOW

Tracing instrumentation

Binary

optimisation passes

GCC Compilation

src,

char*

fn,

Code 1. Function prototype for basic block tracing. The second pass, pass_absinth_xml, will traverse RTL (GCC’s low-level intermediate language) to extract load primitives read, write, and execute for each basic block. The load primitives correspond to memory loads, memory stores, and data processing or control instructions respectively. For each RTL instruction one execute primitive will be generated in the model (Table I). Furthermore, it is evaluated, whether the instruction will result to memory being read or written, and generates a read or write primitive respectively. Finally, pass_absinth_xml merges consecutive read, write, or execute primitives into one – i.e. execute, execute, execute becomes execute 3 – to reduce the size of the workload model and speed up the simulation. Optionally the entire basic block can be coalesced into one read, one execute, and one write primitive in that order. The resulting model is further decreased in size but at the cost of accuracy. All the primitives generated by the pass are written to an XML output file. TABLE I. RTL EXPRESSION AND CORRESPONDING ACTION PERFORMED BY PASS_ABSINTH_BBS Action

JUMP_INSN CALL_INSN

Generate “execute” primitive

INSN MEM

Generate “read” or “write” primitive depending on whether the particular MEM expression is the first or second operand of a SET expression

Basic block traversal

pass_absinth_trace

Source code (C, C++, …)

Absinth_trace_bb(char* int bb)

RTL expression

A. Simulation trace –based workload model ABSINTH2 has been implemented as two additional compiler passes inside the GNU Compiler Collection (GCC) version 4.5.1 (Figure 1.). It can be activated during the compilation of any source code supported by GCC. pass_tree_profile

block ID to a trace file, which is compressed with gzip on-thefly to keep the size of the trace minimal. Multi-threaded applications are supported by writing a separate trace file for each thread in the application.

Application execution

pass_cleanup_cfg Control trace (.gz) pass_df_finish RTL expression tree traversal

Load primitives (XML)

pass_absinth_xml

pass_rtl_move_ loop_invariants

Workload model

Figure 1. ABSINTH2 flow

ABSINTH2 uses a deterministic trace of basic block execution for modeling control. The first ABSINTH2 pass, pass_absinth_trace, instruments the beginning of each basic block in the application with a function call (Code 1) to a tracing library. As a result, source filename, function name and a custom basic block ID will be passed to the tracing library during the execution of the ABSINTH2-modified binary. The custom basic block ID is a unique number, which stays consistent between the two ABSINTH2 passes. The tracing library writes the source and function names as well as basic

A special process workload model has been created for simulating ABSINTH2-generated workload models in ABSOLUT [7] since neither the control trace nor the load primitives are expressed in SystemC. Currently, the trace and XML files to be simulated are designated in the system configuration file and the process model parses the trace and the primitives during simulation. This incurs a simulation time penalty but the advantage is that the workload model can be replaced with another without recompilation of the system model. B. Simulation trace –based workload model with external library function workloads Modern software systems provide majority of their services to applications through libraries. As such, significant part of the application code can be provided in these pre-compiled libraries. Problem with the aforementioned application workload generation methods is that the library functions are not compiled at the same time as the application. Thus workloads for the library functions have to be generated

separately. If the libraries are not modeled it can cause significant inaccuracy to the generated model as we will show in Chapter IV. User can of course separately compile the used libraries with ABSINTH and link their workload to the application workload model but this requires additional work. Therefore, to improve the accuracy of the application workload models generated with ABSINTH2, also library functions should be taken into account. As the system libraries are pre-compiled and, in case of dynamic library, linked to the application at load-time or run-time, the workload extraction has to happen at run-time. Valgrind [5] is an Open Source, Dynamic Binary Instrumentation (DBI) framework that can be used for this purpose. In addition to X86, AMD and PowerPC, it supports also many ARM Linux platforms making it usable for embedded system software development. In this study we modified the callgrind_annotate tool (a tool for presenting the out of Callgrind, call-graph generating cache and branch prediction profiler) of the framework to produce such profiling reports that can be post-processed. We also developed a postprocessing tool SAKE for external library function workload generation. These models can be used in performance simulation with the ABSOLUT approach (Fig. 2).

Enabling cache simulation is important for the workload generation as it will make Valgrind to count not only the number of execute instructions but also the number of memory reads and writes. As a next step, call-graph is transformed to more readable format by running modified callgrind_annotate tool: callgrind_annotate -–auto=yes -– inclusive=yes > report.txt Callgrind_annotate reads the profiling results and prints out the source code annotated with profiling results. More importantly in our case, this applies also to the calls to external library functions. In the “SAKE run” -step, the external library function workloads are extracted from the report and converted into xml- or C++ -format for ABSOLUT –simulation: sake.py –format xml –include_rt_load no –output ext_lib_load.xml report.txt SAKE is a python script that reads the profiling report, picks up the calls to external library functions and writes the results to workload model. Format of the output is by default xml but user can also choose C++ for backward compatibility with ABSINTH flow. The workload model is constructed so that the location of each external library function call in the application code can be restored in the simulation phase. Let’s assume following example:

During ABSINTH2 compilation the workload of the user code is printed out. Following excerpt depicts the structure of the generated XML workload model:

Figure. 2 Workload extraction flow for ABSOLUT performance simulation with external library workload model generation included.

To produce user code workload model, the source code is compiled and executed as described in Chapter III.A. For external library workload generation, the same source code is compiled preferably with the same ABSINTH2 patched GNU Compiler Collection (GCC) compiler, this time without the switch (-fabsinth2) that enables user code workload extraction. Resulted binary is executed with Callgrind to create call-graph with cache simulation (effectively Cachegrind [10]) enabled: valgrind –-tool=callgrind –-simulatecache=yes ./program

1 4 8 . . . puts

Workloads of the basic blocks of the application are recorded into the XML file according to the code hierarchy. External library function calls are written to elements. In this example, the printf() function call in the main function translates into puts() function call of the stdio library in the compile process. After recompile and execution of the source code the call graph is generated with Callgrind and finally postprocessed with SAKE to produce external library function workload model. The format of the model follows the one used by ABSINTH2 with the difference of segregating external functions within each user code function: 1477 531 8 309 8 Both the user code workload and the external library function workloads are traversed during the ABSOLUT simulation. Workload primitives in each basic block are converted to execute, write or read primitive instructions which in turns are linked to corresponding instructions on the execution platform. Whenever a call to an external library function is found from the user code workload model, the corresponding workload is read from the external library function workload model. C. Simulation trace –based workload model with external library function workloads and run-time virtual address resolving workload In the flow described in Chapter III.B we omitted the workload caused by the dynamic linking process itself by defining option –include-rt-load no for SAKE. Let us assume that the possible operating system of the execution platform is based on Linux. Then we can further increase the accuracy of the application workload model by including also the run-time library linking process workload into our model. This option of SAKE exploits the ability of Callgrind to embed also the cost of runtime linking and loading service to the profiling report per function call. We show in the Chapter IV what kind of improvement in the application workload model accuracy one can expect from this option.

D. Workload models of multi-threaded applications The authors of [6] present a technique to model POSIX threaded applications with ABSOLUT performance simulation approach [7]. In this study it was extended to support ABSINTH2 where probabilistic branch selection in the control workload model is replaced by deterministic control trace. Principle of modeling multi-threaded applications is the same as in case of single thread. ABSINTH2 compiler generates common user code workload XML file and control traces for each thread. Callgrind is used to dump call-graphs per thread basis. The individual thread profiling results are then postprocessed with SAKE to produce separate external library function workload XML files for each thread.

IV.

CASE STUDIES

A. Reference platform and benchmarks The configuration of our reference platform is detailed in Table II. Callgrind version 3.6.1 with cache simulation enabled was used for the measurements on the reference platform. We used three different benchmarks to measure the accuracy of workload models. Fibonacci number calculation and sparse matrix manipulations benchmarks are available as Open Source on the BenchIT website (www.benchit.org). Multi-threaded matrix multiplication example was written inhouse for this purpose. TABLE II.

REFERENCE PLATFORM CONFIGURATION

Reference Platform Configuration Processor

Intel Core i7 950

Core arrangement

Four physical cores, four virtual cores

Core frequency

1.6 GHz

L1 cache size

32 KB

L2 cache size

256 KB

L3 cache size

8198 KB

Operating System

Ubuntu 10.04 LTS, kernel 2.6.32-29

B. Execution platform model The execution platform model used in the case studies is shown in Fig. 3

reference platform were obtained from Callgrind profiling results.

Figure. 3 Intel Core i7-950 ABSOLUT model

It contains eight processor cores divided into four physical and four virtual cores. As the application workload models do not include accurate address information the cache architecture is simplified. Therefore each core contains independent L1/L2/L3 cache stack in contrast to real processor where L3 cache is shared between all processors. Caches are assumed to be coherent. Access latencies to local caches and to caches on other cores are modelled in each core. Latency values are obtained from [8]. Memory controller (IMC in Fig. 1) handles the accesses to external SDRAM memories. Processor performance is taken into account by defining clock frequency of cores. Architecture efficiency of the platform is defined by average cycles-per-instruction (CPI) value of each core. Simultaneous Multi-Threading (SMT) is modelled by defining the CPI values of virtual cores twice as big [9] as the values of physical cores. The platform model is created with SystemC and the communication interfaces is based on OCP TL2 within the execution model. C. Fibonacci The sequence Fn of Fibonacci numbers is defined by Fn = Fn-1 + Fn-2

(1)

where Fn = 0 and F1 = 1 Test case used in this study is based on BenchItv6 “fibonacci” -test kernel. The kernel contains two versions for Fibonacci sequence calculation. For this study the nonrecursive version was selected for simplicity. It is a singlethread case without dynamic memory management or calls to external libraries. Fig. 4 shows the deviation of execute instruction, data memory read and write counts of ABSINTH2 workload model from the reference platform as a function of Fibonacci iterations. The Fibonacci kernel was compiled with ABSINTH2 patched gcc compiler version 4.5.1 In case of workstation execution the –fabsinth2 switch was omitted. Instruction as well as data memory read and write counts of the

Figure 4. ABSINTH2 workload model deviation from reference platform for execute, memory read and write instructions

In Fig. 4, the reference platform instruction counts are normalized to one and corresponding counts from ABSINTH2 simulation are compared against them. In this example, the data memory read count (Dr) of ABSINTH2 workload model is very close to reference. Regarding execute instructions (Ir) the ABSINTH2 workload model is optimistic representing 60 73 % of the load measured from the reference depending on the iteration count. In this example, the data memory write (Dw) deviation increases as the iteration count grows. Reason for this increase is in the code structure of the test kernel where Fibonacci number is calculated using three variables that are initialized before the main loop of the calculation. Therefore, as these initializations turn into memory write operations, the total deviation of Dw instructions is smaller with less iterations of the test kernel. In case of Ir and Dr the deviation dependency on the iteration count is weaker as most of these instructions are executed in the main loop. In the start-up of Valgrind, its’ core grafts itself into the client process and becomes part of the client’s process [10]. This process can reserve more general purpose registers than is available in the host machine, which leads to increased register spilling to memory. This action can be the reason for the difference in memory instructions between ABSINTH2 and reference platform. Translation process where client’s code is disassembled to intermediate representation, instrumented with tool plug-in (here Callgrind) and finally converted back into x86 code can explain the difference in execute instruction counts. This would, however, require further studies and was not covered in this study. D. Sparse matrix manipulations The second test case is based on the BenchItv6 “sparse” – test kernel. Kernel consists of sparse matrix conversions to different storage formats and matrix multiplications. It is a single-thread case using dynamic memory allocation. Memory for the input matrix as well as for the results is allocated and de-allocated dynamically from the heap. Furthermore, input

matrix is filled with random numbers generated by srand() function from C standard general utility library stdlib.h. This means that the majority of the workload is caused by library functions. In fact, test case profiling results on the reference platform revealed that roughly 75 % of the total number of instructions fetched originated from library functions. Fig. 5 represents the deviation of the execute (Ir), memory write (Dw) and read (Dr) instruction workload from the reference for different workload models. Models without library workloads (ABSINTH, ABSINTH2) are clearly optimistic compared to the reference. In the ABSINTH workload model, the effect of probability based method in branch selection [1] becomes visible in case of matrix size 100 by 100. Consider the following code structure in the BenchIt “sparse” -kernel : mydata_t* pmydata; pmydata = (mydata_t*)malloc( sizeof(mydata_t)); if ( pmydata == 0 ) { fprintf(stderr); exit( 127 ); } At first, space is reserved from heap for pmydata. Success of the reservation is checked and if it was not successful, program is exited. From the system design perspective we are of course interested to know the load of normal behaviour. In ABSINTH workload model simulation, it can happen that the non-successful memory reservation branch gets selected and simulation exits prematurely as was the case for matrix sizes up to 50 by 50. The branch of successful memory reservation got selected in case of 100 by 100 matrix size, therefore the deviation from reference platform results is smaller. Workload models taking external library load into account were most accurate. Deviation of execute instructions was limited to +/- 25 %. In case of memory read and write instructions these models were optimistic as well.

Figure 5. ABSINTH and ABSINTH2 workload model instruction count deviations from the reference platform.

E. Multi-threaded matrix multiplication Third test case is a matrix multiplication where the calculation of partial products is divided to multiple threads using Posix thread library. Size of the matrix is 40 times 40, calculation is divided into eight threads. Input matrixes as well as the result matrix are situated in shared memory. In the beginning of the execution, main thread reserves memory for the worker threads from the heap and generates them. After worker threads have calculated the partial products and stored them to the result matrix they are joined to the main thread and finally the program exits. Fig. 6 shows the accuracy of simulation results compared to the Callgrind profiling results from the reference platform for the number of execute instructions.

that inclusion of dynamic linking load does not significantly improve the accuracy of workload model.

120 100

ACKNOWLEDGMENT

80 60 40 20 0 Execute instructions Reference platorrm ABSINTH ABSINTH2 ABSINTH2 wit h library load ABSINTH2 wit h library load and dynamic linking load

Figure 6. Accuracy of number of execute instructions in different workload models compared to the profiling result.

In ABSINTH workload model simulation, a branch that prematurely exits from the program gets selected and the measurement leads to inaccurate results. In ABSINTH2 workload model simulation, true execution trace improves the accuracy significantly. If the library load and dynamic linking load are also taken into account the accuracy of the workload model is only slightly improved compared to plain ABSINTH2 case. In this case application workload consists mostly of user code. Therefore the impact library and dynamic linking load inclusion is negligible. In the last option the instruction count error is only 7 % compared to reference platform. V.

CONCLUSIONS

We proposed ABSINTH2, a GCC compiler based workload model generator for automatic creation of application workload models. Compared with our previous work, we have improved the control flow modeling of applications. Probability -based branch selection has been replaced by tracing the control flow. Improved accuracy of the new generator was proved in experiments with benchmarks. Furthermore, we proposed SAKE, a dynamic binary instrumentation based extension for ABSINTH2 workload model generator. It is capable of extracting both the workload of external library functions as well as the load of dynamic linking. With the experiments we proved that it can further increase the accuracy of the generated workload model but the overall accuracy depends on how much the application uses services of external libraries. On the other hand , we showed

This work is supported by the European Commission and Tekes – the Finnish Funding Agency for Technology and Innovation unity under the grant agreement ARTEMIS-20091-100230 SMECY, and by the European Commission under the grant agreement 215244 MOSART. REFERENCES [1]

J. Kreku, K. Tiensyrjä and G. Vanmeerbeeck, ”Automatic workload generation for system-level exploration based on modified GCC compiler,” Proc. Design, Automation and Test in Europe conference and exhibition, 2010 [2] F. Balarin, Y. Watanabe, H. Hsieh, L. Lavagno, C. Passerone, A. Sangiovanni.Vincentelli, “Metropolis: An Integrated Electronic System Design Environment,” IEEE Computer Society, vol 36, issue 4, pp 4552, April 2003 [3] P. Lieverse, P.van der Wolf, K. Vissers and E.Deprettere, “A methodology for Architecture Exploration of Heterogeneous Signal Processing Systems,” Journal of VLSI Signal Processing, vol 29, pp 197-201, 2001 [4] D. Pimentel, C. Erbas and S. Polstra, “A Systematic Approach to Exploring Embedded System Architectures at Multiple Abstraction Levels,” IEEE Transactions on Computers, vol 55, no. 2, pp 99-112, February 2006 [5] N. Nethercote and J. Seward, “Valgrind: A Framework for Heavyweight Dynmic Binry Instrumentation,” Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), June 2007, San Diego, California, USA [6] J. Saastamoinen, S. Khan, K. Tiensyrjä and T. Taipale, “Multi-threading support for system-level performance simulation of multi-core architectures,” ARCS 2011 Workshop Proceedings, 24th International Conference on Architecture of Computing Systems 2011, pp 169-177, February 22-23 2011, Como, Italy. [7] J. Kreku, M. Hoppari, T. Kestilä, Y. Qu, J.-P. Soininen, P. Andersson and K. Tiensyrjä, ”Combining uml2 application and systemc platform modeling for performance evaluation of real-time embedded systems,” EURASIP Journal on Embedded Systems, 2008. [8] D. Molka, D. Hackenberg, R. Schäne and M. S. Müller, “Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System,” 2009 18th International Conference on Parallel Architectures and Compilation techniques [9] G. Drysdale and M. Gillespie, “Intel Hyper-Threading Technology: Analysis of the HT Effects on a Server Transactional Workload,” Intel Software Network Communities 2009 [10] N. Nethercote, “Dynamic Binary Analysis and Instrumentation,” PhD Dissertation, University of Cambridge, November 2004

1

A Flexible NoC–based LDPC code decoder implementation and bandwidth reduction methods Carlo Condo, Guido Masera, Senior Member IEEE

Abstract—The need for efficient and flexible LDPC (Low Density parity Check) code decoders is rising due to the growing number and variety of standards that adopt this kind of error correcting codes in wireless applications. From the implementation point of view, the decoding of LDPC codes implies intensive computation and communication among hardware components. These processing capabilities are usually obtained by allocating a sufficient number of processing elements (PEs) and proper interconnect structures. In this paper, Network on Chip (NoC) concepts are applied to the design of a fully flexible decoder, capable to support any LDPC code with no constraints on code structure. It is shown that NoC based decoders also achieve relevant throughput values, comparable to those obtained by several specialized decoders. Moreover, the paper explores the area and power overhead introduced by the NoC approach. In particular, two methods are proposed to reduce the traffic injected in the network during the decoding process, namely early stopping of iterations and message stopping. These methods are usually adopted to increase throughput. On the contrary, in this paper, we leverage iteration and message stopping to cut the area and power overhead of NoC based decoders. It is shown that, by reducing the traffic injected in the NoC and the number of iterations performed by the decoding algorithm, the decoder can be scaled to lower degrees of parallelism with small losses in terms of BER (Bit Error Rate) performance. VLSI synthesis results on a 130 nm technology show up to 50% area and energy reduction while maintaining an almost constant throughput. Index Terms—VLSI, LDPC Decoder, NoC, Flexibility, Wireless communications

I. I NTRODUCTION LDPC codes were first studied by Gallager in [1], and later rediscovered by MacKay and Neal [2]: the outstanding performance offered by these codes led to intensive research in both theory and implementations. LDPC codes are currently included in several standards such as IEEE 802.11n [3] and IEEE 802.16e [4]: the need for flexible decoders able to support multiple codes is rising, and so is the attention of the research community. Therefore flexible decoders capable of working for multiple codes are receiving a significant attention. Flexibility issues must be tackled on different fronts: parametrized processing elements (PE) and specialized programmable processors are valid solutions at the processing level [5] [6]. On the other side, flexibility must be provided also at the inter–PE interconnect level. Communication structures optimized for single codes or classes of codes like [7] and [8] achieve great efficiency by statically mapping the communication needs on low–cost structures. This is the approach commonly used with quasi-cyclic LDPC codes [9], where the peculiar structure of the parity check matrix (H) permits the usage of very simple interconnect devices like barrel–shifters.

Though efficient, this approach limits greatly the achievable level of flexibility: fully flexible decoders must be able to work with H matrices very different from one another. The intrinsically flexible Network on Chip paradigm has been proposed as a possible structure to interconnect both heterogeneous processors (Inter–IP NoCs [10] [11]) and homogeneous hardware components concurrently executing a single task (Intra–IP NoCs, [12]). NoC–based LDPC decoders [13] are composed of a set of (P ) PEs connected by means of an NoC, which can accommodate the specific communication needs of any LDPC code. Stemming from a previously proposed flexible and scalable NoC based decoder [14], in this work improved architectures are described with the purpose of showing the feasibility of the NoC approach. A first decoder implementation is introduced based on a 5 × 5 two dimensional mesh NoC: synthesis results for this decoder proves that, notwithstanding its large flexibility, it is capable to reach high throughput; in particular it is shown that the designed decoder is compliant with the WiMAX standard throughput requirements. However the flexibility offered by the NoC approach comes at a power and area cost. Therefore modified versions of the original decoder are proposed to limit both area and power overhead. These new decoding architectures incorporate two algorithms able to reduce the amount of messages that PEs exchange across the NoC in the decoding of LDPC codes. The first algorithm is an already known method to implement early stopping of decoding iterations [15]: by limiting to the minimum the number of iterations sequentially performed by the decoder on a data frame, the global number of messages across the NoC is decreased with respect to the case of a decoder that always runs the same number of iterations. Various methods for early stopping of iterations have been proposed to save power during the decoding [16] [17] [18] [19] [20]. In this paper, iteration stopping is adopted to also reduce the number of PEs and thus the occupied area. The second key modification applied to the NoC based decoder is the introduction of a method to dynamically stop the delivery of single messages across the NoC, when they are not strictly required. In particular, in the decoding of LDPC codes, generated inter–PE messages carry a twofold information: the sign of the message is a binary information on the value of a bit in the codeword, while the modulus is associated to the level of reliability of the binary information. Reliabilities of the codeword bits tend to grow from one iteration to the other, but this growth occurs at different rates for different bits. Therefore the proposed method compares each message to be delivered with a threshold and if the threshold is passed the

2

message is considered as reliable enough and it is not sent. The reduction in terms of traffic injected by the PEs into the NoC is exploited to decrease both power and occupied area, leading to improved NoC based decoders. Provided synthesis results show that the joint application of the two mentioned methods achieves relevant advantages with respect to the original NoC based decoder: occupied area can be reduced up to 43%, while 40% to 57% of dissipated energy is saved. The paper structure is organized with Section II summarizing the adopted decoding algorithm, while Section III describes the NoC approach to LDPC code decoding. Section IV describes briefly the architecture of the single processing element, Sections V and VI detail two different methods for reducing the NoC traffic, with implementation issues and advantages. Section VII shows results in terms of achievable throughput, occupied area and power saving; comparison with state of the art solutions are also given. Conclusions are drawn in Section VIII. II. LDPC

DECODING

An LDPC code is a linear block code characterized by a sparse parity check matrix H. Columns (index j) of H are associated to received bits, while rows (index m) correspond to parity check constraints (PCC). In the layered decoding method [21], which provides approximately a ×2 factor in convergence speed w.r.t the two phase decoding, PCCs are clustered in layers ([22]) and extrinsic probability values are updated from one layer to the other. According to the notation adopted in [22]. L(c) represents the logarithmic likelihood ratio (LLR) of symbol c (L(c) = log(P {c = 0} /P {c = 1})). For each H column j, bit LLR L(qj ) is initialized to the corresponding channel–estimated soft value. Then, for each PCC m in a given layer, the following operations are executed: (old)

(old)

) − Rmj

(1)

Ψ(L(qmn ))

(2)

Sign(L(qmn ))

(3)

L(qmj ) = L(qj ∑ Amj =

n∈N (m),n̸=j



smj =

n∈N (m),n̸=j (new)

Rmj

(new)

L(qj

= −smj Ψ(Amj ) (new)

) = L(qmj ) + Rmj

(4) (5)

(old)

L(qj ) is the extrinsic information received from the previous layer: it is updated in (5) and eventually passed to (old) the following layer. Rmj is used in equation (1); the same (new) amount is then updated in (4), Rmj , used to compute (5) and stored to be used again in the following iteration. N (m) is the set of codeword bits connected to parity constraint m. Finally, Ψ(·) is a non–linear non–limited function, often substituted by the normalized min–sum approximation [23]: A1mj = minn∈N (m) (|L(qmn )|)

(6)

A2mj = minn∈N (m),n̸=t (|L(qmn )|)

(7)

where t is the index related to first minimum A1mj , while A2mj is the second minimum. Equation (4) is also changed to { −smj · A1mj /α when |L(qmj )| ̸= A1mj (new) Rmj = (8) −smj · A2mj /α otherwise where the normalization factor α is used to limit the approximation performance degradation due to the min–sum non optimality [23]. III. N O C BASED DECODING This work focuses on complete flexibility of the decoder and therefore no assumption is made on the structure of supported LDPC codes. To achieve such a large flexibility, the possible use of NoC based interconnect architectures has already been suggested and partially explored in [24] and [13]: however, a complete evaluation of the potential of the NoC based approach in terms of achievable performance and implementation complexity is not available. Figure 1 shows the adopted NoC topology, a two– dimensional toroidal mesh. Each node includes a PE and a routing element (RE) with five inputs/outputs. The simple input queuing architecture is implemented with first-in first-out memory queues (FIFO). Every FIFO is connected to the output registers by means of a crossbar switch. Since the number of PCCs in a code is much higher than the number of PEs, several PCCs will be scheduled on each PE. A control unit (CURE ) generates commands for the RE components, implementing a given routing strategy. In particular, control bits are necessary at the cross–bar to implement a given switching of incoming data, read signals must be applied to the FIFOs, and write signals are required by output registers. Thus, in general, the CURE must receive destination addresses for received packets and implement a certain routing algorithm. Alternatively, the CURE can be reduced to a simple routing memory (RM), which statically apply pre–calculated control signals to the RE. Given a certain code, the inter–processor communication needs are known a priori, depending on the structure of the H matrix. To reduce as much as possible the implementation overhead due to routing information in the packet header and routing algorithm, the so–called Zero Overhead NoC (ZONoC) [13] concept can be exploited. For each supported code the best path followed by all messages during a decoding iteration is statically derived: a routing memory in each node stores the necessary controls, avoiding the routing algorithm implementation and reducing the packet size and depth of FIFOs. Static information is derived via simulation on a cycle accurate Python simulation tool. This model receives a description of the NoC and of the parity constraints mapped on each PE, and by simulating the behavior of the NoC as PEs inject messages, produces the routing decisions across a complete iteration, together with the maximum size of input FIFOs. These informations can be easily coded into binary control signals and stored in the routing memory. IV. A RCHITECTURE OF THE PROCESSING ELEMENT In this section, the general structure of the PE is summarized. Figure 3 shows a simplified block scheme. The PE

3

Figure 1.

Figure 2.

NoC torus mesh topology with detailed routing element

Example of memory organization for extrinsic values

executes equations (1) to (8) in a pipelined way, in order to achieve high throughput. Finite precision representation of data and number of decoding iterations are decided by means of extensive simulations of the considered LDPC codes. The data flow begins when previous layer’s extrinsic values (old) L(qj ) are received by the PE and stored in L(qj ) MEMORY. This is a two–port memory with Npc × Nd locations, where Npc is the maximum number of PCCs mapped onto the PE and Nd is the maximum degree of PCCs. Similarly to RM, also the Write Address Generator memory (WAG) is initialized with data obtained via off–line simulation. Its purpose is to generate writing addresses for incoming mes(old) sages: since at every PE the sequence of arrival of L(qj ) is the same through every iteration, it can be derived statically. WAG MEMORY contains pointers to L(qj ) MEMORY, where extrinsic values are stored. Fig. 2 shows an example of L(qj ) MEMORY organization with Nd = 3. The memory is divided into Npc blocks, each one containing 3 consecutive locations. In the example, the 2nd scheduled PCC receives (old) three L(qj ) values from previous layer: these extrinsic values are sequentially stored in the 2nd block, starting from offset 3. The CNT/CMP unit is a counter used to compute the read addresses for L(qj ) MEMORY. It counts Nd successive locations from an initial offset that points at the first L(qj ) of a parity check. Upon recognizing the last read operation, a new offset and Nd values are loaded. Rmj MEMORY contains Rmj amounts and it is sized exactly as L(qj ) MEMORY. (old) (old) The subtraction of L(qj ) and Rmj is used to compute

the L(qmj ) values, and the first and second minimum are derived by the MINIMUM EXTRACTION unit. A cumulative XOR keeps track of the overall sign L(qmj ). The output of the COMPARE unit, that implements (8), is multiplied by 1/α, according to the NMS algorithm to (new) obtain Rmj . At the end of the flow, L(qmj ) is retrieved (new) by means of a short FIFO and added to Rmj , obtaining the (new) new L(qmj ) (5), which is sent to the NoC via an output buffer. The implementation results of a flexible decoder based on the NoC approach are reported in Section VII: a 5 × 5 NoC and the described PE architecture are used to design an highly versatile decoder, which reaches a worst–case throughput of more than 80 Mbps on WiMAX LDPC codes. It also supports any structured or unstructured LDPC code, up to the block size of the largest WiMAX code, including codes adopted in WiFi. This result actually proves that NoC based decoders are a feasible solution for multi-standard applications. V. M ESSAGE STOPPING Two improvements are proposed in this paper to reduce the amount of messages that have to be exchanged among PEs: (i) message stopping, that results in a reduction of the traffic injected into the NoC, and (ii) early stopping of iterations, that cuts the decoding time. This advantages can be exploited to reduce dissipated power and occupied area. A C++/Python fixed point simulation model has been developed for the whole transmission chain, consisting of encoder, AWGN (additive white gaussian noise) channel and decoder. The model allows to statistically study the extrinsic values exchanged among PEs in the layered decoding of a set of LDPC codes. In particular, Figure 4 shows how extrinsic values change from one iteration to another: as expected, extrinsic values tend to increase with iterations and the number of messages that carry high reliability values for the corresponding bits also increases along the decoding process. Moreover, divergence from 0 of extrinsics occurs earlier for higher signal to noise ratios (SNR), while extrinsics tend to float around the zero at low SNRs, expressing uncertainty about the bit value.

4

Figure 3.

Simplified block scheme for the processing element

Figure 4. Extrinsic values evolution for WiFi code (1944, 0.75), with SNR=2.0 (upper) and SNR=2.6 (lower)

Figure 5. SNR=2.6

Extrinsic values distribution over 15 frames, WiFi (1944, 0.75),

Figure 5 shows the distribution of message values at different iterations. As expected, the number of uncertain extrinsics decreases as the decoding proceeds, meaning that most of errors introduced by the channel are corrected in the initial iterations; remaining errors after initial iterations are associated to low absolute values of extrinsics (L(qj )). This

Figure 6.

BER curves without and with message stopping

behavior of extrinsics can be exploited to reduce the traffic injected in the NoC. The basic idea is very simple: once a given extrinsic has reached a high enough value, it is not updated anymore in the following iterations. This implies that extrinsics are saturated to a certain limit and saturated values are not sent through the NoC. We call this method of traffic reduction on the NoC “message stopping” (MS). In order to apply MS, extrinsics must be compared against a threshold during the decoding process and the value of the threshold has to be decided by simulation. Several thresholds have been tried for different codes and the value that guarantees at the same time a high number of stopped extrinsics and a small effect on BER performance has been selected. The choice of the threshold, THR, is also affected by decoding algorithm, finite precision representation of data and SNR. Table I (column 2) shows the average percentages of stopped messages (Smsg ) for the decoders addressed in Section VII. The given percentages have been computed by considering two NoC based decoders: decoder A does not support MS and simply executes decoding iterations up to a given maximum number Itmax ; the decoding of a frame is actually stopped before reaching Itmax only if all PCCs are verified. Decoder B executes the same algorithm as A, but it also implements MS, meaning that extrinsics are compared against THR and when THR is passed they are no more

5

Table I E FFECT OF BANDWIDTH REDUCTION METHODS FOR DIFFERENT LDPC CODES . M ESSAGE STOPPING WITH THRESHOLD T HR ON LDPC DECODER WITH Px PE S , AND Smsg STOPPED MESSAGES . Tgain IS THE THROUGHPUT GAIN OVER THE AVERAGE .

T HR Smsg

Code Figure 7.

Architecture of the Check Block (CB)

updated. For both decoders, the global numbers of extrinsics that are updated in the decoding of a data frame are registered and averaged across several frames. Table I shows that Smsg ranges between 19% and 40% for considered codes. these relevant percentages motivated us to further study the impact of MS on the decoder implementation. Figure 6 gives the BER curves for different cases of MS applied to the (2304 , 1152) WiMAX code. Different SNR losses (0.1, 0.2 and 0.3 dB) are obtained at the 10−5 BER crossing point. Percentages in Table I have been obtained for the 0.3 dB case. In an NoC based decoder, the time length of a decoding iteration has three components. The first component depends on the number of cycles taken to inject messages into the network: ideally, a PE needs Npc × Nd cycles to generate and send out all messages corresponding to all assigned PCCs. The second component comes from the distance between source and destination PEs in the NoC: the physical delivery of a message corresponds to a number of hops depending on the size of the NoC and on the selected path. The last component derives from the conflicts that occur at NoC nodes when multiple messages have to be routed across the same port of the switch. Intuitively, in a small to medium size NoC decoder the third contribution to the iteration length tends to dominate, as sent messages generate several conflicts and spend several cycles in the FIFOs. In such a case, message stopping is a very efficient way to improve the throughput: if the injected number of messages is reduced by a certain percentage, the corresponding throughput is expected to improve by approximately the same percentage. However, in a larger NoC, the iteration length is dominated by the first and second contributions, thus the actual gain in throughput tends to be smaller that the percentage reduction of injected messages. The cycle–accurate NoC model has been used to simulate decoding iterations with and without message stopping. The throughput gain achieved by means of message stopping in the decoding of several LDPC codes is reported in Table I (columns 3 and 4). For each code, two NoC based decoders have been considered: a large decoder with P12 PEs, and a smaller one with P22 PEs. In both cases, the throughput Tgain is provided as a percentage increase with respect to the same decoding executed with no message stopping. It can be seen from Table I that larger gains, between 10% and 15%, are obtained for the smaller NoCs. A. Architecture Additional architecture components are required to support the MS (Fig. 7). Each extrinsic L(qj ) to be injected into the

802.16e (2304, 0.5) 802.16e (2304, 0.83) 802.16e (1632, 0.5) 802.16e (1632, 0.83) 802.16e (576, 0.5) 802.16e (576, 0.83) 802.11n (1944, 0.75)

Msg. stopping P1 Tgain P2 Tgain

Early stop Tgain

16

19%

25

5.2%

9

15.1%

6.2%

17

27%

25

5.5%

9

12.9%

9.2%

14

32%

25

4.2%

9

11.2%

6.2%

16

36%

25

5.6%

9

19.9%

32.8%

10

39%

25

7.9%

9

10.6%

11.3%

12

40%

25

6.3%

9

11.5%

43.9%

16

21%

16

3.7%

9

6.5%

6,7%

NoC has to be compared against T HR. If |L(qj )| > T HR the message is considered as reliable enough and must not be sent. To implement such behavior, the Check Block (CB) is inserted at the output buffer of each PE. The CB performs the threshold–message comparison: a subtraction generates a sign bit, which is appended to the message and used as a ”stopped” flag (F) to inform the destination PE that the current message is received for the last time. To support MS a dynamic routing is also required instead of a static routing. As the stopping of messages cannot be predicted, the off–line derivation of routing decisions is not possible. A packet header must be created, containing the Destination Node Identifier (DNI), which is used by the routing algorithm executed at each NoC node to properly deliver incoming messages. The so-called O1Turn routing method [25] is adopted in this work due to its reduced complexity. Finally, an additional field is required in the packet to compensate for the unpredictable arrival order of messages. This field (RO) contains the address for writing the corresponding extrinsic in the L(q) memory. The whole structure of the packet is shown in Fig. 7, where the field PAYLOAD contains the extrinsic value. VI. E ARLY STOPPING OF ITERATIONS In the decoding of LDPC codes, the average number of iterations (ANI) is known to be much lower than Itmax : for example, the first row in Table II shows that, in the decoding of WiMAX codes with Itmax = 10, the ANI ranges between 2.9 and 6.1, depending on the SNR. These results have been obtained on a 5 × 5 topology by simply stopping the decoding of a frame as soon as a valid codeword is found. The introduction of an early stopping (ES) criterion can be of great benefit to reduce power dissipation and several ES methods have been proposed in the literature to this purpose. In this work we apply a recently proposed ES method [15] with the aim of reducing the occupied area. A lower ANI can be easily exploited to increase the decoding throughput. However a 5 × 5 NoC based decoder with no ES and no MS (first row

6

criterion are almost overlapped at low SNR, meaning that the number of Type II codewords is limited in this region. At high SNRs, the ES method offers a percentage reduction of ANI close to 20% A. Architecture

Figure 8.

Average iterations curves with and without early stopping

in Table II) is already compliant with WiMAX standard in terms of achievable throughput. Therefore we exploit ES to reduce the degree of parallelism of the decoder, P , and thus the size of the NoC. In particular we show that a 3 × 3 NoC decoder with ES guarantees the same throughput offered by the 5 × 5 architecture with no ES at the cost of a small BER performance penalty. In [15] an early stopping method is described with reference to WiMAX and WiFi LDPC codes. The proposed method basically detects iterations that are required to correct parity bits and skip them. To this purpose, incorrect codewords are divided into two types. Type I takes into account errors located either in the information part of the codeword or in the first z positions of the parity part, where z is the expansion factor of H. Type II refers to errors located in the last M-z positions of the codeword, where M is the number of rows in H. At high SNR values (> 1.7 dB), Type II errors are much more frequent than Type I. Said si the result of the ith parity check equation, the syndrome vector s is defined as s = [s0 , s1 , ..., sM −2 , sM −1 ]T . The syndrome accumulation vector (SAV) a = [a0 , a1 , ..., az−2 , az−1 ]T is defined so as ai =

c−1 ∑

(si+kz )

(9)

k=0

and c = M/z. It is shown in [15] that the SAV vector is entirely composed by even numbers for Type II errors: in this case, the decoding process can be stopped with no loss of information. On the contrary, if an odd number is present in the SAV, then the codeword is of Type I and the decoding must continue. Even though presented for WiMAX and WiFi standards, the ES method can easily be extended to less structured codes. The effects of this ES criterion have been evaluated by means of the same C++/Python simulation model used for the MS method. Performed simulations show that BER performance is weakly affected by the selected ES method. On the other side, the ANI is greatly reduced, as shown in Figure 8 for the WiMAX (2304, 0.5) code. It can be seen that the curves corresponding to the decoding with and without ES

Additional hardware resources are necessary to support the described ES method (Figure 9): P Transmission Blocks (TB), one for each PE, and a unique Early Stopping Block (ESB). The ES processing can be divided in three steps. 1) In step 1, si are computed for each PCC i (i = 0, 1, . . . , M − 1) and delivered from PEs to ESB. The computation is locally performed by the TB of each PE: this simply requires XOR–ing the sign bits of extrinsic values in PCCs. The delivery requires one dedicated connection from every PE to the ESB. At receiving, the si are sequentially stored in the P input memories SMinh , h = 1, . . . , P . 2) In step 2, si are reordered by means of a shuffling network (SN) to enable the SAV calculation (9). The whole set S = {si |i = 0, 1, . . . , M − 1} of syndromes stored in SMin memories is actually partitioned into ∪P h h , where Sin contains all P sub–sets: S = h=1 Sin si evaluated from PCCs mapped to the hth PE. The shuffling network generates a new partitioning of S, where ∪si are divided according to the SAV vector: z−1 h h , where sub–set Sout includes all si S = h=0 Sout evaluated from PCCs belonging to SAV element h. After shuffling, syndromes are stored in the P output memories SMouth (h = 1, . . . , P ). 3) In step 3, aj elements of SAV a (j = 0, 1, . . . , z − 1) are computed in parallel by P xor gates and P SAV blocks (SB). Since usually z > P , each SB computes in sequence multiple items of a. A final OR gate generates the binary output STOP, which is the final decision on stopping. The TBs operate concurrently with each decoding iteration and do not introduce latency. On the contrary, the ESB processing introduces additional cycles of latency: M/P cycles are necessary to move si syndromes from SMI to SMO memories; the same number of cycles are required to read P si syndromes at the time from SMO memories and evaluate the final decision (STOP). The whole latency, 2M/P , corresponds to several cycles, depending on code length and NoC size, however it can be easily accomodated within a decoding iteration. For example, the 5 × 5 NoC based decoder with no MS and no ES in Table II needs 421 cycles to complete a single iteration when decoding the (2304,1152) WiMAX code; for this example, M = 1152 and P = 25, thus the additional latency to implement ES is 92 cycles, equal to 22% of the length of one iteration. For the same example code, the overhead due to the ESB latency can be evaluated in Fig. 8, where the third curve shows the effective ANI obtained with the implemented ES method: at 2.2 dB, ES method should ideally reduce the ANI from 6 to 4.8; the ESB latency causes a delay in the stopping decision and this changes the ANI to 5.2, which is still a relevant advantage with respect to the

7

Figure 9.

Iteration early stopping block scheme

original value. The results in terms of throughput gain can be seen in the last column of Table I for several codes. The given percentages also take in account the ESB latency. The ESB contains four types of memories: 1) P SMin memories receive the M syndromes, therefore each of them has size M/P × 1. 2) P SMout memories receive the reordered syndromes (size M/P × 1). 3) SNM memory contains controls for the shuffling network. For each syndrome to be moved from SMini to SMoutj , SNM must enable the right path between ports i and j. As the network has P input and output ports, P · ⌈log2 P ⌉ control bits are required. M/P syndromes are received at each input port. thus the total size for SNM is M · ⌈log2 P ⌉ bits. 4) Finally, P SWA memories are allocated to store write addresses for SMout components. For each of them, M/P words are needed and every word contains ⌈log2 (M/P )⌉ address bits, plus one additional bit to be used as write command for SMout memories. The total size of a SWA memory is therefore M · (1 + ⌈log2 (M/P )⌉) bits. The content of SNM and SWA memories depends on the specific scheduling of PCCs on decoder PEs. The global amout of memory can be expressed as M · (3 + ⌈log2 (M/P )⌉ + ⌈log2 (P )⌉). For example, the ESB memory required to support WiMAX codes on a 4 × 4 NoC based decoder, is obtained with P = 16 and M = 1152 and is equal to 16K bits. VII. ACHIEVED RESULTS The first row in Table II refers to a 25-PEs NoC sized to support all WiMAX LDPC codes. The decoder is fully flexible and able to support any other LDPC code with size lower than the largest WiMAX code. However, in this paper, we limit presented results to the case of WiMAX codes; obtained performance on other LDPC codes are available in [14]. Even if the decoder does not include MS and ES methods, the number of iterations allowed and the degree of parallelism of the NoC guarantee a throughput of at least 70 Mbits/s for all code lengths and rates in WiMAX standard. Comparing this

decoder with implementations reported in the last row of the Table, it can be seen that the worst–case throughput is quite high, while larger area is required due to the high degree of flexibility provided by the NoC approach. This overhead can be significantly reduced by introducing ES and MS methods. Row 2 in Table II is related to a 5 × 5 NoC based decoder implementing the ES method as described in Section VI. Comparing this implementation with the decoder in the first row, which has the same size but does not supports ES, it can be seen that the worst–case throughput is not dramatically changed, while occupied area increases by almost 20%, due to the additional hardware components required to support ES. The only advantage provided by ES is related to the average energy dissipated to decode a data frame , Ef , which is reduced by roughly 20%. In row 3, results are given for an 18 PEs NoC (6×3) supporting both ES and MS. The potential offered by MS method is exploited to reduce the number of PEs, so saving both area (-11%) and energy (-37%). The throughput still reaches 70 Mbps for the WiMAX codes, while a 0.2 dB penalty is paid in terms of BER performance. Two further solutions are explored with decoders in rows 4 and 5. The 4 × 4 NoC based decoder in row 4 provides reduction of 24% on area and 19% to 40% on Ef , with minor penalties in terms of throughput (15%) and BER performance (0.1 dB). The 3 × 3 case achieves the lowest occupied area, which is comparable with the best implementations reported in the last row of the Table, and the lowest Ef . Moreover the throughput is compliant with WiMAX standard. The BER penalty in this case if 0.3 dB. VIII. C ONCLUSIONS The design of a fully flexible NoC based LDPC decoder is presented, together with two complementary methods for reducing the traffic injected into the network. These methods provide relevant area and power saving. The first proposed decoder implementation offers an unparalleled degree of flexibility and throughput higher than 70 Mbps on WiMAX codes. A penalty in terms of additional area and power is paid for this decoder with respect to state of the art dedicated or partially flexible decoders. The other presented NoC based decoders exploit early stopping of iterations and message stopping to scale the whole NoC to lower degrees of parallelism: the scaled architectures still achieve high enough worst–case throughput at a much lower area and power cost. R EFERENCES [1] R. Gallager, “Low-density parity-check codes,” Information Theory, IRE Transactions on, vol. 8, no. 1, pp. 21 –28, 1962. [2] D. MacKay, “Good error-correcting codes based on very sparse matrices,” in Information Theory. 1997. Proceedings., 1997 IEEE International Symposium on, 1997. [3] J. Lorincz and D. Begusic, “Physical layer analysis of emerging IEEE 802.11n WLAN standard,” in Advanced Communication Technology, 2006. ICACT 2006. The 8th International Conference, vol. 1, 2006, pp. 6 pp. –194. [4] M. Khan and S. Ghauri, “The WiMAX 802.16e physical layer model,” in Wireless, Mobile and Multimedia Networks, 2008. IET International Conference on, 2008, pp. 117 –120.

8

Table II LDPC

ARCHITECTURES COMPARISON : CMOS TECHNOLOGY PROCESS (TP), AREA OCCUPATION (A), NORMALIZED AREA OCCUPATION FOR 65 NM TECHNOLOGY (An), CLOCK FREQUENCY (fclk ), PRECISION BITS (b), AVERAGE ENERGY PER FRAME DECODING (Ef ), MAXIMUM (Itmax ) AND AVERAGE (ANI) NUMBER OF ITERATIONS , MINIMUM THROUGHPUT (T ) AND SNR TO ACHIEVE BER=10−5 (SN R)

Decoder

No MS No ES

5 × 5 NoC

No MS ES

5 × 5 NoC

MS ES

6 × 3 NoC

MS ES

4 × 4 NoC

MS ES

3 × 3 NoC [6] [26] [27] [13] [28] [7]

TP [nm]

A [mm2 ]

An [mm2 ]

fclk [MHz]

b [bits]

Ef [µJ] 1.14

Itmax

ANI

10

130

4.72

1.18

300

8

4.98

10

8.28 0.93 130

130

130

130 65 180 65 130 90 90

5.49

4.20

3.61

2.68 062 3.39 1.337 3.7 0.679 6.22

1.37

1.05

0.90

0.67 0.62 0.442 1.337 0.93 0.354 3.24

300

300

300

300 400 100 400 300 400 300

[5] G. Masera, F. Quaglio, and F. Vacca, “Implementation of a flexible LDPC decoder,” Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 54, no. 6, pp. 542 –546, 2007. [6] M. Alles, T. Vogt, and N. Wehn, “FlexiChaP: A reconfigurable ASIP for convolutional, turbo, and LDPC code decoding,” in Turbo Codes and Related Topics, 2008 5th International Symposium on, 2008, pp. 84 –89. [7] C.-H. Liu, C.-C. Lin, S.-W. Yen, C.-L. Chen, H.-C. Chang, C.-Y. Lee, Y.-S. Hsu, and S.-J. Jou, “Design of a multimode QC-LDPC decoder based on shift-routing network,” Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 56, no. 9, pp. 734 –738, 2009. [8] X. Chen, S. Lin, and V. Akella, “QSN:a simple circular-shift network for reconfigurable quasi-cyclic LDPC decoders,” Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 57, no. 10, pp. 782 –786, 2010. [9] M. Fossorier, “Quasicyclic low-density parity-check codes from circulant permutation matrices,” Information Theory, IEEE Transactions on, vol. 50, no. 8, pp. 1788 – 1793, 2004. [10] L. Benini and G. De Micheli, “Networks on chips: a new SoC paradigm,” Computer, vol. 35, no. 1, pp. 70 –78, Jan. 2002. [11] K. Goossens, J. Dielissen, and A. Radulescu, “AEthereal network on chip: concepts, architectures, and implementations,” Design Test of Computers, IEEE, vol. 22, no. 5, pp. 414 – 421, 2005. [12] L. Benini, “Application specific NoC design,” in Design, Automation and Test in Europe, 2006. DATE ’06. Proceedings, vol. 1, 2006, pp. 1 –5. [13] F. Vacca, G. Masera, H. Moussa, A. Baghdadi, and M. Jezequel, “Flexible architectures for LDPC decoders based on network on chip paradigm,” in Digital System Design, Architectures, Methods and Tools, 2009. DSD ’09. 12th Euromicro Conference on, 2009, pp. 582 –589. [14] C. Condo and G. Masera, “Omitted for blind review,” IEEE Trans. VLSI Syst., submitted for publication, available on arxiv.org. [15] Z. Chen, X. Zhao, X. Peng, D. Zhou, and S. Goto, “An early stopping criterion for decoding ldpc codes in wimax and wifi standards,” in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, 30 2010-june 2 2010, pp. 473 –476. [16] G. Gilikiotis and V. Paliouras, “A low–power termination criterion for iterative LDPC code decoders,” Signal Processing Systems Design and Implementation, 2005. IEEE workshop on, 2005. [17] Y. Sun and J. R. Cavallaro, “High throughput VLSI architecture for soft-output MIMO detection based on a greedy graph algorithm,” in

8

8

8

8 N/A N/A 6 6 7 6

[18] [19] [20] [21] [22] [23]

[24] [25]

[26] [27]

[28]

2.9

Code length - rate 576 - 0.5

T [Mb/s] 71

SNR [dB] 2.9

4.9

1632 - 0.5

78

2.4

10 10

6.1 2.9

2304 - 0.5 576 - 0.5

82 70

2.2 2.9

4.44

10

4.9

1632 - 0.5

76

2.4

7.59 0.56

10 10

6.1 2.9

2304 - 0.5 576 - 0.5

81 70

2.2 3.1

3.26

10

4.9

1632 - 0.5

72

2.6

5.32 0.68

10 10

6.1 2.9

2304 - 0.5 576 - 0.5

74 61

2.4 3.0

3.57

10

4.9

1632 - 0.5

67

2.5

6.69 0.49

10 10

6.1 2.9

2304 - 0.5 576 - 0.5

64 74

2.3 3.2

2.84

10

4.9

1632 - 0.5

71

2.7

4.81

10 20 10 20 10 12 20

6.1 N/A N/A N/A N/A 6.64 N/A

2304 - 0.5 WiMAX WiMAX WiMAX 2304 - 0.5 2304 - 05 WiMAX

72 27.7 68 48 (min) 56 66.7 212 (max)

2.5 N/A N/A N/A N/A 2.15 2.2 (min)

Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI’09, New York, USA, Mar. 2009, pp. 445 – 450. W. Wang and G. Choi, “Minimum–energy ldpc decoder for real–time mobile application,” Design, Automation & Test in Europe Conference & Exhibition, 2007. DATE ’07, 2007. ——, “Speculative energy scheduling for ldpc decoding,” Quality Electronic Design, 2007. ISQED ’07. 8th International Symposium on, 2007. W. Wang, G. Choi, and K. Gunnam, “Low–power vlsi design of ldpc decoder using dvfs for awgn channels,” VLSI Design, 2009 22nd International Conference on, 2009. F. Guilloud, E. Boutillon, J. Tousch, and J.-L. Danger, “Generic description and synthesis of LDPC decoders,” Communications, IEEE Transactions on, vol. 55, no. 11, pp. 2084 –2091, 2007. D. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC codes,” in Signal Processing Systems, 2004. SIPS 2004. IEEE Workshop on, 2004, pp. 107 – 112. M. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity iterative decoding of low-density parity check codes based on belief propagation,” Communications, IEEE Transactions on, vol. 47, no. 5, pp. 673 –680, May 1999. T. Theocharides, G. Link, N. Vijaykrishnan, and M. Irwin, “Implementing LDPC decoding on network-on-chip,” in VLSI Design, 2005. 18th International Conference on, 2005, pp. 134 – 137. D. Seo, A. Ali, W.-T. Lim, and N. Rafique, “Near-optimal worst-case throughput routing for two-dimensional mesh networks,” in Computer Architecture, 2005. ISCA ’05. Proceedings. 32nd International Symposium on, june 2005, pp. 432 – 443. T.-C. Kuo and A. Willson, “A flexible decoder IC for WiMAX QCLDPC codes,” in Custom Integrated Circuits Conference, 2008. CICC 2008. IEEE, 2008, pp. 527 –530. T. Brack, M. Alles, T. Lehnigk-Emden, F. Kienle, N. Wehn, N. L’Insalata, F. Rossi, M. Rovini, and L. Fanucci, “Low complexity LDPC code decoders for next generation standards,” in Design, Automation Test in Europe Conference Exhibition, 2007. DATE ’07, 2007, pp. 1 –6. Y.-L. Wang, Y.-L. Ueng, C.-L. Peng, and C.-J. Yang, “Processing-task arrangement for a low-complexity full-mode WiMAX LDPC codec,” Circuits and Systems I: Regular Papers, IEEE Transactions on, 2010.

FERONOC:FLEXIBLE AND EXTENSIBLE ROUTER IMPLEMENTATION FOR DIAGONAL MESH TOPOLOGY Majdi Elhajji,Brahim Attia,Abdelkrim Zitouni,Rached Tourki

Samy Meftali, Jean-luc Dekeyser

Univ. Monastir Laboratory of Electronics and Micro-Electronics Monastir 5019, Tunisia

Univ. Lille 1 LIFL,CNRS, UMR 8022, INRIA Villeneuve d’Ascq 59650, France

ABSTRACT Networks on Chip (NoCs) can improve a set of performances criteria, in complex SoCs, such as scalability, flexibility and adaptability. However, performances of a NoC are closely related to its topology. The diameter and average distance represent an important factor in term of performances and implementation. The proposed diagonal mesh topology is designed to offer a good tradeoff between hardware cost and theoretical quality of service (QoS). It can contain a large number of nodes without changing the maximum diameter which is equal to 2. In this paper, we present a new router architecture called FeRoNoC (Flexible, extensible Router NoC) and its Register Transfer Level (RTL) hardware implementation for the diagonal mesh topology. The architecture of our NoC is based on a flexible and extensible router which consists of a packet switching technique and deterministic routing algorithm. Effectiveness and performances of the proposed topology have been shown using a virtex5 FPGA implementation. A comparative performances study of the proposed NoC architecture with others topology is performed. Index Terms— SoC, NoC, RTL , FeRoNoC. 1. INTRODUCTION Modern applications of specific SoCs in signal audio and video processing require the increase of computation capabilities. Thus, Intellectual properties (IPs) have been more and more integrated which makes communication very complex in these systems. Consequently, according to a set of works in state of the art [1, 2, 3, 4, 5] interconnection architecture based on shared busses shows its limits for the communication requirements of future SoC. In this context NoCs appear as a best solution to provide communication in the Chip. Due to the following characteristics; reliability, scalability of bandwidth and energy efficiency NoCs are emerging to replace busses. However many applications, especially video encoder like H.264, have some performances requirements. Thus, a major goal in the SoC design is therefore, to ensure performances

and QoS required by the application using the minimum available resource. NoCs seems to be today the most appropriate communication solution for integrating many cores in a system and guaranteed QoS for applications. Indeed, the implementation of high NoC performances has become one of the most important challenges of designers. NoC is generally composed of three basic components: network interfaces (NIs), routers and links. Router which is an element of the NoC topology, implements routing function, switching technique and the flow control algorithms.Topology of NoC defines the connectivity or the routing possibilities between nodes, thus having a fundamental impact on the network performance as well as the switch structure (number of ports and port width). The tradeoff between generality and customization becomes then an important issue when choosing a network topology. Another major factor that has an important impact on the NoC is the performance of router. The router is characterized by its degree, frequency, power consumption and latency. The degree of a router determines the number of its neighbors. Obviously, designing a router with a higher degree leads to the difficulties of VLSI (Very Large Scale Integration) implementation that can affects the performance of the SoC in term of used resources. The motivation of this work has been addressing the demand for optimized infrastructure communication by proposing an optimal topology that provides a good performances. In the proposed diagonal mesh topology an even number of routers are connected by a links to the neighboring routers in clockwise and counter clockwise direction plus a central connection. The key characteristics of this topology include good network diameter that equal to 2, vertex symmetry, deterministic routing, generic number of routers and low degree for peripheral routers that equal to 3. High router degree reduces the critical path length but increases complexity. This paper is organized as follows: Next section presents some related works. Section three presents the architectures and the routing algorithm of FeRoNoC. In section four, simulation and Implementation results for router on FPGA are presented and then some comparison and discussions takes

place in section five. Finally we conclude the work. 2. RELATED WORKS There are many works which offer new architectures of NoC like STNoC [6] and GeNoC [7]. These NoCs are based on flexible and evolutionary packets. Moreover, they are based on the Spidergon and the Octagon topologies respectively and they have low costs of silicon implementation for the router and network interface. On the other side, most of the NoC use 2DMesh topologies [8, 9, 10, 11]. In [8] SoCIN is presented, which is extensible based on XY routing. The basic router called RASoC(Router architecture for SoC) have a configurable FIFO. ASoC(adaptive system on chip) [9] is a scalable architecture, flexible and modular communication between routers. The AEtheral NoC [10] is based on the ATM network and adopts a fixed size packet technique which is oriented to a real time application. The disadvantage of this NoC is the fact that reception of the packets is not guaranteed when flits take different paths. HERMES [12] is a 2D-Mesh NoC topology satisfies the requirement of implementing a low area and low latency communication for system on chip modules. Deterministic routing and Wormhole switching [13] are the dominant approach for NoCs researches. Deterministic routing algorithms are usually used because they require a low cost in term of logic compared to the adaptive algorithms. In Wormhole technique a packet is divided into flits, it implements the functionality with lower buffer requirement [14]. Thus, it is an interesting solution compared to packet-based circuit switching and virtual cut-through. Others switch technique like packet switching and virtual cut-through switching require enough buffers for saving a whole packet at each intermediate router. In [15] authors have compared different NoC architectures. Based on different point of views including performances such as latency, area and power consumption. In another work, Bononi and Concer [16] compared ring, mesh and Spidergon topologies. Their paper showed that the spidergon topology outperforms the mesh and the ring topologies. 3. ROUTER ARCHITECTURE The proposed Mesh diagonal topology generalizes and improves performance of the well known STOctagon network processor topology by a simple bidirectional ring with a central router. This network is constituted by (N+1) routers including a central element which is connected with all peripheral routers via links. Each peripheral router provides four input/output ports enable to connect left/right neighbors (routers), central router and the local IP. The central router contains N ports for the connection of each peripheral router and additional one permitting the connection with its local IP as is shown in figure 1:

Fig. 1. The NoC topology. The communication packet used in our NoC is composed by three basic messages called flits which consist of: • Header: composed by one bit (BOP)indicating the beginning of a packet, • Body: contain the data to be transmitted on 32 bits, • Tail: one bit indicating the end of a packet (EOP). Both of the two kinds of routers composing the diagonal mesh topology use packet switching technique and Wormhole flow control mechanism.In packet switching, packets are transmitted without any need for connection establishment procedure. It requires the use of a switching mode, which defines how packets move through the switches. The described router has a controlled logic and bidirectional ports. Each port has an input buffer for temporary storage of information and a local port enabling communication between router and its local IP core. This module contains a set of components that are described below. Figue 2 shows a general block diagram for the router. In the diagonal mesh, topology is presented in this work, each switch has a different number of ports, depending on its peripheral position or central router. 3.1. Peripheral Routers In our NoC, we used a synchronous router with four input/output ports (local, clockwise, counter clockwise and across). Each port is connected to a bi-directional exchange bus. Each switch has a unique address and the switching technique used is packet switching. The data flowing through the network is a Wormhole routing. We made this choice to reduced number of buffers required per node and the simplicity of the communication mechanism. The diagonal mesh topology uses credit based flow control strategies. This later presents interesting advantages over handshake. In creditbased protocol, when the receiver is free, the transmitter

indicating the end of packet;(3) REQ: control signal indicating data availability, (4) creditout : control signal indicating that FIFO is not full; (5) data: indicating data to be received; (6)datain : data to be written in the FIFO, (7) write: control signal indicating the writing in the FIFO. Thus as explained above, the IC aims to establish a connection between entities of initiator and destination routers.

Fig. 2. The router architecture. Fig. 3. The Input module. sends a new data at each cycle of the clock and the receiver indicates its availability by a signal named credit. The peripheral router is composed by a bidirectional ports numbered starting from zero. They connect the router with its local IP and these neighboring bidirectional ports. The connection is defined as follows: 1. The first port is connected to neighbor in the clockwise direction. 2. the second port connected to the central router. 3. the third port connected to neighbor in counter clockwise. The internal architecture of the peripheral router is composed by several components, such as input/output controller ), routing function and switch allocator. We use in each input port of our router an input component. This latter is divided into three components named FIFO, input controller and output controller. The main function of the input controller (IC) is to create an interface between output controller (OC) of the source router and IC of the current router. The IC is activated when the current router receives the beginning of packets (BOP). Thus, it allows the reception of packet flits from the output ports of the adjacent routers, storage in the FIFO buffer and the management of the flow control (credit based) between adjacent routers. The IC can accept a new packet when the previous one is not entirely switched. This part of the router allows the reception of packets sent by the neighboring OC and writes them in the FIFO. The block diagram of this component is shown in figure 3. This figure describes signals and recommended hardware implementation for the router. The physical data bus width is 32 bits. The IC in the figure is composed by the following signals: (1) BOP control signal indicating the beginning of received packet, (2) EOP:

To complete the description of input block, an explanation of the functionality of the output controller (OC) is mandatory in order to finish the construction of the input block of peripheral router. The output controller (OC) is the last block of the peripheral router. Their main roles consist in communicating with the IC, reading data, storing data in the FIFO and sending them after getting the control signal credit. When the output port, indicated by the routing function, is allocated in the output controller, the OC starts sending packet. This component is composed by control and data signals. The functional description of these signals is similar to that of the IC. Signal Grant indicates that the OC must send the packet to the corresponding output port. The association of an input and output controller constitutes the input module of the peripheral router. The FeRoNoC input module is described in VHDL and validated by a functional simulation. Its behavior can be summarized by the following steps: 1. When BOP=1, sending flits begins. 2. EOP=1, when the last flit is received. 3. Req=1 Sending. 4. Credit=1, the OC reads the data. 5. Full=1 the IC saves data in the FIFO. 3.2. Routing function and arbitration A routing function defines the path followed by each message or packet through the NoC. It describes also how data are forwarded from sender to receiver by interpreting the destination address field of the header flit. The choice of a routing algorithm depends on several metrics like power,logic and routing

table,increasing performance and maximizing traffic utilization of the network. In the FeRoNoC router there are two modules implementing the control logic which are routing and arbitration. The routing module is presented in figure 4.

Fig. 4. Routing module. In our case, routing is deterministic and the communication between ports is not usually established. Indeed the local port can communicate with all ports. The central router has a direct connection with all the routers of the NoC. The routing function extracts the destination address of the packet, calculates the path to follow and generates a reqp signal, where p indicates the number of destination of ports as represented in figure 4. In our design, each input port of peripheral routers has its specific routing function. The routing algorithm is distributed inside the router as described in the following steps: The routing function of the local port can request three other output ports clock wise, Counter clock wise and across. To provide a correct message transfer, the following algorithm describes how to calculate the direction of packet. 1: numberof jump = (@dest − @curr)mod(N + 1) 2: if number of jump=0 then 3: local port direction 4: else if number of jump= 1 v 2 then 5: clockwise direction 6: else if number of jump =14 v 15 then 7: counter clockwise direction 8: else 9: central router 10: end if In the other side the clockwise port can request only two output ports Counter clockwise and local, thus the algorithm can be described as follow: 1: numberof jump = (@dest − @curr)mod(N + 1) 2: if number of jump=0 then 3: local port direction 4: else if number of jump =14 v 15 then

counter clockwise direction end if The routing function of the counter clockwise port can request only two ports (clock wise and local). 1: numberof jump = (@dest − @curr)mod(N + 1) 2: if number of jump=0 then 3: Local port direction 4: else if number of jump= 1 v 2 then 5: Clockwise direction 6: end if Finally, routing function of across port can request only the local port. 1: numberof jump = (@dest − @curr)mod(N + 1) 2: if number of jump=0 then 3: Local port direction 4: end if The EOP control signal allows the activation of the arbitration. When it is equal to 1 the arbitration is executed. Indeed the output port is allocated to an input port, thus the correct transfer is provided. The routing function sends control signal to the Switch Allocator of the suitable router. The hardware description of the Switch Allocator is presented in Figure 5. 5:

6:

Fig. 5. Switch Allocator. This module contains four arbitration components that decide the connection between input port and the output port. Its behavior is described by the following algorithm: 1: if R0 allocate the output port then priority will be R1 > R2 > R0 2: 3: else if R1 allocate the output port then 4: priority will be R2 > R0 > R1 5: else if R2 allocate the output port then 6: priority will be R0 > R1 > R2 7: end if As is shown in figure 5 the switch allocator has many control signals such as Req, grant and index. The index signal in-

dicates the number of local input port, grant signal: indicates that the request is achieved. Indeed this information can activate the Crossbar element. The crossbar is a physical switch connecting the inputs to the outputs. It can be represented as a module with N inputs and N outputs. The switching in the crossbar can be achieved via multiplexers. The selection bit of each MUX could be generated by the index signal. Generally the transfer of the packet is performed in many steps. Firstly the state of the FIFO must be sent, secondly, the input controller starts data storage in the FIFO. Then, the routing function sends request signal to allocate the output port and finally follows the arbitration technique. Figure 6 presents a mapped architecture of the routing function and the switch allocator.

Fig. 8. Block diagram of the central router. router. The routing function (RF) of the local port can request all other output ports of the central router. RF of the port numbered i can request only output ports from 0 to i - 3 or output ports from i + 3 to N and also the local port of central router. 4. SYNTHESIS AND SIMULATION RESULTS

Fig. 6. Interaction between routing function and Switch allocator. The interaction between modules was described in VHDL and validated by functional simulation as is presented in figure 7: The simulation steps are described as: 1. The second port will send to local port. 2. the BOP=1, REQ=1 the IC module starts storage of data. 3. At the same clock the routing function receives the BOP signal and the destination address then specify the suitable output port. 3.3. Central router The main objective of this router is to minimize the traffic and the load imposed on the peripheral router. Moreover, it allows the routing of data in the network to their destination with a number of required jumps equal to two. The importance of this router appears when the SoC contains a set of IP core. Hardware architecture of this router is the same as the peripheral routers. The only difference appears in the number of input/output port. Figure 8 shows the block diagram of this router. Each input port of the central router has its specific routing function. The routing algorithm is distributed inside the

This section presents some simulation and synthesis results. It also explains how to find a compromise between latency and diameter. The router in the FeRoNoC was validated by functional simulation and by the synthesis results. We started by a VHDL simulation, in 9 a packet transmission in router is illustrated. The local port sends data to the second port after having sent the flits and the request. The simulation shows some of internal signals defining interconnection. The target address is the peripheral router number 8. This later is located far from the source, thus the packet crosses the central router. The simulation scenario can be explained as follow: 1. Peripheral Router sends the first flit of the packet (address of the target switch) to the data out signal at it’s across port and asserts the req and bop signals in this port. 2. Central Router detects the req signal asserted in its port number 0 and gets the flit in the data in signal. It takes 2 clock cycles to route this packet. Next flits are routed with 1-clock cycle latency. 3. Central Router puts the flit in data out signal and asserts the req and Bop signals of its output port number height. It takes 2 clock cycles to route this packet. 4. Router 8 detects asserted req signal of it’s across port. The first flit of the packet is routed to the local port of the router 8 and the source to target connection is now established. 5. The remaining flits contain the payload of packet.

Fig. 7. Simulation of the router.

Fig. 9. Simulation of the proposed diagonal Mesh topology 6. After sending all flits, the connection is closed and the inputs and outputs reserved by this packet can be used by other packets. However, the main objective of the central router is to improve latency in the NoC depending on the distance. In our case study the peripheral router 0 sends data to the peripheral router 8. The simulation shows that minimal latency to switch packet from source at target depends on the diameter of the network. Latency is given by: n ∑ Latency = ( (Ri )) + P × clockcycles

(1)

i=1

where n is the number of routers in the communication path, Ri is the required time of the routing algorithm at each switch, P is the reference packet size and clock cycle presents a required time to send flits. Based on this equation, for our diagonal Mesh topology, the n=2 target and source routers are involved. In this paragraph some synthesis results are presented and a cost analysis of area and power consumption is realized. The router performance has been evaluated in terms of speed, latency and estimated peak performance. The FeRoNoC router was synthesized on Xlinx virtex 2pr xc2vp device using Xilinx ISE 9.1. The simulation was performed using modelSim 6.5 SE tool. The proposed router has been prototyped on 2 different FPGA technologies: Xilinx Virtex5 xc5vlx50-3ff676 and Xilinx virtex 2 pro. Table I presents our synthesis router results with Xilinx virtex 5 in which: area, operating frequency and power consumption results are shown. Table II presents implementation works targeting the same FPGA. The maximum running frequency is about 264 Mhz and the power consumption is 33mw.

Our approach provides performances in terms of latency, speed and area. The router design is highly modular and adaptable without the use of explicit handshake signal for communication between sub modules of the router. Other metrics evaluating our design is the peak performance which depends on the maximal clock frequency Fmax , the flit size (f litsize ) and the time (T) in clock cycle for transmitting a flit. P M perport = (Fmax /T ) × f litsize

(2)

The credit based control flow used by our router requires one clock cycle for transmitting one flit then T = 1 and f litsize = 32bits Table 1. Synthesis results FPGA/Perf. Virtex2 Virtex5 Slice 5% 4% Flip Flop 3% 2% Lut 2% 2% Frequency(Mhz) 218 264 Power(mw) 97(200Mhz) 33(200Mhz) Peak.Perf. 6.9 Gbit/s per port 8.44 Gbits/s per port

5. COMPARAISON AND DISCUSSION The work on flexible and extensible network on chip is an emerging topic. There are many proposed topologies in the literature, as cited in previous section each topology designs offers a different set of tradeoffs in terms of metrics, such

as network degree, network extendibility notice in which a 2D mesh [17] topology provides very good theoretical metrics. Nevertheless, due to the increasing complexity of application, this topology cannot provide a good performance. On the other hand simple topology like ring provides a low cost in term of area but poor performance where the number of cores increases. The diameter and average distance represents an important factor in terms of performance and implementation. The proposed NoC is designed to deliver a good tradeoff between hardware cost and theoretical performance. Due to its higher connectivity our topology out performs ring, mesh and STspidergon in terms of diameter and average distance. In addition, our diagonal Mesh topology can contain a large number of nodes without changing the diameter. Indeed it can deliver a good latency for multimedia application like video encoder (H.264). Thus, it represents a solution for on chip communication in next SoC. The major inconvenience of the proposal appears in the required link. However, based on the evolution of semi conductor technology this problem can be solved. On the other side, many works are presented in the literature to implement NoC. Authors in [18] describe a 2D meshes Network on chip implementation based on virtex4 and virtex2 FPGA. They describe an open source FPGA based NoC architecture with a high throughput and low latency. In this work, a generic bridge based on packet switching with Wormhole routing, was proposed. As is shown in table II, the data width of this implementation is about 36 bits and the maximum frequency is less that our frequency. This work provides a low cost area. In [19] authors present a packet switched NoC running on a virtex2. Moreover, they show an implementation of packet switched and time multiplexed FPGA overlay networks running at 166 Mhz. The aim of this work was to support designers in choosing between time multiplexing and packet switching. They use a 32 bits data width, as is shown in table II. Our implementation outperforms this work in term of area, latency and maximum frequency. Table 2. Comparison with other works Perf./Works [18] [19] Our Data width 36bits 32bits 32bits Latency 3 6 2 Slice 431 1464 989 Frequency(Mhz) 166 166 218 Topology Mesh Mesh Diagonal Mesh

6. CONCLUSION Network on chip presents a most adapted technology to perform communication in complex SoCs. In this work, we present a novel topology named diagonal mesh and related router called FeroNoC. It offers a low latency (2 clock cycles)

and high speed (264 Mhz) communication for on chip modules. This paper presents all the details of our NoC such as topology, routing algorithm, dynamic arbiter and router module.This architecture offers a variety of SoC communication services owing to its flexibility and adaptability. The physical parameters of the designed router consist on the width and the depth of the FIFO, the number of the input/output ports, the valence and the maximal diameter of the NoC. Compared with other NoC, the advantage of this architecture, resides in its capacity to handle a suitable cost/performance compromise in the field of NoC. This is due to its wide constant and low diameter, latency of the router, frequency and power consumption. The simulation and implementation of this architecture show its performances and effectiveness. Our next objective is to prove the use of the diagonal mesh topology, mapped H.264 encoder will be investigated in a future work. Also the dynamic reconfiguration (in terms of number of nodes which depends on the number of IP of the application) will be studied. Adopting this technology, it is possible to obtain application specific NoCs. Moreover, we are going to design a low power SoC, based on the diagonal mesh topology and H.264 encoder, which includes dynamic reconfiguration. 7. REFERENCES [1] K Shashi, J Axel, P.S. Juha, F Marteli, M Mikael, Johny, T Kari, and H Ahmed, “A network on chip architecture and design methodology,” in IEEE computer Society Annual Symposium. IEEE, 2002, pp. 105–112. [2] M Mikael, N Erland, T Rikard, K Shashi, and J Axel, “The nostrum backbone communication protocol stack for networks on chip,” in VLSI Design. IEEE, 2004, pp. 105–112. [3] B Luca and D.M Giovanni, “Powering networks on chips energy-efficient and reliable interconnect design for socs,” in ISSS’01. ACM, 2002, pp. 105–112. [4] B Luca and D.M Giovanni, “Networks on chips: A new soc paradigm,” IEEE computer, vol. 35, pp. 70–78, January 2002. [5] G Pierre and G Alain, “Ageneric architecture for onchip packet-switched interconections,” in DEsign, Automation and Test in Europe. IEEE, 2000, pp. 250–256. [6] C Marcello, D.G Miltos, L Riccardo, M Giuseppe, and Pieramlisi Lorenzo, Design of Cost-Efficient Interconnect Processing Units: Spidergon STNoC, 2008. [7] J Schmaltz and D Borrione, “A generic network on chip model,” Tech. Rep., TIMA Laboratory, Grenoble,France, 2009.

[8] J Liang, A Laffely, S Srinivasan, and R Tessier, “An architecture and compiler for scalable on-chip communication,” IEEE transaction on very large scale integration systems, vol. 12, pp. 711–726, July 2004. [9] C.A Zeferino and A Susin, “A parametric and scalable network-on-chip,” in 16th Symposium on integrated circuits and system design. IEEE, 2003. [10] K Goossens, J Dielissen, and A Radulescu, “Aethereal network on chip: Concepts, architectures, and implementations,” IEEE Design and Test of Computer, vol. 22, pp. 414–421, September 2005. [11] F Uriel and R Prabhakar, “Exact analysis of hot-potato routing,” in 33rd Annual Symposium on Foundations Of Computer Science. IEEE, 192, pp. 553–562. [12] K Goossens, J Dielissen, and A Radulescu, “Hermes: an infrastructure for low area overhead packet-switching networks on chip,” VLSI journal, Elsevier, vol. 38, pp. 69–93, March 2004. [13] J Dally and C Seitz, “Deadlock-free message routing in multiprocesor interconnection networks,” IEEE transaction. [14] J Dally and B Towles, “Route packets, not wires: Onchip interconnection networks,” in Design Automation Conference (DAC). ACM, 2001, pp. 683–689. [15] P.P Partha, G Cristian, J Michael, I Andre, and S Resve, “Performance evaluation and design trade-offs for network-on-chip interconnect architectures,” IEEE transaction on Computers, vol. 54, pp. 1025–1040, August 2005. [16] L Bononi and N Concer, “Simulation and analysis of network on chip architectures: Ring, spidergon and 2d mesh,” in DATE. IEEE, 2006, pp. 154–159. [17] T.A Bartic, J.-Y Mignolet, V Nollet, T Marescaux, D Verkest, S Vernalde, and R Lauwereins, “Topology adaptive network-on-chip design and implementation,” IEE Comput. Digit, vol. 152, pp. 467–452, July 2005. [18] A Ehliar and D Liu, “An fpga based open source network-on-chip architecture,” in Field Programmable Logic and Applications. IEEE, 2007, pp. 800–803. [19] N Kapre, N Mehta, M deLorimier, R Rubin, H Barnor, M.J Wilson, M Wrighton, and A DeHon, “Packet switched vs. time multiplexed fpga overlay networks,” in IEEE Symposium on Field-programmable Custom Computing Machines. IEEE, 2006, pp. 800–803.

A New Algorithm for Realization of FIR Filters Using Multiple Constant Multiplications Mohsen Amiri Farahani, Student Member, IEEE, Eduardo Castillo-Guerra, Bruce G. Colpitts, Senior Members, IEEE Department of Electrical and Computer engineering, University of New Brunswick, Canada

ABSTRACT This paper presents a new common subexpression elimination (CSE) algorithm to realize FIR filters based on multiple constant multiplications (MCMs). This algorithm shares the maximum number of partial terms amongst minimal signed digit (MSD)-represented coefficients. It modifies the iterated matching (ITM) algorithm to share more partial terms in MCMs, which yields a significant logic and, consequently, chip area savings. The employment of the proposed algorithm results in efficient realizations of FIR filters with a fewer number of adders compared to the conventional CSE algorithms. Experimental results demonstrate a reduction up to 22% in the complexity of FIR filters over some conventional CSE algorithms. The proposed algorithm also addresses challenges encountered in resource-constrained applications, which require banks of high-order filters, such as in real-time distributed optical fiber sensor.

efficient implementations of FIR filters driven by high accuracy and low power consumption requirements as well as the flexibility of their implementations on hardware platforms such as field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs). In applications such as distributed optical fiber sensors [1], the use of higher order filters with long bit-width coefficients is necessary as the specification of filters should be exactly met. An efficient implementation of such filters is a major challenge since there are always tradeoffs concerning accuracy, power consumption, and speed of hardware implementations.

Index Terms— common subexpression elimination (CSE), finite impulse response (FIR) filters, minimal signed digit (MSD), multiple constant multiplications (MCMs) 1.

INTRODUCTION

An FIR filter is a linear time invariant system governed by a linear convolution between the input samples and filter coefficients, as shown in Equation (1). In this equation, ck stands for the filter coefficients, and x[n] and y[n] denote the input and output signals of the filter, respectively.

y[n] =

N −1

∑ c x[n − k ] k

(1)

k =0

Fig. 1 shows the transposed structure of the FIR filter. The complexity of the implementation of FIR filters is mainly dominated by the product operation between filter coefficients and input samples as reflected in this figure. In recent years, there has been an increasing demand for

Fig. 1. Transposed structure of the FIR filter.

As shown in the transposed structure of the FIR filter, the same input sample is multiplied by a set of coefficients at each time. This kind of operation is known as multiple constant multiplications (MCMs) and provides the opportunity of realizing FIR filters with a less amount of computational resources. The realization of MCMs is closely related to the substitution of constant multiplications by shift and addition operations. This substitution enables a significant hardware saving, as adders and shift-registers are more cost-efficient than multipliers. It also enables the utilization of algorithms detecting redundancies among constants and sharing partial terms. In general, algorithms used to realize MCMs are categorized into two approaches: graph-dependence (GD) and common subexpression elimination (CSE).

The GD approach relies on graph theory to synthesize the coefficients of an FIR filter with a number of primitive arithmetic operations [2]. It models inner products of the input samples by nodes and the shift operations by edges of the graph. Several GD algorithms such as Bull Horrock’s modified (BHM) [3] and reduced adder graph (RAG) [4] have been reported in the literature. Recently, significantly more efficient GD algorithms were introduced in [5], [6]. The authors in [5] used a new graph structure to code the set of filter coefficients based on their conditional probability of occurrence. Voronenko and Püschel in [6] presented an optimal GD algorithm, which extensively explores possible partial products (intermediate vertices) among constants. The CSE approach is a subset of the general adder graph theory. It encounters the MCM problem with respect to the numerical representation of constants. Hartley algorithm is one of the first reported CSE algorithms [7]. It searches through canonical signed digit (CSD)-represented coefficients to find several common subexpressions that can be conveniently reduced to a single instance. In realization of MCMs, some efficient numerical representations, such as CSD or minimal signed digit (MSD) are more demanding as they represent numbers with fewer nonzero bits than the normal binary representation [8]. The iterative matching (ITM) algorithm in [9] is another basic CSE algorithm. It finds the maximum coincidences between pairs of constants based on the most common pattern and then eliminates a number of parallel adders between constants. In [10], some modifications were applied to the ITM algorithm to maximize the sharing of partial terms amongst constants having binary or CSD representations. Similar to GD algorithms, realization of FIR filters using CSE reduces the area and power consumption and increases the throughput of the filters. In [11], the logic depth of filters was reduced by designing a CSE algorithm that uses each common subexpression only one time. In this paper, we present a new CSE algorithm by modifying the ITM algorithm presented in [9] to share a maximum number of adders amongst the constants and realize MCMs with a shorter logic depth. The proposed algorithm searches inside MSD-represented constants to find the best match, and then uses this match to do further optimization steps. The remainder of the paper is organized into five sections. In Section 2, we present the limitations of the ITM algorithm. Section 3 introduces the proposed algorithm in detail. Section 4 evaluates the performance of the proposed algorithm with respect to some conventional algorithms studied in this research. Finally, the conclusion along with the scope of future work and acknowledgments are

presented in the last sections. 2.

LIMITATIONS OF ITM ALGORITHM

In this section, we analyze the ITM algorithm [9] by explaining its main limitation, the inability of sharing a maximum number of partial terms. For simplicity and clarity, a basic example (Example 1) consisting of a set of three constants, H = {c0, c1, c2}, is used. Table I shows this set of constants and their CSD representation. Table I The set of constants of Example 1.

c0

583

01001001001

c1

143

00010010001

c2

-1145

10010001001

The ITM algorithm selects partial terms based on the most common patterns that are determined in advance based on the bitwise matches amongst the constants. The ITM algorithm can be briefly analyzed in four steps: 1- The calculation of the current (immediate) adder saving. The immediate saving is the number of adders that can be saved based on the nonzero bitwise matches amongst pairs of constants. For example, there are two nonzero bitwise matches between c0 and c2, which results in an immediate saving of one adder. Table II shows the immediate savings for the set of Example 1. Table II Immediate savings for the set of Example 1.

c0 c1 c2

c0 0 1

c1 0 0

c2 1 0 -

2- The calculation of the later (future) saving. The future saving is the number of adders that can be saved at the next iteration of the algorithm. For the set of Example 1, there is no future saving as there is no match amongst constants in the next iteration of the algorithm. 3- The selection of the best match using the immediate and future savings. Based on Table II, the best match is between c0 and c2 and the pattern corresponding to the best match (the matched digit) is “100 1 ”. 4- Updating the set of constants. The algorithm updates the set by adding a constant corresponding

to the matched digit and then removing the common pattern from the selected pair of constants. Table III shows the updated set of Example 1. Table III The final set of constants of Example 1.

c0

01001000000

c1

00010010001

c2

10010000000

ccommon

00000001001

In general, the ITM algorithm is iterated until at least there is a pair of constant with more than one match. This stopping criterion allows the algorithm to save even one adder. It is observed from Table III that there are no bitwise matches amongst the constants. This means that the ITM algorithm stops at this step and cannot save any more adders for this set. However, a search into the constants of Table III shows that there are three repetitions of the pattern “1001” or its complement (negation) “ 1 00 1 ”. The elimination of these repetitions and the sharing of the corresponding partial terms can save two adders more. In the next section of the paper, we address the limitation of the ITM algorithm by applying specific modifications aiming to sharing more partial terms in MCMs. 3.

PROPOSED ALGORITHM

The proposed algorithm introduces a new definition for matches (common partial terms) amongst constants. Unlike the ITM algorithm, it searches into the shifted versions of constants to find the repetitions of common partial terms. These modifications can result in the maximum sharing of partial terms of the MCMs. Similar to the ITM algorithm, the proposed algorithm considers the effects of the selection of any partial terms on the current and future ability to save adders. The following assumptions are made in the proposed algorithm: •



The algorithm considers only the MSD representation of constants. The other representations of the radix-2 numerical system such as the normal binary and CSD representations can also be used in the algorithm but they are not considered in this paper. The complexity of an adder and a subtracter is similar; therefore, we will henceforth refer to both as adders and the number of required adders as the adder cost.

The proposed algorithm is composed of the following steps: -

Forming the set of constants.

-

Finding the immediate adder saving. Finding the future adder saving. Forming the decision matrixes and selecting the best match. Updating the set of constants and searching for the patterns equal to the best match.

For simplicity and clarity, another basic example (Example 2) is employed in this section to introduce the proposed algorithm. Table IV shows the set of constants for Example 2. Table IV The set of constants for Example 2

Constant n1 n2 n3

Decimal 70 -27 -93

3.1. Forming the set of constants

The first step of the algorithm involves three levels: 1Eliminating all duplications in the set (it is quite common to find identical constants in digital filters or linear transformations). 2- Removing constants with at most one nonzero bit from the set. 3- Selecting the best MSD representation of constants, as it is possible to have more than one MSD representation for constants. For example, the number 3 has two MSD representations 10 1 and 011. To find the best MSD representation for the set, the number of matches among all constants is calculated for each MSD representation, and then the representation with the highest number of matches is selected as the best. Table V shows the constants of Example 2 resulting the best MSD representation. Table V The set of constants of Example 2

n1

01001010

n2

00100101

n3

10100101

3.2. Finding the immediate adder saving

The second step of the algorithm includes the calculation of immediate saving which is the saving in adders when a particular pair of constants is chosen. One of the main differences between the proposed and ITM algorithms is that shifted versions of constants are also considered in the calculation of immediate and future savings. Taking advantage of the fact that complexity of shift operations is negligible compared to additions, the

algorithm also calculates matches for the shifted versions of constants. For instance, in the set of Example 2, the constant n1 can be shifted one digit to the left or right and then be compared with other constants to find common patterns between them. In addition to the consideration of shifted versions of constants, in this work (unlike the ITM algorithm), the definition of matches is extended to: 1- The bitwise match between two constants (what has been done in the ITM algorithm). 2- The bitwise match between one constant and the negation of the other one (called negated match). For example, there is a negated match between n1 = “ 01001010 ” and n2 = “ 00100101 ” with the common pattern of “10000 1 ”. This extension in the definition of matches in the proposed algorithm enables the sharing of more common partial terms amongst the constants. The immediate saving for the set of Example 2 is calculated and shown in Fig. 2. In this figure, the matrixes T and T show the immediate savings based on the bitwise and negated matches, respectively. The size of both symmetric matrixes is N by N, where N is the number of constants in the set. Every particular element of these matrixes points to the number of adders that can be saved by selecting that pair. For example, the element at row 1, column 3 of the matrix T indicates the number of adders that can be saved by selecting n1 and n3 as the best match. For this pair, the common pattern is “10010 1 ” that its selection can save two adders. ⎡0 0 2⎤ T = ⎢0 0 1⎥ ⎢ ⎥ ⎣⎢ 2 1 0 ⎦⎥

⎡0 1 1 ⎤ T = ⎢1 0 1 ⎥ ⎢ ⎥ ⎢⎣ 1 1 0⎥⎦

Fig. 2. Immediate saving for the set of Example 2.

The immediate saving gives useful information about the possible savings in adders, but in order to have the maximum sharing of partial terms, the effects of any selection on the future saving in adders should be analyzed. 3.3. Finding the future saving

The future saving focuses on the potential of saving in adders after affecting the selection of any particular pair as the best match. For a particular pair, it is calculated based on the sum of three top best savings after applying the effects of that selection to the set. It was found in experiments that the sum of three top best savings obtains a good estimation for the future saving. As an explanation of how the future saving is calculated, Table VI shows the set of constants after affecting the

selection of the pair of n1 and n3. There is only one match that is between n2 and nnew and can save just one adder. This makes the future saving of this pair equal to one. Table VI The set of constants after selecting the pair of n1 and n3

n1

00000000

n2

00100101

n3

10000000

nnew

00100101

The future saving for the set of Example 2 is calculated and shown in Fig. 3. Every particular element of matrix P points to the possible future saving if the corresponding pair is selected as the best match based on the bitwise matches. In matrix P , every particular element points to the possible future saving if the corresponding pair is selected as the best match based on the negated matches. ⎡0 0 1 ⎤ P = ⎢0 0 2 ⎥ ⎢ ⎥ ⎣⎢1 2 0⎦⎥

⎡ 0 2 2⎤ P = ⎢ 2 0 2⎥ ⎢ ⎥ ⎣⎢ 2 2 0⎦⎥

Fig. 3. Future adder saving for Example 2.

The information behind these two recent matrixes provides a good feedback about the effects of any selections on future savings in adders. This information along with the immediate saving are considered to the selection of the best match in the next step of the algorithm. 3.4. Forming the decision matrixes and selecting the best match

In this step of the algorithm, the decision matrixes are made using the immediate and future savings in adders. These matrixes allow the selection of the best match while both current and future savings are considered. The decision matrixes based on bitwise and negated matches, D and D are formulated in Equation (2). D =T *P

&

D =T *P

(2)

In this equation, the sign “*” denotes the element by element multiplication between the matrixes. Fig. 4 shows the resulting decision matrixes for Example 2. The location of the maximum in the matrixes points to the pair that will be selected as the best match; however, there is more than one candidate for Example 2.

⎡ 0 0 2⎤ D = ⎢ 0 0 2⎥ ⎢ ⎥ ⎢⎣2 2 0⎥⎦

⎡0 2 2⎤ D = ⎢2 0 2 ⎥ ⎢ ⎥ ⎢⎣2 2 0⎥⎦

Fig. 4. Decision matrixes for Example 2.

In this case, the algorithm calculates the sum of the elements in the row and column of the decision matrixes corresponding to the location of the candidates. It then selects the location resulted the smallest value as the best match. Fig. 5 shows the summation procedure for Example 2, where the smallest value is located at the row 1, column 2. This means that the best match is between the pair of n1 and n2.

⎡ 0 0 2⎤ D = ⎢ 0 0 2⎥ ⎢ ⎥ ⎢⎣ 2 2 0⎥⎦

⎡0 2 2⎤ D = ⎢2 0 2⎥ ⎢ ⎥ ⎣⎢ 2 2 0 ⎦⎥

0 + 6 4 + 6⎤ ⎡ 0 S = ⎢0 + 6 0 4 + 6⎥ ⎢ ⎥ 0 ⎥⎦ ⎢⎣4 + 6 4 + 6

In general, the algorithm is iterated while there is at least one pair of constants with more than one digit match. This iteration condition allows the algorithm to find the common patterns in the set while saving at least one adder. Fig. 6 shows the final architecture of Example 2. The total number of adders required for the realization of this set is four and the logic depth of the realization is two.

Fig. 6. Final architecture of Example 2.

Fig. 5. Summation procedure for Example 2.

4. 3.5. Updating the set and searching for the patterns equal to the best match

In the last step, the algorithm updates the set by adding a constant corresponding to the matched digit and then removing the common pattern from the selected constants. The set of Example 2 is updated by adding a new constant “ 00100001 ” corresponding to the matched digit and replacing n1 and n2 with “ 00001000 ” and “ 00000100 ”, respectively. A search is conducted inside the set to find patterns equal to the matched digit or its negation after updating the set. There are two repetitions of matched digit (“ 00100001 ”) or its negation (“ 00100001 ”) in n3 that can be eliminated in Example 2. Table VII shows the set of constants after the first iteration.

Table VII The set of constants for Example 2 after first iteration.

n1 n2 n3

00001000 00000100 00000000

ncommon

00100001

EXPERIMENTAL RESULTS

In this section, numerical results of the realization of two sets of benchmarked FIR filters are presented to evaluate the proposed algorithm. Comparisons are made between the proposed and conventional algorithms to provide a more detailed performance analysis. 4.1. The first set of FIR filters

The first set of FIR filters is composed of five commonly referenced FIR filters, FIR1 to FIR5. Those filters are used in paper [10], [12] to evaluate the MITM and NR-SCSE algorithms. The adder cost and logic depth of the implementations of these FIR filters are the criteria of comparison between the proposed algorithm and the Hartley [7], Bull Horrock’s modified (BHM) [3], NR-SCSE [12], ITM [9], modified ITM (MITM) [10] algorithms. Table VIII shows the results of implementation of the benchmarked filters using the algorithms. In this table, the symbol ∑ stands for the adder cost; LD shows the logic depth; L is the bit-width of coefficients; and N is the number of taps in the filters. The optimization ratio, γ, is a function defined as the number of adders per tap (γ = ∑/N). The improvement ratio, Λ, is the ratio between the adder cost in the MSD representation of the filters and the adder cost in a particular algorithm. The higher Λ indicates that the better adder saving achieved by using that particular algorithm.

TABLE VIII Results of the realization of the benchmarked filters.

Filter

FIR 1 N=4 L = 10

FIR 2 N=4 L = 12

FIR 3 N =25 L=9

FIR4 N= 59 L= 14

FIR5 N= 60 L= 14

Algorithm MSD Hartley BHM NR-SCSE ITM MITM Proposed MSD Hartley BHM NR-SCSE ITM MITM Proposed MSD Hartley BHM NR-SCSE ITM MITM Proposed MSD Hartley BHM NR-SCSE ITM MITM Proposed MSD Hartley BHM NR-SCSE ITM MITM Proposed

∑ 15 10 9 9 12 9 8 18 13 11 13 12 12 12 23 21 19 18 22 18 17 87 70 59 60 73 57 57 114 85 61 58 66 57 56

LD 4 3 7 2 3 3 2 4 3 7 3 5 5 4 2 3 6 2 2 2 2 4 4 3 2 3 3 3 6 4 8 3 4 3 3

γ 3.75 2.5 2.25 2.25 3 2.25 2 4.5 3.25 2.75 3.25 3 3 3 0.92 0.84 0.76 0.72 0.88 0.72 0.68 1.47 1.19 1 1.01 1.22 0.97 0.97 1.9 1.42 1.02 0.97 1.10 0.95 0.93

Λ 1 1.5 1.67 1.67 1.25 1.67 1.87 1 1.38 1.64 1.38 1.5 1.5 1.5 1 1.19 1.31 1.39 1.05 1.28 1.35 1 1.24 1.48 1.45 1.21 1.53 1.53 1 1.35 1.86 1.96 1.72 2.00 2.04

The average adder cost for the realization of benchmarked filters is shown in Fig. 7. The results of the first experiment indicate that the adder cost of the proposed algorithm is smaller than the other algorithms, where it also provides a shorter logic depth. Compared to the ITM algorithm, the proposed algorithm contributes up to a 22 percent reduction in the adder cost required for the realization of the benchmarked filters.

Fig. 7. Average adder costs for the benchmarked filters.

4.2. The second set of FIR filters

The second set of FIR filters is a bank of high order FIR filters used in the distributed optical sensors to filter out data in a comb teeth approach that parallely scan the spectrum [1]. The bank of filters is composed of 19 bandpass FIR filters with the specifications shown in Table IX, where Ap and As stand for the amplitude and attenuation (in dB) in the pass-band and stop-band, respectively, and fs , fp, and fs denote the stop-band, pass-band and sampling frequencies (in MHz) for the filters, respectively. The filters are designed using the filter design and analysis (fda) tool in MATLAB. They are spaned in a 200 MHz frequency band with the center frequencies seperated by 10 MHz. Table IX Specifications of the filters of the third set.

FIR 1 2 . . . 19

fs1 2.5 12.5 . . . 182.5

fp1 7.5 17.5 . . . 187.5

fs2 12.5 22.5 . . . 192.5

fs2 17.5 27.5 . . . 197.5

Ap 1 1 . . . 1

As 40 40 . . . 40

order 114 114 . . . 114

fs 400 400 . . . 400

The bank of filters is realized using the proposed algorithm for different bit-widths. The results of the realization are shown in Table X, where all notations are the same as in Table VIII.

REFERENCES Table X Results of the realization of the bank of filters.

MSD Hartley BHM NR-SCSE ITM MITM Proposed

L ∑ LD ∑ LD ∑ LD ∑ LD ∑ LD ∑ LD ∑ LD

8 41 2 24 2 22 4 24 2 25 2 23 2 23 2

10 82 3 44 3 39 5 39 3 43 3 38 3 37 3

12 129 3 65 4 57 7 58 3 66 4 58 4 55 4

14 175 3 80 4 68 7 70 4 83 4 69 4 64 4

The results of This table X indicate that the average adder cost in the proposed algorithm is smaller than the other algorithms. The proposed algorithm contributes up to 63% and 22% reductions in the adder cost of the benchmarked filter bank over the MSD representation and ITM algorithm, respectively. 5.

CONCLUSIONS

A novel CSE algorithm for the realization of FIR filters based on the MSD representation of filter coefficients has been introduced in this paper. The algorithm realizes MCMs by considering shifted versions of constants in the calculation of immediate and future savings. It also anticipates the effects of the selection of any partial terms in the future adder saving. The proposed algorithm realizes FIR filters with smaller adder cost than the compared algorithms. It also offers a better tradeoff between the adder cost and logic depth. The saving of adders in the proposed algorithm enables the implementation of filters with less power consumption. The experimental results indicate that the proposed algorithm contributes up to a 22 percent reduction in the complexity of FIR filters over the ITM algorithm [9]. As future work, we are going to apply the proposed algorithm to other radix-2 representations and also compare it with other prominent algorithms such as those presented in [6], [11]. 6.

ACKNOWLEDGMENTS

The authors would like to thank the reviewers for their valuable comments.

[1] P. Chaube et all., “Distributed fiber-optic sensor for dynamic strain measurement,” IEEE Sensors Journal, vol. 8, no. 7, July 2008. [2] A. Dempster and M. D. Macleod, “Constant integer multiplication using minimum adders,” Proc. Inst. Elec. Eng. Circuits and Systems, vol. 141, no. 5, pp. 407–413, 1994. [3] D. R. Bull and D. H. Horrocks, “Primitive operator digital filter,” Proc. Inst. Elec. Eng. Circuits, Devices and Systems, vol. 138, pt. G, pp. 401–412, 1991. [4] A. Dempster and M. D. Macleod, “Use of minimum-adder multiplier blocks in FIR digital filters,” IEEE Trans. Circuits Syst. II, vol. 42, pp. 569–577, 1995. [5] C. H. Chang, J. Chen, and A. P. Vinod, “Information theoretic approach to complexity reduction of FIR filter design,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 8, pp. 23102321, Sep. 2008. [6] Y. Voronenko and M. Puschel, “Multiplierless multiple constant multiplication,” ACM Trans. Algorithms, vol. 3, no. 2, May 2007. [7] R. Hartley, “Optimization of canonical signed digit multipliers for filter design,” IEEE Int. Symp. Circuits Syst., Singapore, pp. 1992–1995, 1991. [8] M. D. Macleod and A. G. Dempster, “Multiplierless FIR filter design algorithms,” IEEE Signal Process. Lett., vol. 12, no. 3, pp. 186–189, 2005. [9] M. Potkonjak et al., “Multiple constant multiplication: Efficient and versatile framework and algorithms for exploring common subexpression elimination,” IEEE Trans. Comput. Aid. Des., vol. 15, no. 2, pp. 151–165, 1996. [10] M. A. Farahani, E. C. Guerra, and B. G. Colpitts, “Efficient implementation of FIR filters based on a novel common subexpression elimination algorithm,” CCECE’10, 2010. [11] L. Aksoy et al., “Exact and approximate algorithms for the optimization of area and delay in multiple constant multiplications,” IEEE Trans. Comput. Aid. Des. Integr. Circuits Syst., vol. 27, no. 6, pp. 1013-1026, 2008. [12] M. M. Peiro, E. I. Boemo, and L. Wanhammar, “Design of high-speed multiplierless filters using a nonrecursive signed common subexpression algorithm,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process. vol. 49, no. 3, pp. 196– 203, 2002.

ANALYZING SOFTWARE INTER-TASK COMMUNICATION CHANNELS ON A CLUSTERED SHARED MEMORY MULTI PROCESSOR SYSTEM-ON-CHIP Daniela Genius, Nicolas Pouillon Laboratoire LIP6, Universit´e Pierre et Marie Curie Abstract The task graph of telecommunication applications often exhibits massive coarse grained parallelism, which can be exploited by an on-chip multiprocessor. In many cases it can be organized into several subsequent stages, each containing dozens or even hundreds of identical tasks. We implement communications between tasks via software channels mapped to on-chip memory, allowing for multiple readers and writers to access them in arbitrary order. Our architecture is based on the shared memory paradigm. The interconnection network is hierarchical, so that communication latencies vary with the distance between the cluster where the task is located and the cluster on which the channel is placed. Moreover, packet sizes and arrival rates are subject to strong variations. An analytical approach to dimensioning the channels is thus near impossible. Within a purely simulation based approach, we gain insight into the performance of such software channels. Index Terms— multicore processing, performance analysis, task farm parallelism 1. INTRODUCTION We study streaming applications written in the form of a set of coarse grain parallel threads communicating with each other. Two possible approaches exist in order to extract such coarse grain parallelism from a sequential application: a coarse-grained segmentation of the sequential application into functional tasks that execute sequentially (pipeline parallelism), or a replication of the entire sequential application into many clones, where all the tasks do the same job, but each on different data (task farm parallelism). We define as pipelined task farm applications a class of coarse grain parallel applications organized in subsequent pipeline stages. Each task of a stage can write to every task of the subsequent stage, and each task of the subsequent stage can read from every task of the preceding stage. Fig 1

shows a task graph with two stages apart from I/O tasks. We focus on telecommunication applications featuring hundreds or even thousands of packet flows simultaneously. For performance reasons, they are often implemented on a Multi Processor System-on-Chip (MPSoC). Two design

Fig. 1. Pipelined task farm graph goals emerge: a large number of read/write accesses to the same channel must be accommodated, and channels should be accessible by hardware and software tasks alike. Multi Writer Multi Reader (MWMR) channels [1] generalize KPN in such a way that multiple writers and multiple readers access the same channel in non-deterministic order. The so-called packet reordering caused by MWMR channels is not crucial for telecommunication applications; as shown in [2], the TCP protocol can re-establish the packet order. Accessibility by hardware and software tasks alike is made possible by implementing them as software buffers located in on-chip memory. The interface of each task remains unchanged, whatever the number of instances of the tasks with which it communicates. This leads to a bipartite graph in which one type of nodes represents the tasks and a second type of nodes explicitly represents the communication channels (Fig 2). Such a graph is called Task and Communication Graph (TCG) and the number of writers and readers becomes a parameter at creation of the channel. The performance results of massively parallel telecommunication application did not yet live up to expectations, in contrast to successful implementations of other streaming applications like [3] on our platform. Two main reasons emerged: the first is related to the scalability of the architecture, the second is inherent in the class of applications.

Due to contention on the on-chip interconnection network, the important number of processors required to deal with this massive parallelism cannot be accommodated on a platform based on a flat interconnect. A hierarchical interconnect has thus to be introduced. In such NUMA (Non Uniform Memory Access) architectures, memory access latencies differ importantly depending on whether a processor accesses a memory bank local to the cluster, or situated on another cluster. As our class of applications relies heavily on communication via software channels stored in on-chip memory, the fill state of the channels is extremely sensitive to disparity of memory access latencies. Telecommunication applications feature three additional types of irregularity. Firstly, packets arrive at irregular intervals. Secondly, due to their varying sizes, packets have to be cut into chunks of equal size. For efficiency reasons, these so-called slots are stored in memory during the entire treatment and only a small descriptor transits the communication channels. Addresses are reused when a packet has left; in consequence, task graphs become cyclic – in contrast to typical video streaming applications, whose task graphs are essentially pipelines. Finally, the duration of a treatment of a packet by a task depends on its content and is thus unknown at the time the task graph is designed. Channel depths however must be fixed at that time. All in all, we are confronted with a particularly grave problem of dimensioning communication channels in advance to prevent overflow or underflow [1].

Fig. 2. Task graph with explicit representation of the communication channels and feedback The irregularity present in our class of applications is very difficult to capture in analytical models, we thus employ a purely simulation based approach. It is thus crucial to monitor the fill state of the MWMR channels in the course of time, with very little simulation overhead and without distorting performance results. The contribution of this paper is a comprehensive method for analyzing the behavior of software channels stored in memory. By spying on transfers to and from on-chip memory on the interconnection network and analyzing them on a cycle accurate bit accurate level, we monitor channels in the course of time. We first present related work on communication channels is so far as it is applicable to our class of applications. Some implementation details of MWMR channels which

are important for the understanding of our monitoring approach are presented in Section 3. The generic simulation platform and its telecommunication specific extensions are presented in Section 4. The analysis tool is detailed in Section 5, where we briefly show how it can be used to exhibit performance bottlenecks for a given application. Section 6 raises open questions and gives perspectives on future work. 2. RELATED WORK Inter-task communications in telecommunication applications are necessarily asynchronous, point-to-point, which rules out many inter-task communication models. As already mentioned, the irregularity present in this class of applications is difficult to capture in analytical models. One analytical model specifically proposed for this class is derived in [4], however by neglecting effects of caching, separate memories, and shared communication resources. Y-chart based approaches like Sesame [5] on the other hand rely on simulation; the feedback serves to approach an optimal solution by mutually adapting hardware platform and application. They focus on the system level. Sesame uses Kahn Process Networks (KPN [6]) for application modeling. Let us restrict to communication models based on the KPN paradigm, which proposes a semantic of inter-task communication through infinite depth, point-to-point FIFO channels with non-blocking writes and blocking reads. The KPN formalism is very popular because it is deterministic: the result of an application does not depend on the relative speed of its tasks. Infinite channels are impossible to implement, but Parks showed in [7] that FIFO depth can be reduced to a finite value without loosing the KPN properties. As a corollary, write operations to the FIFO become blocking. Disydent (Digital System Design Environment [8]) is based upon KPN and uses point-to-point channels. Channels are implemented as software channels and mapped to on-chip memory. Disydent communications have to be described differently depending on whether they take place between purely software, purely hardware or mixed pairs of tasks. The KPN formalism has also been adapted by YAPI [9]. YAPI extends KPN by a channel selection mechanism. Implementations of YAPI are COSY [10] and SPADE [11]. Recently, the work on ESPAM [12] examines mappings of streaming media applications to shared memory MPSoC architectures. A variety of bounded KPN channels are implemented in form of specific hardware, whereas our focus is on software channels mapped to on-chip memory. The work on pn [13] introduces reordering channels, where data can be written in a different order from that in which it is read. These channels are also bounded and blocking. KPN channels can be implemented as a special case of MWMR communications with one reader and one writer.

3. SHARED MEMORY IMPLEMENTATION The access to MWMR channels is protected by a single lock. Tasks wait actively for such a spin lock. This choice was driven by the need of a simple protocol destined to be implemented in a wired automaton. MWMR channels support either blocking or non blocking read and write operations; both of them rely on a five step protocol: (1) get the lock (READ operation), (2) test the status of the channel (READ operation), (3) do the actual transfer (READ/WRITE operation), (4) update status and pointer (WRITE operation), (5) free the lock (WRITE operation). Non blocking operations will return after an attempt to do the transfer, with the number of items actually copied as the return value of the function. Blocking operations loop until they can transfer a fixed number of items. The five step protocol is implemented as a set of software functions that can be used either by a task running on a processor or implemented in a wrapper that connects a coprocessor to a software MWMR channel. From the channel point of view, there is thus no difference whether a hardware or software task performs the transfer. As transactions via the five step protocol are expensive, generally bursts of descriptors are transferred. Figure 3 provides a closer look at the data structure of MWMR channels. It consists of three parts: (1) a ring buffer, (2) a status field containing read and write pointers, the lock, the status (fill state) and an usage counter and (3) descriptor of static properties: depth, width, address of status and address of a ring buffer. The ring buffer is characterized by its width and depth. The width is a size (in bytes) of a single item that can be stored in the buffer. The depth is the number of items that can be stored. A given MWMR channel contains items of the same size. Read an write operations performed on a MWMR channel must contain an integer number of items. 1 3 width depth

ring buffer

2 lock

read pointer

write pointer

status

Fig. 3. MWMR data structure in memory: (1) ring buffer (2) status (3) descriptor

This choice of implementation of software channels has the inconvenience that far more memory traffic is generated. Our particular focus will thus have to lie on the efficiency

of the data transfers. For efficiency reasons, channels are mapped to cacheable memory. We ensure cache coherency in the software implementation of the MWMR primitives, selectively invalidating the cache lines concerned. 4. THE SIMULATION PLATFORM As mentioned above, our application is irregular in several ways (packet size, arrivals, duration of treatment) which made us privilege a purely simulation based approach. We moreover wish to parameterize an architecture for a class of applications rather than fix it for a single application. For this reason, we adopted SoCLib [14], running the Mutek static kernel [15]. SoCLib is a generic open shared memory multiprocessor-on-chip platform. The core of the platform is a library of SystemC simulation models for virtual components (IP cores), with a guaranteed path to silicon. The Design Space Explorer (DSX) handles the description of the application, simulation model and mapping of streaming applications on SoCLib platforms. The destination memory bank to which a channel is mapped can be given explicitly in SoCLib/DSX, which is a significant improvement over SPADE [11] where only tasks can be mapped explicitly. 4.1. Virtual Component Interconnect Let us consider the shared memory paradigm in more detail. The VCI Virtual Component Interconnect) standard [16] aims at separating functionality from communications in order to facilitate reuse of hardware components. There are two types of components: initiators (typically a processor’s cache or a coprocessor) and targets (most others like RAM and terminal). All initiators and targets share the same address space: targets can be identified by the most significant bits of their address, while initiators are identified by an index. A VCI transaction is a pair request/response. A request can contain one or several addresses (burst transaction). The response packet has the same length as the request packet. The packets are transported in an atomic manner. An initiator can emit a request without awaiting the answer to the previous request, but the responses do not necessarily come back in the order in which the requests have been emitted. Figure 5 shows the VCI interface. The first eight signals concern the request packet, the remaining seven the response packet. The initiator is identified by SRCID. CMD stands for command (read, write, linked read) which can be valid (CMDVAL) or not, and is acknowledged (CMDACK) by the target. Packets are of variable length and the end of a packet is marked by EOP. The initiator acknowledges reception (RSPACK). The target can emit an error signal.

Fig. 4. Overview of the clustered multiprocessor architecture; asterisks (*) mark the places where VCI MWMR statistics modules are inserted

Request

1 1 1 32 m 32 n p

ure 4. All clusters are connected to the network-on-chip through two ports: an initiator port that sends request from other clusters and a target port that send commands from this cluster to the others. Clusters can be of three types:

CMDVAL CMDACK EOP ADDRESS CMD WDATA SRCID PKTID

VCI Initiator

VCI Target RERROR RDATA RSRCID RPKTID REOP RSPACK RSPVAL

s 32 n p 1 1 1

1. Input clusters, shown on the upper left hand side, contain one input coprocessor, its MWMR controller and one or several memory banks.

Response

Fig. 5. VCI interface 4.2. Generic SoCLib Streaming Platform A generic MPSoC based on the SoCLib library contains, on a single chip, a variable number of small programmable 32 bit RISC processors (MIPS32), a variable number of embedded memory banks, and other components like terminals and I/O coprocessors. There is no central control by a general purpose processor. Simulations are cycle accurate bit accurate. The central micro-network is a mesh. The paths between the routers are rather short even for larger numbers of processors; we use an abstraction where all communications on the central interconnect have the same latency. The number of processors per cluster is limited to four and thus a crossbar, otherwise of prohibitive cost, is the interconnect of choice for the local interconnects. A global view of the clustered platform is shown in Fig-

2. Output clusters are symmetric to input clusters with the exception that it does not have a memory bank: slots have to be requested from the input cluster. It is advisable that the I/O clusters be close neighbors on the on-chip interconnect. 3. Processing clusters contain several processors, their caches, one or several memory banks and eventually a terminal to follow the progress of the application.

4.3. Telecommunication Specific Platform In contrast to other applications that have successfully been ported to MPSoC, data volume is very large in stateof-the art telecommunication applications. Packet chunks should be copied between memory banks as seldom as possible. As a case study we take the packet classifier shown in [17]. It is organized in four stages: input tasks, classification tasks, scheduling tasks, output tasks (Figure 6). The bootstrap task generates addresses at start-up, sends them to the input tasks via the address channels, then emits

Fig. 6. Classification application mapped onto a clustered platform strobe signals for the coprocessors and suspends itself. The input tasks divide the packet up into slots of equal size and stores them in memory. A descriptor is produced, containing a pointer to the first slot and some additional information. The descriptor is sent on to an input channel. Such a channel either has as many readers as there are classification tasks, or is destined to a group of such tasks, typically located on the same cluster. Classification tasks are computation intensive as they inspect the packet header; to this end, they have to access the first slot in on-chip memory. The descriptor is sent to one of the priority queues as a result of this inspection. Scheduling tasks multiplex descriptors onto the output channels, observing priorities defined by the application designer. The output tasks reassemble the packet; as soon as a packet leaves the platform, the liberated addresses are sent to the address channels. Input and output tasks are implemented as hardware coprocessors, specific to the class of telecommunication applications. The TCG is cyclic due to the additional feedback channel for addresses between the output and the input task. There are also feedback channels between the classification tasks and the input task for reuse of addresses of discarded erroneous packets, left out in the figure for readability. We show the application together with the optimal mapping (tasks encircled together are mapped onto the same cluster).

adding a hardware component. The first approach distorts the execution significantly, whereas the addition of hardware only very slightly affects simulation time and is essentially bounded by the writing of the log to the simulating machine. 5.1. The VCI MWMR Statistics Module We propose an extension to the SoCLib library: a module based on the SoCLib VCI logger, which spies on all the fields of a VCI packet for every access and writes the results to a log. The VCI MWMR statistics module restricts to the accesses concerning MWMR channels; the analysis of its log enables us to detect the five steps of the protocol as described in Section 3 and thus to establish statistics of read and write operations to individual channels as well as the time passed waiting for a lock and copying data to memory. For measuring latencies, departures of VCI requests and arrivals of responses have to be recorded. As any task can potentially write to any memory bank, one module has to be connected to each VCI interface between a cache and its local crossbar, marked by asterisks in Figure 4. In order to identify requests and responses belonging together, we use the VCI pktid field. In the SoCLib/DSX architecture description, it is sufficient to add a single line containing a name, a list of channels to monitor, and a name of the log file to be used at the module’s creation.

5. LOGGING VCI TRANSFERS 5.2. Experimental Setup In order to extract statistics on the MWMR channels, we have two options: we can either modify the code of the tasks or monitor the VCI transfers on the interconnect by

The experiments shown here are not meant to compare performances but to show the detail that can be obtained by

using the VCI MWMR statistics module for analysis. The following default parameters were chosen for the architecture: there is one input and one output coprocessor, each on its own cluster, 16 classification tasks on four general purpose clusters and two scheduling tasks on another. Descriptors are transferred in bursts of 16. The direct mapped, write through caches are dimensioned to their maximum size: 16 KByte for both data and instruction cache. A second memory bank on the input cluster accommodates the slots such that they do not interfere with the channels. We do not use off-chip memory in these experiments. Small packets of 54 bytes which constitute the worst case wrt. bandwidth and can be stored entirely on-chip. The input stream contains 24,000 packets. We map one task per processor, as spin locks proved to perform best with this configuration. Under DSX, the bootstrap process has to be completed before starting the input coprocessor [17]. Thus, the address channel leading from the bootstrap task to the coprocessor has to accept the totality of addresses generated for on-chip storage of slots. Throughout the present experiment, we generate 2K on-chip addresses of four byte each. If the system is in equilibrium, once the address channels have filled up, they are always near full. All other channels contain eight byte descriptors, with a maximum of 1K for the input and output channels, 256 for the priority queues. Unless mentioned otherwise, there are 16 classification and 2 scheduling tasks, and we aim at the maximal throughput that can be met without discarding packets.

empty 2.86

passthrough 2.12

outline 0.73

regular 0.66

classif 0.63

Table 2. Throughput in bits/cycle for different variants of the application

on the address channel. This bootstrap process takes around 1,63 million cycles. The sawtooth form is due to the bursts. The vertical distance between the graph of write and of read operations at a given instant is the fill state of the channel. Next, descriptors are produced by the input coprocessor and written to the input channels. When the throughput imposed is too high, the write accesses outnumber the read accesses and the input channel fills up (Figure 8). Once it is full, the input coprocessor can no longer write descriptors and starts discarding packets. The figure shows that only about half of the packets arrive at their destination. If the system is in equilibrium, the input channels are always near empty (not shown because graphs for read and write operations are hardly distinguishable). Around 3.8 million simulation cycles, the first descriptors arrive at one of the scheduling tasks. Figure 9 shows one of the priority queues for an equal distribution of destination IP addresses, the others are alike. The scheduling task sends the packet on to the output channel, which is always near empty and thus not shown. All 24,000 descriptors are accounted for, and 192,000 bytes are transferred. After 25 million SystemC cycles, all packets have left the platform at the output coprocessor.

5.3. Observing Channels in the Course of Time

5.4. Analyzing Performance Bottlenecks

Experiments for the flat interconnect [17] revealed that the input channel is a source of contention: if all classification tasks read from a single channel, contention when accessing its memory bank is too important. However, more input channels provide more packets to the classification tasks simultaneously. Originally, all priority queues had a size of 1 KByte, which proved insufficient and led to performance loss. With the help of histograms derived by our tool, we adapted the size of the priority queues (Table 1 shows mean depths in number of bytes).

Five variants of the application are examined, from the simplest variant to the full scale application, in order to exhibit the causes of the performance breakdown (Table 2):

nb input channels priority queue size

1 1152

2 1408

4 1536

8 1536

Table 1. Dimensioning the priority queue

• empty, where input and output coprocessor are directly connected, in order to determine the maximal possible throughput, • passthrough, where one task is inserted which transfers descriptors without accessing slots, • outline, the full scale application where the classification tasks send the descriptors to priority queues in round robin order without accessing slots, • regular, the full scale application, • classif, full scale application with weighted priorities.

The first around 300,000 simulation cycles are consecrated to the booting of the operating system. Figure 7 shows the graphs representing the read and write operations

We observe that performance degrades as soon as (onchip) memory is accessed by software tasks: the outline

total lock data

wI 330 71 259

rC 354 67 287

wC 894 45 849

rS 326 19 307

wS 201 18 183

rO 131 65 66

Table 3. Mean duration of channel access variant does not access slot memory, and this type of access only accounts for less than 10% of throughput loss in the regular variant. Either, a task has to wait too long to obtain a lock, or the memory transfer itself is too slow. Weighted priorities (regular to classif ) do not significantly impact the throughput. The following experiment confirms our observations. We determine the latency between the first request for a lock and the moment it is actually obtained (Table 3). We do not consider the address channels, only accessed at bootstrap and by the hardware I/O coprocessors. Three types of channel accesses remain to analyze: accesses to the input and output channels between hardware coprocessors and software tasks, respectively, and accesses to the priority queues taking place exclusively between software tasks. A channel can be read or written, such that there are six columns in Table 3: written by the input coprocessor (wI), read by classification tasks (rC), written by classification tasks (wC), read by scheduling tasks (rS), written by scheduling task (wS), read by output coprocessor (rO). The time spent waiting for a lock is relatively short while data transfers are relatively long. While hardware coprocessors have reasonable latencies, transfers initiated by software tasks obviously slow down the application. Measurements of the cache performance show an average CPI around 3.3 for classification and 2.9 for scheduling tasks. As cache lines are at a maximal size of 16 words, such that eight descriptors fit into a cache line, possible conflict misses will have to be analyzed as well as interferences between slot and channel accesses. The simulation overhead is around 18% but can be improved if only the monitoring on VCI interfaces belonging to memory banks containing channels that are currently of interest is activated. There is practically no alteration on the measured throughput. 6. CONCLUSION AND PERSPECTIVES The particular form of the task graph of telecommunication applications motivated the implementation of software communication channels, placed in on-chip memory of a shared memory MPSoC. The present work contributes a method for analyzing accesses to such channels; a component provides a cycle by cycle trace of the corresponding memory accesses during simulation, at small additional

cost. With the help of this detailed analysis, we were able to uncover some performance bottlenecks. The topology of the central interconnect should be taken into account more precisely; a study of a platform based on a two dimensional mesh is under way. For our class of applications, not only do channels interfere with each other, they also interfere with slots when placed on the same memory bank and, more importantly, with either during inter-cluster transfers. DSX gives the designer full control of the mapping of software objects. Apart from channels and memory segments, others like stacks and the code of the tasks are explicitly mapped to on-chip memory. We are currently extending our mechanisms to capture these kinds of interferences. Cache effects have to be analyzed in more detail. During MWMR transfers, among others, items are copied from shared to private memory. Measurements revealed that the corresponding write operations are not sufficiently regrouped into bursts, impairing performance. First experiments using an implementation of the cache with an improved write buffer are encouraging.

7. REFERENCES [1] E. Faure, HW/SW communications in telecommunication oriented MPSoC (Communications mat´erielleslogicielles dans les syst`emes sur puce orient´es t´el´ecommunications), Ph.D. thesis, UPMC, 2007. [2] J. C. R. Bennett, C. Partridge, and N. Shectman, “Packet reordering is not pathological network behavior,” IEEE/ACM Trans. on Networking, vol. 7, no. 6, pp. 789–798, Dec. 1999. [3] A. Greiner, F. P´etrot, M. Carrier, M. Benabdenbi, R. Chotin-Avot, and R. Labayrade, “Mapping an obstacles detection, stereo vision-based, software application on a multi-processor system-on-chip,” in IEEE Intelligent Vehicles Symposium, Tokyo, Japan, june 2006, pp. 370–376. [4] L. Thiele, S. Chakraborty, M. Gries, and S. K¨unzli, “A framework for evaluating design tradeoffs in packet processing architectures,” in DAC. 2002, pp. 880–885, ACM. [5] A. D. Pimentel, C. Erbas, and S. Polstra, “A systematic approach to exploring embedded system architectures at multiple abstraction levels,” IEEE Trans. Computers, vol. 55, no. 2, pp. 99–112, 2006. [6] G. Kahn, “The semantics of a simple language for parallel programming,” in Information Processing ’74,

J. L. Rosenfeld, Ed., pp. 471–475. North-Holland, NY, 1974. Data through address channel

[7] T. M. Parks, Bounded scheduling of process networks, Ph.D. thesis, University of California at Berkeley, CA, USA, 1995.

120 k

100 k

[8] I. Aug´e, F. P´etrot, F. Donnet, and P. Gomez, “Platform-based design from parallel C specifications,” CAD of Integrated Circuits and Systems, vol. 24, no. 12, pp. 1811–1826, Dec. 2005.

bytes total

80 k

60 k

40 k

20 k

0 0

5M

10 M Cumulative read Cumulative write

15 M Time (cycles)

20 M

25 M

Cumulative write bw Cumulative read bw

[10] J.-Y. Brunel, W. M. Kruijtzer, G. Kenter, F. P´etrot, L. Pasquier, E.A. de Kock, and W. J. M. Smits, “COSY communication IPs.,” in Proceedings, 37th Conference on Design Automation, NY, June 5–9 2000, pp. 406–409, ACM/IEEE.

Fig. 7. Fill state of the address channel Data through input channel 16 k 14 k

[11] P. Lieverse, T. Stefanov, P. van der Wolf, and E. F. Deprettere, “System level design with spade: an MJPEG case study,” in ICCAD, 2001, pp. 31–38.

12 k 10 k bytes total

[9] E. A. de Kock, W. J. M. Smits, P. van der Wolf, J.-Y. Brunel, W. M. Kruijtzer, P. Lieverse, K. A. Vissers, and G. Essink, “Yapi: application modeling for signal processing systems,” in 37th conference on Design automation, New York, 2000, pp. 402–405, ACM Press.

8k

[12] H. Nikolov, T. Stefanov, and E. F. Deprettere, “Systematic and automated multiprocessor system design, programming, and implementation,” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 27, no. 3, pp. 542–555, 2008.

6k 4k 2k 0 0

200 k

400 k

600 k

Cumulative read Cumulative write

800 k 1M Time (cycles)

1.2 M

1.4 M

1.6 M

Cumulative write bw Cumulative read bw

Fig. 8. Overflow of the input channel

[13] S. Verdoolaege, H. Nikolov, and T. Stefanov, “pn: A tool for improved derivation of process networks,” EURASIP J. Emb. Sys, vol. 2007, 2007. [14] SoCLib Consortium, “The SoCLib project: An integrated system-on-chip modelling and simulation platform,” Tech. Rep., CNRS, 2003, http://www.soclib.fr.

Data through priority queue 16 k 14 k

[15] F. P´etrot, P. Gomez, and D. Hommais, “Lightweight implementation of the POSIX threads API for an onchip MIPS multiprocessor with VCI interconnect,” in Embedded Software for SoC, A. A. Jerraya, S. Yoo, D. Verkest, and N. Wehn, Eds., chapter 3, pp. 25–38. Kluwer Academic Publisher, Nov. 2003.

12 k 10 k bytes total

1.8 M

8k 6k 4k

[16] VSI Alliance, “Virtual Component Interface Standard (OCB 2 2.0),” Tech. Rep., VSI Alliance, Aug. 2000.

2k 0 0

5M

10 M

Cumulative read Cumulative write

15 M Time (cycles)

20 M

Cumulative write bw Cumulative read bw

Fig. 9. Fill state of the priority queue

25 M

[17] D. Genius, E. Faure, and N. Pouillon, “Mapping a telecommunication application on a multiprocessor system-on-chip,” in Algorithm-Architecture Matching for Signal and Image Processing, G. Gogniat, D. Milojevic, A. Morawiec, and A. Erdogan, Eds., chapter 1, pp. 53–77. Springer LNEE vol. 73, 2011.

Multiplier Free Filter Bank Based Concept for Blocker Detection in LTE Systems Thomas Schlechter Institute of Networked and Embedded Systems Klagenfurt University 9020 Klagenfurt, Austria Email: [email protected]

Abstract— Power efficiency is an important issue in mobile communication systems. Especially for mobile user equipments, the energy budget, limited by a battery, has to be treated carefully. Despite this fact, quite an amount of energy is wasted in todays user equipments, as analog and digital frontend in communication systems are engineered for extracting the wanted signal from a spectral environment defined in the corresponding communication standards with their extremely tough requirements. In a real receiving process those requirements can typically be considered as dramatically less critical. Capturing the environmental transmission conditions and adapting the receiver architecture to the actual needs allows to save energy during the receiving process. An efficient architecture being able to fulfill this task for a typical Long Term Evolution scenario is desired and introduced in this paper. The development of a suitable filterchain is described and a complexity comparison to Fast Fourier Transformation based methods is given.

I. I NTRODUCTION Recently, research on Cognitive Radio (CR) has gained great interest. The concept of CR, e.g. described in [1], allows the user equipment (UE) to scan its relevant environment with respect to instantaneous spectrum allocation. In the original context of CR this information is used for efficient spectrum usage by different UEs using various radio access technologies. However, this concept can be used beyond. Considering a UE providing Long Term Evolution (LTE) functionality, knowledge about the environmental spectral composition is extremely valuable for the design of the receive path [2, 3]. The main idea is as follows: if the UE detects many interferences to the wanted signal, then both the analog and digital frontend (AFE/DFE) of the receive path have to provide full performance, e.g. highly linear amplifiers, filters of high order, etc. In the remainder of this paper this interferences will be called blockers. Full performance of the AFE and DFE results in high

energy consumption of the UE. If, on the other hand, there are only few blockers present, which additionally contain little energy, the receive path does not have to run in full performance mode, resulting in power saving. A concept handling this task for the Universal Mobile Telecommunications System test case has been described in [4], while this paper introduces a concept for efficient spectral estimation for a typical LTE scenario. The concept presented here is optimized to the LTE specification, while approaches for a more general implementation are described in [5]. The Matlab simulation environment implementing the given approach is described in [6], while a hardware-software co-simulation approach based on a FPGA board has been given in [7]. Section II describes the initial conditions and worst case scenario the UE has to cope with to clarify the motivation of building a spectrum sensing filter chain. Section III introduces and discusses different approaches of DFE based interference detection. Section IV provides simulation results and complexity estimates using the method described in [8] for specific implemented filter chains. II. S PECTRAL E NVIRONMENT FOR AN LTE UE In [9] several blockers are defined for the UE to cope with, which are further described in more detail in [10]. These differ between the different allowed channel bandwidths for each LTE UE of 1.4, 3, 5, 10, 15 and 20MHz. As an example, Fig. 1a shows an overview of the blocker scenarios for the 5MHz case in baseband representation defined in the standard. It can be seen, that the wanted LTE signal (black) around DC is embedded in several blockers (light and dark grey) of different kind. The peaks in the spectrum refer to continuous wave (CW) blockers or Gaussian Minimum Shift Keying (GMSK) blockers modeled as CW blockers, while the broader blockers represent other LTE users at different channel frequencies. The power levels assigned

PSD [dBm/LTE channel bandwidth]

−20

case example. The AFE and DFE, however, are typically designed for the worst case scenario. For the second example, as for many other real life situations, both frontends are overengineered. This results in a higher than necessary energy consumption. Therefore if both the AFE and DFE are reconfigurable and the UE is able to gain knowledge about the surrounding spectral situation, energy consumption could be driven to a minimum. The following Section III will introduce a method being able to fulfill this task.

−40 −60 −80 −100 −120 −140

III. DFE

−160 −180

−2

−1

0

1

2

Frequency [Hz]

7

x 10

(a) PSD of the defined blocker environment. −40

PSD [dBm/LTE channel bandwidth]

−60

−80

−100

−120

D ETECTION

A method is desired, which allows to gain information about the spectral environment of a UE. This information can be used to reconfigure the AFE and DFE of the receive path as described in Section II. The focus is set on the implementation of a energy and area efficient approach. In any case, the interference detection must consume less energy than the energy saved by the reconfigured AFE and DFE. The given approach is based on a highly efficient dyadic filter bank. Also its suitability for an efficient hardware implementation to fit the requirements will be evaluated. As a complexity reference a spectrum sensing approach based on the well known Fast Fourier Transform (FFT) will be discussed. A. FFT based spectral estimator

−140

−160

−180 −2.5

BASED I NTERFERENCE

−2

−1.5

−1

−0.5

0

0.5

1

Frequency [Hz]

1.5

2

2.5

The FFT is a hardware efficient implementation of the Discrete Fourier Transformation (DFT). The set of Fourier coefficients X(l) of a signal of length N is defined by

7

x 10

(b) Relaxed blocker environment.

Fig. 1: Power Spectral Density (PSD) for the defined and the relaxed blocker scenario.

to the single blockers refer to worst case scenarios defined in the standard. The given power level of around -90dBm/5MHz for the LTE signal is remarkably below the blocker levels, e.g. around -60dBm/5MHz for the adjacent and alternate channels and around -37dBm for the narrowest CW blocker. High filter performance with steep slopes is needed to retrieve the wanted LTE signal in such an environment. However, in most of the cases this scenario will not represent the actual spectral allocation around the UE. A more realistic scenario could be the one shown in Fig. 1b. Obviously in this scenario the detection of the wanted LTE signal is much more relaxed compared to the previous worst

X(l) =

N −1 

x(m)e−j(2π/N )ml with l ∈ {0, 1, ..., N − 1} .

m=0

(1) A PSD estimate, called a periodogram [11], can be calculated by 1 Pˆ (l) = (2) |X(l)|2 . N Due to its high variance the pure periodogram is a poor PSD estimator. For real applications typically the N data samples are split into K segments of N/K samples each, over which K periodograms are calculated and averaged to improve the performance (Bartlett method). K 1 ˆ ˆ PBartlett (l) = Pk (l). K

(3)

k=1

Further improvements can be achieved by calculating the periodograms from overlapping data segments. The latter is the so called Welch method [11], illustrated in Fig. 2.

Overlap D

0

Signal Data Record

N-1

Segment 1 Segment 2

... Periodogram 1 ...

Segment K

Periodogram 2 ... Periodogram K ... Averaging PSD Estimate

Fig. 2: Illustration of the Welch estimator.

In the following, a spectral estimator based on the Welch method with an overlap of D = 0.5, corresponding to 50% overlap of the single segments, is assumed, which leads to K segments of length N 2N = . DK K

(4)

The complexity of each of the 2N K -point FFTs is given by N 2N (5) log2 K K complex multiplications and 2N 2N (6) log2 K K complex additions. Additionally for calculating each periodogram, 2N K multiplications are needed to square the Fourier coefficients. As K periodograms are calculated, 2N multiplications have to be added. To sum over all periodograms, 2N (K − 1) (7) K additions are needed. Finally according to (2) and (3) a normalization of each of the 2N K gained PSD-estimates by K 1 1 (8) · = 2N K 2N has to be performed. If for a specific system N is a power of two this normalization can be implemented by a simple shift operation. In this case we end up with N 2N log2 + 2N K K complex multiplications and

(9)

2N 2N 2N (K − 1) (10) log2 + K K K complex additions. As complex multiplications mainly determine the chip area of the implementation, often

the complex additions are neglected. As the number of multiplications in the spectrum sensing approach given in this paper is minimized, also the additions should be kept in mind. The frequency resolution of the FFT based approach is linear over the whole frequency range. As a consequence, for a finer resolution in a part of the spectrum, the resolution of the whole spectrum has to be enlarged when using the Welch method, if still information about the whole spectrum is desired. In the given LTE application not all of the spectral parts are of the same amount of interest. Frequencies close to the wanted LTE channel are of greater interest than frequencies with higher offset. Therefore the FFT based approach seems not to be a good solution and a more suitable approach will be given in the next Section. However, we use the Welch method for the purpose of a complexity reference. B. Highly efficient dyadic filter chain The filter chain approach discussed in the following is based on a dyadic structure. The idea is taken from the Discrete Wavelet Transformation, which has mainly been described in [12] and can efficiently be implemented on a hardware platform by filtering and downsampling [13]. The concept is shown in Fig. 3. As can be seen, the filter chain consists only of highpass filters (HPF), lowpass filters (LPF) and downsampling units. A special class of infinite impulse response (IIR) filters is used in the actual system. The resulting filter implementations provide LPF and HPF functionality at the same time with almost no additional cost. This fact is used in the system whenever it is applicable. IIR filters are used, as they provide better attenuation at lower complexity compared to finite impulse response filters like shown in [14]. As a consequence of the dyadic structure, where downsampling by a factor of two after each filter stage is destined, each of the filters need to have a cutoff frequency of fs /4, with fs the sampling frequency of the signal. This fact raises the possibility of using halfband filters

HP



Input LP

Output Output HP



LP





Output HP



LP



Output

Fig. 3: Dyadic structure of a discrete wavelet transformation.

adjacent channel blocker

HP

input

CIC

LP

2

CIC

LP

IIR

2

type1

2

Power Estimator | |²

IIR

LP

type1

+

Power Estimator | |²

LP LP

Power Estimator | |²

LP

IIR type2

HP

2

IIR type1

HP

IIR type1

HP

wanted signal Power Estimator | |²

+

Fig. 5: Example chain for the 5MHz test case with adjacent channel blocker and wanted signal detection.

(HBF) with decreased complexity. The downsampling process results also in a reduced sampling frequency for the next filter stage, which coincides with lower energy consumption after each stage compared to the earlier one. Contradictional to the FFT based approach, the frequency resolution is no longer linear, but logarithmic. As illustrated in Fig. 4a for optimal rectangular filter shapes, the resolution for higher frequencies is poor. For lower frequencies the resolution can be increased without adding significant complexity to the algorithm. As a result, by adopting this basic idea and building a dyadic structure of LPF, HPF and downsampling blocks, good resolution for power estimates close to DC signals can be achieved. The resolution towards higher frequencies is still poor as the dyadic filter chain has logarithmic frequency resolution, only. Typically, at least for some parts of the spectrum linear frequency resolution is desired. This can easily be implemented by adding further HPF/LPF stages after downsampling. Like shown in Fig. 4b, a subsampled version (by a factor of two) of a HPF output with frequency range [fs /8; fs /4] will be mirrored to the frequency range [0; fs /8]. This subsampled version of the signal can again be split by a HPF/LPF combination. Keeping in mind, that the original frequency range of this part was [fs /8; fs /4], together with the straight forward portion truly situated between [0; fs /8] a linear frequency resolution between 0 and fs /4 has been achieved. This procedure can be repeated as often as necessary. As can be seen, this approach is much more flexible and adoptable to the actual needs of the system. Note that the input signal given by the 3GPP standard [9] and illustrated in Fig. 1a is very specific and the location of the blockers is considered as given. Therefore the resolution needed for the filter chain in each part of the spectrum is known

in advance. Additionally, different downsampling factors might help to shift filter outputs to ideal positions in the spectrum for further conditioning. Also, in regions where no unwanted spectral part can fold into a subsampled HPF output, downsampling by a higher factor can again reduce complexity. Note that in the given example and the 5MHz test case described in Section IV, high resolution is only provided where it is absolutely needed, to keep complexity as low as possible. Furthermore only coefficients of the type ci =

−1 

ak 2k with ak ∈ {0, 1}

(11)

ˆ k=−k

and typically kˆ ∈ {1, 2, ..., 8}, are used. Those coefficients can be implemented in hardware as very few pure shift-and-add operations, which further decreases the overall complexity. A method for optimizing the filter design has been given in [15]. IV. R EAL S YSTEM A PPLICATION , S IMULATION R ESULTS AND C OMPLEXITY E STIMATES Based on the idea of the dyadic approach, filter chains were developed and their detection properties simulated using Matlab. A set of filter chains, built according to the rules defined in Section III-B, has been implemented. TABLE I: Number of real additions for the investigated filter chain approach per input sample 5MHz test case additions @104MHz @52MHz @26MHz @13MHz @6.5MHz @3.25MHz

minimum 2 2 2 3 0 0

maximum 2 5 15 11 9 11

average 2 3.3 8.5 7.8 3.7 4.8

0

PSD/dB

output of 5th LPF

−20

−40 3rd HPF

output of 1st HPF

output of 2nd HPF fs/16

fs/8

fs/4

Frequency

fs/2

(a) Spectrum after each filter stage.

PSD/dB

0

−20 downsampling by 2

−40 subsampled version of the output of the 2nd HPF

output of the 2nd HPF

fs/8

Frequency

fs/4

(b) Subsampling the high pass output of the 2nd stage.

Fig. 4: Illustration for achieving higher resolution.

Fig. 5 shows an example chain for the 5MHz test case with the adjacent channel and wanted signal detection scenario. For all of the chains, three different types of IIR filters have been used. It has to be mentioned, that each of the implemented IIR LP filters can also perform HP functionality at the same time with almost no additional cost. Note that none of the filters contains multipliers. Fig. 6 exemplarily shows a possible hardware implementation of one of the used IIR filters. The evaluation of the different chains’ performance is done with focus on the detection accuracy for the scenario given in Fig. 1a and the overall complexity. The boundary conditions for the implementation are the following: For each possible channel bandwidth, like described in Section II, and each single blocker to be detected, one specific chain was implemented. This assumption is legimitate, as the requirements according to the 3GPP standard are fulfilled. For the 5MHz test

case ten different blockers are defined (six LTE type blocker, four CW type blocker). This results in six typical detection scenarios, as for some of the blockers the same detection chain can be used. With the given approach, those blockers using the same chain can not be differentiated. This is not a problem for the overall application, though, as the information of interest is, whether any blocker is present in a given frequency range or not. Each filter chain consists of cascaded integrator comb (CIC) stages and optimized IIR filters of order three and five. The complexity of the pure filter stages, evaluated like described in [8], for the 5MHz test case is given in table I. Note that 104MHz is a typical sampling frequency for mobile devices. For each blocker detection chain within the 5MHz test case and each intermediate sampling frequency, the number of real additions were counted. The minimum, maximum and average numbers resulting from the statistics of the different chains are reported. Note that no multipliers are required in the filter chain itself. Clearly for each filtered output signal, a power estimation based on the time domain signal has to be performed. For M input samples to the power estimator block, M multiplications and M − 1 additions have to be considered. The normalization step by M −1 can be implemented as pure shift and add operation at no cost, if the number of input samples M is a power of two. As can be seen, a decrease in complexity compared to the FFT based approach could be achieved. Assuming N input samples at the input of the filter chain like for the discussed FFT based approach, due to continuous downsampling, for one blocker detection scenario the number of multiplications is fairly below the number of multiplications given in the second part of (9) (but also linearly dependent on N ). The additional cost for the FFT based approach can be estimated from the first part of (9), while for the highly efficient filter chain architecture no further multiplications are needed. For

+ input

+

-

>>2

+

>>1

- +

>>1

LP-output

+ -2

z

+

z-1

+

-

HP-output

>>2 - +

z-2

Fig. 6: Structure of the Type-2-IIR filter.

adjacent cw inband alternate imboth imcw immod n−imboth n−imcw n−imgmsk

error [dB]

3 2 1 0 −1 −50

−40

−30

−20 −10 0 10 20 Offset to RefBlocker [dB/ChannelBandwidth]

30

40

50

30

40

50

(a) Detection error in dB for wanted signal.

error [dB]

2 0 −2 −50

−40

−30

−20 −10 0 10 20 Offset to RefBlocker [dB/ChannelBandwidth] (b) Detection error in dB for blocker.

Fig. 7: Detection error of the wanted signal and each blocker for all blocker scenarios.

different LTE channel bandwidths similar results could be observed. Fig. 7 shows the simulation results achieved by applying the described filter chains. Each of the lines refers to the detection accuracy achieved by applying one distinct filter chain for a specific blocker scenario and the 5MHz test case. The legend entries refer to the chosen blocker scenario, where each of those are described in more detail in [10]. Note: in this paper the example chain for the adjacent channel blocker has been described only, while in Fig. 7 the results for all possible blockers are given to provide an overview on the whole detection scenario. Fig. 7a refers to detection accuracy for the wanted signal in the blocker environment, while Fig. 7b shows the results for the detected blocker. A value of 10dB on the axis of abscissae means, the distinct blocker contains 10dB more power than defined in the standard, while the offset of 0dB refers to the reference test case. The simulation was done for all blockers with power

offset ΔP ∈ [−50; 50] dB w.r.t the reference scenario to evaluate the performance under realistic conditions. The horizontal dashed lines in both graphs show the region, where blocker power detection can be considered as sufficient. These requirements were defined by the designers of an LTE-AFE and are closely related to the actual implementation of the AFE. It can be seen, that both the wanted signal and the blocker detection fit the needs of the detection chain over a wide range of blocker power variations. V. C ONCLUSIONS In LTE systems both AFE and DFE are developed for worst case scenarios. In most real world scenarios these are overengineered. An approach allowing to sense the surrounding spectrum of an LTE UE to reconfigure both AFE and DFE in order to reduce energy consumption to a minimum has been introduced. Efficient dyadic filter chains have been discussed, and estimates concerning

complexity and simulation results related to detection accuracy were given. It has been shown, that the presented concept is promising towards intelligent DFEs for AFE and DFE reconfiguration during runtime. ACKNOWLEDGMENT This work was funded by the COMET K2 ”Austrian Center of Competence in Mechatronics (ACCM)”. The COMET Program is funded by the Austrian Federal government, the Federal State Upper Austria, and the Scientific Partners of ACCM. R EFERENCES [1] J. Mitola, “Cognitive Radio: making software radios more personal,” in IEEE Personal Communications, vol. 6, 1999, pp. 13– 18. [2] A. Mayer, L. Maurer, G. Hueber, T. Dellsperger, T. Christen, T. Burger, and Z. Chen, “RF Front-End Architecture for Cognitive Radios,” in Proceedings of the 18th Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, PIMRC, Athens, Greece, Sep. 2007, pp. 1–5. [3] A. Mayer, L. Maurer, G. Hueber, B. Lindner, C. Wicpalek, and R. Hagelauer, “Novel Digital Front End Based Interference Detection Methods,” in Proceedings of the 10th European Conference on Wireless Technology, Munich, Germany, Oct. 2007, pp. 70–74. [4] G. Hueber, R. Stuhlberger, and A. Springer, “Concept for an Adaptive Digital Front-End for Multi-Mode Wireless Receivers,” in Proceedings of the IEEE International Symposium on Circuits and Systems, ISCAS, Seattle, WA, May 2008, pp. 89–92. [5] T. Schlechter, “Output-to-Spectrum Assignment Algorithm for a LTE Cognitive Radio Filter Bank,” in Proceedings of the Joint Conference Third International Workshop on Nonlinear Dynamics and Synchronization and Sixteenth International Symposium on Theoretical Electrical Engineering (INDS & ISTET) 2011, Klagenfurt, Austria, Jul. 2011, pp. 186–190. [6] T. Schlechter, “Simulation Environment for Blocker Detection in LTE Systems,” in Proceedings of the 7th Conference on PhD Research in Microelectronics & Electronics (PRIME) 2011, Trento, Italy, Jul. 2011, pp. 13–16.

[7] S. Vajje and T. Schlechter, “Hardware-Software Co-Simulation Environment for a Multiplier Free Blocker Detection Approach for LTE Systems,” in Proceedings of the Joint Conference Third International Workshop on Nonlinear Dynamics and Synchronization and Sixteenth International Symposium on Theoretical Electrical Engineering (INDS & ISTET) 2011, Klagenfurt, Austria, Jul. 2011, pp. 356–361. [8] T. Schlechter, “Estimating Complexity in Multi Rate Systems,” in Proceedings of the 17th IEEE International Conference on Electronics, Circuits and Systems, ICECS, Athens, Greece, Dec. 2010, pp. 728–731. [9] TS 36.101 Evolved Universal Terrestrial Radio Access (EUTRA); User Equipment (UE) radio transmission and reception, 3rd Generation Partnership Project (3GPP) Std., Rev. 9.3.0, Mar. 2010. [Online]. Available: http://www.3gpp.org/ftp/Specs/archive/36 series/36.101/36101930.zip [10] T. Schlechter and M. Huemer, “Overview on Blockerdetection in LTE Systems,” in Proceedings of Austrochip 2010, Villach, Austria, Oct. 2010, pp. 99–104. [11] D. G. Manolakis, V. K. Ingle, and S. M. Kogon, Statistical and Adaptive Signal Processing. 685 Canton Street, Norwood, MA 02062: Artech House, Inc., 2005. [12] T. Schlechter and M. Huemer, “Advanced Filter Bank Based Approach for Blocker Detection in LTE Systems,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS) 2011, Rio De Janeiro, Brazil, May 2011, pp. 2189–2192. [13] F. C. A. Fernandes, I. W. Selesnick, R. L. Spaendock, and C. S. Burrus, “Complex Wavelet Transform with Alpass Filters,” Signal Processing, vol. 8, pp. 1689–1706, Aug. 2003. [14] T. Schlechter and M. Huemer, “Complexity-Optimized Filter Design for a Filter Bank Based Blocker Detection Concept for LTE Systems,” in Eurocast 2011 - Computer Aided Systems Theory - Extended Abstracts, A. Quesada-Arencibia, J. C. Rodriguez, R. Moreno Diaz jr., and R. Moreno-Diaz, Eds. Berlin, Heidelberg, New York: Springer Verlag GmbH, Feb. 2011, pp. 182–183. [15] T. Schlechter, “Optimized Filter Design in LTE Systems Using Nonlinear Optimization,” in Proceedings of the 17th European Wireless Conference (EW) 2011, Vienna, Austria, Apr. 2011, pp. 333–339.

PRACTICAL MONITORING AND ANALYSIS TOOL FOR WSN TESTING Markku Hänninen, Jukka Suhonen, Timo D. Hämäläinen, and Marko Hännikäinen Tampere University of Technology, Department of Computer Systems P.O.Box 553, FI-33101 Tampere, Finland {markku.hanninen, jukka.suhonen, timo.d.hamalainen, marko.hannikainen}@tut.fi ABSTRACT Wireless Sensor Networks (WSN) comprise autonomous embedded nodes that combine sensing, actuation, and distributed computing with small size and low energy. Testing of WSNs is challenging due to distributed functionality, lack of test and debug interfaces, and limited communication, computation and memory resources. This paper presents the design and implementation of a practical network monitoring and analysis tool for identifying the causes of misbehavior. The tool consists of a sniffer node that passively captures WSN traffic, and multiple user interfaces that can run e.g. on a PC or mobile phone. Unlike the related proposals, our tool neither needs setting up an additional monitoring network alongside the actual WSN nor uses any node resources. Practical testing experiences with multihop mesh WSN deployments show that the tool reveals design and implementation defects that are hard to discover with other testing methods such as in-network monitoring systems or debug prints. Index Terms— wireless sensor network testing, passive monitoring, network analysis 1. INTRODUCTION Wireless Sensor Networks (WSN) consist of even thousands of fully autonomous tiny nodes, which sense their environment, control actuators, perform distributed computing, and communicate wirelessly with each other. WSNs have applications for example in home, office, health care, military, and industry. To achieve small physical size and low manufacturing costs, sensor nodes are typically limited in computation, memory, communication, and energy resources [1]. A WSN is composed of complex distributed embedded systems with real-time requirements coming from radio tx/rx timings. Testing of embedded systems is challenging due to complexity, lack of test and debug interfaces, tight coupling of software and hardware, and need for the real operating environment with stimuli to the system [2, 3]. Moreover, the distributed nature of WSNs makes the testing more difficult as a failure in one node may affect the behavior of others as well. Faultless functionality of algorithms and protocols has to be confirmed to ensure high performance and reliability. Simulators [4, 5, 6] are commonly proposed for WSN testing but they concentrate only on network level operation. Emulators and testbeds [7] are used for node level testing, but they do not represent the harsh physical environment, radio channel characteristics, node mobility, and hardware failures in real-world deployments. Several in-network monitoring systems [8, 9, 10, 11] with active instrumentation of sensor nodes have been proposed to enable the delivery of diagnostics information during a deployment. In these systems, monitoring traffic is sent in-band with the actual WSN traffic. This approach has several disadvantages [12]. First, failures in the monitored WSN may prevent the delivery just when it is most

needed. Second, scarce computation, memory, network bandwidth, and energy resources should not be used for instrumentation and self-testing. Third, the monitoring infrastructure is interwoven with the WSN protocol stack, and thus adding or removing instrumentation may change the network behavior, causing probe effects. As opposed to the active instrumentation, our goal is to find out the causes of network failures by passively capturing the network traffic. This approach does not need any modifications to WSN nodes or use their limited resources. As the monitoring mechanism is completely separated from the network, it does not affect the behavior of the WSN. Also, a failure of the WSN cannot break the monitoring mechanism. Capturing the network traffic can be utilized also when deploying nodes. Finding correct locations for nodes is highly important for ensuring good network connectivity and avoiding partitioning and other problems due to weak links or lack of neighbors. By listening traffic and measuring link quality to the already installed nodes, it is possible to decide if the current location is good enough for a new node. WSN development has been relying on field piloting to achieve realistic performance and application feasibility results [13, 14]. The networking reliability and prototype testing are required for successful field pilots of WSN applications. In this paper, we present a practical network monitoring and analysis tool for supporting WSN research and deployments. Purpose of the tool is to facilitate testing, deploying, and resolving the causes of misbehavior by passively capturing and analyzing WSN traffic. The tool consists of a sniffer node and multiple user interfaces that can run e.g. on a PC or mobile phone. A development interface decodes the captured packet stream and analyzes the protocol behavior. A deployment interface helps the installer to find proper locations for nodes in order to form a well-connected multihop mesh network without any gaps. A maintenance interface is focused primarily on debugging the causes of network failures online and in-situ. We have implemented the tool with these three interfaces on a low-energy Tampere University of Technology WSN (TUTWSN), which is a multihop prototype WSN targeted at low data rate monitoring applications where the main objectives are scalability and energy efficiency. Use of the tool is not limited to WSNs but it is applicable also for other wireless mesh networks. We have used the tool in several deployments, and it has helped us to find both weaknesses in protocol design and errors in software implementation. The rest of this paper is organized as follows. The related work is presented in Section 2. Testing issues and challenges in WSNs are discussed in more detail in Section 3. Section 4 describes the requirements for a network analysis tool and presents the system architecture of our proposal. The tool implementation is presented in Section 5. Section 6 describes our testing experiences with the presented tool. Finally, Section 7 concludes the paper.

2. RELATED WORK

3. TESTING ISSUES IN WSNS

Several systems have been proposed for passive monitoring of WSNs. LiveNet [15] uses a set of packet sniffers co-located with the monitored WSN to reconstruct network topology, determine bandwidth usage and routing paths, make connectivity analysis, and identify hot-spot nodes. Sniffers can be installed either temporarily, during initial deployment, or permanently. Temporary sniffers can log packets to flash memory for manual retrieval, while permanent ones require a backchannel such as serial port or ethernet for delivering the packet logs. The system does not offer real-time processing due to high storage and computational requirements but the merging and analysis of packet traces are performed offline. In [16], the authors emphasize the importance of visualization to understand the behaviors of complex WSNs and present a multisniffer data collection network for WSNs. The idea is similar to LiveNet but this proposal shows graphical views of a network and can record network activities and replay them at different speeds. The system is primarily intended for supporting WSN application development, not for testing protocols. Deployment Support Network (DSN) refers to a backbone network that is installed alongside the actual sensor network during the deployment process. DSNs can collect diagnostics information, overhear WSN traffic, or even control the sensor nodes. Sensor Network Inspection Framework (SNIF) presented in [12] uses a DSN [17] that acts as a distributed network sniffer. Each DSN node has two radios: the first for overhearing WSN traffic, and the second for forming a network among the DSN nodes to forward the overheard packets to a sink. SNIF is able to detect problems related to individual nodes (e.g. reboot), wireless links, paths (e.g. routing failures, loops), or global problems (e.g. network partitioning). Unlike LiveNet, SNIF performs online analysis of the packet stream to infer and report problems immediately. Another difference is that SNIF tries to detect the causes of failures, while LiveNet shows the time-varying dynamics of network behavior. The authors in [18] present a DSN for extensive debugging and controlling of WSN nodes. Every WSN target node is attached to a DSN node by wires. Their interactive debugging services include remote reprogramming, instant or timed remote procedure calls (RPC), and time-stamped data/event-logging. Compared to LiveNet and SNIF, this proposal offers a developer a versatile remote access to embedded node software but does not capture WSN traffic. DSNs enable comprehensive monitoring of a WSN during deployment without using scarce WSN resources. However, the challenge is how to form a reliable, robust, and scalable backbone network with proper connectivity and sufficient throughput. Both [12] and [18] use Bluetooth scatternet in which node’s 100 mW power consumption limits the lifetime of the DSN to few weeks with two AA batteries. Yet, the lifetimes of WSNs are expected to be even years, and not all the problems appear during the first few weeks. Furthermore, wireless DSN and WSN nodes may interfere each other. Due to doubled hardware and increased costs DSNs are predominantly suitable for protocol testing and development but not for real-world long-term deployments. As real-time monitoring of all WSN nodes with DSN would be impractical at least in regionally large outdoor deployments, we have chosen a mobile sniffer approach. Failure causes are identified in situ on a deployment site without any probe effect or continuous interference with the WSN. In addition to monitoring and analyzing, our tool can be used to determine proper deployment locations for nodes.

WSNs require testing on multiple levels: protocol, algorithm, hardware, embedded node software, distributed system to name a few. The resource-constrained and distributed nature of WSNs set several challenges for testing. In this section, we describe the most challenging issues emerged in our practical WSN testing in deployments. Real-time operation is required in WSN nodes to ensure high performance and energy efficiency. Radio transmissions and receptions must take place just in time to minimize radio duty cycle, idle listening, frame collisions, and missed packet receptions [19]. For example, in beacon synchronized Medium Access Control (MAC) protocols that utilize Time Division Multiple Access (TDMA) scheduling, nodes have to follow the collective TDMA schedule accurately to avoid overlapping active periods with neighbors [20]. Limited node resources are a major challenge in embedded software testing [18]. Test and debug interfaces are very minimal, and too little attention is paid to testability in both hardware and software design. In the worst case, there is only a LED or no interface at all [14]. Debug printing via serial interface can be used for diagnosing node’s operation and tracking software errors but adding extra instrumentation code may affect node’s behavior, especially in time critical parts of software. Co-operation and communication between nodes make testing of distributed WSN algorithms difficult. In a case of failure, debugging only one node is not enough to trace the causes. Even if debug information is received from all the participating nodes, also the communication traffic has to be captured to find out which node is the guilty one. Operation of distributed algorithms may depend on the number of participants or neighbor nodes, which increases the number of conceivable test scenarios. For example, assigning slot reservations for member nodes in intra-cluster communication varies according to the number of member nodes and the amount of traffic coming from each member. Testing in a laboratory is not sufficient to reveal all the design imperfections and implementation defects in a WSN protocol stack [13]. In a real-world deployment, link reliabilities and Received Signal Strength Indicator (RSSI) values fluctuate due to e.g. external interferences, asymmetric links, channel fading, moving obstacles, and inclement weather. Fluctuating link qualities, node mobility, changes in the network topology, and varying amount of traffic may affect MAC, routing, and neighbor selection protocols. Thus, embedded node software must be tested in a real environment to ensure reliable and faultless operation. Occasional unavailability (break in data delivery) of some part of a network is a common problem encountered in our WSN deployments. Autonomous and self-organizing WSNs are able to react to environmental changes, node failures, and broken links with dynamic neighbor discovery and routing protocols. Unfortunately, sometimes the dynamic functionality may fail or the recovery takes longer than expected, which causes unavailability. Even though the deployed nodes seldom have any test or debug interfaces, the causes of unavailability can be resolved by sniffing the network traffic. 4. NETWORK ANALYSIS TOOL In this section, we first describe the general requirements for a WSN analysis tool, and then present the system architecture of our proposal. We explain how the architecture design meets the requirements and allows fulfilling them. Also, the designed user interfaces are introduced.

4.1. Requirements for analysis tool

Sniffer device

• Captures all packets to enable comprehensive studying of a protocol behavior. • Captured packets should be time-stamped accurately to allow validating real-time capabilities of WSN nodes and analyzing scheduling related communication problems.

Command

Analysis software

Reply/Captured packet

Core

• Captured packets should be decoded to a human readable format in each protocol layer. • Estimates link qualities to the nodes within the communication range.

Development interface

Deployment interface

Maintenance interface

• User interface shows only the desired information, not all the available. • User interface shows the captured packet stream in real time. • Packet stream and analyzed data can be recorded to a file for later inspection. • User interface software is portable to multiple platforms including mobile devices. • Mobile sniffer device is battery-powered and easy to carry. 4.2. System architecture The system architecture of our analysis tool is presented in Figure 1. The architecture consists of a sniffer device and analysis software. The sniffer device timestamps captured packets and forwards them to the analysis software. To achieve high accuracy in time-stamping, it is done on the sniffer device right after receiving a packet. The sniffer device is kept simple and all data processing is left out of it to allow its implementation on resource-constrained WSN node hardware. Using the same hardware both on nodes and on a sniffer saves time and financial resources, and allows the sniffer device to be tiny and battery-powered. As the sniffer does not perform any processing, it can totally concentrate on listening to a channel and thus minimize packet misses. The analysis software comprises a core component and multiple user interfaces. The core is a mediator between the sniffer device and the user interfaces but also implements services that are common to all the user interfaces, such as setting radio address, performing network scan, and listening to a channel with given timeout. The core controls the sniffer device according to user commands and forwards captured packets to the user interfaces. All the processing tasks, such as packet decoding and filtering, are done by the analysis software as it is run on a computer that has more computation and memory resources than the sniffer device. We have designed the architecture to support multiple user interfaces to allow better customization of views and functions according to different use scenarios. The development interface is used when monitoring network traffic in detail and debugging reasons for failures in protocol behavior. The deployment interface helps a user to determine proper locations for nodes and to ensure sufficient link qualities to neighbor nodes during the installation phase. The assistance is needed because relocating a node by few meters may drastically alter wireless link condition due to channel fading and other wireless effects [21]. The maintenance interface is targeted at post deployment monitoring, which is needed because e.g. environmental changes or energy depletion on a sensor node may cause unpredictable network problems, even after a careful deployment. Extensibility has been considered in the architecture design as new user interfaces utilizing the independent core can be implemented easily to the analysis software.

Fig. 1: The system architecture of our network analysis tool comprising a sniffer device and analysis software.

5. TOOL IMPLEMENTATION This section presents the implementation of the analysis tool on TUTWSN. The internal operation of the sniffer device is described in detail and functions of the user interfaces are presented with screen captures.

5.1. TUTWSN prototype platform TUTWSN uses a beacon synchronized multi-channel low duty-cycle MAC protocol [19] and multi-cluster tree topology. The channel is accessed in cycles, which consist of a superframe and idle time. Each node chooses its channel individually and data packets are transmitted on receiver’s superframe. To eliminate collisions, superframes have locally unique schedules such that they do not overlap on the same channel. A separate global network-signaling channel is used for sending network beacons, which are received in a network scan when discovering neighbors [22]. The network channel is selected dynamically to avoid interferences. Each superframe comprises a cluster beacon and contention and contention-free time slots. The contention slots are intended for control traffic such as association and slot reservation requests. The contention-free time slots, in which data frames are sent, are allocated by a cluster head according to the bandwidth needed by associated nodes (members). A member node may also dynamically request more bandwidth by setting a control bit in the header of a transmitted data frame. Bandwidth is decreased if a member does not transmit anything in time slots assigned to it. TUTWSN uses a dynamic cost routing in which data frames are transmitted to a neighbor with a lower cost and a sink has the lowest cost in the network [23]. This kind of routing is selected as it supports deploying large-scale networks. Routing costs are determined from link reliability, RSSI, bandwidth usage, latency, and energy. Figure 2 presents two different TUTWSN node platforms. Both of them use the same protocol stack that is run on a Microchip PIC18LF8722 8-bit microcontroller. The first platform has 433 MHz Nordic Semiconductor nRF905 radio, while the another has 2.4 GHz nRF24L01 from the same manufacturer. The transfer rates of the radios are 50 kbps and 1 Mbps, respectively. The 2.4 GHz TUTWSN is targeted at indoor applications and the 433 MHz version at long range outdoor networks.

Microcontroller (other side of the PCB)

PCB antenna

Sniffer node with RS232 interface

LEDs and button

433 MHz radio

Accelerometer

Luminance Two AA batteries sensor Microcontroller (other side of the PCB)

2.4 GHz radio Two AA batteries

PCB antenna

Nokia N900 RS232 to Bluetooth adapter

Fig. 3: Hardware used for network monitoring in our implementation.

Fig. 2: 433 MHz (top) and 2.4 GHz (bottom) TUTWSN node platforms.

5.2. Sniffer implementation In our implementation, both 2.4 GHz and 433 MHz TUTWSN node platforms have been used as a sniffer device. The operation of the sniffer device is simple as it has only two main states. In an idle state, the sniffer just waits for commands from the analysis software. Then the radio is kept in a sleep mode to conserve energy as the sniffer is mobile and battery-powered. When the sniffer receives a capture command, it switches to a capture state and begins to listen to the given channel. Received packets are time-stamped with 16 µs accuracy and forwarded via Universal Asynchronous Receiver Transmitter (UART) serial connection. If the capture command includes a timeout, the sniffer returns automatically to the idle state after the timeout is passed. Otherwise, it continues capturing until a new command is received. The embedded software of the sniffer device is implemented with C language and compiled with Microchip MCC18. The sniffer node is presented in Figure 3. The node has internal UART to RS232 adapter to enable connection to the analysis software. The connection can be made wireless with RS232 to Bluetooth adapter, which has been done with Nokia N810 internet tablet and N900 mobile phone in our experiments. When analyzing a WSN, the analysis software is typically run on a laptop because of its larger screen compared to mobile phones. 5.3. Development interface The development interface allows very detailed inspection of network traffic with several features. Figure 4 presents an example of network monitoring. The features are explained through the example. First a user commands a network scan in which the interface shows information decoded from captured network beacons. The results contain the nodes around a sniffer, their operating channels, measured RSSIs, battery voltages, and neighbor information. Then the user begins to monitor the cluster of the node ID 1828. In cluster monitoring, the sniffer listens to the cluster and follows cluster’s channel changes automatically. In the example, the first captured packet is a cluster beacon (slot type B), which is intended for all other nodes (broadcast). The interface shows operating channels, cluster load, timing information, and slot allocation i.e. in which

slots a member node has to listen or is allowed to transmit. Route information containing sink identifiers, sequence numbers, and costs can be utilized when comparing routes of neighbor nodes or resolving routing problems such as loops or a lack of next hop. The next three slots are contention-free having a slot type R. First the cluster head sends a data packet as a broadcast to all members, and then as a unicast to the member ID 2473. The user can choose which pieces of information are decoded and shown to avoid too plentiful flow of information. In this example, the interface

> scan scanning |==============================| 3 clusters 0. 1828 @ 77 (2.477 GHz) [++++] [ ] 2.97 V | neighbor(s) @ CH { 11, 29 } 1. 1829 @ 11 (2.411 GHz) [++ ] [ ] 3.37 V | neighbor(s) @ CH { 65 } 2. 2473 @ 35 (2.435 GHz) [+++ ] [ ] 3.21 V | neighbor(s) @ CH { 29, 77 } >0 B00/0 CH -> BROADCAST [+0.0000] CLUB [33.211623] [2011-02-14 17:17:59] flags: 0, txpow: 3, source: 1828 clch: 50, nwch: 80, beacon ord: 4, extend: +0, aloha: 2, time: 0, load: 5 slots: [BHS D01 D05 U01 U05 U01 U05 ---] Route: 114, TC0:{seq=161, cost=9}, TC1:{seq=161, cost=10} R00/0 CH -> BROADCAST [+0.0480] DATA [33.259640] [2011-02-14 17:17:59] seq: 3, ext resv: no, txpower: 3 type: NODEAPI | dest: 1703, source: 114, TC: 1, seq: 8 payload: 52 F9 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 appId: NODE_CONTROLLER, ver: 1, cmd: led on, target: node R01/0 CH -> 2473 [+0.0640] DATA [33.275599] [2011-02-14 17:17:59] seq: 3, ext resv: no, txpower: 3 type: NODEAPI | dest: 1703, source: 114, TC: 1, seq: 8 payload: 52 F9 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 appId: NODE_CONTROLLER, ver: 1, cmd: led on, target: node R01/1 2473 -> CH [+0.0721] CTRL [33.283681] [2011-02-14 17:17:59] type: ACK | seq: 3, txpower: 2, dest: 1828, source: 2473: ext resv: slot 0 SUCCESS

Fig. 4: Monitoring decoded cluster traffic after a network scan. User commands are in italics and slot information (type, number, traffic direction) in bold.

Fig. 5: A view of deployment interface on Nokia N810 internet tablet comprising link qualities to the nodes within the communication range and their route information.

Fig. 6: A view of maintenance interface on Nokia N810 internet tablet presenting the average traffic load of a node, and its neighbor nodes (top table) and routes (bottom table).

shows all the available: time-stamps, MAC headers (packet type, sequence number, need for extra reservation, and transmission power), routing headers (destination and source addresses, traffic class defining Quality of Service (QoS), and sequence number), and application payload. Printing the packets in hexadecimal format and logging all the output to a file are also available if needed for further inspection. With a detailed collective real-time view of network traffic, the developer can resolve communication related problems such as wrong sequences, timing errors, too low transmission power, corrupted packets, incorrect acknowledgements, misbehaving slot allocation, defective notification of a channel change, and missing or incorrect response to a request. In addition to cluster monitoring, the interface allows channel monitoring in which all traffic on a selected channel is captured. This can be used for monitoring simultaneously multiple clusters that operate on the same channel. Other functions are a Frequency Time Division Multiple Access (FTDMA) view of a neighborhood and network channel search. The FTDMA view is presented in a use scenario in Section 6. The network channel search finds out the current network channel by scanning all the available channels. The development interface is implemented as a console program with C++ language to allow portability to multiple platforms and operating systems.

The deployment interface is implemented with C++ language. Graphics of the interface are implemented with Qt framework that is available for several mobile devices and computer operating systems.

5.4. Deployment interface Use of the deployment interface requires that the network of already installed nodes has traffic when deploying new nodes. The traffic is needed to measure link qualities and examine route information. In our implementation, the interface utilizes beacon transmissions but the traffic can also be e.g. route requests and confirms. The deployment interface is presented in Figure 5. Its main functionality is showing the link quality indication to the nodes within the communication range and indicating whether a node has a route to a gateway. A suitable deployment location is identified when reliable link can be maintained to a next hop route. A user places a node to the desired place and marks the node as deployed in the deployment interface. The deployment interface stores the node identifier and the current location determined with Global Positioning System (GPS) capable host device. The locations can be exported to an Open Geospatial Consortium (OGC) KML file for presenting locations on a map software or WSN user interface.

5.5. Maintenance interface The maintenance interface comprises link quality assessment for identifying potentially unreliable links, bandwidth usage analysis for detecting bottlenecks, and route analysis for detecting missing routes. After a network scan, the interface lists the known nodes and their signal strengths similar to the deployment interface. Instead of showing detailed packet capture, the interface groups the information extracted from captured packets. It allows examining node specific information by selecting a node from the list. Figure 6 presents a screen capture of the interface showing an analysis on an examined node. The analysis include remaining battery voltage, traffic load as a percentage of the maximum capacity, neighbor information, and detailed route information. The neighbor information shows the neighbors with which the node has communicated recently and RSSIs measured to them. Use of the route information is similar to the presented in the development interface as it is extracted from the same beacon frames. The interface is implemented with C++ and Qt. 6. TESTING EXPERIENCES WITH PASSIVE MONITORING In this section, we describe our testing experiences and lessons learned with cases in which we have used the presented passive monitoring tool successfully. The tool has revealed errors that would have been difficult to discover otherwise. The detected errors have resulted from software implementation bugs, neglected timing issues, or protocol design defects that may not come up until in a real-world deployment. 6.1. Inspecting fulfillment of real-time requirements With the fulfillment of real-time requirements, we refer to an ability of a WSN node to perform radio transmissions and receptions accurately. The accuracy can be affected by varying processing times of algorithms, crystal inaccuracy caused by temperature changes, or

Packet: 256 µs RX margin

Process margin

Cluster head

TX

A

B

A

D

RX

Member TX node RX

500 µs

D

D B

Time

D A

A

8 ms Slot length Uplink boundary

Downlink boundary

Time B

Beacon

D

Data

A

Ack

Fig. 8: A screen capture of FTDMA testing. Time and channel allocation is presented both in a picture and in numeric format.

Fig. 7: Time-slotted communication between a cluster head and its member node in TUTWSN.

new code added to a time-critical section in a protocol stack. Inspecting the fulfillment of real-time requirements frequently is necessary when implementing new features to the protocol. Figure 7 illustrates time-slotted communication between a cluster head and its member node in TUTWSN. The slot length should be minimized to maximize the number of slots and therefore throughput. Yet, there must be enough margin for processing between transmissions. Also, extra listening margins in radio receptions should be as minimal as possible to conserve energy but still tolerate crystal inaccuracy, and thus maintain reliable packet reception. We have used our tool to check that packets are transmitted just in time in slot boundaries, and it has revealed delays that have caused transmissions to fail. In such cases we have either optimized the code or increased the slot length to guarantee enough processing time. After that, the tool has also been used to measure fluctuations of packet transmission times around the deterministic slot boundaries, which allowed determining the needed reception margins. We have also tried to inspect the fulfillment of real-time requirements by adding intra-node checks to the program code but this approach contains at least two problems. First, a check should be as close the transmission moment as possible for a realistic and accurate inspection. However, a timer function call itself on a resourceconstrained microcontroller may delay the transmission too much as the real-time requirements are in the magnitude of microseconds. Second, testing of inter-node time synchronization with only intranode checks is difficult. The passive monitoring tool presented in this paper prevents probe effects and offers collective inspection of fulfillment of real-time requirements by calculating the time intervals between packets from time stamps and showing the transmission times relative to a cluster beacon. 6.2. Inspecting FTDMA scheduling Selecting time slot and channel for node’s own active period referred to FTDMA scheduling is an important algorithm in beacon synchronized MAC protocols. Scheduling errors, meaning that nodes do not follow the collective schedule, handicap neighbor selection and hence routing because a node cannot synchronize to neighbors with overlapping schedules. Furthermore, overlapping active periods on the same channel induce packet collisions.

Our tool detects scheduling errors and allows inspection of the whole schedule of a neighborhood at a glance from a view presented in Figure 8. The tool captures beacons and then calculates time intervals between the active periods of nodes. The active periods are visualized with bars at each channel. The limited capturing range of the sniffer device is not a problem because overlapping schedules have to be avoided only in a local neighborhood within the communication range, not globally. Inspecting FTDMA scheduling with in-network monitoring using node resources would be difficult, incomplete, and not real-time because of node mobility, resource limitations, and the need for combining information received from multiple nodes. Moreover, in a case of scheduling error an inspection made by a WSN itself cannot be as confident as the one by an external independent sniffer. Our proposal allows accurate and extensive monitoring of FTDMA schedules in real time. 6.3. Detecting asymmetric links The reliability of an asymmetric link in one direction is worse than in the other. In an extreme case, only one-way communication is possible. Link asymmetry results from reflection, scattering, and multipathing of radio signals, antenna characteristics, or defective adjustment of transmission power. Consider a situation in which node A can barely receive packets from node B but B cannot receive anything from A. The network analysis tool enables detection of this problem. As the sniffer node is located between the nodes A and B, it can capture packets from both of them. With the sniffer we know that A receives beacons from B as it tries to send data to B. Since B does not acknowledge, it does not receive anything. If A does not have any other neighbors, the sequence is repeated again and again, and thus we know that the reason for routing failure is an asymmetric link. With the analysis tool, we found a transmission power control problem in TUTWSN protocol design. The problem was found when we monitored network traffic on the deployment site since a node had stopped sending data though the battery voltage was sufficient. The transmission power is stored in a packet header before each transmission. The problem was an assumption that the receiver of a packet can send the acknowledgement with the same power level. The assumption had been made to conserve energy. It is correct most of the time, but sometimes more aggressive increase in transmission power is needed due to link asymmetry in a real-

world environment. The asymmetry can occur occasionally because of varying environmental conditions in different seasons. For example, in winter there is a lot of snow but no leaves in trees and in summer vice versa. In this case, in-network monitoring is useless if the node suffering from an asymmetric link does not have a route to send its diagnostics data. Our analysis tool is practical as it is easy to carry and does not require any test interface or modifications to nodes but identifies the problems from normal network traffic. Besides, WSN nodes deployed outdoors seldom have any test interfaces.

6.4. Broadcasting to all nodes in network Broadcasting means sending the same message to all nodes in a network. The basic principle is simple: when a node receives a broadcast message, it forwards the message to its neighbors. The problem is how to stop broadcasting when the message has reached all nodes in the network. In tree topology networks every node forwards the message only to its children, and the broadcast is ended when the message reaches leaf nodes at the lowest level of the tree. In mesh networks nodes are inter-connected, and thus some other ending mechanism is needed. Sequence numbering in packets is a general method to check if the received message has already been received and forwarded. Every node stores received sequences in memory and drops duplicated messages. The analysis tool detected an implementation defect related to sequence numbering in TUTWSN. The tool reported that traffic loads are higher than expected at all nodes in a network. We began to monitor the network traffic and noticed a broadcast message being forwarded again and again in the network. The problem was that the nodes did not update the sequence numbers correctly and the message had been traveling around in the network for days wasting energy and bandwidth. The defect was such invisible that it did not affect application layer behavior as the duplicated messages were dropped there but not in the routing layer. The tool revealed that extra packets in addition to normal traffic are transmitted in the network all the time. The defect would have been hard to discover and resolve otherwise, e.g. with debug prints, because extra traffic was the only symptom.

7. CONCLUSIONS This paper presents the design and implementation of a practical network analysis tool for testing and inspecting WSNs by passive monitoring. Testing challenges in resource-constrained WSNs are identified and discussed to determine requirements for a network analysis tool. The system architecture of our proposal is designed to support fulfilling the requirements and implementing multiple user interfaces for various purposes of use. The implemented user interfaces allow real-time network analysis and make finding the causes of network failures easier. Also, one interface is implemented to assist selecting proper locations for nodes in order to form a well-connected multihop mesh network. Practical testing experiences prove that our tool addresses to the presented testing challenges and reveals design and implementation defects that are hard to discover with other testing methods. Yet, currently it still requires some protocol expertise from the user. In future work, we are going to implement automated failure analysis and channel-, link-, and node-specific performance analysis that reports throughput, channel utilization, packet reception rate, and latency/hop in real-time.

8. REFERENCES [1] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wireless sensor networks: a survey,” Computer Networks, vol. 38, no. 4, pp. 393–422, 2002. [2] Wei-Tek Tsai, Lian Yu, Feng Zhu, and R. Paul, “Rapid embedded system testing using verification patterns,” Software, IEEE, vol. 22, no. 4, pp. 68–75, July-Aug. 2005. [3] J. Grenning, “Applying test driven development to embedded software,” Instrumentation Measurement Magazine, IEEE, vol. 10, no. 6, pp. 20–25, Dec. 2007. [4] S. Sundresh, W. Kim, and G. Agha, “Sens: A sensor, environment and network simulator,” in Proceedings of the 37th annual symposium on Simulation (ANSS ’04), Washington, DC, USA, 2004, pp. 221–228, IEEE Computer Society. [5] “Castalia: A simulator for wsns,” [ONLINE]. Available: http://castalia.npc.nicta.com.au, 2011 [Accessed: Feb. 18, 2011]. [6] “Prowler: Probabilistic wireless network simulator,” [ONLINE]. Available: http://www.isis.vanderbilt.edu/projects/ nest/prowler, 2011 [Accessed: Feb. 18, 2011]. [7] M. Imran, A. M. Said, and H. Hasbullah, “A survey of simulators, emulators and testbeds for wireless sensor networks,” in International Symposium in Information Technology (ITSim ’10), June 2010, vol. 2, pp. 897–902. [8] N. Ramanathan, K. Chang, R. Kapur, L. Girod, E. Kohler, and D. Estrin, “Sympathy for the sensor network debugger,” in Proceedings of the 3rd international conference on Embedded networked sensor systems (SenSys ’05), New York, NY, USA, 2005, pp. 255–267, ACM. [9] G. Tolle and D. Culler, “Design of an application-cooperative management system for wireless sensor networks,” in Proceeedings of the 2nd European Workshop on Wireless Sensor Networks, Jan.-Feb. 2005, pp. 121–132. [10] J. Suhonen, M. Kohvakka, M. Hännikäinen, and T. D. Hämäläinen, “Embedded software architecture for diagnosing network and node failures in wireless sensor networks,” in Embedded Computer Systems: Architectures, Modeling, and Simulation, M. Berekovic, N. Dimopoulos, and S. Wong, Eds., vol. 5114 of Lecture Notes in Computer Science, pp. 258–267. Springer Berlin / Heidelberg, 2008. [11] Z. Chen and K. G. Shin, “Post-deployment performance debugging in wireless sensor networks,” in 30th IEEE Real-Time Systems Symposium (RTSS ’09), Dec. 2009, pp. 313–322. [12] M. Ringwald, K. Römer, and A. Vitaletti, “Passive inspection of sensor networks,” in Distributed Computing in Sensor Systems, J. Aspnes, C. Scheideler, A. Arora, and S. Madden, Eds., vol. 4549 of Lecture Notes in Computer Science, pp. 205–222. Springer Berlin / Heidelberg, 2007. [13] K. Langendoen, A. Baggio, and O. Visser, “Murphy loves potatoes: experiences from a pilot sensor network deployment in precision agriculture,” in 20th International Parallel and Distributed Processing Symposium (IPDPS ’06), Apr. 2006, p. 8 pp. [14] G. Barrenetxea, F. Ingelrest, G. Schaefer, and M. Vetterli, “The hitchhiker’s guide to successful wireless sensor network deployments,” in Proceedings of the 6th international conference on Embedded networked sensor systems (SenSys ’08), New York, NY, USA, 2008, pp. 43–56, ACM.

[15] Bor-rong Chen, G. Peterson, G. Mainland, and M. Welsh, “Livenet: Using passive monitoring to reconstruct sensor network dynamics,” in Distributed Computing in Sensor Systems, S. Nikoletseas, B. Chlebus, D. Johnson, and B. Krishnamachari, Eds., vol. 5067 of Lecture Notes in Computer Science, pp. 79–98. Springer Berlin / Heidelberg, 2008. [16] Yu Yang, Peng Xia, Liang Huang, Quan Zhou, Yongjun Xu, and Xiaowei Li, “Snamp: A multi-sniffer and multi-view visualization platform for wireless sensor networks,” in 1st IEEE Conference on Industrial Electronics and Applications, May 2006, pp. 1–4. [17] J. Beutel, M. Dyer, L. Meier, and L. Thiele, “Scalable topology control for deployment-support networks,” in 4th International Symposium on Information Processing in Sensor Networks (IPSN ’05), Apr. 2005, pp. 359–363. [18] M. Dyer, J. Beutel, T. Kalt, P. Oehen, L. Thiele, K. Martin, and P. Blum, “Deployment support network,” in Wireless Sensor Networks, K. Langendoen and T. Voigt, Eds., vol. 4373 of Lecture Notes in Computer Science, pp. 195–211. Springer Berlin / Heidelberg, 2007. [19] M. Kohvakka, J. Suhonen, T. D. Hämäläinen, and M. Hännikäinen, “Energy-efficient reservation-based medium access

control protocol for wireless sensor networks,” EURASIP Journal on Wireless Communications and Networking, p. 20 pages, 2010. [20] A. Koubaa, A. Cunha, and M. Alves, “A time division beacon scheduling mechanism for IEEE 802.15.4/Zigbee clustertree wireless sensor networks,” Polytechnic Institute of Porto (ISEP-IPP) Technical Report HURRAY-TR-070401, 2007. [21] D. Ganesan, B. Krishnamachari, A. Woo, D. Culler, D. Estrin, and S. Wicker, “Complex behavior at scale: An experimental study of low-power wireless sensor networks,” UCLA CS Technical Report UCLA/CSD-TR 02-0013, 2002. [22] M. Kohvakka, J. Suhonen, M. Kuorilehto, V. Kaseva, M. Hännikäinen, and T. D. Hämäläinen, “Energy-efficient neighbor discovery protocol for mobile wireless sensor networks,” Ad Hoc Networks, vol. 7, pp. 24–41, Jan. 2009. [23] J. Suhonen, M. Kuorilehto, M. Hännikäinen, and T. D. Hämäläinen, “Cost-aware dynamic routing protocol for wireless sensor networks - design and prototype experiments,” in 17th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC ’06), Sept. 2006, pp. 1–5.

2011

Tampere, Finland, November 2-4, 2011

Poster Session Reconfigurable Systems & Tools for Signal & Image Processing

High Level Design of Adaptive Aistributed Controller for Partial Dynamic Reconfiguration in FPGA Sana Cherif, Chiraz Trabelsi, Samy Meftali and Jean-Luc Dekeyser Methodology for Designing Partially Reconfigurable Systems Using Transaction-Level Modeling Francois Duhem, Fabrice Muller and Philippe Lorenzini

www.ecsi.org/s4d

HIGH LEVEL DESIGN OF ADAPTIVE DISTRIBUTED CONTROLLER FOR PARTIAL DYNAMIC RECONFIGURATION IN FPGA Sana Cherif, Chiraz Trabelsi, Samy Meftali, Jean-Luc Dekeyser INRIA Lille Nord Europe - LIFL - USTL - CNRS, 40 avenue Halley, 59650 Villeneuve d’Ascq, FRANCE ABSTRACT Controlling dynamic and partial reconfigurations becomes one of the most important key issues in modern embedded systems design. In fact, in such systems, the reconfiguration controller can significantly affect the system performances. Indeed, the controller has to handle efficiently three major tasks during runtime: observation (monitoring), taking reconfiguration decisions and notify decisions to the rest of the system in order to realize it. We present in this paper a novel high level approach permitting to model, using MARTE UML profile, modular and flexible distributed controllers for dynamic reconfiguration management. This approach permits components/ models reuse and allows systematic code generation. It consequently makes reconfigurable systems design less tedious and reduces time to market. Index Terms— Partial Dynamic Reconfiguration, distributed control, modularity, adaptivity, high level modeling, UML MARTE 1. INTRODUCTION In a reconfigurable computing system, both the hardware and software parts can be reconfigured depending upon the designer requirements. Due to their several advantages, such as flexibility and performance, reconfigurable devices, mainly FPGAs, make them well adapted for diverse application domains such as satellite based systems, telecommunications, transport, etc. As the computational power increases, more functionalities are expected to be integrated into the system. Today’s FPGAs are more suitable for evolving systems with Partial Dynamic Reconfiguration (PDR) feature. It enables modification of specific regions of an FPGA on the fly while others regions remain operational and functioning. Although PDR is fastly growing and gaining popularity in the domain of reconfigurable computing, it is mandatory to predict a robust control mechanism for managing reconfiguration. A good control system should have these prerequisites : speed, accuracy and robustness. In our work, the reconfiguration decision is taken in a distributed way, which is better in terms of performance than the centralized way. In order to decrease the design complexity of our system, we will use a model-driven design method-

ology. UML (Unified Modeling Language) is considered as one of the main unified visual modeling languages in Model Driven Engineering (MDE). When UML is utilized in the framework of MDE, it is usually carried out through by its internal metamodeling mechanism termed as a profile. UML MARTE [1](Modeling and Analysis of Real-Time and Embedded Systems) is a recent industry standard UML profile of OMG, dedicated to model-driven development of embedded systems. MARTE extends UML along with added extensions (e.g. performance and scheduling analysis). The MARTE profile extends the possibilities to model the features of software and hardware parts of a real-time embedded system and their relations. MARTE is mainly organized into two packages, the MARTE design model and the MARTE analysis model which share common concepts, grouped in the MARTE foundations package: for expressing non-functional properties (NFPs),timing notions (Time), resource modeling (GRM), components (GCM) and allocation concepts (Alloc). An additional package contains the annexes defined in the MARTE profile along with predefined libraries. The hardware concepts in MARTE are grouped in the Hardware Resource Model (HRM) package from MARTE design package [2]. HRM consists of a functional and a physical views. The first one (HwLogical) classifies hardware resources depending on their functional properties, and the second view (HwPhysical) concentrates on their physical nature. Both derive certain concepts from the HwGeneral root package in which HwResource is a core concept that denotes a generic hardware entity providing at least one HwResourceService, and may require some services from other resources. The functional view of HRM defines hardware resources as either computing, storage, communication, timing or device resources. The physical view represents physical components with details about their shape, size and power consumption among other attributes. The UML MARTE has been introduced to provide the designer with some facilities to create complex system models where each component of this system has different behavior semantics. MARTE model is considered as the most generic and abstract graphical representation compared to other system-level modeling language, such as SystemC. This one is mostly used for Transaction-level modeling (TLM).

This presents a constraint for the designer who should have some knowledge about this lower level of abstraction. On the other hand, MARTE offers the possibility of targeting a variety of lower-level system descriptions, while hiding low-level details from designers. In [3], a methodology to automatically generate SystemC heterogeneous specifications from generic MARTE models have been proposed. This methodology allows to automatically generate, from the high level MARTE description, an executable SystemC model, while hiding from designers low-level details related to the SystemC platforms such as signals, transactions and threads. Indeed, using model transformations, a system modeled using MARTE can be transformed into models targeting different platfroms and purposes such as simulation and synthesis. In this paper, we propose a high level design of an adaptive distributed controller for partial dynamic reconfiguration, using UML MARTE profile. The rest of the paper is as follows. Section 2 gives an overview of some related works while section 3 explains our motivations and our proposed approach. We validate our solution with a case study in section 4 and presents several advantages in section 5. Finally, the last section presents a conclusion and future works. 2. RELATED WORKS There are a few works that have carried out research using the MARTE profile for specification of reconfigurable systems. The MoPCoM project [4] aims to target modeling and code generation of dynamically reconfigurable embedded systems using the MARTE UML profile for SoC Co-Design [5]. However, the targeted applications are extremely simplistic in nature, and do not represent complex application domains normally targeted in the SoC industry. The authors in the OverSoC project [6] also provide a high level modeling methodology for implementing dynamic reconfigurable architectures. They integrate an operating system for providing and handling the reconfiguration mechanism. Robust control mechanisms for managing reconfiguration must be developed in order to take into account SoC designer needs as well as QoS choices. In [7], authors reuse the design methodology based on Automation Objects presented in [8] to address formal approaches to verify the control architecture in a hierarchical way. Therefore, they propose an hierarchical control approach by means of a multi-layered architecture. Authors in [9] present control semantics at the deployment level where each configuration can be viewed as an implementation of the modeled application: comprised of unique collection of Intellectual Properties (IPs). Each IP is related to an application elementary component. The control at deployment level can be extended for complex configurations. A comparison of several design approaches of distributed controllers in industrial automation systems is presented in [10] . Authors discussed essentially two different approaches: decomposition of a centralized controller onto

several communicating distributed controllers and integration of predefined controllers of components to the control of a system. In [11], authors present a Multiple Models Switching and Tuning approach(MMST) for adaptive control to provide greater flexibility in design. They also consider deterministic and stochastic systems and discuss some problems of convergence and stability. 3. PROPOSED APPROACH 3.1. Motivations Partial Dynamic Reconfiguration (PDR) gives the designer the ability to reconfigure a certain portion of the FPGA during run-time without influencing the other parts. This special feature allows the hardware to be adaptable to any potential situation especially when a high degree of flexibility and parallelism are required to achieve real-time constraints. Normally, the reconfiguration is carried out in complex systems by means of a control mechanism (for example a RTOS or middleware), which can depend on QoS choices: such as changes in executing functionalities due to designer requirements, or changes due to resource constraints of targeted hardware/platforms. The controller can be implemented in different manners. It can be external or internal; and can be an hardware module written in an HDL acting as a controller. It also can be integrated in different levels in a SoC Co-Design framework [9]. Control integration at the application level has a local impact and is independent of the architecture or the allocation. The reconfiguration at this level may arise due to QoS criteria as in video processing application where switching from a high resolution mode to a lower one is required. Control integration in an architecture can be mainly local and used for QoS such as modification of the hardware parameters (voltage or frequency) for manipulating power consumption levels. The second type of control can be used to modify the system structure either globally or partially. If either the application or the architecture is changed, the allocation must be adapted accordingly. Nevertheless, dynamic reconfiguration is more architecture driven as compared to being application oriented, and is influenced by target platform low-level details. Hence, we are interested in control integration at the architecture level. Dynamic adaptation depends on the context required by designer and can be determined by different QoS criteria [12]. Self adaptation is also present in some component models. Usually this type of adaptation is executed by an adaptation manager or controller, typically a separate process or component that is responsible for managing the adaptations. The aim of our proposed approach is to model at a high abstraction level, a distributed modular controller which will carry out some of QoS factors such as changes in executing functionalities due to designer requirements, or changes due to resource constraints of targeted hardware/platforms. The changes can also take place due to other environmental crite-

ria such as communication quality, time and area consumed for reconfiguration and energy consumption. 3.2. Controller modeling Our proposed controller consists of a distributed and local device which manages each partial reconfigurable region (PRR) in an FPGA. While as the use of a master controller is unavoidable, a master controller (processor) is mandatory for overall system level control. The detailed block diagram of controller architecture is shown in Figure 1. Ctrl1

PRR1

DCT_Ctrl Ctrl2Proc

Our proposed controller is highly flexible thanks to its adaptivity and modularity. In fact, the Event Observer and the Decision module are adaptable and each one can contain different functionalities according to observed data and events,respectively. Figure 2 shows a high level model of our proposed modular controller using MARTE profile concepts. Each component contains the MARTE HRM HwComponent stereotype to indicate that these modules are physical in nature.

PRR1

Event_Observer

Adapter Decision_module

Ctrl2PRR1

1x1 DCT HwAcc

Fig. 2. Modeling of our proposed controller

Adapter Notifier Static region Reconfigurable region

Fig. 1. Detailed block diagram of our proposed modular controller Our Controller characterized by its modularity and adaptability, will be based on the following notions: • Ctrl2Proc Adapter: adapts signals transmitted from the controller to the processor via bus macro. • Event Observer: has as input the processor’s requests, e.g., read, write, etc. or user inputs sent from the hyperTerminal. It treats these data and transforms them into a list of events. • Decision module: analyzes received events coming from Observer module and decides which configuration can be run. This choice depends on defined criteria such as power consumption, performance, etc. This module can be adaptive according to constraints on which it’s based. • Notifier : this module forwards the configuration decision to the ICAP (Internal Configuration Port Access) which will load the chosen configuration. • Ctrl2PRR Adapter :adapts signals between our modular controller and its corresponding reconfigurable region.

As shown in Figure 2, we use a softcore Microblaze processor as example of a master controller which will monitor all the distributed controllers. It has MARTE HRM HwProcessor stereotype and is connected to the controllerToProcessor Adapter via the Processor Local Bus (PLBbus). The Partial reconfigurable region (PRR) typed as the generic HwResource type in order to illustrate that the partially reconfigurable region can be either generic or have a specific functionality. This PRR consists of a reconfigurable Hardware Accelerator (HwAcc) defined as HwPLD, as it is reconfigurable compared to a typical hardware accelerator in a large-scale SoC design which can be seen as a HwASIC (after fabrication). 4. CASE STUDY We present here a case study of a complete FPGA model to illustrate our modeling methodology. 4.1. Application level The H.264 is currently one of the most commonly used formats for the recording, compression, and distribution of high definition video [13]. This standard has already been adopted for a wide range of applications, from low bit-rate mobile video to high-definition TV. Like other previous video coding standards, the H.264 utilizes video coding algorithms based on block-based motion compensation and transform based spatial coding framework. The H.264 video encoder treats the frames as I (Integral frame) or as P (Partial frame)[14]. The I frames are encoded without taking into consideration the previous frame, while the P frames are treated taking

into account the information existing in the previous frame. Figure 3 describes the encoding principle for a PFrame. Just like encoding an I frame, the algorithm partitions the frame into 16x16 pixels MacroBlocks (MB). CurrMB

ResMB

CoeffMx

QuantMx

Bitstream

PredMB R.ResMB PredMB RefMB

MVect CurrMB

MVect

RefMB

Fig. 3. Simplified block diagram of a P frame H.264 video coding Motion Estimation (ME) compresses the temporal redundancies between consecutive frames [15]. The algorithm compares a Current MB (CurrMB)(16x16 pixels) from the current frame to all the candidate blocks in a search window in the previous frame and the best match is used to estimate the motion. For each CurrMB, the ME module can obtain two possible results. The first possibility is that this MB is similar to the previous one called referential MB (RefMB) (with possible presence of small differences). The second possibility is that they are completely different, i.e., the current MB contains 16x16 new pixels. In the case of similarity, the ME will provide a Motion Vector (MVect) for the CurrMB to indicate its origin from the previous decoded frame. Then, the Motion Compensation module (MC) takes as inputs this MVect and the RefMB. Then, using temporal prediction, it provides a Predicted MB (PredMB), which will store the full information needed to transform the previous frame into the next frame. Nonetheless, the similarity between CurrMB and PredMB is not absolute. This leads to use Subtractor module (Subtract) to compute differences between them in order to obtain a MB that contains only residues (extra data) called Residual MB (ResMB). Therefore, after the ME, MC and difference phases, the current MB which initially contains a 16x16 original pixels, will contain only some few data. The rest can already be found somewhere in the previous decoded frame. By using the Discrete Cosine Transform (DCT), the frame values stored into a MacroBlock will be transformed into frequency values. The DCT video coding, particularly the 2D DCT, can be decomposed as two 1D DCT computations together with a transpose memory in order to reduce the computational complexity. The DCT block implements a row-column Distributed Arithmetic (DA) [13] version of the Chen fast [16] DCT algorithm. Furthermore, 1-D DCT is first performed row by row on the input data and the results are saved in a transpose memory. Then, 1-D DCT is performed column by column on the results stored in the transpose mem-

ory. The outputs obtained from the second 1-D DCT are the coefficients of the 2-D DCT, i.e., a Coefficient Matrix (CoeffMX). Using the transformed MB after the DCT function, the next step is to quantize (Quant) these values by removing a specific number of bits from all the 8x8 DCT values. This number can vary from frame to frame but is unique during the quantization of the MBs from the same frame. As a result, we obtain a quantization matrix (QuantMx). After that, an inverse transform (inverse quantization (IQuant) and inverse DCT (IDCT)) is applied to this matrix to generate a reconstructed residual MacroBlock (R.ResMB). This one in sent to an adder module (Add) and finally stored in the SRAM memory. The last step is to code quantized values using Entropy Coding (EC) techniques. Entropy coding is a term referring to a lossless data compression scheme that replace data elements with coded representations which, in combination with the previously-described predictions, transformations, and quantization, can result in significantly reduced data size [17]. In H.264, two modes of EC designs are used in a context adaptive (CA) way: CA Variable Length Coding CAVLC and CA Binary Arithmetic Coding CABAC. After processing all the MBs of the frame using the P frame coding, the final result that we obtain is a bitstream. This application is modeled with UML MARTE concepts using HLAM (High-Level Application Modeling) and HRM (Hardware Resource Modeling) packages. As shown in Figure 4, we apply MARTE HLAM RtUnit stereotype to Motion Compensation, Quantization and Coding tasks, to indicate that the component is a realtime unit which means that they are allocated to a platform computing component (in our case : a processor). While we apply MARTE HRM HwComponent stereotype to DCT and ME tasks in order to specify that these tasks are performed on hardware accelerators. 4.2. System architecture In Figure 5, we present the global structure of our reconfigurable architecture, implemented on the Xilinx Virtex 4 XC4VFX20. Our system can be mainly divided into two regions. The static region and the dynamically reconfigurable one. The static region mainly consists of a softcore Microblaze processor which is selected as the global reconfiguration controller and other necessary peripherals for dynamic reconfiguration. The Microblaze processor connects directly to the high speed 64-bit Processor Local Bus (PLB). Several slave peripherals are connected to the PLB bus, such as SystemACE controller for accessing the partial bitstreams placed in an external on board Compact Flash (CF) card, an UART controller module which is connected to a hyperTerminal from which the user inputs/events are sent to the processor. An ICAP core which carries out partial reconfiguration using the read-modify-write mechanism is connected to the PLB bus with a master interface. Finally, four processors generated from the modeled application are connected to PLB

(a) Prediction PFrame modeling

(b)Modeling of Transform and Inverse Transform PFrame

ways. They can be removed or added to achieve zonal coding for the DCT coefficients (from 1x1 to 8x8), or we can simply change the internal logic of the HwAcc through dynamic partial reconfiguration to reduce the precision of resultant DCT coefficients. Since, ME is a highly parallel application, it should be implemented on a Hardware accelerator (HwAcc). Therefore, we can use ME computation as an example to demonstrate how to take advantage of the unused HwAccs. Figure 6 finally shows our reconfigurable architecture (Xilinx Virtex 4 XC4VFX20 chip) using MARTE concepts in a merged functional/physical view to express all the necessary attributes related to the corresponding physical/logical stereotypes. We use 4x4 DCT with 4 HwAccs for ME as an example for modeling with MARTE profile. 5. ADVANTAGES OF OUR SOLUTION Our proposed solution combines mainly three aspects : distribution, adaptability and modularity. In this section, we illustrate the advantages of our approach with some examples. 5.1. Spatial distribution:

(c) PFrame modeling Fig. 4. Modeling of the H.264 video coding bus, i.e.,quant pr (quantization), cod pr (coding), sub pr (subtractor) and MC pr (Motion Compensation). Dynamic area

MicroBlaze

PRR1 PRR2

systACE

PLB bus

UART

Ctrl 1

Ctrl 2

Ctrl 5

Ctrl 6

PRR3 PRR4

....

ICAP MC_pr

Sub_pr Quant_pr

Cod_pr

....

PRR5 PRR6 PRR7

PRR8

Xilinx Virtex 4 XC4VFX20

Fig. 5. Block diagram of the architecture of our reconfigurable system In video coding application, ME and DCT are the most compute intensive processes. Inspired by proposed architecture in [18], dynamically reconfigurable region contains eight hardware accelerators (HwAcc) connected to PLB bus via bus macros. Each one serves as a partial reconfigurable region (PRR) which can be dynamically reconfigured using partial reconfiguration. To achieve quality scalability of DCT computations, hardware Accelerators can be reconfigured in two

- Performance: In modern embedded systems, the reconfiguration controller can significantly affect the system performance. When the performance of a centralized controller is not sufficient, often it is a good idea to divide the control program to several pieces and run them in different devices. Indeed, using only one global controller to manage several partial reconfigurable regions can lead to deadlocks and consequently performance degradation. - Complexity: Partial dynamic reconfiguration becomes one of the most key issues in today’s FPGAs. These systems are becoming more agile to meet the requirements of ever changing markets. Due to the presence of several partial dynamic regions (PRR) in an FPGA, it is mandatory to efficiently manage them. Therefore, implementing a distributed controller to manage each PRR enhances flexible reconfiguration and offers reusability of the controllers. Such a splitting may bring an immediate gain in the productivity and the complexity of a global reconfiguration management is reduced significantly. This solution makes the control design less complex and increases the performance, compared to a centralized solution. 5.2. Modularity and adaptivity: - Reconfiguration Time: Three main functionalities are handled by our modular controller: observation, taking reconfiguration decision and notification. The main objective for proposing a modular controller is to ease the reconfiguration of some sub-modules of our distributed controller without needing to change the others. Since an unused HwAcc



CFlash



hyperteminal

Fig. 6. Modeling of our reconfigurable architecture can be utilized by dynamic partial reconfiguration for other functions, Motion Estimation and DCT can be performed concurrently. In this case, this accelerator controller will not totally change. Furthermore, the kind of data observed by Event Observer module is a pixel, i.e., MacroBlock. So that only the decision functionality changes and adapts to the ME’s constraints. Thus, in this case, we will gain in terms of time reconfiguration. - Energy consumption gain: The Event Observer module can detect the current activity of the processing element (hardware accelerator) and the related power consumption. Thereafter, it sends this information to the decision module. If the total consumption exceed a certain threshold, the monitor can decide that a reconfiguration is needed. So, it sends the decision to the processor responsible for the reconfiguration launching. A reconfiguration can, for example, load another version of the same algorithm but with a lower performance. For an image processing application this would be done by loading, for example, a filter that is less precise than the previous one. Moreover, in order to process a High Definition frame (Event Observer module’s input is a HD frame), we are dealing then with intensive computing thus increasing parallelism degree. Therefore, we need to increase the number of processing elements. This leads to a higher performance. However, when there is not enough energy in the battery to carry on with the same parallelism degree, decreasing the number of the operational processing elements is required. This reconfiguration consists of switching from a high resolution mode to a lower one (Standard Definition frame). In this case, such information (battery level) is transmitted from the Event Observer module to the

decision module which decides to turn off its corresponding accelerator in order to not consume clock cycles. Obviously a controller of a partial reconfigurable region (PRR) has to collaborate and communicate with other PRR controllers (as any local controller). As in many real situations, the use of a master controller is unavoidable. In our architecture, the Microblaze is considered as the master controller. While the control mechanism depends on QoS choices, the Microblaze can provide changes due to resource constraints of targeted hardware or configuration constraints to local controllers in order to turn on/off processing elements for each reconfigurable region. In the design model, these changes are modeled using Component state attribute from MARTE HRM HwComponent stereotype as shown in Figure 6. A Component state of a partial dynamic region (PRR) set to Storage instead of Operating means that this PRR is disabled or inactive and does not consume clock cycles. The changes can also depend on other environmental criteria such as communication quality, time and area consumed for reconfiguration; and energy consumption levels. In our proposed solution, reduction of accelerators number (e.g.,DCT dimension from 8x8 to 1x1) can reduce overall power consumption, total number of clocks cycles required for the running module. Thus, our proposed solution can be beneficial in terms of hardware resources and power consumption. The FPGA does not need to be stopped while changing the configuration which is important for many image/video applications. Finally, an efficient reconfiguration would find a trade-off between performance and power.

6. CONCLUSION AND FUTURE WORKS Dynamic and partial reconfiguration is supported by more and more recent FPGA platforms. Controlling reconfigurations in such systems becomes one of the most difficult and important issues, because of its impact on systems performances. We presented in this paper a new high level approach for modeling and designing dynamic reconfiguration controllers, starting from UML MARTE profile. We validated our design approach with a case study which consisting of a P Frame (Partial Frame) H.264 video encoding. Finally, we illustrated the advantages of our solution with some examples. Our proposed controller is implemented as a reconfiguration distributed hardware component. Its modular and flexible structure make it reusable and permits implementation code generation using MDE approach. Thus, these properties help to decrease design effort and time to market of complex partial and dynamic reconfigurable systems. As future works, we plan to carry out MDE model transformations to enable automatic code generation of our distributed and adaptive controller in a targeted FPGA. The code can then be used as input for commercial tools for final FPGA synthesis. 7. REFERENCES [1] OMG. Modeling and analysis of real-time and embedded systems (MARTE). http://www.omgmarte.org/. [2] S. Taha, A. Radermacher, S. Gerard, and J.-L.Dekeyser. An open framework for detailed hardware modeling. In IEEE proceedings SIES’2007, pages 118 – 125. IEEE, 2007. [3] E. Piel et al. Gaspard2: from MARTE to SystemC Simulation. In MARTE Workshop at DATE’08, March 2008. [4] A. Koudri et al. Using marte in the mopcom soc/sopc co-methodology. In MARTE Workshop at DATE’08, 2008. [5] J. Vidal, F. De Lamotte, and G. Gogniat. A co-design approach for embedded system modeling and code generation with uml and marte. In Design, Automation and Test in Europe (DATE’09), pages 226–231, 2009. [6] S. Pillement and D. Chillet. High-level model of dynamically reconfigurable architectures. In Conference on Design and Architectures for Signal and Image Processing, (DASIP’09), pages 1–7, 2009. [7] D.Missal, M.Hirsch, and H .-M. Hanisch. Hierarchical distributed controllers design and verification. Emerging Technologies and Factory Automation, 2007. ETFA. IEEE Conference on, page 657, September 2007.

[8] S. Karras T. Pfeiffer V. Vyatkin, H.-M. Hanisch and V. Dubinin. Rapid engineering and reconfiguration of automation objects aided by formal modelling and verification. International Journal of Manufacturing Research, 1:382 – 404, 2006. [9] J.-L. Dekeyser, A. Gamati´e, S. Meftali, and I. R. Quadri. Heterogeneous Embedded Systems - Design Theory and Practice, chapter High-level modeling of dynamically reconfigurable heterogeneous systems. Springer, 2010. [10] V.Vyatkin and M.Hirsch and H .-M. Hanisch. Systematic Design and Implementation of Distributed Controllers in Industrial Automation. Emerging Technologies and Factory Automation, 2006. IEEE Conference on, page 633, September 2006. [11] Kumpati S. Narendra and Osvaldo A. Driollet. Adaptive control using multiple models, switching, and tuning. Adaptive Systems for Signal Processing, Communications, and Control Symposium 2000. AS-SPCC. The IEEE 2000, page 159 – 164, October 2000. [12] E. Bruneton, T. Coupaye, M. Leclercq, V. Qu´ema, and J. Stefani. The fractal component model and its support in java: Experiences with auto-adaptive and reconfigurable systems. Software: Practice and Experience, 36(11-12):1257–1284, 2006. [13] V. Bhaskaran and K. Konstantinides. Image and Video Compression Standards - Algorithms and Architectures. Kluwer Academic Publishers,1997. [14] Yuri V. Ivanov and C. J. Bleakley. Real-time h.264 video encoding in software with fast mode decision and dynamic complexity control. ACM Transactions on Multimedia Computing, Communications and Applications, 6:Article 5, February 2010. [15] W.Burleson, P.Jain, and S.Venkatraman. Dynamically parameterized architectures for power-aware video coding: Motion estimation and dct. Digital and Computational Video, 2001. Proceedings. Second International Workshop on, page 4, Feb 2001. [16] C.H. Smith W.H. Chen and S. Fralick. A fast computational algorithm for the discrete cosine transform. IEEE Trans. Commun., COM-25:1004–1009, September 1977. [17] G. J. Sullivan, P. Topiwala, and A. Luthra. The h.264/avc advanced video coding standard:overview and introduction to the fidelity range extensions. SPIE Conference on Applications of Digital Image Processing XXVII, Special Session on Advances in the New Emerging Standard: H.264/AVC, 5558:454–474, August 2004. [18] J. Huang and M.Parris and J. Lee and R. F. DeMara. Scalable FPGA-based Architecture for DCT Computation

Using Dynamic Partial Reconfiguration. ACM Transactions on Embedded Computing Systems, pages 1–8, December 2008.

1

Methodology for Designing Partially Reconfigurable Systems Using Transaction-Level Modeling Franc¸ois Duhem, Fabrice Muller, Philippe Lorenzini University of Nice-Sophia Antipolis - LEAT/CNRS e-mail: {Francois.Duhem, Fabrice.Muller, Philippe.Lorenzini}@unice.fr

Abstract—As a matter of fact, there is a lack of tools handling partially reconfigurable FPGAs modeling at a high level of abstraction that give sufficient degree of freedom to the designer for testing scheduling algorithms. In this paper, we present our methodology to fill this gap and take into account partial reconfiguration into high-level modeling with SystemC. Our approach relies on dynamic threads to change the functionality of modules during runtime and on transaction level modeling for all the communications. We introduce a reconfiguration manager to develop and validate scheduling algorithms for hardware tasks management. Moreover, our simulator performs design space exploration in order to find a viable implementation (in terms of reconfigurable zones) for a given application. Our methodology is validated with the modeling of a dynamically reconfigurable video transcoding chain.

I. I NTRODUCTION Nowadays, the constant evolution of embedded systems implies a reduction of the time to market for new products. Designers tend to raise the abstraction level in order to hide the implementation details that may not be relevant during the early stages of the development. Moreover, chip capacity is increasing and Systems-on-Chip (SoC) get more and more complex, including one or more microprocessors, memory, peripheral interfaces and so on. When the system is implemented on a programmable device, for instance a Field Programmable Gate Array (FPGA), we talk about System on Programmable Chip (SoPC). SoPCs tend to be an intermediate solution between a full software solution, very slow, and a Application Specific Integrated Circuit (ASIC) solution, more powerful but also much more expensive [1]. The latest SoPCs using FPGAs may benefit from Partial Reconfiguration (PR). This feature included in state-of-theart Xilinx FPGAs allows to dynamically reconfigure a part of the FPGA, while the remainder logic keeps running. This technique present a great interest when it is about implementing greedy applications like video transcoding: resources may be mutualised so that a smaller FPGA may be used, while reducing power consumption [2]. However, despite interesting properties, PR is not well established in the industry yet [3]. One of the main matters is that PR does not come with an easy way to model its behaviour and the associated control at a high level of abstraction. It is thus impossible to validate an approach during the early stages of the development process. This model approach allows to quickly choose architecture and validate it, avoiding costly changes during more advanced phases of the project. Besides,

it is currently not possible to test and/or validate a scheduling algorithm with the existing works led to bring PR support into modeling tools. In this context, we want to develop a methodology for highlevel modeling of dynamically reconfigurable architectures. We present our approach, based on the C++ open-source library SystemC [4]. The main objective is to provide an easy way to develop scheduling strategies for hardware task management in a dynamically reconfigurable system. We also want to introduce Design Space Exploration (DSE) in our methodology in order to provide the developer with a design flow completely integrated in current Xilinx’s PR flow. Moreover, the methodology has to be independent from any other library or extension. This approach will be validated using a video transcoding chain application. The remainder of the paper is structured as follows: in Sect. II, we discuss works related to partial reconfiguration modeling. Section III introduces our approach to model dynamically reconfigurable systems and Sect. IV presents the application used to validate it. Finally, we discuss further improvements of our work in Sect. V. II. R ELATED W ORKS The works carried out to provide PR support in static modeling tools is mainly focused on two major tools, SystemC and Unified Modeling Language (UML) [5]. UML is a graphical and object-oriented modeling language initially dedicated to software modeling. Nevertheless, it is also widely used for hardware and mixed systems, mostly because of the numerous tools supporting UML and because of its extension mechanism, used to customize UML models. For embedded systems, there is the MARTE (Modeling and Analysis of Real-Time Embedded Systems) profile [6]. MARTE extends UML capacity to real-time embedded systems. Still, MARTE does not natively support PR modeling. The works carried out in [7], [8] present a MARTE-based approach to address this lack using the GAPSPARD2 framework. The authors modified some MARTE concepts in order to add missing PR aspects. For instance, they introduced an attribute that describes the nature of the area either as static or dynamically reconfigurable. As far as we know, it is currently impossible to develop and validate scheduling strategies using these works. Another MARTE-based approach is described in [9]. It relies on the MoPCoM project, which defined a co-design

2

methodology based on MDA (Model-Driven Architecture). Modeling is done on three levels, from a computational independent model down to a platform specific model. Each level follows a Y-chart design, separating the application model, the platform model, and merges both into an allocated model. Partial reconfiguration support is introduced at the allocation level where several application components are mapped to a single physical component. This methodology still requires the user to manually define which component may share a reconfigurable zone, lacking of design space exploration. SystemC provides tools for design and verification of hardware, software or mixed systems. It has become a de facto standard in the industry because it is based on a well-known language in the developers community [10]. It allows describing a system at different levels of abstraction, from the Register Transfer Level (RTL) up to functional models that may be timed or untimed. Transaction Level Modeling (TLM) [11] is another paradigm where communications are simplified and represented as transactions, thus reducing simulation time. Authors in [12] present their approach to model PR as a SystemC library, called ReChannel. It mainly consists in instantiating the different possible modules for each reconfigurable zone, but activating only one module at a time (Dynamic Circuit Switching). The connection between the pool of reconfigurable modules and the remainder of the system is done through a portal that can handle the basic channels defined in SystemC. ReChannel also defines a control class to manage reconfiguration using dedicated portals. However, the tasks are assigned once at the beginning of the simulation, which means that it is not possible to move the task from one reconfigurable zone to another. Works carried out in ADRIATIC (Advanced Methodology for Designing Reconfigurable SoC and Application Targeted IP-entities in wireless Communications) led to the concept of Dynamic Reconfigurable Fabric [13]. The fabric contains several contexts and is able to dynamically switch from one to another, reminding of the ReChannel approach. The flow takes as an input a SystemC description of a static system and transforms it into a code implementing a reconfigurable module fabric. Candidate modules to a dynamic implementation are chosen by their common interface. Authors in [14] enhanced the previous works by adding a configuration scheduler and taking into account the time overhead associated to a reconfiguration. In [15], authors added PR support to the extension OSSS (Oldenburg System Synthesis Subset), proposing the automatic synthesis of a reconfigurable system. It uses the C++ polymorphism concept to assign different modules to one reconfigurable zone. This approach also takes into account the reconfiguration and the context switch times to provide a time accurate model. However, tasks are assigned to a reconfigurable zone within a Reconfigurable Object in a static way. We want to free ourselves from this constraint and allow dynamic placement of each task with respect to its resources and its timing constraints. Authors in [16] use a concept introduced in SystemC 2.1, dynamic threads. In contrast of static threads, dynamic threads may be spawned during runtime and not only during the

HDL sources

Synthesis tools (e.g. Precision Synthesis, Xilinx ISE) Device

Dataflow graph

Netlists

Simulator

Timing information

Application model

Allocated model

Architecture model Scheduling

Selected RZs

Design tools (e.g. Xilinx PlanAhead) Bitstreams

Fig. 1.

Improved PR design flow

elaboration phase. A reconfigurable module is composed of two dynamic threads: one for the actual user process and one for its control (creation and destruction). In our opinion, this method is the most flexible and therefore we base our approach on dynamic threads. Moreover, they define the concept of dynamic port, making the construction of a port possible after the elaboration phase. This opens a way for reconfigurable modules with different interfaces. III. O UR APPROACH A. Overview & Design Flow We base our approach on the separation of concerns between the application and the architecture. It results in a Ychart based approach that merges our application model and our architecture model into an allocated model that provide the user with a viable implementation of the application, in compliance with a defined scheduling strategy. The modeling methodology is detailed in III-B. Our simulator may be used at two different design stages. First of all, it can be used independently during the very first stages of the development in order to study the feasibility and the potential benefits taken from a partially reconfigurable design. In this case, the simulator entries should be roughly estimated but they provide a good overview of the final reconfigurable system. Then, it may also be used during the implementation phase, fully integrated in an existing PR design flow, as described in Fig. 1. Typically, once the netlists are generated during the synthesis phase, the developer needs to manually define the reconfigurable zones available on the FPGA. We propose to replace this phase by our simulator by inferring relevant information from the files generated during the synthesis. For

3

D Needs: 6 CLBs

Architecture Model 1

B1/B2 Needs: 12 CLBs 1 BRAM

16

C Needs: 8 CLBs 1 DSP

1

4

RZ 3 4 CLBs

8

RZ 4 8 CLBs

RZ 6 4 DSPs

RZ 1 8 CLBs & 1 DSP

8

RZ 5 2 BRAMs

Socket control

Manager Sockets

Reconf Control Thread

RZ 2 12 CLBs & 1 BRAM

Target Socket

Allocated Model Reconfiguration Manager

Resources: 8 CLBs & 1 DSP Can host: A, C, D

Resources: 12 CLBs & 1 BRAM Can host: B1, B2, D

Possible scheduling: RZ 1 RZ 2

Fig. 2.

Fig. 3.

RZ 2

A

Blank

B1

C

D Blank

A

Blank B2

Initiator Socket

User Algorithm Thread

Data

FPGA

RZ 1

Blank

Socket control

A Needs: 4 CLBs 1 DSP

Socket control

Application Model

C

D

Blank

Our approach

instance, the netlists generated for every task in the application can be used to retrieve information about the resources requirements, feeding the application model. This model also uses a dataflow graph representing the links between tasks and timing information. This information represent tasks deadline, determined by the application specifications, and reconfiguration time overhead. This time overhead can be estimated using the works we lead on an efficient reconfiguration manager coupled with an accurate cost model [17]. The architecture model only needs the targeted device to infer potential reconfigurable zones. The simulator results in a set of RZs fullfilling the requirements that will be used to feed the last stages of the flow, i.e. hardware implementation and bitstream generation. B. Modeling scheme Partial reconfiguration shows two major aspects we want to take care of in our modeling methodology: functionality changing, where PR is used to switch between different implementations of a single module, and resources sharing, where PR is used to share a pool of resources between several modules. Dealing with these aspects in an efficient manner may require some kind of control at the FPGA level. Thus, we introduce a reconfiguration manager, split into two levels: one low-level controller in the FPGA and a higher level layer that is smarter and controls hardware tasks either from inside or outside the FPGA. Figure 2 describes our Y-chart based approach, separating the application, architecture and allocated models. The application is first described in a static way (application model). In this example, the application is composed of four modules activated sequentially and looping in time over and over again, with two possible implementations for the second module. The application model needs to know the amount of data transferred between modules. In the example, module C receives data blocks of eight bytes while the output consists in sixteen bytes blocks. The modules are also described by their resources requirements (e.g. CLB, BRAM and DSP48 columns). Then, the targeted platform is described (platform

Module architecture

model). Since our methodology only focuses on the reconfigurable part of the system that should be implemented into a FPGA, it consists in the description of the available Reconfigurable Zones (RZs) in terms of resources. Then, during simulation, the manager maps the modules to the platform described (allocated model). In this example, the first RZ can host modules A, C and D while the second one can host B1, B2 and D. A possible scheduling scheme is also presented in Fig. 2. It introduces blank bitstreams for each RZ that clear the area but more importantly reduce power consumption. Here, the second RZ is kept as inactive as possible since it is the biggest one, therefore favouring power consumption. The lower level details of the reconfiguration manager are directly included in the module as relevant timing information such as reconfiguration time overhead or context switch time. The manager present in the model is in fact the higher level layer in charge of scheduling. We base our model on three main concepts. First, a base reconfigurable module representing a hardware task that provides interfaces to communicate with other modules and a dedicated interface to communicate with the reconfiguration manager. This manager is in charge of the reconfigurations of every module it is connected to. Finally, all the communications in this model are described using TLM-2.0 to raise the abstraction level and speed up simulation time. C. Reconfigurable module Our typical module is based on SystemC’s dynamic threads, used to switch between tasks and change the functionality of a module during runtime. This method is by far the most flexible and most complete. Figure 3 shows the complete module architecture. Let us describe the components of this module. First, it contains an input and an output interface, used to communicate between modules. In the TLM standards, they are referred to respectively as target and initiator sockets. A more detailed description will be provided in section III-E. The number of I/O sockets may be adapted to the application needs. Each interface is associated to a pool of functions, Socket control, used to connect the sockets to the module. There is also a manager interface, which controls the partial reconfiguration aspects. It is composed of two sockets for a two-way communication. In order to model the dynamic behaviour of the tasks, the module is also built above two dynamic threads. The first one is User Algorithm and represents the functionality of

4

a dynamically reconfigurable module. It is spawned during runtime and may be destroyed if no longer needed. The second one is Reconf Control: it communicates with the configuration interface and is responsible for the creation and the destruction of User Algorithm. This thread is defined dynamically to ensure flexibility in the algorithm threads management. Both dynamic threads are defined outside the module class. Therefore, we declare a C++ interface for each thread that provides a set of services in order to access the contents of the module (for instance, retrieving the data pointer or class attributes like task execution time).

Module N+1

Task execution End computation Scheduling Reconfiguration Next module ready Send data Task execution

Fig. 4.

Interactions between tasks and the reconfiguration manager

Fig. 5.

BEGIN

P1

P2

Terminated

RESUME

START P1

Continue?

Task

STOP

Reconfiguration Manager Continue?

D. Reconfiguration manager As mentioned previously, the reconfiguration manager is in charge of the higher level aspects of reconfiguration, i.e. mostly hardware task scheduling. The manager is aware of the system architecture, described in terms of reconfigurable zones. From an application point of view, the manager has to know which parts are to be static and which ones are to be dynamically reconfigured. Using this information, the manager is able to dynamically place the reconfigurable modules and perform time scheduling on them. The placement of a task is decided when the task has to be executed, i.e. when the previous task is done with its execution (or in the case of the first task of the application, when the first packets are sent). Figure 4 shows the interactions between tasks and the reconfiguration manager leading to their mapping and scheduling on the FPGA. The previous module, done with its execution, has to wait for the next task to be configured in order to send the data and thus terminate the transaction. Thus, to proceed, this module notifies the manager of the end of its execution. At this point, the manager, aware of the current modules configured on the FPGA, schedules the execution of the next task. Several cases may occur here. First, if the task is not present on the FPGA, the manager searches for a compatible RZ, either blank or unused (i.e. the task is configured on the RZ but not running). If such a RZ is found, the manager’s reconfiguration engine reconfigures the RZ. Once the process is complete, the manager notifies the previous module that it may send the data to the next module. If the task is present on the FPGA, it may be either running or idle. If idle, the manager just notifies the previous module. If running, the task is put into a waiting queue since it is not possible yet to handle several implementations of the same task on the FPGA. The manager communicates with each module using two dedicated sockets for both up and down links. The down link, responsible for module configuration, is used to configure the modules at the beginning of the simulation or when a reconfiguration is performed. To sum up, the down link is used to model the functionality switching aspect of PR. The up link is used for the modules to acknowledge the reconfiguration manager that it is done processing. It also means that the data is ready to be sent to the next module. So, at this point, the manager will try to instantiate the next module on the FPGA. We might also take care of preemption in our model. To do so, we had to face a potential issue : the preemption cannot

Reconfiguration Manager

Module N

END

Preemption example to model resources sharing

be done anywhere during the task execution. Indeed, the task may be in an atomic processing phase. At least, it is necessary to save the context of the task as it is done in software in order to start from this point when the task is to be run again. We chose to take the problem backwards. The developer has to explicitly define some switch points in the user algorithm where the task can be interrupted, corresponding to points where context save is easily done. Another switch point is implicitly defined at the end of the task execution. Fig. 5 shows an example of communication between a task and the manager. When reaching a switch point, the module uses the up link to notify the manager that it is possible to switch task in this RZ. Then, the manager decides whether a task switch is necessary and responds to the module with the appropriate command (resume or stop). If the task has to be stopped, the manager sends a start command to the next module. E. Communication All the communications in our model (between modules and from the reconfiguration manager to a module) are described using OSCI TLM-2.0 standard. It is an abstraction of the RTL level where the events that may occur during simulation are replaced by function calls, resulting in a very interesting performance enhancement in terms of simulation time. It is also useful during the early phases of a product development where the implementation details may not be established yet: the developer may perform architectural exploration in order to quickly choose architecture. Even though TLM-2.0 is perfectly fit for memory-mapped bus modeling (especially SoCs virtual platforms), it provides several mechanisms to facilitate the modeling of other kinds of systems. TLM-2 supports two coding styles: loosely-timed and approximately-timed. The first one makes use of the blocking transport interface which allows only two timing points for each transaction, corresponding respectively to the call and the return of the transport function. On the contrary, the

5

REQ(1) Module 3

RESP (2) (2) RESP (1) (1) REQ(1)

Module 4

RESP (1) (1)

REQ(4) RESP (2) (3)

RESP (2) (2) REQ(2)

RESP (2) (2)

(4) RESP (3) (3) REQ(3)

RESP (1) (1)

REQ(3)

REQ(2)

Module 2

REQ(2)

REQ(3)

REQ(1) Module 1

(3)

Task execution

Fig. 6.

Pipeline between modules

approximately-timed coding style is supported by the nonblocking transport interface which provides multiple phases and timing points. As we want to model parallel behaviour of the tasks inside the FPGA, we chose to use the latter in order to leverage the non-blocking transport interface. TLM provides a protocol associated with the interface and the coding style we chose in order to model the communication between an initiator socket and a target socket: the Base protocol [18]. Even though it is possible to define our own protocol, built for our needs, we wanted to keep our model as standard as possible. The Base protocol splits a transaction into four phases. First, the initiator asks the target to start a new transaction (phase BEGIN REQ). When the target receives this message, it can respond to the initiator during the second phase (END REQ), either accepting or rejecting the request. At this point, the target can proceed with the request and analyze the transaction. After the processing is done (in our case, it is the user algorithm corresponding to the reconfigurable module), the target may enter the third phase of the protocol where it responds to the initiator (BEGIN RESP). Finally, the initiator socket ends the transaction by sending the appropriate message (END RESP). Page 26 of [18] provides an illustration of the Base protocol between an initiator and a target socket that sums up our use of the protocol. Each call from the initiator is followed by a response on the return path from the target socket. We always issue a TLM ACCEPTED command, used for acknowledgment purposes. We prefer to use the backward path (from the target to the initiator) since it provides much more flexibility in the protocol. Figure 6 presents a possible execution scheme for four modules and shows the theoretical pipeline between them when treating four packets. Calls are grouped into requests and responses. We can see that after module two processing of the first packet, it responds to module one after issueing a request towards module three in order to ensure concurrency. It is possible to associate delays with each phase of the transaction. These delays may correspond to memory accesses or the time spent transferring data from one module to another. In order to enhance our model, we add our own delays in the user algorithm (see Sect. III-C) in order to model the task computation time but also the reconfiguration time if necessary. As the TLM standard recommends using their transaction type in order to keep the design standard and to avoid a new type definition, we decided to use the generic payload, closely related to the Base protocol and perfectly fit to memorymapped buses. For instance, a generic payload contains fields

for the address, the access type, data pointer, data length, burst access and so on. However, it also comes with a convenient extension mechanism. The aim of this mechanism is to provide the developer with the capability of adding information to the transactions. In fact, it allows any type of information to be transferred along with the transaction. The extension may contain relevant information about the data stream. In our transcoding chain, an extension is used to pass information related to the video stream. F. Design Space Exploration As mentioned previously, our aim is to provide the designer with a viable solution that will meet the application timing constraints once implemented on the FPGA. To ensure this, the model performs design space exploration on the architecture, i.e. the RZs. It is able to simulate different scenarios (different number of RZs) until the simulation succeeds. In order to do so, the simulator has to be fed with a pre-defined pool of RZs. This RZ pool has to be heterogeneous in order to succeed with the placement of every task in the application. It is described in a configuration file that is parsed at the beginning of the simulation. A subset of this pool is chosen to run the simulation. However, there are some requirements for this selection that are to be taken into account. The first and most obvious one is that the set may host the entire application. If even one module does not fit any RZ, then the application cannot be implemented. In addition, we may want to choose RZs that may host several modules in order to enhance performance of the system and to ease the mapping. These observations lead us to the following: the simulation has to sort the RZs by descending compatibility with the application. Here, the compatibility corresponds to the number of tasks that may be implemented on the given RZ. Even if the RZ pool is sorted, it is still not possible to select only the first RZs for the simulation. In fact, some tasks may not be implemented on any RZ, hence not fulfilling the first constraint. In order to satisfy this constraint, we must select the RZs that are the only one to fit a module (i.e. the task is only compatible with one RZ). After this step, we may choose the remainder RZs from the rest of the pool. The simulation cogency lies in the task deadline verification. Each time a task finishes its execution, the reconfiguration manager checks whether its timing constraint is met. If not, it stops the current simulation and allows the simulator to increment the number of available RZs. Another very important point to keep in mind is that we currently do not perform any type of verification on the selected RZs: we assume here that the RZs do not overlap. If the RZ pool is not correctly defined, then it will most certainly result in a failure during the design implementation. IV. A PPLICATION A. The Home Gateway In the framework of the ARDMAHN project [19], we develop an adaptive home gateway that should provide video streams to different terminals, for instance a portable video player, a computer or a television. Of course, these devices do

6

Compensation

Entropy Decoder

Inverse Quantization

Inverse Transform

Generic Encoder

Fig. 7.

Intra Prediction Motion Compensation

Image Storage

Motion Estimation

Reconstruction

De-blocking filter

Generic Decoder

Entropy Coder

+

Parameters Adaptation

Quantization

Transform

-

Inverse Quantization

Inverse Transform

+

Motion Compensation

Image Storage

Intra Prediction

De-blocking filter

Generic transcoding chain

not have the same video encoding and/or size requirements, so that it becomes necessary to adapt the video streams before transferring them towards the terminals. Once again, the heterogeneity of the target devices implies numerous use cases for the transcoding chain and it is impossible to implement all of them into a single SoC. Therefore, FPGAs partial reconfiguration feature is used to change the functionality of the transcoder (from one use case to another) and to mutualise resources inside the transcoder (for energy and/or resources saving). Consequently, our test application consists in a video transcoding chain handling several video codecs.

We manually created a pool of 15 reconfigurable zones to feed the simulator. For the moment, we cannot infer a set of reconfigurable zones directly from a FPGA type. We took care of keeping the pool heterogeneous to fit to the heterogeneous nature of the FPGA architecture. It also ensures that every task in the application will find a compatible zone. The simulation leads us to an implementation on five reconfigurable zones. It also allows us to evaluate the overall performance of the system. For instance, it is possible to get information about the different modules, the reconfiguration manager, RZ occupation rates and so on in order to help the development of the scheduling algorithm. It is also possible to evaluate the ratio configuration time to task execution time. If too high, the ratio may handicap the power consumption or even create a bottleneck on the memory where the configuration files are stored. Table I shows the time spent in every possible state for each of the five reconfigurable zone in the appropriate simulation. TABLE I O CCUPATION RATES ( IN % OF SIMULATION TIME )

Blank Reconfiguring Idle Running

RZ 0 0.1 4.1 2.2 93.6

RZ 1 0.3 4 2.9 92.8

RZ 2 0.5 4.1 2.9 92.5

RZ 3 0.8 4 4 91.2

RZ 4 1 4.1 4.1 90.8

B. Transcoding chains The application focuses on the following use cases: H.264 to H.264, H.264 to MPEG-2, MPEG-2 to H.264 and VC-1 to H.264. Even though the video codecs are different, the chains share a similar architecture, as depicted in Fig. 7. Let us take the example of H.264 and MPEG-2 decoding. In the H.264 standard, the entropy coder is a CABAC (ContextAdaptive Binary Arithmetic Coding) and the transform is a DCT (Discrete Cosine Transform) whereas the MPEG-2 standard uses VLC (Variable Length Coding) and integer transform. Moreover, in the MPEG-2 decoder, there is no intra prediction or de-blocking filter. Therefore, if there has to be a switch between a MPEG-2 and a H.264 decoder, these modules have to be modified by the reconfiguration manager. To put it another way, it means that some reconfigurable modules are associated with several implementations corresponding to each use case. Note that we simplified the dataflow graph by grouping some tasks together, thus creating one larger task. For instance, in the generic decoder, we grouped the tasks Intra Prediction, Motion Compensation and Image Storage under the task named Compensation. C. Results At this time, the ARDMAHN project is halfway through its completion, and we do not have all the information necessary to provide an accurate model. Thus, we prefer to use madeup values which still validate our approach and our model of computation. We surely will validate our approach with real values later on in order to get actual implementation details about the application.

These results were obtained with a scheduling strategy using all the available RZs and treating tasks with a classical round robin algorithm. The scheduling strategy will most certainly evolve as we will develop different algorithms. We can see that with this scheduling strategy, the RZs spend more than 90% of the time in running state, representing a good utilization of the zones. Moreover, the simulation terminated within seconds on a Intel Core 2 Quad CPU running under Windows 7 x64, which is a reasonable time overhead considering the time saved trying different implementations directly on the FPGA. The simulator also generates a trace file at the end of the simulation, containing debug information for every module in the chain and for the reconfiguration manager. A sample example of waveform can be seen in Fig. 8, showing the different states taken by the RZs and by the tasks (e.g. running, idle) during the simulation, underlining the scheduling performed by the reconfiguration manager. It also shows the several reconfigurations that occur physically on the FPGA, configured via the Internal Configuration Access Port (ICAP), a hard macro present on the latest Xilinx FPGAs. In our case, the ICAP is managed by our controller, FaRM [17]. For instance, let us consider task T2, Inverse Transform. At time t1 , the manager verifies if the task is present on the FPGA. As it is not present, it requests a reconfiguration to the configuration engine, linked to the ICAP. Therefore, the ICAP switches to the CONF T2 state. Task T2 switches to state Run RZ2, which means that RZ2 has been chosen to host the task (verifiable with RZ2 state). Between t1 and t2 , the manager receives a request for implementing task T1. However, it is not present on the FPGA so that a reconfiguration is needed. However, it is not possible to use FaRM since it

7

t1

t2

oping methodologies for home gateways integrating dynamic reconfiguration. R EFERENCES

FaRM FaRM

Fig. 8.

Sample waveform generated during the simulation

is already reconfiguring RZ2. Therefore, the reconfiguration engine avoids multiple reconfigurations at the same time. Once the first reconfiguration is complete, at time t2 , FaRM starts reconfiguring task T1 on RZ4 and RZ2 goes to state Run T2. V. F UTURE WORKS For the moment, the designer needs to manually define the pool of reconfigurable zones given to the simulator, which will lead to an implementation using a subset of this pool. We would like to entirely include our simulator in Xilinx PR design flow. For instance, the simulator would infer the RZ pool from the device used for the first phases of the flow (e.g. netlist generation). It would also infer timing information for every task using the generated netlists. Once a viable solution found, the simulator would create a UCF file, used to constrain the design given the selected RZs, resulting in a complete flow to design reconfigurable systems without much effort. VI. C ONCLUSION We propose a Y-chart based methodology to evaluate partially and dynamically reconfigurable systems in SystemC and transaction-level modeling. For that purpose, we use dynamic threads to change the behaviour of a module during runtime. Our methodology also includes design space exploration to find an FPGA implementation of the application and can be fully integrated into traditional PR design flows, between the synthesis and the implementation stages. Our architecture includes a reconfiguration manager used to test and validate scheduling algorithms for the management of hardware tasks. The methodology is tested using a video transcoding chain that needs partial reconfiguration in order to satisfy its numerous use cases. The results lead us on implementing the transcoding chain on five RZs, chosen between 15 available RZs given their compatibility with the application. The scheduling strategy provide utilization rates for the RZs greater than 90%, meaning that PR is not interfering too much with the application. ACKNOWLEDGEMENTS This work was carried out in the framework of the project ARDMAHN [19] sponsored by the ANR, which aims at devel-

[1] K. Compton and S. Hauck, “Reconfigurable computing: a survey of systems and software,” ACM Comput. Surv., vol. 34, pp. 171–210, June 2002. [Online]. Available: http://doi.acm.org/10.1145/508352.508353 [2] C. Kao, “Benefits of Partial Reconfiguration,” Xcell Journal, vol. 55, pp. 65–67, 2005. [3] P. Manet, D. Maufroid, L. Tosi, G. Gailliard, O. Mulertt, M. Di Ciano, J.-D. Legat, D. Aulagnier, C. Gamrat, R. Liberati, V. La Barba, P. Cuvelier, B. Rousseau, and P. Gelineau, “An evaluation of dynamic partial reconfiguration for signal and image processing in professional electronics applications,” EURASIP J. Embedded Syst., vol. 2008, pp. 1:1–1:11, January 2008. [Online]. Available: http://dx.doi.org/10.1155/2008/367860 [4] Open SystemC Initiative (OSCI), “SystemC home,” http://www.systemc.org. [5] Object Management Group (OMG), “Unified Modeling Language (UML),” http://www.uml.org/. [6] ——, “MARTE profile,” http://www.omgmarte.org/. [7] I. R. Quadri, S. Meftali, and J.-L. Dekeyser, “MARTE based modeling approach for Partial Dynamic Reconfigurable FPGAs,” in Sixth IEEE Workshop on Embedded Systems for Real-time Multimedia ´ (ESTIMedia 2008), Atlanta Etats-Unis, 10 2008. [Online]. Available: http://hal.inria.fr/inria-00525007/en/ [8] ——, “From MARTE to dynamically reconfigurable FPGAs: Introduction of a control extension in a model based design flow,” INRIA, Research Report RR-6862, 2009. [9] J. Vidal, F. de Lamotte, G. Gogniat, J.-P. Diguet, and P. Soulard, “UML design for dynamically reconfigurable multiprocessor embedded systems,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’10. 3001 Leuven, Belgium, Belgium: European Design and Automation Association, 2010, pp. 1195–1200. [Online]. Available: http://portal.acm.org/citation.cfm?id=1870926.1871215 [10] J. Gipper, “SystemC: the SoC system-level modeling language,” Embedded Computing Design, may 2007. [11] L. Cai and D. Gajski, “Transaction level modeling: an overview,” in Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, ser. CODES+ISSS ’03. New York, NY, USA: ACM, 2003, pp. 19–24. [Online]. Available: http://doi.acm.org/10.1145/944645.944651 [12] Andreas Raabe, “Describing and Simulating Dynamic Reconfiguration in SystemC Exemplified by a Dedicated 3D Collision Detection Hardware,” Ph.D. dissertation, Bonn University, 2008. ´ “System-Level Modeling of [13] A. Pelkonen, K. Masselos, and M. Cupk, Dynamically Reconfigurable Hardware with SystemC,” Parallel and Distributed Processing Symposium, International, vol. 0, p. 174b, 2003. [14] Y. Qu, K. Tiensyrj¨a, and K. Masselos, “System-Level Modeling of Dynamically Reconfigurable Co-processors,” in Field Programmable Logic and Application, ser. Lecture Notes in Computer Science, J. Becker, M. Platzner, and S. Vernalde, Eds. Springer Berlin / Heidelberg, 2004, vol. 3203, pp. 881–885. [15] A. Schallenberg, W. Nebel, A. Herrholz, P. A. Hartmann, and F. Oppenheimer, “OSSS+R: a framework for application level modelling and synthesis of reconfigurable systems,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’09. 3001 Leuven, Belgium, Belgium: European Design and Automation Association, 2009, pp. 970–975. [Online]. Available: http://portal.acm.org/citation.cfm?id=1874620.1874857 [16] K. Asano, J. Kitamichi, and K. Kenichi, “Dynamic Module Library for System Level Modeling and Simulation of Dynamically Reconfigurable Systems,” Journal of Computers, vol. 3, pp. 55–62, feb 2008. [17] F. Duhem, F. Muller, and P. Lorenzini, “FaRM: Fast Reconfiguration Manager for Reducing Reconfiguration Time Overhead on FPGA,” in Reconfigurable Computing: Architectures, Tools and Applications, ser. ARC ’11, 2011, pp. 253–260. [18] Open SystemC Initiative (OSCI), OSCI TLM-2.0 Language Reference Manual, July 2009. [19] ARDMAHN consortium, “ARDMAHN project,” http://ARDMAHN.org/.

2011

Tampere, Finland, November 2-4, 2011

Poster Session Dynamic Architectures & Adaptive Management for Image & Signal Processing

A Framework for the Design of Reconfigurable Fault Tolerant Architectures Sebastien Pillement, Hung Manh Pham, Olivier Pasquier and Sébastien Le Nours High-Level Modelling and Automatic Generation of Dynamically Reconfigurable Systems Gilberto Ochoa, El-Bay Bourennane, Hassan Rabah and Ouassila Labbani

www.ecsi.org/s4d

A Framework for the Design of Reconfigurable Fault Tolerant Architectures Hung Manh Pham, S´ebastien Pillement

Olivier Pasquier, S´ebastien Le Nours

University of Rennes I / IRISA 6 rue de Kerampont, B.P. 80518 22302 LANNION, FRANCE Email: [email protected]

IREENA – Polytech Nantes, Rue Christian Pauc, B.P 50609 44306 NANTES CEDEX 3, France [email protected]

Abstract—The rapid evolutions in reconfigurable electronic products design permit to handle more and more complex applications. New fields of investigations (i.e. automotive, aerospatial, banking, . . . ) are interesting but require a high level of dependability. This paper proposes a framework to design reconfigurable architecture supporting fault-tolerance mitigation scheme. The proposed framework allows simulation, validation of mitigation operations, but also to size architecture resources. The implementation of a fault-tolerant reconfigurable platform permits to validate the proposed model and the effectiveness of the framework. This implementation shows the potential of dynamically reconfigurable architectures for supporting fault-tolerance in embedded systems.

I. I NTRODUCTION The criticality of some appearing applications requires the establishment of dependability policy to ensure correct functioning of systems and safety for users. The solutions widely adopted to answer this problem are moving towards proliferation of computing resources and the establishment of the concept of redundancy treatments. Apart from cost issues, the problem of increasing resources is the hardware utilization rates. Indeed, the current trend is to add computers at the emergence of a new need, without necessarily reusing already available circuits. Another approach is to implement redundancy in software. Traditionally, the problem of this management is related to the availability of a real-time operating system, which allows to reduce development phases and ensures responsiveness of the system.

Moreover, it is now recognized that the reconfigurable logic circuits can meet the performance requirements and the issue of application scalability. The main advantage of these solutions lies in the possibility of not multiplying the computing resources. Instead, genericity of the equipment proposed enables to optimize the resources usage particularly during idle time of the equipments. The CIFAER project [1] aims, among others, to demonstrate the value of using new dynamically reconfigurable technologies in the automotive domain. More generally the project aims at demonstrating the relevance of reconfiguration for dependability. The first challenge is the study of reconfigurable computers architecture, compatible with cost constraints related to the field, as well as constraints complexity of algorithms and realtime. The second challenge is then to evaluate the cost and effectiveness of fault-tolerance mitigation schemes in reconfigurable architectures. In this work we address these two problems by the design of a model to evaluate both above points. We present in this paper the model and we will show the effectiveness of the approach by implementing in a Xilinx FPGA one special dependability strategy. The remainder of the paper is as follows. Section II presents the classical fault-tolerance mechanisms used in reconfigurable architectures. Sections III, and IV present the considered platform and the related model. Finally, before conclusions in Section VI, Section V shows implementation results on a multi-FPGA platform.

II. FAULT TOLERANCE IN RECONFIGURABLE ARCHITECTURES

In an electronic system, there are can be two kinds of fault. The temporary nature of random error due to electromagnetic radiation which can then change the state of a transistor, this effect is called SEU (Single Event Upset), and permanent error usually due to aging or variability. The hardware design must consider these two types of fault to ensure a certain level of dependability. Dynamic reconfiguration offers new strategies, but also requires new approaches. Traditional approaches of detection and faulttolerance at component and architectural levels are based on the redundancy concept. Relying on this concept we can find the Dual Modular Redundancy (DMR) where the resources are doubled, and the Triple Modular Redundancy (TMR). The problem of redundancy comes from the insertion of a majority vote in order to detect a path failure. Classically this circuit is hardened in ASIC (designed to withstand the SEU). In an FPGA, no area is protected against SEU, co the ”voter” can fail. The idea is then to triples the voters themselves. The synchronized TMR method [2] combines dynamic partial reconfiguration with the TMR. A special voter has been developed, and through dynamic partial reconfiguration, a faulty block is reconfigured and a re-synchronization is performed. For processor-based systems, the lockstep system is a DMR implementation. Actually, two processor cores are running the same set of operations in parallel and on a strictly synchronized manner. The outputs of the operations can be performed and compared to determine if there was an error. This system has been implemented using 2 PowerPCs implemented on an Virtex FPGA [3]. Redundancy comes with a large hardware overhead, and it has to be noticed that DMR allows to detect a fault but not to correct it. Another strategy is the temporal redundancy in which a computer performs the same calculation twice and made comparisons is another strategy. This approach only permits error detection and require a recovery strategy. Indeed, if the computer suffers a permanent failure it will not identify a false

calculation, since it does the error consistently. Strategies for communication are the same as for computing elements, i.e. are based on physical or temporal redundancy. At this platform level the main approach is to migrate tasks from one faulty processor to another ones depending on the failure appearance (whether impact the communications or processors) and depending on the criticality of tasks. If a processor which is executing a critical task fails then the task is sent to run on another processor at the expense of a less critical task. The decision of the task migration is often centralized on one processor designated as the master system. This centralized approach poses a problem in terms of reliability. Modern FPGAs are mainly based on SRAM memory technology, which is highly sensitive to SEU [4]. Protecting the configuration memory is essential. Xilinx FPGAs offer the possibility to readback the contents of the configuration memory readback [5]. This technique ensures that current configuration data are correct. The scrubbing [6] technique reconfigure periodically the FPGA at a frequency higher than the frequency of faults occurrence, this is the most used technique. Another technique, for memory protection, is the CRC vote [7], the idea is to calculate the CRC of memory areas corresponding to duplicated modules. A vote is then performed on these signatures to detect faults. The system can be restored by copying the fault-free configuration memory of FPGA in the memory of the faulty FPGA. This technique is currently running on multi-FPGA systems and cannot be applied within the same circuit due to routing resources. The tiling [8] is the only technique that overcomes permanent errors. A configuration can have multiple distributions (several implementations). In each distribution, a blank area is inserted at different places. When a fault is detected, the distribution with the faulty area is placed in the empty area. This approach require specific resources to manage the reconfiguration process. The major problem with these approaches is the extra cost of time or resources. Designing such a system will then require a framework to early

evaluate the overheads due to dependability. III. D ESCRIPTION OF THE CONSIDERED SYSTEM The platform considered in this work focuses on the design of a multi-processor architecture in which each processor has access to several media of communication. Besides the functional organization of the application under consideration, the organization’s hardware architecture can be represented by figure 1.

Fig. 1: Typical hardware architecture. This architecture leads to consider two possibilities for failure: • the failure of an ECU, • the total or partial failure of a communication network. The considered architectures are composed of few processors (at most some tens). The architecture can therefore be considered small, and, without requiring significant storage capacity, each processor can have its own local image of the complete architecture, each processor being responsible for updating its image. It is further assumed that communication networks allow messages broadcasting. In the following we consider any architecture composed of p processors connected by b Network. An algorithm is proposed in [9], and used in the synchronization process is based on connection matrices that are exchanged. As said above, each processor has a complete view of the architecture in which it operates. This image consists of 2 matrices: • Proc: it is a matrix of size p * p where each element of index (i, j) has the value 1 if processor j is able to transmit a message to processor i. It has the value 0 otherwise. • Net: is a matrix of size b * p where each element of index (i, j) has the value 1 if the

processor j as received at least one message across the bus i. It has the value 0 otherwise. For constructing a global of the whole architecture, each processor runs periodically an observation cycle (period : CyclePeriod). To work correctly, the observation cycles must start synchronously on each processor. This would require that all processors have a common clock, which is not the case, but in general there are different solutions enabling the synchronization of the clocks with an accuracy of some µs, or 1 µs maximum. Such accuracy is sufficient related to the period of observation cycles (of the order of several ms). The “observation cycle” is splitted into two equal parts of length (Length P art) as shown in figure 2, Length P art is less than half of CyclePeriod. In the first part, each processor identifies its connections to the others processors and networks. The second part aims to gather these information from all processors to construct a complete image of the architecture.

Fig. 2: Organization of a cycle of observation (namely “life cycle”). During the first part of the cycle (Fig. 3a), each processor broadcasts network a message containing only its identifier). After having broadcasted messages, each processor waits for messages sent by other processors in the architecture. The length of this waiting time corresponds to the time remaining before the end of the first part. Whenever a processor receives a message, it is able to identify the

(a) First part of the cycle.

(b) Second part of the life cycle.

Fig. 3: Description of an observation cycle to detect a failure in the system.

network by which the message has been transmitted and the processor which has sent the message. At the end of the first step, each processor can create an element of each matrix: a row of the Proc matrix showing the status of connections with each processors without assuming the network by which it is connected, and a net matrix column indicating the connection status of the processor with each of these networks in the architecture. During observation cycle second part (Fig. 3b), each processor broadcasts on each network to which it is connected a message containing its identifier and the 2 matrix element it has built during the previous part of the current observation cycle. The processors then wait for messages transmitted by the other processors. At the end this second part, each processor is able to fully rebuild the image of the whole system state. From these matrices, a simple analysis gives information about the running status of each processor and each network in the architecture. If column j of the Proc matrix has only 0 then processor j is not working properly. In the same way, if the line i of the Bus matrix is totally setted to 0, then the network i of the architecture is no longer functional. It is then possible to make local decisions, even decisions of routing messages to guarantee the correct behavior of the application. However, a decision of routing messages may alter part of the response time, which can have serious consequences in the case of hard real-time applications. This type of decision is not considered in the context of this

work. The principle of the proposed monitoring is based on the periodic frame of life transmission. The novelty here lies in the fact that the contents and size of data exchanged varies over time in order the communication overhead. For example we use application message to update the Proc and Net matrices without sending a specific frame. This limits the influence of the observing system on the communication resources of the architecture. In case of fault detection it is necessary to implement a recovery strategy. In case of error in a processor, two cases are possible: 1) the processor restarts after its reconfiguration, the other processor then sends back the last saved correct context (contained in particular frame of life) (checkpointing and rollback) [10]; 2) faulty processor does not restart (in case of permanent fault), a decision of task migration may be taken. We consider that the source code of the task is accessible by all processors and the context of the task was distributed in the previous frames of life. A fault-free processor can quickly resume execution of the interrupted task. (To see) In case of multiple simultaneous failures, our solution can identify defects in a single iteration for each processor has an accurate picture of the state of architecture [9].

IV. P ROPOSED M ODEL To validate and evaluate the impact of different techniques of detection and/or corrections we defined a model for simulation purpose. This model is intended, initially, to show the qualities and drawbacks of surveillance algorithms. We assume an architecture with p processors and b networks. On each processor we can consider that three software parts run as shown in Fig. 4. The Application is simply considered in this model as an element that loads the resources (communication and computation), which may have an influence on the quality of the detection system (the Observer).

The results presented later concern the modeling of a 4-processors architecture (proc1 to proc4) connected by two networks (Net1 and Net2). The global model is presented in Fig. 5. The UseCase variable simulate processor and/or bus faults. This variable evolves over time depending on the scenario defined by the user and the CtrlUseCase built-in function.

Fig. 5: model for simulation.

Fig. 4: Principle of monitoring. The Communication Management block represents all the communications functions. This block is in charge of forwarding each message to the communication networks or recipient(s). In the case of an incoming message for the processor, it directs the message to the observation function or to the application. The block Observer implements the detection system described in the previous section. Besides demonstrating the possibilities of failure detection in different circumstances, this model is customizable. Thus, the cycle period and the duration of each part can be modified. Similarly, it is possible, among other things, to specify the load imposed by the application on the different networks, the size of information transmitted by the detection system, the throughput and latency of each network in the architecture. The desynchronization between processors has not been considered in the model.

Figure 6 shows a simulation result in the case of a failure of the Net 2 between times 100 and 150 ms, then the processor 2 between times 250 and 300 ms. The upper part of the figure shows the status of processor 2 as seen by other processors. The lower part indicates the state of the Net 2 as seen by different processors in the architecture. In each figure, the bottom plot shows the operating status of the item indicated by the UseCase, value 1 representing a running state and value 0 a defect. The states seen by each processor is displayed by a specific plot. Values 0 and 1 are modulated by the processor identifier. This simulation shows clearly that all processors detect the problems. A finer temporal analysis allows us to determine the maximum time to detect the fault, it is of course related to the time allocated to the cycle of observation (CyclePeriod). Using the model, but for the architecture presented in Fig. 7, showed situations of false detections due to a significant local ownership of the application requiring a high utilization rate of network connection.

Fig. 6: Simulation of faults on Bus 2 and on processor 2

V. G ETTING WORKS A. System Level We set works mechanisms of fault tolerance at the system level by implementing a multi-FPGA platform (Fig. 7). The system consists of four FPGAs connected by two communication networks such as Ethernet (in future development one will be based on a PLC interface, while the other will be a RF connections as studied in the CIFAER project). The first network, called Main, is built using a router, the second ring topology network increases the fault-tolerance capacity of the entire platform at a reasonable cost. Each FPGA contains a multi-processor system composed of 4 MicroBlazes. Our system is built by exploiting the dynamic reconfiguration of Xilinx Virtex FPGAs. A typical dynamical system is built around a microprocessor that reads the configuration bitstreams to an external memory interface and control the reconfiguration interface (ICAP) by sending the partial bitstreams to the different reconfigurable areas (PRR). In conventional approaches, this processor is integrated statically to control dynamic resources. In our system, the control processor is also a dynamically reconfigurable cluster. This is to ensure the possibility of reconfiguring all the processors in case of fault appearance. Making all processors reconfigurable requires putting in place a distributed reconfiguration management. In conventional systems a single centralized Mi-

Fig. 7: Multi-FPGA platform. Built around two separate networks to support the CIFAER project approach, FPGAs incorporate the mechanism of fault detection by connection matrices. Copying of bitstreams optimizes the migration of tasks across the platform.

croBlaze can be connected to the configuration memory (classically a CompactFlash) because of output interface circuit. It is therefore required to copy these bitstreams in another memory. Due to the limited size of BRAM available in the FPGA, we used a DDR2 SDRAM memory available on the board we used. The interested reader can find more information about implementation in [11]. Throughout the system, each FPGA is connected to a memory that can be accessed by all processors within the same circuit. This memory is divided into three segments (Fig. 7): • • •

An area to save the software settings of its processors and its bitstreams. An area to save the context and the bitstream of the following FPGA in the ring network. An area set aside and used in case of a system failure occurrence. This segment is used to transfer the bitstreams and the different situations between FPGAs.

This memory organization guarantees the existence of at least one copy of each bitstreams across the system. As can be seen in Fig. 7, each FPGA bitstream is in the local memory and a copy is also available in the local memory of the previous FPGA on the

ring network. These copies will be used in case of system failure, and enable a quick change of context. During the synchronization process there may be two failure possibilities for the of an FPGA. Either the MicroBlaze 1 (supporting the interface to the network) of the FPGA is defective, causing loss of communication, or the circuit itself is defective. To distinguish these two possibilities, the secondary Ethernet link is used (ring topology). If one MicroBlaze within an FPGA fails, this error is handled through dynamic reconfiguration of the processor or by using task migration within the MPSoC system. If the entire FPGA fails, all the tasks of the FPGA should migrate to another FPGA. Whatever the new FPGA is, task migration is possible because there is a copy of bitstream somewhere in the architecture. This migration requires to define task priorities to properly choose which task needs to be maintained. If the fault appears on the main Ethernet switch, then all the primary connections are faulty, leading to a loss of connection between all FPGAs. In this case, all circuits switch on the ring network. The second network then allows the proper functioning of the global system. The use of network redundancy, coupled with the new paradigm of dynamic reconfiguration allows to build secure systems. B. Implementation results We have implemented our fault-tolerant multiprocessor circuit in a Xilinx Virtex XC5VSX50T. As can be seen on Fig. 8 and Table I, this medium range circuit easily supports the entire system. Our MPSoC uses 67 % of the FPGA slices and 65 % of BRAM memories of the component. Note that the differences in bitstream sizes, presented in Table I, due to the placement of PRR in the FPGA. Particular attention should be paid to this point because it leads to penalties in terms of reconfiguration time. The processor 1 is larger because it contains the interface with CompactFlash memory. As we said earlier, the bitstreams need to be copied to make them accessible to all processors. We measured the reconfiguration time required from the CompactFlash (CF2ICAP in Table I).

Fig. 8: Floorplan of the MP-SoC implemented on each FPGA. Each processor is mapped into a PRR and communicates with the rest of the system through the bus macro. The static part of the system include the control logic is very little.

We then measured the time taken to copy the bitstreams in the DDR2 memory (CF2DDR) and then duration of system reconfiguration from DDR2 (DDR2ICAP). To note that the duration of reconfiguration from the DDR2 is 65 % faster than from the CF. The first step of copying bitstream is expensive (456 +357 + 2 * 281 = 1.3 s), but this step take place only once at system initialization. Reconfiguration operations are then faster, then we will save every partial reconfiguration time of the system that will take place. We were able to validate, using other sizes bitstream, that the reconfiguration time grows linearly. Consequently approaches using redundancy leading to increase in bitstreams are very penalizing. The inter-FPGA communication are ensured via TCP/IP by using the hardware block lwIP [12] controlled by a MicroBlaze. The communication stack can operate in two modes: RAW mode and Socket mode. The Socket mode is simpler whereas the RAW mode is more efficient. In this mode, the bandwidths are 120 Mb/s downstream and 104 Mb/s upstream. The size of a bitstream for a MicroBlaze is on average of 170 KB (see Table I) which requires 170KB×8bits 104M bit/s×103 = 13 ms transfer time from an FPGA to another. The software context of a MicroBlaze is a 6 kbps bitstream which requires only 58 µs transfer over the network. Depending on the content of the frame of life, we can adapt the required bandwidth for the synchronization process. Thus the time required to send the overall context from an

Static Processor 1 Processor 2 & 4 Processor 3

LUT

FF

Slice

BRAM

7866 4160 2880 3840

6616 4160 2880 3840

2019 1040 720 960

22 16 16 16

bitstream size (KB)

CF2ICAP (ms)

CF2DDR (ms)

DDR2ICAP (ms)

253 156 198

529 326 414

456 281 357

129 80 101

TABLE I: Resources consumed by the system and timing for bitstreams manipulations. The static part includes controllers ICAP interface for multi-port access to DDR and generator outages. Measurement of transfer time and reconfiguration according to the source memory and the size of bitstreams.

FPGA to each others is about 230 µs. In our system, the minimum synchronization interval is 130 ms (value greater than the longest reconfiguration time of processors = 129 ms, see table I). The principle is not to trigger synchronization when a processor is being reconfigured. The synchronization routine in a system operating at 100 MHz needs 287µs for the four processors. This time depends on the content of the frame of life and the resolution phase presented above. This synchronization time is significantly less than the time between each interruption, and will not have much impact on the timings of the global system. VI. C ONCLUSION Supporting fault-tolerance is a necessity for the future development of dynamically reconfigurable architectures in new applications. This raises the question of designing reconfigurable fault-tolerant systems. The other question is also to ensure the means of communications in a distributed environment. We have proposed a comprehensive framework for modeling and designing reliable and dynamically reconfigurable architecture. We proposed a multi-processor fault-tolerant platform and show the efficiency of the proposed framework. We are currently working on mechanisms of fault injection in hardware to demonstrate the effectiveness of our techniques and compare them with the model performed to highlight the limitations of the system. We inject faults by modifying a bit of a frame reconfiguration approach using the scrubbing technique. This technique can be embedded into the system to reduce the time of validation and testing. The model will be enriched to take into account new techniques for dependability.

R EFERENCES [1] projet CIFAER, “http://www.insa-rennes.fr/ietr-cifaer.” [2] C. Pilotto, J. Azambuja, and F. Kastensmidt, “Synchronizing triple modular redundant designs in dynamic partial reconfiguration applications,” in Proceedings of the twentyfirst annual symposium on Integrated circuits and system design, 2008, pp. 199–204. [3] Xilinx, “PPC405 lockstep system on ml310,” Xilinx Application Note XAPP564. [4] P. Bernardi, M. Reorda, L. Sterpone, M. Violante, and I. Torino, “On The Evaluation of SEUs Sensitiveness in SRAM-Based FPGAs,” in IEEE International On-Line Testing Symposium. IEEE Computer Society Washington, DC, USA, 2004, pp. 115–120. [5] Xilinx Inc., Virtex FPGA Series Configuration and Readback, Xilinx Application Note XAPP138, 2005. [6] M. Berg, C. Poivey, D. Petrick, D. Espinosa, A. Lesea, K. LaBel, M. Friendlich, H. Kim, and A. Phan, “Effectiveness of Internal Versus External SEU Scrubbing Mitigation Strategies in a Xilinx FPGA: Design, Test, and Analysis,” IEEE Transactions on Nuclear Science, vol. 55, no. 4 Part 1, pp. 2259–2266, 2008. [7] H. Castro, A. Coelho, and R. Silveira, “Fault-tolerance in FPGA’s through CRC voting,” in Proceedings of the twenty-first annual symposium on Integrated circuits and system design. ACM New York, NY, USA, 2008, pp. 188–192. [8] A. Kanamaru, H. Kawai, Y. Yamaguchi, and M. Yasunaga, “Tile-Based Fault Tolerant Approach Using Partial Reconfiguration,” in Proc. Int. Workshop on Reconfigurable Computing: Architectures, Tools and Applications. LNCS, vol. 5453, 2009, pp. 293–299. [9] C. Haubelt, D. Koch, and J. Teich, “Basic OS Support for Distributed Reconfigurable Hardware,” in Computer Systems: Third and Fourth International Workshops SAMOS. Springer, 2004. [10] M. Bashiri, S. Miremadi, and M. Fazeli, “A Checkpointing Technique for Rollback Error Recovery in Embedded Systems,” in International Conference on Microelectronics, 2006, pp. 174–177. [11] M. Pham, S. Pillement, and D. Demigny, “A faulttolerant layer for dynamically reconfigurable multiprocessor system-on-chip,” in International Conference on ReConFigurable Computing and FPGAs (ReConFig), Cancun, Mexico, Dec. 2009, pp. 284–289. [12] LightWeight IP, “http://savannah.nongnu.org/projects/lwip.”

HIGH-LEVEL MODELLING AND AUTOMATIC GENERATION OF DYNAMICALY RECONFIGURABLE SYSTEMS. Gilberto Ochoaa, El-Bay Bourennanea,, Hassan Rabahb , and Ouassila Labbania a

LE2I Laboratory - Burgundy University - Dijon Cedex, France ([email protected]), ([email protected]), ([email protected]) b Nancy University, LIENBP239, 54506 Vandoeuvre-Les-Nancy, France ([email protected]) Abstract - Dynamic Partial Reconfiguration (DPR) has been introduced in recent years as a method to increase the flexibility of FPGA designs. However, using DPR for building complex systems remains a daunting task. Recently, approaches based on MDE and UML MARTE standard have emerged which aim to simplify the design of complex SoCs. Moreover, with the recent standardization of the IP-XACT specification, there is an increasing interest to use it in MDE methodologies to ease system integration and to enable design flow automation. In this paper we propose an MARTE/MDE approach which exploits the capabilities of IP-XACT to model and automatically generate DPR SoC designs. In particular, our goal is the creation of the structural top level description of the system and to include DPR support in the used IP cores. The generated IP-XACT descriptions are transformed to obtain the files required as inputs by the EDK flow and then synthesized to generate the netlists used by the DPR flow. The methodology is demonstrated using two CODEC cores (CAVLC and VLC) into a MicroBlaze based DPR SoC. Index Terms—Dynamic Partial Reconfiguration, UML MARTE, MDE, IP-XACT, ESL Design

1. INTRODUCTION Run-time reconfiguration (RTR) has been introduced in recent years as a means of virtualizing hardware tasks in FPGA systems [1]. However, it wasn’t until the introduction of Dynamic Partial Reconfiguration (DPR) technologies by Xilinx that these systems became a reality. In DPR systems, parts of the system can be reconfigured on run-time while the other functionalities in the FPGA remain operational [2].This capability can potentially provide enormous benefits to the systems designers, such as power and resources reduction utilization, amongst others. However, despite the efforts by Xilinx and many industrial and academic endeavours, using DPR in very complex systems remains a very daunting task. This is due, in the first place, to the complexity of the design flow [3], which requires and in-depth knowledge of many low level aspects of the FPGA technology. Secondly, efforts in the academia to extend the capabilities of DPR design flow have further increased the complexity of DPR SoC designs.

In this paper, we propose a methodology to leverage the conception of DPR SoCs. It is based in a Model Driven Engineering (MDE) [4] approach in tandem with a component based approach. MDE allows high level system modelling of both software and hardware; model transformations can then be carried out to generate executable models from the high level models. Since we aim for a component based approach, the seamless integration and interoperability of the used IP is a necessity. The SPIRIT consortium has developed the IP-XACT specification [5] that describes a standard way documenting IP meta-data for SoC integration. Several industrial cases studies have demonstrated that the adoption of IP-XACT facilitates the configuration, integration, and verification in multi-vendor SoC design flows [6], [7]. Additionally to the IP packaging and integration, IP-XACT also provides ways to automate the design flows where different tools are used The contributions of this paper relate to presenting an MDE approach that uses the UML MARTE profile and that enables moving from high level models to HDL code generation for system description. IP-XACT is used as an intermediate model, used to configure the deployed IPs in the DPR system and to automate the system integration and parameterisation. The parameterised system then will be used to generate the necessary inputs to the DPR design flow. Our approach simplifies the conception and implementation of FPGA-based SoCs, and it greatly facilitates the composition of DPR designs. The rest of this paper is organized as follows: in Section II, we discuss the related works in the areas of hardware resource modelling using UML and the efforts done by the academia to integrate IP-XACT into MDE approaches. In section III we analyze the Xilinx DPR design flow in the context of MDE methodologies. In section IV, we present the proposed model-driven approach for DPR. Then, we present a case study for the integration of two transcoder implementation into a SoC based DPR design in Section VI. Finally, section VII we present the conclusions and future work.

2. RELATED WORKS The use of model based approaches for co-design has been thoroughly discussed in [8]. UML/MDE has been adopted in co-design methodologies in the last years with relatively success. The extensions mechanisms introduced in UML have stimulated its use in embedded systems modelling. Structural modelling has always been the most prominent application of the UML in SoC design, for specification of requirements, behavioural and architectural modelling, test benches generation, and IP integration. There are several approaches which use the UML profile and extensions to support embedded hardware resources modelling. Many of them made use of the UML profile for “Modelling and Analysis of Real Time and Embedded Systems” (MARTE) [4], which is a proposal for OMG standard profile. Several works explore embedded system modelling using UML [9, 10], but only a few explore dynamic and partial reconfiguration capabilities. In [12], authors detail a DPR methodology by extending UML/MARTE Profile with specific stereotypes. Their approach is developed in the GASPARD2 [11] design environment. Despite the complexities of their approach, in [13] the authors demonstrate how their methodology can be exploited to move from MARTE models to code generation. However, the mechanisms to move from UML specifications to enriched levels have not been standardized, and every approach manages this issue by defining their own transformation rules and information repository (using custom XMI data-books). Recently, efforts have been carried out to integrate the concepts of the IPXACT into UML design methodologies. The goal of the IPXACT standard is to provide a standard XML abstraction of hardware component implementations, whatever the language. Hence, these files can be seamlessly interchanged between EDA tools to favour IP-Reuse. The IP-XACT standard has generated enormous interest in the industrial and scientific communities as a means to overcome the complexity of system integration. Several research efforts have been carried out to integrate IP-XACT into MDE flows. Initial attempts at bridging the gap between the MARTE Profile with IP-XACT have been presented in [14]. The authors created an ad-hoc UML profile for IP-XACT by introducing stereotypes to represent IP-XACT objects. However, their approach is only sketeched, without presenting implemementation results. In [15], the authors have investigated the application of the UML for modelling IP-XACT compatible component and system descriptions. Their approach enables a comprehensible visual modelling of IP-XACT systems. The proposed methodology maps several IP-XACT concepts to corresponding UML concepts. They present an application targeting a CoreConnect system, but it only is oriented to generation of SystemC Transaction-Level Model code.

A similar approach can be found in [16], which maps the TUT UML profile for embedded systems design to an IP-XACT model. The resulting IP-XACT design flow using UML is also presented which allows automatic RTL component integration based on the proposed transformation rules. Subsequently, the authors further demonstrated their approach in [17], adding modelling concepts, to implement a complex MP-SoC. Despite the relatively large numbers of proposals in the modeling of SoCs using UML MARTE in the one hand, and a combination of UML and IP-XACT in the other, so far there are not approaches that use a Meta model-based approach for DPR systems. Moreover, all the approaches presented in the literature dealing with IP-XACT in MDE methodologies make use of systems descriptions that are not abstract enough; these approaches make system descriptions using IP-XACT concepts directly, which is not suitable for high-level methodologies. In this paper we make use of an MDE methodology where IP-XACT concepts are embedded in the conception flow. The MDE methodology for high- level system modeling is presented Figure 1. The system specification starts by modeling the application and architecture separately, and then associated and deployed at a high level of abstraction. Control information for the DPR description is also obtained from the model, but this aspect is outside of the scope of this paper. The deployment phase makes use of a library that abstracts the low level aspects of the targeted platform. The obtained system specification is then used by subsequent stages of our MDE approach; the deployed IPs are configured and an IP-XACT system description is created from the obtained meta-model. These steps will be detailed in-depth in Section 4.

Figure1. Deployed MDE framework for high-level modeling.

3. DYNAMIC PARTIAL RECONFIGURATION DESIGN FLOW A brief discussion of the Dynamic Partial Reconfiguration Design Flow is provided in this section. For a more detailed description, the reader is directed to the Xilinx user’s guide [3]. Dynamic Partial Reconfiguration parts from the idea that only a small section of the FPGA can be modified on run-time. For this, the designer must define explicitly the areas of the FPGA that will be dynamically reconfigured (known as PRRs); a series of modules are then assigned to these physical partitions (known as PRMs). These modules are subsequently converted into partial bitstreams that can be downloaded in run-time to map the desired functionalities into the destined partitions. The DPR Design Flow is based on a bottom-up synthesis approach, as depicted in Figure 2. This methodology requires that netlists for each partition to be generated independently. In parallel, the top module of the design is synthesized, with black-boxes for the partitions. This means that the IP blocks are obtained independently (i.e. library); theses IP blocks may require being configured (adding features for DPR) and parameterized before being synthesized. Therefore, automating the IP and system parameterization and their transformation into netlists can positively improve the DPR systems design cycle. The netlists for the PRMs are obtained through synthesis using Xilinx ISE, whilst the System netlist is obtained from the EDK description. These are then imported to PlanAhead that manages the details of building a DPR system.

We believe that the DPR design flow (of which we only explained a small part) can be fully exploited by integrating it into a MDE approach using IP-XACT as a unifying information repository and as a means for design flow automation through the use of a compliant design environment. The main contribution of this paper consists in describing such a methodology, which starts from a high level description using UML MARTE (using the approach explained in previous section) and then generates the inputs for an IP-XACT based design environment. This design environment is in charge of configuring the necessary IP cores and then of integrating and configuring the top level description of the system. Moreover, the automation capabilities of IP-XACT can be exploited through the use of the so-called generators in order to perform the logic synthesis of the design, which will generate the necessary netlists used as inputs by the DPR Design flow. The configured IP-XACT descriptions are used to generate several files utilized by the Xilinx Embedded Development Kit (EDK) tool, as depicted in Figure 3. Examples of these files are the Microprocessor Hardware Specification (MHS) and the Microprocessor Peripheral Description (MPD). The MHS and MPD files are employed by Xilinx Plaform Studio’s Platgen tool [20] to generate the SoC platform. This tool generates the top-level HDL description, whilst the HDL files for the reconfigurable modules are gathered from the library. Finally, generators are used to synthesize the top level design and each of the reconfigurable modules separately. IP-XACT provides a way to automate this process through the use of generators, which in tandem with generator chain descriptions can be exploited to automate this kind of activities. All these tasks are facilitated by Magillem Generator Studio.

Figure 2. Dynamic Partial Reconfiguration Desgn Flow

Multiple passes through the place-and-route tools are used to generate the necessary bitstreams for all full and partial design implementations; each pass (called a configuration) represents a complete FPGA design. Then the generated partial bitstreams can be stored in memory and the configuration control can be performed by a program on the processor.

Figure 3. Dynamic Partial Reconfiguration Desgn Flow

4. PROPOSED META-MODEL DRIVEN DPR DESIGN FLOW The conception of Dynamically Reconfigurable System departs from the definition of a static design which has the same characteristics of a traditional SoC application, and most of the information introduced to the DPR is structural. However, since modules are to be swap in and out of the system, an IP reuse methodology has to be developed to enable seamless integration of IPs coming from different vendors and that might not have the same interfaces for different configurations. Also, many IPs are highly parametrisable, which requires a design flow to be able to configure them depending on different scenarios. We believe that the combination of the MARTE Profile and IPXACT can improve the applicability of the model-driven approaches to this problem. In this paper, we present a methodology for generating a DPR system by creating the top module required by the design flow and by configuring the necessary reconfigurable modules. The obtained files are then feed into the Xilinx design flow for DPR systems to obtain the necessary total and partial bitstreams. The proposed meta-model driven methodology for DPR is presented in Figure 4.

Figure 4. Deployed methodology in-depth.

The methodology is explained in more details as follows. A meta-model driven approach based in IP-XACT starts from the creation of a library of components in accordance to the standard, which contains information about the implementation of the IP stored in the Component HDL library. The IP – XACT library is normally populated by several means: IP descriptions can be done manually, the information can be obtained from legacy code and documentation, or it can be captured by using a graphical representation using extensions to the UML. In our methodology, new components can be added to the library by capturing the necessary information using IP-XACT extensions to MARTE or it can be obtained by using Magillem Platform Assembly [18]. The designer of a DPR application starts by composing a system using a MARTE description in a tool as Papyrus. An extension to the MARTE Profile for DPR has been created, details can be found in [19]. As explained in Section 2, the static part of the system and the reconfigurable modules are treated independenly. The static part of the system is created by integrating non DPR components such as the processor, memory controllers, communication IPs, etc; the reconfigurable partitions are defined as black boxes that contain only information about the interfaces of the modules to be placed on them. The DPR IPs to be mapped in this reconfigurable areas are then specified and retrieved from the library for configuration and customization. Once the system has been composed and the reconfigurable IPs defined, the MARTE description is converted into an IP-XACT by the model transformation phase. The transformation rules of our approach are inspired by the work presented in [17]. This is achieved by parsing the original XMI description from the MARTE description using a tool such as Python. This process will generate an IP-XACT design which is imported along the IPs IP-XACT descriptions (gathered from the library) to a compliant Design Environment. Generators are used to configure the necessary parameters of the system and of the fixed and reconfigurable IPs. This phase is particularly important for the integration of the dynamically reconfigurable components, since specific functionalities (in the form of a logic wrapper) have to be added in order to support seamless partial reconfiguration. For this step, we make use of Magillem Platform Assembly. After the IP-XACT design has been generated, this can be used for targeting different design flows (i.e. VHDL top level code generation, C library generation, System C). We have decided to target the Xilinx Platform Studio flow since Dynamic Partial Reconfiguration using MicroBlaze or PowerPC is well supported and given that a complete SoC can be easily created. Also, several PSF files can be exploited using our methodology.

5. CASE STUDY: RECONFIGURABLE ENTROPY CODER INTEGRATION INTO A SOC SYSTEM In order to demonstrate the feasibility of the proposed model driven top-down methodology, we carried out the implementation of a reconfigurable architecture for entropy coders used in reconfigurable video decoders. A dynamic wrapper is used to encapsulate the IP cores that will be placed in a reconfigurable area. A static wrapper is defined to allow the control of IP cores and the communication with the rest of system architecture. 5.1 Proposed application Numerous studies have been published on video transcoding and adaptation. In [21] a classification in two categories has been proposed. The heterogeneous transcoding is related to the conversion between two different standards such as MPEG-2 to H.264. The homogeneous transcoding is used to make several adjustments to the same standard. In both cases, the conversion may involve bit rate, resolution or frame rate. The transcoding chain can be divided into three main parts: bitstream analyzer, pixel level adaptation and bitstream generation. The transcoding procedure should be carried out in compressed bitstream and generate a new bitstream to be relevant in real time application. The generation of adapted bitstream requires information from the target decoder and a reconfigurable entropy encoder. The bitstream generator used in this case study has been designed to support VLC for MPEG2 standard or CAVLC for H.264 standards and corresponds to the extension the work presented in [22]. Data flow graph for CAVLC and VLC IP cores are shown in Figure 5.

Figure 5: CAVLC (a) and VLC (b) IP cores

The VLC and CAVLC implementations have been carried out in VHDL. In parallel, IP-XACT descriptions have been created for the IPs and for the bus interfaces required for the static and dynamic wrappers. We made use of these IP descriptions to build a minimum system using a MicroBlaze processor in a SoC, following the Xilinx Xflow methodology and tools, but it must be noted the our methodology is not restricted to them, it has been used to facilitate the conception of the system. 5.2 Reconfigurable system model Figure 6 shows the modeling phase of a reconfigurable system targeted to an embedded architecture to be exploited by the Xilinx’s EDK environment. This diagram represents a merged functional/physical view of the system used to express the attributes related to physical/logical stereotypes. In this structural diagram the designer is interested in describing the way the system is to be connected, not concerned to the low level aspects of the design. Every hardware component has two type definitions, one being functional (the type of modules) and the other physical (i.e. areatype, Static or DynamicReconfigurable). In the case of our case study, we only make use of a dynamic reconfigurable region (labeled as PRR in the diagram). In addition, we have made use of components such as the PLB_HWICAP (in charge of managing the partial reconfiguration data), the SystemACE controller (to store data and configuration bitstreams), the UART controller, and of course the PLB bus and the MicroBlaze processor. It is important to note the these attributes are used in the the proposed design methodology for indicating in which cases custom functionality will be added to support dyanamic partial reconfiguration; this functionally comes in the form of a dynamic wrapper (DW) that is inserted between the bus wrapper (the IPIF module static wrapper or SW) and the core to be used in the application. These wrappers will be described in more detail in the next sub-section.

Figure 6: Modeling of a PDR architecture in UML MARTE.

Once the system has been modelled, an intermediate XMI file is created, which in then parsed and an IP-XACT file is populated. The IP-XACT file contains the interconnections between the modules and the configurable elements are set by the attributes entered in the model. 5.3 Interface and Generic Wrapper generation A generic and static wrapper is used to encapsulate VLC and CAVLC IP cores, as shown in Figure 7. Any difference between the two IP cores will be included in a dynamic wrapper specific to the IP allowing the adaptation to the static wrapper. An in depth analysis of CAVLC and VLC IP cores CAVLC show important differences in term of processing and some similarities in data manipulation and access. These IPs process data blocks (4x4 for CAVLC and 8x8 for VLC) and provide in their output compressed data in byte format. Control signals are also used particularly to manage the incoming data and the output video bitstream. The similarities between the two IP cores are exploited and a generic static wrapper was designed to encapsulate the two IPs and included in the library. The static wrapper is composed of an adaptable buffer capable of handling an 8x8 coefficients blocks or a set of 4x4 coefficients blocks. A controller manages the data transfer between the FIFO, the IP core and the rest of the system. The controller generates and manages the handshake signals for data communication and configuration state of the reconfigurable area. The static wrapper communicates with the IP cores through a dynamic wrapper specific to each IP core. The dynamic wrapper includes an interface whose complexity depends on the complexity of data transfer. The goal is to simplify and generalize the design of the static wrapper..

The complete hardware component or hardware accelerator (HWA) is modeled using UML MARTE as well, as depicted in Figure 8. The partially reconigurable module consists of three sub-modules, namely the IP core functionality itself (IP Core), the dynamic wrapper (PRR_wrapper) and the IPIF module (IPIF). The HWA is typed reconfigurable as depicted in Figure 7.

Figure 8: Modeling of a hardware accelerator in UML MARTE.

This view of the IP shows only a structural representation of the IP, but the interfaces sizes and different parameter can be controlled by the user. This information will be used as in the previous case to populate an intermediate XMI file that will be parsed and used to generate an IP-XACT description of the component. Abstract and bus definitions have been created in IP-XACT for the hardware wrapper signals and for the IPIF module; the bus definitions are used for connecting the different modules in tbe HWA modules, but when modelling in MARTE the user does not have to take these low level aspects into consideration. Figure 9 shows a section of the bus definition for the dynamic wrapper.

Figure 9.IP-XACT bus definition of the dynamic wrapper.

Figure 7: Simplified view of IP cores wrapper for partial reconfiguration showing static and dynamic wrapper.

The modelled HWA component is converted into an IPXACT component description that will be used for interconnecting the component into a larger system (i.e.

using Magillem Platfrom Assemnly) and to set different configurable elements of the IP core. Also, it is important to note that the component object in IP-XACT contains information regarding the different views of this component (i.e. language used to describe the component, tools to which it is targeted to, etc). Also, the component description encapsulalates information about the files associated with an IP core (if any) that can be exploited for automating many of the burdersome steps in the DPR design flow, this trough the FileSets schema. The component IP-XACT description of the system is used to generate the Microprocessor Peripheral Description (MPD) files used by EDK to configure the IP blocks connected to microprocessor based system. This file contains similar information to the IP-XACT description and the transformation rules are relatively straightforward, more about information about the MHS and MPD can be found in [21]. The component description is also exploited to create the interconnections and sub-modules instantiations in the user logic files used by EDK to implement the functionality of the IP cores. Since the VHDL file is already created from the MPD description (in tandem with the MHS file) the functionalities for the dynamic part of the IP are added in the user logic file. If any netlist have been used to implement the functionality of the HWA the information is stored in the IP-XACT description and the used to update the Black Box Definition (BBD) files used by EDK to define the presence of black boxes with their associated netlists. The Peripheral Access Order (PAO) file used by EDK to establish the order of synthesis can be also update from the FileSets schema in the component description. 5.5 System architecture The target system architecture for validation of reconfigurable entropy coder is shown in Figure 10. This architecture is based on a PLB bus from Xilinx and its an schematic representation of the model presented in Figure 5. The components connected to this bus are mainly used to control and manage the reconfigurable area. Bitsream for CAVLC and VLC are generated and stored in flash memory managed by the system ACE. In this system, the microblaze processor will be used as a reconfiguration controller. It will read the adequate bitstream from compact flash and sends it to HW_ICAP which configures the PRR. The system architecture modelled in Figure 5 is transformed into an XMI format and then converted into an IP-XACT design. An IP-XACT design description contains information about the components instantiated into a design, bus connections (interconnection in IP-XACT terms), ad-hoc connections (i.e. point to point connections) and hierarchical connections.

Figure 10 Generated System Architecture

This information is used to populate the MHS file, especially regarding the different ports; if only PLB ports are to be used, then no extra information is required, but in many cases ad-hoc signals are used as general I/O to implement specific functionalities and then this information have to be updated in the system description. The component instances schema in the IP-XACT description contains information about the configurable elements of the design that are the equivalent to the parameters in the MHS and MPD files and to the generics in the VHDL code. As depicted in in the MDE flow of Figure 3, the generated MHS file (for the complete system) ane the MPD files (fore each of the IP cores) are used by the EDK tools (especifically by PlatGen) to generate the top level VHDL system. The reconfigurable modules are synthesized independenly and instantiated as black boxes in the EDK design. The generated netlists are then imported to PlanAhead for the low level DPR flow. 5.6. Implementation results The proposed architecture has been targeted to a Virtex-5 FPGA using ISE 12.1 from Xilinx. The partitioning of reconfigurable area was performed with PlanAhead, which allows the automatic insertion of slice macros to interface static and dynamic regions. The partitioning was performed

so that the reconfigurable area holds the necessary resources to implement VLC or CAVLC IP cores. The obtained results in terms of resource utilization (LUT, Slice L, Slice M and BRAM) are shown in Table 1. Resources

LUT Slice L Slice M Bram36

H264

Mpeg2

Cavlc

Vlc

1190 206 94 0

2297 422 167 4

Wrapper 90 11 12 0

Overhead Cavlc/ wrapper

Vlc/ wrapper

8.7% 6% 16% 0%

4% 2.8% 7.6% 0%

Table 1: Resources utilization and Wrapper overhead

Adding a static and dynamic wrapper which allows the reconfiguration may cause a significant overhead on the physical resources of FPGA. We have optimized this wrapper to reduce this overhead. Table 1 shows the average overhead by the wrapper compared to VLC and CAVLC IP. The average overhead of the wrapper on the two IPs is 7.5%. This overhead will be compensated by the fact that only one IP is instantiated on the system. 7. CONCLUSION In this paper we have presented a design methodology that facilitates the conception and implementation of dynamic partial reconfigurable SoC. We have concentrated our efforts in the creation of the structural description of the system that is used as an input to the DPR design flow. The presented approach is based in two widely used standards, UML MARTE and IP-XACT that until recent years had been developed in parallel; a great deal of research have been carried out to unify both standards, given the opportunities offered by the IP-XACT standard for interchanging IP descriptions among EDA tools. However, as it is demonstrated in this paper, IP-XACT can also be exploited as a means for providing and intermediate system description that can be used to pass from UML MARTE models to HDL code generation. An IP-XACT compliant design environment facilitates the configuration and interconnection of complex systems, and provides mechanisms for EDA that can be used to control automate many of the burdensome tasks associated with SoC design flows. We have showed how the IP-XACT can be used to generate the top level HDL description of the system, along with the necessary reconfigurable IPs that are gathered from a component library. The automation capabilities of IP-XACT have been exploited to generate the synthesized netlists that are used by Xilinx PlanAhead tool to create DPR systems. Furthermore, we have demonstrated our methodology through a case study in which an entropy coder can be reconfigured for adaptive compression system.

8. ACKNOWLEDGMENTS This work has been supported by the ANR FAMOUS Project (ANR-09-SEGI-003) Authors would like to thank Magillem Design Services for their support and providing us with academic licenses to use their tools for IP-XACT exploitation. 9. REFERENCES [1] P. Manet, “An Evaluation of Dynamic Partial Reconfiguration for Signal and Image Processing in Professional Electronics Applications”, EUSASIP Journal of Embedded Systems, 2008. [2] A. Donlin, “Applications, Design Tools and Low Power Issues in FPGA Reconfiguration”, Chapter 22 in Designing Embedded Processors A Low Power Perspective, Springer, 513 -541, 2007. [3]. Xilinx Corporation, Partial Reconfiguration User Guide, Xilinx UG208, 2011. [4] OMG, “Modeling and Analysis of Real-time and Embedded systems MARTE), Beta 3,” http://www.omgwiki.org/marte-ftf2/doku.php, 2009. [5] "IEEE Standard for IP-XACT, Standard Structure for Packaging, Integrating, and Reusing IP within Tools Flows," IEEE Std 1685-2009, vol., no., pp.C1-360, Feb. 18 2010. [6] W. Kruijtzer et al., “Industrial IP Integration Flows based on IP-XACT Standards,” in DATE’08, March 2008, pp. 32–37. [7] C. Lennard, “Industrially Proving the SPIRIT Consortium Specifications for Design Chain Integration,” in DATE’06, March 2006, pp. 1–6. [8] J-L. Dekeyser, P. Boulet, P. Marquet, and S. Meftali, "Model driven engineering for SoC co-design," IEEE-NEWCAS Conference, 2005. The 3rd International, vol., no., pp. 21- 25, 19-22 June 2005 doi: 10.1109/NEWCAS.2005.1496724. [9] J.Vidal and F. De Lamotte and G. Gogniat, “A co-design approach for embedded system modeling and code generation with UML and MARTE,” in Design, Automation and Test in Europe (DATE’09), 2009. [10] J. Vidal, F. de Lamotte, G. Gogniat, J.- P. Diguet, and P. Soulard, "IP reuse in an MDA MPSoPC co-design approach," Microelectronics (ICM), 2009 International Conference on , vol., no., pp.256-259, 19-22 Dec. 2009. [11] West team of LIFL laboratory. Graphical array specification for parallel and distributed computing (Gaspard). [Online]. http://www2.lifl.fr/west/gaspard/. [12] I. R. Quadri, A. Muller, S. Meftali, and J.-L. Dekeyser, “MARTE based design flow for Partially Reconfigurable Systems-on-Chips,” in 17th IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC 09), 2009. [13] I. R. Quadri, S. Meftali, and J.-L. Dekeyser, "Designing dynamically reconfigurable SoCs: From UML MARTE models to automatic code generation," Design and Architectures for Signal and Image Processing (DASIP), 2010 Conference on, vol., no., pp.68-75, 26-28 Oct. 2010. [14] C. André, F. Mallet, A. M. Khan, and R. de Simone, "Modeling SPIRIT IP-XACT with UML MARTE", In: Proc. DATE Workshop on Modeling and Analysis of RealTime and Embedded Systems with the MARTE UML profile, 2008. [15] T. Schattkowsky, X. Tao, and W. Mueller, "A UML frontend for IP-XACT-based IP management," Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE '09. , vol., no., pp.238 -243, 20-24 April 2009. [16] T. Arpinen et al., “Model-driven Approach for Automatic SPIRIT IP Integration,” in UML-SOC’08, June 2008. [17] T. Arpinen, T. Koskinen, E. Salminen, T. D. Hamalainen, and M. Hannikainen, "Evaluating UML2 modeling of IP-XACT objects for automatic MP-SoC integration onto FPGA," Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE '09. , vol., no., pp.244-249, 20-24 April 2009. [18] Magillem Design Services, “Magillem Platform Assembly User Guide”, 2011. [19] S. Cherif, I. R Quadri, S. Meftali, J-L Dekeyser: Modeling Reconfigurable Systemson-Chips with UML MARTE Profile: An Exploratory Analysis. DSD 2010: 706-713. [20]. Xilinx Corporation, Embedded System Tools Reference Guide, Xilinx UG111,September 2009 [21]..Ahmad, I.; Xiaohui Wei; Yu Sun; Ya-Qin Zhang; , "Video transcoding: an overview of various techniques and research issues," Multimedia, IEEE Transactions on , vol.7, no.5, pp. 793- 804, Oct. 2005 [22]. Guarisco, M.; Rabah, H.; Weber S.; Amira, A.; , "Dynamically reconfigurable architecture for real time adaptation of H264/AVC-SVC video streams," Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on , vol., no., pp.39-44, 13-18 June 2010

2011

Tampere, Finland, November 2-4, 2011

Poster Session Smart Image Sensors

SystemC Modelization for Fast Validation of Imager Architectures Yves Blanchard

www.ecsi.org/s4d

SYSTEMC MODELIZATION FOR FAST VALIDATION OF IMAGER ARCHITECTURES Yves Blanchard Paris-Est University, ESYCOM, ESIEE Paris Noisy le Grand, France E-mail: [email protected] ABSTRACT Development of smart CMOS imagers is a complex design task where the verification of an architecture composed of a matrix of pixels intermixed with analog and digital electronics is playing an important part. New generations of imager using 3D integration will allow even more processing to be done in-situ. Verification has to be done locally for the pixel and globally for the architecture. Design exploration and validation problematic has shifted from mostly the analog domain to the validation of a complex SOC with millions of parallel processors, the pixels. In this paper we present a methodology using the SystemC language for the creation of fast models for validation and a first level evaluation of performance of large CMOS imager architectures. Index Terms— Active pixel sensors, architecture simulation, SystemC, fast model 1. INTRODUCTION With the increase in the resolution of Active Pixel Sensors (APS), power consumption and data throughput has become an important factor that is often treated by increasing the local processing near the pixel [3], [5]. However, imager functionalities have always been limited by the ratio of the area sensitive to light to the pixel size. With the advent of 3D stacking technologies new solutions are emerging where the computing power available in the pixel vicinity is less limited by the fill factor [2]. This local data treatment can be quite sophisticated and together with global processing makes the design validation a difficult task. Analog and digital processing up to millions of pixels is done in parallel and may vary between different parts of the sensor at the same time. Usual methodology and tools can be used for the separate validation of local analog and digital processing. Analog, mixed signal or Matlab/Simulink simulation can validate part of the global architecture. However, because of the number of pixels, a full structural system evaluation/validation is nearly impossible with conventional mean, with the risk of missing some control or architectural problems that will be discovered too late. SystemC is a versatile language that can be used both for high level modeling or the creation of synthesizable

Antoine Dupret, Arnaud Peizerat CEA LETI – MINATEC Grenoble, France E-mail: {antoine.dupret, arnaud.peizerat}@cea.f description of digital systems (with only a subset of the language). An extension, SystemC-AMS brings mixed signal capabilities to the language. SystemC is used for various kinds of models targeting high level of abstractions with Transaction Level Models (TLM). Validation of complex mixed hardware-Software systems can be done on virtual platform described with SystemC-TLM [7]. Transaction Level Models (TLM) however are quite far from the implementation structure and not adequate for a specific circuit like an imager who has some processing at the transistor level. Yet if a more classic structural level model is used the simulation can become very slow because of the number of processing elements. For hardware designer SystemC is a difficult language to use as it is based on C++ and rely on object oriented and generic programming. The consequence is that it is often underused by the designers that are trained to use what they can find in classical hardware description languages (HDL) language with a Register Transfer Level methodology. However there are in SystemC intermediate levels of descriptions using powerful features of the language that can be used for the creation of fast simulation models within a structural description. Furthermore, while structural, they bring flexibility to the model, allowing quick change, without the complexity of a full TLM model. This study was motivated by the design of low power imager architectures with in circuit processing where novel features could not be fully explored/evaluated by traditional tools. Local pixel design was validated with analog simulation but circuit models with inter pixel processing were either too high level to allow any evaluation of the hardware implementation or so close to the implementation that they required long development times for a result of very slow simulation and no flexibility. 2. DESIGN REQUIREMENTS An imager is a mixed signal system whose core is a matrix of pixels. The pixel itself can be seen as an analog processor with digital control signals. Further processing can also be done in analog or digital. It is a challenge for a model to be at the same time accurate, both for the analog and the digital processing at the pixel level and fast for the simulation of the whole structure of the imager matrix representing a huge number of parallel processors.

3. STRUCTURAL DESCRIPTION The goal of a structural model is to be accurate enough to validate the processing elements, the scheduling of the data in the architecture, the main control signals and to introduce some realistic behavior to evaluate the performance of the implementation. In this section we present the structure that was chosen to create a general imager description that can be extended for the validation of various architectures. As the fundamental sensing element the pixel has its own level of description and so is available as a module. It is the atomic and the lowest element of the hierarchy. Each pixel can be by parameterized individually for a non-ideal behavior by the simulation wrapper that has a direct access to it through the hierarchy of modules (see section 4). Pixels are instantiated in a second level of hierarchy that represents blocks of pixels. Such a block, depending of the architecture can be a column of pixels or an arbitrary group of adjacent pixels that share a common processing [6]. These groups will be named block of pixels in the rest of this article. This module implements what is common to some pixels, like conversion, filtering, transform … They interact strongly with the pixels and in fact, part of the processing can be shared between blocks and pixels. In SystemC, it can be done with communication through ports, but a better model flexibility and a faster simulation is obtained by using communication interfaces. The third level of hierarchy is the matrix gathering all the pixel blocks in the global architecture. Together with a control block they form the imager.

Full matrix Global controls Imager model Simulation wrapper

Figure 1.

Configuration (interface)

Communication for processing (interfaces)

block of pixels

controls (signals)

Models can be created with many languages from hardware specific to generic languages. HDL models can be very accurate, but they are cumbersome to modify because of their structures where all communications are done through signals and ports. Taking also into account that they are very slow to simulate (when it is even possible) they are used mostly for implementation or validation of some local functionalities. An imager of 500K pixels, counting two processes per pixel, requires a simulator with the ability to run 1 Million processes in parallel with digital and analog behaviors. Classic programming languages (C++, Java, m,…) can be used to create algorithmic models that run very fast but contains no or little information about the hardware feasibility and its characteristics. An intermediate level keeping a structural description with parallelism and some timing but still far enough from the final implementation to be flexible and fast can be obtained with SystemC, using only the Application Programming Interface (API) features standardized by the IEEE [1]. Compared to other works existing on the subject, [4], this paper presents a more generic methodology to obtain a very fast and flexible simulation model, while keeping a structural description. Most of the methodology can also be used for other architectures with a high repetitive structure.

Modules hierarchy

Because of the imager structure, the number of modules is on a huge scale, so it is worth noticing that they are all dynamically created, even at the top of the hierarchy, to avoid stack overflow. The top of the hierarchy is a simulation wrapper whose task is the configuration and control of the simulation and of the imager parameters (Fig. 1). Analog values use a C++ double precision floating type. Even if the accuracy of simple precision float could be enough for most values, double precision are necessary to minimize rounding errors in intermediate computation results and introduce non ideal behaviors. Furthermore they have nearly no penalty for performance (speed-memory usage). Digital values are coded using integers, int, unsigned int (32 bits), int64 or uint64 (64 bits) types. They are large enough for most applications and as the computer native words are very fast for simulation. Single bit values use the Boolean type. SystemC specific types are not used even if they would bring a better accuracy as they are slower for computation and use more memory. Communications between modules are made directly through interfaces or signals. When there are global information that have one driver and many readers, like clock, pixel reset … signals are well suited as they can be naturally connected to all the modules through ports. However when there is a strong interaction between modules to do some processing, interfaces are preferred as they offer flexibility and speed of simulation, 4. COMMUNICATION THROUGH INTERFACES Each level of the hierarchy communicates upward and backward through some control signals, but mostly through

interfaces to implement data processing. In SystemC an interface gives access through specific class methods to the content of a communication channel.

module

Process and internal methods

module Interface methods

Process and internal methods

Interface methods

Export to use interface from parent Export to use interface from children

Figure 2. Interface and export

However there are only semantic differences between a hierarchical channel and a module, and interfaces can be used for fast communication between different modules. Interfaces can be exported through module boundaries giving direct access to the children from the top of the hierarchy, or to an upward module from a child. As an example, by exporting a pixel interface (a module may have various interfaces) to the simulation wrapper, pixels can be easily parameterized individually. With interface each level of the hierarchy can implement its processing in a flexible way, calling functionalities or receiving information from the other levels without having to know where they are (Fig. 2). If a method is added in a module, as long as it is part of one of its interfaces, it will be available to all the other modules connected to this interface without having to modify the structure. Furthermore if the hierarchy is changed, only the chain of interface export has to be maintained and for example access to the pixel interface will not have to be modified in the simulation wrapper. Interfaces are not process, they are not managed by the simulation kernel but they can be called by processes or used to control processes through the notification of events. A module can do some of its processing in another module by calling a foreign method through an interface. It can also trigger other modules action through the notification of events. 5. SYSTEMC PROCESSES AND EVENTS As it was already stated, the number of imager pixels generates a model with a very large number of processes working in parallel. In SystemC there are two kinds of processes, one with its own thread of execution, and one without, the so called methods. If the first one is used, the creation of too many threads in the program by the SystemC

kernel at the first start of simulation crashes the execution. In consequence, modules that are hugely repeated in architectures must only use method processes. The only limitation to the number of method processes is the computer memory. They also have the advantage that their call is faster than thread processes. Another point is that simulation speed depends on the number of processes active at each simulation time (called delta cycle). Emulating analog behavior in SystemC may lead to process being continuously active to compute new floating point values at all times. However, these values are only pertinent at some instant or when attaining some threshold. By computing the values only when they are needed or by evaluating the time when the threshold will be attained without computing continuously, simulation speed can be greatly increased. Imagers are synchronized through different signals and so are sequential in nature. This property is used in the model to compute values only when they are needed. Pixels are doing integration from reset to selection for reading. But instead of having a continuous integrator, the pixel module will only compute its value when read through one of its interface methods. This simple sequential treatment may not be enough if some action depends on a specific value being reached, even if there is no measure from outside at this time. However, even in this case, a simulation speed increase can be obtained if it is possible to predict mathematically from start when this value will be attained. If it is the case, then a powerful feature of the language, the creation of specific events, is used to "go" directly to this time without continuously or even periodically computing intermediate values. In SystemC, an event is a data structure that is used to tell the simulation kernel the time of some action. All processes registered in the simulation kernel as sensitive to a specific event will be activated when this event is notified. Furthermore the kernel will only run delta cycle where there are events as they mean there may be something to do. An event may activate a process in the same module (Fig. 3) or in another module through an interface exported to this module. An event is used by the simulation kernel, but it can be created and notified by some module in the application. When at some time a process is active in a module, it can compute the time when the next action will take place, and then notify an event for this time. The process then becomes inactive and when the notification time arrives, the kernel activates the right action by waking up all the processes registered as sensitive to the event. Between the time were the event was notified and the time where it is processed no computing power is used. For example the study of an imager where integration is stopped when a pixel reaches a threshold has been done with an interface in the matrix allowing all the pixels to update a common event to set it for the shortest integration time. The pixels models are initialized at reset, compute their integration time knowing their illumination, update the

common event and go to "sleep". The simulation kernel then goes directly from reset time to the event time without any intermediate computation in any pixel. class dum : public sc_module, public dum_if { public : sc_in rst_dum; sc_export mpp_a; SC_HAS_PROCESS(dum); dum (sc_module_name):sc_module(name) { SC_METHOD(mpv); // first process sensitive