XIV International Conference on Embedded Computer and Systems: Architectures, MOdeling and Simulation SAMOS XIV - 2014 July 14<sup>th</sup> - Samos Island (Greece)



Francesca Palumbo POLCOMING Università degli Studi di Sassari



Endri Bezati, Simone Casale-Brunet, Marco Mattavelli SCI STI MM École Politecnique Fédérale de Lausanne



#### Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL Multi-Standard Decoder Use-Case



# Outline

- Introduction
  - Problem Statement
  - Two Step Problem Solving
    - · Step 1: Coarse-Grained Reconfigurability
    - Step 2: Dataflow Model of Computation and RVC-CAL
  - Generation of Multi-Dataflow Graphs
- Automated Design Flow
  - Composition: the Multi-Dataflow Composer
  - Optimization: TURNUS Co-Exploration Framework
  - RTL Generation: Xronos High Level Synthesis
  - Tools Integration
- Experimental Results
  - An MPEG-4 SP Decoder Use Case
  - Synthesis Results
  - Performance Results
- · Conclusions

# Outline

- Introduction
  - Problem Statement
  - Two Step Problem Solving
    - Step 1: Coarse-Grained Reconfigurability
    - Step 2: Dataflow Model of Computation and RVC-CAL
  - Generation of Multi-Dataflow Graphs
- Automated Design Flow
  - Composition: the Multi-Dataflow Composer
  - Optimization: TURNUS Co-Exploration Framework
  - RTL Generation: Xronos High Level Synthesis
  - Tools Integration
- Experimental Results
  - An MPEG-4 SP Decoder Use Case
  - Synthesis Results
  - Performance Results
- · Conclusions





# **Problem Statement**

Electronic devices are converging to platforms that are:



- portable: dimensions and battery life limits have to be taken into consideration.
- multimedial: different applications have to be executed, very often at the same time.
- higly efficient: real time response is needed in many cases.

• easily evolvable: technology and algorithms fast evolution have to be followed by the devices.

# **Problem Statement**

Electronic devices are converging to platforms that are:



• portable: dimensions and battery life limits have to be taken into consideration.

• multimedial: different applications have to be executed, very often at the same time.

• higly efficient: real time response is needed in many cases.

• easily evolvable: technology and algorithms fast evolution have to be followed by the devices.

need of new design flows to address this complex scenario

#### Two Step Problem Solving Step 1: Coarse-Grained Reconfigurability

• Systems are required to be **flexible** and **efficient**.

#### Two Step Problem Solving Step 1: Coarse-Grained Reconfigurability

• Systems are required to be flexible and efficient.

• **Reconfigurable Paradigm** (RP) to hardware design: specialized computing platforms, capable of changing configuration to serve the targeted computations.



### **Two Step Problem Solving**

#### Step 1: Coarse-Grained Reconfigurability

• Systems are required to be flexible and efficient.

• **Reconfigurable Paradigm** (RP) to hardware design: specialized computing platforms, capable of changing configuration to serve the targeted computations.

• The more the hardware is specialized, the more it is difficult to program.



#### Two Step Problem Solving Step 2: Dataflow Model of Computation and RVC-CAL (1)

A dataflow program is a directed graph of functional units (actors) exchanging sequences of data (tokens) through dedicated channels.



Actors encapsulate their state and communicate exclusively by sending and receiving tokens. Such an absence of race conditions, along with the intrinsic **modularity** of the dataflow graphs, make it possible to explicit the algorithmic **parallelism** of the programs.

### **Two Step Problem Solving**

#### Step 2: Dataflow Model of Computation and RVC-CAL (2)

**CAL[1]** = Cal Actor Language, high-level programming language for describing dataflow actors.



[1] MPEG-B part 4 (2009), [2] MPEG-C part 4 (2010)

**RVC-CAL** dataflow specifications













#### Generation of Multi-Dataflow Graphs (1) **RVC-CAL** dataflow specifications HW platform A Α 1:1 $\square$ С D B B clock HW CG reconfigurable platform Α A clock С 2:1 B В D B Ε D E

7

Challenges of the modern electronic devices design:

- low area and power (portability)
- flexibility (multimediality)
- performances (high efficiency)
- code reusability (fast evolution)

Challenges of the modern electronic devices design:

- low area and power (portability)
- flexibility (multimediality)
- performances (high efficiency)
- code reusability (fast evolution)

 $\rightarrow$  resource sharing

- $\rightarrow$  CG reconfiguration
- $\rightarrow$  dataflow parallelism highlighting
- $\rightarrow$  dataflow modularity (VTL)

Challenges of the modern electronic devices design:

- low area and power (portability)
- flexibility (multimediality)
- performances (high efficiency)
- code reusability (fast evolution)

perfect matching, but...

 $\rightarrow$  CG reconfiguration

 $\rightarrow$  resource sharing

- $\rightarrow$  dataflow parallelism highlighting
- $\rightarrow$  dataflow modularity (VTL)

Challenges of the modern electronic devices design:

- low area and power (portability)
- flexibility (multimediality)
- performances (high efficiency)
- code reusability (fast evolution)

 $\rightarrow$  CG reconfiguration

 $\rightarrow$  resource sharing

- $\rightarrow$  dataflow parallelism highlighting
- $\rightarrow$  dataflow modularity (VTL)

perfect matching, but...what happens if the **design complexity** grows?





# Outline

- Introduction
  - Problem Statement
  - Two Step Problem Solving
    - · Step 1: Coarse-Grained Reconfigurability
    - Step 2: Dataflow Model of Computation and RVC-CAL
  - Generation of Multi-Dataflow Graphs
- Automated Design Flow
  - Composition: the Multi-Dataflow Composer
  - Optimization: TURNUS Co-Exploration Framework
  - RTL Generation: Xronos High Level Synthesis
  - Tools Integration
- Experimental Results
  - An MPEG-4 SP Decoder Use Case
  - Synthesis Results
  - Performance Results
- · Conclusions

Functionalities required by the automated design flow:



Functionalities required by the automated design flow:

multi-dataflow network composition





Functionalities required by the automated design flow:





Functionalities required by the automated design flow:





Open RVC-CAL Compiler (ORCC) is a framework able to generate, from an RVC-CAL specification, the corresponding source code for different target platforms (hardware, software or mixed).







Open RVC-CAL Compiler (ORCC) is a framework able to generate, from an RVC-CAL specification, the corresponding source code for different target platforms (hardware, software or mixed).







Open RVC-CAL Compiler (ORCC) is a framework able to generate, from an RVC-CAL specification, the corresponding source code for different target platforms (hardware, software or mixed).







Open RVC-CAL Compiler (ORCC) is a framework able to generate, from an RVC-CAL specification, the corresponding source code for different target platforms (hardware, software or mixed).





| front-end |      |  |      |
|-----------|------|--|------|
| ORCC      |      |  |      |
| back-en   | ds   |  |      |
| С         | Java |  | LLVM |





























11

### Composition: the Multi-Dataflow Composer (2)

**RVC-CAL** dataflow specifications



12

## Composition: the Multi-Dataflow Composer (2)







**TURNUS** is a design space exploration framework for heterogeneous parallel systems. It provides high-level modeling and simulation methods and tools for system level performances estimation and optimization.





It provides the **execution causation traces** for a given dataflow specification in relation to a particular input stimulus. The causation traces are graphs describing the dependencies among the actions executed during the input stimulus processing.





It provides the **execution causation traces** for a given dataflow specification in relation to a particular input stimulus. The causation traces are graphs describing the dependencies among the actions executed during the input stimulus processing.





It provides the **execution causation traces** for a given dataflow specification in relation to a particular input stimulus. The causation traces are graphs describing the dependencies among the actions executed during the input stimulus processing.





It provides the **execution causation traces** for a given dataflow specification in relation to a particular input stimulus. The causation traces are graphs describing the dependencies among the actions executed during the input stimulus processing.





The causation traces postprocessing analyses the causation traces along with given action weights (latency of the actions for a specific target platform) in order to optimize the size of the FIFO memories in the dataflow network through a Design Space Exploration.





The causation traces postprocessing analyses the causation traces along with given action weights (latency of the actions for a specific target platform) in order to optimize the size of the FIFO memories in the dataflow network through a Design Space Exploration.





The causation traces postprocessing analyses the causation traces along with given action weights (latency of the actions for a specific target platform) in order to optimize the size of the FIFO memories in the dataflow network through a Design Space Exploration.





The causation traces postprocessing analyses the causation traces along with given action weights (latency of the actions for a specific target platform) in order to optimize the size of the FIFO memories in the dataflow network through a Design Space Exploration.





**Xronos** is a framework for the generation of RTL descriptions from dataflow or sequential applications. It supports three basic data types (int, uint and bool), flow control statements (branches, loops, parallel blocks) and procedural abstractions (sequences of statements) called *tasks*.





**Xronos** is a framework for the generation of RTL descriptions from dataflow or sequential applications. It supports three basic data types (int, uint and bool), flow control statements (branches, loops, parallel blocks) and procedural abstractions (sequences of statements) called *tasks*.





**Xronos** is a framework for the generation of RTL descriptions from dataflow or sequential applications. It supports three basic data types (int, uint and bool), flow control statements (branches, loops, parallel blocks) and procedural abstractions (sequences of statements) called *tasks*.





Xronos is a framework for the generation of RTL descriptions from dataflow or sequential applications. It supports three basic data types (int, uint and bool), flow control statements (branches, loops, parallel blocks) and procedural abstractions (sequences of statements) called *tasks*.









START: MDC compeses the multi-dataflow network



STEP 1: TURNUS generates the execution causation traces



**STEP 2:** Xronos generates the actions weights



STEP 3: TURNUS generates the FIFO optimal sizes



**STEP 4:** Xronos generates the RVC compliant reconfigurable hardware platform



# Outline

- Introduction
  - Problem Statement
  - Two Step Problem Solving
    - · Step 1: Coarse-Grained Reconfigurability
    - Step 2: Dataflow Model of Computation and RVC-CAL
  - Generation of Multi-Dataflow Graphs
- Automated Design Flow
  - Composition: the Multi-Dataflow Composer
  - Optimization: TURNUS Co-Exploration Framework
  - RTL Generation: Xronos High Level Synthesis
  - Tools Integration
- Experimental Results
  - An MPEG-4 SP Decoder Use Case
  - Synthesis Results
  - Performance Results
- · Conclusions











Different designs composition in terms of functionality and role of the involved actors.

|            | number of actors |      |                 |          |         |  |  |
|------------|------------------|------|-----------------|----------|---------|--|--|
| design     | ordinary*        | SBox | not<br>shared** | shared** | overall |  |  |
| intra      | 32               | 0    | 32              | 0        | 32      |  |  |
| full       | 38               | 0    | 38              | 0        | 38      |  |  |
| parallel   | 70               | 0    | 70              | 0        | 70      |  |  |
| reconf     | 45               | 45   | 20              | 25       | 90      |  |  |
| opt_reconf | 45               | 45   | 20              | 25       | 90      |  |  |

\* computational actors (not SBoxes).

Different designs composition in terms of functionality and role of the involved actors.

|            | number of actors |      |                 |          |         |  |
|------------|------------------|------|-----------------|----------|---------|--|
| design     | ordinary*        | SBox | not<br>shared** | shared** | overall |  |
| intra      | 32               | 0    | 32              | 0        | 32      |  |
| full       | 38               | 0    | 38              | 0        | 38      |  |
| parallel   | 70               | 0    | 70              | 0        | 70      |  |
| reconf     | 45               | 45   | 20              | 25       | 90      |  |
| opt_reconf | 45               | 45   | 20              | 25       | 90      |  |

\* computational actors (not SBoxes).

Different designs composition in terms of functionality and role of the involved actors.

|            | number of actors |      |                 |          |         |  |
|------------|------------------|------|-----------------|----------|---------|--|
| design     | ordinary*        | SBox | not<br>shared** | shared** | overall |  |
| intra      | 32               | 0    | 32              | 0        | 32      |  |
| full       | 38               | 0    | 38              | 0        | 38      |  |
| parallel   | 70               | 0    | 70              | 0        | 70      |  |
| reconf     | 45               | 45   | 20              | 25       | 90      |  |
| opt_reconf | 45               | 45   | 20              | 25       | 90      |  |

\* computational actors (not SBoxes).

Different designs composition in terms of functionality and role of the involved actors.

|            | number of actors |      |                 |          |         |  |
|------------|------------------|------|-----------------|----------|---------|--|
| design     | ordinary*        | SBox | not<br>shared** | shared** | overall |  |
| intra      | 32               | 0    | 32              | 0        | 32      |  |
| full       | 38               | 0    | 38              | 0        | 38      |  |
| parallel   | 70               | 0    | 70              | 0        | 70      |  |
| reconf     | 45               | 45   | 20              | 25       | 90      |  |
| opt_reconf | 45               | 45   | 20              | 25       | 90      |  |

\* computational actors (not SBoxes).

Different designs composition in terms of functionality and role of the involved actors.

|            | number of actors |      |                 |          |         |  |  |
|------------|------------------|------|-----------------|----------|---------|--|--|
| design     | ordinary*        | SBox | not<br>shared** | shared** | overall |  |  |
| intra      | 32               | 0    | 32              | 0        | 32      |  |  |
| full       | 38               | 0    | 38              | 0        | 38      |  |  |
| parallel   | 70               | 0    | 70              | 0        | 70      |  |  |
| reconf     | 45               | 45   | 20              | 25       | 90      |  |  |
| opt_reconf | 45               | 45   | 20              | 25       | 90      |  |  |

\* computational actors (not SBoxes).

# Synthesis Results (1)

Results retrieved from the Xilinx Synthesis Technology tool targeting a Xilinx Virtex 5 330 FPGA board.

| K00011K00         | design   |        |     |            |     |  |  |
|-------------------|----------|--------|-----|------------|-----|--|--|
| resource          | parallel | reconf | Δ%  | opt_reconf | Δ%  |  |  |
| Slices            | 22495    | 18447  | -19 | 16383      | -27 |  |  |
| Slice Regs        | 27567    | 23034  | -16 | 23674      | -14 |  |  |
| Slice LUTs        | 67445    | 52836  | -22 | 48545      | -28 |  |  |
| LUT-FF pairs      | 71011    | 55946  | -21 | 52518      | -26 |  |  |
| BRAMs             | 148      | 154    | +4  | 112        | -24 |  |  |
| DSPs              | 36       | 18     | -50 | 18         | -50 |  |  |
| Max freq<br>[MHz] | 25,43    | 25,22  | -1  | 25,38      | 0   |  |  |

 $\Delta$ % percentage increment/reduction between the parallel and the reconfigurable designs.



## Synthesis Results (1)

Results retrieved from the Xilinx Synthesis Technology tool targeting a Xilinx Virtex 5 330 FPGA board.

| resource          |          | design |     |            |     |  |  |
|-------------------|----------|--------|-----|------------|-----|--|--|
|                   | parallel | reconf | Δ%  | opt_reconf | Δ%  |  |  |
| Slices            | 22495    | 18447  | -19 | 16383      | -27 |  |  |
| Slice Regs        | 27567    | 23034  | -16 | 23674      | -14 |  |  |
| Slice LUTs        | 67445    | 52836  | -22 | 48545      | -28 |  |  |
| LUT-FF pairs      | 71011    | 55946  | -21 | 52518      | -26 |  |  |
| BRAMs             | 148      | 154    | +4  | 112        | -24 |  |  |
| DSPs              | 36       | 18     | -50 | 18         | -50 |  |  |
| Max freq<br>[MHz] | 25,43    | 25,22  | -1  | 25,38      | 0   |  |  |

 $\Delta$ % percentage increment/reduction between the parallel and the reconfigurable designs.



## Synthesis Results (1)

Results retrieved from the Xilinx Synthesis Technology tool targeting a Xilinx Virtex 5 330 FPGA board.

| resource          |          | design |     |            |     |  |  |
|-------------------|----------|--------|-----|------------|-----|--|--|
|                   | parallel | reconf | Δ%  | opt_reconf | Δ%  |  |  |
| Slices            | 22495    | 18447  | -19 | 16383      | -27 |  |  |
| Slice Regs        | 27567    | 23034  | -16 | 23674      | -14 |  |  |
| Slice LUTs        | 67445    | 52836  | -22 | 48545      | -28 |  |  |
| LUT-FF pairs      | 71011    | 55946  | -21 | 52518      | -26 |  |  |
| BRAMs             | 148      | 154    | +4  | 112        | -24 |  |  |
| DSPs              | 36       | 18     | -50 | 18         | -50 |  |  |
| Max freq<br>[MHz] | 25,43    | 25,22  | -1  | 25,38      | 0   |  |  |

 $\Delta$ % percentage increment/reduction between the parallel and the reconfigurable designs.



## Synthesis Results (1)

Results retrieved from the Xilinx Synthesis Technology tool targeting a Xilinx Virtex 5 330 FPGA board.

| resource          |          | design |     |            |     |  |  |
|-------------------|----------|--------|-----|------------|-----|--|--|
|                   | parallel | reconf | Δ%  | opt_reconf | ۵%  |  |  |
| Slices            | 22495    | 18447  | -19 | 16383      | -27 |  |  |
| Slice Regs        | 27567    | 23034  | -16 | 23674      | -14 |  |  |
| Slice LUTs        | 67445    | 52836  | -22 | 48545      | -28 |  |  |
| LUT-FF pairs      | 71011    | 55946  | -21 | 52518      | -26 |  |  |
| BRAMs             | 148      | 154    | +4  | 112        | -24 |  |  |
| DSPs              | 36       | 18     | -50 | 18         | -50 |  |  |
| Max freq<br>[MHz] | 25,43    | 25,22  | -1  | 25,38      | • 0 |  |  |

 $\Delta$ % percentage increment/reduction between the parallel and the reconfigurable designs.



# Synthesis Results (2)

Results retrieved from the XPower Analyzer tool targeting a Xilinx Virtex 5 330 FPGA board for a QCIF video sequence decoding.

| resource |          | d      | esign |            |     |  |  |  |  |
|----------|----------|--------|-------|------------|-----|--|--|--|--|
| power    | parallel | reconf | Δ%    | opt_reconf | Δ%  |  |  |  |  |
| Clock    | 0,382    | 0,347  | -9    | 0,323      | -15 |  |  |  |  |
| Logic    | 0,045    | 0,025  | -44   | 0,024      | -47 |  |  |  |  |
| Signals  | 0,050    | 0,031  | -38   | 0,031      | -38 |  |  |  |  |
| BRAMs    | 0,090    | 0,090  | 0     | 0,059      | -34 |  |  |  |  |
| ТОТ      | 0,567    | 0,493  | -13   | 0,437      | -23 |  |  |  |  |

 $\Delta$ % percentage increment/reduction between the parallel and the reconfigurable designs.



# Synthesis Results (2)

Results retrieved from the XPower Analyzer tool targeting a Xilinx Virtex 5 330 FPGA board for a QCIF video sequence decoding.

| resource |          | d      | esign |            |     |  |  |  |
|----------|----------|--------|-------|------------|-----|--|--|--|
| power    | parallel | reconf | Δ%    | opt_reconf | Δ%  |  |  |  |
| Clock    | 0,382    | 0,347  | -9    | 0,323      | -15 |  |  |  |
| Logic    | 0,045    | 0,025  | -44   | 0,024      | -47 |  |  |  |
| Signals  | 0,050    | 0,031  | -38   | 0,031      | -38 |  |  |  |
| BRAMs    | 0,090    | 0,090  | 0     | 0,059      | -34 |  |  |  |
| ТОТ      | 0,567    | 0,493  | -13   | 0,437      | -23 |  |  |  |

 $\Delta$ % percentage increment/reduction between the parallel and the reconfigurable designs.



# Synthesis Results (2)

Results retrieved from the XPower Analyzer tool targeting a Xilinx Virtex 5 330 FPGA board for a QCIF video sequence decoding.

| resource |          | d      | esign |            |     |  |  |  |
|----------|----------|--------|-------|------------|-----|--|--|--|
| power    | parallel | reconf | Δ%    | opt_reconf | Δ%  |  |  |  |
| Clock    | 0,382    | 0,347  | -9    | 0,323      | -15 |  |  |  |
| Logic    | 0,045    | 0,025  | -44   | 0,024      | -47 |  |  |  |
| Signals  | 0,050    | 0,031  | -38   | 0,031      | -38 |  |  |  |
| BRAMs    | 0,090    | 0,090  | • 0   | 0,059      | -34 |  |  |  |
| ТОТ      | 0,567    | 0,493  | -13   | 0,437      | -23 |  |  |  |

 $\Delta$ % percentage increment/reduction between the parallel and the reconfigurable designs.



### **Performance Results**

Results are given in frames per second (fps) and are obtained as the average of fps for the intra and the full decoder configurations.

| video                  |          | de     | esign |            |    |
|------------------------|----------|--------|-------|------------|----|
| sequence<br>resolution | parallel | reconf | ۵%    | opt_reconf | Δ% |
| QCIF                   | 110      | 105    | -5    | 105        | -5 |
| CIF                    | 28       | 26     | -6    | 26         | -6 |

 $\Delta$ % percentage increment/reduction between the parallel and the reconfigurable designs.



### **Performance Results**

Results are given in frames per second (fps) and are obtained as the average of fps for the intra and the full decoder configurations.

| video                  |          | de     | esign |            |    |
|------------------------|----------|--------|-------|------------|----|
| sequence<br>resolution | parallel | reconf | ۵%    | opt_reconf | Δ% |
| QCIF                   | 110      | 105    | -5    | 105        | -5 |
| CIF                    | 28       | 26     | -6    | 26         | -6 |

 $\Delta$ % percentage increment/reduction between the parallel and the reconfigurable designs.

22

## Outline

- Introduction
  - Problem Statement
  - Two Step Problem Solving
    - · Step 1: Coarse-Grained Reconfigurability
    - Step 2: Dataflow Model of Computation and RVC-CAL
  - Generation of Multi-Dataflow Graphs
- Automated Design Flow
  - Composition: the Multi-Dataflow Composer
  - Optimization: TURNUS Co-Exploration Framework
  - RTL Generation: Xronos High Level Synthesis
  - Tools Integration
- Experimental Results
  - An MPEG-4 SP Decoder Use Case
  - Synthesis Results
  - Performance Results
- · Conclusions

• Challenge of modern electronic systems development: coupling **portability**, **flexibility** and **efficiency** with the **minimum design effort**.

• Challenge of modern electronic systems development: coupling **portability**, **flexibility** and **efficiency** with the **minimum design effort**.

• A solution is possible by exploiting the **dataflow** programming paradigm along with the **coarse-grained reconfigurability**.

• Challenge of modern electronic systems development: coupling **portability**, **flexibility** and **efficiency** with the **minimum design effort**.

• A solution is possible by exploiting the **dataflow** programming paradigm along with the **coarse-grained reconfigurability**.

• An **automated design flow** has been assembled, by the integration of different RVC-CAL tools (MDC, TURNUS, Xronos).

• Challenge of modern electronic systems development: coupling **portability**, **flexibility** and **efficiency** with the **minimum design effort**.

• A solution is possible by exploiting the **dataflow** programming paradigm along with the **coarse-grained reconfigurability**.

• An **automated design flow** has been assembled, by the integration of different RVC-CAL tools (MDC, TURNUS, Xronos).

• The proposed design flow has been validated on a real use case involving two configurations of the MPEG-4 Simple Profile decoder.

- Results highlighted the effectiveness of the approach by achieving more than 25% of saving in terms of **resource utilization** and more than 20% of saving in terms **power consumption**.
- The generated reconfigurable designs present a very **small performance penalty** (5-6%) with respect to the original decoder designs.

### Acknowledgements

The research leading to these results has received funding from:







• the Region of Sardinia L.R.7/2007 under grant agreement CRP-18324 [RPCT Project].



• the Region of Sardinia, Young Researchers Grant, POR Sardegna FSE 2007-2013, L.R.7/2007 "Promotion of the scientific research and technological innovation in Sardinia"



XIV International Conference on Embedded Computer and Systems: Architectures, MOdeling and Simulation SAMOS XIV - 2014 July 14<sup>th</sup> - Samos Island (Greece)

> Carlo Sau DIEE Università degli Studi di Cagliari carlo.sau@diee.unica.it



#### Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL Multi-Standard Decoder Use-Case

