A Clustered VLIW Architecture Based on Queue Register Files
Fernandes, Marcio M
Instruction-level parallelism (ILP) is a set of hardware and software techniques that allow parallel execution of machine operations. Superscalar architectures rely most heavily upon hardware schemes to identify parallelism among operations. Although successful in terms of performance, the hardware complexity involved might limit the scalability of this model. VLIW architectures use a different approach to exploit ILP. In this case all data dependence analyses and scheduling of operations are performed at compile time, resulting in a simpler hardware organization. This allows the inclusion of a larger number of functional units (FUs) into a single chip. IN spite of this relative simplification, the scalability of VLIW architectures can be constrained by the size and number of ports of the register file. VLIW machines often use software pipelining techniques to improve the execution of loop structures, which can increase the register pressure. Furthermore, the access time of a register file can be compromised by the number of ports, causing a negative impact on the machine cycle time. For these reasons we understand that the benefits of having parallel FUs, which have motivated the investigation of alternative machine designs. This thesis presents a scalar VLIW architecture comprising clusters of FUs and private register files. Register files organised as queue structures are used as a mechanism for inter-cluster communication, allowing the enforcement of fixed latency in the process. This scheme presents better possibilities in terms of scalability as the size of the individual register files is not determined by the total number of FUs, suggesting that the silicon area may grow only linearly with respect to the total number of FUs. However, the effectiveness of such an organization depends on the efficiency of the code partitioning strategy. We have developed an algorithm for a clustered VLIW architecture integrating both software pipelining and code partitioning in a a single procedure. Experimental results show it may allow performance levels close to an unclustered machine without communication restraints. Finally, we have developed silicon area and cycle time models to quantify the scalability of performance and cost for this class of architecture.