This paper presents a many-core heterogeneous computational platform that employs a GALS compatible circuit-switching on-chip network. The platform targets streaming DSP and embedded applications that have high degree of task-level parallelism among computational kernels. The test chip fabricated in 65-nm CMOS consists of 164 simple small programmable cores, three dedicated-purpose accelerators and three shared memory modules. All processors are clocked by their own local oscillators and communication is achieved through a simple yet novel source-synchronous communication technique that allows each interconnection link to sustain a peak throughput of one data word per cycle.
A complete 802.11a/g WLAN baseband receiver was implemented on this platform. It has a real-time throughput of 54 Mbps while processors running at 594MHz and 0.95~V, and consumes an average 174.41 mW with 12.18 mW (or 7.0%) dissipated by its interconnection links. By taking advantage of the GALS architecture, which allows processors to run at their optimal clock frequencies with dual supply voltages set at 0.95 V and 0.75 V, the receiver consumes only 122.86 mW, a 29.6% in power reduction.