Post on 21-Mar-2020
$�1HZ�'LVWULEXWHG�'63�$UFKLWHFWXUH$�1HZ�'LVWULEXWHG�'63�$UFKLWHFWXUH%DVHG�RQ�WKH�,QWHO�,;6�IRU�:LUHOHVV%DVHG�RQ�WKH�,QWHO�,;6�IRU�:LUHOHVV
&OLHQW�DQG�,QIUDVWUXFWXUH&OLHQW�DQG�,QIUDVWUXFWXUH
Ernest Ernest TsuiTsuiCommunications and Interconnect Technology LabCommunications and Interconnect Technology Lab
Corporate Technology GroupCorporate Technology Group
Kumar Kumar GanapathyGanapathyEdge Access DivisionEdge Access Division
Intel Communications GroupIntel Communications Group
Contributors:Contributors:Hooman HonaryHooman Honary , Rich , Rich NichollsNicholls , Tony Chun, and Lee Snyder, Tony Chun, and Lee Snyder
HOT CHIPS 14HOT CHIPS 14Session 7: Digital Signal ProcessorsSession 7: Digital Signal Processors
20 August 200220 August 2002
2
OutlineOutline–– Vision for Wireless Networks - Vision for Wireless Networks - ubiquitousubiquitous
–– Anticipated Issues – Anticipated Issues – plethora of “standards”plethora of “standards”
–– Future Wireless Requirements – Future Wireless Requirements – ““ soft” withsoft” with intelligence tointelligence toincrease capacityincrease capacity
–– Architectural Objectives – Architectural Objectives – flexible and low powerflexible and low power
–– How will we go about it? – How will we go about it? – distributed at the “right granularity”distributed at the “right granularity”
–– Distributed Architectural Summary – Distributed Architectural Summary – based on power, size,based on power, size,and wireless protocols we can derive a “good” (near optimal?)and wireless protocols we can derive a “good” (near optimal?)distributed architecturedistributed architecture
–– Comparison to other Wireless DSP research – Comparison to other Wireless DSP research – flexible butflexible butwithin 2x of Berkeley Research Wireless Center’s Pleiades Arch.within 2x of Berkeley Research Wireless Center’s Pleiades Arch.
–– Summary – Summary – infrastructure architecture is near-optimum ininfrastructure architecture is near-optimum ingranularity and powergranularity and power
–– Next Steps Next Steps – – client architecture nextclient architecture next
3
Vision for Wireless NetworksVision for Wireless Networks––Ubiquitous Internet Connections for allUbiquitous Internet Connections for all
Mobile Client DevicesMobile Client Devices–– HandheldsHandhelds , , PDAsPDAs , Tablet PCs, and Laptops, Tablet PCs, and Laptops
–– Always-onAlways-on
––New Paradigm for Wireless New Paradigm for Wireless BasestationsBasestations–– Proliferation of Proliferation of basestations basestations due to lack ofdue to lack of
spectrumspectrum
–– Agility across Multiple BandsAgility across Multiple Bands
–– Multi-Network (WLAN, WWAN)Multi-Network (WLAN, WWAN)
4
Wireless Protocol PlethoraWireless Protocol Plethora
––PAN, WLAN, and WANPAN, WLAN, and WAN–– PAN: PAN: Bluetooth Bluetooth (UWB, Wireless USB2)(UWB, Wireless USB2)
–– WLAN (4 protocols): 802.11b/a (11g,WLAN (4 protocols): 802.11b/a (11g, Hiperlan HiperlanII)II)
–– WAN (9 protocols):WAN (9 protocols):– 2G: IS-95, GSM
– 2.5G: GPRS/EGPRS, cdma2000
– 3G: WCDMA (FDD, TDD, SC), CDMA 1xE DV
Antici pated Future Issues
5
Wireless Re quirements Summar yWireless Re quirements Summar y–– Soft Radios at Soft Radios at Basestations Basestations (deployed initially)(deployed initially)
–– Low Power (<1 W) but highly flexibleLow Power (<1 W) but highly flexible–– Large no. of channels per coreLarge no. of channels per core–– ScaleableScaleable
–– ReconfigurableReconfigurable Client Radios (deployed later) Client Radios (deployed later)–– Seamless Client RoamingSeamless Client Roaming
– Two Concurrent Wireless Protocols– Selected 802.11a and WCDMA as the most intensive protocols
–– Variable User Environments require “adaptive”Variable User Environments require “adaptive”resource allocationresource allocation
–– Adaptive to Broadband AFE distortionsAdaptive to Broadband AFE distortions–– Very Low Power (<< 1W)Very Low Power (<< 1W)
– Digital Baseband is < 10% of total PHY pwr
–– ReconfigurableReconfigurable to allow to allow Si Si Re-useRe-use–– ScaleableScaleable
6
64 PointFFT
P/SChannel
Correction(FEQ)
QAM Demap& Soft MetricGeneration
DeinterleaverAdaptive IQImbalanceCorrection
y 48 complex MULTs
y 204 complex MULT
LMS updatey 48 X 8 realcoefficent updates
ResidualFrequencyCorrection
y 48 complex MULTy 4 arctany 8 real ADDsy 1 SUBTRACTy 1 SHIFT
y 3 stagesy 16 Radix-4 butterfles per stagey Re-ordering
DecimationFilter
GuardInterval
Removalfrom Analog Front End
Automatic FrequencyCorrection
S/P
Fixed IQImbalanceCorrection
from TimingSynchronization
y 2 X 13 real MACs y 1 complex MULT
y 1 complex MULTy 1 modulo ADDy 1 bit extracty 2 LUTs
ViterbiACS+Traceback
y 64 1/2 ACS + 64 leveltraceback
Data streamDescrambler
y 6 register LFSRy 2 XORs
Decryption/Lower MAC
Viterbi PathNormalizations
Bit RateFEC
Sample/Symbol Rate IXS
802.11a Signal Processing Flow Example
Rich Nicholls and Hooman Honary
7
802.11a Initial Acquisition Flow802.11a Initial Acquisition Flow
Short Training(8 useconds)
40 M
20 M
250 K
Output Data Rate(Max: 54M)
Sample Rates
1.6useconds:
4useconds:
y 1 complex MULTy 2 complex ADDsy 1 LUT
2.4useconds:
AGC DecimationFilter
PreambleDetection
from Analog Front End 1
AGC Lock
y 2 X 13 real MACs
y 1 complex MULTy 2 complex ADDsy 2 real ADDsy 1 COMPARE
SNRCalculation
AGCfrom Analog Front End 1
AGCDecimation
Filterfrom Analog Front End 2
y 2 X 13 real MACsy 1 complex MULTy 2 complex ADDsy 1 LUT
SNRCalculation
DiversitySelection
y 2 real MULTsy 4.25 COMPAREsy 3.06 real ADDsy .125 SHIFTsy 1.06 LUTs
y 2 real MULTsy 4.25 COMPAREsy 3.06 real ADDsy .125 SHIFTsy 1.06 LUTs
start SNR
y 1 COMPARE
start Second AGC
Once per Packet
AFC Coarse Tuning
y 1 ARCTANy 1 MULT
from Pream ble Detector
DecimationFilter
Tim ingSynchronization
from Analog Front End
y 2 X (8+8) com plex MACy 1 complex MAGy 1 COMPARE
y 2 X 13 real MACs
Automatic FrequencyCorrection
y 1 complex MULTy 1 modulo ADDy 2 LUTs
For new Packet Communications schemes – significantprocessing goes on during very short intervals of thepreambles
Hooman Honary
8
ReceiveFilter
RakeFingers
(DPDCH 2)
RakeFingers
(DPDCH 3)
RakeFingers
(DPDCH 1)
ChannelEstimator(CPICH)
CombinerQPSKSymbol
Demapping
DeinterleavingDemultiplexing
ChannelAggregation
1M 2M
TurboDecoder
6M
RakeFingers
(DPCCH)
Combiner
Combiner
CombinerQPSKSymbol
Demapping
QPSKSymbol
Demapping
2M
2M
QPSKSymbol
Demapping
DeinterleavingDemultiplexing
ViterbiDecoder
32K
1M
1M
16K 32K
2M
Sample Rate 318Mips
(1 per antenna)
FEC60Mips/iteration
CRC
CRC
TransmitFilter
Spreading
13 M
20 M
20 M
20 M
10 M
10 M
10 M
75M
31 M 10 M
10 M
14 M
14 M
14 M
62M
Chip/Symbol Rate 299Mips
ChannelEstimator(CPICH)
ChannelEstimator(CPICH)
SIREstimator
+1
-1
-
TPC
SIR Reference
QPSKSymbolMapping
InterleavingMultiplexing
FECEncoding
20 M 20 M
CRC
+
Z-1
TPC
+D
-D
To TX GainEarly-LateGate
14 M 54 M
20 M
75M
75M
Clock/VoltageControl
Samples
/Chip
Clock/VoltageControl
No of Fingers
No of Iterations
y 400Mhz @ 2samples/chipy 2X400Mhz or 800Mhz@ 4samples/chip
y 400Mhz @ 2samples/chipy 500Mhz @ 4samples/chip y 400Mhz up to 6 iterations
y 500Mhz up to 8 iterations
WCDMA Signal Processing Flow
The high data rates in 3G result in multi-code, -antenna, and -despreader(finger) processing requirements
Tony Chun and Hooman Honary
9
Computational Mix for WirelessComputational Mix for WirelessProtocolsProtocols
0
5
10
15
20
25
30
35
40
Filtering Correl. Sync FEC MisclDSP
MAC
802.11aWCDMA
• Miscl. DSP ~ 10 separate signal processing threads
%
10
How will we go aboutHow will we go aboutit?it?
11
Flexibility, Power, and Cost TradesFlexibility, Power, and Cost Trades(Pick two only)(Pick two only)
Flex.
Power Cost
DedicatedH/W
DSPIXS PE
12
Present Status of Soft RadiosPresent Status of Soft Radiosyy Prior Infrastructure Approaches Prior Infrastructure Approaches
–– DSP + ASICDSP + ASIC–– Inflexible ASIC and Costly DSPInflexible ASIC and Costly DSP
–– DSP + Closely Coupled AcceleratorsDSP + Closely Coupled Accelerators–– Increased Power and Costly DSPIncreased Power and Costly DSP
–– ReconfigurableReconfigurable–– Hard to ProgramHard to Program
–– CostlyCostly
–– High PowerHigh Power
–– Granularity problem has not been completely solvedGranularity problem has not been completely solved
yy Need Evolved ArchitectureNeed Evolved Architecture
13
Architectural Ob jectivesArchitectural Ob jectives––Client:Client:
–– 2-3x Power/Size2-3x Power/Size of Dedicated Hardware for of Dedicated Hardware forthe most intensive protocol as a goalthe most intensive protocol as a goal
–– Related to no. of protocols possibly in theRelated to no. of protocols possibly in theclient deviceclient device
––BasestationBasestation ::–– 5-10x Power/Size 5-10x Power/Size of Dedicated Hardware forof Dedicated Hardware for
the most intensive protocol as a goalthe most intensive protocol as a goal
–– Related to no. of protocols possibly in theRelated to no. of protocols possibly in theinfrastructure deviceinfrastructure device
14
General Architectural IssuesGeneral Architectural Issues
––Low power requires a Low power requires a highly distributedhighly distributedarchitecturearchitecture
–– Low voltage helps Low voltage helps quadratically quadratically lower powerlower power–– Low clock frequency linearly lowers powerLow clock frequency linearly lowers power–– Large size penalties associated withLarge size penalties associated with
distributed elements must be avoideddistributed elements must be avoided
––What is the low power What is the low power interconnectinterconnectstrategy?strategy?
––How do we simplify the distributedHow do we simplify the distributedprocessor processor programmingprogramming problem? problem?
15
Architecture A pproachArchitecture A pproach–– Investigate Homogeneous Processing Elements (PE)Investigate Homogeneous Processing Elements (PE)
– Easy to Scale and to Program for Basestations– Heterogeneous better for Client
–– Interconnect with Nearest Neighbor MeshInterconnect with Nearest Neighbor Mesh– Eliminates High Speed (and power) buses [J. Rabaey, Silicon
Architectures for Wireless, Hotchips 2001 Tutorial]– PHY connections are 95% nearest neighbor
–– Number of Distributed Processing ElementsNumber of Distributed Processing Elements– Driven by:
• Computational Load• Size and Power Constraints• Feature parameters, e.g., Average Load Capacitance, Vdd,
etc.
–– Type of ElementType of Element– General Purpose DSP combined with:– Acceleration of “Standard Operations” with the right granularity
–– S/W programming via High Level LanguageS/W programming via High Level Language– Explicitly indicates parallelism and connections
16CMOS AFECMOS AFECMOS AFECMOS AFECMOS AFECMOS AFECMOS AFECMOS AFE
�
�
�
�
�
�
�
Embedded ControllerEmbedded Controller
�
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
PEPE
PE PE
System ArchitectureSystem Architecture
'RHV�D�*RRG'RHV�D�*RRG�QHDU�RSWLPDO��3(�QHDU�RSWLPDO��3(6ROXWLRQ�([LVW"6ROXWLRQ�([LVW"
18
Macro-architectural ConstraintsMacro-architectural Constraints–– First, must meet Power, Size, and Computational LoadFirst, must meet Power, Size, and Computational Load
constraintsconstraints– Computational Load = Rc (ops/sec.)
• Nop = No. of parallel significant operations (multiplies,etc.) in one cycle [R. Brodersen, ISSCC’02]
• Fclk = Clock frequency• Nop x Fclk > Rc
– Power Constraint = Po (mW)• Power (dynamic, leakage (Pleak), short circuit (Psc)) < Po
– Size Constraint = Ac (mm2)• Nop x Aop < Ac
• Aop = Average area of a significant computational unit(e.g., multiplier-memory-address-decoder, etc.) (mm2)
• Aop ~ Granularity Factor– Constraints on Fclk
• Rc / Nop < Fclk
• Rc x Aop/Ac < Fclk
19
Clock Rate BoundsClock Rate Bounds–– FFclkclk is upper bounded by power constraintsis upper bounded by power constraints
– a x Csw x Vdd2 x Fclk + Pleak < Po/(b x Ac)• where Pleak is the average pwr leakage density in mW/mm2• Csw is the average switching (load) capacitance per mm2• ‘a’ is the activity factor• ‘b’ is the average active area (incl. Datapath, cache, cache
memory bus, etc. and excl. L2 memory, etc.)• ‘b’ varies from ~ 10% for microprocessors to ~ 80% for
dedicated hardware and also is a function of clock gatingstrategies
–– FFclkclk is lower bounded by computational andis lower bounded by computational andarea constraintsarea constraints
– Rc x Aop/Ac < Fclk < (Po/(b x Ac) – Pleak)/ (a x Csw x Vdd2)– Key Issues:
• Find the Fclk that meets upper and lower bounds• Derive the Aop and Nop
20
General Power, Area, General Power, Area, FFclkclk Trends Trends
Fclk
Power
(Po)
Fixed
Aop
FixedArea(Ac)
Area
(Ac)
Voltage/Power
~ 50 MHz ~ 500 MHz ~ 5 GHz
100
10
1
OptimumArea
Flexibility
InterconnectPower
Granularity
Granularity
Nop
21
Reconfi gurableReconfi gurable Power Trend Summar y Power Trend Summar y
–– There is an optimum There is an optimum FFclkclk for a fixed for a fixed AAopop– (Recall that Aop is the fundamental processing size)– The optimum meets Size and Computational requirements and
minimizes power for the above– Higher Fclk increases power and lower Fclk increases area and
interconnect power
–– Is there a similar optimum as Is there a similar optimum as AAopop is Varied?is Varied?– As Aop decreases – interconnect Power increases exponentially
• Simpler elements must be connected in a more complexmanner to retain flexibility
– As Aop increases - the voltage requirement (and Power) increases• More complex element requires time-multiplexing
–– Thus, is there a globally “good” design?Thus, is there a globally “good” design?– Conjecture:
• Determine the Minimum Aop (for the flexibility desired) andfind the optimum Fclk
22
Example of “Good”Example of “Good”Architecture Parameters in theArchitecture Parameters in the
optimum areaoptimum area
––NNopop (No. of parallel Significant (No. of parallel Significantoperations), for 90 nm:operations), for 90 nm:
–– NNopop ~ 50 ~ 50
–– AAopop ~ 0.6 mm~ 0.6 mm 22
– Is this an optimum Granularity A op??
–– FFclkclk ~ 400 MHz~ 400 MHz
–– PPoo ~ 750 ~ 750 mWmW
–– RRcc ~ 20 ~ 20 GOPsGOPs
.H\�&RPSXWLQJ.H\�&RPSXWLQJ(OHPHQW(OHPHQW,;6�&RUH,;6�&RUH
24
text
text
X-B
us
Y-B
us
Z-B
us
text
MULT
M-ADD
D-ADD
S-ADD
S-MULT
A0A1A2A3
MULT
M-ADD
D-ADD
S-ADD
S-MULT
A0A1A2A3
MULT
M-ADD
D-ADD
S-ADD
S-MULT
A0A1A2A3
MULT
M-ADD
D-ADD
S-ADD
S-MULT
A0A1A2A3
ALIGN ALIGN ALIGN ALIGN
DMA (64 bits)
text
DAU-Y
DAU-Z
ALU32-bits
MULT16x16
DAU-X
BARRELSHIFT
ControlRegisters
Register File
LOOPBUFFER
ProgramSequencer
Loop, Branch
Instructiondecoder &control unit
textProgram
memory bankProgram
memory bankProgram
memory bank
P-B
us
textData
memory bankData
memory bankData
memory bank
DMABridge
R-Bus
DMA (64 bits)
LXLY
AX
AY
AZ
JTAGController
DebugUnit
VECTOR REGISTER FILE
IXS coreIXS core
¾Efficient Vector processingarchitecture
¾Octal-MAC architecturewith 8/16/32-bit arithmetic
¾Quick loop entry/exitmechanisms
¾Loop buffer¾Data alignment unit¾Resource management
engine¾ Integrated address
generation and control-processing pipeline
25
Architecture Summar yArchitecture Summar yyy IXS Processor Octal MAC unitsIXS Processor Octal MAC units
–– RISC-tightly coupledRISC-tightly coupled–– Acceleration H/WAcceleration H/W
–– ViterbiViterbi /Turbo/Turbo–– Correlation, De-spreading, etc. Correlation, De-spreading, etc.–– Filter Filter
–– Parameters Parameters within the within the NNopop Range (50)Range (50)–– 5 5 PEsPEs x 9 x 9 MACs MACs = 45 = 45 MACsMACs–– 32 – 8 bit adders per PE32 – 8 bit adders per PE
yy Mesh-Connected to SurroundingMesh-Connected to SurroundingProcessors (5 Processors (5 PEsPEs total) total)
yy Do we have the optimal Do we have the optimal AAopop??–– Lower Lower AAopop will start to increase interconnectwill start to increase interconnect
PowerPower
26
How does the IXS PEHow does the IXS PECompare againstCompare against
Dedicated Hardware?Dedicated Hardware?
27
Power and Area Efficiency of IXS PE Power and Area Efficiency of IXS PE vsvsDedicated H/W for WLAN BenchmarkDedicated H/W for WLAN Benchmark
Still 5-7x Dedicated H/WStill 5-7x Dedicated H/W
0
1
2
3
4
5
6
7
Power Area
IXS-PE Arra y
W LANAve.Estim.
Baseband PHY and lower MAC estimates
(all scaled to 90 nm)
28
How do we com pareHow do we com pareagainst otheragainst other
Reconfi gurableReconfi gurableApproaches?Approaches?
29
How Does our Architecture Compare?How Does our Architecture Compare?Multi-User Detector BenchmarkMulti-User Detector Benchmark
0
5
10
15
20
25
30
Pow er Area
G P-DSP(BW RC )DSP Exten.(BW RC )Intel IXSHomo g eneousBerkele yPleiades DedicatedHardw are
BWRC and Lee Snyder
30
Die PhotoDie Photo
DSP Core
RISCCore
31
Summar ySummar y–– Homogeneous Mesh-Connected Array of IXSHomogeneous Mesh-Connected Array of IXS
Processing Elements for InfrastructureProcessing Elements for Infrastructure–– Low power/size (5-7x dedicated h/w)Low power/size (5-7x dedicated h/w)–– Flexibility where it’s neededFlexibility where it’s needed–– ScaleabilityScaleability–– For given size/power and feature size constraints aFor given size/power and feature size constraints a
“good” solution can be found“good” solution can be found–– Key Processing elementKey Processing element
– Minimum Memory– “Maximum-Datapath” Units
–– Next Steps:Next Steps:–– “What is the optimal “What is the optimal AAopop Size?”Size?”–– “What is the right Arch. for the Client?”“What is the right Arch. for the Client?”