Cadence IP Blogs

Transforming the Automotive Experience with Cadence Tensilica DSPs

SriramK — Fri, 06 Mar 2026 04:54:00 GMT

Experience Innovation at Embedded World 2026

The automotive industry has shifted its focus from traditional performance metrics to prioritizing safety and comfort. As vehicles evolve into software-defined environments, the interior cabin is emerging as a sanctuary—offering enhanced safety, superior comfort, and immersive high-fidelity entertainment for all occupants. At this year's embedded world, Cadence is proud to showcase how Tensilica DSPs are at the forefront of this transformation. In collaboration with leading ecosystem partners, Cadence will present a series of engaging demonstrations that highlight the latest in-cabin technology advancements.

Featured Demonstrations

The "quiet bubble" Active Noise Cancellation with Silentium: Road noise can significantly detract from a premium cabin experience. To address this, Cadence has partnered with Silentium to demonstrate their Quiet Bubble™ software running on Tensilica HiFi DSPs. Leveraging low-latency processing, this system actively cancels unwanted road and tire noise in real time, delivering a serene and whisper-quiet interior. Passengers can enjoy clear conversations and a more focused driving experience, free from external distractions.
In-cabin sensing and occupant monitoring on NXP RT700 platform: Tensilica DSPs are instrumental in enhancing the safety of drivers and passengers. This demonstration running on the NXP RT700 platform highlights advanced in-cabin sensing capabilities and workloads processed on our HiFi 1 DSP core, including driver distraction detection, real-time monitoring of vital signs, and child presence detection to ensure no child or pet is inadvertently left behind. The importance of these technologies has grown as the Euro NCAP 2026 protocols now demand higher safety standards. Achieving a 5-star safety rating requires manufacturers to implement not only alert systems but also direct-sensing and active-intervention solutions.
Long-range radar chipset from NXP S32R47: Targeting L2+ to L4 automotive ADAS applications, Cadence Tensilica FloatingPoint DSPs and FFT accelerators provide just the right solution under the hood to process radar-dense point clouds, perform object detection, classify and separate tightly spaced objects, sense debris next to these objects, and detect vulnerable road users (VRUs), enabling safer highway and urban driving. Cadence DSPs also support multimodal sensing and lidar sensor processing.

Whether your interests lie in active acoustics, AI-driven safety features, or the future of zonal vehicle architecture, Cadence experts will be available to provide in-depth explanations and demonstrations of the latest technology.

Where: Hall 4, Booth 219
When: March 10-12, 2026

For more information, visit Summary - embedded world 2026 and contact us to request a meeting.

Accelerating Chiplet Innovation with a New Partner Ecosystem

Mick Posner — Wed, 04 Mar 2026 16:57:00 GMT

The semiconductor industry is currently undergoing a massive shift. As we push the boundaries of performance in physical AI, data centers, and high-performance computing (HPC), traditional monolithic chip design is hitting physical and economic walls. The answer for many engineers and architects is chiplets, a modular approach that enables the mixing and matching of silicon dies to create powerful, highly customized systems.

However, transitioning from a single-die SoC (system on chip) to a multi-die SiP (system in package) brings a surge in engineering complexity. How do you ensure different pieces of silicon from different vendors communicate with each other correctly?

To tackle these challenges head-on, Cadence has announced a major leap forward: a Chiplet Spec-to-Packaged Parts ecosystem. This initiative is designed to streamline the engineering process and accelerate time to market. Through our partnerships, Cadence is paving a lower-risk path for the next generation of chiplet adoption.

The Spec-to-Packaged Parts Vision

The core of this announcement is the Cadence Physical AI chiplet platform. This isn't just a set of tools. It's a comprehensive configurable platform designed to bridge the gap between a chiplet specification and a final, known-good die (KGD) or packaged (multiple dies) part.

Cadence has built spec-driven automation that generates chiplet framework architectures. These frameworks combine Cadence's own IP with third-party partner IP, all wrapped in critical chiplet management services, as well as built-in security and safety features.

The goal is clear: Accelerate the spec-to-parts process while reducing risk!

Developing chiplets often feels like venturing into uncharted territory. By providing a pre-verified platform, Cadence enables design teams to start with a robust foundation rather than building everything from scratch. In addition to significantly reducing customer-specific chiplet development time, this approach optimizes costs, provides the flexibility needed for customization, and enables configurability that modern applications demand.

Figure 1. Cadence Physical AI Chiplet Platform

Critically, the generated chiplet architectures are standards-compliant. They adhere to the Arm Chiplet System Architecture and the future OCP Foundational Chiplet System Architecture, ensuring broad interoperability. The Cadence Chiplet Framework encapsulates these capabilities, which are reusable across chiplets, accelerating chiplet development and ensuring cross-chiplet interoperability through standardization.

Figure 2. Cadence Chiplet Framework

Strategic Partners: Arm and Samsung

Two key collaborations anchor this new ecosystem, signaling the industry-wide support for this initiative.

Arm: Powering Physical AI

Building on a long history of collaboration, Cadence and Arm have forged a new strategic partnership focused on physical AI.

This agreement grants Cadence access to the advanced Arm Zena Compute Subsystem (CSS). This is a game-changer for edge AI processing requirements in automobiles, robotics, and drones. By integrating Arm's technology, the platform empowers safer, smarter, and more efficient systems.

Samsung Foundry: Future Prototype Silicon Proof

One of the biggest hurdles in chiplet adoption is proving real-world functionality. To showcase Cadence's chiplet expertise and Samsung's semiconductor technology, Cadence is partnering with Samsung Foundry to build a real-world silicon prototype of the Physical AI Chiplet Platform. Using Samsung's SF5A process for automotive, the prototype will feature an Arm Zena CSS-based chiplet, a central system chiplet, and an AI chiplet powered by Cadence Neo NPUs.

A Robust IP Partner and Silicon Analytics Ecosystem

An ecosystem is defined by the strength of its community. Beyond Samsung and Arm, Cadence has enlisted a diverse group of initial IP partners and a silicon analytics company to ensure important aspects of chiplet designs for Physical AI are covered.

Arteris: Providing physically-aware network-on-chip (NoC) IP products like Ncore for coherent systems and FlexGen for non-coherent ones to handle high bandwidth, low latency, and power-efficient interconnects in multi-die systems.

eMemory: Contributing enhanced one-time programmable (OTP) memory that complements Cadence's security subsystems, ensuring secure storage and key management.

M31 Technology: Delivering MIPI PHY interface IP, essential for automotive and high-volume consumer applications requiring flexible camera and display integration.

proteanTecs: Embedding a hardware health and performance monitoring system for telemetry and silicon analytics SW, per chiplet die and across chiplet types, to enable power-efficient, safe, and reliable performance of next-gen systems.

Silicon Creations: Providing ultra-fast, multiphase PLL clocking solutions optimized for the Cadence Chiplet Framework, UCIe die-to-die IP, and interface IP.

Trilinear Technologies: Delivering advanced DisplayPort IP to drive high-performance video connectivity.

Conclusion

As David Glasco, vice president of the Compute Solutions Group at Cadence, noted in our recent announcement, this ecosystem represents a "significant milestone in chiplet enablement."

In an era of skyrocketing design complexity, achieving the necessary performance and cost efficiency demands collaboration and standardization. By combining extensive internal expertise with a powerful network of partners like Arm and Samsung Foundry and specialized IP and silicon analytics providers, Cadence is building a launchpad for the next generation of physical AI and HPC innovations.

For engineers and architects, this means less time wrestling with integration headaches and more time focusing on differentiation and innovation.

Resources

Read the news release: Cadence Launches Partner Ecosystem to Accelerate Chiplet Time to Market
Watch our webinar: Cadence Chiplet Solutions: Helping You Realize Your Chiplet Ambitions.
Learn more: Cadence Chiplet Solutions

The Memory Imperative for Next-Generation AI Accelerator SoCs

Subash Peddu — Wed, 18 Feb 2026 04:30:00 GMT

The tremendous growth in large language model (LLM) size corresponds with an equally dramatic rise in agentic AI applications, which are being adopted rapidly across both enterprise and consumer markets. To accommodate this demand, hyperscale providers are deploying next-generation data centers at an unprecedented rate and scale. Each of these data centers hosts millions of AI accelerators to run agentic AI workloads. One notable example is Meta’s Hyperion data center, which is projected to consume up to 5GW of compute power in a footprint that would cover a significant portion of Manhattan.

AI accelerators are key components of the AI data center tasked with speeding up the complex mathematical calculations required for AI and machine learning (ML). As the demand for AI continues to skyrocket, it’s crucial that SoCs provide the necessary “brain power” for these AI accelerators to keep pace. In this dynamic environment, SoC architects must carefully balance multiple factors to optimize their next-generation designs. When faced with the task of defining SoCs for tomorrow’s AI accelerators, four key criteria warrant close consideration: performance, power efficiency, SoC layout optimization, and futureproofing.

Performance

Next-generation LLMs, which can reach up to 100-trillion-parameter scale and beyond, demand exceptional computational throughput. While compute subsystems are advancing at an estimated 20X every two years, memory subsystems are improving only 2–3X over three years, creating what the industry now calls the memory wall.
In the face of this challenge, high-bandwidth memory (HBM) remains the preferred solution with its wide I/O architecture and superior performance. The latest HBM standard, HBM4, doubles the number of data lines from 1,024 to 2,048 compared to the previous generation (HBM3). This enables a significant increase in memory bandwidth (2X just due to doubling the data bits), unlocking higher overall AI accelerator performance.

Another key performance metric is memory data rate. The HBM4 standard increases the per-bit data rate to 8Gbps, from 6.4Gbps in HBM3. However, DRAM vendors are quickly moving beyond this speed due to the ever-increasing performance requirements of AI accelerators. To achieve these speeds, SoC developers require a memory subsystem, including a memory controller and physical layer (PHY), that can perform at or beyond the DRAM speed to ensure adequate system margin and high reliability. For example, in April 2025, Cadence announced an HBM4 PHY and controller that perform at 12.8Gbps, or 3.3TB/s of memory bandwidth, per HBM4 DRAM device. At this speed, AI hardware developers will have 4X the memory bandwidth available per DRAM when compared to the previous generation! The industry is not stopping, so expect to see even higher speeds to support AI’s insatiable demand for memory bandwidth.

Power Efficiency

Electricity is a major operational cost for modern data centers. In addition to powering SoCs and racks, data center operators must account for substantial cooling infrastructure and the power required to run it. With future AI/ML workloads measured in terabytes/s, power-efficient data transfers are even more critical for managing cost and improving sustainability.

For memory subsystems, efficient data movement between the SoC and HBM plays a critical role in overall power consumption. Measured in picojoules per bit (pJ/bit), lower energy-per-bit transfer directly enables higher energy efficiency, reduced cooling requirements, and lower total cost of ownership (TCO). By minimizing pJ/bit at the HBM PHY, systems achieve meaningful power savings that support sustainability goals while improving operational economics.

SoC Layout Optimization

In modern AI-centric SoCs, the core area is the most valuable real estate. Achieving higher AI performance depends on maximizing the silicon allocated to core logic—where the AI compute subsystem resides—while minimizing the footprint of the memory subsystem, which must still deliver massive bandwidth. As a result, both the SoC core area and shoreline must be extremely efficient to achieve peak performance.

Today’s AI SoCs are frequently designed at or near the reticle limit. At this scale, die edges must fit as many HBM4 PHYs as possible to maximize memory bandwidth per SoC while minimizing impact on compute core die area. Figure 1 illustrates an AI SoC layout with the HBM4 physical layers utilizing the area on the east and west sides of the die, with the compute core in the center. The long and narrow PHY layout efficiently uses the die edges with minimal waste between PHYs, while preserving valuable core logic space. In this example, the SoC will have 20TB/s memory bandwidth available with a 12.8Gbps memory subsystem.
Optimizing HBM4 PHY dimensions is more than just a packaging concern. It is a fundamental design decision that directly impacts AI performance, scalability, and silicon efficiency.

Figure 1: Example SoC layout optimized for maximum bandwidth

Futureproofing

With SoC design and fabrication cycles spanning 12 to 24 months, products designed today must anticipate the availability of faster DRAM devices at the time of system deployment. This means that SoC designs should incorporate HBM4 PHY and controller technology capable of supporting the highest speed grades today while being able to scale seamlessly as future DRAM generations become available.

At Cadence, we specialize in tackling these detailed design challenges. We are ready to provide the necessary technical information and support to quickly start your next-generation memory subsystem design. To see our industry-leading HBM IP in action, visit us at Booth 300 at the 2026 Chiplet Summit, where we’ll be demonstrating our 3nm HBM3E PHY operating at 14.4Gbps.

Learn more about Cadence’s HBM4 PHY and Controller IP.

Accelerating Chiplet Interoperability

Mick Posner — Mon, 16 Feb 2026 13:00:00 GMT

In the chiplet marketplace, the vision of a library of chiplets that can be mixed and matched requires interoperability between chiplets (sometimes from different sources), meaning standardization is essential. By establishing common chiplet system standards, designers will be able to seamlessly integrate chiplets from different vendors, reducing development time and costs while continuing to drive innovation. Interoperability ensures that diverse components work together reliably, unlocking new possibilities for modular system design and expanding the market for all participants.

While there are several specifications that aim to define chiplet systems, the Arm^® Chiplet System Architecture (Arm CSA) is by far the most comprehensive and mature. Arm recently contributed the Arm CSA to the Open Compute Project (OCP) to form the Foundational Chiplet System Architecture (FCSA). FCSA will deliver a vendor and CPU-neutral architecture, common system partition guidelines, and a shared vocabulary and set of standards for system-level and interface definitions between chiplets, complementing existing interconnect standards like the Universal Chiplet Interconnect Express (UCIe^™).

Think of OCP FCSA as the expansive baseline standard covering broad Chiplet system architecture and Arm CSA as the Arm-specific implementation of OCP FCSA. I know Arm CSA came first, in the future, Arm CSA will follow and build upon the OCP FCSA.

The OCP's FCSA offers several key benefits:

Modularity and Interoperability: It enables modular design at the silicon level, allowing different chiplets to work seamlessly together, regardless of the manufacturer.
Reduced Fragmentation: By providing a neutral, standardized framework, it minimizes industry fragmentation and promotes collaboration across companies.
Open Ecosystem: It encourages innovation by fostering an open chiplet economy, where companies can contribute and benefit from shared advancements.
Scalability: It supports scalable solutions for industries like automotive and computing, where flexible and efficient silicon designs are critical.
Cost Efficiency: It facilitates cost-effective development by reusing standardized components rather than designing custom silicon for every application.
Accelerated Innovation: It speeds up time to market for new technologies by simplifying the integration of diverse chiplets into cohesive systems.

This architecture is a significant step toward a more collaborative and efficient future for chiplet-based designs.

Designed with interoperability in mind, the Cadence Physical AI Chiplet platform and its underlying Cadence Chiplet Framework follow the Arm CSA and are expected to align with the OCP FCSA specification as it matures. The Cadence implementation, in addition, incorporates chiplet system capabilities above and beyond the specification that our engineers deemed essential for physical AI applications and generalized chiplet use cases. Cadence is excited about the evolution of the new Foundation Chiplet System Architecture specification and is committed to driving this standard forward through leadership and future contributions.

Learn more about Cadence Chiplet Solutions.

Cadence Tapes Out 32GT/s UCIe IP Subsystem on Samsung 4nm Technology

MBhatnagar — Wed, 11 Feb 2026 01:30:00 GMT

With the rapidly increasing connectivity demands driven by AI/ML and HPC/datacenter use cases, high-throughput die-to-die connectivity is more essential than ever. Cadence has been at the forefront of die-to-die connectivity solutions since 2018. In keeping with a rapidly broadening portfolio of die-to-die connectivity solutions, Cadence has taped out its IP subsystem for 32GT/s UCIe solution on Samsung's 4nm (SF4X) process technology. Building on eight years of expertise in die-to-die solutions and the success of multiple UCIe IP subsystem test chips, this next-generation product delivers improved performance and flexibility while maintaining the reliability and precision proven by its predecessors.

The 32GT/s UCIe solution builds upon silicon-proven IP at 32GT/s and 16GT/s speeds—the latter forming the first two UCIe transceiver publications—this is peer-reviewed and proven performance. Key features include:

High-Speed Data Transfer Ranging from 4Gbps to 32Gbps: This IP supports all UCIe data transfer rates from 4Gbps to 32Gbps, offering flexibility across various customer applications. This broad speed support makes the IP ideal for diverse use cases, providing scalable performance to meet the requirements of both low-power systems and high-throughput systems.
Optimized Design for 32Gbps Speed and Wide Interoperability: The IP is optimized to operate seamlessly at the UCIe specification speed of 32Gbps, ensuring robust interoperability with any UCIe solution. This optimization enables the best performance metrics and broader interoperability.
Universal Interoperability for Various Transmitter and Receiver Configurations: The UCIe transmitter in Cadence's IP supports both half-rate and quad-rate UCIe receiver implementations, ensuring broad compatibility across various configurations. With the ability to generate clocks up to 16GHz, this IP provides robust support for data rates up to 32Gbps, ensuring full interoperability in diverse UCIe applications.
Self-Calibrating Capabilities and Hardware-Based Bring-Up with No Firmware Requirements: A key feature of Cadence's UCIe solutions is their self-calibration functionality and hardware-based bring-up, which eliminates the need for firmware intervention during system initialization. This significantly simplifies the setup process by removing the need for firmware loading.
At-Rate Loopback for Wafer Sort and Validation: The IP features at-rate loopback at 32Gbps, enabling efficient wafer sort—a key feature for D2D solutions—and simplifying packaged part validation. Additionally, the full die-to-die (D2D) loopback mode ensures comprehensive validation across the entire link, including the channel, from one die to its partner and back, offering complete testing coverage for high-reliability systems.
Integrated Internal PLL: Like all previous Cadence UCIe IP, this IP includes an internal Phase-Locked Loop (PLL) that autonomously generates the necessary Lclk and high-speed clocks within the IP. The user only needs to supply a 100MHz reference clock, with the option to provide the Lclk from the SoC. This allows for simplified clock management, streamlined integration, and reduced system complexity.
Robust Performance Under Extreme Operating Conditions: Cadence's UCIe IP solutions feature a "Maintenance Mode" that performs regular background runtime recalibration to ensure uninterrupted operation, even under changing conditions such as supply voltage and temperature drifts (ranging from -40°C to 125°C), covering the full industrial range.
Support for Vendor-Defined Messaging Over Sideband Links: The IP supports vendor-defined messages over sideband links, fully compliant with UCIe specifications. This feature ensures effective communication and control across the die-to-die interconnect, enhancing system integration. It is included in both the 16Gbps and 32Gbps versions of our UCIe IP solutions.
Broad Protocol Support: Cadence's 32G UCIe IP also offers broad protocol support to enable pre-validated, high-performance, low-latency, and low-power subsystems for any application.

Cadence's tapeout of the 32GT/s IP subsystem marks a major advancement in die-to-die connectivity. It offers high performance, power efficiency, and integration, supporting a range of advanced packaging options. This IP builds on the reliability and precision of Cadence's previously proven Gen2 UCIe-SP and Gen1 UCIe-SP and UCIe-AP solutions, continuing to support features such as self-calibrating capabilities, hardware-based bring-up, and robust performance under varying conditions. As a contributing member of the UCIe consortium, Cadence is helping to shape the future of the chiplet ecosystem and meet the needs of modern high-performance computing, data centers, and AI/ML applications.

For further information or inquiries, please contact us to explore how our UCIe IP can support your projects. Learn more about Cadence's Universal Chiplet Interconnect Express (UCIe) PHY and Controller.

CES 2026 Recap: Trust Built on a Real, Working eUSB2V2 System Demo

DavidShin — Tue, 10 Feb 2026 04:54:00 GMT

Nothing builds trust like a real working system.

That was the guiding principle behind our CES 2026 showcase in Las Vegas—where we successfully demonstrated what we believe is an industry-first 3nm eUSB2V2 PHY IP alongside eUSB2V2 controller IPs (both host and device) running together in a complete, end-to-end system. The result: a live, real-world eUSB2V2 data path performing at speeds of up to 4.8 Gbps, highlighting the promise of this new USB interface protocol.

Why This Demo Mattered

When a new interface technology emerges, specs and simulations are only part of the story. What customers and partners truly need is confidence that the ecosystem can work, in practice, across chips, controllers, PHYs, platforms, and devices.

At CES 2026, we brought that confidence to life with a demo designed to answer a simple question: Can eUSB2V2 deliver real throughput in real conditions—with real system behavior and interoperability?

Our answer: Yes, live, on the show floor.

https://youtu.be/gkZz-8faQBI

Demo Setup: A Complete Device to Host the eUSB2V2 System

The demo system was built to represent a realistic, end-to-end data flow—capturing video on the device side and delivering it reliably to the host side for processing and display.

System architecture (high level):

3nm eUSB2V2 PHY test chips on both host and device sides
Host and device controllers implemented on FPGA boards
PC/ATX boards in the loop to support the full end-to-end pipeline
SMA cable connectivity between device PHY and host PHY to ensure robust signal integrity
Monitor display on the host side showing the live video feed

How Data Flowed

A high-resolution camera streamed maximum raw data in real time through the PC/ATX boards into the FPGA-based system. The device PHY communicated to the host PHY over SMA, and on the host side, the FPGA board ran the host controller, connecting to a PC/ATX board and monitor to display the live captured video.

This seamless flow showcased the efficiency and performance of eUSB2V2 in a practical, real-world scenario—not just a lab bench proof.

The Highlight: Two Simultaneous 4K Video Streams Over eUSB2V2

To push the system beyond a single “happy path” workload, we demonstrated two video sources transferring simultaneously from the eUSB2V2 device side to the eUSB2V2 host side.

1. 4K Live Video (UVC) — Uncompressed YUV

One stream was a 4K live video feed captured from a 4K webcam, transported as uncompressed YUV video from the device side to the host side. On the host, the system SoC performed video enhancement and rendered the output to the monitor.

Protocol/class used: UVC (USB Video Class)
Key takeaway: eUSB2V2 can sustain demanding real-time video movement without relying on compression tricks.

2. 4K Recorded Video (MSC) — Bulk Transfer from SSD

The second stream demonstrated bulk data transfer: a 4K recorded video stored on an SSD acting as the device. The host system received the content and rendered it on the display.

Protocol/class used: MSC (Mass Storage Class)
Key takeaway: eUSB2V2 supports high-throughput bulk workloads alongside real-time streaming.

A Critical Enabler: UTMI v2.0 Bridging Controller and PHY

A key element of the demo was how we connected the system pieces cleanly and correctly.

Controllers (host + device): running on Cadence FPGA platforms
3nm PHY IP: implemented in test chips
Bridge/interface: the newly published UTMI (USB 2.0 Transceiver Macrocell Interface) v2.0

This enabled a smooth connection between the controller IP in FPGA and the PHY IP in silicon—helping validate the full stack of the solution.

What This Proves: eUSB2V2 As a Complete Solution

This CES 2026 demonstration wasn’t just a speed milestone. It was a system-level proof that eUSB2V2 can be delivered as a complete, interoperable solution:

Industry-first 3nm eUSB2V2 PHY IP (host + device PHY coverage)
Host and device controller IPs
A validated, real-time system that shows how the pieces operate together in the field

And importantly, it provides confidence for adoption—especially because it supports seamless interoperability between:

Host-side solutions already licensed by tier-1 customers, and
A variety of device-side solution providers across the ecosystem

Looking Forward: Built for What's Next – Consumer and Edge AI

Beyond consumer connectivity, this milestone reflects a strategic direction toward the anticipated edge AI era—where bandwidth, latency, power efficiency, and reliability become increasingly critical for distributed intelligence and sensor-rich devices.

This is about showing not only what’s possible today—but what can scale into next-generation applications tomorrow.

Conclusion

eUSB2V2 delivers the flexibility modern systems demand, combining high data rates, configurable link options, reduced EMI, and low power operation. As SoCs move to advanced process nodes while connected devices remain on mature nodes, eUSB2V2 is quickly becoming the go-to USB interface to bridge the gap, extending the proven, ubiquitous USB ecosystem.

Cadence, a long-time leader in USB IP, has expanded its portfolio with complete, end-to-end eUSB2V2 solutions, including host and peripheral controllers, PHYs, drivers, and Verification IP. Learn more at cadence.com

Scale-Up and Scale-Out IP for Optical Interconnect for Accelerated Computing

HW202512191014 — Fri, 06 Feb 2026 17:00:00 GMT

Optical connectivity is foundational to modern data centers, enabling high-bandwidth, low-latency data movement across switches, routers, servers, and racks. With the rise of AI factories, its importance has increased dramatically. Optical links prov...(read more)

Heterogeneous Multicore Using Cadence IP

Nayan Gaywala — Fri, 23 Jan 2026 05:52:00 GMT

Build a Heterogeneous multicore with RISC-V, Xtensa DSPs and Janus NoC. Off-load work to DSPs. System modelling and FPGA Emulation of Heterogeneous multicore.(read more)

From Spec to Silicon: Successful Physical AI System Chiplet Bring-Up

Mick Posner — Thu, 13 Nov 2025 14:00:00 GMT

The semiconductor industry is advancing at an unprecedented pace, driven by the need for higher performance, greater integration, and maximum efficiency. With Moore's Law slowing, innovative approaches like chiplet-based architectures have taken center stage, especially for physical AI designs. We are excited to announce a major milestone: the successful silicon bring-up of the Cadence System Chiplet, a core component of our physical AI chiplet platform.

In this post, we will review the strengths of the physical AI chiplet platform as we detail the multi-phase silicon bring-up journey of our System Chiplet, including technical highlights such as the Cadence UCIe high-speed die-to-die interconnect and LPDDR5X 9600 memory interface validation.

The Power of the Physical AI Chiplet Platform

Traditional SoCs were typically monolithic, with all functions housed on a single silicon die. As demand for specialization and higher performance surged, however, this model revealed significant limitations in manufacturing complexity, yield, and cost.

For applications such as automotive ADAS, robotics, drones, and aerospace and defense, the physical AI chiplet platform answers these challenges with a modular design. By disaggregating a large SoC into separate, specialized chiplets for (1) compute, (2) system management with memory and I/O, (3) AI engines, and (4) optional domain-specific functions, the platform enables cost reduction, customization, and flexible configurations. At the center of this architecture is the Cadence System Chiplet, which orchestrates communication, manages resources, and serves as the backbone of the entire platform.

Figure 1. Cadence's Physical AI Chiplet Platform

The Silicon Bring-Up Journey: Step by Step

Engineering a chiplet-based platform from concept to working silicon is a meticulous, multi-stage process. The successful bring-up of the System Chiplet, a critical component of the physical AI chiplet platform, showcases Cadence's deep technical expertise and demonstrates the maturity of this modular approach. The platform integrates multiple instances of the same System Chiplet die, enabling the system to mimic multiple application use cases.

Figure 2. Package image diagram and actual photo of package of dies

Milestone 1: System Platform Initialization

Achieving initial power-on and successful platform initialization marked the first major milestone. The hardware team coordinated power delivery, clocking, and basic connectivity across all chiplets, verifying that system-level reset and bring-up sequences performed as intended. Debug capabilities are embedded in the always-on power domain, which ensures die access before the UCIe die-to-die interconnect links are initialized. This foundational step enabled further functional validation of interfaces and prepared the platform for subsequent milestones. The single die configuration booted to the command prompt within a day of hardware and software setup. This was achievable as both the hardware and software had been pre-silicon verified in simulation and emulation.

Milestone 2: UCIe Die-to-Die Interface Bring-Up

After the individual die initialization, it was time to move on to the multi-die chiplet-to-chiplet configurations. This pivotal step in the process involved bringing up and validating the UCIe high-speed die-to-die interface, which is essential for reliable chiplet communication. One chiplet was configured as the multi-die initiator, and after completing its own secure boot, it went on to manage the initialization of the secondary chiplet. This is a baseline function of the Cadence Chiplet Framework, which I will share more about later. The engineering team carefully executed power sequencing, link training, and initial handshake routines across chiplets. Through exhaustive testing and measurements, we verified signal integrity, error rates, and lane reliability. Importantly, and with a significant margin, we successfully validated the 32Gb/s UCIe performance across the 25mm link (the maximum link length per UCIe specification) and shorter 7mm links implemented in the package. This successful milestone not only proved interoperability between chiplets at the raw electrical and protocol layers but also validated the robustness of the Cadence UCIe implementation.

Figure 3. UCIe-SP32G RX Eye Opening (25mm Link)

Milestone 3: LPDDR5X 9600 Memory Interface Bring-Up

Maximizing AI performance requires high-speed memory access, deeply integrated into the system's core architecture. The bring-up and validation of the LPDDR5X 9600 memory interface represented the next major Cadence System Chiplet bring-up accomplishment. It includes the latest Cadence LPDDR5X IP solution, with the interface brought online and successfully trained for robust operation at 9600 Mb/s. With the memory subsystem operational, extensive stress tests—including demanding read/write patterns and high-bandwidth streaming—confirmed error-free, sustained high performance even with concurrent access across chiplets. Each chiplet in the two-chiplet system enables a unique configuration, so multiple realistic use cases could be validated. Test cases included disabling the memory subsystem in one chiplet and having the other chiplet read and write to memory across the chiplets' UCIe connections. Another test case configured the LPDDR5X interfaces on each chiplet, building a shared memory structure. The Cadence System Chiplet's central management ensured optimal memory utilization, empowering the physical AI chiplet platform to deliver advanced AI throughput and efficiency.

Figure 4. LPDDR5X Write Eye, excellent electrical margin (overclocked at 14.6Gbps)

Milestone 4: Chiplet Framework Validation

Another fundamental milestone centered on testing the Cadence Chiplet Framework itself, which served as a key criterion for the platform's success. This framework underpins the platform's modular architecture, defining standards for integration, discovery, management, secure boot, functional safety, and coordinated function among heterogeneous chiplets. The validation process included orchestrating complex operations across combinations of chiplets, verifying that each functional block could be independently managed, dynamically allocated, and automatically detected by the platform. Inter-chiplet workflows, error reporting, and platform-level configuration were demonstrated to operate seamlessly, confirming both the extensibility and robustness of the modular design. Proven Chiplet Framework integration ensures the platform supports rapid innovation, simple scalability, and reliable interoperability as new chiplets and workloads are introduced.

Figure 5. Cadence Chiplet Framework capabilities

Milestone 5: Functional and Performance Validation

The functional and performance validation phase involved rigorous testing of the platform under various real-world scenarios to ensure it meets expected standards. Comprehensive benchmarks were conducted to measure data throughput, latency, and power efficiency across diverse AI workloads. Cadence utilized several industry-standard benchmarks available through TinyML, including object detection. Because TinyML is a branch of machine learning focused on AI on the "edge," there is no need to rely on power-hungry cloud processing. Stress tests further validated the system's ability to handle peak performance conditions without degradation, stressing individual chiplets as well as multi-chiplet modes. The results confirmed that the platform achieves both high reliability and competitive performance metrics, positioning it as a robust solution for next-generation physical AI applications.

Finally, the focus moved to specific system-level validation of functional areas not covered by the previous application cases. Tests covered a range of scenarios, from standard data transfers to complex AI task execution, each coordinated by the Cadence System Chiplet and distributed over chiplets with identical or differing configurations to mimic additional application use cases. The platform excelled, with no errors in high-volume data exchange, consistent performance under AI workloads, and robust overall system integration. The System Chiplet proved its critical role as the nexus for communication and orchestration within the platform, while the use of multiple chiplets multiplied the AI throughput and provided flexible performance scaling utilizing a multi-die chiplet-based modular design.

What This Success Means for Future Physical AI Platforms

The successful bring-up of the Cadence System Chiplet, as part of a physical AI chiplet platform, marks a new standard for modular, high-performance semiconductor design.

De-Risking Advanced System Integration: Demonstrating a fully operational System Chiplet intertwined with memory and other critical interfaces gives future product teams confidence in adopting this platform for powerful physical AI systems, moving the product from concept to market-ready.
Accelerating Ecosystem Growth: By validating not only open standards but also the System Chiplet approach within a comprehensive platform, we move closer to an ecosystem where designers can reliably combine chiplet designs.
Enabling Powerful and Flexible Architectures: With the System Chiplet serving as the heart of the physical AI chiplet platform, next-generation automotive ADAS, drones, robotics, and aerospace and defense designs can now benefit from the flexibility and scalability once limited to complex and costly monolithic SoCs.

Physical AI platforms are poised to transform multiple industries, with use cases spanning automotive, robotics, drones, and aerospace and defense. This versatility highlights the platform's adaptability to the complex requirements of safety, autonomy, and high-performance computing found in these sectors.

Final Thoughts

Transitioning from design to functioning silicon on a modular platform, especially for physical AI applications, is a complex, rewarding journey. Our successful bring-up highlights the decisive role of the Cadence System Chiplet as an essential component of the physical AI chiplet platform and Cadence's role in jump-starting the realization of a chiplet marketplace. While standardized die-to-die interconnects like UCIe facilitate chiplet interoperability, the real impact lies in the Cadence platform's integrated design and silicon-proven chiplet framework managing a multi-die chiplet system.

We are proud to help shape a new era of scalable, adaptable, and high-performance physical AI systems, which are at the core of tomorrow's most powerful edge AI technology solutions. Cadence is ready to help our customers realize their chiplet ambitions.

Download our eBook, Cadence Chiplet Solutions: Helping You Realize Your Chiplet Ambitions.

Learn more about the Cadence Chiplet solutions.

The Power of Shifting Left: Cadence Accelerating Innovation with Arm

Arif Khan — Sat, 08 Nov 2025 00:00:00 GMT

In semiconductor design, projects are remembered for their extremes—legendary successes and cautionary failures. The difference often hinges on when problems are discovered. A bug found late in development can derail timelines and budgets. This is why "shifting left"—moving testing and validation earlier in the process—is now a critical strategy for innovation.

Why Shifting Left Matters

Shifting left means bringing testing, verification, and validation activities forward in the design cycle. Instead of waiting for physical prototypes, teams use simulation and emulation to catch issues early. This proactive approach reduces costs, accelerates time to market, and minimizes the risk of late-stage surprises. The cost delta is staggering. NASA's research shows that the cost of fixing a bug multiplies tenfold at each stage of development. A bug caught during requirements costs "1X"; during design, "10X"; during build, "100X"; and in production, "1000X." That is real money, real time, and real risk.

Cadence's Approach to IP Integration

At Cadence, we've made shifting left a core part of our IP delivery model. Take PCI Express^® (PCIe^®) technology: previously, customers received separate controller and PHY components and were left to sort out integration challenges themselves. We changed that by delivering pre-verified subsystems—controller and PHY tested together in real environments. We own the integration risk, not our customers. We apply this industry standard to emerging technologies such as CXL^™ and UCIe^™.

Arm Neoverse CSS Ecosystem: Blueprint for Acceleration

Arm^® Neoverse^® Compute Subsystems (CSS) take this approach further. Going beyond discreet IP, Arm delivers a pre-integrated, pre-verified platform—cores, mesh, and control logic, all ready to go. The Server Base System Architecture (SPSA) and SystemReady compliance suite mean hardware boots "out of the box." This robust framework eliminates bottlenecks and accelerates system bring-up.

Cadence + Arm: Multi-Platform Validation for Real-World Success

As a key Arm partner, we provide next-generation IP for interfaces such as PCIe, CXL, and DDR memory. Our multi-platform validation pipeline embodies shift-left as follows:

RTL Simulation with Xcelium Logic Simulation: Early sanity checks catch fundamental issues in PCIe transactions and memory operations.
Emulation with Palladium Solution: High-speed, hardware-based emulation runs full Arm SystemReady validation suites, stress-testing systems before silicon exists.
Full Compliance Testing: Rigorous multi-stage testing ensures our IP meets SPSA and SystemReady specifications, giving customers confidence from day one.

What truly sets this collaboration apart is the depth and breadth of our validation strategy. By leveraging both simulation and emulation, we replicate real-world scenarios and workloads, uncovering edge cases that might otherwise go undetected until late in the development cycle. This means our customers receive IP that's not only functionally robust but also proven to perform under demanding conditions.

Our teams work closely with Arm engineers to co-develop test plans, share insights, and rapidly iterate on solutions. This joint effort accelerates the identification and resolution of integration challenges, ensuring that our IP seamlessly fits within the Arm Neoverse CSS ecosystem. We also validate across multiple platforms and configurations, from basic boot sequences to complex memory and connectivity operations, so customers can trust that their systems will work as intended, right out of the box.

Some interfaces, such as PCIe and DDR5, require special attention due to legacy quirks and boot requirements. Cadence integrates and validates these within Arm Neoverse CSS environments, ensuring robust, low-risk solutions for customers. This comprehensive, collaborative approach is the foundation for delivering innovation at speed and scale.

The Future: Collaborative, Accelerated Innovation

The ongoing collaboration between Cadence and Arm is continually evolving. As new Arm Neoverse CSS versions and protocol standards emerge, we co-validate solutions to stay ahead. Test chips featuring Cadence IP and Arm cores validate functionality in silicon, shifting left before customers even start their designs.

Whether through leading-edge IP, full subsystem integration, or cloud-based validation, Cadence is committed to customer success. By embracing shift-left and collaborating with Arm, we're building not just better components, but a faster, more efficient path to innovation.

Explore More

Rethinking Edge AI Interconnects: Why Multi-Protocol Is the New Standard

Joe C — Wed, 05 Nov 2025 23:00:00 GMT

Modern compute systems have evolved beyond reliance on a single dominant interface. Today, they're increasingly defined by their ability to support multiple high-speed protocols concurrently—including PCIe, Ethernet, and others. This shift toward multi-protocol capability is fundamentally reshaping how we architect intelligent edge AI systems, especially as inferencing workloads grow more distributed, data-intensive, and latency-sensitive.

Autonomous Systems Demand Real-Time Edge AI—and Smarter Interconnects

Autonomous platforms, extending from vehicles to industrial robots, rely on real-time AI inferencing to make split-second, accurate decisions. These systems must rapidly process massive volumes of sensor data, run complex models on AI accelerators, and coordinate with central compute units (CCUs)—all under tight latency and power constraints.

To meet these demands, concurrent multi-protocol support is no longer a luxury—it's a necessity. A multi-protocol PHY that enables PCIe 5.0 and 25G Ethernet to operate simultaneously delivers the high-speed, low-latency connectivity required across the entire edge AI stack.

While newer standards, like PCIe 6.0/7.0 and high-speed Ethernet, are advancing rapidly, they often introduce higher power consumption, cost, and integration complexity—making them better suited for hyperscale data centers than edge environments. In contrast, PCIe 5.0 and 25G Ethernet strike the right balance of bandwidth, efficiency, and ecosystem maturity, making them ideal for real-time, production-ready edge deployments.

This concurrent capability unlocks several key benefits:

Parallel Data Paths for Maximum Throughput: Supporting both protocols concurrently allows sensor data ingestion and compute offloading to happen in parallel, rather than sequentially. This minimizes latency, prevents congestion, and ensures that AI accelerators are continuously fed with high-fidelity inputs.
Simplified System Architecture: Multi-protocol PHYs eliminate the need for separate interface components or complex switching logic. This streamlines board design, reduces BOM cost, and lowers power consumption, which are all critical for compact, thermally constrained edge deployments.
Greater Design Flexibility: Concurrent support enables tailored interconnect strategies. Designers can dedicate PCIe lanes to GPU or NPU accelerators, while Ethernet handles distributed sensor fusion and control traffic without tradeoffs or reconfiguration overhead.

By enabling true concurrency across PCIe and Ethernet, multi-protocol interconnects eliminate bottlenecks and unlock a new level of performance and efficiency. This architecture ensures synchronized, low-latency data flow from sensors to compute to acceleration—empowering autonomous systems to operate with the speed, precision, and resilience required at the edge.

See It in Action

To explore the technology behind this multi-protocol flexibility, check out our demo video.

▶️ 1:40 – 2:30: Multi-Protocol PHY in Action
This segment shows PCIe 5.0 and 25G Ethernet links running concurrently on a single PHY, demonstrating its ability to maintain signal integrity and consistent performance across protocols. This is a foundational capability for edge AI systems, such as autonomous platforms.

This demonstration underscores the interconnect agility required for next-gen edge AI, where multi-protocol integration isn't just beneficial, it's mission-critical.

Learn more about Cadence concurrent multi-protocol solutions.

Running Optimized PyTorch Models on Cadence DSPs with ExecuTorch

pulin — Wed, 22 Oct 2025 15:00:00 GMT

By Vijay Pawar of Cadence and Matthias Cremon of Meta

Introduction

Deploying PyTorch models on embedded devices, especially audio DSPs, presents unique challenges. To address these, Cadence and Meta have collaborated to create a robust, high-performance framework for deploying machine learning models on Cadence's Tensilica HiFi DSP family. By leveraging ExecuTorch and applying both graph-level and operator-level optimizations, the teams have achieved speedups of at least an order of magnitude compared to standard out-of-the-box deployments.

ExecuTorch

ExecuTorch is a solution for training and inference on the edge, designed for portability, productivity, and performance. It supports a wide variety of platforms, from mobile phones to embedded systems and microcontrollers, and enables developers to use familiar PyTorch toolchains for model authoring, conversion, debugging, and deployment. ExecuTorch provides a lightweight runtime and leverages full hardware capabilities, including CPUs, GPUs, NPUs, and DSPs.

Tensilica HiFi DSP Family

The Cadence Tensilica HiFi DSP family for audio, voice, speech, and AI offers low-energy, high-performance, highly optimized DSP solutions that span the entire spectrum of audio and voice algorithms and end equipment. Audio/voice/speech (AVS) processing covers a wide range of performance- and power-consumption requirements. At one end of the spectrum is the ultra-low-power "wake-on-voice" processing used in many of today's smartphones and wearables. At the other end, building state-of-the-art voice-controlled digital assistants requires advanced audio digital signal processing capabilities to efficiently run neural network-based speech recognition. The Tensilica HiFi DSP family includes multiple products ranging from the HiFi 1s DSP at the low end to the highest performing HiFi 5s DSP.

Performance Highlights

Cadence and Meta have collaborated to improve the performance of various neural network (NN) operators on the Tensilica HiFi 4 DSP using the HiFi NN library. Demonstrated using seven open-source models from the ExecuTorch repository, the results show dramatic improvements over standard out-of-the-box deployments:

Model	Output Size	Base FPS @ 500MHz	Optimized FPS @ 500MHz
RNNT Predictor	[1, 10, 256]	146.5	2875.6
RNNT Encoder	[1, 25, 256]	5.9	82
RNNT Joiner	[1, 25, 10, 128]	9.9	261.1
Baby Llama (1 layer)	[1, 512]	0.5	6.5
Resnet-18	[1, 1000]	0.2	7.7
Resnet-50	[1, 1000]	0.1	3.6
MobileNetv2	[1, 1000]	0.7	12.4

Operator Coverage and Data Types

ExecuTorch now supports a wide range of operators and data types that are optimized for Tensilica HiFi DSPs:

Compute Operators: Fully Connected, Matrix Mul, Convolution 1D/2D, Depthwise Convolution, Dilated Convolution
Non-linear Activations: Sigmoid, Tanh, Softmax, ReLU
Elementwise Operators: Add, Sub, Mul, Div, Quantize, Dequantize
Normalization Operators: Mean, Squared-diff, Reciprocal-square-root, Min, Max
Reorg Operators: Copy, Slice, Transpose, Concatenation
Activation Data Types: asymmetrically quantized signed int8, asymmetrically quantized unsigned int8, float32
Weight Data Types: symmetrically quantized int8, symmetrically quantized int8, float32

Current and Future Work

Opportunities remain to further enhance performance and expand support across different DSPs within the Tensilica HiFi family and beyond. Ongoing and future initiatives include:

Expanding DSP Support: Enabling additional Tensilica HiFi DSPs, such as the HiFi 1s DSP (ideal for always-on, energy-efficient applications, and small NN workloads) and the HiFi 5s DSP (NN-ready, offering approximately a 4X performance boost over the HiFi 4 DSP)
Quantization Improvements: Introducing 16-bit activation support in the quantizer
Latency Optimizations: Investigating fused layers (e.g., LSTM, GRU) for further latency reduction

Conclusion

With seven models now available as open source and running with optimized operators, Cadence and Meta have demonstrated that deploying PyTorch models on DSPs can be both efficient and scalable. Continued collaboration promises even greater performance and broader applicability for embedded machine learning deployments.

Learn more about Cadence Tensilica HiFi DSPs and ExecuTorch.

Powering Scale Up and Scale Out with 224G SerDes for UALink and Ultra Ethernet

Sheryl G — Wed, 08 Oct 2025 00:30:00 GMT

As AI workloads grow in scale and complexity, networks are challenged to keep up. According to McKinsey & Company, global demand for data center capacity is projected to nearly triple by 2030, with AI workloads expected to account for approximately 70% of that increase. GPT-5 reportedly will have 17 trillion parameters, which would represent a 10X increase over GPT-4. In addition, in its "2024 United States Data Center Energy Usage Report," Lawrence Berkeley National Laboratory projects 200/400Gb ports will grow from 6% to 26% of network energy share by 2028. To address these trends, AI infrastructure must be scalable to easily accommodate tomorrow's AI models, and scalable, efficient AI factories and hyperscale data centers begin with innovative IP.

At the recent ECOC 2025 conference in Copenhagen, Cadence showcased its key role in enabling the future of AI infrastructure with live silicon demonstrations of several essential IP technologies for emerging 800G and 1.6T networks. Powered by Cadence's 224G SerDes IP, Cadence's Ultra Accelerator Link (UALink 1.0) scale-up and Ultra Ethernet scale-out networking solutions deliver the performance, flexibility, and interoperability needed for next-generation AI factories and hyperscale data centers.

With the rise of new industry-standard protocols like UALink and Ultra Ethernet, seamless connectivity and interoperability between the various networking system components, including cables, connectors, and optical technologies, is crucial.

Live Demos: Real-World Performance

Real-world demonstrations at events like ECOC validate that high-speed IP, such as the Cadence 224G SerDes, perform as intended over both long and short reach with optimal signal integrity, low latency, and low power consumption. Cadence's live silicon demos of its 224G SerDes PHY IP highlighted its robustness, interoperability, and technical excellence in real-world scenarios featuring ecosystem vendors' cable and connector solutions.

At the OIF booth at ECOC, Cadence showcased CEI-224G-LR interoperability at a data rate of 212.5Gbps with a PRBS31Q data pattern, with total insertion loss exceeding 40dB (bump to bump) at Nyquist frequency. The long-reach 224G PHY demo setup featured a Multilane transmitter to Cadence receiver test chip silicon and cabled backplane channels using a Samtec SiFly HD connector with a 1m cable and a TE near-chip connector with backplane connectors and TE cables. The successful demo achieved a pre-FEC bit-error rate (BER) of 5E-08, highlighting the solution's error-free performance under demanding conditions.

At the Cadence booth, Cadence demonstrated serial link performance with a 45dB channel insertion loss (bump to bump) at a 212.5Gbps data rate and PRBS31Q data pattern, with a Cadence transmitter to receiver link setup. The achieved pre-FEC BER was 4E-08, further validating the robustness of the 224G PHY link setup. In a full solution with FEC, the system operates nearly error-free.

Technical Takeaways

Robustness: Both demos prove Cadence's 224G PHY can maintain high data integrity and low error rates over extremely challenging channels.
Interoperability: The use of industry-standard connectors and cables ensures seamless integration with other ecosystem solutions, which is critical during a rapid AI infrastructure build out.
Scalability: The technology supports full-duplex operation from 1.25Gbps to 225Gbps, enabling future-proof deployments for 1.6T, 800G, 400G, and 200G Ethernet networks.
Design Flexibility: The beachfront-optimized floorplan allows flexible SoC edge placement, and the PHY supports chip-to-module (VSR), chip-to-chip (MR), and copper/backplane (LR) interconnects.

Scaling Up and Scaling Out: UALink and Ultra Ethernet

Cadence's multiprotocol PHYs support both 112G and 224G operation over long and short reach and across process nodes from 7nm down to 2nm+, ensuring future-proof scalability. Optimized and configurable controller options allow tailored solutions for specific application needs, maximizing efficiency and interoperability. Cadence controllers support best-in-class UALink 1.0 and Ultra Ethernet, as well as standard Ethernet, CPRI, and JESD protocols. Cadence's Ultra Ethernet controller supports the most versatile protocol features, ensuring compatibility with evolving IEEE and OIF standards.

Cadence: A Trusted IP Partner for AI/HPC

Cadence's commitment to enabling ecosystem interoperability and leading the future of high-speed connectivity with UALink and Ultra Ethernet innovation was evident at ECOC 2025. These protocols are not only meeting today's connectivity challenges, they are also paving the way for the next generation of AI, high-performance computing (HPC), and hyperscale data center networks.

Modern AI infrastructure demands more than legacy solutions. When scaling AI to meet today's and tomorrow's power, performance, and area demands, you need a trusted partner with a broad portfolio of solutions optimized for the AI/HPC market. Cadence has the expertise and proven IP to help you optimize throughput, balance power and performance, and solve your networking, memory, and chiplet connectivity challenges.

Learn more about how Cadence's 224G SerDes, UALink, and Ultra Ethernet solutions are setting a new benchmark for scaling up and scaling out next-generation AI factories by visiting the 224G-LR SerDes PHY landing page.

Accelerate Automotive System Design with Cadence AI-Driven DSPs

Vinod Khera — Tue, 07 Oct 2025 05:25:00 GMT

The automotive industry is on the brink of a transformative era powered by intelligence, safety, and seamless user experiences. Integrating digital signal processing and artificial intelligence (AI) transforms vehicle intelligence. It enhances user ...(read more)

A Hybrid Subsystem Architecture to Elevate Edge AI

SriramK — Thu, 02 Oct 2025 22:00:00 GMT

The world of artificial intelligence is moving beyond the cloud and into our everyday devices from smart sensors to robotics and AR/VR headsets. One of the key components that enables this shift is a neural processing unit (NPU), also known as an AI accelerator, which is a specialized hardware designed to execute AI models. Optimized for neural network, deep learning, and machine learning tasks, NPUs handle the fundamental, math-intensive operations that power these workloads while CPUs and GPUs handle a wider variety of tasks.

The NPU architecture evolved over time to accommodate the changing AI landscape. This evolution, driven by new and evolving use cases, has led to distinct NPU design philosophies, which can be broadly categorized into three types as shown in Table 1 below.

Table 1: NPU Performance and Application Tiers

Type	Key Architectural Features	Models Supported
Gen 1	Basic matrix multiplication, fixed point processing, limited programmability	Convolutional neural networks (CNNs)
Gen 2	Matrix multiplication with some level of programmability to handle some complex activation functions	CNNs, RNNs, and some transformers
Gen 3	Massive parallelism, optimized for FP8/FP4/INT8, inbuilt programmable core to handle more complex activation functions	Large language models (LLMs), large vision models (LVMs)

A deep learning workload comprises a wide range of operations, including data pre-processing, activation functions, and other data transformations. Despite their specialization, NPUs are not a silver bullet for the entire AI pipeline.

If we focus specifically on the Gen 1 NPUs, these are the embodiments of the AI-at-the-edge philosophy and are highly optimized for one thing: massive matrix multiplication, which forms the core of a CNN-based model. When these NPUs encounter a layer they don't support, they have no choice but to stop, hand the data over to the main host CPU, wait for it to finish, and then retrieve the data. This creates three major architectural issues in an AI subsystem:

CPU Bottleneck: A general-purpose CPU is architecturally inefficient at performing the parallel data processing required for these AI layers. This offloading process becomes the slowest part of the entire AI inference pipeline.
Data Traffic Jam: Constantly moving large tensors between the NPU's memory, the CPU's caches, and system DRAM consumes significant power and time, adding latency and negating the NPU's efficiency benefits.
Increased System Complexity: Software developers must manage this complex, fragmented workflow. The AI model is no longer running on a single accelerator but is partitioned across multiple processors, making performance unpredictable and debugging difficult.

These issues have become even more pronounced with the rise of complex transformer models. These models introduce operations like more complex Gaussian error linear unit (GELU), layer normalization, Softmax, complex element-wise operations, etc., that a type A NPU is forced to offload, creating new system bottlenecks.

A Hybrid Architecture: NPU + AI Co-Processor (AICP)

These limitations warrant a new approach to designing the AI subsystem: a hybrid architecture. Pairing the NPU with a companion such as the Cadence Tensilica NeuroEdge 130 AI Co-Processor, which is designed specifically to handle these offload tasks, can create a more powerful and efficient AI subsystem, simplify the design, and accelerate time to market.

The end-to-end inference flow for this kind of hybrid AI subsystem is a multi-step process that strategically leverages the strengths of both the NPU and NeuroEdge 130 AICP. The successful execution of ViT, a vision transformer, looks as detailed below and this could apply to any model, including LLMs and VLMs:

Step 1: Offloaded Pre-Processing: The NeuroEdge 130 AICP performs the initial, data-intensive, and non-MAC-heavy pre-processing tasks, including dividing the image into patches, converting them into tokens, and applying positional encodings.

Step 2: NPU-Centric Compute: Once the data is prepared, the NeuroEdge 130 AICP, acting as the control processor, transfers the data to the NPU, where the MAC unit computes the math-intensive tasks. This streamlined data flow ensures the NPU's expensive parallel units are kept at near-constant, high-level utilization.

Step 3: Offloaded Post-Processing: After the core computational layers are completed on the NPU, the aggregated output is returned to the NeuroEdge 130 AICP. The AICP then handles the final classification/post-processing tasks, including the pooling and the final SoftMax activation, which is specifically optimized to perform.

Step 4: Output Generation: The final classification probabilities are produced by the NeuroEdge 130 AICP, completing the inference cycle.

Table 2: Layer-by-Layer Execution Mapping

ViT Layer	Computational Characteristics	Optimal Execution Location	Rationale
Input Embedding & Patching	Data slicing, reformatting, non-MAC ops	Offloaded to AICP	Data pre-processing not suited for parallel NPU cores; requires a flexible, programmable processor
Positional Encoding	Vector addition, low compute	Offloaded to AICP	Low-intensity data manipulation; would idle the NPU's parallel units
Self-Attention Mechanism	High MAC operations, large matrix multiplications	Executed on NPU	Core parallel workload; canonical task for the NPU's tensor acceleration unit
Multi-Layer Perceptron (MLP) Blocks	Extremely high MAC ops, accounts for >50\% of total MACs	Executed on NPU	The primary computational bottleneck; the reason an NPU is in the system
Final Layers (Pooling, Softmax)	Low MAC ops (pooling), specialized function (Softmax)	Offloaded to AICP	Non-MAC-intensive and specialized mathematical functions are handled more efficiently by a flexible co-processor

The Benefits of a Hybrid Architecture

Enabling Advanced Features: A more capable AI subsystem allows for the deployment of cutting-edge features—like on-device generative AI, advanced sensor fusion, and multi-modal models—that would be impossible on a gen 1 NPU, creating significant product differentiation.
Lower Power and Smaller Area: By using a purpose-built co-processor instead of an inefficient general-purpose CPU, designs can achieve a significant reduction in dynamic power and optimal silicon area, lowering manufacturing costs and extending battery life.
Faster Time to Market: The combination of a mature, extensible hardware architecture and a unified software development kit (SDK) reduces development complexity and risk, allowing teams to bring innovative AI-powered products to market faster.

In the next post, we''ll look at how this hybrid architecture applies to a system which has a Gen 2 and Gen 3 NPUs. In the meanwhile, learn more about the Cadence NeuroEdge 130 AI Co-Processor.