Quantcast
Channel: EDN
Viewing all articles
Browse latest Browse all 650

Understanding and combating silent data corruption

$
0
0

The surge in memory-hungry artificial intelligence (AI) and machine learning (ML) applications has ushered in a new wave of accelerated computing demand. As new design parameters ramp up processing needs, more resources are being packed into single units, resulting in complex processes, overburdened systems, and higher chances of anomalies. In addition, demands of these complex chips presents challenges with meeting reliability, availability, and serviceability (RAS) requirements.

One major, yet often overlooked, RAS concern and root cause of computing errors is silent data corruption (SDC). Unlike software-related issues, which typically trigger alerts and fail-safe mechanisms, SDC issues in hardware can go undetected. For instance, a compromised CPU may miscalculate data, leading to corrupt datasets that can take months to resolve and cost organizations significantly more to fix.

Figure 1 A compromised CPU may lead to corrupt datasets that can take months to resolve. Source: Synopsys

Meta Research highlights that these errors are systemic across generations of CPUs, stressing the importance of robust detection mechanisms and fault-tolerant hardware and software architectures to mitigate the impact of silent errors in large-scale data centers. Anything above zero errors is an issue given the size, speed, and reach of hyperscalers. Even a single error can result in a significant issue.

This article will explore the concept of SDC, why it continues to be a pervasive issue for designers, and what the industry can do to prevent it from impacting future chip designs.

The multifaceted hurdle

Industry leaders are often hesitant to invest in resources to address SDC because they don’t fully understand the problem. This reluctance can lead to higher costs in the long run, as organizations may face significant operational setbacks due to undetected SDC errors. Debugging these issues is costly and not scalable, often resulting in delayed product releases and disrupted production cycles.

To put this into perspective, today’s machine learning algorithms run on tens of thousands of chips, and if even one in 1,000 chips is defective, the resulting data corruption can obstruct entire datasets, leading to massive expenditures for repairs. While cost is a large factor, the hesitation to invest in SDC prevention and fixes is not the only challenge. The complexity and scale of the problem also make it difficult for decision makers to take proactive measures.

Figure 2 Defect screening rate is shown using DCDIAG test to assess a processor. Source: Intel

Chips have long production cycles, and addressing SDC can take several years before fixes are reflected in new hardware. Beyond the lengthy product lifecycles, it’s also difficult to measure the scale of SDC errors, presenting a big challenge for chipmakers. Communicating the magnitude and urgency of an issue to decision makers without solid evidence or data is a daunting task.

How to combat silent data corruption

When a customer receives a faulty chip, the chip is typically sent back to the manufacturer for replacement. However, this process is merely a remedy for the larger SDC issue. To shift from symptom mitigation to a problem-solving solution, here are some avenues the industry should consider:

  • Research investments: SDC is an area the industry is aware of but lacks comprehensive understanding. We need researchers and engineers to focus on SDC despite how costly the investment will be. This involves generating and sharing extensive data for analysis, identifying anomalies, and diagnosing potential issues like time delays or data leaks. All things considered, enhanced research will help clarify and manage SDC effectively.
  • Incentive models: Establishing stronger incentives with more data for manufacturers to address SDC will help tackle the growing problem. Like the cybersecurity industry, creating industry-wide standards for what constitutes a safe and secure product could help mitigate SDC risks.
  • Sensor implementation: Implementing sensors in chips that alert chip designers to a potential problem is another solution to consider, similar to automotive sensors that alert the owner when tire pressure is low. A faulty chip can go one to two years without being detected, but sensors will be able to detect a problem before it’s too late.
  • AI and ML toolbox: AI algorithms, an option that is still in the early stages, could flag conditions indicative of SDC, though this requires substantial data for training. Effective implementation would necessitate careful curation of datasets and intentional design of AI models to ensure accurate detection.
  • Silicon lifecycle management (SLM) strategy: SLM is a process that allows chip designers to monitor, analyze and optimize their semiconductor devices throughout its life. By executing this strategy, it makes it easier for designers to track and gain actionable insights on their device’s RAS in real time and ultimately, detecting SDC before it’s too late.

Partly due to its stealthy nature, SDC has become a growing problem as the scale of computing has increased over time, and the first step to solving a problem is recognizing that a problem exists.

Now is the time for action, and we need stakeholders from all areas—academics, researchers, chip designers, manufacturers, software and hardware engineers, vendors, government and others—to collaborate and take a closer look at underlying processes. Together, we can develop solutions at every step of the chip lifecycle that effectively mitigate the lasting impacts of SDC.

Jyotika Athavale is the director for engineering architecture at Synopsys, leading quality, reliability and safety research, pathfinding, and architectures for data centers and automotive applications.

Randy Fish is the director of product line management for the Silicon Lifecycle Management (SLM) family at Synopsys.

Related Content

The post Understanding and combating silent data corruption appeared first on EDN.


Viewing all articles
Browse latest Browse all 650

Trending Articles