Algorithms for Design-Automation - Mastering Nanoelectronic Systems
Logic Diagnosis With Improved Resolution
ALEJANDRO COOKINFOTECH
SUMMER SEMESTER 2007
SUPERVISOR:
STEFAN HOLST
July 17, 2007
2
Table of contents
1. Introduction 3
2 Diagnosis in General 4
2.1 Logic Diagnosis 4
2.1.1 Fault Tuples 4
2.2 Volume and Precision Diagnosis 5
2. 3 Cause-Effect and Effect-Cause Diagnosis 6
2.4 Fault Diagnosis Objectives 7
2.4.1 Fault Localization 7
2.4.2 Fault Identification 7
3 DIAGNOSIX: A diagnosis methodology 8
3.1 Overview 8
3.2 Stage 1: Defect Localization 12
3.2.1 Path-tracing 12
3.2.2 Per-test diagnosis 13
3.2.3 Passing parameter validation 16
3.3 Stage 2: Behavior Identification 17
3.3.2 Cover Forest analysis 17
3.3.2 Neighborhood function extraction 18
3.4 Stage 3: Behavior validation 19
4 Results 20
4.1 Applicability 22
5 Conclusions 23
References 24
3
1 Introduction
The periodic advances in semiconductor technology have placed strong challenges in the manufacturing
process of VLSI circuits. The higher level of integration in each process generation has led to an
increasing level of complexity in today's designs, and thus more resources and effort have been recently
devoted to test and diagnosis.
The goal of testing is to determine the presence of defects in a single chip, diagnosis, on the other hand,
focuses on their location and identification. The diagnosis information can later be used to improve the
outcome of the manufacturing process; this is specially important during the first spins of a new design
in a new process technology, when the yield is usually very low, and more detailed defect information
could provide valuable insight into the cause of the problems.
Yield learning is defined as the collection and application of process knowledge to improve yield by
identifying and locating systematic and random manufacturing events [8]. The complexity of today´s
designs is hard to model and, therefore, it is very difficult to obtain accurate defect information by
simulation; this makes yield learning extremely important during the first silicon of new designs
because it characterizes defects at the physical level. With this information in mind, the designers are
able to modify the circuit in order to decrease the number of systematic defects, and hopefully obtain a
steep yield ramp in the next process cycle. Rapid yield learning is key to the success of the electronic
industry, where time-to-volume makes a difference in market share and lost revenues.
Diagnosis can be defined as the process of identifying and locating faults in a chip. It uses different
levels of abstraction to discover potential physical defects. It is a software-based approach, and, under
certain assumptions, is essentially independent of of the physical complexity of the device under
diagnosis (DUD), and can, therefore, be used to analyze large amounts of data [4]. Fault diagnosis is
often the first step towards identifying and locating defects, and its goal is to enable physical failure
analysis (PFA), that is to say, a closer, more detailed look at the physical cause of the failure.
As the size and the complexity of the chip increase, the diagnosis approaches using simple fault models
are becoming more limited. In this paper, DIAGNOSIX, a new methodology for fault diagnosis, is
introduced. DIAGNOSIX is not heavily constraint by the assumptions of traditional fault models;
4
instead of checking the occurrence of a static fault model to explain the defect behavior, DIAGNOSIX
extracts a consistent fault model from the given set of test patterns and the physical neighborhood
surrounding suspect lines. Although DIAGNOSIX does rely on some key assumptions regarding
defects, they are considered to be weaker that those used in traditional approaches.
In the next pages, fault diagnosis is presented in general to provide a context for the following
discussion of DIAGNOSIX. Then, a few definitions relevant to the new methodology are explained and
the underlying defect assumptions are explicitly formulated. In the following pages, the complete
methodology, its evaluation and applicability are covered in detail.
2 Diagnosis in general
2.1 Logic Diagnosis
It is possible to distinguish between several strategies in the context of VLSI diagnosis. Logic
diagnosis refers to the diagnosis of random logic and it makes use of the knowledge and
technology available in the logic testing domain. As when dealing with testing, if the structure of
the target circuit is regular and known to fail in specific ways, specialized techniques may yield
better results, for example, the diagnosis of on-chip memories calls for different algorithms and
fault models from those used in logic diagnosis. Scan chain diagnosis also has particular needs and
is addressed with a different set of tools.
2.1.1 Fault tuples
The single stuck-at line (SSL) fault model has been so far the most common fault model for
testing random logic in VLSI circuits, however, according to recent studies [2], this abstraction
does not accurately match the real physical defect behavior, and thus makes diagnosis tasks
difficult. Several more accurate fault models have been explored to overcome this limitation,
and their number is likely to keep increasing in coming years due to more complex designs and
the increasing parametric variations in the manufacturing process of future circuit geometries.
In this context, a new fault modeling mechanism has been proposed to represent arbitrary
misbehaviors by making use of fault tuples. Fault tuples can combined in order to describe
arbitrarily complex faults and provide a generalization to express known fault models in a
single notation. This way, it is possible to perform simultaneous analysis of a many defect
5
mechanisms using the same methodology.
A fault tuple is a 3-tuple represented as ⟨l , v , t ⟩ , l denotes a given signal line, v is a value,
and t represents a clock condition. The possible set of values for these parameters is shown
next:
l∈{lines}v∈{0,1 , D ,D }
t∈{i ,iN , i, i}
A fault tuple is said to be satisfied, if the line l is controlled to the value v within a time frame
represented by t. Additionally, the misbehavior (if any) should be propagated to an observable
output.
The parameter t specifies a clock cycle condition for controlling a line and its meaning
depends on the value to be assigned; for example, if v ∈ {0,1 } ∧ t=i , l must have been
controlled by the ith clock cycle. On the other hand, if v ∈ {D , D } ∧ t=i , the discrepancy
{D , D } must manifest itself by the ith clock cycle. Likewise, the condition t=iNmeans a constraint within N clock cycles after the reference ith clock cycle. The rest of the
conditions can be explained in a similar manner.
A product of tuples in the form ⟨ l1 , v1 ,T K ⟩⋅⟨ l2 , v2 , T L⟩ is satisfied if and only if all tuples in
the expression are satisfied. Several products can be joined together in a macrofault in order to
model arbitrary misbehaviors.
2.2 Volume and Precision Diagnosis
Fault diagnosis may pursue two different purposes in the yield learning process. In volume
diagnosis, a large number failing chips are analyzed to discover systematic defects. If a given
number of chips shows the same faulty behavior, it is likely that the same physical defect is present
in all all of them, and the manufacturing process can be improved to reduce its occurrence.
Volume diagnosis is constraint by the large amounts of data it must handle and the time it takes to
analyze the behavior of every chip.
Precision diagnosis, on the other hand, is performed on a small quantity of failing chips, like the
first silicon or a sample chip from a group of chips with the same systematic fault. Its objective is
to locate faults with sufficiently high resolution, and provide detailed information on the nature of
6
physical defect. In precision diagnosis, a great deal of effort is paid to obtain enough accuracy as to
guide PFA.
2. 3 Cause-Effect and Effect-Cause Diagnosis
There exist two classic approaches to fault diagnosis: cause-effect diagnosis relies on a fault model
to predict the output value of a faulty chip. In this approach, a fault dictionary is constructed by
pre-calculating the set of possible responses to all input patterns for all modeled faults; and in the
next step, when output responses from the DUD are compared and matched against the fault
dictionary, potential faults can be identified.
The main disadvantage of this technique is that it depends on fault models, and these might not
accurately describe the real defects mechanism in complex CMOS technologies. For this reason,
finding a suitable fault model becomes one of the main challenges in cause-effect analysis.
A second disadvantage of cause-effect algorithms is that it assumes a single fault in the DUD
because the huge number of fault sites in a multiple fault model makes the generation of the fault
dictionary infeasible. Furthermore, even when all individual fault can be identified, multiple faults
are not guaranteed be detected in the presence of fault masking [10]. The single fault assumption is
of course broken when two or more faults are present in the chip, or when a defect manifests itself
as multiple faults in the chosen fault model.
In Effect-cause diagnosis, the chip outputs are observed and deductive reasoning is performed,
starting from primary outputs to primary inputs, in order to identify potential faults, consistent with
the output responses. This technique does not make use of a fault dictionary since the fault
information is processed directly when the defect symptom is encountered during testing,
moreover, this approach is better suited to analyze the occurrence of multiple faults and their
masking relations because these problems can be handled within the logic reasoning to detect and
locate faults.
The main drawback of the algorithms in this approach stems from the computational effort required
to infer potential faults that explain the faulty behavior.
7
2.4 Fault Diagnosis Objectives
2.4.1 Fault localization
Fault localization techniques usually employ simple fault models and matching algorithms to
identify a faulty signal line. The location of this signal is used as a first approximation to PFA.
Diagnosis methods for fault localization, like that introduced in [10], use a cause-effect
approach and heavily rely on fault simulation to build a fault dictionary.
Fault simulation with diagnosis in mind requires a different approach from that used in testing
to calculate fault coverage. Traditionally, in order to reduce simulation time, a fault is
immediately removed from the simulation set after it has been detected by one input pattern,
however, this measure limits the resolution of the fault dictionary. The reason for this is that
many different faults can produce the same response to the same input pattern, and the
responses to other patterns need to be known to narrow down the number of potential faults.
A diagnosis simulator needs to simulate a fault until it can uniquely identify it based on the
output responses this fault produces, that is to say, a fault can be dropped only when the
simulator finds a difference between the test failure output, and the expected failure output that
can be caused by this fault.
The size of the fault dictionary becomes a problem as the responses for all faults to many input
patterns need to be stored, furthermore, most of the failure information is not used at all since
only a small part of all simulated faults will actually be present in the DUD. A way to overcome
this situation is suggested in [10], where they perform fault simulation after testing and they
guide the diagnosis towards the relevant faults in accordance to the observed behavior. The key
point in this optimization is to select the appropriate faults to include in the simulator to achieve
a good fault resolution without too much computational overhead as to made the process too
time-consuming.
The methods outlined in the previous paragraphs still suffer fro the disadvantages of cause-
effects paradigm, and their inability to describe defect mechanisms precisely has become a great
obstacle for its performance in recent years.
2.4.2 Fault Identification
Techniques for fault identification try to assess the presence of a presumed fault in the DUD.
They attempt to circumvent the limitations of a simple fault model by, either exploring more
8
complex models [1], or using a fault model independent approach [5], [6].
The advantage of fault identification over fault localization is that it captures the faulty behavior
in a way that can be mapped back to real physical defects. When using complex fault models,
however, the success of this approach relies again on how accurately they can represent the
defects in the DUD. In addition, complex fault models also need a higher computational effort
to simulate and, therefore, their use might become very time-consuming or even infeasible for
large circuits. The use of multiple fault models is not guaranteed to yield optimum results, since
the chip may fail in unexpected ways, and it is unlikely that all defect mechanisms will be
exactly matched.
Other fault identification methods weaken the requirements on the fault models by directly
extracting the defect behavior from the DUD responses [4]. These techniques use an effect-
cause analysis to obtain a first approximation of the faulty sites and, in a second stage of the
algorithm, try to somehow improve the resolution. One possible way to improve resolution is to
use the faulty sites and DUD responses to guide ATPG test pattern generation [5]. A second
alternative for better resolution is to reason over the behavior of faulty lines by studying the
passing patterns. This approach is used in DIAGNOSIX and will be further explained later in
this paper when the algorithm is detailed.
3 DIAGNOSIX: A diagnosis methodology
3.1 Overview
The goal of DIAGNOSIX is to identify logical faults that accurately capture the defect mechanism
in the DUD. These logical faults are extracted from the logic behavior of the circuit and its layout
information and hence need not be an instance of a traditional fault model. Logical faults consist of
both faulty signal lines and the set of logical conditions that produce the faulty behavior; they can,
thus, model defects that produce repeatable logic-level misbehavior.
The inputs to DIAGNOSIX, like in most diagnosis methods, include the complete logic-level
description of the DUD, the set of passing test patterns and the failing patterns with their
corresponding failing outputs. Additionally, DIAGNOSIX requires some limited layout
information so that the neighborhood of suspect faulty lines can be determined.
DIAGNOSIX was designed for combinational circuits, so it is assumed that the DUD is purely
combinational or implements full scan.
9
The methodology of DIAGNOSIX relies on a few simplifying assumptions in order to make the
diagnosis problem tractable. The first assumption states that defects need to cause one or more
failing outputs for at least one input test pattern, additionally , the faulty components in the circuit
(transistors, gates, interconnects) must be located within a given intra-layer physical distance from
the faulty line. In the case of multiple faults, the same methodology is applicable if all faults are in
this sense localized to a specific region of the DUD. For instance, if a fault is detected in line s3 in
Figure 1, the responsible defects are restricted to those within a radius r of s3. In this case, the lines
s2, s4 and s5 are also considered in the following steps of the algorithm and the lines s1 and s6 are
discarded from further analysis.
The second simplifying assumption requires the behavior caused by the defects not change over
time, that is to say, defects must be repeatable and their behavior can always be reproduced with
the same set of input patterns. Finally, it is assumed that the scan circuitry is fault-free and cannot,
by itself, produce failing outputs.
In the context of the DIAGNOSIX methodology, the neighborhood of line li is defined as the set of
lines that: are physical neighbors of li, drive li or one of its physical neighbors, or drive a gate which
is also driven by li (side-inputs).
Figure 1. The defects are localized to a region of radius r from the faulty line s3 [4].
Figure 2(a) shows a bridge fault between lines S6 and S9 and Figure 2(b) shows some layout
information for the same circuit. It can be seen that S9 has two physical neighbors, namely S7 and
10
S10. The neighborhood of S9 is then comprised of S7 and S10, the drivers of S9: S3, S4 and S8, the
drivers of S7: S1 and S2, and the drivers of S10: S4, S5. Even when S10 were not a physical neighbor
of S9, it would still be a part of the neighborhood of S9 because it drives a S14, which, is also driven
by S9.
Likewise, the neighborhood state of a line li, for a pattern ti, is the set of fault-free logical values
present in the neighborhood of li, when ti is input to the DUD. The localization assumption implies
that the neighborhood state of a line determines when it becomes faulty [4]; in turn, this
generalization usually captures the nature of the behavior many real defects like bridges, opens,
transistor stuck-open, gate-oxide shorts, etc.
Figure 2. Neighborhood of a line: (a) gate-level circuit.
DIAGNOSIX is an effect-cause methodology for precision diagnosis and it is capable of
performing both fault localization and identification. Figure 3 shows the procedural flow of the
diagnosis activities; in the first stage, the failure information from the tester is analyzed and the
suspect faulty lines are identified, then, in the second stage, the logical conditions that make the
target lines faulty are determined by studying the test pattern responses, the logic-level description
of the DUD as well as its layout information. The observed fault, the suspect lines and their
neighborhood state in this stage make up an extracted fault model to explain the the real behavior
of the chip.
11
Figure 2. Neighborhood of a line: (b) layout information
This model is again verified in the behavior validation stage, when the derived faults are simulated
and compared against the DUD responses. In order to better represent and simulate faults, fault
tuples are used in this methodology since they can describe arbitrary trigger conditions.
In the last stage of DIAGNOSIX, Focused ATPG, new test patterns are generated to improve
accuracy, confidence and resolution.
The final output of the algorithm is a set of candidate faults. The accuracy of the candidate is the
ratio between the total number of patterns applied to the DUD, and the number of patterns that can
be explained by this fault. A fault explains a given pattern if the result of the fault simulation
matches that of the real DUD. Resolution is the inverse of the number of candidates, and, finally,
resolution is ratio between the number of neighborhood states produced by all test patterns, and the
total number of possible states in the circuit.
12
Figure 4. Overview of DIAGNOSIX [4]
3.2 Stage 1: Defect Localization
The main purpose of this stage is to find potential faulty lines, that is, to identify signal lines that
are likely to be influenced by a physical defect. To achieve this localization, three main steps are
performed: path-tracing [9], per-test diagnosis [6] and passing pattern validation (PPV) [3].
3.2.1 Path-tracing
Path-tracing, in its simplest form, statically traces the output responses back to the inputs
through the combinational gates so that an input cone can be constructed for each observable
output. All the signal lines in the input cone are identified as potential faulty lines that could
cause the observed faulty behavior.
It is possible to reduce the number of candidate lines if the results from fault-free simulation
are taken into account when selecting lines from the input cone. This is called dynamic path-
tracing. With this approach, it can be proven that some signal lines cannot be responsible for
the faulty outputs, and therefore, they can be safely removed from further consideration.
According to one of DIAGNOSIX's key assumptions, path-tracing in this step must only be
concerned with combinational logic. Fortunately, this problem is well understood, and the
existing algorithms are guaranteed to find all possible candidate signals for a given observed
failure [6].
Since the ultimate purpose of this methodology is to identify defect behavior, the output of the
path-tracing procedure is augmented with additional information, and the polarity of the error
on a line is also annotated in the process output. This information is obtained anyway in
dynamic path-tracing and can reduce simulation time in subsequent steps the the localization
13
strategy.
Formally, the output of the path-tracing step is a set Sp, holding entries of the form liv, where li
is a suspect faulty line and v is its error polarity where v∈⟨0,1⟩ .
3.2.2 Per-test diagnosis
Although the number of lines in the output set of path-tracing step is much smaller than the
total number of lines in the DUD, it can be still large enough to be an obstacle for precision
fault diagnosis. This is why, per-test diagnosis is used to further reduce the number of potential
faulty lines.
Per-test diagnosis is a fault localization technique that attempts to model realistic defect
behavior by making use of simple fault models that can be efficiently simulated in software
tools. This method is also knows as single location at a time (SLAT) because it uses only those
patterns for which the defect behavior can be explained by a single fault location.
SLAT makes use of the stuck-at fault model to gather defect information, however, it is not
assumed that this model characterizes defect behavior. When a test pattern is applied to the
DUD, the defect is modeled as a set of stuck-at faults, however, this faults need not be present
consistently in all test patterns, for instance, it may be possible that an input pattern reveals a
stuck-at 0 for a given line, but for a different pattern the same line might behave as a stuck-at
1, or even as a fault-free line.
In the context of combinational logic, one fundamental assumption is needed so that the SLAT
diagnosis becomes feasible: all observed fails for an input pattern can be explained exactly by
at least one stuck-at fault that affects a single pin. This assumption is known as the SLAT
property, and it means that, independently of the the fault activation condition, the failure
behavior of each pattern can be modeled with stuck-at faults on a single line. One of the
advantages of the SLAT property is that it allows the use of the already available set of tools
and knowledge base for stuck-at faults to study other kind of defects more realistically.
The input patterns can be classified into SLAT patterns, that is, failing patterns with the SLAT
property, non-SLAT patterns and passing patterns. In per-test diagnosis only SLAT patterns
14
are studied, whereas the passing patterns are considered in later stages of the DIAGNOSIX
methodology.
There are two main risks in this methodology. Firstly, it may happen that a defect, in response
to a given input pattern, causes errors on more than one line, but the errors propagate to
observable outputs in such a way that the fails can be explained by a single stuck-at fault. This
may lead to a simplified scenario where a complex defect mechanism is modeled as a simpler
one. Secondly, a very serious risk rises when non of patterns has the SLAT property and the
diagnosis fails. This first risk is unavoidable in any diagnosis strategy, while the second is
considered low since there are usually enough patterns with the SLAT property [4].
The SLAT diagnosis within the context of the DIAGNOSIX methodology proceeds as follows:
the set of faults Sp, obtained from the path-tracing step, is used as the initial set of candidate
faults and the input patterns are simulated to identify all the SLAT patterns. Only the faults
that independently explain the output responses for one or more patterns, and justify the SLAT
property, are stored in a table along with its SLAT pattern. As Figure 5 shows, all faults are
simulated for all patterns in a double loop.
The information in the table is represented as a set of temporary stuck lines (TSL), and a TSL
is defined as li /v T K , where line li is stuck at value v for a given set of test patterns TK
which belong to the total set of patterns T ( T K∈T ).
Each TSL li /v T K is ranked by counting the number of patterns it explains: a given TSL
li /v iT K is considered to be better that other TSL l j /v j
T L if T L⊂T K⊂T . This
classification comes from the intuitive idea that the fault that explains the most patterns is
more likely to be closer to the actual defect location. With this consideration in mind, a cover
forest is constructed to represent the fault and their relationships. Formally, there is an edge
between vertices li /v T K and l j /v j
T L if T L⊂T K⊂T , the higher ranked faults are
always placed closer to the roots, and the faults that explain patterns for which there are no
other explaining faults, are placed at the roots of the forest. Figure 6 shows an example of a
forest cover, in this case T 1≠T 2 , and T 5⊂T 2⊂T 1⊆T .
15
Figure 5. SLAT diagnosis flow [6]
Figure 6. Example cover forest
16
3.2.3 Passing pattern validation
The cover forest described in the previous section lists the identified potential fault locations,
however, due to fault dominance and equivalence, some of these faults may lead to incorrect
results. In this step of the methodology, the assumptions described previously in this
discussion are exercised to discard misleading TSL faults.
The problem at hand comes from the presence of equivalent faults in the forest cover, that is,
faults that behave identically for the same input pattern. The simplest example of this situation
is a defect on the input of a NOT gate that manifests itself as a stuck at zero li /0T K . The
cover forest for this example will include li /0T K , the correct fault location, but will also
contain the incorrect fault lo/1T K on the output of the NOT gate.
Passing parameter validation (PPV) relies on the observation that the logical conditions that
activate a faulty line in a failing pattern cannot be present in a good pattern that also sensitizes
the line. According to the underlying assumptions, the neighborhood state of a TSL li /vT K ,
when the set TK is applied to the DUD, represents the activation condition for the fault, and
consequently, there can be no passing pattern that produces the same neighborhood state in the
line, and, at the same time, is propagates the fault to an observable output. All TSL faults for
which this check is not fulfilled are removed from the cover forest.
For example, Table 1 shows the simulation results for the DUD shown in Figure 2(a). The TSL
fault s 9/1T K explains both failing patterns, that is T K=⟨ t1 , t 2⟩ , and is included in the
forest cover. The TSL fault s12 /0T K will also be included since it is equivalent to
s 9/1T K . If the neighborhood of s12 is assumed to be ⟨s5 , s7 , s8 , s10⟩ , it can be seen that,
for the TSL fault s12 /0T K , the neighborhood state of the failing pattern t1
⟨s5 s7 s8 s10⟩=⟨0110 ⟩ is the same as that of the passing pattern t3 , so this fault is removed
from the cover tree. For s 9/1T K , none of the failing patterns produces the same
neighborhood stage as the passing pattern, which means that this fault is retained.
The passing parameter validation step relies on two very important assumptions: the complete
defect excitation is captured by the neighborhood state, so the defect locations must be
17
confined to a radius r from the faulty line, and finally, there can be no faults in the
neighborhood of the faulty lines. This condition is of course necessary to compare the values
of the neighborhood states of the failing and passing patterns.
3.3 Stage 2: Behavior Identification
Once the potential defect locations are known, the behavior identification step infers the logical
conditions that the produce the faulty lines. The goal in this process is to extract a fault model,
from the information gathered so far, that captures the logic behavior of the physical defect.
The input of this methodology stage are the reduced forest cover and the neighborhood
information for each potential faulty line. Figure 7 shows the flow of activities for behavior
identification and validation.
3.3.1 Cover Forest analysis
As Figure 7 shows, cover forest analysis is the first activity in the behavior identification
procedure. Each TSL fault in the reduced cover forest represents a faulty line that explains a
subset of the SLAT parameters used in the localization phase. The objective at this point is to
find a group of TSL faults that together explain the complete set of SLAT patterns. Once one
such cover is identified, all its failure information is used to create a macrofault. If a consistent
fault model for this macrofault is identified, the Focused ATPG stage is entered, and new
patterns will be generated, based on the candidate model, to further improve diagnosis
accuracy and resolution.
If the number of TSL faults is small, all their combinations can be considered for macrofault
Test Pattern
FailStatus
Neighborhood states1 s2 s3 s5 s7 s8 s10
t1 failing 0 0 0 0 1 1 0t2 failing 0 0 1 0 1 0 0t3 passing 0 1 0 0 1 1 0
Table 1. Fault-free simulation values on the neighborhood lines for the circuit in Figure 2(a)
18
formation, however, if this is not the case, the analysis of every single cover becomes
infeasible, for this reason, some heuristics need to be taken into account to guide the cover
selection. The structure of the cover forest can provide a good selection heuristic by noticing
that the TSL faults that explain more patterns are closer to the roots the the forest, so they
should be given preference over the leaf nodes.
The DIAGNOSIX implementation builds covers comprised only of root nodes.
Figure 7. Flowchart for behavior identification and validation
3.3.2 Neighborhood function extraction
This phase of the methodology identifies, for each TSL fault li /v T K present in the selected
cover, the logical conditions that cause the DUD to behave like the fault li /v for the test set
T K , and a consistent fault model is constructed in order to represent the defect activation
19
mechanisms.
Neighborhood function extraction (NFE) finds a logical function that includes all the causes of
defect excitation on a single line. To construct such a function, the neighborhood states
resulting from the failing patterns can be considered as minterms of a truth table. Likewise, the
neighborhood states from the passing patterns represent maxterms, and the rest of the states
are included as “don't cares”. After the table is complete, boolean minimization techniques can
be applied to find a sum-of-products expression known as neighborhood function. This
procedure is repeated for all TSL faults in the selected cover.
It is worth to note make clear that including a set of “don't cares” when deriving the
neighborhood function is a heuristic; although the procedure addresses both passing and
failing patterns, it does not capture all the possible neighborhood states, that is, the states
produced by patterns not included in the test set, therefore, in order to improve the diagnosis
confidence, additional patterns must be generated.
3.4 Stage 3: Behavior validation
Until now, the information on each of the TSL faults partially models the DUD defect mechanism;
in the Behavior validation step, the TSL faults in the selected cover with their corresponding
neighborhood functions are joined together to produce a single macrofault that, by itself, models
the complete defect behavior. To ensure consistency in the final model, the resulting macrofault is
simulated using all available patterns (passing, SLAT and non-SLAT) and the obtained results are
compared with the real responses from the DUD. Based on this comparison, some macrofaults are
rejected on the grounds of poor accuracy, while others will be considered model candidates.
Within the DIAGNOSIX methodology, only macrofaults with 100% accuracy will be deemed
successful.
In order to merge the information about all TSL faults into a single simulatable macrofault, each
the fault and its neighborhood function are represented as fault tuples and their product is
constructed. Each product of tuples contains all necessary information to explain some part of the
defect behavior. Finally, the macrofault is constructed by joining all the product of tuples together.
20
4 Results
In [4], the authors of the DIAGNOSIX methodology analyze the diagnosis outcome for two sets of
experiments. In the first experiment, a number of typical faults are injected in the description of a
circuit and, after simulation, the expected signal responses are fed to DIAGNOSIX; additionally, 5 real
failing chips with PFA results are also analyzed. This part of the experiments serves as a way to
validate the diagnosis output, because, in the first case, the defect behavior and its location are
completely known, while in the latter, the already available physical analysis provides the real location
of the chip defects.
Later, in the second experiment, 830 failing chips, provided by an industrial partner, are diagnosed and
analyzed. From the resulting outcome, and based on the possible resolution and accuracy obtained in
the previous controlled experiment, it is possible to assess the value of this methodology in real
industrial production.
For the first experiment set, several traditional and non-traditional fault models were used to evaluate
the performance of the methodology. In particular, the following fault models: two-wire biased-voting
bridge, wired-bridge, interconnect open, input pattern, three-line bridge, multiple stuck line MSL and
net faults, were injected five times each in a separate instance of the same design, giving a total of 35
circuit responses. Out of these 35 devices, 4 of them, injected with MSL faults, have non-SLAT
patterns, which account, at most, for 42 % of the total number of patterns.
The localization stage of the DIAGNOSIX methodology is able to identify a set of faulty lines that
includes the real defect lines for all but two of the 35 devices. The first localization error occurs in a
two-wire biased-voting bridge, in this case, one of the lines in the bridge is never included in the cover
forest. The reason for this is that a TSL fault in this line is never exercised by the initial the test set, and
therefore the methodology fails to consider this line faulty.
The second localization error is made in one of the circuits injected with MSL faults. In this case, the
real faulty lines are included in the cover forest in the per-test diagnosis step but they are later removed
during passing parameter validation. If all the initial assumptions are maintained, no signal should be
incorrectly removed from the cover forest, however, for this circuit, the neighborhood of this signal also
contains faults, which violates the given assumptions for PPV. It is worth to note that this localization
result, even when the MSL error is present, can be regarded as a total success in the traditional sense,
21
since it identifies at least one of the faulty signal lines.
The analysis of the intermediate outputs for the 35 DUDs suggests that the methodology can greatly
reduce the number of candidate fault lines in each step of the localization procedure. The fault
localization of a typical example of those circuits starts with 3303 faulty lines after path-tracing, and
this number is reduced to 211 and 94 in per-test diagnosis and PPV, respectively. On the average case,
per-test diagnosis reduced the number of suspect lines by 93% while PPV further reduced this number
by 54%.
The analysis of the five real circuits with PFA results yielded slightly worse numbers for the reduction
of suspect lines, where the number of lines was reduced, on average, by 90.3% after PPV, however, for
these 5 cases, PFA showed the all the real defect locations are included in the cover forest of every
DUD. Table 2 shows the detailed number of identified faulty lines after each localization step.
In the behavior validation step, DIAGNOSIX found suitable candidate macrofaults for all but 4 of the
35 target DUDs. As expected, one of this circuits is that for which a faulty line was dropped from the
cover forest, and therefore, no candidate macrofault can be identified. The other 3 DUDs have also
been injected with MSL faults, but this time the heuristic of choosing macrofaults from the roots of the
cover forest does not achieve 100% accuracy in the validation step. It is hence necessary either to
search the cover forest for another macrofault, or to relax the validation criteria and accept the
available fault candidates.
It is important to notice that for 24 circuits, the number of macrofaults is not reduced after the behavior
validation step and the large number of possible candidates results in poor diagnosis resolution. The
Size of SpNo. of TSL faults
After per-test After PPVChip #1 1131 51 31Chip #2 4141 176 89Chip #3 408 114 67Chip #4 2217 106 68Chip #5 397 136 76
Table 2. Localization results for 5 chips[]
22
problem lies in the limited diagnosis capabilities of the initial test set, and could be improved by
generating new patterns specifically produced to identify differences between candidate macrofaults.
For each of the 5 real circuits with PFA results, DIAGNOSIX identifies at least one candidate
macrofault. Furthermore, the number of lines in the neighborhood of each TSL fault can also be taken
into account in order to reason about the nature of the defect in the chip. More specifically, for three of
the DUDs, there is one candidate consisting of a single TSL fault without neighbors, this means that the
fault does not depend on any other signal, and, therefore, there is either a front-end-of-the-line (FEOL)
defect, that is, a defect that causes a faulty gate, or a short to one of the power lines. For the remaining
two DUDs, the candidate macrofaults are composed of multiple TSL faults with as many as 25
neighborhood lines. On the same grounds as in the previous analysis, the FEOL defects can be
discarded and some back-end-of-the-line (BEOL) defects, that is to say, defects that cause interconnect
errors like opens and bridges, are more likely to be present. The results from the physical analysis of
these circuits confirm these observations about the nature of the defects in the chips.
4.1 Applicability
According to the authors of the methodology, DIAGNOSIX can be used to guide PFA by
reducing the number of potential faulty sites to be observed. Even when the defects causing the
faulty lines in a candidate macrofault could span several metal layers, or a large region in the
same layer, the neighbors of the faulty line limit the physical region that has to be inspected to
find physical imperfections.
Although the application of the methodology relies on a few weak assumptions, the accuracy and
confidence of the diagnosis output depends more heavily on the characteristics of the observed
fault; that is to say, the quality of the results in the localization and validation steps is sensitive to
defect behavior. For instance, the methodology may yield sub-optimal results in the presence of
MSL faults because they are not likely to be fully characterized by a single stuck-at fault in per-
test diagnosis, or because another defect is present in the neighborhood of a faulty line and, thus,
the validation step incorrectly removes the line from the cover forest.
The described experimental results show that MSL faults are the most severe limitation of the
DIAGNOSIX methodology, however, for the real silicon chips, this shortcoming did not hinder
the accuracy and confidence of the diagnosis results. Nonetheless, this situation could become a
23
serious issue in other process technologies or design scenarios.
The results of the analysis of 830 failing chips suggest that DIAGNOSIX is also able to model the
characteristics of the failures in a manufacturing process; in particular, DIAGNOSIX identified
five or less candidate macrofaults with 100% accuracy for 71% of the total chips.
DIAGNOSIX found 530 circuits (61%) for which the lines in the macrofaults have no physical
neighbors. As stated earlier, the number of physical neighbors can be considered a mean to
quantify the extent of the physical region that has to be inspected in PFA and, consequently,
DIAGNOSIX not only improves diagnostic localization by 70% with respect to previous
approaches [10], but also hints that most of the errors are FEOL defects or shorts to the power
rails.
Statistical analysis on the macrofaults of a large number of failing chips could be used to describe
and model systematic defects in a manufacturing process. Such volume analysis is part of the
future research efforts in the DIAGNOSIX methodology.
5 Conclusions
DIAGNOSIX is a new diagnosis methodology capable of both locating and identifying physical defects
in a chip. This methodology extracts defect information from both failing and passing patterns and is,
therefore, independent of the nature of the defects in the circuit. The resulting error behavior is further
refined by simulating the complete available test set.
In the initial step, the set of possible faulty lines is obtained by path-tracing. These faulty lines are
reduced by per-test diagnosis and further refined by considering the results of the passing patterns and
the information on the physical neighborhood of a faulty line.
After this step, it is possible to find a set of faulty lines that together explain the complete failure
behavior of the chip; this behavior and its activation condition can then be extracted in the form of
boolean equations.
The extracted behavior is again validated against all available patterns to discard any potential
inconsistent models.
The experimental results of the analysis of a large number of chips suggest that DIAGNOSIX can be
24
used as a first approximation of defect location in order to guide PFA, in some cases achieving a large
localization improvements in comparison to previous approaches. For a reduced number of chips with
available PFA results, DIAGNISIX correctly identified FEOL and BEOL defects.
25
References
[1] R. C. Aitken, “Finding Defects with Fault Models,” in Proc. of the Int.Test Conf., pp. 498–505, 1995.
[2] R. D. Blanton; K. N. Dwarakanath and R. Desineni. “Defect Modeling Using Fault Tuples,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Nov. 2006, Volume: 25, Issue: 11. Page(s): 2450-2464
[3] R. Desineni and R. D. Blanton, “Diagnosis of Arbitrary Defects Using Neighborhood Function Extraction,” in Proc. of the VLSI Test Symposium, pp. 366–373, May 2005.
[4] R. Desineni; O. Poku; R. D. Blanton. “A Logic Diagnosis Methodology for Improved Localization and Extraction of Accurate Defect Behavior,” in International Test Conference, 2006. IEEE International, Vol., Iss., Oct. 2006. Pages:1-10
[5] S. Holst, H-J. Wunderlich, “Adaptive Debug and Diagnosis without Fault Dictionaries,” ets, pp. 7-12, 12th IEEE European Test Symposium, 2007
[6] L. M. Huisman, “Diagnosing Arbitrary Defects in Logic Designs Using Single Location at a Time (SLAT),” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 23, no. 1, pp. 91–101, Jan. 2004.
[7] S. D. Millman, E. J. McCluskey and J. M. Acken, “Diagnosing CMOSBridging Faults with Stuck-At Fault Dictionaries,” in Proc. of the Int.Test Conf., pp. 860–870, Oct. 1990.
[8] Semiconductor Industry Association, “The Int. Technology Roadmapfor Semiconductors,” 2005 edition.
[9] S. Venkataraman and S. B. Drummonds, “POIROT: A Logic Fault Diagnosis Tool and Its Applications,” in Proc. of the Int. Test Conf., pp. 253–262, Oct. 2000.
[10] J. A. Waicukauski and E. Lindbloom, “Logic Diagnosis of StructuredVLSI,” in IEEE Design and Test of Computers, pp. 49–60, Aug. 1989.