# Post-Silicon Analysis of Shielded Interconnect Delays for Useful Skew Clock Design

Binyamin Frankel, Eyal Sarfati, Shmuel Wimer<sup>10</sup>, *Member, IEEE*, and Yitzhak Birk<sup>10</sup>, *Senior Member, IEEE* 

Abstract— Analyses and simulations have shown that interconnect shielding can replace a large fraction of the delay buffers used to achieve timing goals through a useful skew clock design methodology. Immunity from process, operation, and environmental variations in nanoscale CMOS technology clock designs are essential, thus making predictable delays and useful skews highly important. We examine interconnect shielding intradie within-die (WID) and interdie die-to-die (D2D) variations under a wide variety of (P, V, T) corners, and show their applicability and ability to achieve clock design timing goals. The analysis is based on post-silicon measurements of a novel shielded interconnect ring oscillator in a 16-nm test chip supported by a rigorous provable estimation methodology.

*Index Terms*—Clock trees, delay tuning, interconnections, process variations, ring oscillator (RO), useful skew, wire shielding.

#### I. INTRODUCTION

**V** ERY large scale integration (VLSI) designers have studied delay distribution in *clock trees* extensively. *Timeborrowing* relaxes timing constraints, thus allowing higher clock speeds [1], [2]. These techniques, which are known collectively as the *useful clock skew* (useful skew for short), shift the arrival time of the clock signal to the sequential circuits in some prescribed amount relative to a nominal clock referred to as zero. The shifts are usually obtained by inserting delay buffers in the clock-trees. It is important to differentiate between the useful skew which is intentional, aimed at speeding up the clock, to the inherent ordinary skew occurring by the *RC* delays of a nonideal clock-tree, which slows down the clock.

The internal delay of the buffers is subject to wide, unpredictable changes in *process variation*, and has been aggravated by recent progress in VLSI technologies to the nanometer

Manuscript received July 8, 2019; revised August 14, 2019; accepted August 27, 2019. Date of publication September 18, 2019; date of current version October 29, 2019. This work was supported by the Israel Chief Scientist through the HiPer Consortium of the MAGNET Program. The review of this article was arranged by Editor M. S. Bakir. *(Corresponding author: Shmuel Wimer.)* 

B. Frankel and S. Wimer are with the Engineering Faculty, Bar-Ilan University, Ramat Gan 52900, Israel (e-mail: binyamin.frankel@gmail.com; wimers@biu.ac.il).

E. Sarfati and Y. Birk are with the Electrical Engineering Department, Technion, Haifa 32000, Israel (e-mail: eyal.sarfati@gmail.com; birk@ee.technion.ac.il).

Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TED.2019.2938621

scale [3], [4]. Inserting delay buffers into a clock network is a delicate task and a design burden. Smaller variations ensure that the useful skew will be sustained across a broad range of operation and silicon conditions [5]. Wires are considerably less sensitive to process and operating condition variations than delay buffers. This makes the design more robust and its operation in silicon more predictable at corners [6].

Interconnect shielding for achieving clock tree timing design goals has been described in two separate works. The first [7], [8] showed that wire shielding can provide a sufficiently large dynamic range of delay tuning via shield spacing over a wide range of interconnect widths and lengths. This dynamic range of delay tuning is essential for effective useful skew design. SPICE simulations have shown that for process, operational, and environmental (P, V, T) variations, the delay range obtained by using shield spacing is more robust than that obtained by common delay buffers.

It is important to note that the replacement of delay buffers by shields to control clock skew does not impact routing resources. Usually, the clock network is already shielded to protect signal integrity. We take advantage of this to control the underlying clock signal *RC* delays. A clock delay tuning by shielding design flow was implemented in [8] and was tested on Marvell's memory controller and ARMv7based processor chips implemented in leading 28 nm (highperformance mobile) technology. Neither required an area increase compared with delay buffer insertion design flow. In terms of dynamic power consumption, the usage of existing shields means that there is no extra switching power. In fact, avoiding a great deal of delay buffer insertion into the clock network may eliminate some dynamic power.

The second study reported in [9] indicated that the simulation-based premise of a wide dynamic tunable delay range is achievable in silicon. For this purpose, a special shielded interconnect ring oscillator (RO) was devised, supplemented with a testing system and a rigorously validated estimation methodology. The assumptions in [7] and [8] were evaluated by measurements in silicon.

This work has two goals. The first is to demonstrate through post-silicon measurements that the wide tunable delay range and the stability of the useful skew design methodology exist in a broad range of (P,V,T) corners. The second is to show by post-silicon measurements that the delay range and primarily the useful skew, when implemented by wire shielding, are more robust than conventional delay buffers.

0018-9383 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information. These are proved using a shielded-wire RO supplemented by a precise measurement and estimation methodology.

RO circuits are well known and often used for the extraction of process parameter variations. The works described in [10] and [11] used RO to monitor the variations of the rise time and the fall time of an inverter. RO was used in [12] to extract ON-current variations of the pMOS and the nMOS devices, whereas [13] extracted the threshold voltage variations from different path delays. Another study reported in [14] used ROs fabricated in 65-nm process technology and an iterative method to estimate the variations in the threshold voltage and gate channel length. It was shown that the predicted values aligned closely with the measured values, proving the validity of the estimation technique. Design-dependent ROs (DDROs) were used in [15] to estimate chip delay based on measurements from multiple DDROs. The performance of DDROs and the delay estimation approach were verified on a 45-nm Silicon On Insulator (SOI) test chip.

Although ROs are widely used for process and performance monitoring, and also as an adjunct method to control the operating conditions of processors, to the best of our knowledge ROs have not been implemented to measure the delay parameters related directly to interconnect shielding. A recent work [16] addressed the lithography aspects of the wire widths in sub 20-nm processes based on GDSII mask data. Wire width variabilities affect delay uncertainties as a result of their varying resistance and their varying parasitic and crosscoupling capacitance in the presence of shields. While this type of study is useful for patterning and lithography planning, it provides little information at the chip level.

Here, we explored interconnect shielding interdie (also known as die-to-die, D2D) and intradie (also known as withindie, WID) variations under a wide variety of (P, V, T) corners, and show their applicability and effectiveness in achieving clock design timing goals as well as their superiority over commonly used delay buffers. The analysis is based on postsilicon measurements of a 16-nm test chip supplemented with a rigorous provable estimation methodology.

The remainder of this article is organized as follows. Section II describes the shielded interconnect RO measurement system. Section III presents the shielded interconnect delay estimation methodology. Section IV presents the postsilicon analysis of shielded interconnect interdie and intradie delay variations and their tunable dynamic ranges. Section V concludes the discussion.

## II. SHIELDED INTERCONNECT RING-OSCILLATOR MEASUREMENT SYSTEM

Measuring the delays of interconnects directly from silicon is very difficult and practically impossible. Silicon testing affords very limited probing of delay paths because there is no visibility of their constituent delay segments or internal nodes. To enable the derivation of the individual delays, a sufficiently large equation system of bulk delays is required. The RO system described in what follows was devised to form such a linear system. Each test of the system "programs" the RO via an appropriate control signal to yield a specific



Fig. 1. Five-stage shielded interconnect RO. (a) Five-stage inverting chain. (b) Shielded wires bundle connecting successive stages. (c) GDSII layout of a bundle comprising different spacing of shields.

equation of the bulk delay. The linear system is described and solved in Section III.

A CMOS RO was used for the evaluation of the gate delay from silicon by indirect calculation based on counting pulses [17]. To accurately measure the shield delay from silicon, [9] devised a reconfigurable shielded interconnect RO circuit, based on the five-stage inverting chain shown in Fig. 1(a). The ring is comprised of four interconnecting shielded wire bundles, each of which consists of four differently shielded wires. The shielded wire bundles connect successive stages of the ring as illustrated in Fig. 1(b).

The inverting stage consists of a 4–1 MUX, whose four inputs are  $I_0-I_3$  and its single output is fanned out to four buffered outputs  $Z_0-Z_3$ . The internal design of the MUX ensures identical delays from any input to any output (see Fig. 2 in [9] and the associated discussion). The RO was implemented in a leading FinFet 16-nm standard cell library. The delay similarities were observed across many corners with the SPICE model extracted from the GDSII layout employing the StarRC Synopsys tool [18].

The wire connecting a stage to its successor is 200  $\mu$ m long. Fig. 1(c) shows the GDSII layout of the shielded wires with different spaces of  $1 \times s_{\min}$ ,  $2 \times s_{\min}$ ,  $3 \times s_{\min}$ , and  $5 \times s_{\min}$ , where  $s_{\min}$  is the minimum wire spacing of the technology. Though the  $5 \times s_{\min}$  shielding aims at representing the situation of unshielded wire, its existence is a must, since otherwise neighboring signals and shields would cause uncontrolled interference.

Fig. 2 shows a silicon photo of the shielded wires. The wire bundle colors on the top correspond to the layout implementation of Fig. 1(a), and the bottom zoomed-in views on a single bundle whose shields are connected to the power grid corresponding to the GDSII layout in Fig. 1(c).



Fig. 2. Photograph of the ROs shielded wires.



Fig. 3. Testing circuit.

The RO comprises four wire bundles, where each bundle consists of four differently shielded wires. There are, therefore, 16 unknown delays, denoted by  $\delta_i^j$ , where  $i \in \{1, 2, 3, 5\}$ is the shielding distance of  $i \times s_{\min}$  as shown in Fig. 1(c), and  $j \in \{0, 1, 2, 3\}$  is the corresponding inverting MUX in Fig. 1(a). The input selection by S0 and S1 in each of the MUXs defines a total of  $4^4 = 256$  distinct RO path compositions, yielding an oscillation frequency determined by the internal delays of the inverting stages in the ring and the specific selection of the shielded interconnections in the ring. As stated above, the inverting MUX was designed to yield identical internal delays regardless of its selected input-tooutput path.

Measuring delays directly on silicon is complex and expensive, whereas measuring the frequencies of an RO (ring delay) to any desirable accuracy is relatively simple. An appropriate system to measure the oscillation frequency of the shielded RO in Fig. 1 was devised. It is comprised of the RO, a tunable aperture circuit to count the oscillation pulses, a counter, and a synchronizer. Fig. 3 illustrates this measurement system. For purposes of illustration its synchronization and control details are ignored, but the reader can find these in [9]. A test selects one of the 256 ring configurations. The RO is triggered with an enable signal synchronized with a measurement aperture of width t. The counter counts the number of pulses n incurred during the time period t, thus yielding the RO's delay  $\Delta = t/n$ .

## III. SHIELDED INTERCONNECT DELAY ESTIMATION METHODOLOGY

Once the delays are derived from the oscillator frequencies, the question is how to deduce the effects of various shielding on the delays indirectly. Let  $0 \le k \le 255$  be the index of a RO configuration obtained by the input-to-output path selection of the four MUXs of the ring. Let  $\Delta_k$  be the corresponding delay obtained by dividing the duration *t* of the measurement aperture by the number  $n_k$  of the counted oscillations; namely

$$\Delta_k = \frac{t}{n_k}, \quad 0 \le k \le 255. \tag{1}$$

The path delays obtained in (1) are the measurements comprised of the four segments constituting the kth path. There are altogether 16 such segments, whose delays should be estimated. Since this work focused on finding the delay dynamic range obtained by shielding, the MUX delay had a negligible effect.

Let  $\delta_{i(j)}^{j}$  be one of the 16 delay segments, where  $0 \le j \le 3$  designates one of the four stages of the RO in Fig. 1(a), and  $i(j) \in \{1, 2, 3, 5\}$  designates the selection of one out of the four possible shield spacings  $i(j) \times s_{\min}$  in the corresponding stage as shown in Fig. 1(b). The following equality holds:

$$\Delta_k = \delta_{i(0)}^0 + \delta_{i(1)}^1 + \delta_{i(2)}^2 + \delta_{i(3)}^3, \quad 0 \le k \le 255$$
(2)

where k is spanned over all possible selections of the ring configurations in Fig. 1 obtained by appropriate selections of the MUXs' control signals. The linear system in (2) can be written in the following matrix notation:

$$\mathbf{H}\boldsymbol{\delta} = \boldsymbol{\Delta} \tag{3}$$

where  $\boldsymbol{\delta} = [\delta_1^0, \delta_2^0, \delta_3^0, \delta_5^0, \dots, \delta_1^3, \delta_2^3, \delta_3^3, \delta_5^3]^T$  is a 16 × 1 vector of unknown segment delays of the RO, and  $\boldsymbol{\Delta} = [\Delta_0, \Delta_1, \dots, \Delta_{255}]^T$  is a 256 × 1 vector of the delay measurements in (1). Finally, **H** is an 256 × 16 zero-one matrix, where each row is comprised of four ones representing a specific configuration under test in (2).

The 256 equations in (2) involve 16 unknown parameters, yielding an overdetermined linear system. Note that any specific wire segment can be involved in 64 configurations, as dictated by the three other stages. If a segment had an identical impact on each of the 64 configurations, one could choose any 16 row-independent equations out of the 256 of (2) to solve the system. In reality, however, the impact of a specific segment can vary across configurations. This in turn results in some noise in the measured  $\Delta_k$ , thus making it impossible to obtain an accurate solution. In this case, least mean squares (LMSs) parameter estimation is needed [19]. The solution to (3) is obtained by the LMS estimation

$$\hat{\boldsymbol{\delta}} = (\mathbf{H}^T \mathbf{H})^{-1} \mathbf{H}^T \boldsymbol{\Delta}$$
(4)

where  $\delta$  is the estimated solution. Unlike in post-silicon where the measurement of an individual segment delay is impossible, the segment delay  $\delta_{i(i)}^{j}$  can be simulated by the

SPICE simulator, and compared with its estimated value  $\hat{\delta}_{i(j)}^{J}$  as obtained by the approximated solution in (4).

Since there is not any direct post-silicon delay measurement, the 16 estimated post-silicon segment delays cannot be compared to anything measured in silicon. How can we be confident that the linear regression in (4) yields a valid post-silicon estimation? To this end, we used Monte Carlo cross-validation. In this case, the unknown segment delays are estimated by a subset portion (say 80%) of the path delay measurements drawn randomly. The remaining subset (say 20%) of the path delays is first computed by using the estimated segment delays and then compared with the corresponding measurements for validation [20]. The correctness of the post-silicon validation was proven and discussed in [9].

A subset of 80% of the measurements, denoted by  $\mathbf{\Delta}_{80\%}$ , is drawn randomly from the vector  $\mathbf{\Delta} = [\Delta_0, \Delta_1, \dots, \Delta_{255}]^T$ . These measurements with their corresponding rows in the matrix **H**, denoted by  $\mathbf{H}_{80\%}$ , are used to estimate the delays of the 16 shielded wires, denoted by  $\hat{\boldsymbol{\delta}}_{80\%}$ . The following system is solved to yield the estimated delays:

$$\hat{\boldsymbol{\delta}}_{80\%} = \left( \mathbf{H}_{80\%}^T \ \mathbf{H}_{80\%} \right)^{-1} \mathbf{H}_{80\%}^T \boldsymbol{\Delta}_{80\%}.$$
(5)

To verify the accuracy of  $\hat{\delta}_{80\%}$ , the remaining 20% measurements of the vector  $\mathbf{\Delta} = [\Delta_0, \Delta_1, \dots, \Delta_{255}]^T$ , denoted by  $\mathbf{\Delta}_{20\%}$ , are compared with their corresponding predicted values. A predicted delay  $\hat{\Delta}$  is calculated by summing the appropriate estimated delays of  $\hat{\delta}_{80\%}$  defined by the RO configuration corresponding to  $\Delta \in \mathbf{\Delta}_{20\%}$  as follows:

$$\hat{\Delta} = \hat{\delta}^{0}_{80\%,i(0)} + \hat{\delta}^{1}_{80\%,i(1)} + \hat{\delta}^{2}_{80\%,i(2)} + \hat{\delta}^{3}_{80\%,i(3)}.$$
 (6)

If  $|\Delta - \hat{\Delta}| \approx 0$  for every  $\Delta \in \Delta_{20\%}$ , we consider the estimation to be reasonably accurate.

# IV. POST-SILICON ANALYSIS OF SHIELDED INTERCONNECT DELAY VARIATIONS

The shielded RO shown in Fig. 1 and its accompanying testing system were fabricated in leading FinFet 16-nm technology on the Marvell Corporation's test-chip shown in Fig. 3. We incorporated four RO circuits on a die, located apart from each other. This allowed us to measure the robustness of the delay tuning by shielding subject to WID variations, in addition to (P,V,T) variations. The post-silicon delays of the shielded wires were obtained by solving (4) for each test, incorporating the entire 256 configurations. The subsequent analyses were obtained for typical, slow, and fast silicon samples, denoted by t, s, and f, respectively.

### A. Delays and Their Tunable Dynamic Range

The rationale for delay tuning by interconnect shielding is based on the assumption that shields can provide a sufficiently large dynamic range of delays by adjusting their spacing from the signal wires. This dynamic range was the key to building the clock tree delay tuning design flow in [8]. We thus aimed at studying the delay tuning dynamic ranges achievable in silicon. This first required obtaining the raw delays of the shielded wires, since post-silicon testing only yields the delays



Fig. 4. Four shielded RO circuit locations on a test chip.

of the entire RO and not the delays of its individual shielded wires. The latter are derived by solving (4). First, the delay of each ring configuration in (1) is obtained by averaging its delay over 50 repeated measurements. This averaging considerably mitigates the effect of the noise of counting plus or minus one pulse in the measurement aperture that may occur in the testing system due to timing synchronization.

Note that each of the  $\delta_1$ ,  $\delta_2$ ,  $\delta_3$ , and  $\delta_5$  segment delays is repeated four times in the ring of Fig. 1 and each of the  $\delta_i^j$ ,  $0 \le j \le 3$ ,  $i \in \{1, 2, 3, 5\}$  is estimated individually, so  $\delta_1$ ,  $\delta_2$ ,  $\delta_3$ , and  $\delta_5$  are obtained by averaging their four respective cyclic repetitions in the ring. Another averaging takes place across the four on-die ring replicas shown in Fig. 4. Finally, averaging across the dies of the same process corners also takes place. While representative delays are obtained by averaging, their WID and D2D variations are also very important and are discussed separately below.

Fig. 5 illustrates the post-silicon delays for slow, typical, and fast corners. Every corner shows from the top (higher delays) to bottom (smaller delays) the delay surfaces of  $\delta_1$ ,  $\delta_2$ ,  $\delta_3$ , and  $\delta_5$  in temperature and voltage ranges of 25 °C–105 °C and 0.8–1.0 V, respectively. The distance between surfaces represents the delay tuning range achievable by wire shielding, whereas the tilt of the surfaces represents the delay changes incurred by temperature and voltage variations.

To introduce effective useful skew into the clock tree with wire shielding rather than a delay buffer, a sufficient dynamic range of delay tuning is required. The back-of-the-envelope calculation in [8] yielded a 44% dynamic range for wire widths of  $1 \times w_{\min}$ , which is the width we used for the physical layout in Fig. 1(c) and Fig. 2. It was later shown in [9] by SPICE simulations that the delay tuning ranges from 32% in (s, 0.72 V, -40°, worst RC) corner up to 47% for (f, 1.05 V, 125°, best RC) corner. The raw delays in Fig. 5 can be used to derive the post-silicon delay tuning range (in percentages) for various (P, V, T) corners. The delay tuning



Fig. 5. Delays of different wire shields across (P, V, T) corner variations.

 TABLE I

 Post-Silicon Delay Tuning Ranges [%]

| 1 |           | $V_{\rm dd} = 1.0 \rm V$ |      |         | $V_{\rm dd} = 0.9 \rm V$ |      |         | $V_{\rm dd} = 0.8 V$ |      |         |      |
|---|-----------|--------------------------|------|---------|--------------------------|------|---------|----------------------|------|---------|------|
|   |           |                          | fast | typical | slow                     | fast | typical | slow                 | fast | typical | slow |
|   | Temp [oC] | 25                       | 34.8 | 35.2    | 34.6                     | 33.6 | 34.0    | 33.0                 | 32.2 | 32.3    | 30.1 |
|   |           | 50                       | 35.2 | 35.6    | 35.0                     | 34.1 | 34.4    | 33.5                 | 32.7 | 32.8    | 31.5 |
|   |           | 85                       | 35.6 | 36.0    | 35.6                     | 34.6 | 34.8    | 34.1                 | 33.2 | 34.8    | 32.2 |
|   |           | 105                      | 35.8 | 36.2    | 35.8                     | 34.8 | 35.0    | 34.4                 | 33.4 | 35.1    | 32.5 |

ranges are obtained by computing the ratio in (7), and the results are summarized in Table I. The post-silicon results are conclusive and match the back-of-the-envelope calculation and the SPICE simulations nicely

delay range = 
$$\frac{\delta_1 - \delta_5}{0.5(\delta_1 + \delta_5)}$$
. (7)

The delay sensitivities to (P, V, T) variations are also striking. The delay tuning range in (7), which we are interested in, can be viewed as the relative delay changes (sensitivity) incurred by shield spacing changes. The larger the sensitivity to spacing changes the better. In contrast, the delay changes across (P, V, T) corners should preferably be kept as small as possible. Below we derive the  $\delta_i$  sensitivities to (P, V, T) changes,  $i \in \{1, 2, 3, 5\}$ , where the reference corner is (t, 0.9 V, 85 °C). Examination of other reference corners yielded similar sensitivity trends

$$S_{\delta_i}^{\Delta V} \triangleq \frac{\delta_i(t, 0.8 \text{ V}, 85 \ ^\circ\text{C}) - \delta_i(t, 1.0 \text{ V}, 85 \ ^\circ\text{C})}{0.5[\delta_i(t, 0.8 \text{ V}, 85 \ ^\circ\text{C}) + \delta_i(t, 1.0 \text{ V}, 85 \ ^\circ\text{C})]} \quad (8)$$

$$S_{\delta_i}^{\Delta T} \triangleq \frac{\partial_i(\mathbf{t}, 0.9 \, \text{V}, 105 \, ^\circ\text{C}) - \partial_i(\mathbf{t}, 0.9 \, \text{V}, 25 \, ^\circ\text{C})}{0.5[\delta_i(\mathbf{t}, 0.9 \, \text{V}, 105 \, ^\circ\text{C}) + \delta_i(\mathbf{t}, 0.9 \, \text{V}, 25 \, ^\circ\text{C})]} \tag{9}$$

and

$$S_{\delta_i}^{\Delta P} \triangleq \frac{\delta_i(s, 0.9 \text{ V}, 85 \text{ °C}) - \delta_i(f, 0.9 \text{ V}, 85 \text{ °C})}{0.5[\delta_i(s, 0.9 \text{ V}, 85 \text{ °C}) + \delta_i(f, 0.9 \text{ V}, 85 \text{ °C})]}.$$
 (10)

The sensitivities are derived from the data in Fig. 5, and are summarized in Table II.

It is interesting to compare the above sensitivities with the dynamic delay range in (7). While the latter is desirable,

 TABLE II

 Delay Sensitivities [%] to (P, V, T) Corner Variations

|            |                           | $\delta_1$ | $\delta_2$ | $\delta_3$ | $\delta_5$ |
|------------|---------------------------|------------|------------|------------|------------|
| ,<br>ities | $S_{\delta_i}^{\Delta T}$ | 8.9        | 8.4        | 8.0        | 7.8        |
| (h)        | $S_{\delta_i}^{\Delta V}$ | 8.8        | 9.9        | 10.9       | 11.6       |
| , s        | $S_{\delta_i}^{\Delta P}$ | 1.6        | 1.6        | 2.4        | 2.1        |

the former is not. Nevertheless, voltage and temperature sensitivities,  $S_{\delta_i}^{\Delta V}$  and  $S_{\delta_i}^{\Delta T}$  respectively, can be accounted for and treated at design time, first of all by simulations, and then by taking steps to avoid temperature hotspots and significant voltage drops. The most design uncontrollable variability is the process  $S_{\delta_i}^{\Delta P}$ , but as observed in Table II, its value is fortunately very small.

Table II also reveals the  $S_{\delta_i}^{\Delta V}$  and  $S_{\delta_i}^{\Delta T}$  trends with an increase in shielding spacing.  $S_{\delta_i}^{\Delta V}$  increases since the current driving strength of a driver decreases with voltage decreases in (8), whereas the wire's resistance remains intact. Therefore, in  $\delta_5$ , where the impact of the driver is greater than in  $\delta_1$ , the sensitivity to voltage change is higher. In contrast,  $S_{\delta_i}^{\Delta T}$  decreases as the shielding spacing increases since the wire resistance increases in (9) with a rise in temperature. Since  $\delta_1$  yields a higher cross coupling capacitive load than  $\delta_5$ , the wire's resistance contribution to  $\delta_1$  has a heavier weight than in  $\delta_5$ .



Fig. 6. Variation ranges for the shield and buffer delays in 28 nm [8].



Fig. 7. Useful skew by (a) delay buffers and (b) shielding.

Recall that the main goal of shield spacing tuning is to control the clock signal timing, notably for useful skew purposes. A useful skew is defined as the difference between the arrival time (delay) of the clock signal to the flip-flop (FF) driving a combinational logic path and the arrival time (delay) of the clock signal to its terminating FF. As such, its sensitivity to (P,V,T) should be as small as possible. Clock delays per se may be suspect to significant changes in various corners, so in order to maintain a stable useful skew, the delays comprising the skew should change similarly (and not in opposite directions). By comparing their sensitivities to (P,V,T) variations we show below that a shielding-based useful skew is not worse, but rather usually better than a bufferbased skew.

A simulation-based study comparing the variabilities of useful skew generated by delay buffers and the shielding was conducted in [8]. The SPICE results of a leading 28nm technology for the interconnect of three widths are shown in Fig. 6. The relative skew ranges of the nominal delays and their variations (in percentages) were obtained for buffer (blue dots) and shield (red squares). The vertical bars are the delay variations obtained across all simulated corners. It is clear that the variations of the shielding-based skews correspond to about half of the delay buffer-based skews. We show below that a similar behavior is observed in silicon.

To this end, useful skew design scenarios are illustrated in Fig. 7, where a time difference  $\Delta t = t'' - t'$  between the arrival times of two signals is the relevant measure. Fig. 7(a) presents a design using delay buffers to achieve  $\Delta t$ , whereas

 TABLE III

 Delay Skew Sensitivities [%] to (P, V, T) Corner Variations

| _ |                              |                               | $\Delta t_1$ | $\Delta t_2$ | $\Delta t_3$ | $\Delta t_1 / \Delta t_3$ |
|---|------------------------------|-------------------------------|--------------|--------------|--------------|---------------------------|
| ſ | ies                          | $S_{\Delta t_i}{}^{\Delta T}$ | 11.5         | 11.3         | 10.6         | 1.09                      |
|   | Skew<br>Sensitivities<br>[%] | $S_{\Delta t_i}{}^{\Delta V}$ | 2.09         | 2.89         | 5.33         | 0.392                     |
|   | Sen                          | $S_{\Delta t_i}{}^{\Delta P}$ | 2.09         | 2.90         | 8.60         | 0.242                     |

in Fig. 7(b)  $\Delta t$  is achieved by shielding. The robustness of these two scenarios under (P, V, T) variations can be deduced from the RO raw delay data as follows. Consider the following delay difference expression, representing a type of delay skew

$$\Delta t_i = \delta_i - \delta_5, \quad i = 1, 2, 3 \tag{11}$$

in which  $\delta_5$  may be considered as marginally affected by shielding. While  $\Delta t_1$  is more highly dominated by the shielding effect, and hence corresponds to Fig. 7(b) scenario,  $\Delta t_3$ is more highly dominated by the buffer properties, and hence, is more appropriate for Fig. 7(a) scenario. We derive the  $\Delta t_i$ sensitivities to (P,V,T) changes for  $i \in \{1, 2, 3\}$  as follows, where the reference corner is (t, 0.9 V, 85 °C). Examination of other reference corners yielded similar sensitivity trends

$$S_{\Delta t_{i}}^{\Delta V} \triangleq \frac{|\Delta t_{i}(t, 0.8 \text{ V}, 85 \text{ °C}) - \Delta t_{i}(t, 1.0 \text{ V}, 85 \text{ °C})|}{0.5[\Delta t_{i}(t, 0.8 \text{ V}, 85 \text{ °C}) + \Delta t_{i}(t, 1.0 \text{ V}, 85 \text{ °C})]}$$
(12)  
$$S_{\Delta t_{i}}^{\Delta T} \triangleq \frac{|\Delta t_{i}(t, 0.9 \text{ V}, 105 \text{ °C}) - \Delta t_{i}(t, 0.9 \text{ V}, 25 \text{ °C})|}{0.5[\Delta t_{i}(t, 0.9 \text{ V}, 105 \text{ °C}) + \Delta t_{i}(t, 0.9 \text{ V}, 25 \text{ °C})]}$$
(13)

and

$$S_{\Delta t_i}^{\Delta P} \triangleq \frac{|\Delta t_i(s, 0.9 \text{ V}, 85 \text{ °C}) - \Delta t_i(f, 0.9 \text{ V}, 85 \text{ °C})|}{0.5[\Delta t_i(s, 0.9 \text{ V}, 85 \text{ °C}) + \Delta t_i(f, 0.9 \text{ V}, 85 \text{ °C})]}.$$
(14)

The delay skew sensitivities are derived from the data in Fig. 5, which was averaged over all the on-die ROs shown in Fig. 4 and all their internal wire segments shown in Fig. 1. Table III presents the sensitivities derived in (12)–(14).

It is interesting to compare  $S_{\Delta t_1}^*$ , where  $\Delta t_1 = \delta_1 - \delta_5$ suits the delay skew as shown in Fig. 7(b), with  $S_{\Delta t_3}^*$ , where  $\Delta t_3 = \delta_3 - \delta_5$ , which suits the delay skew as shown in Fig. 7(a). The rightmost column of Table III shows that shielding-based useful skew is more robust than the bufferbased for voltage and process variations. When temperature variations are concerned, the delay skew sensitivities are nearly similar. In conclusion, wire shielding not only provides sufficient tunable dynamic delay ranges (Table I) which are robust under corner variations (Table II) but can also be safely used to obtain comparable or more stable useful skews than by using delay buffers (Table III).

#### B. Intradie (WID) and Interdie (D2D) Delay Variations

Section IV-A discussed both delay range and delay skew sensitivities under (P,V,T) changes. While ROs are a typical way to study variability, their lumped structure does not



Fig. 8. Delay skew variation methodology. (a) WID. (b) D2D.

represent the trend of WID variation increases with process feature size decreases. From a design perspective, the WID delay skew variations, which are addressed in this section, are the main concern. Given a die of process type  $P^*$  working at a specific voltage-temperature operating point  $(V^*, T^*)$ , let  $k \in \{1, 2, 3, 4\}$  index the four on-die ROs shown in Fig. 4. Given the delay skew  $\Delta t_i$  as defined in (11), the WID sensitivity  $S_{\Delta t_i}^{\text{WID}}$  of  $\Delta t_i$  is given by considering  $\Delta t_i$  in all four ROs as follows:

$$S_{\Delta t_i}^{\text{WID}} \frac{\max_k \Delta t_i(P^*, V^*, T^*) - \min_k \Delta t_i(P^*, V^*, T^*)}{0.5[\max_k \Delta t_i(P^*, V^*, T^*) + \min_k \Delta t_i(P^*, V^*, T^*)]}.$$
(15)

Fig. 8(a) illustrates the WID variation measurement methodology and depicts a specific segment of the ring over all its four replicas, for which the maximum and the minimum in (15) are used.  $S_{\Delta t_i}^{\text{WID}}$  is computed 48 times, stemming from twelve  $(V^*, T^*)$  operating points and four cyclic segments in each oscillator. Fig. 9(a) shows the average of  $S_{\Delta t_i}^{\text{WID}}$ , its standard deviation and the extreme measured values; each is shown for typical, fast, and slow process corners. It can be seen that the difference between the maximum and the average WID variations falls within  $3\sigma$  of the average. It is clear that the variation of a delay skew  $S_{\Delta t_1}^{\text{WID}}$  (blue bar), which is appropriate for the useful skew obtained by shielding is far more stable than a delay skew  $S_{\Delta t_3}^{\text{WID}}$  (yellow bar), which suits the useful skew obtained by delay buffers.

As illustrated in Fig 6(b),  $S_{\Delta t_i}^{D2D}$  can be evaluated by using a similar methodology as in (15), adapted to repetitions of the same segment in the same oscillator across different dies of the same process, rather than different oscillators on the same



Fig. 9. Delay skew variation measurements. (a) WID. (b) D2D.

die as in Fig. 8(a).  $S_{\Delta t_i}^{\text{D2D}}$  is computed 192 times, stemming from twelve ( $V^*$ ,  $T^*$ ) operating points, four cyclic segments in each oscillator as shown in Fig. 1, and the four ROs shown in Fig. 4. Whereas a specific  $S_{\Delta t_i}^{\text{D2D}}$  is associated with dies of the same process corner, Fig. 9(b) shows the average of  $S_{\Delta t_i}^{\text{D2D}}$ , its standard deviation and the extreme values, comprising the averages of the D2D delay skew variations across typical, slow and fast process corners. As in the WID variations, the shielding-based  $S_{\Delta t_1}^{\text{D2D}}$  delay skew (blue bar) is far more stable than the buffer-based  $S_{\Delta t_3}^{\text{D2D}}$ (yellow bar) under D2D variations.

### V. CONCLUSION

This article explored the delays obtained by interconnect shielding in post-silicon, and their intradie (WID) and interdie (D2D) variations under a wide variety of (P, V, T) corners. It is shown that such delays are applicable and effective in achieving clock design timing goals and can replace delay buffers. Wire shielding is shown to provide a sufficient dynamic delay range, which is stable under corner variations and can also be safely implemented to obtain more stable useful skews than with delay buffers.

#### ACKNOWLEDGMENT

The authors would like to thank the Marvell Corporation for supporting this article. They would also like to thank the anonymous reviewers whose comments helped to improve this article.

#### REFERENCES

- J. P. Fishburn, "Clock skew optimization," *IEEE Trans. Comput.*, vol. 39, no. 7, pp. 945–951, Jul. 1990.
- [2] R. B. Deokar and S. S. Sapatnekar, "A graph-theoretic approach to clock skew optimization," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, May/Jun. 1994, pp. 407–410.
- [3] A. Agarwal, D. Blaauw, and V. Zolotov, "Statistical timing analysis for intra-die process variations with spatial correlations," in *Proc. IEEE/ACM Int. Conf. Comput.-Aided Design*, Nov. 2003, p. 900.
- [4] C. Constantinescu, "Trends and challenges in VLSI circuit reliability," *IEEE Micro*, vol. 4, no. 4, pp. 14–19, Jul. 2003.
- [5] J. Kim, D. Joo, and T. Kim, "An optimal algorithm of adjustable delay buffer insertion for solving clock skew variation problem," in *Proc. 50th Annu. Design Automat. Conf.*, May/Jun. 2013, pp. 1–6.
- [6] M. Alioto, G. Palumbo, and M. Pennisi, "Understanding the effect of process variations on the delay of static and domino logic," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 18, no. 5, pp. 697–710, May 2010.
- [7] B. Frankel and S. Wimer, "Optimal VLSI delay tuning by wire shielding," J. Optim. Theory Appl., vol. 170, no. 3, pp. 1060–1067, 2016.
- [8] E. Sarfati, B. Frankel, Y. Birk, and S. Wimer, "Optimal VLSI delay tuning by space tapering with clock-tree application," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 64, no. 8, pp. 2160–2170, Aug. 2017.
- [9] E. Sarfati, B. Frankel, Y. Birk, and S. Wimer, "Accurate shielded interconnect delay estimation by reconfigurable ring oscillator," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 65, no. 10, pp. 3435–3444, Oct. 2018.
- [10] A. Ghosh, R. M. Rao, J.-J. Kim, C.-T. Chuang, and R. B. Brown, "On-chip process variation detection using slew-rate monitoring circuit," in *Proc. 21st Int. Conf. VLSI Design (VLSID)*, Jan. 2008, pp. 143–149.

- [11] T. Iizuka, J. Jeong, T. Nakura, M. Ikeda, and K. Asada, "All-digital on-chip monitor for PMOS and NMOS process variability measurement utilizing buffer ring with pulse counter," in *Proc. ESSCIRC*, Sep. 2010, pp. 182–185.
- [12] H. Notani, M. Fujii, H. Suzuki, H. Makino, and H. Shinohara, "On-chip digital I<sub>dn</sub> and I<sub>dp</sub> measurement by 65 nm CMOS speed monitor circuit," in *Proc. IEEE Asian Solid-State Circuits Conf.*, Nov. 2008, pp. 405–408.
- [13] T. Takahashi, T. Uezono, M. Shintani, K. Masu, and T. Sato, "On-die parameter extraction from path-delay measurements," in *Proc. IEEE Asian Solid-State Circuits Conf.*, Nov. 2009, pp. 101–104.
- [14] I. A. K. M. Mahfuzul, A. Tsuchiya, K. Kobayashi, and H. Onodera, "Variation-sensitive monitor circuits for estimation of global process parameter variation," *IEEE Trans. Semicond. Manuf.*, vol. 25, no. 4, pp. 571–580, Nov. 2012.
- [15] T.-B. Chan, P. Gupta, A. B. Kahng, and L. Lai, "Synthesis and analysis of design-dependent ring oscillator (DDRO) performance monitors," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 22, no. 10, pp. 2117–2130, Oct. 2014.
- [16] D. Prasad, C. Pan, and A. Naeemi, "Modeling interconnect variability at advanced technology nodes and potential solutions," *IEEE Trans. Electron Devices*, vol. 64, no. 3, pp. 1246–1253, Mar. 2017.
- [17] Y. A. Eken and J. P. Uyemura, "A 5.9-GHz voltage-controlled ring oscillator in 0.18-µm CMOS," *IEEE J. Solid-State Circuits*, vol. 39, no. 1, pp. 230–233, Jan. 2004.
- [18] Synopsys. (2017). StarRC Parasitic Extraction. [Online]. Available: https://www.synopsys.com/content/dam/synopsys/implementation& signoff/datasheets/starrc-ds.pdf
- [19] F. van der Heijden, R. P. Duin, D. de Ridder, and D. M. J. Tax, "Parameter estimation," in *Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB.* Hoboken, NJ, USA: Wiley, 2017, pp. 77–113.
- [20] R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and model selection," in *Proc. Int. Joint Conf. Artif. Intell.* (*IJCAI*), 1995, pp. 1137–1145.