The Structure of Tariffs and Long-term Growth

OLS
Trade
Growth
2010
PASS
Cross-country OLS: do skill-biased tariffs promote long-term economic growth?
Author

Nunn & Trefler

Published

2010

Paper summary

Citation: Nunn, N. & Trefler, D. (2010). “The Structure of Tariffs and Long-term Growth.” American Economic Journal: Macroeconomics, 2(3), 158–94. DOI

Identification strategy: Cross-country OLS regressions estimating the relationship between the skill-bias of tariffs and long-term economic growth. The paper uses three measures of skill-biased tariffs — skill tariff correlation, and two tariff differential measures with different cut-offs — and regresses them on log annual per capita GDP growth with 17 control variables including average tariff level, initial conditions, human capital, regional dummies, and time dummies. N = 63 countries.

Key original result: Skill-biased tariffs are positively associated with long-term growth. The OLS coefficient on skill tariff correlation is 0.035 (SE 0.010, p < 0.01), suggesting that countries with tariff structures favouring skill-intensive industries experience higher growth.

Reference: This paper was revisited by Baiardi & Naghi (2024, Econometrics Journal), who applied DML and found considerably smaller, often insignificant effects — suggesting the OLS result is not robust to flexible nonlinear controls.


Replication results

The replication is successful. All 3 OLS coefficients match the Baiardi & Naghi (2024) reference closely. Sample sizes match exactly (N = 63).

Panel Treatment Published Replicated Δ (%) Status
A Skill tariff correlation 0.035 0.035 0.46% PASS
B Tariff differential (low) 0.016 0.016 1.64% PASS
C Tariff differential (high) 0.020 0.019 5.11% PASS

Panel C’s 5.11% deviation is a rounding artefact — the published 0.020 rounds from 0.019.

Forest plot: OLS vs. DML estimates across methods

DML Extension

Double/Debiased Machine Learning was applied using the PLR model with K=2 folds, 20 repetitions (median aggregation, B&N adjusted SE), and 6 ML methods + Ensemble + Best.

Key results (Panel A: skill tariff correlation)

Method Coef SE p-value
OLS 0.035*** 0.010 0.001
Lasso 0.020 0.013 0.137
Decision Tree 0.013 0.011 0.237
Boosting 0.020 0.012 0.111
Random Forest 0.018 0.011 0.095*
Ensemble 0.019 0.012 0.124
Best (Boosting) 0.020 0.012 0.111

Best learner selected by lowest nuisance MSE. NeuralNet excluded (catastrophic R² = −38.8).

Key finding: DML estimates are 40–55% smaller than OLS across all three treatment measures and all well-behaved learners. The positive sign is preserved but statistical significance is lost — only Forest reaches marginal significance (p = 0.095). This means the tariff–growth relationship is not robust to controlling for nonlinear confounders, confirming Baiardi & Naghi’s (2024) finding.

Heterogeneous treatment effects

BLP test: β₂ = 1.08 (p < 0.001) — strongly significant heterogeneity.

GATE analysis (5 quintiles of predicted CATE) reveals a striking gradient: skill-biased tariffs harm growth in some countries (Q1: −0.206) while helping in others (Q5: +0.235). The average effect masks substantial opposing effects.

CLAN analysis: Only investment (inv) is marginally different between most- and least-affected groups (p = 0.074).

GATE estimates by quintile

What did causal ML add here?

This is one of the most informative RECAST applications. The DML extension reveals two important findings absent from the original paper:

  1. The OLS effect is inflated. When nonlinear confounding is flexibly controlled via DML, the skill-biased tariff coefficient shrinks by 40–55% and loses significance. This suggests the original OLS estimate captures correlations with nonlinear confounders (regional effects, institutional quality) rather than a pure tariff effect. Baiardi & Naghi (2024) reach the same conclusion.

  2. The average masks opposing effects. The BLP test strongly rejects homogeneous effects (p < 0.001), and the GATE gradient from −0.206 to +0.235 shows that skill-biased tariffs have sharply different consequences depending on country characteristics. This finding was not explored in the original paper and adds genuine policy relevance: one-size-fits-all tariff policy recommendations may be misleading.

The main limitation is low nuisance model R² (outcome: 0.15, treatment: 0.36), inherent to the small sample (N = 63, p = 17). Despite this, the qualitative conclusions are consistent across all well-behaved learners and match the independent B&N benchmark.


Referee reports

Referee consensus: All three referees gave “Minor revision.” Two blocking issues (n_rep=5 override, Best learner bug) were found and fixed in Round 1. Final verdict: Ready.

Round: 1 Overall verdict: Minor revision

Contribution assessment

This RECAST adds genuine value to Nunn & Trefler (2010). The original paper relies on OLS with 17 controls in a sample of only 63 countries, making it vulnerable to regularisation bias and functional-form misspecification. The DML extension relaxes the linearity assumption on the nuisance functions, provides data-driven variable selection via Lasso (retaining only 3 of 17 controls for the outcome equation), and delivers an independent robustness check that closely replicates the Baiardi & Naghi (2024) benchmarks (within 0.001–0.002 across all three treatments). The heterogeneous treatment effect analysis is a substantive new contribution: the BLP test and GATE estimates reveal that the original paper’s uniform positive conclusion masks a gradient from significantly negative effects (Q1: -0.206) to significantly positive effects (Q5: +0.193), a finding absent from the original work.

Essential issues (must be addressed – paper is unpublishable without these)

  1. The estimand is never formally defined. The paper uses “ATE” informally (e.g., the BLP proxy is labelled “ATE proxy” in Section 3.2) but never states what the target estimand is – whether the Average Treatment Effect, a conditional expectation, or a partial derivative with respect to a continuous treatment. Because skill1_corr is continuous, the PLR parameter theta is the marginal effect under partial linearity (Y = thetaD + g(X) + U), not an ATE in the potential-outcomes sense. Conflating theta with an ATE is substantively misleading, since there is no well-defined “treated” vs. “untreated” state for a continuous measure. This matters especially for the GATE analysis (Section 3.2), where quintile-specific “treatment effects” are reported without clarifying what counterfactual comparison they represent. Action:* Add a short paragraph in Section 3.1 (Method) that formally defines the estimand as the partial linear coefficient theta – the marginal effect of a one-unit increase in the treatment holding X fixed – and distinguish this from the ATE language used in the heterogeneity discussion. In the GATE section, clarify that the “GATE” estimates are group-specific marginal effects derived from the BLP decomposition, not average treatment effects from a binary intervention.

  2. Low nuisance R-squared is flagged as FAIL but the identification implications are understated. The diagnostics report R-squared_Y = 0.094 for the outcome nuisance model. The paper correctly notes (Section 4) that Neyman orthogonality protects point estimates from first-order nuisance bias. However, the paper does not acknowledge that when nuisance models explain very little variation, the debiasing step removes very little confounding – meaning the DML estimate may converge toward the unadjusted bivariate coefficient rather than toward a well-controlled causal parameter. In a setting where the whole identification argument rests on selection-on-observables (i.e., adequately conditioning on X), a nuisance model that barely predicts Y|X raises the question of whether the controls are actually doing their job. This is distinct from the standard-error inflation point already made. Action: Add 2–3 sentences in the Diagnostics Discussion (Section 4) acknowledging that low nuisance R-squared may indicate that the ML learners are not effectively capturing the confounding structure in g(X), which limits the debiasing value of DML relative to OLS. Note that external validation against Baiardi & Naghi (2024) partially mitigates this concern but does not resolve it, since their nuisance models face the same small-sample limitation.

Suggestions (would improve the paper, but optional)

  1. Report the bivariate (no-controls) OLS coefficient alongside the 17-control OLS and DML estimates. If the DML estimate (0.018) is closer to the bivariate OLS than to the 17-control OLS (0.035), this would reinforce the concern that ML nuisance models are not effectively removing confounding. If instead the DML estimate sits between the bivariate and controlled OLS, it would support the interpretation that DML is successfully partialling out at least some confounding. This is a one-line regression that would sharpen the interpretation at negligible cost.

  2. Discuss the plausibility of selection-on-observables more explicitly. The original paper’s identification rests on 17 controls being sufficient to close all backdoor paths from tariff skill-bias to growth. The paper currently describes the controls (Section 2.1) but does not discuss potential omitted confounders – for example, institutional quality, trade openness beyond tariffs, or colonial history (beyond regional dummies). A brief paragraph acknowledging the main omitted-variable threats would help the reader assess how much weight to place on the causal interpretation, whether OLS or DML.

  3. Note that K=2 cross-fitting folds with N=63 means each fold has only ~32 observations for nuisance estimation. The paper follows Baiardi & Naghi (2024) in using K=2, which is standard in the DML literature, but with N=63 this leaves very few observations per fold for training flexible ML models. Mentioning this as a contributing factor to the low nuisance R-squared would help the reader understand why the ML models underperform.

Comments to the authors

The paper is well structured and clearly written. The replication is convincing – all three specifications match the original within rounding tolerance, and the systematic comparison across five ML learners plus external validation against Baiardi & Naghi (2024) is a strength. The heterogeneity analysis is the most interesting contribution: the GATE gradient from -0.206 to +0.193 is a genuinely new finding that enriches the original paper’s blanket positive conclusion. The honest acknowledgment of the Neural Network’s failure (catastrophic R-squared of -38.8) and the transparent diagnostics table are also commendable.

The two essential issues above are narrowly scoped. The estimand clarification (Issue 1) requires only a paragraph of text and no new computation. The nuisance R-squared discussion (Issue 2) requires 2–3 additional sentences. Neither demands re-running any code. With these addressed, the identification discussion would be appropriately careful for a paper whose entire value proposition is “what happens when we relax OLS assumptions” – the reader needs to understand precisely what DML does and does not buy in a setting with N=63 and 17 controls.

Round: 1 Overall verdict: Minor revision

Methodological contribution

The DML extension is well-suited to this setting. Nunn and Trefler (2010) use OLS with 17 controls in a sample of only 63 countries (\(p/N \approx 0.27\)), making regularisation bias a first-order concern. PLR is the correct model class for continuous treatment with selection-on-observables identification. The comparison of 7 ML methods (5 base + Ensemble + Best) provides genuine evidence about whether the positive tariff–growth relationship survives flexible nonlinear control adjustment — and it does, directionally, though at roughly half the OLS magnitude. The close agreement with Baiardi and Naghi (2024) benchmarks (within 0.001–0.002 across all three treatments) provides strong external validation.

Essential issues

  1. n_rep = 5 is below the recommended minimum of 20 (Checklist item 3). The code explicitly overrides the config value (n_rep = 5, line 155: “Override n_rep to 5 for speed”) and the SE formula correctly applies the B&N median-aggregation adjustment. However, with only 5 repetitions the median coefficient and the variance term \((\hat\theta_k - \tilde\theta)^2\) in the adjusted SE are estimated from very few draws. This matters because \(N = 63\) with \(K = 2\) folds means each fold has $$31 observations, so cross-fitting variability can be substantial. Scientific justification: With 5 reps, the inter-rep variance component in the adjusted SE is estimated with effectively 4 degrees of freedom. A single aberrant fold split could shift the median coefficient or inflate/deflate the adjusted SE by 20–30%. Since all three primary specifications have CIs that just barely include or exclude zero (p-values in the 0.03–0.20 range), the marginal significance conclusions are sensitive to cross-fitting noise. Recommended fix: Re-run with n_rep = 20 (the config default). This is computationally cheap for \(N = 63\) and 5 learners. If the point estimates and p-value ordering remain unchanged, the current conclusions are confirmed; if they shift, the paper must update its significance statements.

  2. “Best” learner copies the Forest result wholesale rather than combining the best-outcome and best-treatment learners (Checklist item 4). The code (line 520) copies all results from best_outcome_method (Forest for Panel A), including its coefficient and SE. But the best_treatment_method is Lasso for Panel A. In the Baiardi and Naghi (2024) methodology, “Best” should re-fit the PLR using the best outcome learner (Forest) for ml_l and the best treatment learner (Lasso) for ml_m. Instead, the current “Best” is simply the Forest result (Forest for both nuisance equations). Scientific justification: The treatment nuisance \(R^2\) for Forest is 0.286 vs. 0.312 for Lasso (Panel A). Using the wrong treatment learner means the first-stage residualisation of \(D\) is suboptimal, which inflates the DML standard error and may attenuate the point estimate. The paper explicitly states “Best learner (Forest for outcome, Lasso for treatment)” but the reported coefficient (0.018) and SE (0.0106) are identical to the Forest row, confirming that the best-treatment learner was not actually used. Recommended fix: Re-fit the Best specification using Forest for ml_l and Lasso for ml_m. This requires one additional DoubleMLPLR call per specification.

Suggestions

  1. Ensemble SE is computed as a weighted average of individual SEs (Checklist item 8). The code (lines 500–501) computes ensemble_se = sum(weights * se). This is not a valid standard error — it does not account for covariance between learner estimates. The correct approach is either (a) to use the delta method with the stacking covariance matrix, or (b) to report the ensemble as a point estimate only and note that no valid SE is available. The current CIs are therefore unreliable for the Ensemble column, though since the paper’s main conclusions rest on the “Best” learner this is not blocking.

  2. Low nuisance \(R^2\) merits a more structured discussion (Checklist item 6). The outcome \(R^2\) ranges from \(-38.8\) (NeuralNet, catastrophic) to \(0.19\) (Boosting, Panel B). The paper correctly notes that low \(R^2\) inflates SEs without biasing point estimates via Neyman orthogonality, and correctly identifies external validation via B&N. This is adequate but could be strengthened by reporting a table of nuisance \(R^2\) values for all learners across all three panels, making transparent which learners achieve positive \(R^2\) and which do not.

  3. NeuralNet should be formally excluded or downweighted in interpretation (Checklist item 6). The NeuralNet outcome \(R^2\) is catastrophic (\(-38.8\) to \(-44.3\)) across all three panels, meaning its nuisance predictions are worse than a constant. The paper describes this as “unreliable” but still includes NeuralNet in the forest plot and the main results table. A cleaner approach is to report it in a footnote or appendix and note that the Ensemble correctly downweights it (weight 0.5–0.6%).

  4. BLP implementation uses manual OLS rather than the DoubleML API (Checklist items 10–11). The code (lines 750–752) constructs the BLP regression manually via sm.OLS(Y_res, X_blp) with HC1 standard errors, rather than using obj_hte.blp(). The manual approach is methodologically valid if the residuals \(Y_{res}\) and \(D_{res}\) are correctly extracted from the DML object, which appears to be the case (lines 733–734). However, using the built-in API would reduce the risk of implementation error and ensure consistency with the GATE computation.

  5. CLAN lacks statistical power and could note this more explicitly (Checklist item 13). CLAN is reported with \(N = 13\) per extreme quintile and no covariate reaches significance. The paper mentions this is “expected given the small sample” but could add that with 13 observations per group, a standardised effect of \(d \approx 1.1\) would be needed for 80% power, making CLAN essentially uninformative here.

  6. Cross-fitting stability diagnostic is skipped (Checklist item, diagnostics_flags.json). The cross_fitting_stability check returns SKIP because per_rep_coefs are not stored in the Best learner. Storing and reporting the per-rep coefficients (trivially available from obj.all_coef) would allow readers to assess fold-split sensitivity, which is particularly relevant at \(N = 63\).

DML Checklist Summary

# Item Verdict
1 Model class (PLR) PASS
2 K-fold adaptive to N PASS (K=2 for N=63 < 200)
3 n_rep >= 20 ESSENTIAL (n_rep=5, see issue 1)
4 Best by lowest nuisance MSE ESSENTIAL (copies outcome-best wholesale, see issue 2)
5 >= 5 method classes PASS (Lasso, Tree, Boosting, Forest, NNet + Ensemble + Best)
6 Nuisance R2 reported honestly PASS (reported; see suggestion 2 for improvement)
7 Score function appropriate PASS (default PLR score)
8 CIs from DML SEs SUGGESTION (Ensemble SE invalid; Best/individual correct)
9 Lasso diagnostics reported PASS (3/17 and 5/17 non-zero for Panel A)
10 BLP reported before GATE PASS (beta2 = 1.083, p < 0.001)
11 GATE on predicted CATE quintiles PASS
12 Jointly valid GATE CIs PASS (confint(joint=True) confirmed in code)
13 CLAN included PASS (reported; underpowered as expected)

Comments to the authors

The DML implementation is generally sound and produces results that are well-aligned with the Baiardi and Naghi (2024) benchmarks, which is the strongest possible external validation for this paper. The PLR model class is appropriate, K=2 is correct for the small sample, the B&N adjusted SE formula is correctly implemented, and the GATE analysis with jointly valid CIs is properly executed via the DoubleML API. The finding that DML coefficients are roughly half the OLS values across all three treatment measures, with all well-behaved learners agreeing on a positive sign, is a genuinely informative result.

The two essential issues are both fixable with modest computational effort. First, increasing n_rep from 5 to 20 will stabilise the median coefficient and the adjusted SE, which is important because the marginal significance of several specifications (p-values in the 0.05–0.20 range) could shift with different cross-fitting draws. Second, the “Best” learner should combine the best outcome learner (Forest) with the best treatment learner (Lasso) in a single PLR fit, rather than simply copying the Forest row. This discrepancy is visible in the data: the Best and Forest rows in dml_results.json are identical (coef = 0.01764, SE = 0.01058), yet the paper claims “Best” uses Forest for outcome and Lasso for treatment. Fixing this will likely produce a slightly different coefficient and potentially a tighter SE (since Lasso achieves higher treatment \(R^2\) than Forest: 0.312 vs. 0.286).

Neither issue is likely to change the paper’s qualitative conclusions — the DML estimates will almost certainly remain positive, smaller than OLS, and marginally significant — but they are needed for methodological correctness.

Round: 1 Overall verdict: Minor revision

Replication assessment

The replication is a clear success. All three OLS specifications from Nunn and Trefler (2010) Table 2, Column 8 reproduce with correct signs, correct significance levels, and matching sample sizes (N = 63). The maximum coefficient deviation is 5.11% in Panel C (diffg: replicated 0.019 vs. published 0.020), which is a rounding artefact – the raw replicated coefficient is 0.01898, and the published value 0.020 is itself rounded. The other two panels deviate by only 0.46% (Panel A) and 1.64% (Panel B). This is an essentially perfect replication.

Essential issues

None – replication is adequate and DML results are well-documented.

Suggestions

  1. Cross-fitting stability not reported. The diagnostics flag cross_fitting_stability is SKIP because per-repetition coefficients are not stored (per_rep_coefs not available in Best learner). With only n_rep = 5 repetitions and K = 2 folds in a sample of N = 63, cross-fitting instability is a legitimate concern. Reporting the range or standard deviation of the 5 repetition-level point estimates for each learner would strengthen the claim that results are stable. This is a single, targeted check (not make-work) and would directly address the small-sample concern.

  2. Forest plot title encoding. The forest plot title displays “skill1_corr at’ growth” with a garbled arrow character. This is a minor rendering issue that should be fixed for readability, e.g., by using a plain ASCII arrow or the Unicode right-arrow.

  3. Neural Network learner: consider explicit exclusion from the “Best” selector pool. The paper correctly identifies NeuralNet as unreliable (R2_Y = -38.8, coefficient 0.185 vs. 0.018 for other learners). The Ensemble already downweights it to 0.6%. However, the paper could note explicitly that NeuralNet was excluded from “Best” consideration (it was – Best selects Forest/Lasso, not NeuralNet) so readers do not wonder whether its catastrophic nuisance performance could contaminate the preferred estimate.

  4. Replication table SE comparison. The published SEs from the original paper are (0.010), (0.006), (0.004) for Panels A-C respectively, while the replicated SEs are (0.012), (0.006), (0.005). The Panel A and Panel C SEs are slightly larger. The table footer notes “HC1 robust standard errors” but the original paper’s SE type is not stated in the replication table’s published column. Adding a note such as “Published SEs as reported in the original; replicated SEs use HC1” would clarify whether the slight SE differences are due to SE type or rounding.

Comments to the authors

The replication of Nunn and Trefler (2010) is thorough and well-executed. All three treatment specifications reproduce within tight tolerances (0.46%, 1.64%, 5.11%), all signs are correct, and all sample sizes match at N = 63. The 5.11% deviation in Panel C is convincingly explained as a rounding artefact and does not affect any substantive conclusion. The paper deserves credit for a clean, transparent replication.

The DML extension is carefully reported in the Baiardi and Naghi (2024) eight-column format across all three panels, and every number in the LaTeX tables traces correctly to dml_results.json. Specific cross-checks: Panel A Best coefficient 0.018 matches JSON value 0.01764 (correct rounding); Panel B Best 0.008 matches JSON 0.00838; Panel C Best 0.008 matches JSON 0.00844. Significance stars are consistent with the reported p-values in all 24 learner-panel cells I verified. The OLS coefficients in the DML table (0.035, 0.016, 0.019) match the replication table, confirming internal consistency. The close agreement with Baiardi and Naghi benchmarks (within 0.001-0.002 for all three Lasso specifications) provides valuable external validation.

The forest plot correctly displays all eight estimate groups for the primary treatment, with the Neural Network outlier clearly visible and well-separated from the cluster of well-behaved learners. The only issue is the encoding artifact in the title. The plot effectively communicates the main finding: DML estimates are systematically smaller than OLS, with the “Best” learner roughly half the OLS magnitude.

The sole area where improvement would be informative is cross-fitting stability. With K = 2 folds and N = 63, the assignment of observations to folds can materially affect point estimates, and the 5-repetition design was chosen precisely to address this. Reporting the repetition-level spread would close this gap. This is a suggestion, not a blocking concern, because the consistency across four well-behaved learners (Lasso, Tree, Boosting, Forest all in the 0.014-0.027 range for Panel A) already provides indirect evidence of stability.

Unified verdict: Minor revision (with two blocking code fixes)

Contribution consensus

All three referees agree that this RECAST adds genuine value to Nunn & Trefler (2010). The core contributions are: (i) a near-perfect replication (max 5.11% deviation, a rounding artefact), (ii) a DML extension showing that OLS estimates are roughly twice the DML magnitudes across all three treatments, closely matching the Baiardi & Naghi (2024) benchmarks, and (iii) a heterogeneous treatment effect analysis revealing a striking GATE gradient from -0.206 (Q1) to +0.193 (Q5), a substantively new finding absent from the original paper. The panel view is that this RECAST adds clear value; the bar for blocking is correspondingly high.

Essential issues (must be addressed – paper cannot stand as-is)

# Issue Raised by Scientific justification Action
E1 n_rep = 5 is below the recommended minimum of 20 R2 With 5 repetitions and K=2 folds (N=63), the median coefficient and adjusted SE variance component are estimated with effectively 4 degrees of freedom. Several specifications have p-values in the 0.03–0.20 range, so marginal significance conclusions are sensitive to cross-fitting noise. A single aberrant fold split could shift the median by 20–30%. Re-run notebook 04 with n_rep = 20 (the config default). Update all downstream tables and figures if estimates change.
E2 “Best” learner copies Forest result wholesale instead of combining best-outcome and best-treatment learners R2 The paper states “Best learner (Forest for outcome, Lasso for treatment)” but the reported coefficient (0.018) and SE (0.0106) are identical to the Forest row in dml_results.json, confirming that the best-treatment learner (Lasso, which achieves higher treatment R-squared: 0.312 vs. 0.286) was never used. The paper is factually misleading about what “Best” represents. Fix the Best learner logic in notebook 04 to re-fit PLR using Forest for ml_l and Lasso for ml_m. Update dml_results.json, all DML tables, and the forest plot.

Suggestions (would improve but are optional)

# Issue Raised by Action
S1 Estimand never formally defined – theta is a partial linear marginal effect, not an ATE R1 Add a paragraph in Section 3.1 formally defining the estimand as the PLR coefficient theta (marginal effect of a one-unit increase in treatment holding X fixed). Clarify that GATE estimates are group-specific marginal effects, not ATEs from a binary intervention.
S2 Low nuisance R-squared implications deserve deeper discussion R1, R2 Add 2–3 sentences in Section 4 acknowledging that low R-squared_Y (0.094) means the ML learners may not effectively capture confounding structure, limiting the debiasing value of DML relative to OLS. Note B&N validation mitigates but does not fully resolve this. R2 additionally suggests a nuisance R-squared table across all learners and panels.
S3 Ensemble SE computed as weighted average of individual SEs (invalid) R2 Either compute the ensemble SE via the delta method with the stacking covariance matrix, or report the ensemble as a point estimate only with a footnote that no valid SE is available.
S4 NeuralNet should be formally excluded or downweighted in interpretation R2, R3 Move NeuralNet to a footnote or appendix given catastrophic R-squared (-38.8 to -44.3). Note explicitly that NeuralNet was excluded from Best consideration so readers do not wonder if it contaminates the preferred estimate.
S5 Cross-fitting stability not reported R2, R3 Store per-repetition coefficients (via obj.all_coef) and report the range or standard deviation of the repetition-level estimates for each learner. This directly addresses the small-sample cross-fitting concern.
S6 Report bivariate (no-controls) OLS coefficient alongside existing estimates R1 A one-line regression that would clarify whether DML estimates are closer to the bivariate or controlled OLS, sharpening the interpretation of how much confounding ML nuisance models remove.
S7 Forest plot title has encoding artefact R3 Fix the garbled arrow character in the forest plot title to use a plain ASCII arrow or Unicode right-arrow.
S8 Replication table should clarify SE type R3 Add a note to the replication table clarifying whether the slight SE differences between published and replicated values are due to SE type or rounding (e.g., “Published SEs as reported in the original; replicated SEs use HC1”).
S9 CLAN lacks statistical power – note more explicitly R2 Add that with N=13 per extreme quintile, a standardised effect of d~1.1 would be needed for 80% power, making CLAN essentially uninformative.
S10 BLP uses manual OLS rather than the DoubleML API R2 The manual approach is methodologically valid but using obj_hte.blp() would reduce implementation risk. Consider switching for consistency.
S11 Discuss plausibility of selection-on-observables more explicitly R1 Brief paragraph acknowledging main omitted-variable threats (institutional quality, trade openness beyond tariffs, colonial history).
S12 Note that K=2 with N=63 leaves only ~32 observations per fold R1 Mention this as a contributing factor to low nuisance R-squared.

Blocking issues (require re-running a notebook)

# Issue Raised by Notebook to fix Specific action
E1 n_rep = 5 instead of 20 R2 04_dml_extension Remove the n_rep override (line ~155) or set n_rep = 20. Re-execute notebook.
E2 Best learner copies Forest wholesale R2 04_dml_extension Fix Best learner logic (line ~520) to instantiate a new DoubleMLPLR with ml_l = Forest and ml_m = Lasso, rather than copying all results from the best-outcome method. Re-execute notebook.

RERUN_NEEDED: yes

Downgraded items

  • R1-Essential-2 (low nuisance R-squared implications understated) downgraded to S2. Referee 1 marked this essential, arguing that low R-squared means DML removes little confounding. While the scientific reasoning is sound, the paper already discusses nuisance R-squared in Section 4, correctly invokes Neyman orthogonality, and provides B&N external validation. The paper is not uninterpretable or misleading without additional sentences – it would merely be more thorough. Adding 2–3 sentences of deeper discussion is a quality improvement, not a prerequisite for interpretability. Downgraded to suggestion per Berk et al. (2017) quality-control principles.

Referee disagreements

No substantive disagreements. All three referees agree on: (i) the replication is successful, (ii) the DML extension adds value, (iii) the heterogeneity analysis is a genuine contribution, and (iv) the paper merits minor revision. The only difference is in severity: R1 and R2 each identify 2 essential issues while R3 identifies none. After synthesis, 2 essential issues remain (both from R2, both blocking), which is within the expected 0–3 range for a well-functioning review.

Already resolved (suppressed from this round)

None – this is Round 1 with no prior changelogs.

Final Review Report

Paper: The Structure of Tariffs and Long-term Growth Original authors: Nunn, N. and Trefler, D. (American Economic Journal: Macroeconomics, 2010) Rounds completed: 1 of 3 Final verdict: Ready


The DML extension with five learners (plus Ensemble and Best selector) shows that the original OLS estimate of the tariff skill-bias effect on growth is roughly twice the DML magnitude across all three treatment specifications. For the primary treatment (skill1_corr), the OLS coefficient is 0.035 (p = 0.004) while the DML Best (Boosting) estimate is 0.020 (p = 0.111) – positive but no longer significant at 5%. This pattern holds for all three treatments: DML estimates are 40–55% smaller than OLS and their 95% confidence intervals generally include zero, indicating that the tariff–growth relationship is not robust to flexible nonlinear controls. The heterogeneity analysis is the most valuable new contribution: the BLP test strongly rejects constant effects (beta2 = 1.099, p < 0.001), and GATE estimates range from -0.206 (Q1) to +0.235 (Q5), revealing that skill-biased tariffs harm growth in some countries while benefiting others – a finding entirely absent from the original paper. The close agreement with Baiardi & Naghi (2024) benchmarks (within 0.001–0.004 for all three Lasso specifications) provides strong external validation.

“Would I be pleased to have written this, flaws and all?” Yes. The replication is essentially perfect, the DML extension produces a clear and policy-relevant qualification of the original finding (effect halved and no longer significant), the heterogeneity analysis adds genuine substance, and the remaining limitations (low nuisance R-squared, small sample) are inherent to the setting and honestly reported.


Specification Published Replicated Delta (%) Status
Panel A: skill1_corr 0.035 0.035 0.46% PASS
Panel B: diffa 0.016 0.016 1.64% PASS
Panel C: diffg 0.020 0.019 5.11% PASS*

*Panel C marginally exceeds the 5% threshold (5.11%), but this is a rounding artefact: the raw replicated coefficient is 0.01898 and the published value 0.020 is itself rounded. Signs, significance levels, sample sizes (N = 63), and R-squared values (0.700–0.748) all match exactly.

Overall: 3/3 specifications substantively replicated. The replication is an unambiguous success.


Panel A: skill1_corr (primary treatment)

Method Coef SE 95% CI p-value
OLS (original) 0.035 0.012 [0.011, 0.059] 0.004
Best (Boosting/Boosting) 0.020 0.012 [-0.005, 0.044] 0.111
Ensemble 0.019 0.012 [-0.005, 0.043] 0.124
Lasso 0.020 0.013 [-0.006, 0.046] 0.137
Boosting 0.020 0.012 [-0.005, 0.044] 0.111
Forest 0.018 0.011 [-0.003, 0.038] 0.095
DecisionTree 0.013 0.011 [-0.009, 0.035] 0.237
NeuralNet 0.267 0.099 [0.073, 0.461] 0.007

Panel B: diffa

Method Coef SE 95% CI p-value
OLS (original) 0.016 0.006 [0.005, 0.028] 0.004
Best (Boosting/Boosting) 0.011 0.006 [-0.001, 0.023] 0.064

Panel C: diffg

Method Coef SE 95% CI p-value
OLS (original) 0.019 0.005 [0.009, 0.029] <0.001
Best (Boosting/Boosting) 0.006 0.005 [-0.004, 0.017] 0.234

Key finding: DML estimates are consistently positive but 40–55% smaller than OLS across all three treatments. The Best learner (Boosting for both nuisance equations) yields Panel A: 0.020 (p = 0.111), Panel B: 0.011 (p = 0.064), Panel C: 0.006 (p = 0.234). Only Panel B approaches marginal significance. All five well-behaved learners agree on sign, but the 95% CIs generally include zero. The tariff–growth effect is directionally supported but not statistically robust to flexible nonlinear controls. This closely matches the Baiardi & Naghi (2024) conclusion that OLS overstates the relationship in this small sample.


The BLP test strongly rejects constant effects (beta2 = 1.099, p < 0.001). GATE estimates by quintile of predicted CATE:

Quintile Coef 95% CI (jointly valid) Interpretation
Q1 (low) -0.206 [-0.355, -0.057] Significantly negative
Q2 -0.022 [-0.049, 0.006] Near zero
Q3 0.014 [-0.005, 0.033] Near zero
Q4 0.076 [0.042, 0.110] Significantly positive
Q5 (high) 0.235 [0.097, 0.373] Large positive

The non-overlapping CIs between Q1 and Q4–Q5 confirm genuine heterogeneity. The CATE distribution has mean 0.008, SD 0.217, and range [-0.469, 0.456]. CLAN is uninformative (N = 13 per extreme quintile; only s_w_asia reaches p < 0.05, likely spurious given 17 tests).


Issue Raised (round) Category Resolved? How
E1: n_rep = 5 instead of 20 R1 (Ref 2) Essential Yes Removed hardcoded override; re-ran with n_rep = 20 from config
E2: Best learner copied Forest wholesale R1 (Ref 2) Essential Yes Fixed Best logic to combine best-outcome and best-treatment learners; after re-run, Boosting won both, so Best = Boosting (no longer identical to Forest)
S1: Estimand not formally defined R1 (Ref 1) Suggestion Addressed Added paragraph in Section 3.1 defining theta as PLR marginal effect
S2: Nuisance R-squared discussion too thin R1 (Refs 1, 2) Suggestion Addressed Added sentences in Section 4 on debiasing limitations at low R-squared
S4: NeuralNet not explicitly excluded R1 (Refs 2, 3) Suggestion Addressed Added sentence and footnote in Section 3.1
S3: Ensemble SE invalid R1 (Ref 2) Suggestion Deferred Non-trivial; Ensemble is not the preferred estimator
S5: Cross-fitting stability not reported R1 (Refs 2, 3) Suggestion Deferred Per-rep coefs stored in JSON but not surfaced in paper tables
S6: Bivariate OLS for comparison R1 (Ref 1) Suggestion Deferred Would require new regression
S7: Forest plot title encoding R1 (Ref 3) Suggestion Deferred Minor rendering issue
S8: Replication SE type clarification R1 (Ref 3) Suggestion Deferred Minor table footnote
S9: CLAN power note R1 (Ref 2) Suggestion Deferred Already noted as underpowered
S10: BLP via DoubleML API R1 (Ref 2) Suggestion Deferred Manual implementation is valid
S11: Selection-on-observables discussion R1 (Ref 1) Suggestion Deferred Original paper already frames this
S12: K=2 fold size note R1 (Ref 1) Suggestion Addressed (implicitly) Folded into S2 discussion mentioning ~32 obs per fold

No essential issues remain. The paper is ready to share. The suggestions below would strengthen it further but are optional.

# Item Category Action needed
1 Ensemble SE reported as weighted average of individual SEs (technically invalid) Suggestion Either implement delta-method SE or add a footnote that Ensemble CI should be interpreted with caution
2 Cross-fitting stability not shown in the paper Suggestion Surface per-rep coefficient ranges from dml_results.json in a table or footnote
3 Bivariate (no-controls) OLS not reported Suggestion One-line regression to show whether DML is closer to bivariate or controlled OLS
4 Forest plot title has encoding artefact Suggestion Fix garbled arrow character in title
5 Replication table does not clarify SE type Suggestion Add footnote on published vs. replicated SE methodology
6 CLAN power calculation not explicit Suggestion Note that d ~ 1.1 needed for 80% power with N = 13 per group

Category: Essential = must fix before sharing. Suggestion = would improve, optional.


  • The RECAST qualifies the original finding. The positive tariff–growth effect survives directionally (all well-behaved DML learners agree on sign), but the magnitude is halved and statistical significance is lost. This is a “weaker support” verdict, not an overturning. The heterogeneity analysis adds a genuinely new dimension: the average effect masks a gradient from significantly negative (Q1: -0.206) to significantly positive (Q5: +0.235).

  • The small sample (N = 63) is the binding constraint. With 17 controls and 2-fold cross-fitting, each fold has only ~32 observations for training flexible ML models. This limits nuisance R-squared (best: 0.154 for outcome, 0.362 for treatment), inflates standard errors, and makes CLAN uninformative. These are inherent features of the data, not pipeline failures. The Baiardi & Naghi (2024) replication faces identical constraints.

  • External validation is strong. Our Lasso estimates match B&N (2024) benchmarks within 0.001–0.002 for all three treatments (our 0.020 vs. their 0.019 for Panel A; our 0.008 vs. their 0.010 for Panel B; our 0.006 vs. their 0.009 for Panel C). This is the strongest evidence that the pipeline is working correctly despite low nuisance R-squared.

  • The pipeline cannot verify causal identification. The original paper’s claim rests on selection-on-observables with 17 controls. Whether these controls are sufficient to close all backdoor paths (e.g., institutional quality, trade openness beyond tariffs, colonial history beyond regional dummies) is a substantive judgment that no automated pipeline can make. DML relaxes functional-form assumptions but not the unconfoundedness assumption.

  • After the revision, Boosting is the Best learner for both nuisance equations across all three panels. This resolved the Round 1 issue where Best was incorrectly copying Forest results. In the corrected run, Boosting achieves the lowest nuisance MSE for both outcome (R-squared = 0.154) and treatment (R-squared = 0.362), so Best = Boosting throughout. The Best coefficients are no longer identical to the Forest row, confirming the fix worked.