Getting Started with hetid • hetid

library(hetid)

Note: For enhanced table formatting in analysis output, consider installing the knitr package: install.packages("knitr"). The package works without it, but provides nicer formatted tables when available.

Introduction

The hetid package implements identification through heteroskedasticity methods for models with endogenous regressors. These methods provide ways to identify and estimate causal effects when traditional instrumental variables are not available, using heteroskedasticity in the error terms as a source of identification. The package includes implementations of Klein and Vella (2010), Lewbel (2012), Rigobon (2003), and Prono (2014) methods.

The Lewbel (2012) Method

Consider a triangular system of equations:

\begin{aligned} Y_1 &= \beta_{1,0} + \beta_{1,1}X + \gamma_1 Y_2 + \epsilon_1 \\ Y_2 &= \beta_{2,0} + \beta_{2,1}X + \epsilon_2 \end{aligned}

Where: - Y_2 is endogenous in the first equation (i.e., \text{Cov}(\epsilon_1, \epsilon_2) \neq 0) - X is exogenous - \gamma_1 is the parameter of interest

Lewbel’s method uses heteroskedasticity in \epsilon_2 to construct instruments. The key identifying assumptions are: 1. \text{Cov}(Z, \epsilon_1\epsilon_2) = 0 where Z = X^2 - E[X^2] 2. \text{Cov}(Z, \epsilon_2^2) \neq 0 (heteroskedasticity)

Generating Data

The package provides a function to generate data that satisfies Lewbel’s assumptions:

# Set parameters for the data generating process
params <- list(
  beta1_0 = 0.5, # Intercept in first equation
  beta1_1 = 1.5, # Coefficient on X in first equation
  gamma1 = -0.8, # Endogenous parameter (true value)
  beta2_0 = 1.0, # Intercept in second equation
  beta2_1 = -1.0, # Coefficient on X in second equation
  alpha1 = -0.5, # Factor loading for error correlation
  alpha2 = 1.0, # Factor loading for error correlation
  delta_het = 1.2 # Heteroskedasticity strength
)

# Generate a sample
set.seed(123)
data <- generate_lewbel_data(n_obs = 1000, params = params)

# Display basic information about the generated data
cat("Generated data dimensions:", dim(data), "\n")
#> Generated data dimensions: 1000 6
cat("Variable names:", names(data), "\n\n")
#> Variable names: Y1 Y2 epsilon1 epsilon2 Xk Z
head(data)
#>           Y1          Y2   epsilon1    epsilon2        Xk           Z
#> 1  0.8859851 -0.59332384 -0.5200403 -1.30574632 0.2875775 -0.20970028
#> 2  1.8355644  0.04560669  0.1895921 -0.16608818 0.7883051  0.29102733
#> 3 -0.9492333  0.80901016 -1.4154905  0.21798709 0.4089769 -0.08830088
#> 4  2.0246892  0.06421877  0.2515381 -0.05276383 0.8830174  0.38573960
#> 5  3.9894287 -0.25473690  1.8749383 -0.31426962 0.9404673  0.44318948
#> 6  2.3482261  0.49361990  2.1747873 -0.46082360 0.0455565 -0.45172130

Verifying Assumptions

Before proceeding with estimation, it’s important to verify that the data satisfies Lewbel’s key assumptions:

# Verify the assumptions using the generated data
# The function tests three key conditions:
# 1. Cov(Z, ε₁ε₂) = 0 (covariance restriction)
# 2. Cov(Z, ε₂²) ≠ 0 (instrument relevance)
# 3. Cov(ε₁, ε₂) ≠ 0 (endogeneity)
verification_result <- verify_lewbel_assumptions(
  data = data, config = params, verbose = TRUE
)
#> 
#> --- Verifying Lewbel's Key Assumptions ---
#> Test sample size: 1000
#> Cov(Z, e1*e2) = 0.015249 (should be approx. 0)
#> Cov(Z, e2^2) = 0.159949 (should be != 0)
#> Cov(e1, e2) = -0.384956 (should be != 0 for endogeneity)
#> Test H0: Cov(Z, e1*e2) = 0, p-value = 0.3390
#> Key assumptions appear to be satisfied.

Estimation

Point Identification

When the covariance restriction holds exactly, we can obtain point estimates using 2SLS with Lewbel instruments:

# Run a single simulation to demonstrate estimation
# This compares OLS (biased) vs 2SLS with Lewbel instruments (consistent)
result <- run_single_lewbel_simulation(
  sim_id = 1,
  params = c(params, list(
    sample_size = 1000,
    tau_set_id = 0.2,
    bootstrap_reps = 100
  ))
)

# Display results
cat("True gamma1:", params$gamma1, "\n")
#> True gamma1: -0.8
cat("OLS estimate:", round(result$ols_gamma1, 3), "\n")
#> OLS estimate: -1.009
cat("2SLS (Lewbel) estimate:", round(result$tsls_gamma1, 3), "\n")
#> 2SLS (Lewbel) estimate: -0.81
cat("First-stage F-statistic:", round(result$first_stage_F, 1), "\n")
#> First-stage F-statistic: 33.6

# Check if instruments are strong (F > 10 is conventional threshold)
if (result$first_stage_F > 10) {
  cat("✓ Strong instruments (F > 10)\n")
} else {
  cat("⚠ Weak instruments (F ≤ 10)\n")
}
#> ✓ Strong instruments (F > 10)

Set Identification

When we relax the covariance restriction to allow for some correlation between Z and \epsilon_1\epsilon_2, we obtain set identification:

# Calculate bounds under exact identification (tau = 0)
# When tau = 0, we get point identification
bounds_exact <- calculate_lewbel_bounds(data, tau = 0)
cat("Bounds under exact identification (tau = 0):\n")
#> Bounds under exact identification (tau = 0):
cat(
  "[", round(bounds_exact$bounds[1], 3), ",",
  round(bounds_exact$bounds[2], 3), "]\n"
)
#> [ -0.709 , -0.709 ]

# Calculate bounds under relaxed assumption (tau = 0.2)
# When tau > 0, we allow some violation of the covariance restriction
bounds_relaxed <- calculate_lewbel_bounds(data, tau = 0.2)
cat("\nBounds under relaxed assumption (tau = 0.2):\n")
#> 
#> Bounds under relaxed assumption (tau = 0.2):
cat(
  "[", round(bounds_relaxed$bounds[1], 3), ",",
  round(bounds_relaxed$bounds[2], 3), "]\n"
)
#> [ -0.806 , -0.593 ]

# With bootstrap standard errors for inference
bounds_with_se <- calculate_lewbel_bounds(
  data,
  tau = 0.2,
  compute_se = TRUE,
  b_reps = 100
)
cat("\nBounds with bootstrap standard errors:\n")
#> 
#> Bounds with bootstrap standard errors:
cat(
  "Lower bound:", round(bounds_with_se$bounds[1], 3),
  "(SE:", round(bounds_with_se$se[1], 3), ")\n"
)
#> Lower bound: -0.806 (SE: 0.103 )
cat(
  "Upper bound:", round(bounds_with_se$bounds[2], 3),
  "(SE:", round(bounds_with_se$se[2], 3), ")\n"
)
#> Upper bound: -0.593 (SE: 0.129 )

# Check if true value is contained in the bounds
true_gamma <- params$gamma1
in_bounds_exact <- (true_gamma >= bounds_exact$bounds[1] &&
  true_gamma <= bounds_exact$bounds[2])
in_bounds_relaxed <- (true_gamma >= bounds_relaxed$bounds[1] &&
  true_gamma <= bounds_relaxed$bounds[2])

cat("\nTrue value γ₁ =", true_gamma, "\n")
#> 
#> True value γ₁ = -0.8
cat("Contained in exact bounds:", ifelse(in_bounds_exact, "✓", "✗"), "\n")
#> Contained in exact bounds: ✗
cat("Contained in relaxed bounds:", ifelse(in_bounds_relaxed, "✓", "✗"), "\n")
#> Contained in relaxed bounds: ✓

Monte Carlo Simulation

The package includes functions for running Monte Carlo simulations to assess the performance of the estimators:

# Example: Run a small Monte Carlo study
# Note: eval=FALSE to avoid long computation time in vignette building.
# Users can run this code locally for actual Monte Carlo analysis.
library(purrr)
library(dplyr)

# Run 100 simulations to assess estimator performance
n_sims <- 100
sim_params <- c(params, list(
  sample_size = 500,
  tau_set_id = 0.2,
  bootstrap_reps = 50
))

cat("Running", n_sims, "Monte Carlo simulations...\n")

# Option 1: Using map_dfr with inline progress reporting
results <- map_dfr(1:n_sims, function(i) {
  if (i %% 10 == 0) cat("Completed:", i, "of", n_sims, "simulations\n")
  run_single_lewbel_simulation(i, sim_params)
})


# Summarize performance metrics
summary_stats <- results |>
  summarise(
    # Bias (should be close to 0 for consistent estimators)
    ols_bias = mean(ols_gamma1 - params$gamma1, na.rm = TRUE),
    tsls_bias = mean(tsls_gamma1 - params$gamma1, na.rm = TRUE),

    # Coverage rates (should be close to 0.95 for 95% confidence intervals)
    ols_coverage = mean(ols_coverage, na.rm = TRUE),
    tsls_coverage = mean(tsls_coverage, na.rm = TRUE),

    # Instrument strength
    avg_first_stage_F = mean(first_stage_F, na.rm = TRUE),
    weak_iv_rate = mean(first_stage_F < 10, na.rm = TRUE)
  )

cat("\nMonte Carlo Results:\n")
print(summary_stats)

# Additional analysis: Distribution of estimates
cat("\nDistribution of 2SLS estimates:\n")
cat("Mean:", round(mean(results$tsls_gamma1, na.rm = TRUE), 3), "\n")
cat("Std Dev:", round(sd(results$tsls_gamma1, na.rm = TRUE), 3), "\n")
cat("Min:", round(min(results$tsls_gamma1, na.rm = TRUE), 3), "\n")
cat("Max:", round(max(results$tsls_gamma1, na.rm = TRUE), 3), "\n")

Conclusion

The hetid package provides a complete implementation of Lewbel’s (2012) identification through heteroskedasticity method. Key features include:

Data generation that satisfies the identifying assumptions
Functions to verify these assumptions
Point estimation via 2SLS with Lewbel instruments
Set identification when the covariance restriction is relaxed
Bootstrap inference for the identified sets
Monte Carlo simulation tools for performance assessment

Next Steps

For theoretical background, see the Theory and Methods vignette
For GMM estimation, see the GMM Estimation vignette
For alternative methods, see Klein & Vella, Rigobon and Prono vignettes
For software comparisons, see Comparing Implementations

For more details on the original method, see Lewbel, A. (2012). “Using Heteroscedasticity to Identify and Estimate Mismeasured and Endogenous Regressor Models.” Journal of Business & Economic Statistics, 30(1), 67-80.