Contents

Setting up Development Enviroment

  1. Install Julia and GitHub Desktop - not strictly required but never hurts to have it!
  2. Install vscode and follow basic instructions in https://github.com/ubcecon/tutorials/blob/master/vscode.md
    • In particular, https://github.com/ubcecon/tutorials/blob/master/vscode.md#julia, making sure to do the code formatter step.
    • and the git settings in https://github.com/ubcecon/tutorials/blob/master/vscode.md#general-packages-and-setup
  3. Clone the repo by either:
    • Clicking on the Code then Open in GitHub Desktop.
    • Alternatively, you can go ] dev https://github.com/HighDimensionalEconLab/VarianceComponentsHDFE.jl in a Julia REPL and it will clone it to the .julia/dev folder.
    • If you wanted to, you could then drag that folder back into github desktop.
  4. Open it in vscode by right-clicking on the folder it installed to, and then opening a vscode project.
  5. Open the Julia repl in vscode (Ctrl-Shift-P and then go Julia REPL or something to find it.
  6. type ] instantiate to install all of the packages. Get coffee.
  7. In the REPL run ] test and it should do the full unit test.

Functions in this package

Main Function

VarianceComponentsHDFE.leave_out_KSSMethod
leave_out_KSS(y, first_id, second_id; controls, do_lincom, Z_lincom, lincom_labels, settings)

Returns a tuple with the observation number of the original dataset that belongs to the Leave-out connected set as described in Kline,Saggio, Solvesten. It also provides the corresponding outcome and identifiers in this connected set.

Arguments

  • y: outcome vector
  • first_id: first identifier (e.g. worker id)
  • second_id: second identifier (e.g. firm id)
  • controls: covariates that will be partialled out from outcome before it performs KSS.
  • do_lincom: boolean that indicates whether it runs inference.
  • Z_lincom: matrix of covariates to be used in lincom inference.
  • lincom_labels: vector of labels of the columns of Z_lincom.
  • settings: settings based on VCHDFESettings
  • controls: at this version only controls=nothing is supported.
source

Auxiliary Functions

VarianceComponentsHDFE.find_connected_setMethod
find_connected_set(y, first_idvar, second_idvar, settings)

Returns a tuple of observation belonging to the largest connected set with the corresponding identifiers and outcomes. This requires to have the data sorted by first identifier, and time period (e.g. we sort by worked id and year). This is also the set where we can run AKM models with the original data.

Arguments

  • y: outcome (e.g. log wage)
  • first_id: first identifier (e.g. worker id)
  • second_id: second identifier (e.g. firm id)
  • settings: settings based on data type VCHDFESettings. Please see the reference provided below.
source
VarianceComponentsHDFE.get_leave_one_out_setMethod
get_leave_one_out_set(y, first_id, second_id, settings, controls)

Returns a tuple with the observation number of the original dataset that belongs to the Leave-out connected set as described in Kline,Saggio, Solvesten. It also provides the corresponding outcome and identifiers in this connected set.

Arguments

  • y: outcome vector
  • first_id: first identifier (e.g. worker id)
  • second_id: second identifier (e.g. firm id)
  • settings: settings based on VCHDFESettings
  • controls: at this version only controls=nothing is supported.
source
VarianceComponentsHDFE.leave_out_estimationMethod
leave_out_estimation(y, first_id, second_id, controls, settings)

Returns the bias-corrected components, the vector of coefficients, the corresponding fixed effects for every observation, and the diagonal matrices containing the Pii and Biis.

Arguments

  • y: outcome vector
  • first_id: first identifier (e.g. worker id)
  • second_id: second identifier (e.g. firm id)
  • settings: settings based on VCHDFESettings
  • controls: matrix of control variables. At this version it doesn't work properly for very large datasets.
source
VarianceComponentsHDFE.compute_moversMethod
compute_movers(first_id, second_id)

Returns a vector that indicates whether the first_id (e.g. worker) is a mover across second_id (e.g. firms), as well as a vector with the number of periods that each first_id appears.

Arguments

  • first_id: first identifier (e.g. worker id)
  • second_id: second identifier (e.g. firm id)
source
VarianceComponentsHDFE.compute_matchidMethod
compute_matchid(second_id, first_id)

Computes a match identifier for every combination of first and second identifier. For example, this can be the match identifier of worker-firm combinations.

Arguments

  • first_id: first identifier (e.g. worker id)
  • second_id: second identifier (e.g. firm id)
source
VarianceComponentsHDFE.lincom_KSSMethod
lincom_KSS(y, X, Z, Transform, sigma_i; lincom_labels)

This function regresses fixed effects based onto some observables. See appendix in KSS for more information.

Arguments

  • y: outcome variable.
  • X: the design matrix in the linear model.
  • Z: matrix of observables to use in regression.
  • Transform: matrix to compute fixed effects (e.g. Transform = [0 F] recovers second fixed effects).
  • sigma_i: estimate of the unbiased variance of observation i.
  • lincom_labels: labels of the columns of Z.
  • settings: settings based on data type VCHDFESettings. Please see the reference provided below.
source

Datatypes in this package

VarianceComponentsHDFE.JLAAlgorithmType
struct JLAAlgorithm <: AbstractLeverageAlgorithm

Data type to pass to VCHDFESettings type, to indicate JLA algorithm

Fields

  • num_simulations: number of simulations in estimation. If num_simulations = 0, defaults to 100 * log(#total fixed effect)"
source
VarianceComponentsHDFE.VCHDFESettingsType
struct VCHDFESettings{LeverageAlgorithm}

The VCHDFESettings type is to pass information to methods regarding which algorithm to use.

Fields

  • cg_maxiter: maximum number of iterations (default = 300)
  • leave_out_level: leave-out level (default = match)
  • leverage_algorithm: which type of algorithm to use (default = JLAAlgorithm())
  • first_id_effects: includes first id effects. At this version it is required to include the firstideffects. (default = true)
  • cov_effects: includes covariance of first-second id effects. At this version it is required to include the cov_effects. (default = true)
  • print_level: prints the state of the program in std output. If print_level = 0, the app prints nothing in the std output. (default = 1)
  • first_id_display_small: name of the first id in lower cases (default = person)
  • first_id_display: name of the first id (default = Person)
  • second_id_display_small: name of the second id in lower cases (default = firm)
  • second_id_display: name of the second id (default = Firm)
  • outcome_id_display_small: name of the observation id in lower cases (default = wage)
  • outcome_id_display: name of the observation id (default = Wage)
source

Typical Julia Workflow

#Load the required packages
using VarianceComponentsHDFE, DataFrames, CSV, SparseArrays

#Load dataset
data = DataFrame(CSV.File("test.csv"; header=false))

#Extract vectors of outcome, workerid, firmid
id = data[:,1]
firmid = data[:,2]
year = data[:, 3]
y = data[:,4]

#You can define the settings using our structures
JL = JLAAlgorithm(num_simulations = 300)
mysettings = VCHDFESettings(leverage_algorithm = JL, first_id_effects=true, cov_effects=true)

#Run KSS with no controls 
θ_first, θ_second, θCOV = leave_out_KSS(y,id,firmid)

#Create some controls and run the routine where we partial out them
controls = indexin(year,unique(sort(year)))
controls = sparse(collect(1:size(y,1)), controls, 1, size(y,1), maximum(controls))
controls = controls[:,1:end-1]

θ_first, θ_second, θCOV = leave_out_KSS(y,id,firmid; controls)

#Perform Lincom Inference using a Region Dummy
data = DataFrame!(CSV.File("lincom.csv"; header=false))
id = data[:,1]
firmid = data[:,2]
y = data[:,5]
region = data[:,4] 
region[findall(region.==-1)].=0

θ_first, θ_second, θCOV = leave_out_KSS(y,id,firmid; do_lincom = true , Z_lincom = region, lincom_labels = ["Region Dummmy"] )