Setting up Development Enviroment

Install Julia and GitHub Desktop - not strictly required but never hurts to have it!
Install vscode and follow basic instructions in https://github.com/ubcecon/tutorials/blob/master/vscode.md
- In particular, https://github.com/ubcecon/tutorials/blob/master/vscode.md#julia, making sure to do the code formatter step.
- and the git settings in https://github.com/ubcecon/tutorials/blob/master/vscode.md#general-packages-and-setup
Clone the repo by either:
- Clicking on the Code then Open in GitHub Desktop.
- Alternatively, you can go ] dev https://github.com/HighDimensionalEconLab/VarianceComponentsHDFE.jl in a Julia REPL and it will clone it to the .julia/dev folder.
- If you wanted to, you could then drag that folder back into github desktop.
Open it in vscode by right-clicking on the folder it installed to, and then opening a vscode project.
Open the Julia repl in vscode (Ctrl-Shift-P and then go Julia REPL or something to find it.
type ] instantiate to install all of the packages. Get coffee.
In the REPL run ] test and it should do the full unit test.

Functions in this package

Main Function

VarianceComponentsHDFE.leave_out_KSS — Method

leave_out_KSS(y, first_id, second_id; controls, do_lincom, Z_lincom, lincom_labels, settings)

Returns a tuple with the observation number of the original dataset that belongs to the Leave-out connected set as described in Kline,Saggio, Solvesten. It also provides the corresponding outcome and identifiers in this connected set.

Arguments

y: outcome vector
first_id: first identifier (e.g. worker id)
second_id: second identifier (e.g. firm id)
controls: covariates that will be partialled out from outcome before it performs KSS.
do_lincom: boolean that indicates whether it runs inference.
Z_lincom: matrix of covariates to be used in lincom inference.
lincom_labels: vector of labels of the columns of Z_lincom.
settings: settings based on VCHDFESettings
controls: at this version only controls=nothing is supported.

source

Auxiliary Functions

VarianceComponentsHDFE.find_connected_set — Method

find_connected_set(y, first_idvar, second_idvar, settings)

Returns a tuple of observation belonging to the largest connected set with the corresponding identifiers and outcomes. This requires to have the data sorted by first identifier, and time period (e.g. we sort by worked id and year). This is also the set where we can run AKM models with the original data.

Arguments

y: outcome (e.g. log wage)
first_id: first identifier (e.g. worker id)
second_id: second identifier (e.g. firm id)
settings: settings based on data type VCHDFESettings. Please see the reference provided below.

source

VarianceComponentsHDFE.get_leave_one_out_set — Method

get_leave_one_out_set(y, first_id, second_id, settings, controls)

Arguments

y: outcome vector
first_id: first identifier (e.g. worker id)
second_id: second identifier (e.g. firm id)
settings: settings based on VCHDFESettings
controls: at this version only controls=nothing is supported.

source

VarianceComponentsHDFE.leave_out_estimation — Method

leave_out_estimation(y, first_id, second_id, controls, settings)

Returns the bias-corrected components, the vector of coefficients, the corresponding fixed effects for every observation, and the diagonal matrices containing the Pii and Biis.

Arguments

y: outcome vector
first_id: first identifier (e.g. worker id)
second_id: second identifier (e.g. firm id)
settings: settings based on VCHDFESettings
controls: matrix of control variables. At this version it doesn't work properly for very large datasets.

source

VarianceComponentsHDFE.compute_movers — Method

compute_movers(first_id, second_id)

Returns a vector that indicates whether the first_id (e.g. worker) is a mover across second_id (e.g. firms), as well as a vector with the number of periods that each first_id appears.

Arguments

first_id: first identifier (e.g. worker id)
second_id: second identifier (e.g. firm id)

source

VarianceComponentsHDFE.compute_matchid — Method

compute_matchid(second_id, first_id)

Computes a match identifier for every combination of first and second identifier. For example, this can be the match identifier of worker-firm combinations.

Arguments

first_id: first identifier (e.g. worker id)
second_id: second identifier (e.g. firm id)

source

VarianceComponentsHDFE.lincom_KSS — Method

lincom_KSS(y, X, Z, Transform, sigma_i; lincom_labels)

This function regresses fixed effects based onto some observables. See appendix in KSS for more information.

Arguments

y: outcome variable.
X: the design matrix in the linear model.
Z: matrix of observables to use in regression.
Transform: matrix to compute fixed effects (e.g. Transform = [0 F] recovers second fixed effects).
sigma_i: estimate of the unbiased variance of observation i.
lincom_labels: labels of the columns of Z.
settings: settings based on data type VCHDFESettings. Please see the reference provided below.

source

Datatypes in this package

VarianceComponentsHDFE.ExactAlgorithm — Type

struct ExactAlgorithm <: AbstractLeverageAlgorithm

Data type to pass to VCHDFESettings type, to indicate Exact algorithm

source

VarianceComponentsHDFE.JLAAlgorithm — Type

struct JLAAlgorithm <: AbstractLeverageAlgorithm

Data type to pass to VCHDFESettings type, to indicate JLA algorithm

Fields

num_simulations: number of simulations in estimation. If num_simulations = 0, defaults to 100 * log(#total fixed effect)"

source

VarianceComponentsHDFE.VCHDFESettings — Type

struct VCHDFESettings{LeverageAlgorithm}

The VCHDFESettings type is to pass information to methods regarding which algorithm to use.

Fields

cg_maxiter: maximum number of iterations (default = 300)
leave_out_level: leave-out level (default = match)
leverage_algorithm: which type of algorithm to use (default = JLAAlgorithm())
first_id_effects: includes first id effects. At this version it is required to include the firstideffects. (default = true)
cov_effects: includes covariance of first-second id effects. At this version it is required to include the cov_effects. (default = true)
print_level: prints the state of the program in std output. If print_level = 0, the app prints nothing in the std output. (default = 1)
first_id_display_small: name of the first id in lower cases (default = person)
first_id_display: name of the first id (default = Person)
second_id_display_small: name of the second id in lower cases (default = firm)
second_id_display: name of the second id (default = Firm)
outcome_id_display_small: name of the observation id in lower cases (default = wage)
outcome_id_display: name of the observation id (default = Wage)

source

Typical Julia Workflow

#Load the required packages
using VarianceComponentsHDFE, DataFrames, CSV, SparseArrays

#Load dataset
data = DataFrame(CSV.File("test.csv"; header=false))

#Extract vectors of outcome, workerid, firmid
id = data[:,1]
firmid = data[:,2]
year = data[:, 3]
y = data[:,4]

#You can define the settings using our structures
JL = JLAAlgorithm(num_simulations = 300)
mysettings = VCHDFESettings(leverage_algorithm = JL, first_id_effects=true, cov_effects=true)

#Run KSS with no controls 
θ_first, θ_second, θCOV = leave_out_KSS(y,id,firmid)

#Create some controls and run the routine where we partial out them
controls = indexin(year,unique(sort(year)))
controls = sparse(collect(1:size(y,1)), controls, 1, size(y,1), maximum(controls))
controls = controls[:,1:end-1]

θ_first, θ_second, θCOV = leave_out_KSS(y,id,firmid; controls)

#Perform Lincom Inference using a Region Dummy
data = DataFrame!(CSV.File("lincom.csv"; header=false))
id = data[:,1]
firmid = data[:,2]
y = data[:,5]
region = data[:,4] 
region[findall(region.==-1)].=0

θ_first, θ_second, θCOV = leave_out_KSS(y,id,firmid; do_lincom = true , Z_lincom = region, lincom_labels = ["Region Dummmy"] )

Contents

Setting up Development Enviroment

Functions in this package

Main Function

Auxiliary Functions

Datatypes in this package

Typical Julia Workflow