When trying to estimate causal effects, the main issue is to deal with the selection bias caused by omitted variables. Eliminating or at least mitigating selection bias is the key to obtaining unbiased estimates (Angrist and Pischke 2009). Since I do not have access to a randomized experiment (or a quasi-experiment) two approaches will be used: Individual fixed effects and matching, both relying on the *conditional independence assumption* (CIA).

### 3.1 Differences-in-Differences

An ideal econometric approach would have been to identify an exogenous variation into taking treatment or not that would have ensured a causal inference. Since the focus is on the causal effect of TWA employment (taking treatment) on the subsequent regular employment probability, I will turn to the difference-in-difference model. When estimating this model it is crucial that: (i) both groups exhibit parallel trends, (ii) selection into treatment is conditionally random or at least not correlated with the outcome, and (iii) nothing else that affects the outcome variable occurs at the same time as the treatment timing. Since I have not been able to identify an exogenous instrument that selects individuals into treatment, I instead control for individual effects (exploiting the panel data structure) which might be correlated with both selection into treatment and the outcome. However, individual trends cannot be captured by this way of modeling. Another way to deal with this endogeneity is to employ matching techniques which relies on the CIA. It states that conditional on observable characteristics, treatment is as good as randomly assigned, more formally:

\begin{array}{l}T|X\phantom{\rule{1em}{0ex}}\perp \phantom{\rule{1em}{0ex}}{Y}_{0i}\end{array}

(1)

Here, *T* is getting treatment, *X* is a set of confounders, and *Y*
_{0i
} the potential outcome if not taking treatment. Having a rich set of observables is, as previously stated, crucial for being able to claim that the CIA holds and thus that the conditional differences- in-differences (cDiD) is valid. More specifically, observables such as previous labor market performance prior to treatment might control for the unobservable characteristics that cause a selection bias. One way of applying the matching approach is to weight the most crucial variables in the estimation in order to balance the two groups so that they look very similar along observable dimensions. This can be done by *coarsened exact matching* (CEM) where the reweighing is done individually on the different confounders depending on relevance^{10}.

The outcome variable of main interest is the long-run probability of getting employed in the regular sector, relying on employment status in several periods afterwards, which will basically capture this long-run performance in the labor market^{11}. The outcome will be defined as *P*(*Y* = 1), the probability of getting employed in the regular sector. An initial problem is that *P*(*Y* = 1) is not observed in the pre-treatment period of 2001 since the criterion in this year is that the individuals under study should be unemployed. However we do not need outcome data for 2001 as long as we have outcomes for earlier years such as 2000 and 1999. 1998 will be used as the reference year, the parallel trends assumption can be tested using data for 1999 and 2000^{12}. The main iDiD and cDiD^{13} models estimated in this paper will be specified as

\begin{array}{l}{Y}_{i,t}={\gamma}_{t}+{a}_{i}+\sum _{\rho =1999}^{2000}{\delta}_{\rho}{T}_{i,\rho}+\sum _{\tau =2002}^{2008}{\delta}_{\tau}{T}_{i,\tau}+{\beta}^{\prime}{\mathbf{X}}_{i,t}+{\nu}_{i,t}\end{array}

(2)

and

\begin{array}{l}{Y}_{i,t}={\gamma}_{t}+\sum _{\rho =1999}^{2000}{\delta}_{\rho}{T}_{i,\rho}+\sum _{\tau =2002}^{2008}{\delta}_{\tau}{T}_{i,\tau}+{\beta}^{\prime}{\mathbf{X}}_{i,t}+{\nu}_{i,t}.\end{array}

(3)

Here, *α* is the intercept, the *a*
_{
i
} are the individual dummies, **γ**
_{
t
} is a set of time dummies, **X**
_{
i,t
} is a set of confounders, *ν*
_{
i,t
} is the error term, *T*
_{
i,ρ
} = 1[if will be getting treatment], and *T*
_{
i,τ
} = 1[if treated]. Given that there are no anticipation effects (by construction impossible in this setting) the coefficient of the leads (*δ*
_{
ρ
}) should be zero (i.e., parallel trends) strengthening our causal link hypothesis. Since the matching is performed before estimating the equation, we omit any treatment group dummy in Equation (3) since its coefficient will be zero if the matching was successful. Including a group dummy would, in that case, only inflate the standard errors while not contributing to the model. The timing of the treatment is chosen to be 2002 to ensure a long follow-up.

Joining a TWA after the treatment year by either the control or treatment group is permitted and thus selection is orthogonal to future outcomes. Violating this and in effect conditioning on future outcomes would lead to a bias of the estimated effect (Fredriksson and Johansson 2003).

Both groups are identified by being unemployed in 2001, thus a form of matching on pre-treatment labor market outcomes is already performed here. The control group is defined as not joining a TWA in 2002 and instead engaging in something else or staying unemployed. The counterfactual path is then, e.g., taking up studies, dropping out of the labor force, taking a regular job, etc. No restrictions were put on any outcomes from 2003 to 2008.

The reason for using both iDiD and cDiD (to be described in the following section) is that we can expect iDiD to give more precise estimates by construction, compared to a regular DiD by controlling for individual effects rather than two group effects. Also, controlling for unobserved and observed heterogeneity with fixed effects does not prune out observations like *coarsened exact matching* (CEM)^{14} does: CEM might result in very few observations. On the other hand, cDiD can, by balancing the two groups, mitigate the bias occurring when, for instance, the two groups have different age compositions, something which can give rise to diverging income progressions (steeper for younger people). Using well-balanced groups it is also more convincing to point at the control groups’ outcome as the actual counterfactual outcomes since they are the same in all observed aspects. The drawback then is of course the low number of observations that arise due to tight matching criteria. Contrasting these two methods with each other will also give the reader a feel for how big the self-selection bias might be in this application.

### 3.2 Matching

Matching is a technique to overcome the selection bias threatening causal inference. The approach is, however, not uncontroversial. Evidence pointing in favor of the technique comes from, e.g., (Dehejia and Wahba 1999), who report a successful non-experimental analysis on the data in (LaLonde 1986): using matching, they replicate the experimental impact estimates. (Smith and Todd 2005) criticizes (Dehejia and Wahba 1999), but conclude that the matching technique is best put in a DiD-design, which is what is done in the present paper (*conditional DiD*). A principle conceptual difference between regular regression estimations and matching estimation is that the latter gives the researcher greater flexibility in choosing how to aggregate heterogeneous effects, especially when using the specific technique *coarsened exact matching*. Since previous work has shown that the impact TWAs have on individuals differ greatly among groups, this is of great importance. Due to the explicit and easily manipulated weighting procedure, which is in the hands of the researcher instead of implicitly in the estimator (as in OLS), matching makes it easier to estimate the interesting parameters such as the ATT in a stratified way (Cobb-Clark and Crossley 2003).

The basic idea with matching estimators is that we try to find a 'twin’ for each individual taking treatment. This is done by matching on observable characteristics. The idea is that if the individuals are very similar in the observables that are related to the outcome and selection process, the risk of their being different in unobservables that are correlated with outcome and selection is reduced or even eliminated. In practice, we explicitly try to calculate the counterfactual untreated outcome *E*[*Y*
_{0i
}].

\begin{array}{l}{\delta}_{i}^{\ast}={Y}_{1i}-E\left[\phantom{\rule{0.3em}{0ex}}{Y}_{0i}\right].\end{array}

(4)

Matching estimations rely on the CIA as discussed earlier. Furthermore, (Rosenbaum and Rubin 1983) noted that an additional condition was needed, *common support*: If we define *P*(*x*) as being the probability of getting treatment (*T*) for an individual with characteristics *x*, then the common support condition requires 0 < *P*(*x*) < 1, ∀*x*. This is also called the *overlap condition* and it rules out the perfect predictability of *T* given *x*; without this assumption, we have no information to construct our counterfactuals. CEM takes care of this by construction.

*Coarsened exact matching* is a member of the Monotonic Imbalance Bounding (MIB) class of matching methods (further described in (Iacus et al. 2008)). It is a method of pre-processing data which deals with the 'curse of dimensionality’^{15} by coarsening continuous data into bins where the researcher by in-depth knowledge of the variables at hand can determine the size of the bins to preserve information and maximize the number of matches. When the continuous variable is coarsened into bins, matching will take place on the respective strata and then observations are finally re-weighted according to the size of their strata. The bin width can be constant (*ε*
_{
j
}) within the variable *j* or it can vary within each variable, {\epsilon}_{j}^{v}, where the *v* are the cut-off points. Then basically any type of regression can be performed while including the new weights on the uncoarsened data. If the matching is exact in a variable—which is done for, e.g., the educational level—then this confounder is not needed in the regression since the balancing is perfect, unless the variable is time varying. If the matching is exact only on the coarsened values and/or is time varying, then the confounder should be included in the regression to control for the within-bin correlation which most likely will be very small if the bin width (*ε*
_{
j
}) is tightly defined.

The rationale for using CEM instead of, e.g., *propensity score matching*, is because this technique is more transparent, straightforward, by construction deals with the *common support*, gives priority to balancing (thus reducing bias and model dependence) to variance (high precision), meets the congruence principle, is computationally efficient, and reduces the sensitivity to measurement error (which would lead to biased estimates of the ATT, see (Iacus et al. 2008)). In Figures 5 and 6 in the Appendix, two kernel density plots over aggregated days in unemployment from 1998 to 2001 and two histograms over age are graphed before and after matching to give a visual representation of what is going on in the matching process. The treatment group has a higher density over the left region of *days in unemployment* and vice versa over the right region, this skewness is adjusted through the matching. A similar adjustment takes place in the variable age, where the treatment group has a lower density in the left region and vice versa in the right. Notably the sample under study becomes a quite young sample compared with the population. Since balancing the two groups to each other changes the average sample characteristics compared to the population, we in effect measure the *sample average treatment effect on the treated* (SATT) when applying CEM. Matching was performed exactly in 2001—unless otherwise is specified—on gender, level of education, and marital status; coarsened exact matching on aggregate days in unemployment from 1998 to 2001^{16} (*ε*
_{
u n e m p.} = 2 days), age (*ε*
_{
age
} = 5 years), annual earnings in 2000 ({\epsilon}_{\mathit{\text{earnings}}}^{v} where *v* = [0, 5000, 10,000, 15,000, 20,000, 25,000, 30,000, 35,000, 40,000, 60,000, 100,000, 200,000]). For non-western immigrants, the income distribution was completely different and the following break points were chosen to get a sufficient number of matches: *v* = [0, 3000, 10,000, 50,000]. The number of children over age groups ({\epsilon}_{{\mathit{\text{children}}}_{\mathit{\text{age}}}}^{v} where *v* = [0.5, 1.5, 2.5] and *age* = [0–3, 4–6, 7–10, 11–15, 16–17, 18+]). When matching the non-western sample, *marital status* was not included since it reduced the number of matches and did not help to establish parallel trends. *Region of birth* was not included since it reduced the number of matches severely: if *region of birth* were included in the regression, the *F*-test would not be able to reject the hypothesis that the coefficients are zero (at the 5% significance level)^{17}.