### 3.1 Definition of overeducation

To determine whether someone is overeducated the norm within the occupation has to be determined. Each individual’s educational attainment is then compared to the norm in that occupation. Individuals with higher educational attainment than the norm are defined as overeducated, while individuals with lower educational attainment than the norm are defined as undereducated. Individuals whose education is the same as the norm in the occupation are defined as correctly matched, or as having the required level of education.

There are different methods to determine the occupational norm. One is the so-called realized matches approach with the norm defined as the number of years of schooling within a one standard deviation range around the mean; individuals are defined as being undereducated, overeducated or having the required education in relation to this norm (Verdugo and Verdugo, 1989). A second method is to use the most frequently occurring number of years of schooling, i.e. the modal value, within occupations to define the norm instead of the mean. These two approaches are used in the current study.

A third method is to define the norm by using job analysis. Professional job analysts determine the educational requirements for a job and the individual’s educational attainment is compared to this. A fourth method is worker self-assessment where workers are asked in surveys about the educational requirements of their job.

All methods have their weaknesses and strengths (see Hartog 2000 for a discussion) but in many cases the choice of method is driven by data availability. In our case, we do not have access to survey data so we cannot use self-assessed educational requirements for an individual’s job as a way of measuring overeducation. On the other hand we have detailed register data on all employees and workers have been divided into more than 110 distinct occupations. Results from ORU earnings equations have been found to be robust to whether the reference level of education is measured according to realized matches or worker self-assessment (Chiswick and Miller 2010c).

### 3.2 Data and sample restrictions

We use Swedish register data for the period 2001–2008. The data are collected primarily for administrative purposes and maintained by Statistics Sweden. It includes all individuals 16–64 years who were registered residing in Sweden by the end of December each year. In the main analysis, regressions are estimated separately by gender and separately for natives (those born in Sweden independently of where their parents are born), Western immigrants (those born in a Nordic country, within EU15 and in North America), and non-Western immigrants (those born elsewhere). The population used in our analysis is restricted to those aged 25–57, who were employed in November each year and for whom we have information on both occupation and education. The year 2001 has been chosen as the starting year since this is the first year for which information on occupations exists in the registers. Occupations are classified using the SSYK-code in the Swedish registers. We define occupations at the three-digit level which leaves us with 113 occupations. Occupations with fewer than 100 workers are excluded and so are military personnel. Following previous literature we also exclude the self-employed.

When defining the norm we include workers who are between 25 and 57 years of age, who have not been enrolled in education during the year and who have been in Sweden for three years or more. The most recently arrived immigrants are excluded when we calculate the norm but are included in the analysis of overeducation.

In section 5 we analyze the wage-effects of over, under and required education. Information on wages exists in the Swedish registers for all employees in the public sector and for a sample of about 45 percent of those employed in the private sector.

The probability of being overeducated is analyzed using the whole sample while the ORU-regressions used for analyzing wage effects of overeducation are based on the sample of employees for whom information on monthly wages exists. The analysis of state dependence in overeducation is based on a balanced panel for the period 2001–2008.

Information on both occupation and education has to be present for an individual to be included in the sample. Information on education is missing for more immigrants than natives, especially newly arrived immigrants. Information is missing on education for less than 0.1 percent of natives but for around 3 percent of non-Western immigrants. Information on education is collected in different ways for different segments of the population. For those being educated in Sweden, either natives or foreign-born, information stems from reports continuously received by Statistics Sweden from the educational institutions. This information is generally of high quality. For those with education dating back to before 1990, the 1990 census (the latest census in Sweden) has been used.^{2}

One challenge is that the quality of the education variable may be more variable for the foreign-born who have immigrated after completing their education in their home country or in another country than Sweden. Those registered as new immigrants in Sweden are asked by Statistics Sweden to fill out a questionnaire with questions regarding their education, but many who receive the questionnaire do not answer it, which means that information is lacking for many newly arrived immigrants. However, the information received through the questionnaire is gradually complemented by other data sources; from the Public Employment Service for those who have been searching for work through an employment office, from the National Health Board for those who apply for a permit to work as medical doctors, dentists, nurses etc. Still, individuals for whom we have complete information on education may differ from those where information on education is lacking, meaning a potential selection problem. In part this selection problem is mitigated by omitting the most recently arrived immigrants when constructing the educational norms.

Note that measurement errors in education exist also for those who have grown up in Sweden and been educated in Sweden if they also have studied abroad. They are not covered by the questionnaire sent out by Statistics Sweden. For example, a person with a BA from a university in Sweden and a PhD from a university in the US will have a BA recorded as their highest degree according to the statistics, as the Swedish degree is the only recorded one. There could be a difference, however, between the foreign-born growing up in Sweden and native Swedes regarding how often they study abroad and receive their highest degree from a country other than Sweden.

A final issue with comparing an education from Sweden with those received in other countries is that the quality could differ. The quality of the education could be higher or lower if achieved in another country even if it is labeled as the same. Even if educations are at the same level, education acquired in Sweden may be preferred by the employers. It may also be such that some educations acquired in another country are not possible to use in Sweden directly but have to be validated by an authority. It can be complicated and take time, especially for those coming from countries outside EU/EEA. It may also be the case that immigrants when arriving in Sweden have qualifications in occupations with an excess supply in the Swedish labor market. Finally, for some occupations it may be necessary to have good knowledge of the Swedish language, making it impossible to get a job before this requirement is fulfilled. We will address these issues via the empirical specification.

### 3.3 Econometric analysis

In the empirical part of the paper, we first present the incidence of overeducation among natives and immigrants in Sweden. Second, we analyze wage-effects of over, under and required education by estimating the ORU-model first developed by Duncan and Hoffman (1981).

\mathit{ln}{\mathit{w}}_{\mathit{it}}={\mathit{\beta}}_{0}+{\mathit{\beta}}_{1}\mathit{U}{\mathit{E}}_{\mathit{it}}+{\mathit{\beta}}_{2}\mathit{R}{\mathit{E}}_{\mathit{it}}+{\mathit{\beta}}_{3}\mathit{O}{\mathit{E}}_{\mathit{it}}+\mathit{\delta}{\mathit{X}}_{\mathit{it}}+{\mathit{u}}_{\mathit{it}}

(1)

Undereducation is measured as years of deficit education in relation to the “norm” in the occupation which we either derive using the mean plus/minus one standard deviation or the modal years of schooling. Overeducation is in turn measured as the number of excess years of schooling an individual has. Years of undereducation is set to zero for all except for those who are defined as undereducated and years of overeducation is set to zero for all except for those who are overeducated. Required education corresponds to the number of years that is the norm within the occupation.

There are several problems related to the ORU specification of the wage equation, in particular problems with omitted variables and measurement error. In the literature on the return to acquired education the endogeneity problem has since long been acknowledged and specifications trying to correct for this is almost always estimated, primarily using instrumental variable analysis. In the ORU specification, the problem arises if sorting into years of under, over or required education is correlated with the error term, i.e. correlated with some unobservable variable that also is correlated with wages. If this is not taken into account in the empirical specification, we cannot claim to have estimated a causal effect of overeducation on wages. Although the focus of this paper is not primarily on estimating a causal wage effect, we estimate wage regressions controlling for individual fixed effects. This will not, however, correct for the fact that some unobservables change over time; the individual fixed effects will only take care of the problem with time invariant unobservables that are correlated both with wages and years of overeducation.

A second problem in the ORU-specification is measurement error. We have already discussed the problems surrounding the educational variable in the Swedish registers which is used to determine both an individual’s acquired number of years of schooling and number of years of schooling that is required for a job. Leuven and Oosterbeek (2011) points out that the measurement error in the key variables in the ORU analysis is likely to contain an even larger measurement error since both over and undereducation are defined as the difference between acquired and required years of schooling and that this leads to the measurement error becoming more severe. We have also discussed the possibility that measurement error is even more severe for immigrants than for natives, in particular if their education was received in the country of origin.

In spite of these problems, many researchers have estimated the ORU-model and the results are remarkably consistent both over time and space (see Hartog 2000): (1) The returns to actual years of schooling are lower than the returns to required years of schooling; (2) The returns to overeducation are positive, but smaller than the returns to required education, i.e. β_{3} > 0 but β_{3} < β_{2}. This means that overeducated workers earn more than correctly matched workers in the same types of jobs but less than correctly matched workers with the same years of schooling; (3) The returns to under-education are negative, but the estimate is smaller than the estimate for the returns to required education, i.e. β_{1} < 0 but | β_{1}| < β_{2}.

One concern that has been raised in previous studies is if unobserved heterogeneity can influence the results (e.g. Chevalier 2003; Bauer 2002; Korpi and Tåhlin 2009; Nielsen 2011). Bauer (2002) argues that controlling for unobserved heterogeneity might be important if individuals with lower ability need more education to acquire a job for which they are formally overeducated. He further argues that if there is a negative correlation between the probability of being overeducated and ability, then one would expect that we underestimate the returns to overeducation and overestimate the returns to undereducation when not controlling for unobserved heterogeneity.

In the case of immigrants, it can also be argued that some employers might require a stronger signal, i.e. more formal education for the same job from an immigrant applicant than from a native one. In the hiring process, a high level of education is an indication of high ability and conscientiousness, but this may be offset by a general skepticism towards people with a foreign background. Thus, it is not a priori clear how the results are expected to change by controlling for unobserved heterogeneity, in particular for immigrants, given that many studies point to a tendency of immigrants being discriminated against in the hiring process in the Swedish labor market (Carlsson and Rooth 2007; Bursell 2007; Arai et al. 2010).

Leuven and Oosterbeek (2011) are critical of the attempts that have been made to control for unobserved heterogeneity, both using fixed-effect models and instrumental variables. As a result, they argue that it is very difficult to get a credible estimate of the causal wage-effect of being over or undereducated.

The discussion above has mostly been about selection into over- and undereducation and how it may be correlated with ability, given employment. Another type of selection stems from the fact that we observe the occupation only for those who are employed. In Sweden, one of the main issues in the debate about integration of immigrants is that employment rates are substantially lower. A general tendency on the Swedish labor market is that employment increases with educational attainment (Eriksson 2011). Among highly educated individuals education (and thereby overeducation) may be positively correlated with the probability of being employed.

In our sample, almost 90 percent of native men were employed in November 2008. Among native women, employment is slightly lower except for those with higher education of three years or more and those with post-graduate education where employment rates for women and men are about the same. Employment is about 25 percentage point higher among native men and women compared to immigrants. A number of factors affect the immigrants’ probability of getting a job given their education; where they live (Zenou et al. 2010), which type of job that they apply for (Carlsson and Rooth 2007)^{3}, and the period of arrival to Sweden (Åslund and Rooth 2007).

### 3.4 Estimating state dependence in overeducation

In the introduction it was argued that state dependence in overeducation might be a more severe problem than the incidence of overeducation. If it exists and is higher among immigrants than natives, this indicates that a high incidence of overeducation among newly arrived immigrants is not only an initial problem but can have long-lasting negative effects on their labor market integration. Therefore, it is important to estimate the effect of earlier overeducation on future overeducation.

Following Mavromaras and McGuinness (2012), the model to be estimated is

\mathit{O}{\mathit{E}}_{\mathit{it}}={\mathit{X}}_{\mathit{it}}^{\text{'}}\mathit{\beta}+\mathit{\gamma O}{\mathit{E}}_{\mathit{it}-1}+{\mathit{\u03f5}}_{\mathit{i}}+{\mathit{u}}_{\mathit{it}}

(2)

where ϵ_{
i
} is the unobserved heterogeneity which together with *u*_{
it
}, which is assumed to be *iid*, are components of the error term. The dependent variable is a dummy variable taking the value one if the individual is overeducated in period *t* and zero if not. Since the left-hand side variable is a dummy variable we would like to estimate a probit model and since we use panel data we could choose between a fixed and a random effect model. However, since we are not only interested in the effect of time-varying covariates on the outcome but the variable of interest is in fact a lagged dependent variable a random effects model is the one we should estimate. But estimating a simple random effect probit model would lead to biased estimates of the effect of previous overeducation on present overeducation. To be able to establish if there is a direct effect of lagged overeducation on present overeducation net of all factors that affect the probability of being overeducated in the first place, we need to address two problems. The first is the so called initial conditions problem and occurs since the lagged dependent variable is likely to be correlated with the individual effect, ϵ_{i}. Unobservables that are correlated with the outcome will in almost all cases be correlated with the lagged dependent variable. Three different methods have been suggested to correct for this developed by Heckman (1981), Orme (2001) and Wooldridge (2005). A comparison of these three estimators has shown that none of them outperforms the other two, and all three estimators display in most cases satisfactory results. However, the Heckman estimator for which Stewart (2006) has developed a STATA code, is more time consuming than the other two (Arulampalam and Stewart 2009). We have therefore chosen to follow Wooldridge (2005), where the relationship between the individual effect and the lagged dependent variable is modeled conditional on the initial value of overeducation and exogenous explanatory variables.

The second problem arises because of the assumption of independence between the covariates and the error term. This is resolved by applying the Mundlak correction which in practice means that we include individual means of each of the time varying variables that are assumed to be correlated with the unobserved heterogeneity (Mundlak 1978). In our case individual means over age, number of children, years of schooling, and years in Sweden (for immigrants) are included. The model to be estimated then becomes:

\phantom{\rule{0.5em}{0ex}}\mathit{O}{\mathit{E}}_{\mathit{it}}={\mathit{X}}_{\mathit{it}}^{\text{'}}\mathit{\beta}+\mathit{\gamma O}{\mathit{E}}_{\mathit{it}-1}+\mathit{\delta O}{\mathit{E}}_{\mathit{it}=0}+\overline{{\mathit{X}}_{\mathit{i}}^{\text{'}}}\mathit{\alpha}+{\mathit{\u03f5}}_{\mathit{i}}+{\mathit{u}}_{\mathit{it}}

(3)

There are basically two ways in which the incidence of overeducation can change: (i) if the individual’s years of schooling change or (ii) if the norm within the occupation changes. The norm within the individuals occupation can in turn change for two reasons: (i) if the individual stays in the same occupation and the norm within that occupation changes and (ii) if the individual him or herself change occupation. The primary source for changes in overeducation is likely to be job changes. However, changing job is no guarantee for improving the match. Even if one changes job one could end up in another job that one is overeducated for. This makes us wanting to investigate the effect of a job change between period *t-1* and period *t* on overeducation in period *t*. The full model to be estimated then becomes

\phantom{\rule{0.5em}{0ex}}\mathit{O}{\mathit{E}}_{\mathit{it}}={\mathit{X}}_{\mathit{it}}^{\text{'}}\mathit{\beta}+\mathit{\gamma O}{\mathit{E}}_{\mathit{it}-1}+\mathit{\delta O}{\mathit{E}}_{\mathit{it}=0}+\mathit{\rho \Delta jo}{\mathit{b}}_{\mathit{it}}+\overline{{\mathit{X}}_{\mathit{i}}^{\text{'}}}\mathit{\alpha}+{\mathit{\u03f5}}_{\mathit{i}}+{\mathit{u}}_{\mathit{it}}

(4)

Where ∆*job*_{
it
} is defined as *job*_{
it
}*-job*_{
it
}. This is a dummy variable that takes the value one if the worker has changed job and zero otherwise.