48
Theme: Design of imputation 0. General information 0.2 Module code Theme-design of imputation 0.3 Version history Version Date Description of changes Author Institute 1.0 13-12-2011 First version Andrzej Młodak GUS (PL) 1.1. 15-12-2011 First corrected version Andrzej Młodak GUS (PL) 2.0 15-02-2012 Revised version Andrzej Młodak GUS (PL) 3.0 15 30 -05 3 - 2012 Third version Andrzej Młodak GUS (PL) 0.3 Template version and print date Template version used 1.0 p 3 d.d. 28-6-2011 Print date 19-5-2022 0:19 1

Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

Theme: Design of imputation

0. General information

0.2 Module code

Theme-design of imputation

0.3 Version history

Version Date Description of changes Author Institute1.0 13-12-2011 First version Andrzej Młodak GUS (PL)1.1. 15-12-2011 First corrected version Andrzej Młodak GUS (PL)2.0 15-02-2012 Revised version Andrzej Młodak GUS (PL)3.0 1530-053-

2012Third version Andrzej Młodak GUS (PL)

0.3 Template version and print date

Template version used 1.0 p 3 d.d. 28-6-2011

Print date 25-5-2023 17:12

1

Page 2: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

Contents

General description – Theme: ...................................................................................................................3

1. Summary .......................................................................................................................................3

2. General description .......................................................................................................................3

2.1. Whether to use imputation .........................................................................................................3

2.2. Preliminary verification of the target variable ...........................................................................6

2.3. The choice of auxiliary variables ...............................................................................................7

2.4. The choice of the imputation method ........................................................................................8

2.5. Quality control .........................................................................................................................15

2.6. Disclosure of the output ...........................................................................................................17

3. Design issues ..............................................................................................................................18

4. Available software tools .............................................................................................................19

5. Decision tree of methods ............................................................................................................19

6. Glossary ......................................................................................................................................20

7. Literature ....................................................................................................................................20

A.1 Relationship with other modules ............................................................................................23

A.1.1 Related themes described in other modules ...................................................................23

A.1.2 Methods explicitly referred to in this module ................................................................23

A.1.3 Mathematical techniques explicitly referred to in this module ......................................23

A.1.4 GSBPM phases explicitly referred to in this module .....................................................23

A.1.5 Tools explicitly referred to in this module .....................................................................23

A.1.6 Process steps explicitly referred to in this module .........................................................23

General description – Theme: ...................................................................................................................3

1. Summary .......................................................................................................................................3

2. General description .......................................................................................................................3

2.1. Whether to use the imputation ...................................................................................................3

2.2. Preliminary verification of the target variable ...........................................................................4

2.3. The choice of auxiliary variables ...............................................................................................4

2.4. The choice of the imputation method ........................................................................................5

2

Page 3: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

2.5. Quality control ...........................................................................................................................9

2.6. Disclosure of the output ...........................................................................................................12

3. Glossary ......................................................................................................................................12

4. Literature ....................................................................................................................................13

A.1 Relationship with other modules ............................................................................................15

A.1.1 Related themes described in other modules ...................................................................15

A.1.2 Methods explicitly referred to in this module ................................................................15

A.1.3 Mathematical techniques explicitly referred to in this module ......................................15

A.1.4 GSBPM phases explicitly referred to in this module .....................................................15

A.1.5 Tools explicitly referred to in this module .....................................................................15

A.1.6 Process steps explicitly referred to in this module .........................................................15

3

Page 4: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

General description – Theme1:

1 The author of this module would like to express his gratitude to Mrs. Eva Elvers and Mr. Stefan Berg (Statistics Sweden) for very valuable comments and suggestions.

4

Page 5: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

1. Summary

As the theme module of this chapter states, two reasons for imputation (rather than other ways of esti-mation) are convenience and quality. The cConvenience is due toconnected with using complete data files – but some care has tomust be taken (for instance, one has to consider the question of how to estimate the variance has to be considered). The quality of the imputation depends on the information available, e.g. auxiliary variables that can be used (and how well you they can be used them). When we impute one or more values for a unit we have to consider the internal consistency for the unit. Im-putation is often doneconducted together with editing. Imputation depends on somea model and should mostlygenerally not be used for very influential units. ThenIn such cases we need ato re–contact the unit or possibly have an expert conduct imputation by an expert. These problems, as well as athe choice between re–weighting and imputation of unit non–response also depend also on convenience and quality. All these aspects should be taken into account when we design the imputation of statisti-cal data.

In this module particular stages of suchthe an imputation process design are described. To say more sStrictly speaking, wWhen planning this process, we must take into account several important aspects which affect the final quality of imputed values – here called implants – and estimation of aggregated descriptive statistics for the population under investigation. That is, we have to address the following questions:

recommendations for the application of imputation, the choice of auxiliary variables for a particular imputation model, recommendations for the use of different methods, assessment of the model fit and accuracy, optimization of the cost, timeliness and formal quality components of the process, the effect of adding or not adding disturbance terms to some models of imputation, relationships between particular methods.

These topics will be discussed in more detail later on in the module. We will want to investigate many various aspects of imputation such as the necessity to make an imputation, a preliminary review of the target variable, the choice of auxiliary variables, the selection of the class of imputation methods and a particular method within the class, the predicted quality of imputed valueslants in terms of of thetheir deviation of imputed values from their ‘true’ values and – in terms of mean square error, coefficients of bias and components of total variance – the precision of estimation using available and imputed data, identification of possible additional bias, the output format and the release of final results. In successive sections all these topics will be described in detail. Together they provide a the design scheme of the imputation process .

2. General description

2.1. Whether to use imputation

5

Page 6: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

In the main theme module of this chapter two main premises there were presented two main premises being as arguments for conducting imputation. The Ffirst of themone points out that the researcher would like to avoid incompletely filled the convenience of complete data filesbases, which is undesir-able from the point of view of further data processing, analysis, and data presentation. Missing data can affect the discrepancies in distributions presented in relevant contingency tables or make it impos-sible to compute direct estimates. On the other hand, –, according to the second premise – , imputation can be used to increase the quality of estimates of relevant parameters for the population, the quality of modeling of variable distribution as well as of microdata. The necessity of having Access to high–-quality population data is a necessitated byconsequence of one of the fundamental principles objec-tives of statistics to provide reliable and consistent information for users.

It is worth noting that units for which data about the target variable are unavailable can be signifi-cantly different than from the remaining ones. In consequenceAs a result, the means in both groups will also be clearly different. IThus, if so, within the a subset of units for which data about the target variable are available, no units can be “similar” to the imputed ones. On the other hand, remindcall that iImputation should be performed mainly to improve the estimation quality of complex basic de-scriptive statistics, i.e. mean, quantiles, variance, skewness, kurtosis, etc., for the higher level of aggre-gation or for the whole population or for convenience. This means that using only available informa-tion in such situations can lead to seriously biased (practically false) results.

In this placeAt this point we would like also to remind repeat the opinion expressed in the main theme module that even if missing values do occur, a decision not to impute can be taken. Then wWeighting is used instead. That is, instead of imputation a researcher can choose to perform estimation or analy-sis. In this situation researcher should be sure that this action will have no significant consequence for quality of the intended statistical output.

First of all, we should verify, whether providing of a given specific information is necessary. In this context we should be aware of the fact that there is an important distinction between item non–re-sponse and unit non–response. Imputation is usually used more frequently in the former case than in the latter (for a non–response unit we have practically no basis enabling us to estimate the data with satisfactory quality). It is essential to know that thea given value is really missing, i.e. that be sure that a blank doesn’t mean zero, for instance. When designing oOther parts of the survey process it is im-portant that this is done inshould therefore be designed to make sure such a way that data can be pro-cessed and stored, while, so that the keeping a distinction between missing values and zeros should be done.

6

Page 7: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

If the missing valueslack of information concerns only items, which haven’t don’t have to be filled (such asas international exchange in the case of a firm company operating only within one country – see the main theme module of this chapter), the imputation has makes no sense (in this case blanks mean zeros). But it occurs very rarely that theit is seldom the case that a survey consists only with of such questions. If they are necessary variables but with missing data, then to take a decision on about imputation we must focus on their properties. On the other hand, the distribution of the target variable should be so regular that imputation of missing data will not significantly improve the population sta -tistics and using such statistics for available data will be quite sufficient. Thus, one should consider primarily whether imputation is necessary in this case. More precisely, the use of several criteria is recommended. The first of them is, of course, the amount of missing data. If it is very small in relation to the total number of units (e.g. sample size), one can expect that imputation may not significantly improve the population estimates. In other words, the distribution of the target variable is in this case so regular that imputation of missing data will not significantly improve the population statistics and using such statistics for available data will be quite sufficient. Therefore, to avoid additional costs, it is better not to use imputation. At first lookface value, this statement may seem to be in contrarycontra-dict to opinions presented in the main theme module, but it is only a pretenceonly apparently. In most cases, it is much more convenient to have a full dataset with imputed data thatthan only its subset with availablefull information. It concerns is especially true if the distribution of missing values is neither unknown nor not able tocan it be assessed. Otherwise, i.e. if we can suppose with great probability that the distribution of unknown values will be strictly concentrated around “typical” values (and hence their impact on the population statistics for population, like mean , will be very small) we can drop the imputation. Such a situation can occur e.g. if we know the revenue values of revenues of firmsbusinesses and we would like to know whichat sumamount of CIT tax they have paid and we observe that revenue valuess of firmscompanies with missing data on CIT are very close (and concen-trated around the overall mean), then – due to fixed rates of this tax – we can suppose that the distri-bution of the sumamount of paid CIT will also be also concentrated around the relevant mean. Hence, their impact on the total mean for the entire population is minimal and can be neglected.

If we have any doubts whether the unknown distribution will be really notin significant, the imputation is necessary. The second (and also supplementary to the above) advice about the possible use of impu-tation can be analyzing archival data from an analogous, previously conducted survey or relevant data for earlier periods. That is, we can analyze the distribution of the target variable from a historical per -spective and given a high probability of outliers or data gaps observed for units located at the ‘tails’ of these distributions, we can conclude that imputation will be rather necessary. On the other hand, the opposite recommendation can be formulated if we observe that the distribution of the available data was usually almost identical as that for the global values of the variable. This action is strictly con-nected with a verification of the target variable made at the start of the design of imputation which will be described in Section 2.2.

Summing up, imputation should be conducted if at least one of these conditions is satisfied:

1. it is subjectively convenient (i.e. can make the rest of the survey process less complex) and useful to have completely filled dataset,

2. the amount of non–response is relatively large,3.[2.] the distribution of available data is expected to be considerably different than that of the

global variable,

7

Page 8: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

4.[3.] the unavailable data are expected (based on the past experience) to belong to the ‘tail’ of the distribution of the target variable,

5. we would like to have consistency within the survey and perhaps also in relations to other surveys,

6. we would like to increase the quality of microdata.

In the case of occurringUnder circumstances described in the conditions from two to four, the imputa-tion is necessary to guarantee the proper quality of estimates of population parameters estimates. An-other problem can lie in the choice between re–weighting applied in the sampling design and imputa-tion of unit non–response. As we have noted at the beginning in this module, in the latter case the imputed values (and, in consequence, the final population estimates) can be seriously biased – espe-cially if there are many non–response units or if a non–response units is supposed to be dominanting in the population. The rRe-weighting can reduce the impact of the non–response unit but it can simul-taneously lead simultaneously to a deformation of the distribution of the analyzed variable in the entire population. Hence, the decision in this matter should be takenmade subjectively, with great caution. It would be good if re–weighting canould be based on the opinions and assistance of high–specialized experts.

EThe experts can be used, however, to perform the entire imputation process. That is, their opinion can be useful when thea database where there arewith variables to be imputed and possibly also vari-ables which can be used as auxiliary ones is very complex and mutual dependencies between them are hard to recognize by a non–specialist. If the choice of imputation method is ambiguous, an assistance of an experienced person is also welcomeuseful.

If oOne can also perform also manual imputation. This approach consists in a data correction per-formed in as a result of the need tonecessary re–contact with the corresponding respondent or follow-ing estimation of a missing or erroneous value on the basis of relevant basic subject matter knowledge. The latter case is strictly connected with deductive imputation (see the relevant module), because esti-mation is made using basic dependencies by the staff involved directly in data collection and process-ing. That is, if some dependencies between collected data occur “fromby definition” (i.e. if the num-ber of employees=0, then labour cost=0 or the sum of total number of employees ofin different units of an enterprise must be equal to the total number of employees of this enterprise) , then they should be used to impute missing data as early as possible, i.e. onat the collection and editing stages of a statisti-cal survey. Such an approach can be sufficient or at least will significantly reduce costs of further ac-tivities.

Of course, another option is to do nothing. As we have earlier noted earlier, in some cases it this course of action can be motivated by a small number of gaps or an even distribution of existing data and an allegedly even distribution of missing values. In other situations, it would be unfavorable because of large bias of estimated summary statistics.

8

Page 9: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

At the end of this subsection it is important to consider the situation of a statistical office (possibly as opposed to a researcher), where a survey is run repeatedly and where there are many variables and many units in the sample. In this case, there is not time to make detailed studies for missing values each time. On the other hand, replication of ng such a survey can be a good opportunity to recognize most important and regularly occurring problems, including non–response. On the basis of such experience and experts’ opinions one can choice a unique good procedure, which will be comfortable nvenient for next exercisesfuture survey rounds, efficient and sufficiently robust to occassional outliers, which can occur in some replications.

2.2. Preliminary verification of the target variable

When we have decided to conduct imputation, we have to perform a preliminary analysis of the target variable. This step is aimed at avoiding unnecessary effort in dealing with empty data cells, which should not be imputed. First of all, we should control whether the population (or sample) contains units which should not be included in the survey. Completion of this task is necessary because in some situations just performing a survey can verify the true status of a unit. For example, although an eco -nomic entity is registered in the business register, only a direct interview with its representative may provide information that the unit is no longer active. It might have forgotten to submit an update to the register or possibly the updated information was not entered into the database for personal or technical reasons.

The next stage of verification can concern involve items which logically should not be filled or those whose values should be unique as implied by data for other variables. For instance, if we analyze a one–person firm business (and this size is known e.g. from the business register or declaration in the relevant placesection of the questionnaire), next cells concerning employees (e.g. the number of per-sons employed on the basis of a labour contract) have to be empty. Otherwise, there is obviously a mistake (on the part of the interviewer or the respondent) and we must explain the problem and re -move false data. We will not develop this problem further this problem here, because it belongs to issues of editing presented in another chapter of this handbook. In the second situation a concrete value or a given (and available) variable(s) may imply a relevant value of the target variable (e.g. if the income=0, the corresponding paid income tax must also be equal to zero). The Uuse of such depen-dencies leads to deductive imputation, which is described in the relevant module of this chapter.

2.3. The choice of auxiliary variables

Authors of the document published by WHO/UNESCAP (2008) Many famous specialists in the field of imputation (e.g. T. de Waal et al. (2011) believe that the design of the imputation process depends on the kind of imputation procedure to be followed. As we have shown in the method modules within this chapter, there exist many methods of imputation relying on various computational algorithms and scopes of auxiliary data. To support the selection of the most effective algorithm for a given situation we have to first analyze available data for the target variable and recognize which useful auxiliary data we can collect.

9

Page 10: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

An analysis of the target variable should concern the diversification of available data and its complete records from previous periods or surveys. If the absolute value of the coefficient of variation in all cases is relatively small, then we can consider using quick and simple (but, in general, weakly effec-tive) methods with significant random factors and, in this case, even looking for any additional vari-ables might prove unnecessary. Otherwise, i.e. if the distribution of current available data is diversified or significantly skewed or differs considerably from those obtained in previous periods or surveys, then more sophisticated methods supported by auxiliary data are required.

Auxiliary variables should be chosen on the basis of reasonable priorities. Of course, all data for impu-tation have to be available. The most important of the remaining requirements is the strict connection of auxiliary variables with the target one. This connection can be verified by analyzing logical rela-tions (first of all topic – related and informative connections) and relevant correlation coefficients, usually using the Pearson’s formula, although if we are interested only in the similarity of ranks – the Spearman’s formula or in the case of categorical valuables – the τ–Kendall’s option are much more recommended (see e.g. R. Lupton (1993) or A. D. Aczel (1998)).

In this context the categorical or categorizable variables have priority. That is, we look for a variable which is strictly consistent strongly correlated with the target one, which can be used as a means of dividing units into several classes of similarity. It would be best if such a variable was categorical (e.g. the level of satisfaction of an enterprise with its output), but in business statistics it is hardly attainable. Much more often we have at our disposal interval or ratio variables, e.g. net profit, number of employ-ees, turnover, etc. In such cases, these variables should be categorized, i.e. the range of values should be divided into several intervals (with disjoint interiors) and respectively coded. Next, any unit should be classified into exactly one of such intervals. For example, we can divide economic entities by the number of employees into small (up to 9, code: 1), medium (from 10 – 49, code: 2) and large (50 and more, code: 3) enterprises and classify each unit into the relevant interval accordingly. Such an opera-tion facilitates donor imputation by restricting the set of possible donors to a relevant subclass accord-ing to this classification and it can also contribute to improving the quality of other types. There exist many forward, backward and other automatic procedures leading to satisfying this purpose (cf. e.g. R. L. Chambers et al. (2001 a and 2001 b)). This option is recommended first of all in the context of the hot–deck donor imputation.

10

Page 11: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

Sometimes (e.g. in determinant or regression imputation) we need more auxiliary variables. Of course, the criterion of the strict strong connection correlation with the target variable must be retained. But to effectively conduct these types of imputation, each such variable should provide a unique and large information resource. Duplicating information should be avoided. It means that we have to establish such a set of variables, which are mutually weakly correlated and simultaneously retain their mutual multivariate connectioncorrelation. This goal may be achieved by using the reversed correlation ma-trix (see e.g. A. Malina and A. Zeliaś (1998)). That is, we determine a matrix reverse to the correlation matrix and analyze its diagonal elements. They belong to the interval ¿. If one of them is too large (i.e. exceeds the arbitrarily established threshold – usually 10.0), then such a variable is too much corre-lated with others and should be eliminated. A slightly more complex situation occurs when more than one variable shows wrong diagonal entries. In this case we should analyze different variants of deter -mination and choose the best one. Sometimes to achieve the desired low level of mutual correlation it is sufficient to eliminate only one of the ‘suspect’ variables, so that as much of the original informa-tion as possible is saved and the cost of the process is also reduced. Therefore If two or more vari-ables are “bad” in this context, some degree of subjectivity in the eliamination of variables should be exercised. This method is very useful in practice, because it takes into account all posible connections – of which , including hidden ones – into account. It is aslolso very quick from the computational point of view, although for larger numbers of “bad” diagonal entries the elimination will be more complicated. The best solution seems to be here ato use ofrely on experiences from previous realizations of a survey rounds (if theyre were any) or consult the problem with experts.

It is worth noting, however, that the number of auxiliary variables cannot be too large. For example, if the number of auxiliary variable exceeds the number of units in a sample, the quality of imputation will significantly decrease (because empty classes may occur or too many interconnections between variables can lead to an artificial reduction of differences between sampled units).

The variables can be weighted. That is, we can specify which of them are more important and which are less. The weights will be used during imputation. The weight can be constructed as ration of the coefficient of variation of a given variable to the total some of such coefficient for all auxiliary vari-ables. Otherwise (especially for categorical variables) their weights may be established on the basis of the opinion expressed by experts (e.g. as an average rating computed from notes proposed by several independent specialists). The weighting can improve the quality estimation because it enables to focus on that information which is most important from the point of view of the target variable.

2.4. The choice of the imputation method

The most difficult decision to be taken made in the imputation process and the most important point of the design of imputation is the choice of a the method to be used. It depends on data availability, and the character of variables (including the target variable) and a.lso on the type of use of the completed data. We will Below we describe these theoretical aspects of such this choice and next we will present examples of practical decisions motivated by the survey type of survey.

2.4.1. Theoretical basis of choice

11

Page 12: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

The type of tools and algorithms used affects the quality of the final results. We should, however, con-sider simultaneously the problem of cost optimization, timeliness and the formal quality of the imputa-tion process. Thus, if we have several options to choose from, a simulation study on a small sample would be recommended. That is, we can use the available data for target variables (or or their subset – if the number of respondent units is large) and simulate gaps in the data, which can be filled by rele-vant implants uted values to assess their deviation from the true values and the aggregate statistics from their actual level. Methods described in the section “Quality control” can be applied here. This exercise can help to recognize the scale of the problem, the speed of algorithms, the capacity of IT tools, etc. Of course, the lower is the number of data gaps, the less expensiveused method used can will be appliedless expensive – both in terms of the amount of time needed, necessary computing power and financial costs of operations. In any case, however, optimization of these factors should be takien into account. Below we present methodological premises, which underline the use of particular imputation methods.

Suppose that we have no efficient auxiliary variables at our disposal. That is, iIn thethe context of business statistics within the EU-framework, the situation when we will havehaving access to any auxiliary information seems tois be veryrather unlikely, because from the Business Register we have usually provides information on economic activity and the size of an economic entity (although we can even imagine even a situation when datainformation about the on size of thean entity was – e.g. due to aby mistake – not entered into the Business Register and should be imputed). On the other hand, however, the data collected in the Business Register or in another way canmay not be not useful in some situations. For instance, if we would like to impute the net profitability and we have only data on the size of enterprise, this auxiliary information can have small worthbe of little use, because since profitability is usually very weakly connectedrrelated with the companythe size of a firm (e.g. espe-cially if the tax policy in a given country is strongrestrictive, the larger firms can move their possible profit abroad, to the so–called “tax paradiseshavens”). The deductive approach method which seems to be immediately and first – order applicable in this situation seems to be deductive approach. Its main advantage is a the use of edit rules (or restrictions) which should be satisfied by the collected data (e.g. if the sum of quantities for lower–level units should be equal to the relevant quantity for a higher–level unit and we havethere is a gap of in one of the smaller units the, we can easy easily compute it using this simple dependency). Therefore – as argues the author of general theme module and module on deductive imputation – it is much more preferred than to other algorithms (even if there are auxiliary variables). . In many cases, however, this approach is unusable cannot be used and we have to look for another, more effi-cient option alternative (e.g. when more items are missing, which cannot be filled using available rules). The considerations presented below are related only to such a situation. Assume It is assumed that data are cross-sectional, i.e. many information necessary for imputation can be provided by other units other that than these those for which data are to be imputed.

12

Page 13: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

If the target variable is categorical, only random hot–-deck imputation can be applied. The lack of other data which can improve the efficiency of imputation excludes the use of cold-deck, deterministic and model–based attempts (even the mean imputation is impossible, because the arithmetic operations such as averaging cannot be performed on data measured on the nominal or ordinal scale). Our choice is restricted to the unweighted or weighted option. The weighted random hot–deck methods can be used if we analyze a sample with known weights associated to withthe units which belong to the sam-ple (recall that in practice they are usually established as reversed selection probabilities). If the target variable is interval or ratio, we can use the mean imputation (with relevant weights if possible).

Let us Aassume now that we have some additional information at our disposal. Let the target variable be categorical (e.g. level of education of employee). ThIn this caseen, methods which rely on auxiliary data are recommended. If some of them are categorical, we can use some types of donor imputation: the random hot-deck (with imputation cells), distance hot–-deck (with a distance formula which takes this character into account, e.g. Gower’s or Walesiak – the latter is also called also the Generalized Distance Measure – GDM). This last algorithm seems to be especially recommended if we have more than one auxiliary variable and they vary in character (e.g. if in the set of auxiliary variables there are both categorical and interval and ratio ones). If there is only one such variable (but not categorical) , we can also use the rank hot–deck approach.

Let us consider the case when the target variable is expressed on the interval or ratio measurement scale (e.g. net profit rate or average wage and salary). If some of the auxiliary variables are categori-cal, then the distance hot–deck (with a specially adjusted distance formula) is preferredrecommended. Otherwise, we can use the regression imputation. If we observe that some respondents have a larger propensity to respond than others, then instead of hot–deck or regression imputation we can use the propensity score methods (which seems to be more appropriate in this case, because it uses this impor-tant information), and if there are stronger dependencies between current and previous data, cold–deck or sequential imputation would be optimum solutions in this respect.

In the context of the choice of imputation method, it is also important howthe type of use of the im-puted data is to be used. is also important. If we would like to have a completed data set at the unit level (the first possible reason of for imputation mentioned in the main theme module), then the donor imputation is recommended, because it analyses individual properties of each unit with complete data and shows which of them is closest to the data recipient. If we are interested in imputation for the second reason (i.e. improving the quality of microdata or parameter estimates) occurs, methods which generate imputed values on the basis on probabilistic premises can be used (mean, regression or multi -ple imputation, dependentlying also ofon the form and quality of auxiliary variables). T. de Waal et al. (2011) have noted that for categorical data there exists a comfortable alternative of to such approaches. That is, no values for dummy variables are imputed, but a completed full-dimensional contingency table is created in which all units are classified. Thus, each cell total is distributed in each supplemen-tal margin over the cells in the full table that contribute to that cell. They also givedescribe also some another case of imputation proceeding with categorical data based on probabilities of particular an-swers generated from a multinomial distribution of available data.

13

Page 14: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

As it was mentioned in the main theme module, for an accurate estimation of a distribution it is advis-able to impute values by adding a random disturbance to the best possible prediction according to the model. The Modeling mmethods of modeling of them may be variousvary, but it is important that the dispersion in the distribution of the target variable should be retained.

M. Hu and S. Salvucci (2001) note that small random disturbances are sometimes added to imputed values to increase variability. They believe that small random disturbances may be drawn using one of the following methods:

draw random ‘noise’ from a regular distribution such as normal N(0,σ ^2) with mean 0 and variance σ ^2, where the variance is estimated on the basis of observed data,

draw a random disturbance from respondents’ residuals of the regression model (in this case additional analysis such as ANOVA is required to identify random effects and sepa-rate them from others (e.g. model),

draw a random disturbance from residuals of those respondents who have similar values for some selected auxiliary variables (this step prevents non-linearity and non-additivity in regression models).

2.4.2. On the consistency of data from the point of view of imputation

14

Page 15: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

Coherence plays an important role Iin the imputation process an important role has the coherence. This problem is in details discussed in detail in the module “Coherence and consistency” included in the chapter “Quality Assessment”). Hence, now we will only concentrate only on its key aspects that are relevant for– from the point of view of imputation – aspects. Coherence is a general term referring to the consistency between a set of statistical variables describing finite population parameters in terms of various but – from the methodological point of view – mutually connected social or economic phe-nomena in the business reality. In terms of business statistics, two types of coherence are usually dis-tinguished: internal coherence i.e. coherence within a uniform set of annual and short–term business statistics or between data derived from different sources (surveys, administrative INTRASTAT, etc.) and external coherence being a consistency between business statistics and main macroeconomic indi-cators, e.g. with national accounts, statistics on prices and wages, external trade, etc. In terms of impu-tation, the level of coherence informs us how some statistics can be analyzed together (both as auxil-iary and target variables) and how to indicate their ‘optimum’ levels. In this chapter we usually ana-lyze models with one target variable, but it should be remained that when it is necessary to impute two (or more) variables for a statistical unit, these imputations should not be made independently. Instead, they should be considered at the same time. Otherwise the relationships between two imputed values may be unrealistic. For example, one should treat consider simultaneously production output and net profitability, fixed assets and employment, production output and foreign trade turnover, sold produc-tion and wages and salaries, etc. There are two main ways to process them. The first method consists in athe use of the same set of auxiliary variables and the same imputation method. To improve the quality of imputation one can support it by some additional actions. That is, if non–response items in target variables vary, then one can construct additional regression models for each of them including another in the collection of auxiliary variable (and using relevant available data). Then, by using it we can impute relevant data gaps. The target variables may also be combined to obtain one synthetic re-sult for units for which thewith data available for both of them are available and using such a synthetic variable as a meta-variable, one can construct a regression model with auxiliary variables as the inde-pendent variables. On the basis of such a model, one can impute missing values.

On the other hand, If we havegiven a survey with variables similar to bookkeeping data, we certainly have a number of relationships and restrictions to fulfill for each unit when we impute values. Then, we will try, for instance, to find similar units to base models upon. This action may beprovide an es-sential support of imputation. One should keep in mind that there are differences between short–term statistics that are produced quickly with a limited set of variables and domains of estimation and an-nual statistics with a large amount of related variables and detailed domains of estimation (see subsec-tion 2.4.3). Imputation is often done together with editing, where the consistency is crucial (see the chapter on “Editing”).. Imputation depends on somea model and should mostlygenerally not be used for very influential units. Then, we need ato re-contact the unit or possibly have an expert conduct im-putation by an expert.

2.4.3. Examples of impact of survey type intoon the choice of imputation method

15

Page 16: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

The choice of a method to be used can also depend on the survey type. J. Pannekoek (2009) believes notes that data sets of the Structural Business Statistics (SBS) are often complex due to hierarchi-cal systems of connections between because the variables are related via a hierarchical system of bal-ance edits and a number of inequality edits must hold as well necessary constraints valid that should be taken into account when editing the data. According toIn his opinion, Currently applied imputation methods at SN and at other NSI’s do not take edit constraints into account. iIf the information con-tained in the constraints can be incorporated in the imputation model, it is expected that the imputations become more accurate and the process can be simplified. He argues For SBS the most popular imputation method is that a regression model with a single predictor for each target vari-able. A disadvantage of this attempt is that it , which is the most popular imputation method for the SBS data in current use, has a boundary character (i.e. is applied to each target variable separately). Hence, interconnections between variables are omitted, which often violates edit rules and leads to biased estimates based on imputed variables. He suggestsOne can solveing these problems by using multivariate models that allow one to impute all variables simultaneously and take their relations de-fined by the edit-–constraints into account. In this context a generic method can be used. It is based on a truncated multivariate normal distribution, which uses all observed variables as predictors and takes the constraints into account. One problem connected with this method is that it requires complex itera-tive algorithms for estimating parameters, which can often be non–-convergent. Another possibility in this context can be sequential imputation. It is attractive because algorithms used are relatively simple, it isn’t limited to univariate conditional distributions and the univariate conditional distributions can be modelled much more easily and more flexible (not necessarily with the same normal linear regression model) than the multivariate distribution and constraints are easily taken into account. J. Pannekoek (2009) The Author of this paper performed preliminary tests, which confirmed the usefulness and efficiency of this method but also showed that the models used should yield fairly accurate predictions because these predicted values will be used as predictors for the next variable. Therefore, he recom-mends improving it over the standard linear regression model by some transformations of variables or modelling conditional distributions of strongly related groups of variables. There existsis also a possi-bility of construction of ing a complex ‘meta-variable’ for auxiliary variables and used ing it as a sin-gle explanatory variable in the model of regression imputation. The construction of such an index is described e.g. in the paper by A. Malina and A. Zeliaś (1998). This solution leads to simplification and clearnessarity of imputation without any significant loss of information contained in auxiliary vari-ables. Moreover, if special verification or normalization methods will beare applied, all (even hidden) connections between them will be exploited (cf A. Młodak (2006 a, 2006 b)).

16

Page 17: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

The second important family of business statistics are surveys aimed at receiving obtaining short–term statistics (STS) data. In this case, useing data from previous short periods in the imputation process is natural and strongly recommended., Data couldan be examined cross–sectionally (data for all units in a fixed time period) or longitudinally (data for a fixed unit throughover time). However, there are many problems which can occur in such specific citation. The Ffirst of themone is the usually low number of available variables (which can potentially be used as carriers of auxiliary information). Most often only simple descriptive statistics for single variables are tocan be estimated using imputed values. Hence, there is usually a short list of variables that should be controlled, variables are usually continuous with non–negative values. The oOlder databases can also be also obsolete, with missing some advanced statistical functions missing. This fact implyies a necessity of a use ofusing computa-tionally and informatively weaker imputation methods. In short–term statistics, when performing sam-ple surveys (e.g. collection of data on employment, wages and salaries in small enterprises), an under-coverage of some categories for smaller spatial areas can occur, whatich seriously limits the set of possible donors and the quality of regression adjustment. Therefore, some restrictions (e.g. in terms of NACE levels) have to be introduced. An alternative solution can be to grouping small categories (or units) into larger ones (if it is logically and methodologically motivated). One should be taken into account that due to a little time forbefore a data release, the imputation should be as a much automatic as possible (an efficient software is necessary) and should provide results which are, into a large ex-tendt, robust (i.e. toare not affected by the small number of respondenting units which provided their reports according toon time schedule and todata violatility) of data. Therefore, assuming that a suffi-ciently powerful software is at our disposal, cold-deck donor imputation can be here a good tool. E. Wein (2009) presents another solution, used in Germany, According to Tthe approach , which is ap-plied in this country forto short – term statistics imputation consists in that, for each data set with a missing value in the current reporting month and for previous months, estimates are computed using various possible tools are computed. Next the best of them (i.e. such that the distance of an estimate from the true known value in relevant retrogradepast month is the smallest) is indicatedselected. The distances are compiled to form the mean estimated difference. The missing value for the current re-porting month is imputed using the best method adjusted by the mean estimated difference if the mean is smaller than the half of the mean estimated value. For variables concerning e.g. changes in turnover development, the seasonal component (based on data from some previous months and months of the previous year) is also taken into account. R. Seljak and T. Špeh (2005) present the experience of the Statistical Office of the Republic of Slovenia (SORS) which began electronic processing of such infor-mation in 2004 and has developed many interesting methods. From the point of imputation, it is very important, because in this case there is usually a short list of variables that should be controlled, vari -ables are usually continuous with non-negative values. Data could be examined cross–sectionally (data for all units in a fixed time period) or longitudinally (data for a fixed unit through time). The Au-thorThey proposes a the use ofing the Fellegi–Holt method for error location and data imputation (F–H method) in the case of two short-–term business surveys – the Monthly Survey on Turnover, New Orders and Value of Stocks in Industry and the Monthly Survey on Wages. According to this ap-proach, the smallest set of variables to be imputed is found (that is, first we look for other simple methods to fill the gaps other than imputation and, therefore, the number of non–response items is reduced). Next, the range of acceptance for each of these variables is determined using a special linear combination of auxiliary variables. Finally, the imputed value (out of this interval) can be found in several different ways, e.g. by means of the hot-deck method (used horizontally or vertically, i.e. using

17

Page 18: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

previously imputed data or not), the nearest neighbor method. The acceptance interval is adjusted to the relation between target variable and the best (i.e. the strongest correlated with it) auxiliary variable. Thus, if an imputed valuelant falls into this acceptance interval, it is accepted,; otherwise it is rejected and the search for the next allowable value continues. This method enables us to control the quality of imputation at any step of the imputation procedure.

IThe imputation also plays a very important role in the INTRASTAT system (i.e. the system of col -lecting of intra-community data on trade). In the handbook by IMF (2009), it is argued that imputa-tions of prices in foreign trade should be undertaken on the basis of what are considered to be reliable price changes of groups expected to have a similar price change. The reliability will be based on the sample size, variability of price changes within the product group—less variable, more reliable—and robustness of the data source. The Authors authors of this handbook argue that imputation is widely used in this case. Because for some units values or price indices are not available, some proportions of the weight of the commodity group cannot be covered. This is often due to poor data for unit value indices or sampling for price indices; then, imputation is a means by which the coverage is factored up to the weight for the commodity group as a whole. Imputation may be also used for temporarily miss-ing items. Imputation of data processed within the Intrastat system can be necessary also in terms of volume of trade (i.e. number of exported or imported pieces of goods) and its values. Missing items may therefore be imputed using hot–deck methods on the basis of data for similar entities (where ‘sim-ilarity’ is established using a criterion concerning the structure and results of economic entity) – in this case the nearest neighbor method is desirable or using data for the same unit from previous periods (cold-deck imputation). For example (cf. CSO (2010)), because data about the value of exports and imports on the VAT3 return for a given company are often missing, such information is imputed as a forecast based on historical data held about the company. This action is used mainly with respect to entities, for which trade figures are below the threshold established by relevant INTRASTAT regula-tions. On the other hand, for remaining units, if the company has a history of returns, that series is used to forecast a value for its missing arrivals/dispatches at the total level.

18

Page 19: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

To end this subsection, let us analyze some aspects of imputation for the PRODCOM 2 system, which provides statistics on the production of manufactured goods. Eurostat (2001) made a review of current practices applied in this project in various countries. The review also addresses issues connected with imputation of missing data. It looks at differences in terms of the threshold of enterprise size (ex -pressed as the number of employees or turnover) used to qualify companies to particular surveys within the system. For example, in Iceland all enterprises with 3 or more employees and/or exceeding a certain level of turnover are included in the PRODCOM survey; Greece collects data from enter-prises with fewer than 20 employees and Spain relies on comparisons with the industry survey and takes appropriate action, such as increasing the sample size by including establishments belonging to enterprises with 10–19 employees. Most countries include estimates of missing data in the PROD-COM output data and estimate missing data (such as turnover, value of sold production, etc.) by rely-ing on an enterprise’s product history or the sold production trend for other enterprises, which manu-facture the same product or similar other parameters (cold–deck and nearest neighbor method, some-times also deductive imputation is applied). The latter case occurs e.g. when a unit is newly entered in the survey. It is worth noting, however, that there are special cases, which deserve special attention. Many solutions applied in such situation in various countries are discussed in the paper by Eurostat (2001).

As it was mentioned in the main theme module, instead of construction of ng the unique imputation model for the entire population, one can use different models for each of the especially established special imputation classes. This choice will be reasonable if the classes are internally homogeneous (what which denotes that the variance within them is relatively low) and mutually heterogeneous (they different significantly one forfrom each another, i.e. their cross–variance is sufficiently high). Cluster analysis (cf. B. S. Everitt et al. (2011)) may be very useful in this context (as a tool to construction of the classes).

2 The term comes from the French notion "PRODuction COMmunautaire" (Community Production) used for mining, quarrying and manufacturing, i.e. units performing economic activity classified to sections B and C of the Statistical Classification of Economy Activity in the European Union (NACE 2).

19

Page 20: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

We will now consider now else the problem of weighting. It can refer both to sampling and the inclu-sion probabilities and the variable weighting of variables, which can be useful e.g. in donor imputation if we assume that particular information resources have various importance. In the first situation wWe have indicated earlier about weighted versions of mean and random hot-–deck imputation algorithms in the modules devoted to model–based and donor imputation. Remindcall that In in the donor imputa-tion the weights have are also an importance important from the point of view of choice of choosing the donor. That is, the larger is the weight of the a record from the a set of potential donors the greater is the probability that it will be really selected as a donor. The Author author of the main theme mod-ule argues that – except for the aforementioned cases – the weighting seems to be unnecessary because it can even lead even to an increased of in the error of imputation. On the other hand, weighting is necessary to obtain an unbiased estimator in with respect to the sampling design. How to fin dreach a compromise between these positions? One can establish the probability of being a donor such that it is proportional to the inclusion probability (cf. G. Kalton (1983), C. J. Skinner et al. (1989)). Another options is weighting base on auxiliary variable (cf. R. R. Andridge and R. J. A. Little (2009)). One of possible The second options is to consider an iintroductioning of two systems of weights: inclusion of a unit to in the sample and selection of a unit as a donor. The latter one should be optimized to mini-mize the possible errors and maximize utility.

The second possibility concernsis to do with weighting of auxiliary variables (cf. R. R. Andridge and R. J. A. Little (2009)). That is, we assume that they are of various importance. Such a premise can be justified by properties of the target variable. For example, if we estimate the net profitability , the turnover or sold production seem to be more significant in this context than employment. There are various methods of establishment of determining weighs for particular variables are various. One of them is to compute a share of a given variable in the sum of absolute values of the coefficient of varia-tion for all analyzed auxiliary variables. However, this attempt approach prefersfavors rather diversifi-cation of values (which is very important e.g. in a taxonomic context). Another option is to use of ex-perts’ opinions or a weighted coefficient odf regression of the target variable in repent towith respect to the auxiliary variable, determined on the basis of data available for all these variables. Another criterion of choice offor the selection of imputation methods can be the minimization of mean square errors or total variance of estimators in the forms described in the Section 2.5 (“Quality Con -trol”). That isIn other words, we can consider several possibilities and –, if they satisfy all remaining subjective and objective criterions criteria – an option with the smallest choose such options for that values of these quality indices areshould be chosen smallest. A preliminary simulation study can be here very useful.

2.5. Quality control

20

Page 21: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

The main purpose of the quality control is to assess whether the imputed values are best fitted and how the estimation precision of the population statistics using imputed data is influenced by adding or not adding disturbance terms to some models of imputations. This assessment can be done both at the stage of a preliminary simulation study (see item ”The choice of the imputation method”) or ex post, i.e. after performing the whole imputation process. Of course, in the latter case, it is much more diffi -cult due to a shortage of material for effective comparisons.

Let us consider now the preliminary simulation study. The simplest solution is to compare the true and imputed values using a distance formula. The optimum option should underline and demonstrate the most important problems in this case. For this purpose, one can use the metrics or other distance for -mulas presented in the “Donor imputation” module. But the best option from this point of view seems to be the maximum of absolute differences between imputed valueslants and true values, i.e.

d=maxi∈R

|y i− y i|,(1)

where R is the set of units for which data on the target variable Y have to be imputed, y i is the im-

puted and y i – true value for the unit i∈ R, respectively. If sampling is repeated q times (where q∈N), as a measure of precision we will take the average of precision for particular trials, i.e.

d=1q ∑k=1

q

dk , where dk=maxi∈ Rk

| y i− y i|,(2)

where Rk is the set of recipients in k–th trial. To ensure comparability and logical consistency we as-sume that the cardinality of this set in every trial is the same.

However, as we have indicated earlier, the main aim of imputation is to estimate the descriptive aggre -gated (most often population) statistics. So, we should also assess their quality, because the distances (1) and (2) give e.g. no information on diversification or skewness of the imputed values, which affect significantly the quality of estimates of aggregated descriptive statistics within a trial and reduce their impact across trials. Let θ be the parameter to be estimated and θ – its estimate obtained using the

relevant estimator and imputed data. Denote by θk the estimate obtained using θ in the k–th trial in the

simulation study, k=1,2 , …,q. The basic measure of precision of estimation θ is the between–trial Mean Square Error (MSE) of θ, defined as

MSE (θ )= 1q2∑

k=1

q

(θk−θ )2.(3)

21

Page 22: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

It can be supplied with relative bias (RB) measured by the relative deviation of the mean of estimates of the analyzed parameter obtained in successive trials from its true value. Both formulas were used e.g. by K. Kim (2000) and G. Chauvet et al. (2011). C. Arcaro and W. Yung (2001) also propose the empirical relative bias (ERB), which, in some sense, combines these two approaches and is based on dividing EB by MSE. In all these cases, the smaller is the relevant index is, the better is the quality of the estimator. An estimator which minimizes all these three indices will be regarded as optimal. The practical efficiency of such assessment depends on the computational complexity of used the estimator used and the size of used data set to be used (of whichincluding the number of non–response items. Having prepared the estimator algorithm for the estimator, the additional constructition of ng algo-rithms for quality assessment doesn’t consume a liot of timeisn’t very time-consuming. However, in short–term business statistics with a relatively small scope of data the, computation should be quick and the additional costs seem to be very low.

The most frequently estimated population parameters are mean and variance. Both well–-known and new methods of estimating the former are presented in the modules devoted to relevant types of impu-tation. D. Rubin (1978) demonstrates that simulation variance of parameter θin successive trials can be estimated as a special combination of the average within–imputation and between–imputation vari-ances. All these attempts are described in detail in the chapter “Quality Assessment” of this handbook.

Let θA be an estimator of θ computed using all sample data about the target variable. The variance

estimation is strictly connected with θ. In general, C. E. Särndal (1992) showed that the total variance can be decomposed in the sampling, imputation and mixed effect components:

V=V SAM+V IMP+2 V MIX ,(4)

where V SAM=E ( θA−θ )2, V IMP=E ( θ−θ A )2, V MIX=E (( θA−θ ) ( θ−θ A )) are the aforementioned com-

ponents, respectively. Of course, in a simulation study, these expected values can be approximated by arithmetic means of relevant deviations obtained by the consecutive replication of sampling. J. K. Kim et al. (2006) analyze the problem of estimation of variance in the complex sample design, when imputation is repeated q times and prove that the difference between the expected value of multiple

imputation variance estimation and the variance of estimator θ can be expressed asymptotically as

minus double the expected value of conditional covariance of imputation effect and estimation param-eter given available data in the current sample. They have investigated such models for subpopulations (called also domains) and linear regression models. Ch. Arcaro and W. Yung (2001) propose approxi -

mately unbiased statistics for V SAM , V IMP and V MIX using weighted mean and weighted ratio imputa-tion with weights derived from traditional generalized regression estimator (GREG) for the mean based on relevant auxiliary data. Assuming that θ is unbiased, K. Kim (2000) proposes unbiased un-weighted variance estimators for regression and ratio imputation models. W. A. Fuller and J. K. Kim (2005) prove the formula for a fully efficient fractionally imputed variance estimator based on the squared deviation of the mean estimator and response probabilities, in particular, imputation cells and subpopulations. Similar research for balanced random imputation has been conducted by G. Chauvet et al. (2011).

22

Page 23: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

In the case of the ex post quality control, we have a much more serious problem compared to the situa-tion where the distribution of the target variable is at least roughly known (e.g. from the preliminary simulation study). Because there is no exact reference platform (the ‘true’ distribution of the target variable is unknown), it is necessary to rely only on an approximate estimation of imputation precision using MSE, RB or ERB as the basis. We can use a generalization proposed by R. R. Andrigde and R. J. A. Little (2010) who divide a sample into several complete data sets and, instead of repeated sam-pling, investigate a division of the population into q complete and disjoint data sets and then estimate the variance using statistics computed accordingly the Rubin’s approach. This division might be based on an auxiliary variable (most preferably categorical) strictly connected with the target one. These classes should have approximately an equal number of elements. Hence we can obtain an estimate of error.

In the chapter “Quality Assessment” we will describe how to estimate the particular components of the decomposition (4) in the case of imputation. The Author of this module has conducted relevant simulation study which showed that the quality of estimation of variance components using statistics presented in that module is satisfactory.

Summarizing our presentation, the complete quality control of imputation results should involve as-sessing:

the efficiency of the sampling scheme (including the calibration of weights, if applicable), the precision of imputation method, the impact of mixed effects (interactions between sampling and imputation (factors), the occurrence of disturbances.

The first three aspects will affect the choice of the optimal sampling design and imputation algorithm It means that the high expected efficiency of the sampling design and the level of interactions between it and the chosen imputation method will be important factors of efficient sampling design. On the other hand, the precision of imputation methods and predictedforecasted distribution of disturbances which can occur in this process are very important in the choice of algorithm for filling the data gaps. On canThus we are faced with formulate the following questiondilemma: which element problem can be establishdecided at first orderin the first place: the choice of the sampling design or the choice of the imputation method? If we have experiences from past rounds of a given survey (conducted in previous time periods) then, we can use these retrogradepast data to assess the quality of imputation method and take a decision decide whether to maintain or change it. Next, the sampling design can be respectively adjusted, if necessary. Otherwise a preliminary simulation study with several sampling options of sampling and imputation methods is recommended to be conducted. Disturbances are usu-ally random and are placed in regression imputation models which can also be constructed e.g. for ratio imputation.

As it was mentioned in the main theme module, for an accurate estimation of a distribution it is advis-able to impute values by adding a random disturbance to the best possible prediction according to the model. The methods of modeling of them may be various, but it is important that the dispersion in the distribution of the target variable should be retained.

M. Hu and S. Salvucci (2001) note that small random disturbances are sometimes added to imputed values to increase variability. They believe that small random disturbances may be drawn using one of the following methods:

23

Page 24: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

draw random ‘noise’ from a regular distribution such as normal N (0 ,σ 2) with mean 0 and

variance σ 2, where the variance is estimated on the basis of observed data, draw a random disturbance from respondents’ residuals of the regression model (in this case

additional analysis such as ANOVA is required to identify random effects and separate them from others (e.g. model),

draw a random disturbance from residuals of those respondents who have similar values for some selected auxiliary variables (this step prevents non-linearity and non-additivity in regres-sion models).

M. Rueda and S. González (2008) consider a situation where two variables can have missing values for various units and propose a class of imputed estimators using a ratio technique with random distur-bance. They have studied their asymptotic properties and determined the optimal one. This method is especially effective when order statistics such as the median or quartiles are estimated.

Remind that (see the main theme module), after adding a random disturbance, the variance of target variable can be still underestimated, because the uncertainty of the imputation model is not taken into account (cf. D. B. Rubin (1987)). The inconveniences caused by this situation can be assessed by us-ing multiple imputation and more precisely – by the creation of multiple imputed values for each miss-ing value, based on different parameter estimates, random disturbances or models. An important ele-ment of such algorithm is the variance between the imputations per record. This term can contribute to obtain an estimate of the level of uncertainty and by the same token – broader knowledge about actual quality of conducted imputation.

2.6. Disclosure of the output

The last two steps in the design of imputation involve preparing the proper output product. Firstly, preliminary tables need to be generated. They might have various forms, but should contain all infor-mation necessary to assess the quality of imputation. That is, such tables should be generated at the lowest possible level of aggregation. The required descriptive statistics should be associated by identi-fying individual outliers and summary quality indices in the form described in the previous section. It is desirable to print comparative draft tables for individual imputed valueslants, but due to a large number of units this step can be often too expensive and harmful for analysis. If we use several op-tions of auxiliary variables, sampling methods or imputation algorithms, they should be precisely de-scribed, so that we can assess which options lead to most effective and precise results. Identification of outliers should help to perform relevant case studies. If we have auxiliary data at our disposal, verifi -cation should consists in the validation of auxiliary data in terms of outliers and – if positive – analysis of the mechanism applied and ‘noise’. Information which is protected under statistical confidentiality should be clearly indicated.

The second and last stage of our procedure is the generation of final tables. It should be done only after the data has passed all verification steps and the resulting tables should only contain data which can be disclosed, of high–-quality with respect to the imputation process and satisfying user expecta-tions. Tabular presentation should be followed by a precise description of the methods used, their quality, scope of bias and interpretation of statistics.

24

Page 25: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

3. Design issues

Although this module is completely devoted to the design issues, we have to add some supplement. In general, the imputation process concernsis mainly the domain of the statistical institutions and re-searchers. That is, personsople conducting the surveys and analysing itstheir results are primarily re-sponsible for properly conduction of ng this actionprocedure and obtaining high–quality imputed values and – in consequence – precise estimators of population statistics. However, it is worth noting that one can also promoteencourage also among respondents the possibility ofto perform making simple imputation activities. For example, if a unit consists of many LKAU’s and is a respondent unit, i.e. collects data from its units and, elaboratesprocesses them and transfers them collectively to the NSI (as a collectionset of reports for particular units and a relevant summary or aggregated report), it would be desirable that it should make an imputation in relation to its data, if necessary. Such an ac-tion can reduce effort made byon the part of the NSI and quickenreduce the time of the data pro-cessing, especially, if a survey concerns many of such large and complex structured entities take part in the survey. Of course, in such cases deductive imputation is preferred, although some other, more advanced methods (e.g. donor imputation) are alsocan also be possible to appliedy. In the latter case, a qualified staff is necessary. Therefore a, training addressed to such respondents would be recommen-ded. It is clear that all imputations at the level of units should be conducted before imputation made inat the NSI, for the entire database from the survey.

4. Available software tools

IThe imputation can be made using e.g. the StatMatch package within R software or procedures within SAS Enterprise Guide written by specialized staff of NSI’s. In simple cases, an Excel spread-sheet will be also possible to applyserve the purpose. The Generalized Distance Measure GDM can be computed using clusterSim package of the R software.

5. Decision tree of methods

Below we present a draft scheme of the design of imputation process.

Available software tools

[4.] De- cision tree of methods

25

Page 26: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

[5.] Glossary

Term DefinitionSource of definition

(link)

Syn-onyms

(optional)ANOVA Analysis of variance, usually part of analysis of regression,

consisting of the analysis of the distribution model and ran-dom errors of prediction and their use in the adjustment quality (performed usually using the F test).

M. Hu and S. Salvucci (2001)

Between–impu-tation variance

Variance of an estimator between imputation cells after data imputation.

C. E. Särndal (1992)

Components of total variants

The total variance of the imputation consists usually of the sampling effect (the impact of the sampling schemes), the imputation effect (the impact of the chosen imputation method) and the mixed effect (the effect of the two factors combined).

C. E. Särndal (1992)

ERB Empirical relative bias – a complex quality indicator combin-ing bias and MSE – the ratio of an estimate to the MSE de-creased by 1.

C. Arcaro and W. Yung (2001)

Implant The value imputed for a missing value of a given variable. This notion seems to be very important in donor imputation, because the imputed value is taken from the ‘true’ value for a donor, whereas e.g. in model-based imputation it is usually artificial. In other words, we take the value from a donor (so the value is true for the donor and created artificially for the “receiver”). It can be perceived as a special case of imputed value,

original Imputed value

MSE Mean Square Error – the most common quality indicator of an estimator: mean squared deviation of the value of the estimator from the estimated parameter.

K. Kim (2000) and G. Chauvet et al. (2011).

RB Relative bias – the ratio of the difference between an esti-mate and the estimated parameter to this estimated parame-ter.

K. Kim (2000) and G. Chauvet et al. (2011).

Within imputa-tion variance

Average variance of an estimator within an imputation cell after data imputation.

C. E. Särndal (1992)

26

Page 27: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

6. Literature

Aczel A. D. (1998), Complete business statistics. Richard D. Irvin, Inc.

Andrigde R. R., Little R. J. A. (2010), A Review of Hot Deck Imputation of Survey Non–response, International Statistical Review, vol. 70, pp. 40 – 64.

Andridge, R. R., Little R. J. A. (2009), The Use of Sampling Weights in Hot Deck Imputation. Journal of Official Statistics 25, pp. 21-36.

Arcaro C., Yung W. (2001), Variance estimation in the presence of imputation, SSC Annual Meeting, Proceedings of the Survey Method Section, pp. 75 – 80.

Chambers R. L., J. Hoogland, S. Laaksonen, D. M. Mesa, J. Pannekoek, P. Piela, P. Tsai, and T. de Waal (2001a), The AUTIMP-Project: Evaluation of Imputation Software. Report, Statistics Netherlands, Voorburg, the Netherlands..

Chambers R. L., T. Crespo, S. Laaksonen, P. Piela, P. Tsai, and T. de Waal (2001b), The AUTIMP-Project: Evaluation of WAID. Report, Statistics Netherlands, Voorburg, the Netherlands.

Chauvet G., Deville J.–C., Haziza D. (2011), On Balanced Random Imputation in Surveys, Biometrika, vol. 98, pp. 459 – 471.

CSO (2010), Standard Report on Methods and Quality for External Trade, Central Statistics Office, Cork, Ireland, available in the Internet at the following webpage: http://www.cso.ie/en/media/csoie/surveysandmethodologies/documents/pdfdocs/externaltrade2008.pdf

Eurostat (2001), National PRODCOM Methodologies, Office for Official Publications of the Euro-pean Communities, Luxembourg.

Everitt B. S., Landau S., Leese M., Stahl D. (2011), Cluster Analysis, 5th Edition, Wiley Series in Probability and Statistics, John Wiley & Sons, Ltd., Chichester. UK.

Fuller W. A., Kim J. K. (2005), Hot Deck Imputation for the Response Model, Survey Methodology, vol. 31, pp. 139–149.

Hu M., Salvucci S. (2001), A Study on Imputation Algorithms, Working Paper No. 2001–17, Project Officer, Ralph Lee., National Centre for Education Statistics, U. S. Department of Education, Office of Educational Research and Improvement, Washington, D.C., U.S.A., available at http://nces.ed.gov/pubs2001/200117.pdf.

IMF (2009), Export and Import Price Index Manual. Theory and Practice, International Labour Of-fice, International Monetary Fund, Organisation for Economic Co-operation and Development, Statis-tical Office of the European Communities (Eurostat), United Nations Economic Commission for Eu-rope, The World Bank, Ed. by IMF Multimedia Services Section, Washington, DC, U.S.A.

ISCM (2003), International Standard Cost Model Manual, Measuring and reducing administrative burdens for businesses, SCM Network to reduce administrative burdens, http://www.administrative-burdens.com/filesystem/2005/11/international_scm_manual_final_178.doc

Kalton G. (1983), Compensating for Missing Survey Data. Survey Research Center Institute for Social Research, The University of Michigan.

27

Page 28: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

Kim K. (2000), Variance estimation under regression imputation model, Proceedings of the Survey Research Methods Section, American Statistical Association.

Kim J. K., Brick M., Fuller W. A., Kalton G. (2006), On the bias of the multiple-imputation variance estimator in survey sampling, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, pp. 509–521.

Lupton R. (1993), Statistics in Theory and Practice, Princeton University Press, Princeton, New Jersey, U.S.A.

Malina A., Zeliaś A. (1998), On Building Taxonometric Measures on Living Conditions, Statistics in Transition, vol. 3, No. 3, pp. 523–544.

Młodak A. (2006 a), Taxonomic analysis in regional statistics, ed. by DIFIN – Advisory and Informa-tion Centre, Warszawa, Poland (in Polish).

Młodak A. (2006 b), Multilateral normalisations of diagnostic features, Statistics in Transition, vol 7., pp. 1125 – 1139.

Pannekoek J. (2009), Research on edit and imputation methodology: the throughput programme, Dis-cussion paper (09022), Statistics Netherlands, The Hague/Heerlen, Netherlands.

Rubin D. B. (1987), Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York.

Rubin D. B. (1978), Multiple imputation in sample surveys – a phenomenological Bayesian approach to non-response, Proceedings of the Survey Research Methods Section, American Statistical Associa-tion, pp. 20–34.

Rueda M., González S. (2008), A new ratio-type imputation with random disturbance, Applied Mathe-matics Letters, vol. 21, pp. 978–982

Särndal C. E. (1992), Methods for estimating the precision of survey estimates when imputation has been used, Survey Methodology, vol. 18., pp. 241 – 252.

Seljak R., Špeh T. (2005), Automatic Editing System For Two Short–Term Business Surveys, United Nations Statistical Commission and Economic Commission For Europe, Conference of European Statisticians, Work Session on Statistical Data Editing, Ottawa, Canada, 16–18 May 2005, http://www.unece.org/fileadmin/DAM/stats/documents/2005/05/sde/wp.43.e.pdf .

Skinner C. J., Holt D., Smith T. M. F. (eds.) (1989), Analysis of Complex Surveys. John Wiley & Sons, Chichester.

de Waal T., Pannekoek J., Scholtus S. (2011), Handbook of Statistical Data Editing and Imputation, John Wiley & Sons, Inc., Hoboken, New Jersey.

Wein E. (2009), Automatic Imputation For Short Term Statistics, United Nations Statistical Commis-sion And Economic Commission For Europe Conference of European Statisticians, Work Session on Statistical Data Editing (Neuchâtel, Switzerland, 5–7 October 2009), Topic : Automated editing and imputation and software applications, Invited Paper, document available at http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2009/wp.2.e.pdf.

28

Page 29: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

WHO/UNESCAP(2008) Training Manual on Disability Statistics, Department of Measurement and Health Information Systems of the World Health Organization (WHO), Statistics Division of the United Nations Economic and Social Commission for Asia and the Pacific (ESCAP), United Nations Stattistical Institute for Asia and the Pacific Subsidiary Body of United Nations Economic and Social Commission for Asia and the Pacific (ESCAP), available on the official webpage of the UN Statistics Division at http://unstats.un.org/unsd/censuskb20/Attachments/2008ESCAP_TrainManDisabty-GUID7c07895389164cdab0b0b7609136f117.pdf.

29

Page 30: Template for modules of the revised handbook Imp…  · Web viewAs the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are

Specific description – Theme:

A.1 Relationship with other modules

A.1.1 Related themes described in other modules

1. User needs

2. Design of Data Collection Methodology

3. Questionnaire design

4. Sample selection

5. Imputation

6. Weighting

7. Editing

A.1.2 Methods explicitly referred to in this module

1. Use of administrative data

2. Sample selection

3. Data imputation

A.1.3 Mathematical techniques explicitly referred to in this module

1. Effective sample selection algorithms

2. Imputation algorithms

3. Weighting algorithms

4. Quality indicators

A.1.4 GSBPM phases explicitly referred to in this module

1. GSBPM Phases 4.1, 5.2 – 5.6.

A.1.5 Tools explicitly referred to in this module

1. Reporting portals, e-questionnaires

2. Software for statistical analysis (e.g. R, SAS).

A.1.6 Process steps explicitly referred to in this module

1. n/a

30