# StatMatch

## Description

StatMatch is an add-on package for the R environment that functions for data integration through statistical matching and, as a by-product, the opportunity to impute the missing values in the data set.

The package contains functions that implement methods of matching and functions to support matching (calculation of the distances, etc.). There are three functions dedicated to the application of nonparametric methods of matching at the micro level:

- NND.hotdeck: mimimum distance donor selection; implements many distance functions; allows the definition of donation classes. It is possible to introduce a constraint for avoiding the selection of a donor more than once (constrained matching).
- RNDwNND.hotdeck: random selection of the donor in fixed or “moving” classes. In the latter case, it is possible to select at random a donor among the k closest ones; random choice of a donor among those with a distance less than a given threshold, etc. The selection of the donor can be done with probability proportional to a weighing variable. Different distance functions are implemented.
- rankNND.hotdeck: select the donor closer based on the distance between the percentiles of the empirical cumulative distribution of a continuous variable available in both the data set. In estimating the empirical cumulative distribution it is possible to account for different weights assigned to units. The empirical cumulative distribution may be estimated in given classes of units.

These functions can be used to impute the missing values in a data set through the corresponding hot-deck methods.

There is only one function mixed.mtc that allows to implement methods of matching in parametric macro or mixed framework. The function assumes that the variables *X*, *Y* and *Z* are distributed according to a multivariate normal distribution. The estimation of the parameters can be done via the maximum likelihood method or by applying a method based on sample estimates of the parameters of interest (means and variances).

Two functions implement methods for matching at the macro level in the presence of data from complex surveys. These functions are based on a series of calibrations of the weights associated to the units in the source data sets according to the methods suggested by Renssen (1998). At this time the application of these methods is limited to the case where *Y* and *Z* are both categorical and the goal consists in estimating the contingency table *Y* vs. *Z*. In particular, the function harmonize.x harmonizes the marginal/joint distribution of the chosen X variables so to make it coherent between the two initial data sets; then the function comb.samples estimates the table Y vs. Z through the methodologies proposed by Renssen, with or without an additional data source C where *Y* and *Z* or *X*, *Y* and *Z* are jointly observed.

Finally the functions Frechet.bounds.cat and Fbwidths.by.x allow the exploration of uncertainty when all the variables (*X*, *Y*, and *Z*) are categorical. The first function estimates the ranges of uncertainty bounds for all the cells in the contingency table *Y* vs. *Z*. In the presence of many common variables *X*, the function Fbwidths.by.x permits to identify the variables most related to *Y* and *Z*, and that allow a reduction of the width of the uncertainty bounds.

## Information

Status: |
validated |

Author: |
Istat |

Licence: |
GPL-2 | GPL-3 |

GSBPM code: |
5.1 Integrate data 5.4 Edit and impute |

Programming language: |
R |

Keywords: |
statistical matching, data fusion, hot deck imputation, uncertainty analysis |

Contact: |
name: Marcello D’Orazio email: madorazi@istat.it |

## Software and documentation

**SOFTWARE DEPENDENCIES**

R ( ≥ 2.7.0)

R packages: proxy, lpSolve, survey.

In some cases, it may be necessary to have the additional R packages: optmatch, SDaA, simPopulation, MASS. It should be noted that the use of the package optmatch is subject to some limitations.

**COPYRIGHT**

Copyright 2016 Marcello D’Orazio

Licensed under the GNU General Public License (GPL), version 2 or subsequent. You may not use this work except in compliance with the Licence. You may obtain a copy of the Licence at: http://www.gnu.org/licenses/licenses.en.html. Unless required by applicable law or agreed to in writing, software distributed under the Licence is distributed on an “AS IS” basis, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

**DISCLAIMER**

Istat assumes no responsibility for the results arising from use of the instrument that is inconsistent with the methodological guidance contained in the documentation available.

**DOWNLOAD**

**TECHNICAL AND METHODOLOGICAL DOCUMENTATION**