International Development Research Centre (IDRC) Canada     
idrc.ca HOME > Publications > IDRC Books > All our books > POVERTY AND EQUITY >
 Topic Explorer  
IDRC Books
     New
     in_focus
     Development/evaluation
     Economics
     Environment/biodiversity
     Food/agriculture
     Health
     IT/communication
     Natural resources
     Science/technology
     Social/political sciences
    All our books

IDRC's 40th anniversary

Subscribe

Free Online Books
 People
Rodrigo Bonilla

ID: 104010
Added: 2006-09-28 14:42
Modified: 2006-09-28 21:26
Refreshed: 2010-03-07 15:50

Click here to get the URL for the RSS format file RSS format file

15. Non parametric estimation in DAD
Prev Document(s) 16 of 20 Next

15.1 Density estimation

15.1.1 Univariate density estimation

It is often useful to visualize the shapes of income distributions. There are essentially two main approaches to doing so, and a mixture of the two. The first approach uses parametric models of income distributions. These models assume that the income distribution follows a known particular functional form, but with unknown parameters. Popular examples of such functional forms include the log-normal, the Pareto, and variants of the beta or gamma distributions. The main statistical challenge is then to estimate the unknown parameters of that functional form, and to test whether a given functional form appears to estimate better the observed distribution of income than another functional form.

The second approach does not posit a particular functional form and does not require the estimation of functional parameters. Instead, it lets the data entirely "speak for themselves". It is therefore said to be non-parametric. The method is most easily understood by starting with a review of the density estimation used by traditional histograms. Histograms provide an estimate of the density of a variable y by counting how many observations fall into "bins", and by dividing that number by the width of the bin times the number of observations in the sample. To see this more clearly, denote the origin of the bins by y0 and the bins of the histogram by [y0 + mh, y0 + (m + l)h] for positive or negative integers m. For instance, if we take m = 0, then the bin is described by the interval ranging from the origin to the origin plus h. Also, let Image be a sample of n observations of income yi. The value of the histogram over each of the bins is then defined by

Image

Such a histogram is shown on Figure 15.1 by the rectangles of varying heights over identical widths, starting with origin y0. For bins defined by [y0+mh, y0+ (m + 1)h], the bin width is indeed a constant set to h, but we can also allow the widths to vary across the bins of the histogram. The choice of h controls the amount of smoothing performed by the histogram. A small bin width will h lead to significant fluctuations in the value of the histogram, and a very large width will set the histogram to the constant h-1. Choosing an appropriate value for such a smoothing parameter is in fact a pervasive preoccupation in non-parametric estimation procedures, as we will discuss later. The choice of the origin can also be important, especially when n is not very large. There can be, however, little guidance on that latter choice, except perhaps when the nature of the data suggest a natural value for y0. One way to avoid choosing such a y0 is by constructing what will appear soon to be a "naive" kernel density estimator, that is, one in which the point y in Image is always at the center of the bin:

Image

This naive estimator can also be obtained from the use of a weight function w(u), defined as:

Image

and by defining

Image

This frees the density estimation from the choice of y0. This naive estimator can also be improved statistically by choosing weighting functions that are smoother than w(u) in 15.3. For this, we can think of replacing the weight function w(u) by a general "kernel function" K(u), such that1

Image

A smooth kernel estimate of the density function that generated the histogram is shown on Figure 15.1.


1DAD: Distribution| Density Function.

Figure 15.1: Histograms and density functions

Image

In general, we would wish Image, since we would then have Image. For Image to qualify fully as a probability density function, we would also require K(u) ≥ 0 since we would then be guaranteed that Image ≥ 0, although there are sometimes reasons to allow for negativity of the kernel function, h is usually referred to as the window width, the bandwidth or the smoothing parameter of kernel estimation procedures. There are also arguments to adjust the window width that applies to observation yi for the number of observations that surround yi, making h larger for areas where there are fewer observations. This is done for instance by the nearest neighbor and the adaptive kernel methods. As in the use of the naive density estimator, each observation will provide a box or a "bump" to the density estimation of f(y), and that bump will have a shape and a width determined by the shape of K(u) and the size of h respectively2.

E: 18.5.1

The definition of Image in (15.5) makes it inherit the continuity and differentiability properties of K(u). It is often sound and convenient to choose a kernel function that is symmetric around 0, with Image and Image. One such kernel function that has nice continuity and differentiability properties is the Gaussian kernel, defined by

Image

The "bumps" provided by the Gaussian kernel have the familiar bell shapes, are smoothly differentiable up to any desired level, and are such that Image.

15.1.2 Statistical properties of kernel density estimation

The efficiency of non-parametric estimation procedures is usually measured by the mean square error (MSE) that there is in estimating the function f(y) at a point y. The MSE in estimating f(y) by Image is defined by

Image

The most common way of defining a measure of global accuracy simply sums the mean square error across values of y. This yields the mean integrated square error (or MISE), a measure of the accuracy of estimating f(y) over the whole range of y:

Image


2DAD: Distribution| Density Function.

The relative efficiency of a particular choice of a kernel function K(u) can then be assessed relative to that choice of the kernel function which would minimize the MISE. The Gaussian kernel function has very good efficiency properties, although they are not quite as good as some other (less smooth) kernel functions, such as the (efficiency-optimal) Epanechnikov, the biweight or the triangular kernels, which are described and discussed for instance in Silverman (1986) (see in particular Table 3.1).

15.1.3 Choosing a window width

Even, however, if we were to agree on a particular shape for an argument-centered kernel function, there would still remain the question of which window width to choose. Again, conditional on the choice of a particular form for K(u), we can choose the window width that minimizes the MISE. To see what this implies, note first that we can decompose the MSE at y as a sum of the square of the bias and of the variance that there is in estimating Image:

Image

For symmetric kernel functions, the bias can be shown to be approximately equal to

Image

where, as before, f(i)(y) stands for the ith -order derivative of f(y). The variance equals

Image

where Image. Substituting (15.10) and (15.11)in (15.9) then gives:

Image

Hence, considering (15.10), we find that the bias of Image will be low if the kernel function has a low variance, since it is then the observations that are "closer" to y that will count more, and since it is those observations that provide the least biased estimate of the density at y. But the bias also depends on the curvature of f(y): in the absence of such a curvature, the density function is linear and the bias provided by using observations on the left of y is just (locally) outweighed by the bias provided by using observations on the right of y. When f(2)(y) = 0, therefore, there is asymptotically no bias in using kernel density estimation.

Looking at (15.11), we find ceteris paribus that a flatter kernel (i.e., with a lower ck) decreases the variance of Image. A flatter kernel weights more equally the observations found around y, and that reduces the variance of an estimator such as (15.5). We also obtain the familiar result that the variance of the estimator decreases proportionately with the size of the sample.

An increase in h plays an offsetting role on the precision of Image, as is shown by (15.12). When f(2)(y) ≠ 0, a large h increases the bias by making the estimators too smooth: too much use is made of those observations that are not so close to y. Conversely, a large h reduces the variance of Image by making it less variable and less dependent on the particular value of those observations that are very close to y. Hence, in choosing h in an attempt to minimize MISE Image, a compromise needs to be struck between the competing virtues of bias and variance reductions. The precise nature of this compromise will depend on the shape of the kernel function as well as on the true population density function. For instance, if the Gaussian kernel is used and if the true density function is normal with variance σ2, then the choice of h that minimizes the MISE is given by (see for instance Silverman 1986, p.45):

Image

This value of h* is conditional on both K(u) and f(y) being normal density functions. Silverman (1986) also argues for a more robust choice of h*, given by

Image

where A = min(standard deviation, interquartile range/1.34). This is because (15.14)

(...) will yield a mean integrated square error within 10% of the optimum for all the t-distributions considered, for the log-normal with skewness up to about 1.8, and for the normal mixture with separation up to 3 standard deviations. (...) For many purposes it will certainly be an adequate choice of window width, and for others it will be a good starting point for subsequent fine tuning. (Silverman 1986, p.48)

Further (asymptotic) results show that, under some mild assumptions — in particular, that the density function f(y) is continuous at y, and that h → 0 and nh → ∞ as n → ∞ — the kernel estimator Image converges to f(y) as n → ∞. When h is chosen optimally, it is of the order of n-1/5, and by (15.12) the MISE is then of the order of n-0.4. This is slightly lower than the analogous usual rate of convergence of parametric estimators, which is n-0.5.

15.1.4 Multivariate density estimation

Kernel estimation can also be used for multivariate density estimation. Let u, y and yi be d-dimensional vectors. We can estimate a d-dimensional density function as3:

Image

where h is a window width common to all of the dimensions. The multivariate Gaussian kernel is given by Image. The issues of kernel function and window width selections are similar to those discussed above for univariate density estimation. The approximately optimal window converges at the rate Image and the optimal window width for the Gaussian kernel and a multivariate normal density f(y) with unit variance is given by Image.

15.1.5 Simulating from a density estimate

Simulations from an estimated density are sometimes needed to compute estimates of functionals of the unknown true density function. This is the case, for instance, for the estimation in DAD of indices of classical horizontal inequity. The estimation of such indices requires information on the net income distribution of those who have the same gross incomes, and such information cannot be gathered directly from sample observations of net and gross incomes since very few (if any) exact equals can be observed in random samples of finite sizes. Another use of simulated distributions is for computing bootstrap estimates of the sampling distribution of some estimators. The usual bootstrap procedure proceeds by conducting successive random sampling (with replacement) from the original sample Image. This constrains the new samples to contain only those observations yi that were contained in the original sample. Those new samples could instead be generated from a non-parametric estimate of the density of the original sample of incomes, which would yield a bootstrap estimate that would be smoother and less dependent on the precise values that the observations Yi took in the original sample.

Consider first the case of generating J independent realizations, Image, in a univariate case, and suppose that a non-negative kernel function K(u) with window width h is used to estimate f(y). Also assume that observation i has sampling weight wi, and suppose for simplicity that the initial observations Image were drawn independently from each other. The following simple algorithm is adapted slightly from Silverman (1986), p.143. For j = 1, . . ., J, we then:


3DAD: Distribution|Joint Density Function.

Step 1 Choose i with replacement from Image with probability Image

Step 2 Choose ε randomly using the probability density function K;

Step 3 Set Image= yi + hε.

Note that this algorithm does not even require computing directly Image.

For the multivariate case, the above algorithm becomes just slightly more complicated. For instance, for the estimation of classical HI at gross income x, we need to generate a random sample of net incomes, Image, that follows the estimated kernel conditional density Image. For this, we use the original sample Image with sampling weights wi. For j = 1, . . ., J, we then:

Step 1 Choose i with replacement from Image with probability

Image

Step 2 Choose ε randomly using the probability density function K;

Step 3 Set Image= yi + hε.

This gives a simulated sample of net incomes Image, conditional upon gross income being exactly equal to x. A local index of classical HI at x can then be computed using this simulated sample, and global indices of classical HI can be estimated simply by repeating this procedure for each of the observed values of gross incomes, Image.

Because they follow an estimated density function that is on average smoother than the true one, the simulated samples generated by the above algorithms will have a variance that is generally larger than both the variance observed in the sample and the true population variance. Let for instance the sample variance of the yi be denoted as Image. In the univariate case, the variance of the simulated Image will equal Image. This can be a problem if, as is the case for the measurement of indices of classical HI, the quantity of interest is intimately linked to the dispersion of income. There may also be a wish to constrain the simulated samples of net incomes to have precisely the same sample mean, Imagey, as the original sample. Constraining the simulated samples to have the same mean and variance as the original sample can be done by translating and re-scaling the simulated samples. This involves replacing Step 3 above by

Step 3' SetImage

in the univariate case. For the bivariate case, we also use Step 3', but replace Imagey by Image and Image by which can be respectively computed as:

Image

and

Image

Equation (15.16) is in fact an example of a kernel regression of y on x, a procedure to which we now turn.

15.2 Non-parametric regressions

The estimation of an expected relationship between variables is the second most important sphere of recent applications of kernel estimation techniques. Non-parametric regressions offer several useful applications in distributive analysis. An example of such an application is the estimation of the relationship between expenditures and calorie intake. Regressing calorie intake non parametrically on expenditure does not impose a fixed functional relationship between those two variables along the entire range of calorie intake. On the contrary, it allows a fair amount of flexibility by estimating the link between the two variables through a local weighting procedure. The local weighting procedure essentially considers the expenditures of those individuals with a calorie intake in the "region" of the specified calorie intake. It weights those values with weights that decrease rapidly with the distance from the calorie intake. Hence, those with calorie intake far from the specified level will contribute little to the estimation of the expenditure needed to attain that level. The results using this method are thus less affected by the presence of "outliers" in the distribution of incomes, and less prone to biases stemming from an incorrect specification of the link between spending and calorie intake.

Basically, then, one is interested in estimating the predicted response, m(x), of a variable y at a given value of a (possibly multivariate) variable x, that is,

Image

Alternatively, if the joint density f(x, y) exists and if f(x) > 0, m(x) can also be defined as:

Image

The difficulty in estimating the function m(x) is that we typically do not observe in a sample a response of y at that particular value of x. Furthermore, even if we do, there are rarely other observations with exactly the same value of x that will allow us to compute reliably the expected response in which we are interested.

Let then Image be a sample of n observed realizations jointly of x and y. The response information that is provided by the sample can be expressed as:

Image

To estimate m(x), kernel regression techniques use a local averaging procedure that involves weights K(u) that are analogous to those used in Section 15.1 for density estimation. Recalling (15.5) and (15.19), this leads to the following Nadaraya-Watson non-parametric estimator of m(x)4:

E:18.8.1

Image

To reduce the bias of using neighboring yi's, the kernel weights Image are typically inversely proportional to the distance between x and xi. They also depend on the window width h.

As in the case of the kernel density estimators, the kernel smoother Image can be shown to be consistent under relatively weak conditions, including that m(x) and f(x) are both continuous functions of x, and that h Image 0 and nh Image ∞ as n Image ∞ (see for instance Härdle 1990, Proposition 3.1.1). Again, the variance of Image alone does not fully capture the convergence of Image to m(x) since we must also take into account the bias of Image, which comes from the smoothing of the yi in (15.21). Under suitable regularity conditions, including that h ~ n-0.2, the asymptotic distribution of the kernel estimator Image can be shown to be normal, with its center shifted by its asymptotic bias — see Härdle (1990), Theorem 4.2.1, for a demonstration. This asymptotic bias is a function of the form of the kernel K(u) and of the derivatives of m(x) and f(x). It is given by:

Image

This asymptotic bias can be estimated consistently using estimates of m(2)(x), m(1)(x), f(1)(x) and f(x). Such an estimation, however, complicates significantly the computation of the sampling distribution of Image, and it can be avoided if we can expect (or can make) the bias to be small compared to the variance. This will be the case if m(x) is relatively constant, or if we make h fall just a bit faster than its optimal speed of n-0.2 — again, see the discussion of this in Härdle (1990), pp.100-102.


4DAD: Distribution|Non-Parametric Regression.

The variance of Image is given by:

Image

The conditional variance Image can be estimated consistently as in (15.17). In the case of kernel density estimation, note again that the smoothing process makes the rate of convergence of the kernel estimator Image to be n-0.4 instead of the usual slightly faster parametric convergence rate of n-0.5.

15.3 References

This chapter draws significantly from Silverman (1986) and Härdle (1990), to which readers are referred for more details and in-depth analysis.







Prev Document(s) 16 of 20 Next



   guest (Read)(Ottawa)   Login Home|Careers|Copyright and Terms of Use|General Infomation|Contact Us|Low bandwidth