--- title: "MM Algorithms for Smoothed Absolute Values" author: "Jan de Leeuw" date: "Version May 04, 2018" output: pdf_document: keep_tex: yes number_sections: yes toc: yes word_document: toc: yes html_document: keep_md: yes number_sections: yes toc: yes fontsize: 12pt graphics: yes bibliography: absapp.bib abstract: We discuss two different even and convex non-negative smooth approximations of the absolute value function and apply them to construct MM algorithms for least absolute deviation regression. Both uniform and sharp quadratic majorizations are constructed. As an example we use the Boston housing data. In our example sharp quadratic majorization is typically 10-20 times as fast as uniform quadratic majorization. --- {r function_code, echo = FALSE} source("l1fn2.R") source("l1fu2.R") source("l1gu2.R") source("l1gn2.R") source("computeThatRate.R")  {r packages, echo = FALSE} options (digits = 10) suppressPackageStartupMessages (library (numDeriv, quietly = TRUE)) suppressPackageStartupMessages (library (captioner, quietly = TRUE)) suppressPackageStartupMessages (library (L1pack, quietly = TRUE)) suppressPackageStartupMessages (library (mlbench, quietly = TRUE)) figure_nums <- captioner (prefix = "Figure") figure_nums(name = "f", caption = "Square root approximation: functipn", display = FALSE) figure_nums(name = "df", caption = "Square root approximation: first derivative", display = FALSE) figure_nums(name = "ddf", caption = "Square root approximation: second derivative", display = FALSE) figure_nums(name = "g", caption = "Convolution approximation: function", display = FALSE) figure_nums(name = "dg", caption = "Convolution approximation: first derivative", display = FALSE) figure_nums(name = "ddg", caption = "Convolution approximation: second derivative", display = FALSE) figure_nums(name = "plot_f_eps", caption = "Square root approximation: different epsilon", display = FALSE) figure_nums(name = "plot_g_eps", caption = "Convolution approximation: different epsilon", display = FALSE) mprint <- function (x, d = 2, w = 5) { print (noquote (formatC (x, di = d, wi = w, fo = "f"))) }  **Note:** This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory [gifi.stat.ucla.edu/absapp](http://gifi.stat.ucla.edu/absapp) has a pdf version, the bib file, the complete Rmd file with the code chunks, and the R source code. # Introduction Various problems in computational statistics involve absolute values. We mention $\ell_1$ regression, with the median as a special case, unidimensional scaling, support vector machines, and regression with $\ell_1$ penalty terms. Using absolute values can create problems for optimization algorithms used in fitting, because the absolute value function is not differentiable at zero. Smooth parametric approximations to the absolute value function have been suggested by, among many others, @ramirez_sanchez_kreinovich_argaez_14, @voronin_ozkaya_yoshida_15, and @bagul_17. It seemed to be useful exercise to use some of these approximations to construct majorization (nowadays MM) algorithms for various statistical techniques. # Using square root ## Function The simplest, and most commonly used, approximation to $|x|$ is $$\label{E:f} f_\epsilon(x)=\sqrt{x^2+\epsilon^2},$$ where $\epsilon$ is some, presumably small, positive number. It is computationally optimal in the somewhat convoluted sense explained by @ramirez_sanchez_kreinovich_argaez_14. Plots for $\epsilon^2$ equal to 1, .5, .1, and .01 are in figure r figure_nums("f", display = "n").
{r f, fig.align = "center", echo = FALSE} x<-seq(-3,3,length=1000) f <- function (x, epsilon) sqrt(x^2+epsilon) plot (x, abs(x),type="l",col = "RED", lwd = 3, ylab = "f") lines (x, f(x, 1), col = "BLUE", lwd = 2) lines (x, f(x, .5), col = "BLUE", lwd = 2) lines (x, f(x, .1), col = "BLUE", lwd = 2) lines (x, f(x, .01), col = "BLUE", lwd = 2) 
r figure_nums("f")

Clearly $f_\epsilon(x)>|x|$ for all $x$, and the maximum of $f_\epsilon(x)-|x|$ is $\epsilon$, attained at zero. Also $$\lim_{x\rightarrow\infty} f_\epsilon(x)-|x|=\lim_{x\rightarrow-\infty} f_\epsilon(x) - |x| = 0.$$ ## First Derivative The first derivative is $$f'_\epsilon(x)=\frac{x}{\sqrt{x^2+\epsilon^2}}$$ This is plotted, for the same values of $\epsilon$, in figure r figure_nums("df", display= "n").
{r df, fig.align = "center", echo = FALSE} x<-seq(-3,3,length=1000) df <- function (x, epsilon) x/sqrt(x^2+epsilon) plot (x, df(x, 1), col = "BLUE", type = "l", lwd = 2, ylab = "Df") lines (x, df(x, .5), col = "BLUE", lwd = 2) lines (x, df(x, .1), col = "BLUE", lwd = 2) lines (x, df(x, .01), col = "BLUE", lwd = 2) 
r figure_nums("df")

We see that $f'_\epsilon$ is an increasing function of $x$, and thus $f_\epsilon$ is convex. ## Second Derivative The second derivative, plotted in figure r figure_nums("ddf", display = "n"), is $$f''_\epsilon(x)=\frac{1}{\sqrt{x^2+\epsilon^2}}\frac{\epsilon^2}{x^2+\epsilon^2}.$$
{r ddf, fig.align = "center", echo = FALSE} x<-seq(-3,3,length=1000) ddf <- function (x, epsilon) (1/sqrt(x^2+epsilon))*(epsilon/(x^2+epsilon)) plot (x, ddf(x, .01), col = "BLUE", type = "l", lwd = 2, ylab = "DDf") lines (x, ddf(x, .1), col = "BLUE", lwd = 2) lines (x, ddf(x, .5), col = "BLUE", lwd = 2) lines (x, ddf(x, 1), col = "BLUE", lwd = 2) 
r figure_nums("ddf")

The second derivative is non-negative, with the horizontal axis as its asymptote. It attains its maximum, equal to $\epsilon^{-1}$, at zero. This bound on the second derivative is the main reason for using $\epsilon^2$ in the definition of $f_\epsilon$. {r check_f, echo = FALSE, eval = FALSE, cache = TRUE} x<-seq(-3,3,length=1000) dfn <- grad (f, x, epsilon = .1) ddfn <- rep(0, 1000) print (max(abs(dfn - df(x, .1)))) for (i in 1:1000) { ddfn[i] <- hessian (f, x[i], epsilon = .1) } print (max(abs(ddfn - ddf (x, .1))))  # Using convolution ## Function The following convolution approximation to $|x|$ was suggested by @voronin_ozkaya_yoshida_15. $$\label{E:g} g_\epsilon(x)\mathop{=}\limits^\Delta\frac{1}{\epsilon\sqrt{2\pi}}\int_{-\infty}^{+\infty}|x-y|\exp\{-\frac12\frac{y^2}{\epsilon^2}\}dy.$$ Carrying out the integration gives $$g_\epsilon(x)=x\left(2\Phi\left(\frac{x}{\epsilon}\right)-1\right)+2\epsilon\phi\left(\frac{x}{\epsilon}\right).$$ Because $g_\epsilon$ is a weighted sum (integral) of a set of convex functions it is convex. We have $g_\epsilon(x)>|x|$, and the maximum of $g_\epsilon(x)-|x|$ is equal to $\epsilon\sqrt{\frac{2}{\pi}}$, attained at zero. @voronin_ozkaya_yoshida_15 show that as $\epsilon\rightarrow 0$ we have both uniform and $L_1$ convergence of $g_\epsilon$ to $|x|$. Of course other convolution approximation with similar scale families (a.k.a. *mollifiers*), which have support converging to zero while maintaining mass equal to one, if scale goes to zero, can also be used. Figure r figure_nums("g", display = "n") show the approximation for $\epsilon$ equal to 1, .5, .1. and .01.
{r g, fig.align = "center", echo = FALSE} x<-seq(-3,3,length=1000) g <- function (x, epsilon) x*(2*pnorm(x/epsilon)-1)+2*epsilon*dnorm(x/epsilon) plot (x, abs(x),type="l",col = "RED", lwd = 3, ylab = "g") lines (x, g(x, 1), col = "BLUE", lwd = 2) lines (x, g(x, .5), col = "BLUE", lwd = 2) lines (x, g(x, .1), col = "BLUE", lwd = 2) lines (x, g(x, .01), col = "BLUE", lwd = 2) 
r figure_nums("g")

## First derivative The first derivative, in figure r figure_nums("dg", display = "n") is $$g'_\epsilon(x)=2\Phi\left(\frac{x}{\epsilon}\right)-1.$$
{r dg, fig.align = "center", echo = FALSE} x<-seq(-3,3,length=1000) dg <- function (x, epsilon) 2*pnorm(x/epsilon)-1 plot (x, dg(x, 1), type="l", col = "BLUE", lwd = 2, ylab = "Dg") lines (x, dg(x, .5), col = "BLUE", lwd = 2) lines (x, dg(x, .1), col = "BLUE", lwd = 2) lines (x, dg(x, .01), col = "BLUE", lwd = 2) 
r figure_nums("dg")

## Second derivative The second derivative, in figure r figure_nums("ddg", display = "n"), is $$g''_\epsilon(x)=\frac{2}{\epsilon}\phi\left(\frac{x}{\epsilon}\right).$$
{r ddg, fig.align = "center", echo = FALSE} x<-seq(-3,3,length=1000) ddg <- function (x, epsilon) 2*dnorm(x/epsilon)/epsilon plot (x, ddg(x, .1), type="l", col = "BLUE", lwd = 2, ylab = "DDg") lines (x, ddg(x, .5), col = "BLUE", lwd = 2) lines (x, ddg(x, 1), col = "BLUE", lwd = 2) 
r figure_nums("ddg")

The second derivative is positive and unimodal, with a maximum equal to $\frac{1}{\epsilon}\sqrt{\frac{2}{\pi}}$ attained at zero. {r check_g, echo = FALSE, cache = TRUE, eval = FALSE} epsilon = .1 x<-seq(-3,3,length=1000) dgn <- grad (g, x, epsilon = epsilon) ddgn <- rep(0, 1000) print (max(abs(dgn - dg(x, epsilon)))) for (i in 1:1000) { ddgn[i] <- hessian (g, x[i], epsilon = epsilon) } print (max(abs(ddgn - ddg (x, epsilon))))  # Applications to $\ell_1$ Regression Consider the problem of minimizing $$\label{E:l1loss} \sigma(\beta)=\sum_{i=1}^nw_i|y_i-x_i'\beta|$$ This problem has been studied in statistics for many, many years. See, for example, @dielman_05 or @farebrother_13. Our contribution to the computational aspects of the problem is modest, because we are mainly interested in its connection to majorization theory. ## AM/GM approach Let's start with the square root approximation $f_\epsilon$. The first majorization we construct is based on the AM/GM inequality. Because for each $\beta,\tilde\beta$ we have $$\sqrt{\{(y_i-x_i'\beta)^2+\epsilon^2\}\{(y_i-x_i'\tilde\beta)^2+\epsilon^2\}} \leq\frac12\{(y_i-x_i'\beta)^2+\epsilon^2\}+\{(y_i-x_i'\tilde\beta)^2+\epsilon^2\},$$ it follows that $$\sum_{i=1}^nw_if_\epsilon(y_i-x_i'\beta)\leq\frac12\sum_{i=1}^nw_i\frac{1}{\sqrt{(y_i-x_i'\tilde\beta)^2+\epsilon^2}}\{(y_i-x_i'\beta)^2+(y_i-x_i'\tilde\beta)^2+2\epsilon^2\}.$$ Thus if $V(\beta)$ is the diagonal matrix with diagonal elements $$v_{ii}(\beta)=\frac{1}{\sqrt{(y_i-x_i'\beta)^2+\epsilon^2}}$$ then majorization leads to an iteratively reweighted least squares algorithm. The update formula is $$\beta^{(k+1)}=(X'WV(\beta^{(k)})X)^{-1}X'WV(\beta^{(k)})y.$$ We apply this to the Boston housing data from the UCI repository (@newman_hettich_blake_merz_98), loaded from the mlbench package (@leisch_dimitriadou_10). {r boston_data, cache = TRUE} data("BostonHousing") y <- BostonHousing$medv z <- BostonHousing[, 1:13] x <- cbind (1, apply(z, 2, as.numeric)) w <- rep (1, length(y))  The R function l1fn2() is used. {r boston_amgm, cache = TRUE} h1 <- l1fn2 (x, y, aeps = 1e-2, itmax = 10000)  We have convergence of function values, to a change of less than r 1e-10, in r h1$itel iterations. The output below gives both $\sigma_\epsilon$, the approximating loss function that we minimize, and $\sigma$ from $\eqref{E:l1loss}$, the actual $\ell_1$ loss function that we approximate.
{r boston_amgm_out, echo = FALSE} cat( "eps:", formatC(1e-3,digits = 1, format = "e"), "itel:",formatC(h1$itel, width = 2, format = "d"), "sigma_e:", formatC(h1$f,digits = 6, format = "f"), "sigma:", formatC(h1$ff,digits = 6, format = "f"),"\n", "beta:", formatC(h1$beta,digits = 6, format = "f"), fill = TRUE) 
## Uniform quadratic majorization The second majorization uses an upper bound for the second derivative, as in @voss_eckhardt_80, @bohning_lindsay_88, @lange_16, p. 100, or @deleeuw_B_16b, chapter 11. It can be applied both to our square root $f_\epsilon$ and convolution $g_\epsilon$ approximations. First the square root. Because $$f_\epsilon(x)\leq f_\epsilon(y)+\frac{y}{\sqrt{y^2+\epsilon^2}}(x-y)+\frac{1}{2\epsilon}(x-y)^2$$ we have $$\sum_{i=1}^nw_if_\epsilon(y_i-x_i'\beta)\leq\sum_{i=1}^nw_if_\epsilon(y_i-x_i'\tilde\beta)-\sum_{i=1}^nw_i\left\{\frac{y_i-x_i'\tilde\beta}{\sqrt{(y_i-x_i'\tilde\beta)^2+\epsilon^2}}\right\}x_i'(\beta-\tilde\beta)+\frac{1}{2\epsilon}\sum_{i=1}^nw_i(x_i'(\beta-\tilde\beta))^2.$$ which gives the majorization algorithm $$\beta^{(k+1)}=(X'WX)^{-1}X'W(X\beta^{(k)}+\epsilon\ F_\epsilon'(\beta^{(k)})),$$ where $F_\epsilon'(\beta)$ is a vector with elements $$\mathcal{D}f_\epsilon(y_i-x_i'\beta)=\frac{y_i-x_i'\beta}{\sqrt{(y_i-x_i'\beta)^2+\epsilon^2}}.$$ The corresponding expression for the convolution approximation $g_\epsilon$ is $$\sum_{i=1}^nw_ig_\epsilon(y_i-x_i'\beta)\leq\sum_{i=1}^nw_ig_\epsilon(y_i-x_i'\tilde\beta)-\sum_{i=1}^nw_i\left\{2\Phi(\frac{y_i-x_i'\tilde\beta}{\epsilon})-1\right\}x_i'(\beta-\tilde\beta)+\frac12\frac{1}{\epsilon}\sqrt{\frac{2}{\pi}}\sum_{i=1}^nw_i(x_i'(\beta-\tilde\beta))^2.$$ The majorization algorithm is $$\beta^{(k+1)}=(X'WX)^{-1}X'W(X\beta^{(k)}+\epsilon\sqrt{\frac{\pi}{2}}G'_\epsilon(\beta^{(k)}))$$ Given the shape of the second derivatives of the approximating functions, we expect this algorithm to converge slowly. The uniform upper bound is only good very close to the origin, and horribly bad everywhere else. The update function only makes a very small move in the descent direction. We illustrate the slow convergence with two analyses of the Boston housing data, one using the square root and one using the convolution approximation, using the R routines l1fu2() and l1gu2(). For the $f_\epsilon$ approximation we find {r boston_qb_f, cache = TRUE} h2 <- l1fu2 (x, y, aeps = 1e-2, itmax= 100000) 
{r boston_qb_f_out, echo = FALSE, size = "footnotesize"} cat( "eps:", formatC(1e-3,digits = 1, format = "e"), "itel:",formatC(h2$itel, width = 2, format = "d"), "sigma_e:", formatC(h2$f,digits = 6, format = "f"), "sigma:", formatC(h2$ff,digits = 6, format = "f"),"\n", "beta:", formatC(h2$beta,digits = 6, format = "f"),fill=TRUE) 
We see that convergence to our default precision requires r formatC(h2$itel, format = "d") iterations. The results for the convolution approximation are similar. {r boston_qb_g, cache = TRUE} h3 <- l1gu2 (x, y, aeps = 1e-2, itmax= 100000)  {r boston_qb_g_out, echo = FALSE} cat( "eps:", formatC(1e-2,digits = 1, format = "e"), "itel:",formatC(h3$itel, width = 2, format = "d"), "sigma_e:", formatC(h3$f,digits = 6, format = "f"), "sigma:", formatC(h3$ff,digits = 6, format = "f"),"\n", "beta:", formatC(h3$beta,digits = 6, format = "f"), fill=TRUE)  ## Sharp quadratic majorization Uniform quadratic majorization fails rather miserably. But we have to realize that the AM/GM approach also leads to quadratic majorization, and in that case it works quite well. In order to improve the speed of convergence of the quadratic majorizations we turn to sharp non-uniform quadratic approximation, discussed in @deleeuw_lange_A_09. We use their theorem 4.5. **Theorem 4.5.** *Suppose f(x) is an even, differentiable function on$\mathbb{R}$such that the ratio$f'(x)/x$is decreasing on$(0,\infty)$. Then the even quadratic $$g(x)=\frac{f'(y)}{2y}(x^2-y^2)+f(y)$$ is the best quadratic majorizer of$f(x)$at the point$y$.* Both approximations$f_\epsilon$and$g_\epsilon$satisfy the conditions of the theorem. For$f_\epsilon$we find $$\sqrt{x^2+\epsilon^2}\leq\frac12\frac{1}{\sqrt{y^2+\epsilon^2}}(x^2-y^2)+\sqrt{y^2+\epsilon^2}.$$ Or $$\sum_{i=1}^n w_if_\epsilon(y_i-x_i'\beta)\leq\sum_{i=1}^n w_if_\epsilon(y_i-x_i'\tilde\beta)+ \frac12\sum_{i=1}^n\frac{1}{\sqrt{(y_i-x_i'\tilde\beta)^2+\epsilon^2}}\{(y_i-x_i'\beta)^2-(y_i-x_i'\tilde\beta)^2\}.$$ But this is exactly the majorization provided by the AM/GM approach. Consequently the AM/GM approach provides us with the sharpest quadratic majorization of$f_\epsilon$. We see that in the Boston housing example sharp quadratic majorization reduces the number of iterations from r formatC(h2$itel, format = "d") for uniform quadratic majorization to r formatC(h1$itel, format = "d"). The relation between AM/GM and sharp quadratic majorization is both new and interesting, but it does not give us an alternative algorithm. We can just use AM/GM and the R function l1f1(). For the convolution approximation$g_\epsilon$, however, the result in the theorem gives a new non-uniform quadratic majorization. From theorem 4.5 $$g_\epsilon(x)\leq g_\epsilon(y)+\frac{2\Phi(\frac{y}{\epsilon})-1}{2y}(x^2-y^2),$$ or $$\sum_{i=1}^n w_ig_\epsilon(y_i-x_i'\beta)\leq\sum_{i=1}^n w_ig_\epsilon(y_i-x_i'\tilde\beta)+\frac12\sum_{i=1}^nw_i\frac{2\Phi(\frac{y_i-x_i'\tilde\beta}{\epsilon})-1}{y_i-x_i'\tilde\beta}\{(y_i-x_i'\beta)^2-(y_i-x_i'\tilde\beta)^2\}.$$ We analyze the data with the R routine l1gn2(). {r boston_qb_g_n} h4 <- l1gn2 (x, y, aeps = 1e-2, itmax= 1000)  {r boston_qb_g_n_out, echo = FALSE} cat( "eps:", formatC(1e-2,digits = 1, format = "e"), "itel:",formatC(h4$itel, width = 2, format = "d"), "sigma_e:", formatC(h4$f,digits = 6, format = "f"), "sigma:", formatC(h4$ff,digits = 6, format = "f"),"\n", "beta:", formatC(h4$beta,digits = 6, format = "f"),fill=TRUE)  For uniform quadratic majorization we needed r formatC(h3$itel, format="d") iterations, for sharp majorization we only need r h4$itel. ## Rate of Convergence Suppose$\eta$is a majorization scheme for$\tau$, i.e.$\eta:\mathbb{R}^n\otimes\mathbb{R}^n\rightarrow\mathbb{R}$and$\tau:\mathbb{R}^n\rightarrow\mathbb{R}$such that$\eta(x,y)\geq\tau(x)$for all$x,y$and$\eta(x,x)=\tau(x)$for all$x$. Also assume sufficient differentiability and uniqueness of the minima in majorization iterations. @deleeuw_C_94c (also see @deleeuw_B_16b) shows that the convergence rate of the majorization algorithm is the largest eigenvalue of the matrix $$I-\{\mathcal{D}_{11}\eta(x,x)\}^{-1}\mathcal{D}^2\tau(x)=-\{\mathcal{D}_{11}\eta(x,x)\}^{-1}\mathcal{D}_{12}\eta(x,x).$$ evaulated at the fixed point$x$. This is the Jacobian of the update function, that maps$x^{(k)}$to its successor$x^{(k+1)}$. Note that in our majorization example$\eta$is a quadratic in$x$. Thus the second derivatives$\mathcal{D}_{11}$do not depend on$x$and are simply the matrix of the quadratic form. If$\sigma_\epsilon^f(\beta)\mathop{=}\limits^\Delta\sum_{i=1}^n w_if_\epsilon(y_i-x_i'\beta)$then $$\mathcal{D}^2\sigma_\epsilon^f(\beta)=\epsilon^2\sum_{i=1}^n w_i\left\{(y_i-x_i'\beta)^2+\epsilon^2\right\}^{-\frac32}x_i^{\ }x_i'$$ If$\sigma_\epsilon^g(\beta)\mathop{=}\limits^\Delta\sum_{i=1}^n w_ig_\epsilon(y_i-x_i'\beta)$then $$\mathcal{D}^2\sigma_\epsilon^g(\beta)=\frac{2}{\epsilon}\sum_{i=1}^n w_i\phi(\frac{y_i-x_i'\beta}{\epsilon})x_i^{\ }x_i'$$ {r check_D2_tau_f, echo = FALSE, eval = FALSE} eps <- .9 beta <- h1$beta ffunc <- function(beta, z, y, w, eps) { return (sum (w * sqrt((y - drop (z %*% beta)) ^ 2 + eps ^ 2))) } hn <- hessian (ffunc, beta, z = x, y = y, w = w, eps = eps) r <- y - drop (x %*% beta) v <- 1 / ((r ^ 2 + eps ^ 2) ^ (3/2)) d2_tau_f <- crossprod (x, v * w * x) * (eps ^ 2)  {r check_D2_tau_g, echo = FALSE, eval = FALSE} eps <- .25 beta <- h1$beta gfunc <- function(beta, z, y, w, eps) { r <- y - drop (z %*% beta) return (sum (w * (r * (2 * pnorm (r / eps) - 1) + 2 * eps * dnorm (r / eps)))) } gfunc (beta, z = x, y = y, w = w, eps = eps) hn <- hessian (gfunc, beta, z = x, y = y, w = w, eps = eps) r <- y - drop (x %*% beta) v <- dnorm (r / eps) d2_tau_g <- (2 / eps) * crossprod (x, v * w * x)  The function computeThatRate() gives the theoretical convergence rates at the$\ell_1$solution for the Boston housing data for various values of$\epsilon$and for all four quadratic majorizations (uniform-f, uniform-g, non-uniform-F, and non-uniform-g). {r compute_those_rates} epss <- c (5, 2, 1, .5, .1, .01, .005) beta <- h1$beta computeThatRate (x, y, w, beta, epss)  From the output we see that the convergence rate tends to one if $\epsilon\downarrow 0$. Thus asymptotically our quadratic majorization algorithms will have a sublinear convergence rate. As we have already seen in our example, the non-uniform methods converge much faster than the uniform ones. Throughout this report we have chosen $\epsilon$ as some conventional numbers, without taking into account the scale of the data or the closeness of the fit. This is probably not a good idea, and especially in this large example our conventional $\epsilon$'s could very well be too small. We check this more closely by using non-uniform majorization for seven different values of $\epsilon$. The results are below. Epsilon is in the first column, followed by the results for the number of iterations, the value of the approximation function at the minimum, and the value of the least abosolute deviation function itself. {r those_f_eps, echo = FALSE} epss <- c (5, 2, 1, .5, .1, .05, .01) beta <- h1$beta he1 <- l1fn2 (x, y, aeps = epss[1], itmax = 100000) print (c(epss[1], he1$itel, he1$f, he1$ff)) he2 <- l1fn2 (x, y, aeps = epss[2], itmax = 100000) print (c(epss[2], he2$itel, he2$f, he2$ff)) he3 <- l1fn2 (x, y, aeps = epss[3], itmax = 100000) print (c(epss[3], he3$itel, he3$f, he3$ff)) he4 <- l1fn2 (x, y, aeps = epss[4], itmax = 100000) print (c(epss[4], he4$itel, he4$f, he4$ff)) he5 <- l1fn2 (x, y, aeps = epss[5], itmax = 100000) print (c(epss[5], he5$itel, he5$f, he5$ff)) he6 <- l1fn2 (x, y, aeps = epss[6], itmax = 100000) print (c(epss[6], he6$itel, he6$f, he6$ff)) he7 <- l1fn2 (x, y, aeps = epss[7], itmax = 100000) print (c(epss[7], he7$itel, he7$f, he7$ff))  For larger values of $\epsilon$ the values of $\sigma(\beta)$ and $\sigma_\epsilon^f(\beta)$ are very different. But the value of $\sigma(\beta)$ is already quite close to its minimum value, and that is of course what we are interested in. This is illustrated also in figure r figure_nums("plot_f_eps", display = "n"), where the 7 solutions for $\beta$ are plotted. Only the first one, for $\epsilon$ equal to 5, is any different, the rest are virtually indistinguishable. Thus if you want 10 decimals precision in your regression analysis then by all means choose $\epsilon$ small. But seeing a regression analysis with 10 decimals precision tell us more about the data analyst than it tells us about the data.
{r plot_f_eps, echo = FALSE, fig.align = "center"} r <- rainbow (7) plot (1:14, he1$beta, type = "l", lwd = 2, col =r[1], xlab = "predictor", ylab = "beta") lines (1:14, he2$beta, lwd = 2, col =r[2]) lines (1:14, he3$beta, lwd = 2, col =r[3]) lines (1:14, he4$beta, lwd = 2, col =r[4]) lines (1:14, he5$beta, lwd = 2, col =r[5]) lines (1:14, he6$beta, lwd = 2, col =r[6]) lines (1:14, he7$beta, lwd = 2, col =r[7])  r figure_nums ("plot_f_eps") For completeness we repeat the analysis for the convolution approximation. The results are very much the same. {r those_g_eps, echo = FALSE} epss <- c (5, 2, 1, .5, .1, .05, .01) beta <- h1$beta hf1 <- l1gn2 (x, y, aeps = epss[1], itmax = 100000) print (c(epss[1], hf1$itel, hf1$f, hf1$ff)) hf2 <- l1gn2 (x, y, aeps = epss[2], itmax = 100000) print (c(epss[2], hf2$itel, hf2$f, hf2$ff)) hf3 <- l1gn2 (x, y, aeps = epss[3], itmax = 100000) print (c(epss[3], hf3$itel, hf3$f, hf3$ff)) hf4 <- l1gn2 (x, y, aeps = epss[4], itmax = 100000) print (c(epss[4], hf4$itel, hf4$f, hf4$ff)) hf5 <- l1gn2 (x, y, aeps = epss[5], itmax = 100000) print (c(epss[5], hf5$itel, hf5$f, hf5$ff)) hf6 <- l1gn2 (x, y, aeps = epss[6], itmax = 100000) print (c(epss[6], hf6$itel, hf6$f, hf6$ff)) hf7 <- l1gn2 (x, y, aeps = epss[7], itmax = 100000) print (c(epss[7], hf7$itel, hf7$f, hf7ff))  {r plot_g_eps, echo = FALSE, fig.align = "center"} k <- matrix (0, 7, 14) r <- rainbow (7) plot (1:14, hf1beta, type = "l", lwd = 2, col =r[1], xlab = "predictor", ylab = "beta") lines (1:14, hf2$beta, lwd = 2, col =r[2]) lines (1:14, hf3$beta, lwd = 2, col =r[3]) lines (1:14, hf4$beta, lwd = 2, col =r[4]) lines (1:14, hf5$beta, lwd = 2, col =r[5]) lines (1:14, hf6$beta, lwd = 2, col =r[6]) lines (1:14, hf7$beta, lwd = 2, col =r[7]) 
r figure_nums ("plot_g_eps")

# Discussion In future reports we may consider applications to nonlinear $\ell_1$, to the lasso, to support vector machines, and to unidimensional scaling. # Code ## l1fu2.R {r file_auxilary, code = readLines("l1fu2.R")}  ## l1fn2.R {r file_auxilary, code = readLines("l1fn2.R")}  ## l1gu2.R {r file_auxilary, code = readLines("l1gu2.R")}  ## l1gn2.R {r file_auxilary, code = readLines("l1gn2.R")}  ## computeThatRate.R {r file_auxilary, code = readLines("computeThatRate.R")}  # References