Lambert W random variables - a new family of generalized skewed distributions with applications to risk estimation

Originating from a system theory and an input/output point of view, I introduce a new class of generalized distributions. A parametric nonlinear transformation converts a random variable $X$ into a so-called Lambert $W$ random variable $Y$, which all…

Authors: Georg M. Goerg

The Annals of Applie d Statistics 2011, V ol. 5, No. 3, 2197–223 0 DOI: 10.1214 /11-A OAS457 c  Institute of Mathematical Statistics , 2 011 LAMBER T W RANDOM V ARIABLES —A NEW F AMIL Y OF GENERALIZED SKEWE D DISTRIBUTIONS WITH APPLICA TIONS TO RISK ESTIMA TION By Georg M. Goerg Carne gie Me l lon University Originating from a system theory a nd an input/output p oin t of view, I introduce a n ew class of generalized distributions. A para- metric nonlinear transforma tion con verts a random v ariable X into a so-called Lambert W random v ariable Y , which allo ws a very flex- ible approac h to mo del skew ed data. Its shap e d epen ds on the shap e of X a nd a skew ness p arameter γ . In particular, for symmetric X and nonzero γ the outp ut Y is skew ed. I t s distribution and densit y function are particular v arian t s of their input counterparts. Maxi- mum lik eliho o d and metho d of moments estimators are presented, and sim ulations show th at in t h e symmetric case additional estima- tion of γ does not affect the qu alit y of other p arameter estimates. Applications in finance and biomedicine show the relev ance of this class of distributions, whic h is particularly useful for slig htly skew ed data. A practical b y- result of the Lambert W framew ork: data can b e “unskew ed.” The R pac k age LambertW develo p ed by the auth or is p ublicly av ail- able ( CRAN ). 1. In tro d uction. Exploratory data analysis regarding asymmetry in data is usually based on histograms and n onparametric density estimates, and statemen ts suc h as “this d ata set lo oks almost Gaussian, but it is skew ed to the right” or “these asset returns hav e hea vy tail s, b ut they are too sk ewe d that a student- t w ould make sens e” are fairly common. It is therefore natural to generalize symmetric distributions to allo w for asymmetry . A prominent generalizat ion is the sk ew-n ormal distribution [Azzalini ( 1985 )], whic h includ es the Gauss ian as a sp ecial case. A skew-normal r an- dom v ariable (R V) is d efined by ha ving the p robabilit y densit y function Received F ebruary 2010; revised January 2011. Key wor ds and phr ases. F amily of skew ed distributions, skewness, transformation of random v ariables, Lambert W , la tent v ariables, stylized facts of asse t returns, v alue at risk, GARCH. This is an electr onic reprint of the original a r ticle published b y the Institute of Mathematical Statistics in The A nn als of A pplie d Statistics , 2011, V ol. 5, No. 3, 2197–223 0 . This reprint differs from the original in pagina tion and t yp ogr aphic deta il. 1 2 G. M. GOER G (p.d.f.) f ( x ) = 2 φ ( x )Φ( αx ), where Φ( · ) is the cum ulative distribution func- tion (c .d.f.) of a standard Gaussian, and α ∈ R is the sk ew parameter. This approac h to skewness has not only led to su bstan tial r esearch in the skew- normal case [Azzalini and Capitanio ( 1999 ), Arellano-V alle and Azzalini ( 2006 )], but the same concept has also b e en u sed for s tuden t- t [Azza lini and Capitanio ( 2003 )] and Cauc hy distributions [Arnold and Bea ver ( 2000 ), Be- h b oo dian, Jamalizadeh and Balakrishnan ( 2006 )], among others. In all these cases, a parametric manipulation of the original sym metric p .d.f. introdu ces sk ewn ess. Not withstanding the huge success of this app roac h to mo del sk ew ed data, manipulating the p.d.f. to in tro d u ce s kewness seems like p utting the cart b efore the horse: densities are sk ewed, b ecause the random v ariable is—n ot the other w ay around. Also, app lied research starts with data, not with histograms. Motiv ated by this data-driv en view on sk ewn ess, I pr op ose a no vel ap- proac h to asymmetry: Lam b ert W × F X distributions emerge naturally b y mo deling the obs erv able R V Y as the output of a sys tem S d r iv en by random input X with c.d.f. F X ( x ). Here S can either b e a real chemica l, ph ys ical, or biologica l system, or refer to an y kind of mec hanism in a broader sense. In statistica l mo deling suc h a sy s tem is simply represent ed by tr an s formations of R Vs. As there are no restricti ons on F X ( x ), this is a v ery general frame- w ork that can b e analyzed in detail for a particularly c hosen inp ut c.d.f. Figure 1 ill ustrates the metho d ology . F or ins tance, consider S b eing the sto c k m ark et, wh ere p eople bu y and sell an asset according to its exp ected su ccess in the futur e. Asset returns, that is, the p ercen tage c han ge in price, typica lly exhibit negativ e skewness and p ositiv e excess kur tosis—so-cal led stylize d facts [Y an ( 2005 ), Cont ( 2001 )]. The left panel of Figure 2 sho w s daily lo g-returns (in p ercent) y := { y t | t = 1 , . . . , 1,413 } of an equit y fund inv esting in Latin America (LA T AM 1 ). Also, these retur ns are clearly non-Gaussian given their excess kur tosis (1 . 201) and large negativ e skewness ( − 0 . 433)—see T able 1 . The excess kur tosis is t ypically addressed b y a s tu den t t -distribu tion, bu t here a Kolmogoro v– Smirnov test still rejects y ∼ t 6 . 22 on a 5% lev el (ev en for the estimated ν ), as the empirical sk ewness is to o large. Th us, to mo del the pr obabilistic prop erties of such data, asymmetric distrib utions must b e used. Using Lam b ert W × F R Vs to mo d el the asymmetry in asset returns is p erfectly suitable not only giv en empirical eviden ce of “almost stu den t- t , but a little s kew ed data,” but also b y a more fu ndament al viewp oin t. Price c hanges are commonly considered as the r esult of b ad an d go o d news hitting 1 Data from Jan uary 1, 2002 unti l May 31, 2 007: R pack age fEcof in , data set equityFund s , series LATAM . LAMBER T W RANDO M V A RIABLES 3 Fig. 1. Schematic view of the L am b ert W appr o ach to asymmetry: (left) an i nput/outp ut system S tr ansforms (solid arr ows) i nput X ∼ F X to output Y ∼ L amb ert W × F X and her ewith intr o duc es skewness; (right) infer enc e ab out skewe d data y = ( y 1 , . . . , y N ) : (1) unskew observe d y to latent symmetric data b x , (2) use m etho ds of your choic e (r e gr es- sion, time series mo dels, quantile estimation, hyp othesis tests, etc.) for statistic al infer enc e b ase d on b x , and (3) c onvert r esults b ack to the “skewe d world” of y . the market: bad n ews, negativ e return s; go o d news, p ositive returns. The empirical evidence of negativ e sk ewn ess ev ok es the follo w ing question: why should news p er se b e negativ ely skew ed? Or p u t in other w ords : do really bad things happ en more often than really go o d things? In the Lam b ert W framew ork this news ↔ return relation is mod eled under the assum ption that the p robabilit y of getti ng negativ e news is ab out the same as of getting p ositiv e news , but t ypically p eople react far more drastically facing negativ e than p ositive ones. Thus, news X ∼ F X ( x ) are symmetrically distributed, th e mark et S acts as an asymmetric filter, an d the measurable/observ able outcome is a skew ed R V Y /data y . Fig. 2. Daily e quity fund r eturns (LA T AM): (left) observe d r eturns y and estimate d latent news b x b τ MLE ; (right) Gaus sian QQ plot and L amb ert W × t QQ pl ot of y . 4 G. M. GOER G T able 1 LA T AM r eturns y and b ack-tr ansforme d series b x b τ : (top) summary st atistics; (b ottom) Shapir o–Wilk (SW) & Jar que–Ber a (JB) normality tests, and Kolmo gor ov–Smirnov (KS) test for stu dent- t wi th ν de gr e es of fr e e dom LA T AM y b x b τ IGMM b x b τ MLE Min − 6 . 064 − 5 . 073 − 4 . 985 Max 5 . 660 7 . 036 7 . 268 Mean 0 . 121 0 . 190 0 . 198 Median 0 . 138 0 . 138 0 . 138 St. dev . 1 . 468 1 . 456 1 . 457 Skewness − 0 . 433 0 . 000 0 . 053 Kurtosis 1 . 201 1 . 100 1 . 159 b ν 6 . 220 7 . 093 7 . 048 SW 0 . 000 0 . 000 0 . 000 JB 0 . 000 0 . 000 0 . 000 KS ( t b ν ) 0 . 028 0 . 088 0 . 102 Last, the right part of Figure 1 also illustrates a very pragmatic, y et use- ful w a y to exploit the Lambert W framew ork for (slightl y) sk ew ed data. If a certain statistical pr o cedure or mo d el assumes a symmetric—a Gaus- sian, as often is the case—distribution and n o sk ewed imp lemen tation of this metho d is a v ailable, then in stead of applying it to the skew ed y , it is advisable to work w ith the “symmetrized” data b x , make statistical inference ab out X based on b x , and then transform the obtained r esults bac k to the “sk ewed w orld” of Y . Although this is only an appr o ximation to the tr uth, at least this approac h tak es sk ewness in to consideration in stead o f ignorin g it. Section 2 defin es Lam b ert W R Vs a nd their basic prop erties are studied. Section 3 pr esents analytic exp ressions of the c.d.f. G Y ( y ) and p.d.f. g Y ( y ), whic h are particular v ariant s of their input counterparts. After stu d ying Gaussian input in Section 4 , v arious estimators for the parameter v ector of Lam b ert W × F R Vs are introd uced in Section 5 . Section 6 compares their finite sample prop erties and shows that additional estimation of the sk ew- ness parameter γ d o es not affect the qualit y of other parameter estimates. This new class of distribu tion functions is particularly useful for d ata with sligh tly negat iv e sk ewn ess, th u s, Section 7 demonstrates its adequacy on an Australian athlet es data set and on the LA T AM return ser ies. In particular, S ection 7.2 sho ws that th e in put-output system (Figure 1 ) with studen t- t input X is a prop er mo d el f or these returns. A deta iled com- parison of quanti le estimates, wh ich are essential to get appropriate risk measures of an asset, confirms the aptness of Lam b ert W distributions (see Lam b ert W QQ plot in Figure 2 ). Empirical evidence for the signifi cance of LAMBER T W RANDO M V A RIABLES 5 conditional heterosk edastic time series mo dels using Lam b ert W × F in no- v ations concludes Sectio n 7.2 . Finally , Section 8 establishes a d irect link of this new class of distributions to the existing statistics literature, noting that the square of a R V havi ng T ukey’s h distribu tion [T uk ey ( 1977 )] has a Lam b ert W × χ 2 1 distribution. Computations, figures and simula tions w ere realized with the op en-source statistics pac k age R [R Dev elopment Core T eam ( 2008 )]. F unctions used in the analysis are a v ailable as the R pac k age LambertW , which pro vides m an y other methods to p erform L amb ert W inference in practice. 2. Lam b ert W random v ariables. The general notion of a s ystem S with random input and output as sho wn in Figure 1 translates to a v ariable transformation in statistical terminology . Definition 2.1 (Noncen tral, nonscaled Lam b ert W × F R V). Let U b e a co ntin uous R V with c.d.f. F U ( u ) = P ( U ≤ u ) , u ∈ R , (2.1) and p.d.f. f U ( u ). Then Z := U exp( γ U ) , γ ∈ R , (2.2) is a nonc entr al, nonsc ale d L amb ert W × F R V with ske w ness parameter γ . If U is from a parametric family F U ( u | β ), where β parametrizes the F U , then Z is a noncentral , nonscaled Lambert W × F R V with parameter v ector θ = ( β , γ ). The ke y of this family of R Vs is γ , whic h can tak e any v alue on the real line. As exp( · ) is alwa ys p ositiv e, U and Z h a ve the same sign. F or readabilit y let H γ ( u ) := u exp( γ u ). F or γ = 0 transformation ( 2.2 ) reduces to th e identit y Z ≡ U ; thus, Z p ossesses the exact same pr op erties as U . By con tinuit y of H γ ( · ), one can exp ect for γ 6 = 0 bu t close, also Z 6 = U but close. T rans formation ( 2.2 ) indeed describ es a system S with an asymmetry prop erty: let U ∼ F U ( u ) b e a symmetric zero-mean R V, then Z is a sk ewed v ersion of U —dep ending on the sign of γ . F or γ < 0 negativ e U are amplified b y th e factor exp( γ U ) > 1 and p ositive U are damp ed b y 0 < exp( γ U ) < 1: Z is sk ew ed to the left. F or γ > 0 the same reasoning sho ws that Z is a p os- itiv ely . The noncen tral momen ts E ( Z n ) equal ψ ( n ) := Z u n e γ un f U ( u ) d u. (2.3) 6 G. M. GOER G If the momen t-generating fun ction M U ( t ) := E e tU exists for t = γ n , then ( 2.3 ) can b e rewritten to get a more tractable form u la. As ∂ n ∂ γ n e γ U n = ( U n ) n e γ U n , in terchanging d ifferen tiation and the inte gral sign yields ψ ( n ) = n − n ∂ n ∂ γ n M U ( γ n ) . (2.4) If M U ( t ) do es not exist (e.g., for stud en t- t U ), then ( 2.3 ) m ust b e calc u- lated explicitly . 2.1. Sc ale family input. In a typica l in put/output system S suc h as a mi- crophone/loudsp eake r setting, the loudsp eake r will b e louder if sp eak ers raise their voi ce. I n this sense it is stable with resp ect to scaling: doub ling the v olume of the input doub les the v olume of the louds p eak ers—th e signal is not affecte d in any ot her w a y . Viewing this system as a Lam b ert W × F R V system (where the signal is considered as a R V), multiplying X by a fac- tor κ , sh ould—ceteris paribu s—only affect th e output Y b y multiplying by κ ; other prop erties, such as sk ewness or kur tosis, should n ot b e altered. T rans formation ( 2.2 ), how ev er, do es not ha v e this scaling prop ert y of U . Hence, to allo w a comparable system c h aracterizat ion via γ among different scalable data sets define a scale d Lam b ert W R V. Definition 2 .2 (Scale Lambert W × F R V). Le t U := X/σ x b e the unit-v ariance version of a contin uous R V X from a scale f amily F X ( x | β ), where β is the parameter (v ector) of F X and σ x the standard deviatio n of X . Then Y := { U exp( γ U ) } σ x = X exp( γ X/σ x ) , γ ∈ R , σ x > 0 , (2.5) is a sc ale L amb ert W × F R V with parameter v ector θ = ( β , γ ). T rans formation ( 2.5 ) is in v arian t to scaling of the inp ut, for example, a differen t measuremen t unit f or the input do es not mo dify the asymmetry prop erty of the system, but ju st scales the output acco r dingly . Here σ x is a function of β : for an exp onen tially d istr ibuted inpu t X ∼ exp( λ ), β = λ and σ x ( β ) = λ − 1 ; an input X h a ving a Gamma distribution with s hap e α and rate β giv es β = ( α, β ) and σ x ( β ) = √ α/β . 2.2. L o c ation-sc ale family i nput. The focus of this w ork lies in in tro du c- ing sk ewness to symmetric R Vs with supp ort on ( − ∞ , ∞ ), su ch as a Gaus- sian or stud en t- t . These d istributions are n ot only scale, but also shift in v ariant, a prop ert y Lam b ert W × F distr ib ution should also ha ve for lo cation-family in put. Ho w ever, transform ation ( 2.5 ) is not sh if t-inv arian t. F or example, consid er a zero-mean and unit v ariance inpu t R V U 0 : Ω → R , U 10 := U 0 + 10, and let γ = 0 . 1. If U 0 ( ω ) is cl ose to 0, then the s hifted U 10 ( ω ) LAMBER T W RANDO M V A RIABLES 7 will b e close to 10. F or the corresp onding Z 0 ( ω ) and Z 10 ( ω ) this do es not hold: Z 0 ( ω ) is close to 0, but Z 10 ( ω ) w ill not b e shifted b y 10, but lies close to 10 exp(1) ≈ 27 . 183. Definition 2.3 (Lo cation-sca le Lambert W × F R V). Let X b e a R V from a lo cation-scale f amily w ith c.d.f. F X ( x | β ) with mean µ x and standard deviation σ x ; again β p arametrizes F X . Let U = ( X − µ x ) /σ x b e the zero- mean, unit-v ariance v ersion of X . Th en Y := { U exp ( γ U ) } σ x + µ x , γ ∈ R , (2.6) is a lo c ation-sc ale L amb ert W × F R V with parameter v ector θ = ( β , γ ). As b efore, the parameter γ regulates the closeness b etw een X and its sk ewed version Y . F or a full parametrization of a Lam b ert W × F distribution it is necessary to kno w θ = ( β , γ ); viewing ( 2.6 ) only as a transformation from X to Y , it is more natural—and in practice more useful—to only consider µ x , σ x and γ , ignoring the particular str u cture of X give n its parametrization by β . In order to distingu ish these t wo cases in the r emaining part of this work let τ := ( µ x , σ x , γ ) ∈ T = R × R + × R . Clearly , τ can b e computed from θ , τ = τ ( θ ), but n ot n ecessarily vice-v ersa. F or example, for a Gaussian X τ ≡ θ sin ce µ x ( β ) = µ x and σ x ( β ) = σ x . In contrast , for a lo cation-sca le student- t inpu t with β = ( c, s, ν )—wh er e c is the location, s the scale and ν the degrees o f freedom paramete r— τ 6 = θ : µ x ( β ) = c and σ x ( β ) = s q ν ν − 2 if ν > 2 . Th us, b elo w I use either θ if the full p arametrizatio n is im p ortan t or τ if it is sufficien t to consider ( 2.6 ) only as a transformation rather than a fully sp ecified parametric distrib ution. Not a tion 2.4 (Lam b ert W × F R V). F or simplicit y I will refer to all Y in Definitions 2.1 , 2.2 and 2.3 as a L amb ert W × F R V. Which one of the three transformations ( 2.2 ), ( 2.5 ) or ( 2.6 ) is used to generate Y will b e clear f r om the t yp e of inpu t X . F or example, since a χ 2 k distribution do es not ha ve lo cation or scale parameters, a Lambert W × χ 2 k R V refers to Y in Definition 2.1 ; the exp onen tial distribution is a scale family , th us, a Lam b ert W × exp ( λ ) R V Y is defined in Definition 2.2 ; and f or Gaussian input X , the co rresp on d ing Lam b ert W × Gaussian Y refers to Defin ition 2.3 . 2 2.3. L atent variables. So far attenti on has b een dra wn to Y and its pr op - erties giv en X and θ (or τ ). No w consider the inv er s e pr oblem: giv en Y and θ (or only τ ), what does X look lik e? 2 Although technically n ot correct, one can think of a scale Lambert W × F transfor- mation having τ = (0 , σ x , γ ), and a n oncentral , n on scaled Lambert W × F transformation having τ = (0 , 1 , γ ) . This is especially useful for empirical w ork and implementation of the metho d s inv olving Lambert W × F R Vs. 8 G. M. GOER G Fig. 3. L amb ert W function: tr ansformation H ( u ) and the two i nverse br anches of W ( z ) for z < 0 : princip al br anch (dashe d curve) and nonprincip al br anch (dotte d curve). This is not only in teresting for a latent v ariable in terpretation of X , bu t the inv ers e of a transformation is essent ial to derive the c.d.f. of the trans- formed v ariable. Before analyzi ng transformation ( 2.6 ), consider the nonlin - ear transform ation H : C → C , u 7→ u e xp u =: z [Figure 3 sho w s H ( u ) only for u ∈ R ]. F or p ositiv e u the f unction is bijectiv e and resem bles exp( u ) v ery closely . F or n egativ e u , ho wev er, H ( u ) is quite different fr om exp( u ): it take s on negativ e v alues, its minim um v alue equals − 1 e , and—most imp ortan tly— it is nonbijectiv e. Although H ( u ) has no analytical inv er s e [Rosenlic h t ( 1969 )], its imp licitly defined inv erse fun ction is we ll kno wn in mathematics and p h ys ics. Definition 2.5 (Lam b ert W fun ction). The many- v alued fun ction W ( z ) is the ro ot of W ( z ) e W ( z ) = z , z ∈ C , (2.7) and is commonly denoted as the L amb ert W function . Generally the Lamb er t W function is d efined for any z ∈ C . Since Lam b ert W R Vs are only d efined for real-v alued outcomes, in this work the domain and image of the Lam b ert W fun ction is r estricted to th e reals. F or z ∈ [ −∞ , − 1 /e ) no real solution exists; for z ∈ [ − 1 /e, ∞ ) W ( z ) is a real-v alued function. If z ∈ [ − 1 /e, 0), there are tw o r eal solutions: the prin cipal branc h W 0 ( z ) ≥ − 1 and the nonprin cipal branc h W − 1 ( z ) ≤ − 1; f or z ∈ [0 , ∞ ) only one real- v alued solution exists, W 0 ( z ) = W − 1 ( z ) (see Figure 3 ). F or a detailed review in clud ing useful p rop erties and functional identit ies of W ( z ) s ee Corless et al . ( 1996 ), V allur i, Jeffrey and Corless ( 2000 ) and the references therein. LAMBER T W RANDO M V A RIABLES 9 Figure 3 also sho ws how sk ewness is in tro du ced via transformation ( 2.6 ). Symmetric inp u t X ( x -axis) is map p ed to asymmetric output Y ( y -axis) due to the curv ature of H ( u ) . Analogously , mapping v alues from the y - axis to the x -axis “unskews” them. Figure 3 sho ws H γ ( u ) : u 7→ z for γ = 1, th u s, its inv erse is Lam b ert’s W function ( W ( z ) : z 7→ u ). Th e curv ature of H γ ( u ) = u exp( γ u ) dep ends on the skewness parameter: for γ = 0 no cur- v atur e is pr esen t [ H 0 ( u ) = u ]; h igher γ results in more curv ature, and thus more sk ewness in Y . It can b e easily v erifi ed that W γ ( Z ) := W ( γ Z ) /γ is the inv erse function of transf orm ation ( 2.2 ). Hence, giv en Y and τ , the unobs erv able in p ut X can b e reco v ered via W γ  Y − µ x σ x  σ x + µ x = U σ x + µ x = X . (2.8) F or empirical wo rk it is imp ortant to p oin t out that ( 2.8 ) do es not require sp ecific kno wledge ab out F X or β ; µ x and σ x (and γ ) suffice. This will b ecome esp ecially u seful for estimating the optimal inv erse transformation— see S ection 5.2 . Remark 2.6 (Nonpr incipal branch). The Lam b ert W fu nction has t wo branc hes on the negativ e real line (Fig ure 3 ), so transformation ( 2.8 ) is not unique. F or examp le, consid er z = − 0 . 25 and γ = 1. The t w o r eal-v alued so- lutions are W 0 ( − 0 . 25) ≈ − 0 . 357 and W − 1 ( − 0 . 25) ≈ − 2 . 153. Assu ming a sta- ble input/output sys tem, only the principal branc h mak es sens e 3 —denoted b y W γ , 0 ( · ). If the nonprincipal solution is r equired, W γ , − 1 ( · ) will b e u s ed. The probabilit y p − 1 that the observ ed v alue Z ( ω ) w as ind eed caused b y the nonprincipal solution is at most P { U < − 1 / | γ |} , since H γ ( u ) changes its monotonicit y at u = − 1 /γ . F or Ga ussian X and γ = 0 . 1—a v ery large v alue giv en emp irical evidence—this probabilit y equals 7 . 62 · 10 − 24 . F or an input with stud en t t -distribution and ν = 4 degrees of freedom p − 1 ≈ 7 . 26 · 10 − 5 . Hence, ignorin g the n onprincipal ro ot to obtain un ique latent data sh ould not matt er too m uch in practice. Algorithm 1 describ es the empirical ve rsion of ( 2.8 ). The s o obtained x n = W γ , 0  y n − µ x σ x  σ x + µ x , n = 1 , . . . , N , (2.9) is the input data b x τ generating the observ ed y and should ha ve c.d.f. F X ( x ). Here c ( · ) do es not stand for an estimate of τ , b ut s ince W γ , 0 ( · ) ignores the 3 The output is assumed to be similar to the input, but skew ed. Theref ore, the input v alues causing the output should lie close to them: observing z = − 0 . 25, it is more rea - sonable to assume that this correspond s to the close in p ut of W 0 ( − 0 . 25) ≈ − 0 . 3 57 rather than t he very ex treme W − 1 ( − 0 . 25) ≈ − 2 . 153. 10 G. M. GOER G Algorithm 1 Get input b x τ : fun ction get.input ( · ) in the LambertW pac k age. Input: data v ector y ; parameter v ector τ = ( µ x , σ x , γ ). Output: inpu t v ector b x τ . 1: z = ( y − µ x ) /σ x . 2: bac k-transform z via th e p rincipal b ranc h to u = W γ , 0 ( z ). 3: return b x τ = u σ x + µ x . nonprincipal branch, Algorithm 1 need not return the “true” inpu t data x — ev en if τ is k n o wn. 4 F or small γ , b x τ will most lik ely equal the tru e x for all n ; for large γ some y j ’s m ight b e falsely assigned to the pr incipal x j ’s, although these y j ’s w ere actually caused b y nonpr incipal x j ’s. F or an estimate b τ the notation b x b τ will b e used, whic h itself is an appr o ximation to b x τ . 3. Distribution and densit y f unction. F or ease of notation and readabil- it y let z := y − µ x σ x , u 0 := W γ , 0 ( z ) , u − 1 := W γ , − 1 ( z ) , (3.1) x 0 := u 0 σ x + µ x , x − 1 := u − 1 σ x + µ x . By definition, G Y ( y ) = P ( Y ≤ y ) = P ( { U exp( γ U ) } σ x + µ x ≤ y ) = P ( U exp γ U ≤ z ) . The transformation H γ ( u ) c h anges its m onotonicit y at u = − 1 /γ and its in ve rse W γ , 0 ( z ) at z = − 1 / ( γ e ). Consequ en tly , the even t { Y < ( > ) y } [for γ > ( < ) 0] has to b e sp lit u p into separate ev ent s in U to deriv e the distribution of Y . Theorem 3.1 (D istribution of a lo cation-scal e Y ). The c.d.f. of a lo c a- tion-sc ale L amb ert W × F R V Y e quals (for γ > 0 ) G Y ( y | β , γ ) =          0 , if y < − σ x γ e + µ x , F X ( x 0 | β ) − F X ( x − 1 | β ) , if − σ x γ e + µ x ≤ y ≤ µ x , F X ( x 0 | β ) , if y ≥ µ x . (3.2) The c ase γ < 0 c an b e obtaine d analo gously and for γ = 0 it is cle ar that G Y ( y | β , 0) = F X ( y | β ) . 4 This only applies if p − 1 = P ( U < − 1 | γ | ) > 0, as otherwise the b ac k-transformation W γ ( z ) = W γ , 0 ( z ) is bijective. In p articular, if U ≥ 0—for ex amp le, for scale family input X ≥ 0—then x τ ≡ x , not just an approximation. See also Co rollary 3.3 . LAMBER T W RANDO M V A RIABLES 11 Pr oof. F ollo w s b y matc hing the ev ent s in Z with the corresp onding ev ents in U [Glen, Leemis and Drew ( 1997 )]; see Figure 3 .  F or z = − 1 / ( γ e ) b oth b ranc h es of W ( z ) coincide, th us, u 0 = u − 1 . There- fore, F X ( u 0 σ x + µ x ) − F X ( u − 1 σ x + µ x ) ≡ 0 at z = − 1 / ( γ e ), which imp lies con tinuit y of G Y ( y ) at y = µ x − σ x γ e ; the s ame reaso ning s ho ws conti n uit y at y = µ x ( z = 0). Theorem 3.2 (Densit y of a lo cation-scale Y ). The p.d.f. of a lo c ation- sc ale L amb ert W × F R V Y e quals (for γ > 0 ) g Y ( y | β , γ ) =                0 , if y < − σ x γ e + µ x , f X ( x 0 | β ) · W ′ 0 ( γ z ) − f X ( x − 1 | β ) · W ′ − 1 ( γ z ) , if − σ x γ e + µ x ≤ y ≤ µ x , f X ( x 0 | β ) · W ′ 0 ( γ z ) , if y ≥ µ x . (3.3) A gain, γ < 0 c an b e obtaine d analo gously, and g Y ( y | β , 0) = f X ( y | β ) . Pr oof. Using that d dz W γ ( z ) = W ′ ( γ z ), the first deriv ativ e of G Y ( y ) with resp ect to y equals ( 3.3 ). The same argum ents as for G Y ( y ) show that g Y ( y ) is contin uous at y = − σ x / ( γ e ) + µ x and y = µ x .  In general, the sup p ort of Y dep en d s on τ = τ ( θ ) ∈ T if γ 6 = 0. Ho wev er, restricting τ to the subs pace S c := { τ ∈ T | − σ x / ( γ e ) + µ x = c } give s the same su pp ort [ c, ∞ ) for all τ ∈ S c ⊆ T [or ( −∞ , c ] for γ < 0]. O f particular empirical imp ortance are S 0 := { θ ∈ Θ | µ x = σ x / ( γ e ) } and S ±∞ = { θ ∈ Θ | γ = 0 } . (3.4) F or (a scale family) X ∼ F X ( x | β ) taking v alues in [0 , ∞ ) and γ ≥ 0 , the supp ort of the corresp onding (scale) Lam b ert W × F R V Y do es not dep end on τ b ut alw a ys equals [0 , ∞ ). Corollar y 3.3 (C.d.f. and p.d.f. of a scale Lam b ert W × F R V). If X is a nonne gative R V taking v alues in [0 , ∞ ) and γ ≥ 0 , then the inverse tr ansformation W γ ( y /σ x ) is unique . Henc e, the c.d.f. and p.d.f. of a sc ale L amb ert W × F R V Y e qu al G Y ( y | β , γ ) = F X  W γ , 0  y σ x  σ x    β  (3.5) and g Y ( y | β , γ ) = f X  W γ , 0  y σ x  · σ x    β  · W ′ 0  γ y σ x  . (3.6) 12 G. M. GOER G Fig. 4. The p.d.f . and c.d.f . of (a) a “nonc entr al , nonsc ale d,” (b) a “sc ale” and (c) a “lo- c ation-sc ale” L amb ert W × R V Y for differ ent de gr e es of skewness. Pr oof. F ollo w s by setting µ x = 0 in ( 3.1 ), ( 3.2 ) and ( 3.3 ), and n oting that the case u < 0 do es not exist since X ≥ 0.  F or the c.d.f. and p.d.f. of a noncen tral, nonscaled Lam b ert W × F R V Y (Definition 2.1 ) with X taking v alues in [0 , ∞ ) set σ x = 1 in ( 3.5 ) and ( 3 .6 ). Theorems 3.1 and 3.2 d emonstrate the great flexibilit y of the Lambert W setting, sin ce the closed form expr essions for G Y ( y | β , γ ) and g Y ( y | β , γ ) hold f or any w ell-defined in put F X ( x | β ) and f X ( x | β ), resp ectiv ely . Thus, researc hers can easily create Lam b ert W v ariants of their fa v orite d istri- bution F X , b y simply plugging F X and f X in ( 3.2 ) and ( 3.3 ). Figure 4 sho w s the p.d.f. an d c.d.f. of the three Lamb ert × F R Vs d iscu ssed in No- tation 2.4 for four degrees of ske wness, γ = (0 , 0 . 1 , 0 . 2 , 0 . 4). F or γ = 0 th e output Y equ als the inpu t X , thus, also their p.d.f.s/c.d.f.s coincide (solid blac k lin es). With incr easing γ , the R V Y —and thus its distrib ution and densit y— b ecome more and more ske w ed to the right (since γ > 0 ). Although Lam b ert W R Vs are defin ed by transformation ( 2.6 ), they can b e also considered as a particular v arian t of an arbitrary F X —indep end en t of this transformation. S ometimes the input/output aspect migh t b e m ore in- sigh tfu l (e.g., stoc k r etur ns), whereas ot herwise solely the generalize d distri- bution suffices to analyze a giv en d ata set. Esp ecially , if the laten t v ariable X do es not hav e any suitable in terpr etation (see BMI data in Section 7 ), one can co ncen trate on the pr obabilistic pr op erties of Y , ignoring the inpu t X . LAMBER T W RANDO M V A RIABLES 13 3.1. Quantile function. Equation ( 3.2 ) and an insp ection of Figure 3 directly relat e µ x to Y . Corollar y 3.4 (Median of Y ). F or a lo c ation-sc ale L amb ert W R V Y , P ( Y ≤ µ x ) = P ( X ≤ µ x ) for al l γ ∈ R . In p articular, µ x e q u als the me dian of Y , if X is symmetr ic. Pr oof. The transformation H γ ( u ) = u exp ( γ u ) passes through (0 , 0) for all γ ∈ R . F u rthermore, exp( γ u ) > 0 for all γ and all u ∈ R . Therefore, P ( Y ≤ µ x ) = P ( Z ≤ 0) = P ( U exp( γ U ) ≤ 0) = P ( U ≤ 0) = P ( X ≤ µ x ) . F or symmetric input P ( X ≤ µ x ) = 1 2 , therefore, µ x is the median of Y .  Corollary 3.4 n ot only giv es a m eanin gfu l in terpretation of the parame- ter µ x for symmetric input, but the samp le median of y also yields a robust estimate o f µ x . In ge neral, the α -quan tile y α of Y s atisfies α ! = P ( Y ≤ y α ) = P  U exp( γ U ) ≤ y α − µ x σ x  = P ( U exp( γ U ) ≤ z α ) . F or γ > 0 ( γ < 0 analo gously) and z α > 0 the function W γ , 0 ( · ) is bijectiv e. Th us, P ( U exp( γ U ) ≤ z α ) = P ( U ≤ W γ , 0 ( z α )) = P ( X ≤ W γ , 0 ( z α ) σ x + µ x ) and b y definition of the α -quan tile of X , x α = W γ , 0 ( z α ) σ x + µ x ⇔ z α = u α e γ u α , (3.7) where u α = x α − µ x σ x . F or z α < 0 an d γ > 0, ho w ev er, W γ ( · ) is not bijectiv e, th us , z α cannot be computed explicitly as in ( 3.7 ), but must b e obtained b y solving the imp licit equation α ! = F X ( W γ , 0 ( z α ) σ x + µ x ) − F X ( W γ , − 1 ( z α ) σ x + µ x ) . In eit her case, the α -quantile of Y equals y α = z α σ x + µ x . 4. Gaussian input. The results so far hold for an y con tinuous input R V. T o get a b etter insigh t consider Gaussian inp ut U ∼ N ( µ u , σ 2 u ) as a sp ecial case; here β = ( µ u , σ u ). Its moment generating function M U ( t ) equals E ( e tU ) = e tµ u + t 2 / 2 σ 2 u for all t ∈ R . 14 G. M. GOER G Therefore, noncentral momen ts of Z can b e computed explicitly [see ( 2.4 )] b y ψ ( n ) = n − n ∂ n ∂ γ n exp  γ nµ u + γ 2 n 2 σ 2 u 2  . In particular, µ z = ( µ u + γ σ 2 u ) e γ µ u +( γ 2 / 2) σ 2 u , σ 2 z = 2 − 2 ( e 2 γ µ u +2 γ 2 σ 2 u ((4 γ σ 2 u + 2 µ u ) 2 + 4 σ 2 u )) − ( µ u + γ σ 2 u ) 2 e 2 γ µ u + γ 2 σ 2 u = e 2 γ µ u +2 γ 2 σ 2 u ((2 γ σ 2 u + µ u ) 2 + σ 2 u ) − ( µ u + γ σ 2 u ) 2 e 2 γ µ u + γ 2 σ 2 u . As already men tioned in Section 2 , this is an u nstable system, in the sense that a sm all p er tu rbation in ( µ u , σ u ) resu lts in a completely different ( µ z , σ z ) for γ 6 = 0. In con trast, the cen tral momen ts of a lo cation-scale Lam b ert W × Gaussian R V Y with inp ut X ∼ N ( µ x , σ 2 x ) ha v e a m uc h simpler and s table form µ y = µ x + σ x E ( U e γ U ) = µ x + σ x γ e γ 2 / 2 , (4.1) since U = ( X − µ x ) /σ x ∼ N (0 , 1). Usin g ( 4.1 ), the k th cen tral momen t of Y can b e expressed by the k th cen tral momen t of U e γ U , E ( Y − µ y ) k = σ k x E ( U e γ U − γ e γ 2 / 2 ) k . In particular, σ 2 y = σ 2 x e γ 2 ((4 γ 2 + 1) e γ 2 − γ 2 ) , (4.2) whic h only dep ends on the inp u t v ariance and the sk ewness parameter γ . The main m otiv e to in tro duce Lam b ert W R Vs is to accurately mo del sk ewed data. Th e sk ewness coefficient of Y is defined as γ 1 ( Y ) := ( E ( Y − µ y ) 3 ) /σ 3 y . Analogously , the ku rtosis equals γ 2 ( Y ) := ( E ( Y − µ y ) 4 ) /σ 4 y and measures th e thic kn ess of tails of Y . Lemma 4.1. F or a lo c ation-sc ale L amb ert W × Gaussian R V with input X ∼ N ( µ x , σ 2 x ) , γ 1 ( γ ) = γ  e 3 γ 2 (9 + 27 γ 2 ) − e γ 2 (3 + 12 γ 2 ) + 5 γ 2 ( e γ 2 (1 + 4 γ 2 ) − γ 2 ) 3 / 2  (4.3) and γ 2 ( γ ) = e 6 γ 2 (3 + 96 γ 2 + 256 γ 4 ) − e 3 γ 2 (30 γ 2 + 60 γ 4 − 96 γ 6 ) − 3 γ 4 ( e γ 2 (1 + 4 γ 2 ) − γ 2 ) 2 . (4.4) LAMBER T W RAN DOM V ARIABLES 1 5 Fig. 5. Pe arson skewness (a) and kurtosis (c) c o efficient for γ ∈ [ − 1 , 1] and its first or der T aylor appr oximation (dashe d line); (b) and (d) : zo om to the interva l [ − 0 . 15 , 0 . 15] . Pr oof. Dividing the th ird and four th d eriv ative of the momen t ge ner- ating functions for a standard Gaussian U at t = γ n with r esp ect to γ by n n giv es 1 3 3 d 3 dγ 3 e 9 γ 2 / 2 = 9 γ e 9 γ 2 / 2 + 27 γ 3 e 9 γ 2 / 2 = 9 γ e 9 γ 2 / 2 (1 + 3 γ 2 ) , 1 4 4 d 4 dγ 4 e 8 γ 2 = 3 e 8 γ 2 + 96 γ 2 e 8 γ 2 + 256 γ 4 e 8 γ 2 = e 8 γ 2 (3 + 96 γ 2 + 256 γ 4 ) . The rest follo ws by expanding E ( U e γ U − E U e γ U ) 3 and E ( U e γ U − E U e γ U ) 4 via th e binomial formula and using the a b o v e expressions.  As exp ected, the sk ewness co efficient is an o dd function in γ with the same sign as γ . On the con trary , γ 2 ( γ ) is ev en. A first and second order T ayl or approximati on arou n d γ = 0 yields γ 1 ( γ ) = 6 γ + O ( γ 3 ) and γ 2 ( γ ) = 3 + 60 γ 2 + O ( γ 4 ), resp ectiv ely . Although γ can tak e any v alue in R , in practice, it r arely exceeds 0 . 15 in absolute v alue. In this in terv al th e T a y- lor app ro ximation is almost ind istinguishable f rom the true fu n ction (Fig- ure 5 ). This first order appro ximation to γ 1 ( γ ) offers a rule of th u m b b γ T aylor := b γ 1 ( y ) 6 , (4.5) whic h can b e used as a starting v alue for b etter algo rithms su c h as IGMM and MLE (see Section 5 ). Corollar y 4.2. The skewness and k u rtosis c o efficient ar e unb ounde d for γ → ±∞ , that is, lim γ →±∞ γ 1 ( γ ) = ±∞ and lim γ →±∞ γ 2 ( γ ) = ∞ . 16 G. M. GOER G Pr oof. Omitting − γ 2 in the denominator and 5 γ 2 in the numerato r of the sk ewness coefficien t can b e b ounded from b elo w 9 e 3 γ 2 + 27 γ 2 e 3 γ 2 − 3 e γ 2 − 12 γ 2 e γ 2 + 5 γ 2 ( e γ 2 + 4 γ 2 e γ 2 − γ 2 ) 3 / 2 ≥ e 3 γ 2 (9 + 27 γ 2 ) − e γ 2 (3 + 12 γ 2 ) e (3 / 2) γ 2 (1 + 4 γ 2 ) 3 / 2 = e (3 / 2) γ 2 9 + 27 γ 2 (1 + 4 γ 2 ) 3 / 2 − e − γ 2 / 2 3 + 12 γ 2 (1 + 4 γ 2 ) 3 / 2 . As the exp onen tial fu n ction d ominates r ational f unctions, the first term tends to ∞ , wh ereas th e second one go es to 0 for γ to ∞ . In case of th e kurtosis co efficien t, the term e 6 γ 2 in the numerator domi- nates all other terms for large γ and thus d etermines the asymptotic b eha v ior of γ 2 ( γ ) for γ to ±∞ .  This r esu lt shows th at th e Lambert W × Gaussian distribu tions can b e used to mo del a larger v ariet y of sk ew ed data th an a sk ew-normal distribu- tion, since its skewness co efficien t is restricted to the in terv al ( − 0 . 9 95 , 0 . 995) [Azzalini ( 1985 )]. 5. P arameter estimation. F or a sample of N indep en den t ident ically d is- tributed (i.i.d.) observ ations y = ( y 1 , . . . , y N ), wh ic h presu mably originates from transformation ( 2.6 ), θ = ( β , γ ) has to b e estimated from the data. In add ition to the commonly used maxim um likel iho o d estimator (MLE) for θ , I also presen t a metho d of moments esti mator for τ that builds on the input/output relatio n in Fig ure 1 . 5.1. Maximum likeliho o d estimation. The log-lik eliho o d fu n ction in the i.i.d. ca se equals ℓ ( θ | y ) = N X i =1 log g Y ( y i | θ ) , (5.1) where g Y ( · | θ ) is the p.d.f. of Y —see ( 3.3 ). The MLE is that θ = ( β , γ ) whic h maximizes the log-lik eliho o d b θ MLE = arg max θ ℓ ( θ | y ) . Since g Y ( y i | θ ) is a function of f X ( x i | β ), the MLE dep ends on the sp ec- ification of the inpu t densit y . In general, this m u ltiv ariate, nonlin ear opti- mization problem must b e carried out by numerical metho ds, as the t wo branc hes of W ( z ) for y ≤ µ x do not allo w an y further simp lifi cation. LAMBER T W RANDO M V A RIABLES 17 F or (scale) Lamber t W × F X with s upp ort in (0 , ∞ ) and γ ≥ 0, h o wev er, g Y ( y | β , γ ) = f X ( W γ , 0 ( y σ x ) · σ x | β ) · W ′ 0 ( γ y σ x ) (Corollary 3.3 ). Th us , ( 5.1 ) can b e rewritten as ℓ ( β , γ | y ) = ℓ ( β | x (0 ,σ x ,γ ) ) + N X i =1 log W ′ 0  γ y i σ x  , (5.2) where ℓ ( β | x (0 ,σ x ,γ ) ) = N X i =1 log f X  W γ , 0  y i σ x  · σ x    β  (5.3) is the log-lik eliho o d of the b ac k-transformed data x (0 ,σ x ,γ ) = W γ , 0 ( y σ x ) · σ x [no c ( · ) since the in verse is u nique in this case]. Note that W ′ 0 ( γ y i σ x ) only dep ends on σ x ( β ) (and γ ), bu t not necessarily on ev ery co ordinate of β . The equiv alence ( 5.2 ) sh ows th e relation b et ween the exact MLE ( b β , b γ ) based on y and the appro ximate MLE b β based on x (0 ,σ x ,γ ) : if w e would kno w σ x and γ b eforehand, then we co uld just bac k-transform y to x (0 ,σ x ,γ ) and compute b β MLE based on x (0 ,σ x ,γ ) [maximize ( 5.3 )]; ho w ever, in pr actice, σ x and γ ha ve to b e estimated from y and this u ncertain ty en ters the log- lik eliho o d ( 5.2 ) b y the additional term P n i =1 log W ′ 0 ( γ y i σ x ). F or z > 0 it can b e easily sho wn that W ′ ( z ) = W ( z ) z (1+ W ( z )) > 0 as w ell as W ′ ( z ) < 1 since W ′ (0) = 1 and W ′′ ( z ) = − W ′ ( z ) exp( − W ( z )) W ( z )+2 ( W ( z )+1) 2 < 0. Hence, P n i =1 log W ′ 0 ( γ y i σ x ) < 0 for γ > 0 and can b e though t of as a p en alt y for tr ansforming y to the “nicer” x (0 , b σ x , b γ ) with estimated p arameters: th e larger γ , the bigger the p enalt y on the log-lik eliho o d ℓ ( β | x (0 ,σ x ,γ ) ) of the “nice” bac k -trans formed data, since W ′′ ( z ) < 0. Par ameter-dep endent supp ort. F or lo cation-scale L am b ert W × F R Vs the supp ort of g Y ( y ) d ep ends on τ = τ ( θ ) and therefore violates a cru cial assumption of most results related to (asymptotic) prop erties of the MLE. Only for γ = 0 the supp ort of g Y ( y ) = f X ( y ) do es not d ep end on θ . F or X ∼ N (0 , 1) it can b e sho wn that the Fisher information matrix I N ( γ ) = − E ( d 2 dγ 2 ℓ ( θ | y )) = 8 N . Hence, for the symmetric Gaussian case √ N b γ MLE → N (0 , 1 8 ). S im ulations in Section 6 confirm this asymptotic r esult and suggest that also for the general Gaussian case b θ MLE is we ll b eha ved, that is, it is √ N -consisten t and asymptoticall y efficien t. A theoretica l analysis of the asymptotic b eha vior of the MLE for γ 6 = 0 is b ey ond the scop e of this s tudy , but simulati ons sh o w that also for parameter dep endent supp ort b θ MLE is an unbiased estimator with r o ot mean square errors co mparable to the γ = 0 case. 18 G. M. GOER G 5.2. Iter ative gener alize d metho d of moments (IGMM). A disadv anta ge of the MLE is the mandatory a-priori sp ecification of the input distribution. In practice, ho we v er, it is rarely kno wn what kind of d istribution is a go o d fit to the data, even more so if the d ata is transformed v ia a nonlinear trans- formation. Th u s , here I presen t an iterativ e metho d to esti mate the optimal in verse-transformation ( 2.8 ) by estimating τ directly , instead of estimating θ and then computing τ ( b θ ). Th is metho d builds on the in put/output asp ect and only relies up on the sp ecification of the th eoretica l ske wness of X . The prop osed estimator for τ w orks as follo w s (see b elo w for a more detailed discussion): (1) set starting v alues τ (0) = τ 0 . Set k = 0; (2) assume µ ( k ) x and σ ( k ) x are kno wn and estimate γ fr om z ( k ) = y − µ ( k ) x σ ( k ) x to obtain γ ( k +1) (Algorithm 2 ); (3) assume γ ( k +1) is kno wn and get estima tes µ ( k +1) x and σ ( k +1) x from the bac k-transformed data b x ( µ ( k ) x ,σ ( k ) x ,γ ( k +1) ) (Algorithm 3 ). S et k = k + 1; (4) iterate b et we en ( 5.2 ) and ( 5.2 ) unti l conv ergence of the sequence τ ( k ) . F or a momen t assume that µ x and σ x are kno w n and only γ has to b e estimated. Since µ x and σ x are kn o wn, w e ca n consider z = y − µ x σ x . A natural c hoice f or γ is the one that results in bac k-transf ormed data b u γ = W γ ( z ) with sample skewness equal to the theoretical ske wness of U , which equals the theoreti cal sk ewn ess of X . F ormally , b γ GMM = arg min γ k γ 1 ( X ) − b γ 1 ( b u γ ) k , (5.4) where k · k is a prop er norm in R , f or example, k s k = s 2 or k s k = | s | . Discussion of Algor ithm 2 . F or example, let y b e p ositiv ely ske w ed data, b γ 1 ( y ) > 0, and the input x causing th e observ ed y is assu med/kno wn to b e symmetric, th us, γ 1 ( X ) = 0. By the nature of transformation H γ ( u ), the Algorithm 2 Find optimal γ : function gam ma GMM( · ) in the LambertW pac k- age. Input: standardized data ve ctor z ; theoretica l sk ewness γ 1 ( X ) . Output: b γ GMM as in ( 5.4 ). 1: Compute lo wer and upp er b ound for γ : lb = − 1 exp(1) max ( z ) and ub = − 1 exp(1) min ( z ) . 2: b γ GMM = arg min γ k b γ 1 ( b u ) − γ 1 ( X ) k where b u = W γ ( z ) sub ject to γ ∈ [ lb , ub ] . 3: return b γ GMM . LAMBER T W RANDO M V A RIABLES 19 sk ewn ess parameter γ m ust b e also p ositiv e and the T a ylor appr o xima- tion of γ 1 ( γ ) for Gaussian input [see ( 4.5 )] giv es a goo d initial estimate γ 0 = b γ 1 ( y ) / 6 > 0. I n the same wa y as the mapping u 7→ u exp( γ u ) in tro- duces sk ewn ess, the in v erse transformation W γ ( z ) results in less p ositiv ely sk ewed b u γ due to th e curv ature in W γ ( · ) (see Figure 3 ). As the initial guess γ 0 rarely giv es exactly symmetric input, Algorithm 2 searches for a γ suc h that the empirical sk ewn ess of b u γ is as close as p ossible to the “true” sk ew- ness γ 1 ( X ) . There are natural b ound s f or γ to guaran tee the observ abilit y of y , for example, a γ to o large make s large n egativ e observ ations in y imp ossible (due to the m inim u m at z = − 1 /e ; see Figure 3 ). How ev er, since y has actually b een observed, the searc h space f or γ m us t b e limited to the in - terv al O z := [ − 1 exp(1) max( z ) , − 1 exp(1) min( z ) ]. If there exists a ˜ γ ∈ O z suc h that b γ 1 ( b u ˜ γ ) = γ 1 ( X ) , then Algorithm 2 will return b γ = ˜ γ due to the monotoni- cally increasing curv ature of H γ ( u ) an d W γ ( z ) resp ectiv ely; if th er e is n o suc h ˜ γ ∈ O z , then Algorithm 2 r eturns either the lo wer or upp er b oun d of O z , dep endin g on w h ether z is negativ ely or positive ly sk ewed. This univ ariate minimization problem with constrain ts can b e carried out b y standard optimiza tion algorithms. In practice, µ x and σ x are rarely k n o wn b ut also ha v e to b e estimated from the data. As y is shifted and scaled ahe ad of the b ac k-transformation W γ , 0 ( · ), the initia l c h oice of µ x and σ x affects th e optimal c hoice of γ . Therefore, the optimal triple ( b µ x , b σ x , b γ ) m ust b e obtained iterativ ely . Discussion of Algorithm 3 . Algorithm 3 first computes z ( k ) = ( y − µ ( k ) x ) / σ ( k ) x using µ ( k ) x and σ ( k ) x from th e pr evious s tep. T h is normalized outp ut ca n then b e passed to Algorithm 2 to ob tain an up dated γ ( k +1) := b γ GMM . Us- ing this new γ ( k +1) , one can bac k-transf orm z ( k ) to th e pr esumably zero- mean, unit-v ariance in p ut u ( k +1) = W γ ( k +1) ( z ( k ) ). Herewith w e can obtain a b etter appro ximation to the “true” laten t x b y x ( k +1) = u ( k +1) σ ( k ) x + µ ( k ) x . Ho we v er, γ ( k +1) —and therefore x ( k +1) —has b een obtained u sing µ ( k ) x and σ ( k ) x whic h are not necessarily th e most accurate estimates in ligh t of the up d ated appro ximation b x ( µ ( k ) x ,σ ( k ) x ,γ ( k +1) ) . Thus, Algorithm 3 computes new estimates µ ( k +1) x and σ ( k +1) x b y the sample mean a nd standard deviatio n of b x ( µ ( k ) x ,σ ( k ) x ,γ ( k +1) ) , and starts another iteration by passing the up d ated nor- malized output z ( k +1) = y − µ ( k +1) x σ ( k +1) x to Algorithm 2 to obtain a new γ ( k +2) . The algo rithm returns the optimal b τ IGMM once the estimated parameter triple d o es not c hange an y m ore from one iteration to the next, that is, if k τ ( k ) − τ ( k +1) k < tol . A great adv antag e of the IGMM estimator is that it do es not require an y fu rther sp ecification of the input except its skewness. F or example, no 20 G. M. GOER G Algorithm 3 Iterativ e generalized metho d of momen ts: f unction IGMM( · ) in the LambertW pac k age. Input: data v ector y ; tolerance lev el tol ; theoret ical sk ewness γ 1 ( X ) . Output: IGMM parameter estimate b τ IGMM = ( b µ x , b σ x , b γ ). 1: Set τ ( − 1) = (0 , 0 , 0). 2: Starting v alues: τ (0) = ( µ (0) x , σ (0) x , γ (0) ), wher e µ (0) x = ˜ y and σ (0) x = σ y are the sample median and standard d eviation of y , resp ectiv ely . γ (0) = b γ 1 ( y ) − γ 1 ( X ) 6 → see ( 4.5 ) for details. 3: k = 0 . 4: while k τ ( k ) − τ ( k − 1) k > tol do 5: z ( k ) = ( y − µ ( k ) x ) /σ ( k ) x , 6: P ass z ( k ) to Algorithm 2 − → γ ( k +1) , 7: bac k-transform z ( k ) to u ( k +1) = W γ ( k +1) ( z ( k ) ); compute x ( k +1) = u ( k +1) σ ( k ) x + µ ( k ) x , 8: up date parameters: µ ( k +1) x = x k +1 and σ ( k +1) x = b σ x k +1 , 9: τ ( k +1) = ( µ ( k +1) x , σ ( k +1) x , γ ( k +1) ), 10: k = k + 1. 11: return τ IGMM = τ ( k ) . matter if the inp u t is normally , student- t , Laplace or uniformly distrib u ted, the IGMM estimat or find s a τ that giv es symmetric b x b τ indep end en t of the particular c hoice of (symmetric) F X ( · ). A disadv ant age of IGMM from a probabilistic p oin t of view is its deter- mination. In general, Algorithm 3 will lead to bac k-transformed data w ith sample sk ewness identica l to γ 1 ( X ) and so n o sto c hastic ele men t remains in the nature of the estimator. 5 Note th at IGMM does n ot provi de an estimate of β (except for Gaussian input); if n ecessary , an estimate of β must b e obtained in a separate step, for examp le, b y estimating β from the bac k- transformed d ata b x b τ . How ever, in general, b β MLE estimated only fr om b x b τ is (sligh tly) d ifferent fr om b β MLE using Lam b ert W MLE on the original data y : in the first case b τ is assumed to b e kno wn and fixed, whereas in the second case β and τ are estimated join tly [see ( 5.2 )]. The un derlying inpu t data x = ( x 1 , . . . , x N ) ca n be appro ximated via Al- gorithm 1 u sing b τ IGMM . The so obtained b x b τ IGMM ma y th en b e used to c h ec k if X has c haracteristics of a kn o wn parametric distribution F X ( x | β ), and th u s is an easy , bu t h eu ristic chec k if y follo ws a particular Lam b ert W × F X distribution. Ho wev er, suc h a test can only serv e as a ru le of th umb f or v ar- 5 If γ 1 ( X ) dep en d s on one or more p arameters of the d istribution of X (e.g., Gamma), then the IGMM algorithm must b e adapted to this very p roblem. LAMBER T W RANDO M V A RIABLES 21 ious reasons: (i) b τ 6 = τ , th us tests are to o optimistic as b x b τ will ha ve “nicer” prop erties regarding F X than the true x w ould hav e; (ii) ignoring the non- principal branc h alte rs the sample distribu tion of the input—pu tting n o observ ations to the far left (or right): not so m uch of a problem for small γ , the distribu tion can b e trun cated co nsiderably for large γ . F or Gaussian in- put v arious tests are a v ailable [Jarque–Bera, S hapiro–Wilk, among others; see Th o de ( 2002 )], for other distribu tions a K olmogoro v–Sm irno v test can b e used. 6 5.2.1. Gaussian IGMM. F or Gaussian X th e system o f equat ions µ y ( γ ) = µ x + σ x γ e γ 2 / 2 , (5.5) σ 2 y ( γ ) = σ 2 x e γ 2 ((4 γ 2 + 1) e γ 2 − γ 2 ) (5.6) has a unique solution f or ( µ x , σ x ). Giv en b γ IGMM and the sample moments µ y and σ y , the input parameters µ x and σ 2 x can b e obtained b y b σ 2 x ( b γ IGMM ) = σ 2 y e b γ 2 IGMM ((4 b γ 2 IGMM + 1) e b γ 2 IGMM − γ 2 IGMM ) , (5.7) b µ x ( b γ IGMM ) = µ y − b σ 2 x ( b γ IGMM ) b γ IGMM e b γ 2 IGMM / 2 . (5.8) Hence, line 8 o f Algorithm 3 can b e altered to 8 b: µ ( k +1) x = b µ x ( µ y , σ y , γ k +1 ) and σ ( k +1) x = b σ x ( σ y , γ k +1 ), giv en b y ( 5.7 ) and ( 5.8 ). (5.9) Ev en though this s im p lification would lead to a faster estimatio n of τ , it is m ostly of theoretical interest, as it cannot b e guarantee d that X indeed is Gaussian; the more general Algorithm 3 should b e used in practice. 7 6. Sim u lations. Although the c.d.f., p.d.f. and moment s of a Lam b ert W R Vs are non trivial expressions, their s im ulation is straigh tforw ard (Algo- rithm 4 ). This section explores the fi nite-sample prop erties of estimators for θ = ( µ x , σ x , γ ) un d er Gaussian inp ut X ∼ N ( µ x , σ 2 x ). 8 In particular, con ven tional Gaussian MLE (esti mation of µ y and σ y only; γ ≡ 0), IGMM and Lam b ert W × Gaussian MLE, and — f or a sk ew comp etitor—the ske w-normal MLE 9 6 If the data does not represent an in d ep endent sample (as usual for financial data), then critical v alues of several test statistics need n ot b e v alid anymore and adapted tests should b e u sed [see W eiss ( 1978 )]. 7 All n umerical estimates b τ IGMM rep orted in Section 6 w ere obtained using the more general algorithm with line 8 , n ot 8 b. 8 F or the sp ecial case of Gauss ian input τ ≡ θ , thus, IGMM estimates b τ IGMM = b θ IGMM can b e compared directly to b θ MLE . 9 F unction sn.mle in t h e R p ac kag e sn . 22 G. M. GOER G Algorithm 4 R an d om s ample generation: fun ction rLamber tW( · ) in the LambertW p ac k age. Input: num b er of observ ations n ; parameter vec tor β ; sp ecification of the input d istribution F X ( x | β ); sk ewness parameter γ . Output: random sample ( y 1 , . . . , y n ) of a Lam b ert W × F R V. 1: Sim u late n samples x = ( x 1 , . . . , x n ) ∼ F X ( x | β ). 2: Compute µ x ( β ) and σ x ( β ) giv en th e typ e of Lam b ert W × F distrib ution (noncen tral, nonscale; scale ; location-scale). 3: u = ( x − µ x ( β )) /σ x ( β ). 4: z = u exp( γ u ) . 5: return y = z σ x ( β ) + µ x ( β ). are stud ied. Whereas a comparison of accuracy and efficiency in b γ do es not mak e sen s e, it is meaningfu l to analyze b µ y and b σ y of sk ew-normal ve rsus Lam b ert W × Gaussian MLE. Sc enarios. Each estimator is applied to 3 kind s of sim ulated data sets for 4 differen t sample sizes of N = 50 , 100 , 250 and 1,000: γ = 0: Data is sampled fr om a symmetric R V Y = X ∼ N (0 , 1). Do es additional estimatio n of γ affect the p r op erties of b µ y or b σ y ? γ = − 0 . 05: A typical v alue for financial data, su c h as the LA T AM return s in tro duced in Section 1 . γ = 0 . 3: T h is large v alue reve als the imp ortance of th e t w o branches of the Lam b ert W function. How do es the skew-normal MLE handle extremely sk ewed d ata [ γ (0 . 3) = 1 . 9397]? Sim ulations are based on n = 1,000 replications. The inp ut m ean µ x and standard deviation σ x are c hosen such that the observed R V has µ y ( γ ) = 0 and σ y ( γ ) = 1 for all γ . These functional relatio ns can b e obtained b y ( 5.7 ) and ( 5.8 ). F or IGMM the tolerance leve l w as set to tol = 10 − 6 and the Euclidean norm w as used. Remark 6.1. T he Gaussian and sk ew-normal MLE estimate the mea n and standard deviation of Y . Both Lam b ert W metho d s estimate the mean and standard deviation of the laten t v ariable X plus the skewness p aram- eter γ . T h u s, for a meaningful comparison the implied estimates b σ y ( b µ x , b γ ) and b µ y ( b µ x , b σ x , b γ ) giv en b y ( 5.5 ) and ( 5.6 ) are rep orted b elo w. 6.1. Symmetric data: γ = 0 . Th is parameter c hoice in vestig ates if imp os- ing the Lam b ert W framewo rk, ev en th ou gh its u se is sup erfluous , causes a qualit y loss in the estimation. F urthermore, critic al v alues can b e obtained for the finite s ample b eha vior of b γ und er the n ull hyp othesis of a symmetric distribution. LAMBER T W RANDO M V A RIABLES 23 T able 2 Bias and RMSE of b θ for γ = 0 and X ∼ N (0 , 1) Bias RMSE · √ N N γ = 0 µ y = 0 σ y = 1 γ = 0 µ y = 0 σ y = 1 Gaussian ML 50 0 . 0000 0 . 0054 − 0 . 0175 0 . 00 00 0 . 9943 0 . 7053 100 0 . 0000 0 . 0016 − 0 . 0084 0 . 0000 0 . 9812 0 . 7410 250 0 . 0000 − 0 . 0029 − 0 . 0009 0 . 0000 0 . 99 97 0 . 6917 1,000 0 . 0000 0 . 0005 − 0 . 0013 0 . 00 00 0 . 9788 0 . 7105 IGMM 50 − 0 . 0015 0 . 0 054 − 0 . 0060 0 . 4567 0 . 9945 0 . 705 9 100 − 0 . 001 2 0 . 0015 − 0 . 0030 0 . 436 8 0 . 9813 0 . 74 05 250 0 . 0001 − 0 . 0017 − 0 . 0009 0 . 4210 0 . 99 97 0 . 6919 1,000 0 . 0003 0 . 0005 − 0 . 0008 0 . 40 14 0 . 9788 0 . 7102 Lam b ert W ML 50 − 0 . 0013 0 . 0 054 − 0 . 0126 0 . 5144 0 . 9951 0 . 721 0 100 − 0 . 001 3 0 . 0016 − 0 . 0072 0 . 467 0 0 . 9813 0 . 74 07 250 0 . 0002 − 0 . 0017 − 0 . 0027 0 . 4333 0 . 99 97 0 . 6922 1,000 0 . 0003 0 . 0005 − 0 . 0012 0 . 40 39 0 . 9788 0 . 7106 Skew-normal ML 50 NA 0 . 0052 − 0 . 0135 NA 0 . 9928 0 . 7149 100 NA 0 . 0015 − 0 . 0073 NA 0 . 982 1 0 . 7409 250 NA − 0 . 0018 − 0 . 0027 NA 1 . 0004 0 . 6925 1,000 NA 0 . 0000 − 0 . 0013 NA 0 . 9788 0 . 7105 T able 2 disp la ys the bias and r o ot mean squ are error (RMSE) of b θ . Not only are all estimators unbiased, b u t they also ha ve essen tially equal RMSE for b µ y and b σ y . It is w ell kno wn that the Gaussian MLE of σ x is only asymp- totical ly unbiased, bu t for small samples it underestimates the standard deviation, wh ereas a metho d of momen ts estimator suc h as IGMM do es not ha v e that p roblem (see N = 50). F or b γ the IGMM estimato r has sligh tly smaller RMSE th an MLE for small N ; for large N the d ifference disapp ears. This can also b e explained b y an on ly asymptotically unbiased MLE for σ x , and the fun ctional relation ( 4.2 ) of γ , σ x and σ y . Ov erall, estimating γ has no effect on the qualit y of the remaining param- eter estimates, if the data comes from a truly (symmetric) Gaussian distribu - tion. A Shapir o Wi lk Gaussianit y test on the n = 1,000 estimat es of b γ IGMM and b γ MLE giv es p -v alues of 68 . 91% and 68 . 25%, resp ectiv ely ( N = 1,000), and th us confirms the asymptotic n ormalit y of b γ as stated in S ection 5.1 . 6.2. Slightly skewe d dat a: γ = − 0 . 05 . This c hoice of γ is motiv ated b y real world data—in particular, asset returns typicall y exhibit sligh tly nega- tiv e sk ewness [ γ 1 ( − 0 . 05) = − 0 . 3006 3 ]. T able 3 presents the effect of ignorin g small asymmetry in data. Gaussian MLE is by d efinition biased f or γ , bu t b µ y and b σ y are still go o d estimates. Neither IGMM nor Lam b ert W MLE giv es biased b θ , but the R MS E of b σ y in- 24 G. M. GOER G T able 3 Bias and RMSE of b θ f or γ = − 0 . 05 and X ∼ N ( µ x ( γ ) , σ 2 x ( γ )) Bias RMSE · √ N N γ = − 0 . 05 µ y = 0 σ y = 1 γ = − 0 . 05 µ y = 0 σ y = 1 Gaussian ML 50 0 . 0500 0 . 0057 − 0 . 0176 0 . 3536 0 . 99 52 0 . 7350 100 0 . 0500 0 . 0016 − 0 . 0083 0 . 5000 0 . 9826 0 . 7741 250 0 . 0500 − 0 . 0029 − 0 . 0009 0 . 7906 0 . 9981 0 . 7095 1,000 0 . 0500 0 . 0005 − 0 . 0014 1 . 5811 0 . 9781 0 . 7281 IGMM 50 − 0 . 0008 0 . 0046 − 0 . 0057 0 . 4582 0 . 9954 0 . 7410 100 − 0 . 0008 0 . 0011 − 0 . 0026 0 . 4389 0 . 982 8 0 . 7753 250 0 . 0002 − 0 . 0019 − 0 . 0007 0 . 4189 0 . 9982 0 . 7102 1,000 0 . 0003 0 . 0005 − 0 . 0009 0 . 3986 0 . 9780 0 . 7276 Lam b ert W ML 50 − 0 . 0043 0 . 0052 − 0 . 0116 0 . 5113 0 . 9961 0 . 7570 100 − 0 . 0029 0 . 0015 − 0 . 0062 0 . 4701 0 . 982 9 0 . 7802 250 − 0 . 0006 − 0 . 0017 − 0 . 0024 0 . 4282 0 . 9981 0 . 7114 1,000 0 . 0001 0 . 0005 − 0 . 0013 0 . 3992 0 . 9781 0 . 7284 Skew-normal ML 50 NA 0 . 0073 − 0 . 0136 NA 1 . 0011 0 . 7490 100 NA 0 . 0026 − 0 . 0067 NA 0 . 9834 0 . 7811 250 NA − 0 . 0014 − 0 . 0025 NA 0 . 99 90 0 . 7109 1,000 NA 0 . 0000 − 0 . 0012 NA 0 . 9796 0 . 7281 creases for all estimators and all sample sizes. Again IGMM presents sm aller RMSE for b γ than MLE for small N , but not for large N —for the same rea- son as in the γ = 0 case. Notably , the ske w-normal MLE for µ y and σ y is also unbiased and has the s ame RMSE as the Lam b ert W and Gaussian comp etitors, ev en though the true distribu tion is a Lambert W × Gaussian, not a ske w-normal. 6.3. Extr emely skewe d data: γ = 0 . 3 . In this case, the Lam b ert W MLE should work b etter than the skew-normal MLE, since the skewness coefficient γ 1 (0 . 3) = 1 . 9397 lies outside th e theoretica lly p ossible v alues of sk ew-normal distributions. F urthermore, the n onprincipal branch of the L am b ert W func- tion b ecomes more imp ortan t as p − 1 ≈ 4 . 29 · 10 − 4 , so the Lam b er t W MLE should also outp erform IGMM, w hic h ignores th e nonp rincipal solution. Only the sk ew-normal MLE fails to pro vide accurate estimates of lo cation and scale for h ea vily ske w ed d ata sets; all other estimators are p ractically unbiase d (T able 4 ). The RMSE for b σ y almost doubled compared to the symmetric case, and for Gaussian as w ell as sk ew-normal MLE it is increasing with samp le size instead of decreasing. While b γ IGMM has less bias, b γ MLE has a muc h smaller RMSE: not ignoring the nonp r incipal branch more than comp ensates the finite sample b ias in b σ x . Surp risingly , th e RMSE for b γ has diminished by ab out 35% o ve r all sample sizes compared to the symmetric case. LAMBER T W RANDO M V A RIABLES 25 T able 4 Bias and RMSE of b θ for γ = 0 . 3 and X ∼ N ( µ x ( γ ) , σ 2 x ( γ )) Bias RMSE · √ N N γ = 0 . 3 µ y = 0 σ y = 1 γ = 0 . 3 µ y = 0 σ y = 1 Gaussian ML 50 − 0 . 3000 0 . 0029 − 0 . 0336 2 . 1213 0 . 9851 1 . 2941 100 − 0 . 3000 0 . 00 06 − 0 . 0194 3 . 0000 0 . 9863 1 . 3957 250 − 0 . 3000 − 0 . 0076 − 0 . 0056 4 . 7434 1 . 0045 1 . 4499 1,000 − 0 . 3000 0 . 0002 − 0 . 0013 9 . 4868 0 . 9883 1 . 4954 IGMM 50 − 0 . 0076 0 . 0081 − 0 . 0057 0 . 4374 0 . 9917 1 . 2417 100 − 0 . 0055 0 . 00 28 − 0 . 0063 0 . 4005 0 . 9853 1 . 2440 250 − 0 . 0032 − 0 . 0012 − 0 . 0056 0 . 3647 1 . 0009 1 . 2204 1,000 − 0 . 0026 − 0 . 0003 − 0 . 0049 0 . 3197 0 . 9820 1 . 1992 Lam b ert W ML 50 0 . 0180 0 . 0221 0 . 0266 0 . 3844 1 . 0152 1 . 2385 100 0 . 0115 0 . 0131 0 . 0168 0 . 3241 1 . 0095 1 . 2218 250 0 . 0055 0 . 0053 0 . 0054 0 . 2747 1 . 0102 1 . 1535 1,000 0 . 0000 0 . 0021 − 0 . 0021 0 . 2349 0 . 9818 1 . 1383 Skew-normal ML 50 NA 0 . 0695 − 0 . 0938 NA 1 . 3638 1 . 1965 100 NA 0 . 0558 − 0 . 0834 NA 1 . 350 8 1 . 3182 250 NA 0 . 0520 − 0 . 0748 NA 1 . 486 5 1 . 5577 1,000 NA 0 . 0560 − 0 . 0704 NA 2 . 1588 2 . 4585 Discussion. Estimation of µ y is unaffecte d b y the v alue of γ ; the qualit y of b σ y , how ev er, dep ends on γ : the larger γ , the greater the RMSE of b σ y . F or γ = 0 the Lamb er t W metho ds p erform equally well as Ga ussian MLE, whereas for nonzero γ Gaussian and —to s ome extent—sk ew-normal MLE ha ve inferior qualities compared to the Lamb ert W alte rnativ es. In particu- lar, the RMSE for b σ y increases with sample size. Hence, there is no gain restricting analysis to the (symmetric) Gaussian case, as th e Lambert W framew ork extends this d istr ibution to a broader class, without losing the go o d prop erties of Gaussian MLE. F or little asym- metry in the data, b oth the Lam b ert W a nd the skew-normal approac h giv e accurate and precise estimates of location, scale and sk ewn ess. Y et f or hea vily sk ew ed data (sk ewness greater than 0 . 995 in absolute v alue), the sk ew-norm al framewo rk fails not only in theory , but also in practice to pro- vide a go o d appro ximation. T able 5 shows the a v erage n um b er of iterations the IGMM algo rithm needed to conv erge: f or increasing sample size it needs less iterations— sample moments can b e estimated more accurately; more iterations are needed for larger γ —as the starting v alue for γ is based on the T a y lor expansion around γ = 0 and mo ving a wa y from the orig in mak es the initial estimate γ (0) := b γ T aylor less pr ecise. A closer look at the t wo sub-tables (top and b ottom) sh o ws that find - ing the optimal γ (Algorithm 2 ) b ecomes muc h more d ifficult for increas- 26 G. M. GOER G T able 5 Av er age numb er of iter ations ( tol = 10 − 6 ): (top) IGMM Algorithm 3 including the iter ations in A lgorithm 2 ; (b ottom) IGMM only (not c ounting iter ations in A l gorithm 2 ). (left) Gaussian input; (right ) student- t input with ν = 4 de gr e es of fr e e dom. Base d on n = 1 , 000 r epli c ations γ N 0 − 0 . 05 0 . 3 0 , ν = 4 − 0 . 05 , ν = 4 0 . 3 , ν = 4 50 8.39 10 . 24 34 . 65 15 . 82 16 . 76 26 . 91 100 6.05 8 . 16 3 5 . 01 17 . 52 19 . 12 21 . 45 250 4.37 6 . 45 2 7 . 51 17 . 83 21 . 34 13 . 23 1,000 3.58 4 . 96 18 . 43 17 . 20 2 4 . 58 6 . 49 50 4.43 4 . 60 6 . 2 4 4 . 56 4 . 64 6 . 23 100 4.10 4 . 44 6 . 44 4 . 46 4 . 44 5 . 78 250 3.90 4 . 26 5 . 92 4 . 15 4 . 19 5 . 45 1,000 3.58 4 . 11 5 . 9 1 3 . 78 4 . 04 5 . 36 ing γ and sample s ize N than findin g the optimal µ x and σ x giv en the optimal b γ GMM (Algorithm 3 ). F or γ = 0 an d large N th er e is almost no difference b etw een the total n u m b er of iterations (top) and the num b er of iterations in Algorithm 3 only (b ottom). F or large γ , ho w ever, the total num- b er of iterations is appro ximately 5 times as large. The right panel sho ws the v alues for simulations of a Lam b ert W × t R V with ν = 4 degrees of freedom. F or sm all γ , finding b γ GMM tak es m uch longer than for Gaussian input; surprisin gly , f or large γ con verge nce is reac hed faster. This is proba- bly a result of the constrained optimization: due to more extreme v alues for a t -distribution, Algorithm 2 often return s one o f the tw o b ou n dary v alues for b γ GMM without ev en starting the optimizat ion pro cess. Giv en its go o d empirical prop erties, fairly general assu mptions ab out the input v ariable X , and its fast computation time, the IGMM algorithm can b e used as a quic k Lambert W c hec k. F or a particular Lam b ert W × F distribution, the Lam b ert W × F MLE gives more accurate results, esp ecially for h ea vily sk ewed d ata. 7. Applications. This section demonstrates the u sefulness of th e pre- sen ted methodology on r eal world data. In the first example I analyze parts of the Australian Athlet es data set 10 whic h can b e t ypically found in the literature on mo deling sk ewed data [Gen ton ( 2005 ), Azzalini and Dalla -V alle ( 1996 )]. The second example reexamines the LA T AM return s introdu ced in Sec- tion 1 . A Lam b ert W × t -distribution is f ound to giv e an appropr iate fit, 10 R pack age Lambe rtW , data set AA . LAMBER T W RANDO M V A RIABLES 27 Fig. 6. Au str al i an A thletes BMI: (left) observe d data y (dots) and b ack-tr ansforme d data b x b τ MLE (triangles); (right) histo gr am pl us density estimates. b oth for th e raw data as well as the standardized residuals of an auto- regressiv e cond itional heterosk edastic time series mo del (see Section 7.2.1 for d etails). In p articular, a comparison of risk estimators (V alue at Risk) demonstrates th e suitabilit y of th e Lam b ert W × F distribu tions to mo del financial data. 7.1. BMI of Austr alian athletes. Figure 6 sho w s the Bo dy Mass I n dex (BMI) of 100 female Australian athletes (dots) and T able 6 lists seve ral statistica l pr op erties (column 1). Although the data app ear fairly Gaus- sian, its large p ositiv e sk ewn ess mak es b oth tests reject normalit y on a 5% lev el. After 5 iterations b τ IGMM = (21 . 735 , 2 . 570 , 0 . 099), which implies b µ y = 21 . 992, b σ y = 2 . 633, and γ 1 ( b γ IGMM ) = 0 . 601, assum ing Gaussian input. T able 6 BMI ( y ) and b ack-tr ansforme d data b x b τ IGMM and b x b τ MLE : (top) summ ary statistics; (b ottom) Shapir o–Wilk (SW), Jar que–Ber a (JB) normality tests BMI y b x b τ IGMM b x τ ( b θ MLE ) Min 16 . 750 15 . 356 15 . 406 Max 31 . 930 29 . 335 29 . 384 Mean 21 . 989 21 . 735 21 . 742 Median 21 . 815 21 . 815 21 . 815 St. dev . 2 . 640 2 . 570 2 . 56 9 Skewness 0 . 683 0 . 000 0 . 01 7 Kurtosis 1 . 093 0 . 186 0 . 18 7 SW 0 . 035 0 . 958 0 . 95 9 JB 0 . 001 0 . 877 0 . 874 28 G. M. GOER G T able 7 L amb ert W × G aussian MLE for the BMI data Estimate S td. error t v alue Pr( > | t | ) µ x 21 . 742 0.274 79 . 494 0.000 σ x 2 . 556 0.188 13 . 618 0.000 γ 0 . 096 0.039 2 . 481 0.013 The BMI data set consists of exactly n = 100 i.i.d. samples and T able 2 lists fin ite sample p rop erties of b γ IGMM for this case. 11 Th us, if Y = BM I w as Gaussian, then √ 100 b γ IGMM 0 . 4368 ∼ N (0 , 1) . (7.1) Plugging b γ IGMM = 0 . 099 into ( 7.1 ) gi v es 2 . 279 and a corresp ond ing p -v a- lue of 0 . 0113. Thus, b γ IGMM is signifi can t on a 5% lev el, yielding an indeed p ositiv ely skew ed distribution for the BMI data y . As b oth tests cannot reject Gaussianit y for b x b τ IGMM , a Lamb ert W × Gaussian approac h seems r easonable. T able 7 sho w s that all estimates are highly significan t, wh ere stand ard er r ors are obtained by numerical ev al- uation of the Hessian at the optim um. As n ot one single test can reject normalit y of b x b τ MLE (triangles in Figure 6 ), an adequate mo del to capture the stat istical prop erties of the BMI data is BMI = ( U e 0 . 099 U )2 . 556 + 21 . 742 , U = X − 21 . 742 2 . 556 ∼ N (0 , 1) . F or b θ MLE the supp ort of BMI lies in the half-op en in terv al [11 . 967 , ∞ ) . As all observ ations lie within these b oun daries, b θ MLE is indeed a (lo cal) max- im u m . Figure 6 sh ows the closeness of the L amb ert W × Gaussian d ensit y to the histogram and kernel density estimate, whereas the b est Ga ussian is apparen tly an improp er appro ximation. Although a more detailed study of athlete t yp e and other health indica- tors might explain the p rev alent sk ewn ess, the Lam b ert W resu lts at least supp ort common sense: th e h uman b o d y has a n atural physiolo gical lo wer b ound 12 for the BMI, whereas outlie rs on the righ t ta il—alb eit, in principle, also ha ving an up p er b oun d—are more lik ely . 11 Although y is clearly not N (0 , 1), the location-scale in vari ance of Lambert W × Gaussian R Vs makes this difference to scenario 1 in the simulations [ Y ≡ X ∼ N (0 , 1) ] irrelev ant; finite sample prop erties of γ do not c hange b etw een X ∼ N (0 , 1) and general X ∼ N ( µ x , σ 2 x ), since in b oth cases µ x and σ x are also estimated. 12 The low er trun cation of the BMI at 11 . 967 corresp onds to a 180 cm tall athlete only w eighing 38 . 88 kg. LAMBER T W RAN DOM V ARIABLES 2 9 T able 8 L amb ert W × t MLE for the LA T AM series Estimate Std. error t v alu e Pr( > | t | ) c 0 . 197 0.037 5 . 270 0.000 s 1 . 240 0.057 21 . 854 0.000 ν 7 . 047 2.196 3 . 208 0.001 γ − 0 . 053 0.014 − 3 . 860 0.000 7.2. Asset r eturns. A lot of financial data, also the LA T AM return series in tro duced in Section 1 (T able 1 and Figure 2 ), d ispla y negativ e skewness and excess kurtosis. These so-called stylize d facts are well known and t yp - ically addressed via (generalized) auto-regressiv e conditional h eterosk edas- tic (GAR CH) [Engle ( 1982 ), Bollerslev ( 198 6 )] or stochasti c v olatilit y (SV) mo dels [M elino and T ur nbull ( 1990 ), De o, Hurvic h and Lu ( 2006 )]. A theo- retical analysis of Lam b ert W × F time series mo dels, ho wev er, is far b ey on d the scope and fo cus of this w ork. F or empirical evidence regarding the use- fulness and significance of Lam b ert W × F distr ib utions in GAR CH mo dels and p ossible fu ture researc h directions see S ection 7.2.1 . It is also w orth noting that the Lam b ert W × F transformation ( 2.2 ) resem bles SV models v ery closely , and co nnections b et ween the t wo can b e m ade in future work. Based on the news ↔ return interpretation in a sto c k marke t S , it mak es sense to assume a symm etric in put d istr ibution F X ( x ) for the laten t n ews R V X . Without sp ecifying the symm etric F X ( x ) any f urther, the IGMM algorithm giv es a robust estimate for τ : here b τ IGMM = ( − 0 . 048 , 0 . 190 , 1 . 456). Column 2 of T able 1 sho ws that the unskew ed data b x b τ IGMM —here in terp reted as news hitting th e marke t—is non-Gaussian, but a t -distribution cannot b e rejected. In consequence, Y is mo deled as a Lam b ert W × lo cation-sca le t - distribution w ith β = ( c, s, ν ), where c is the location, s the scale and ν the degrees of freedom p arameter. T able 8 sho ws that all coefficients of b θ MLE are highly significan t; in particular, b γ increased substantia lly (in absolute v alue), as γ n ow sole ly addresses asymmetry in the data, an d ν can capture excess ku rtosis. Thus, the prev alen t negativ e sk ewness in the LA T AM daily returns is not an artifact of large outliers in the left tail of an otherwise symmetric distribution, b u t a significan t c haracteristic of the data. In order to c h ec k if the Lamb ert W × t -distrib ution is in deed an ap- propriate m o del for y , it is useful to stu dy the b ac k-transformed data b x b τ ; here b τ MLE := τ ( b θ MLE ) = (0 . 197 , 1 . 465 , − 0 . 053). Not surprisin gly , the sk ew- ness of b x b τ MLE reduced to al most 0 (column 3 of T able 1 ). As a Kolmogoro v– Smirnov test cannot reject a stu den t t -distribution for b x b τ MLE , the Lam- b ert W × t -distribution Y = ( U e − 0 . 053 U )1 . 465 + 0 . 197 , 30 G. M. GOER G Fig. 7. News b x b τ MLE ↔ r eturn y sc atter plot plus histo gr ams; soli d 45 ◦ line: γ = 0 . Dashe d vertic al and horizontal li nes r epr esent the sample me an of b x b τ MLE and y , r esp e ctively. U = X − 0 . 197 1 . 465 , U r 7 . 047 7 . 047 − 2 ∼ t ν =7 . 047 is an ad equ ate unconditional probabilistic mod el for the LA T AM return s y . The effect of news x t in the marke t S is clearly sho w n in a scatter plot of b x b τ MLE v ersu s y . F or example, consider the lo we r-left p oint ( x 1346 , y 1346 ) ≈ ( − 4 . 8 , − 6 . 1) in Figure 7 . Here, the observed nega tiv e retur n equ als − 6 . 1%, but as b γ = − 0 . 053 < 0, this outcome wa s an o verreacti on to bad news that w as only “w orth” − 4 . 8%. F or lo cation-sca le Lam b ert W R Vs the sk ew- ness parameter γ is a p o werful, y et easy wa y to c haracterize different m ar- k ets/assets. Th e negativ e b γ sho w s that this sp ecific market (system) is ex- aggerati ng bad news, and dev alues p ositiv e news. V alue at risk (V aR). The V aR is a p opular measur e in financial statistics to estimate the p oten tial loss f or an in v estmen t in an asset o v er a fixed time p erio d. That is, th e maxim u m p ercent age an inv estor can exp ect to lose—with a confidence of 1 − α —o ver a fi xed time p erio d. Statistica lly this corresp onds to the α -quantile of the d istribution. T he V aR can b e obtained in v arious w a ys: the simp lest are emp irical and theoretical qu an tiles giv en the estimated parameter v ector of a parametric distribution (which are suffi cien t for comparative purp oses). LAMBER T W RANDO M V A RIABLES 31 T able 9 V aR c omp arison for the LA T AM series α Metho d 0.005 0.01 0.05 0.5 0.95 0.99 0.995 empirical − 4 . 562 − 4 . 078 − 2 . 478 0 . 138 2 . 344 3 . 192 3 . 818 Gaussian − 3 . 660 − 3 . 294 − 2 . 293 0 . 121 2 . 535 3 . 535 3 . 901 t − 4 . 297 − 3 . 634 − 2 . 214 0 . 121 2 . 455 3 . 875 4 . 538 Lam b ert W × t − 4 . 871 − 4 . 049 − 2 . 358 0 . 197 2 . 351 3 . 437 3 . 893 Skew- t − 4 . 715 − 3 . 973 − 2 . 364 0 . 201 2 . 346 3 . 465 3 . 957 As exp ected, a Gaussian distribu tion und erestimates b oth the lo w and high quan tiles, as it lacks th e capabilit y to capture excess kurtosis (see T a- ble 9 ). T he t -d istribution with b ν MLE = 6 . 22 degrees of freedom has hea vier tails, but un d erestimates lo w and o ve restimates high quant iles: clearly an indication of the prev alen t skewness in th e data. The Lambert W × t and the sk ew t -distribu tion 13 are the b est appr oximati on to the empirical quan tiles: b oth hea vy tails and negativ e sk ewn ess are ca p tured (see also the Lam b ert W × t QQ plot in Figure 2 ). There is no clear “winner” b et ween the tw o sk ewed distr ibutions: skew- t quantil es are closer to the empirical ones for small α , Lam b ert W × t q u an tiles are closer for large α . Around the median ( α = 0 . 5) b oth skew ed d istributions are far a wa y from the true v alue: the reason b eing a high concen tration of close to 0 r eturns in financial assets, so-calle d “inliers” [see Breidt and Carriquiry ( 1995 )]. 7.2.1. Nonindep endenc e of financial data. It is well kno w n that fi nancial return series y t t ypically exhibit p ositiv e auto-correlatio n in their squares y 2 t , whic h violates the in dep endence assump tion of the MLE p r esen ted in Sec- tion 5.1 . A s tand ard p arametric w ay to capture this dep endence is a GARCH mo del [Bollerslev ( 1986 ), Engle ( 1982 )], which mo dels the v ariance at time t , σ 2 t , as a fun ction of its o wn p ast. A simple, ye t very successful mo del for an uncorrelated y t is a GAR CH(1 , 1), y t = µ + ε t σ t , σ 2 t = ω + αy 2 t − 1 + β σ 2 t − 1 , where ε t is a zero-mean, unit-v ariance i.i.d. sequen ce [for tec h nical details see Nelson ( 1990 ), Engle ( 19 82 )]. T ypically , ε t ∼ N (0 , 1), but also student t - or sk ew t -distributions are us ed for more flexibilit y in the cond itional distr i- bution of ε t giv en the information set Ω t − 1 a v ailable at time t − 1 [Bau w ens and Laur en t ( 2005 )]. F renc h , Sc h w ert and Stam b au gh ( 1987 ) also foun d that 13 MLE estimates are (0 . 917 , 1 . 4 22 , − 0 . 799 , 7 . 156) for the lo cation, scale, shap e and degrees of fre edom parameter respectively; function st.m le in the sn pack age. 32 G. M. GOER G the standard ized residuals ( y t − b µ ) / b σ t —whic h ca n b e considered as a n i.i.d. sequence—still exhibit negativ e skewness after fitting a Gaussian GARCH mo del to S & P 500 returns. After fitting a studen t- t GAR CH (1 , 1) m o del 14 to th e LA T AM return series y , th e Lambert W × t MLE fit for the standardized residu als—whic h are appro ximately i.i.d. and th us d o not violate the MLE assumptions— still giv es a highly significan t b γ = − 0 . 048 w ith a p -v alue of 0 . 000113 (other estimates are not sho wn here). While I will not study Lam b ert W × student- t GAR CH m o dels in detail, this example and the great flexibilit y of Lambert W × F distribu tion co m - bined with the p ossibilit y to sym metrize sk ewe d data suggest that Lam b ert W × F GARCH (and SV) models are a promising area of fu ture researc h . This analysis confi r ms pr evious findings that negativ e sk ewn ess is an im- p ortan t feature of asset r eturns. F or example, optimal p ortfolio mo dels b ased on sk ew ed distributions lead to b etter suited decision rules to reac t to asym- metric price mo ve men ts. It also sh o ws that Lam b ert W distributions mo del the charact eristics of financial returns as well as sk ew t -distributions, with the additional option to r eco v er symmetric laten t data, which is n ot p os- sible for R Vs based on a manipu lation of the p.d.f. r ather than a v ariable transformation. 8. Relation to T uk ey’s h distribu tion. Du ring th e fin al review p ro cess, Professor An drew F. Siegel suggested p ossible connections of Lam b ert W distributions to T uk ey’s g – h distrib u tion [T uk ey ( 1977 )] Z = exp( g U ) − 1 g exp  h 2 U 2  , h ≥ 0 , (8.1) where U ∼ N ( 0 , 1). Here g is the sk ew parameter and h con trols th e tail b eha vior of Z . Although the underlying idea to in tro d u ce skewness is the same, th e sp e- cific transformations to get the skewness effects are d ifferen t, and so are the prop erties of the transformed R Vs. F or g → 0, Z = U exp  h 2 U 2  (8.2) b ecomes symmetric. The R V Z has T ukey’ s h distribution and is commonly used to mo del hea vy-tails [Fisc h er ( 2006 ), Field ( 2004 )]. Equation ( 8. 2 ) re- v eals a close link of Lambert W × F R Vs to the existing statistics literature b y noting that if Z ∼ h , then Z 2 = U 2 e hU 2 has a n oncen tral, nonscaled Lam- b ert W × χ 2 1 distribution with γ = h . 14 F unction garchFit( · ) in the fGarch pack age. LAMBER T W RANDO M V A RIABLES 33 F or further imp ortan t connections b et wee n the Lam b ert W fu nction and T ukey’s h distribu tion see Goerg ( 2011 ). 9. Discussion and outlo ok. Whereas the Lamb ert W f unction p la ys an imp ortant r ole in mathematics, p h ysics, c hemistry , biology and other fields, it has n ot y et b een used in statistics. Here I introd uce it in an input/output setting to ske w and “unskew” R Vs and d ata, resp ectiv ely . Successful application to biomedica l and fi nancial data together with the great flexibilit y with resp ect to the type of inp ut R V X of Lam b ert W × F R Vs pr omise a wide range of applications as w ell as theoretical studies for particularly c hosen input d istributions. Last bu t not least, a v ery pragmatic adv anta ge of the transformation- based Lam b ert W × F R Vs compared to other appr oac hes to asymmetry: data can b e “unsk ewed” using Lam b ert’s W fun ction. Ac kn o wledgment s . I am grateful to Professor Wilfredo P alma for giving me the opp ortunit y to wo r k at the Department of Statist ics, Pon tificia Uni- v ersid ad Cat´ olica de Chile, S an tiago, where I co mpleted imp ortan t parts of this study . F urth er m ore, I wan t to thank Professor Reinaldo Arellano-V alle, Professor Cosma Shalizi, the Editor P r ofessor Stephen Fien b erg and t wo anon ymous referees for helpful commen ts and s u ggestions on the m an uscr ip t. REFERENCES Arellano-V alle, R. B. an d Azza lini, A. (2006). On the u n ification of families of skew normal d istributions. Sc and. J. Stat. 33 561–574. MR2298065 Arnold, B. C. and Bea ver, R. J. (2000). The ske w-Cauc hy distribution. Statist. Pr ob ab. L ett. 49 285– 290. MR1794746 Azzalini , A. (198 5). A clas s of distributions whic h includes the normal ones. Sc and. J. Stat. 12 17 1–178. MR0808153 Azzalini , A. and C apit anio, A. (1999). Statistical applications of the multiv ariate skew normal d istributions. J. R. Stat . So c. Ser. B Stat . Metho dol. 61 579–60 2. MR1707862 Azzalini , A. and Capi t ani o, A. (2003). D istributions generated by p erturb ation of sym- metry with emphasis on a multiv ariate skew t distribution. J. R oy. Statist. So c. Ser. B 65 367–389. MR198 3753 Azzalini , A. and Dalla-V alle, A. (1996). The m u ltiv ariate skew -normal distribution. Biometrika 83 715– 726. MR1440039 Bauwens, L. and La urent, S. (2005 ). A new class of multiv ariate sk ew densities, with application to generalized autoregressiv e conditional heteroscedasticit y mo dels. J. Bus. Ec onom. Statist. 23 34 6–354. MR2159 684 Behboodian, J. , Jam alizadeh, A. and B alakrishnan, N. (2006). A new class of skew- Cauc hy distributions. Statist. Pr ob ab. L ett. 76 14 88–1493 . MR2245569 Bollerslev, T. (1986). Generalized autoregressiv e conditional heteroskedasti city . J. Ec onometrics 31 307–327. MR0853051 34 G. M. GOER G Breidt, J. and Carriquir y, A. L. (1995). I mp ro ved quasi-maximum lik elihoo d estima- tion for sto c hastic v olatilit y models. In M o del l ing and Pr e diction: Honoring Seymour Geisser . S pringer, New Y ork. Cont, R. (2001). Empirical p roperties of asset retu rns: Styli zed facts and statistical issues. Quant. Financ e 1 223–236. Corless, R. M. , Gonnet, G. H. , Hare, D. E. G. and Je ffrey, D. J. (1996). On the Lam b ert W function. A dv. Comput. Math. 5 329–359. MR1414285 Deo, R. , Hur vich, C. and Lu, Y. (2006). F orecasting realized vola tilit y using a long memory stochastic vol atilit y mo del: Estimation, prediction and seasonal adjustment. J. Ec onometrics 131 29–58. MR2275995 Engle, R. F. (1982). A utoregressiv e conditional heteroskedasticit y wi th estimates of the v ariance of United Kingdom inflation. Ec onometric a 50 987–1007. MR0666121 Field, C. A. (2004). Using the g h distribution to mo del extreme wind speeds. J. Statist. Plann. Infer enc e 122 15–22 . MR2057911 Fischer, M. ( 2006). Generalized T ukey-type distributions with application to fi n an- cial and teletraffic data. Av ailable at http://eco npap ers.repec.org/RePEc:zbw :faucse: 722006 . French, K. R. , Schwer t, G . W. and St a mba ugh, R. F. (1987). Exp ected stock returns and volatilit y. Journal of Fi nancial Ec onomics 19 3–29. Genton, M. G. (2005). Discussion of “The sk ew-normal.” Sc and. J. Stat i st. 32 189–198. Glen, A. G. , Leemis, L. and Drew, J. H. (199 7). A general ized un iv ariate change-of - v ariable transformation tec hnique. INFORMS J. Comput. 9 288–295. Goerg, G . M. (2011). The Lam b ert W a y to Gaussianize sk ewed, hea v y tailed d ata with the inv erse of T ukey’s h transformation as a sp ecial case. Unp ublished manuscript. Av ailable at http://arx iv.org/abs /1010.2265 . Melino, A. and Turnbull, S. M. (1990). Pricing foreign currency options with sto chastic vol atilit y . J. Ec onometrics 45 239–265 . Nelson, D. B. (1990). Stationarity and persistence in the GARCH(1 , 1) mod el. Ec ono- metric The ory 6 318–334. MR1085577 R Developme nt Core Team (2008). R: A L anguage and Envir onment for Statistic al Computing . R F oundation for Statistical Computing, V ienna, Au stria. IS BN 3-900051- 07-0. Avai lable at http://w ww.R-pro ject.org . Ro senlicht, M. (1969). On the exp licit solv ab ility of certai n transcenden tal eq u ations. Inst. Hautes ´ Etudes Sci. Publ. Math. 36 15–22. MR0258808 Thode, H. C. , Jr. ( 2002). T esting f or Normality . Statistics: T extb o oks and Mono gr aphs 164 . Dekker, New Y ork. MR1989476 Tukey, J. W. (1977). Explor atory Data Analysis . Addison-W esley , Reading. V alluri, S. R. , Jeffrey, D. J. and Corless, R. M. (200 0). Some app lications of the Lam b ert W function to physics. C anad. J. Phys. 78 823 –831. Weiss, M . S. (1978). Modifi cation of the Kolmogoro v–Smirnov statistic for use with correlated data. J. Amer. Statist. Asso c. 73 872–87 5. Y an, J. (2005 ). Asymmetry , fat-tail, and autoregressiv e conditional density in financial return data with systems of frequ ency curves. Av ailable at http://ci teseerx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.76.27 41 . Dep ar tment of St atis tics Carnegie Mellon University Pittsburgh, Pennsyl v ania 1 5213 USA E-mail: gmg@stat.cm u.edu

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment