Multi-Element Long Distance Dependencies: Using SPk Languages to Explore the Characteristics of Long-Distance Dependencies
In order to successfully model Long Distance Dependencies (LDDs) it is necessary to understand the full-range of the characteristics of the LDDs exhibited in a target dataset. In this paper, we use Strictly k-Piecewise languages to generate datasets …
Authors: Abhijit Mahalunkar, John D. Kelleher
Multi-Element Long Distance Dependencies: Using SP k Languages to Explor e the Characteristics of Long-Distance Dependencies Abhijit Mahalunkar Applied Intelligence Research Center T echnological Uni versity Dublin Dublin, Ireland abhijit.mahalunkar@mydit.ie John D . K elleher AD APT Research Center T echnological Uni versity Dublin Dublin, Ireland john.d.kelleher@dit.ie Abstract In order to successfully model Long Distance Dependencies (LDDs) it is necessary to un- derstand the full-range of the characteristics of the LDDs exhibited in a tar get dataset. In this paper, we use Strictly k -Piecewise lan- guages to generate datasets with various prop- erties. W e then compute the characteristics of the LDDs in these datasets using mutual in- formation and analyze the impact of factors such as (i) k , (ii) length of LDDs, (iii) vocab u- lary size, (iv) forbidden subsequences, and (v) dataset size. This analysis re v eal that the num- ber of interacting elements in a dependency is an important characteristic of LDDs. This leads us to the challenge of modelling multi- element long-distance dependencies. Our re- sults suggest that attention mechanisms in neu- ral networks may aide in modeling datasets with multi-element long-distance dependen- cies. Howe v er , we conclude that there is a need to dev elop more ef ficient attention mech- anisms to address this issue. 1 Introduction Long Distance Dependencies (LDDs) describe an interaction between two (or more) elements in a sequence that are separated by an arbitrary num- ber of positions. LDDs are related to the rate of de- cay of statistical dependence of two points with in- creasing time interval or spatial distance between them. For example, in English there is a require- ment for subjects and verbs to agree, compare: “ The dog in that house is aggr essive ” with “ The dogs in that house are a ggr essive ”. This depen- dence can be computed using information theo- retic measure i.e. Mutual Information ( Cover and Thomas , 1991 ; Paninski , 2003 ; Bouma , 2009 ; Lin and T egmark , 2017 ). T o date most research on LDDs has focused on the distance the dependenc y spans within the se- quence. Howe ver , as our analysis will show the complexity of LDDs not only arises from the dis- tance but also a number of other factors, including: (i) the number of unique symbols in a dataset, (ii) the size of the dataset, (iii) the number of inter - acting symbols within an LDD, and (i v) the dis- tance between the interacting symbols. In this pa- per we use SPk languages to e xplore the complex- ity of LDDs. The motiv ation for using the SP k lan- guage modelling task, is that the standard sequen- tial benchmark datasets provide little to no con- trol ov er the factors which directly contribute to LDD characteristics. By contrast, using SP k lan- guages we can generate benchmark datasets with v arying de grees of LDD complexity by modifying the grammar of the SP k language ( Rogers et al. , 2010 ; Fu et al. , 2011 ; A vcu et al. , 2017 ). One aspect of LDDs that has been neglected in the research on LDDs is the complexity that arises from a multi-element dependency (i.e., dependencies that in volves interactions between more than 2 elements). By controlling k in the SP k grammar , it is possible to generate datasets with v arying degrees of multi-element depen- dency . This multi-element dependencies pose spe- cific challenges to neural architectures that may require these architectures to be augmented with pointer or attention mechanisms. W e explore whether attention mechanism can help with multi- element LDDs using two models, T ransformer- XL ( Dai* et al. , 2019 ) and A WD-LSTM ( Mer- ity et al. , 2018 ). T ransformer-XL employs multi- head attention mechanism along with recurrence mechanism. Whereas, A WD-LSTM is a weight dropped LSTM which does not employ any atten- tion/pointer mechanism. The T ransformer-XL and A WD-LSTM models are both language models . A language model ac- cepts a sequence of symbols and predicts the next symbol in the sequence. The accuracy of a lan- guage model is dependent on the capacity of the model to capture the LDDs in the data on which it is ev aluated. The standard ev aluation metric for language models is perplexity . Perplexity is the measurement of the confusion or uncertainty of a language model as it predicts the next symbol in a sequence, and the lower the perplexity of a model the better the performance of the model. 2 Related W ork: Neural Networks and Artificial Grammars Formal Language Theory , primarily dev eloped to study the computational basis of human language is no w being used extensi vely to analyze any rule- gov erned system ( Chomsky , 1956 , 1959 ; Fitch and Friederici , 2012 ). Formal languages ha ve pre- viously been used to train RNNs and in vestigate their inner workings. The Reber grammar ( Reber , 1967 ) was used to train various 1 st order RNNs ( Casey , 1996 ; Smith and Zipser , 1989 ). The Reber grammar w as also used as a benchmarking dataset for LSTM models ( Hochreiter and Schmidhuber , 1997 ). Regular languages, studied by T omita ( T omita , 1982 ), were used to train 2 nd order RNNs to learn grammatical structures of the strings ( W a- trous and Kuhn , 1991 ; Giles et al. , 1992 ). Regular languages are the simplest grammars (type-3 grammars) within the Chomsky hierarchy which are driv en by regular expressions. F or neu- ral network research an interesting subclass of regular languages is the Strictly k -Piecewise lan- guages. Strictly k -Piece wise languages are natu- ral and can e xpress some of the kinds of LDDs found in natural languages ( Jager and Rogers , 2012 ; Heinz and Rogers , 2010 ). This presents an opportunity of using SP k grammar to gener- ate benchmarking datasets ( A vcu et al. , 2017 ; Ma- halunkar and Kelleher , 2018 ). In A vcu et al. ( 2017 ), LSTM networks were trained to recognize v alid strings generated using SP 2 , SP 4 , SP 8 gram- mar . LSTM could recognize valid strings gener- ated using SP 2 and SP 4 grammars b ut struggled to recognize strings generated using SP 8 grammar , exposing the performance bottleneck of LSTM networks. It has also been observed that the per- formance of LSTMs on SP 2 datasets degraded when the length of the LDDs in the datasets were increased, this was done by increasing the maxi- mum length of the generated strings of SP 2 ( Ma- halunkar and K elleher , 2018 ). 3 Preliminaries 3.1 Strictly k -Piecewise Languages (SP k ) SP k languages form a subclass of regular lan- guages. Subregular languages can be identified by mechanisms much less complicated than Finite- State Automata. Many aspects of human language such as local and non-local dependencies are sim- ilar to subregular languages ( Jager and Rogers , 2012 ). More importantly , there are certain types of long distance (non-local) dependencies in human language which allow finite-state characterization ( Heinz and Rogers , 2010 ). These type of LDDs can easily be characterized by SP k languages and can be easily extended to other processes. A language L , is described by a finite set of unique symbols Σ and Σ * ( free monoid ) is a set of finite sequences or strings of zero or more ele- ments from Σ . Example 3.1. Consider , Σ = { σ 1 , σ 2 , σ 3 , σ 4 } where σ 1 , σ 2 , σ 3 , σ 4 are the unique symbols. A fr ee monoid over Σ contains all concatenations of these unique symbols. Thus, Σ * = { λ , σ 1 , σ 1 σ 2 , σ 1 σ 3 , σ 1 σ 4 , σ 3 σ 2 , σ 3 σ 1 σ 3 , σ 2 σ 1 σ 4 σ 3 , ... } . Definition 3.1. Let, u denote a string, e.g. u = σ 3 σ 2 . The length of a string u is denoted by | u | , and if u = σ 3 σ 2 then | u | =2. A string with length zero is denoted by λ . Definition 3.2. A string v is a subsequence of string w , if f v = σ 1 σ 2 ... σ n and w ∈ Σ * σ 1 Σ * σ 2 Σ * ... Σ * σ n Σ *, where σ ∈ Σ . A subsequence of length k is called a k-subsequence . Let subseq k ( w ) denote the set of subsequences of w up to length k . Example 3.2. Consider , Σ = { a, b, c, d } , w = [ acbd ], u = [ bd ], v = [ acd ] and x = [ db ]. String u is a subsequence of length k = 2 or 2-subsequence of w . String v is a 3-subsequence of w . Ho wever , string x is not a subsequence of w as it does not contain [ db ] subsequence. SP k languages are defined by grammar G SPk as a set of permissible k - subsequences . Here, k in- dicates the number of elements in a dependency . Datasets generated to simulate 2 elements in a de- pendency will be generated using SP 2 . This is the simplest dependency structure. There are more complex chained-dependency structures which re- quire higher k grammars. Example 3.3. Consider L , where Σ = { a, b, c, d } . Let G SP 2 be SP k grammar which is comprised of permissible 2-subsequences . Thus, G SP 2 = { aa, Figure 1: The automaton for G SP2 where n l =6 ac, ad, ba, bb, bc, bd, ca, cb, cc, cd, da, db, dc, dd } . G SP 2 grammar is employed to generate SP 2 datasets. Definition 3.3. Subsequences which are not in the grammar G are called forbidden subsequences 1 . Example 3.4. Consider Example 3.3 , although { ab } is a possible 2-subsequence , it is not part of the grammar G SP 2 . Hence, { ab } is a forbidden subsequence . Example 3.5. Consider strings u, v , w : u = [ bbcbdd ], v = [ bbdbbbcbddaa ] and w = [ bbabb- bcbdd ], where | u | = 6, | v | = 12 and | w | = 10. Strings u and v are valid SP 2 strings because they are composed of subsequences that are in G SP 2 . Ho we ver , w is an inv alid SP 2 string because w contains { ab } a subsequence which is a forbid- den subsequence . These constraints apply for any string x where | x | ∈ Z . Example 3.6. Let G SP 3 = { aaa, aab, abb, baa, bab, bba, bbb, ... } and forbidden subsequence = { aba } be an SP 3 grammar which is com- prised of permissible 3 - subsequences . Thus, u = [ aaaaaaab ], where | u | = 8 is a v alid SP 3 string and v = [ aaaaabaab ], where | v | = 9 is an in v alid SP 3 string as defined by the grammar G SP 3 . The maximum extent of LDD exhibited by a certain SP k language is equal to the length of the strings generated which abide by the grammar . Ho we ver , as per definition 3.2 , the strings gener- ated using this method will also e xhibit dependen- cies of shorter lengths. It should be noted that the length of the LDD is not the same as k . The length of the LDD is the maximum distance between two elements in a dependency , whereas k specifies the number of elements in the dependenc y (as defined in the the SP k grammar). Example 3.7. As per Example 3.5 , v = [ bbdbb- bcbddaa ], consider b in the first position, subse- quence { ba } exhibits dependency of 10 and 11. 1 Refer section 5.2. F inding the shortest forbidden subse- quences in ( Fu et al. , 2011 ) for method to compute forbidden sequences for SP k language Similarly , subsequence { bd } exhibits dependency of 2, 9, 10, etc. Figure 1 depicts a finite-state diagram of G SP2 , which generates strings of synthetic data. Con- sider a string x from this data, ∀ generated strings x generated using grammar G SP2 : | x | = 6 . The forbidden subsequence for this grammar is { ab } . Since { ab } is a forbidden subsequence , the state diagram has no path (from state 0 to state 11 ) because such a path would permit the genera- tion of strings with { ab } as a subsequence, e.g . { abcccc } T raversing the state diagram generates v alid strings e.g . { accdda, caaaaa } . V arious G SP k could be used to define an SP k depending on the set of forbidden subsequences chosen. Thus, we can construct rich datasets with dif ferent properties for any SP k language. forbid- den subsequences allow for the elimination of cer- tain logically possible sequences while simulating a real world dataset where the probability of oc- currence of that particular sequence is highly un- likely . Every SP k grammar is defined with at least one forbidden subsequence . 3.2 Plotting LDD Characteristics Mutual information has pre viously been used to analyse LDDs in datasets. For example, in Ebel- ing and Poeschel ( 2002 ), mutual information was used to analyse the maximum length of the LDDs in two English literary texts, Moby Dick by H. Melville and Grimm’ s tales. Another example, is the work of Lin and T egmark ( 2017 ) who analyzed the LDD characteristics in enwik8 dataset. Mutual information measures dependence be- tween random variables X and Y . These ran- dom v ariables have marginal distributions p ( x ) and p ( y ) and are jointly distributed as p ( x, y ) ( Cov er and Thomas , 1991 ; Li , 1990 ). Mutual in- formation, I ( X ; Y ) is defined as; I ( X ; Y ) = X x,y p ( x, y ) log p ( x, y ) p ( x ) p ( y ) (1) If X and Y are not correlated, in other words if they are independent to each other, then p ( x ) p ( y ) = p ( x, y ) and I ( X ; Y ) = 0 . Howe ver , if X and Y are fully dependent on each other , then p ( x ) = p ( y ) = p ( x, y ) which results in the maxi- mum v alue of I ( X ; Y ) . Mutual information can also be expressed using the entr opy of X and Y i.e. H ( X ) , H ( Y ) and their joint entr opy , H ( X , Y ) as gi ven in the equa- tions belo w: I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y ) (2) H ( X ) = − X x p ( x ) log p ( x ) (3) In our algorithm, we compute the H ( X ) using Grassberger’ s corrections ( Grassberger , 2003 ). H ( X ) = log N − 1 / N k X i =1 N i ψ ( N i ) (4) where N i is the frequency of unique symbol i , N = P N i , K is the number of unique sym- bols, and ψ ( N i ) is the logarithmic derivative of the gamma function of N i . In order to measure dependence between any two symbols at a distance D in a sequence, we design random variables X and Y so that X holds the subsequence of the original sequence from in- dex 0 till | dataset | − 1 − D , and Y holds the sub- sequence from index D till | dataset | − 1 ; where D represents spacing between the symbols and | dataset | or LE N is the size of the dataset. Next we define a random variable X Y that contains a sequence of paired symbols one from X and one from Y , where the symbols in a pair have the same index in X and Y . Algorithm 1 explains the de- tails. Using this information, and Equations 2 and 4 , we calculate the mutual information I ( X , Y ) at a distance D in a sequence. W e define the LDD char acteristics of any given sequential dataset as a function of mutual information I ( X ; Y ) ov er the distance D . Once we have calculated the mutual information within a dataset at the different dis- tances D which range between 1 to | dataset | we can then plot the LDD characteristics as a graph of distance D versus mutual information at D . The LDD characteristics are plotted on a log-log axis , the x-axis defines the distance between a pair of symbols and the y-axis marks the mutual informa- tion. Algorithm 1 Computing LDD Characteristics f or D ← 1 , | dataset | do X ← dataset [0 : | dataset | − D ] Y ← dataset [ D : | dataset | ] X Y ← zero-matrix of size ( K X , K Y ) f or i ← 0 , | X | do Increment X Y [ X [ i ] , Y [ i ]] end f or Compute N X i , N X , K X for X Compute N Y i , N Y , K Y for Y Compute N X Y i , N X Y , K X Y for X Y Compute H ( X ) , H ( Y ) and H ( X, Y ) Eq. 4 I [ D ] ← H ( X ) + H ( Y ) − H ( X , Y ) end f or 4 LDD characteristics of SP k datasets Natural datasets present little to no control over the factors that af fect LDDs. This, limits our ability to understand LDDs in more detail. SP k languages exhibit some types of LDDs occurring in natural datasets. Moreov er, by modifying the SP k gram- mar we can control the LDD characteristics within a dataset generated by the grammar . T o understand and v alidate the interaction between an SP k gram- mar and the characteristics of the data it generates, we used a number of datasets of SP k grammar and analyzed the properties of these datasets. Every dataset is a collection of strings and these strings strictly follow the grammar . Hence the size of the dataset ( | dataset | ) is the sum of the size of all the strings. The datasets were generated using foma ( Hulden , 2009 ) and python ( A vcu et al. , 2017 ; Ma- halunkar and Kelleher , 2018 ) 2 . Belo w we analyze the impact of v arious factors on the resulting LDD characteristics. 4.1 Impact of k A dependency may arise within a dataset due to two or more interacting elements. A two element dependency is the simplest dependency structure and is analyzed by models addressing LDDs. Ho we ver , multiple element dependency may not be rare and if not modeled correctly may well con- tribute to the errors of a model. Lack of knowledge of these dependency structures within benchmark datasets present significant limitation in compar- ing the performance of different models aimed at addressing LDDs. Different model architectures 2 The scripts and details of these datasets are av ail- able at https://github.com/silentknight/ DelFol- ACL- 2019 Figure 2: LDD characteristics of datasets of SP 2 , SP 4 and SP 16 grammar exhibiting LDD of length 100. may be able to represent one type of LDD char - acteristic to a greater extent than another (e.g., distance versus k ). Howe ver , unless the experi- menter is able to control the LDD characteristics present in a dataset it is not possible to disentan- gle which characteristics a giv en model struggles with based solely on the models performance on the data. SP k grammar addresses this problem by providing control o ver both dependency distance and k . W e use SP 2 , SP 4 and SP 16 grammars to gen- erate a set of datasets. W ith k = { 2 , 4 , 16 } , we generate datasets with dif ferent dependency struc- tures with interacting elements 2 , 4 , and 16 re- specti vely . Figure 2 plots the LDD characteristics of SP 2 , SP 4 and SP 16 grammars. They contain uniform distrib ution of strings of string length l where 60 ≤ l ≤ 100 . The maximum length of strings in each of these datasets is 100 . Hence, we can observe steeper mutual information decay beyond D > 100 . k defines the number of correlated or de- pendent elements in a dependency rule. As k in- creases the grammar becomes more complex and there is an ov erall reduction in frequency of the dependent elements in a gi ven sequence (due to lo wer probability of these elements occurring in a gi ven sequence). Hence, the mutual information is lo wer . This can be seen with dataset of SP 16 as compared to SP 2 and SP 4 . It is worth noting that datasets with lower mutual information curves tend to present more difficulty during modeling ( Mahalunkar and K elleher , 2018 ). 4.2 Impact of LDD length The distance or length between tw o interacting el- ements present significant challenge in modeling LDDs as the model is required to store the context Figure 3: LDD characteristics of datasets of SP 2 gram- mar exhibiting LDDs of length 20, 100, 200 and 500. of the interacting element persistently . The suc- cess of a model is dependent on whether it is ca- pable of storing the required length of the contexts as dictated by the dataset. W e generated strings of maximum length 20 (2 ≤ l ≤ 20) , 100 (21 ≤ l ≤ 100) , 200 (101 ≤ l ≤ 200) and 500 (201 ≤ l ≤ 500) using SP 2 grammar . As ex- plained in Example 3.6 , by increasing the length of the generated strings, the distance between depen- dent elements is also increased, resulting in longer LDDs. Consequently , using this string lengths we can simulate LDD lengths of 20 , 100 , 200 and 500 . Figure 3 plots LDD characteristics of SP 2 languages with maximum string length of 20 , 100 , 200 , 500 . The point where mutual infor- mation decay is faster , the inflection point , lies around the same point on x-axis as the maximum length of the LDD. This confirms that SP k can generate datasets with v arying lengths of LDDs. 4.3 Impact of V ocabulary Size W e analyze the impact of vocab ulary size on LDD characteristics, we generate SP 2 grammars where Σ 1 = { a, b, c, d } ( V =4 ) and Σ 2 = { a, b, c, d, ...., x, y , z } ( V =26 ), where V is vocabulary size. The impact of vocabulary size can be seen in figure 4 . Both these datasets contain strings of maximum length 20. Hence the mutual information decays at 20. Both curves hav e identical decay indicating a similar grammar . Ho we ver the ov erall mutual information of the dataset with V =26 is much lo wer then the mutual information of the dataset with V =4 . This is because a smaller vocab ulary results in an increase in the probability of the occurrence of each indi vidual elements. Figure 4: LDD characteristics of datasets of SP 2 gram- mar with vocab ulary of 4 and 26. Figure 5: LDD characteristics of datasets of SP 2 gram- mar with varying forbidden subsequences. 4.4 Impact of forbidden subsequences forbidden subsequences control the complexity of a giv en grammar . W e choose two sets of forbid- den subsequences for SP 2 grammar , { ab, bc } and { ab, bc, cd, dc } . Figure 5 plots the LDD characteristics of SP 2 grammar with two set of forbidden subsequences as { ab, bc } and { ab, bc, cd, dc } . It is seen that the dataset with more forbidden subsequences ex- hibited mutual information decay tending towards a power law decay as compared to an exponen- tial decay by dataset with less forbidden subse- quences. As explained in Lin and T egmark ( 2017 ), datasets with exponential decay tend to exhibit Marko vian beha vior and thus are easy to model as compared to datasets with po wer la w decay . Com- plex LDDs in a dataset result in power law decay . Thus, by controlling the forbidden subsequences, one can introduce more complex LDDs. 4.5 Impact of dataset size Another factor to analyze is the impact of the size of the dataset ( | dataset | ) on LDDs of the same Figure 6: LDD characteristics of datasets of SP 2 gram- mar with varying size of the datasets grammar . W e generate two sizes of the same SP 2 grammar to study the impact of the size of the data on the LDD characteristics, where one dataset is twice the size of the other . In figure 6 we can observe that LDD charac- teristics of datasets sampled from the same gram- mar are less likely to be af fected by the size of the dataset. 5 Multi-Element Long Distance Dependencies: Attention Mechanisms and k As discussed abo ve in section 4.1 , LDDs may arise due to multiple interacting elements, which can be referred to as multi-element long-distance depen- dency (ME-LDD). Current LDD research primar- ily focuses on dev eloping models which are capa- ble of retaining contextual information across long distances in sequential datasets. This approach, which focuses solely on dependenc y distance, may be insufficient in addressing the problems aris- ing due to ME-LDDs. Howe ver , recent advances in attention mechanisms and memory networks may be able to represent ME-LDDs. In this section we in vestigate two models, T ransformer- XL ( Dai* et al. , 2019 ) (with attention mecha- nism) and A WD-LSTM (ASGD W eight-Dropped LSTM) (without attention mechanism) ( Merity et al. , 2018 ) so as to analyze how attention mech- anisms help in modeling datasets with ME-LDDs. The Transformer -XL model augments vanilla T ransformer models by introducing a recurrence mechanism to the Transformer architecture. This recurrence ef fectively encodes an arbitrarily long context into a fixed size representation over con- strained memory and computation. A vanilla T ransformer is made up of Multi-Head Attention Models T est Perplexity in bpc SP 2 SP 4 SP 8 SP 16 1 1 . 6855 1 . 8038 1 . 9611 2 . 0759 2 1 . 413 1 . 486 1 . 658 1 . 708 T able 1: Perplexity score of 1 : T ransformer-XL and 2 : A WD-LSTM models of SP 2 , SP 4 , SP 8 and SP 16 datasets with vocab ulary size V =4 Models T est Perplexity in bpc SP 2 SP 4 SP 6 SP 8 1 4 . 6846 4 . 7320 4 . 7384 4 . 7385 2 4 . 525 4 . 635 4 . 707 4 . 708 T able 2: Perplexity score of 1 : T ransformer-XL and 2 : A WD-LSTM models of SP 2 , SP 4 , SP 6 and SP 8 datasets with vocab ulary size V =26 and Feed F orward layers which aides in adding po- sitional information to the embedded representa- tion. The other model we tested, the A WD-LSTM, uses a weight-drop mechanism so as to aid in regu- larization of the LSTM network. Hence this model does not explicitly uses attention mechanism. W e trained both these models with v ariants of SP k grammar where k = { 2 , 4 , 8 , 16 } for vocab- ulary size V =4 and k = { 2 , 4 , 6 , 8 } for vocab u- lary size V =26 . Hence, there are in total 8 datasets. Ev ery dataset contains uniform distribu- tion of strings with string length l where 60 ≤ l ≤ 100 . The string length ordering is not maintained so as to not bias the models. In-order to train the language models, ev ery dataset is split into train- ing/v alidation/test sets. All the training sets for vocab ulary size V =4 contain ≈ 195000 number of strings and the size of the training set is ≈ 24 M B . Similarly , all the training sets for vocab ulary size V =26 contain ≈ 222000 number of strings with size of ≈ 40 M B . All the test and validation sets contain ≈ 16000 strings with size of 2 M B 3 . The hyperparameters for both the models were reused from the P enn T r eebank char acter model ( Dai* et al. , 2019 ; Merity et al. , 2018 ). T able 1 lists the test perple xity scores of both the models in bits per character for datasets with vocab ulary size V =4 and table 2 lists the test per- plexity scores of both the models for datasets with vocab ulary size V =26 . W e plot the test perplex- ity scores of all the datasets as function of k in 3 The datasets and scripts used for training the models can be found at https://github.com/silentknight/ DelFol- ACL- 2019 Figure 7: T est perplexity score of Transformer -XL and A WD-LSTM models of SP 2 , SP 4 , SP 8 and SP 16 datasets. figure 7 . Here we observe two distinct sets of per- plexity growth curves. It can be seen that, as the size of the v ocabulary changes, the test perplex- ity score across all the models also changes rela- ti vely . This confirms that the vocab ulary size of the dataset does impact the perplexity score of the models. This is attributed to the lower mutual in- formation in the LDD characteristics as explained in section 4.3 . Switching our focus to the growth in perplex- ity as k increases, we tried to understand how the presence of attention mechanism impacts the abil- ity of the neural architectures to model the ME- LDDs as k increases. For datasets with V =26 , the impact of attention mechanism in T ransformer-XL is apparent. The test perplexity score remains al- most unchanged which we attribute to the pres- ence of an attention mechanism. Howe ver , A WD- LSTM model struggles to maintain test perple x- ity scores as k increases, for datasets with V =26 (higher vocab ulary size). For datasets with V =4 , it can be seen that the test perplexity of T ransformer-XL model as a function of k scales exponentially . The pres- ence of exponential relation indicates that the at- tention mechanism of the T ransformer helps the model as k increases in the SP k grammar: the exponential relationship indicates that the growth in model’ s perplexity appears to saturate as k in- creases. For A WD-LSTM model, the test perple x- ity with k also increases at similar rate as that of T ransformer-XL. This could be attrib uted to smaller vocabulary size, which leads to less com- plex dependency structure ev en at higher value of k . Such datasets could be easily modeled using non-attention mechanisms as seen in the figure 7 . Comparing the test perplexity scores in figure 7 for both of these models, it can be observed that the o verall test perplexity score of T ransformer- XL model for across all the datasets as com- pared with A WD-LSTM model is higher . This could be attributed to over -parameterization of the T ransformer-XL model. T ransformer-XL uses ≈ 44 million parameters as compared to ≈ 18 mil- lion used by A WD-LSTM model to model these sub-regular languages. It should also be noted that the height of the LDD characteristics of all the datasets is significantly lower than natural datasets ( Mahalunkar and K elleher , 2018 ). Consequently the mutual information is very small at a giv en distance D . This leads to higher perplexity in lan- guage models modeling them. W e also observed no apparent impact on the v alue of test perplexity scores of all the models on training sets with twice the number of strings. This can be substantiated by observing the LDD char - acteristics in section 4.5 . Limitation of foma tool to generate datasets with v arious properties pre- vented us from exploring more complex datasets. E.g. we were unable to generate SP 16 dataset with vocab ulary size of V =26 due to stac k full error . 6 Discussion The LDD characteristics of a dataset is indica- ti ve of the presence of a certain type of grammar in the dataset. Our e xperiments re veal that even though a specific grammar does induce similar LDD characteristics, there are subtle variations. These v ariations depend on a number of factors such as size of the v ocabulary , length of contex- tual relations, dependency structure (for e.g. “ k ” and “ forbidden subsequences ”). This analysis im- prov es our understanding of the complex nature of the LDDs. This analysis can be extended to natural datasets in an ef fort to better understand the datasets. Thus, if a sequential model such as recurrent neural architecture intends to model a dataset, knowing these factors would greatly bene- fit in selecting the best hyper-parameters of the se- quential model. By training Transformer -XL and A WD-LSTM model with datasets possessing var - ious properties, it was possible to observe the im- pact of various properties on the perplexity score. Also, the impact of multi-head attention mecha- nism on the vocab ulary size is quite evident. Our results suggest that the T ransformer-XL performs much better with increase in vocab ulary size. It is also e videnced by its SoT A results on W ikiT ext103 dataset ( V =267735 ) ( Dai* et al. , 2019 ). 7 Conclusion The majority of neural network research on se- quential models focuses solely on modelling de- pendencies across long distances. Ho wever , the dependencies that occur in sequential data can also be multi-element. Furthermore, the vocab u- lary size, and the forbidden subsequences within a grammar also contrib ute to the dif ficulty of mod- elling the dependencies within a dataset. In natural datasets all of these factors interact, and can confound the analysis for model perfor - mance. Howe ver , using SP k languages it is pos- sible to synthesize sequential datasets and con- trol the type of dependencies exhibited in these datasets. Using a mutual information based anal- ysis of SP k synthesized datasets we examined ho w the different language characteristics (vocab- ulary size, forbidden subsequences) and depen- dency characteristics (length, k ) are reflected in the datasets generated by SP k grammars. Fur- thermore, our results suggest that attention mech- anisms in neural networks may aide in modeling datasets with multi-element long-distance depen- dencies. Although we encourage developing more ef ficient models. The potential impact of this work for neural networks research include: an appreciation of the multifaceted nature of LDDs; a procedure for mea- suring LDD characteristics within a dataset; an understanding of how different hyper-parameters setting on an SP k based dataset synthesis process (string length, k , forbidden subsequences, vocab u- lary size) affect the mutual information profile of the resulting dataset. Acknowledgment This research was partly supported by the AD APT Research Centre, funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Develop- ment Funds. The research was also supported by an IBM Shared Uni versity Research A ward. W e gratefully acknowledge the support of NVIDIA Corporation with the donation of the T itan Xp GPU under NVIDIA GPU Grant used for this re- search. References Enes A vcu, Chihiro Shibata, and Jef frey Heinz. 2017. Subregular complexity and deep learning. In Pr o- ceedings of the Conference on Logic and Machine Learning in Natural Languag e (LaML) . Gerlof Bouma. 2009. Normalized (pointwise) mutual information in collocation extraction. M. Casey . 1996. The dynamics of discrete-time com- putation, with application to recurrent neural net- works and finite state machine extraction . Neural Computation , 8(6):1135–1178. N. Chomsk y . 1956. Three models for the description of language . IRE T ransactions on Information Theory , 2(3):113–124. Noam Chomsk y . 1959. On certain formal properties of grammars . Information and Contr ol , 2(2):137–167. Thomas M. Cover and Joy A. Thomas. 1991. Ele- ments of Information Theory . W iley-Interscience, New Y ork, NY , USA. Zihang Dai*, Zhilin Y ang*, Y iming Y ang, W illiam W . Cohen, Jaime Carbonell, Quoc V . Le, and Ruslan Salakhutdinov . 2019. T ransformer-XL: Language modeling with longer-term dependenc y . W erner Ebeling and Thorsten Poeschel. 2002. En- tropy and Long range correlations in literary En- glish . 26(2):241–246. W T ecumseh Fitch and Angela D Friederici. 2012. Artificial grammar learning meets formal language theory: an overvie w . Philosophical T ransac- tions of the Royal Society B: Biological Sciences , 367(1598):1933–1955. Jie Fu, Jeffre y Heinz, and Herbert G. T anner . 2011. An algebraic characterization of strictly piece wise languages . Lectur e Notes in Computer Science (in- cluding subseries Lectur e Notes in Artificial Intel- ligence and Lectur e Notes in Bioinformatics) , 6648 LNCS:252–263. C. L. Giles, C. B. Miller, D. Chen, H. H. Chen, G. Z. Sun, and Y . C. Lee. 1992. Learning and extract- ing finite state automata with second-order recurrent neural networks . Neural Computation , 4(3):393– 405. P . Grassberger . 2003. Entropy Estimates from Insuffi- cient Samplings . ArXiv Physics e-prints . Jeffre y Heinz and James Rogers. 2010. Estimating strictly piecewise distributions . In Pr oceedings of the 48th Annual Meeting of the Association for Com- putational Linguistics , A CL ’10, pages 886–896, Stroudsbur g, P A, USA. Association for Computa- tional Linguistics. Sepp Hochreiter and J ¨ urgen Schmidhuber . 1997. Long short-term memory . Neural computation , 9(8):1735–1780. Mans Hulden. 2009. Foma: a finite-state compiler and library . In Pr oceedings of the 12th Confer ence of the Eur opean Chapter of the Association for Com- putational Linguistics , pages 29–32. Association for Computational Linguistics. G. Jager and J. Rogers. 2012. Formal language the- ory: refining the Chomsky hierarchy . Philosophical T ransactions of the Royal Society B: Biological Sci- ences , 367(1598):1956–1970. W entian Li. 1990. Mutual information functions versus correlation functions . J ournal of Statistical Physics , 60(5):823–837. Henry W . Lin and Max T egmark. 2017. Critical be- havior in physics and probabilistic formal languages . Entr opy , 19(7). Abhijit Mahalunkar and John D. Kelleher . 2018. Un- derstanding Recurrent Neural Architectures by An- alyzing and Synthesizing Long Distance Dependen- cies in Benchmark Sequential Datasets . arXiv e- prints , page Abhijit Mahalunkar and John D. K elleher . 2018. Using regular languages to explore the representational ca- pacity of recurrent neural architectures. In Artificial Neural Networks and Machine Learning – ICANN 2018 , pages 189–198, Cham. Springer International Publishing. Stephen Merity , Nitish Shirish Keskar , and Richard Socher . 2018. Regularizing and optimizing LSTM language models . In International Conference on Learning Repr esentations . Liam Paninski. 2003. Estimation of entropy and mu- tual information . Neur al Computation , 15(6):1191– 1253. Arthur S. Reber . 1967. Implicit learning of artificial grammars . Journal of V erbal Learning and V erbal Behavior , 6(6):855–863. James Rogers, Jeffrey Heinz, Gil Bailey , Matt Edlef- sen, Molly V isscher , David W ellcome, and Sean W ibel. 2010. On Languages Piece wise T estable in the Strict Sense. In The Mathematics of Language , pages 255–265, Berlin, Heidelberg. Springer Berlin Heidelberg. A. W . Smith and D. Zipser . 1989. Encoding sequen- tial structure: experience with the real-time recur- rent learning algorithm . In International 1989 Joint Confer ence on Neural Networks , pages 645–648 vol.1. Masaru T omita. 1982. Learning of construction of fi- nite automata from examples using hill-climbing : RR : Re gular set Recognizer . Pr oceedings of F ourth International Cognitive Science Conference , pages 105–108. Raymond L. W atrous and Gary M. Kuhn. 1991. In- duction of finite-state automata using second-order recurrent networks. In NIPS .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment