Seq2Seq RNN based Gait Anomaly Detection from Smartphone Acquired Multimodal Motion Data

1 Seq2Seq RNN based Gait Anomaly Detection from Smartphone Acquired Multimodal Motion Data Riccardo Bonetto ∗ , Mattia Soldan § , Alberto Lanaro ‡ , Simone Milani † , Michele Rossi † ∗ Institute of Communication T echnology , T echnische Univ ersit ¨ at Dresden, 01062, Dresden, Germany { riccardo.bonetto } @gmail.com § IVUL Lab, King Abdullah Uni versity of Science and T echnology (KA UST), Thuwal 23955, Saudi Arabia { mattia.soldan } @kaust.edu.sa † Department of Information Engineering, Uni versity of Pado v a, 35131, Padov a, Italy { mattia.soldan.ms } @gmail.com, { simone.milani, michele.rossi } @dei.unipd.it ‡ M31, 35131, Pado va, Italy { alb .lanaro } @gmail.com Abstract —Smartphones and wearable devices ar e fast gro wing technologies that, in conjunction with advances in wireless sensor hardwar e, are enabling ubiquitous sensing applications. W earables are suitable f or indoor and outdoor scenarios, can be placed on many parts of the human body and can integrate a large number of sensors capable of gathering physiological and behavioral biometric information. Here, we are concerned with gait analysis systems that extract meaningful inf ormation from a user’ s movements to identify anomalies and changes in their walking style. The solution that is put forward is subject-speciﬁc , as the designed feature extraction and classiﬁcation tools are trained on the subject under observation. A smartphone mounted on an ad-hoc made chest support is utilized to gather inertial data and video signals from its built-in sensors and rear -facing camera. The collected video and inertial data are prepr ocessed, combined and then classiﬁed by means of a Recurrent Neural Network (RNN) based Sequence-to-Sequence (Seq2Seq) model, which is used as a feature extractor , and a follo wing Conv olutional Neural Network (CNN) classiﬁer . This ar chitecture pr ovides excellent results, being able to correctly assess anomalies in 100 % of the cases, for the considered tests, surpassing the perf ormance of support vector machine classiﬁers. I . I N T RO D U C T I O N The automated e v aluation of human motion has prov en to be a key functionality in dif ferent ﬁelds, such as ambient-assisted living, remote health monitoring and rehabilitation [1], biomet- ric identiﬁcation [2], well-being and ﬁtness applications [3]– [5]. A common trait of these applications is that the ef fectiv e- ness of Human Acti vity Recognition (HAR) strategies greatly depends on the continuous and seamless diarization of people’ s motion and daily activities in different conditions. This can be obtained through automatic and non-in vasi v e measuring tools [6], which are capable of monitoring and recording the motion activity of people with a minimum level of discomfort. As a target application, in the present work we consider wearable sensor technology to acquire movement data from heterogeneous sources. Such acquisition setups in volv e the gathering of a large amount of motion signals, which are to be processed, reﬁned and analyzed to infer some high lev el information about the condition, the motor status or the identity of a person. T owards this, it is necessary to design ﬂexible and accurate classiﬁcation tools that are able to include new subjects in the analysis ( open set ) and characterize their motion accurately . Motion analysis is a vivid research ﬁeld, and people gait is usually in vestigated using 3D cameras, which allow for an accurate tracking of the trajectory of the skeleton and the joints. In this work, we depart from the previous literature as our objective is to use low-cost and non-specialized sensing hardware, i.e., a smartphone. The amount of information that we gather is much more limited than with 3D video systems, but nev ertheless, we would like to infer some useful information about the motion behavior of the monitored subject. Our technology may ev entually be useful for the quick assessment of whether a certain motion disorder is emerging and/or for personalized sport applications (e.g., running). Automatic gait analysis approaches can be divided into three main classes depending on the characteristics of the sensing devices [7]. A ﬁrst class concerns vision-based sys- tems, where subjects are monitored and analyzed by a set of ﬁxed cameras in a controlled en vironment [8], [9]. Although these solutions are e xtremely accurate in characterizing the motion of human subjects, they lack ﬂexibility since they are heavily affected by changes in illumination and occlusions, and they also need some highly specialized and expensi ve equipment, which requires adequate space for its deployment and expert personnel for its operation. A second class of tech- niques in volves en vironment interactiv e sensors [10], [11], i.e., sensing devices that depend on the speciﬁc equipment used by the analyzed subject, e.g., a sport gear, infra-red reﬂectiv e tags, pressure sensing mats for plantar pressure analysis, etc. Their use is often appropriate for indoor spaces and for a restricted number of activities [12]. F or example, in [13] the gait of a monitored subject is captured using infrared (IR) cameras in conjunction with reﬂecti ve tags worn in sev eral parts of the body (usually at the joints), detecting the follo wing anomalous patterns: hemiplegia, Parkinson (shuf ﬂing walk), leg and back pain. A third class of solutions entails the use of wearable motion sensors [14], which can be employed anywhere, an ytime, and with limited discomfort for the wearer . In this case, accelerometers, gyroscopes, magnetometers, GPS and other kind of devices are placed on different parts of the 2 body [15]; during the walking activity , signals are measured and stored for their subsequent classiﬁcation [16]. The study in [17] uses a Shimmer 2 tri-axial acceleration sensing device worn at the subject’ s lower back to detect mild and sev ere knee conditions. W alking patterns are extracted, segmented into subsequent walk cycles, and assessed using a binary k-Nearest Neighbors (k-NN) classiﬁer . Despite their ﬂexibil- ity , the accuracy of these solutions is limited by high noise lev els and by the fact that often the reference system depends on the orientation of the sensing device, which is not ﬁxed and may hav e to be (re)estimated at measurement time. A large body of work has been recently published on multimodal data processing for gait analysis [18], [19], by es- pecially in vestigating the supervised classiﬁcation of walking activity . Most of the solutions transform the input data through Principal Component Analysis (PCA) [16] and noise-remov al strategies [20], moving it into a space representation that is suitable for classiﬁcation. Then, data is classiﬁed using machine learning strategies like k-d trees [16], Support V ec- tor Machines (SVMs) [21], and Artiﬁcial Neural Networks (ANNs) [22]. Recently , Recurrent Neural Networks (RNNs) and RNN-based sequence to sequence models have become popular for applications such as speech to text [23], [24], text to speech [25], sentiment analysis [26], [27], and neural machine translation [28], [29]. Their increasing adoption is mostly due to the architectural ﬂe xibility of neural network models in conjunction with the ability of RNNs to en- code into state v ectors the temporal information underpinning multi-dimensional timeseries. T o the best of the authors’ knowledge, the aforementioned advances in RNN based Seq-2-Seq learning hav e not yet been applied to the classiﬁcation of gait signals. In this paper , we aim at ﬁlling this gap. The proposed classiﬁcation strategy identiﬁes anomalous gaits that may happen during a normal walk. The anomalies that are detected can then be the subject of further analysis, i.e., to further split the anomaly into a number of anomaly classes, by means of standard techniques [30]–[32]. Contributions of the paper . W e propose a subject-speciﬁc gait anomaly detection framew ork that combines recent ad- vances in recurrent neural network based sequence to sequence (Seq-2-Seq) [33] [34] models with the largely proven classi- ﬁcation abilities of Conv olutional Neural Networks (CNN). W e adopt a multi-modal gait analysis approach, which in- tegrates inertial and visual data. Motion is extracted using acceler ometric and gyr oscopic measurements from a smart- phone device that mov es integrally with the analyzed subject. These data are merged with the optical ﬂow information obtained from an ego vision system corresponding to the smartphone built-in camera. Then, the gathered data traces are automatically segmented into gait cycles, which undergo a ﬁltering, a detrending, and a normalization process. The re- ﬁned multi-variate timeseries are then embedded into a feature space obtained through a sequence-to-sequence architecture based on a recurrent neural network (herein referred to as RNN-Seq2Seq). These embeddings are then fed to a CNN binary classiﬁer . The RNN-Seq2Seq model is set to echoing the input se- quence like an autoencoder . Once a sequence has been fully processed, the ﬁnal state of the RNN encoder is extracted and reshaped into a multidimensional matrix. This, in turn, is fed to a CNN based classiﬁer that has the task of separating “normal” gaits from “anomalous” ones. In this work, anomalies are deﬁned with respect to the normal walking style of the subject under analysis. For this reason, preexisting and kno wn conditions af fecting the walking pattern of the analyzed subject (as, for example, old knee or hip injuries) are not treated as anomalies that hav e to be detected. The RNN-Seq2Seq is only trained on “normal” (i.e., usual) gaits with the purpose of making the encoder unable to correctly embed “anomalous” gaits in its feature space. Hence, the difference between the features of “normal” and “anomalous” gaits are expected to be ampliﬁed, thus, facilitating the classiﬁcation task. The output of the trained recurrent model is subsequently inputted into the CNN classiﬁer , which is trained on a different dataset containing pre-labeled “normal” and “anomalous” gaits (with equal cardinality). Once the training procedure for the RNN and the CNN is concluded, the full system is tested for classiﬁcation accuracy on a validation dataset. As a means of comparison, we also implemented a non-linear Support V ector Machine (SVM) model, which has been trained on the same training set used for the CNN classiﬁer, and has been fed with complete gait cycles. Experimental results sho w that the proposed RNN-Seq2Seq pre-encoding and the subsequent CNN-based classiﬁcation outperform the baseline accuracy obtained by SVM, achie v- ing an accuracy of 100 % in the detection of anoma- lous gaits in our own collected dataset. The source code of our frame work for gait analysis, along with prepro- cessed data and the signal acquisition system from An- droid smartphones is publicly released and av ailable at: https://github .com/Soldelli/gait anomaly detection. The rest of this paper is organized as follows. The signal acquisition strategy , and the data pre-processing operations are presented in Section II-A. The proposed RNN-Seq2Seq model is discussed in Section II-B. The proposed CNN based classiﬁer is detailed in Section II-C. The dataset creation, its characteristics, along with implementation details and pa- rameters of all the considered approaches are presented in Section II-D. The experimental results are shown in Section III and some ﬁnal considerations are drawn in Section IV. I I . M A T E R I A L S A N D M E T H O D S Here, we introduce our anomaly detection approach and deﬁne the experimental setup used for its validation. A. Data Processing As with previous work on automatic gait analysis, the proposed system requires the acquisition of accurate motion data for the subject that is to be monitored. T o gather this information, we adopted a multi-modal approach that combines video and inertial signals . As a relati vely cheap and conv enient way to acquire this data, we opted for a smartphone application, since camera and inertial sensors 3 (a) t t Synchron iza tio n, and ,resam pl in g Optical, Flow, estimat i o n Synchron iza tio n, and ,resam pl in g Dat a,alignm ent, and ,sto rage RNN - bas ed anal ysi s Accele rometer Camera (b) Fig. 1. (a) smartphone chest support; (b) functional block diagram. are already integrated into the device. Hence, we developed a chest support for smartphones (sho wed in Fig. 1(a)) and a motion sensing application that records accelerometer , gyroscope, and magnetometer measurements, while also recording a video sequence from the front camera. A block diagram of the smartphone data acquisition and processing system is sho wn in Fig. 1(b). Note that, prior to signal classiﬁcation and clustering, the data coming from different sensors is aligned, denoised, and processed to extract salient motion information. In the following, the processing blocks are described in detail. 1) Inertial Data Acquisition and Synchronization: Inertial signals are provided by built-in gyroscopic, magnetometric and accelerometric sensors. At each sampling epoch, each of these sensors returns a three-dimensional real sample ( 3 -axes, for a total of 9 -axes for the three sensors), related to the motion of the device along the three dimensions of the smartphone reference system. Inertial samples are labelled by a timestamp that is relative to the smartphone’ s system clock. Although the sampling frequency is high (typically around 100 / 200 samples/s, depending on the device), the time interval between consecutive samples is not constant. This is consistent with the ﬁndings of [35]. T o cope with this, interpolation and resampling are performed prior to data analysis to con vert the signals into the common sampling frequency of 200 Hz, as done in prior work [36]. Moreover , the power spectral density of the accelerometer signals show that sensor samples are affected by a signiﬁcant amount of noise due to the irregularities of motion and to the sensitivity of the sensing platform. This noise is lar gely removed through a low pass ﬁlter with a cut-off frequency of 40 Hz. 2) Video Data Extraction: Besides inertial measurements, motion data is also extracted from a video sequence that is concurrently acquired during each walk. At time instant n , the phone/camera pose can be represented by a location vector t n and an orientation matrix R n , which can be incrementally obtained through the relati ve rotation ˆ R n and translation ˆ t n , i.e., R n = ˆ R n R n − 1 and t n = ˆ R n t n − 1 + ˆ t n . As a result, the same points in any two adjacent frames I n and I n − 1 would appear modiﬁed by an afﬁne transformation characterized by ˆ R n and ˆ t n , which can be estimated as follows. For ( R n , t n ) ( ˆ R n +1 , ˆ t n +1 ) ( R n +1 , t n +1 ) frame I n I n +1 p n +1 , 1 p n +1 , 2 p n +1 , 3 p n, 1 p n, 2 p n, 3 Fig. 2. Relation between corresponding points in adjacent video frames. ev ery video frame I n , acquired at time instant n , the video processing unit computes a set D n of ke ypoints that can be easily tracked across subsequent frames (see Fig. 2 for a graphical example), i.e., D n = { p n,k = ( x n,k , y n,k , 1) } , (1) where p n,k are expressed in homogeneous normalized coor- dinates. Salient points were identiﬁed using the SIFT algo- rithm [37], which allo ws one to select a set of pixel patches that are scale-inv ariant across frames. The Farneback optical ﬂow algorithm [38] is then applied on frames I n and I n − 1 ; as a result, the point p n,k of frame I n is associated with a point p n − 1 ,k of the previous frame I n − 1 . These correspondences allow the estimation of the 3 × 3 essential matrix E n , which satisﬁes the Longuet-Higgins equation, 0 = p T t,k E n p n − 1 ,k = p T n,k [ ˆ t n ] × ˆ R n p n − 1 ,k . (2) In this case, the matrix E n is factored into the product of the vector product matrix [ ˆ t n ] × (associated with the relati ve translation vector ˆ t n ) and the relati ve rotation matrix ˆ R n . Follo wing this, E n is estimated through the Nister’ s 5 -point algorithm combined with a RANSAC optimization procedure to remov e false matches and outliers. The matrix E n can be decomposed into ˆ t n and ˆ R n via an SVD-based factorization. This makes it possible to estimate the location and the ori- entation of the smartphone camera at instant n , composing them with R n − 1 and t n − 1 . From the resulting rotation matrix ˆ R n = [ r i,j ] , we obtain the roll, pitch, and yaw angles of the camera, α n , β n and γ n , which are deﬁned with respect to the reference system associated with the camera pose at instant n . These angles are obtained from the rotation matrix ˆ R n as: α n = tan − 1  r 2 , 1 r 1 , 1  , γ n = tan − 1  r 3 , 2 r 3 , 3  , β n = tan − 1 − r 3 , 1 p ( r 3 , 2 ) 2 + ( r 3 , 3 ) 2 ! . (3) 4 Fig. 3. Simpliﬁed diagram of walking phases, Initial Contact (IC) and Final Contact (FC) instants are marked for right and left steps. The estimation accuracy depends on the capture frequency of video frames: for our experiments we used a frame rate of 30 Hz, which is consistent with that adopted in previous studies, e.g., [39]. At the beginning of the video acquisition, a timestamp τ 0 is recorded (whose value depends on the smartphone’ s system clock). Since the sampling frequency is constant, a rather precise acquisition time, computed as τ 0 + n ∆ , can be associated with ev ery acquired frame, i.e., with ev ery estimated triplet of angles ( α n , β n , γ n ) , where ∆ is the sampling period. At this point, it is necessary to synchronize the estimated angles with the signals acquired by the inertial sensors. Since the time resolution of the inertial data is ﬁner, video-related samples are interpolated to a sampling frequency of 200 Hz, and are time synchronized with the inertial data (estimating the optimal delay-shift that aligns video and inertial traces). 3) Gait Cycle Extraction: The human gait follows a cyclic behavior featuring a periodic repetition of a pattern delimited by two consecutive steps. The stance phase starts with the instant when contact is made with the ground (usually occur- ring with the heel touching the ground ﬁrst); this instant is called Initial Contact (IC). After that, the foot becomes ﬂat on the ground and supports the full body weight (foot ﬂat). Then, the heel begins to lift off the ground in preparation for the forward propulsion of the body and we ﬁnally hav e the take off phase, which ends the stance and is delimited by the instant of Final Contact (FC) of the foot with the ground. Afterwards, the weight of the body is mov ed to the other foot until the next IC occurs ( swing time). A gait c ycle (also referred to as stride or walking cycle ) is deﬁned as the time instants between two consecutive Initial Contacts (ICs) of the same foot. A pictorial representation of ICs and FCs is giv en in Fig. 3. IC and FC instants can be identiﬁed by analyzing the vertical component of the accelerometer data. T o this end, the signal is processed by a Difference of Gaussians (DoG) ﬁlter followed by a wav elet transform. IC instants correspond to the local minima of the transformed signals, while FC ones are found through a second differentiation and searching for local maxima. T o av oid false detections, only IC and FC ev ents within speciﬁc time intervals are considered [40]. After this processing phase, gait cycles can be reliably identiﬁed. In fact, a generic walking cycle i starts at IC ( i ) -15 -10 -5 0 5 10 15 1000 1200 1400 1600 1800 2000 Acceleration Samples Detrended Input GCWT1 GCWT2 Fig. 4. Initial (IC) and ﬁnal (FC) contact instants for an example accelerom- eter signal. GCWT means Gaussian Continuous W avelet Transform. and ends at IC ( i + 2) . It is thus possible to locate the walking cycles vectors in all the av ailable signals. Each gait cycle is then normalized to a ﬁxed length of 200 samples, stored on a descriptor and classiﬁed with the technique of Section II-B. An example of IC and FC detection is sho wn in Fig. 4. 4) Detrending: The signals extracted from the reference video hav e trend components on each of the three axes, which can heavily impact the data normalization, see Fig. 5 for examples. T o remov e such trends and obtain semi-stationary timeseries, we hav e ﬁtted a linear model for each gait cycle. Then, the slope of the computed model has been used to remov e the trend from the corresponding gait cycle. Therefore, the trend affecting each cycle is approximated by a linear model. This permits to remove the global trend without affecting the gait cycles. The result of the detrending procedure can be seen by comparing Fig. 5, and Fig. 6. Fig. 5 shows an example of the trends that affect the roll, pitch, and yaw signals extracted from the video. Fig. 6 shows the result of detrending. It is worth noting that spurious peaks still remain after this operation, but they can be promptly removed using an additional threshold based peak detection algorithm. 5) Normalization Data normalization is a standard step in data preparation for neural network based learning. It is performed by removing the mean and dividing the data by its standard deviation, computing these measures over the entire dataset. Here, this operation is performed for each of the 9 signals in a gait cycle. B. Gait RNN Based Embedding Next, we present an RNN based architecture to embed the gait signals into a feature space containing a ﬁxed size, higher order representation of the gaits. Introduction to Sequence to Sequence Models: RNN based Sequence to Sequence (RNN-Seq2Seq) models hav e gained a lot of attention lately , see, e.g., [34], mainly thanks to the 5 0 20000 40000 60000 80000 100000 100 0 100 Roll 0 20000 40000 60000 80000 100000 0 50 Pitch 0 20000 40000 60000 80000 100000 Samples 0 100 Yaw Fig. 5. Example of trends in the Roll, Pitch, and Y aw signals extracted from the video acquisition. The vertical lines identify the gait segments. 0 20000 40000 60000 80000 100000 0.01 0.00 0.01 Roll 0 20000 40000 60000 80000 100000 0.01 0.00 0.01 Pitch 0 20000 40000 60000 80000 100000 Samples 0.01 0.00 0.01 Yaw Fig. 6. Example of the detrended Roll, Pitch, and Y aw signals extracted from the video acquisition. The vertical lines identify the gait segments. work on Neural Machine Translation (NMT) by Google [33]. NMT models are based on a 2 blocks architecture featuring an Encoder (ﬁrst block) and a Decoder (second). The encoder is an RNN that, when fed with an input sequence of words (i.e., a sentence, possibly of v ariable lentgh), embeds it into a ﬁxed size feature space. This embedding, being generated by an RNN, captures temporal correlations between different por- tions of the input sequence. The embedded input sequence is then fed to the decoder RNN that, in turn, generates an output sequence that corresponds to the translation of the input. In this work, we design an RNN-Seq2Seq encoder-decoder architec- ture to embed multi-dimensional ﬁxed-length timeseries into a feature space that is suitable for a subsequent classiﬁcation task. T o achiev e this, we utilize a deep RNN as the encoder and a shallow NN with linear activ ations as the decoder . In Fig. 7, we sho w the architecture of the designed Seq-2-Seq model. First, the multi-dimensional timeseries corresponding to a gait cycle is fed into the RNN encoder . Then, the output of the encoder is fed to the shallow linear decoder . The whole neural network chain is trained to reproduce the input sequence at its ouput. The ﬁnal state of the encoder , is then utilized as a feature vector representing the embedding of the input sequence into a feature space that retains temporal information about the processed gait cycle. Deep RNN Encoder: T o obtain a ﬁnite-dimensional represen- tation of the temporal characteristics of the gait multiv ariate timeseries, we utilized a deep bidirectional Long Short T erm Memory (LSTM) based design [41], [42]. Each recurrent layer produces a higher order encoding (i.e., spanning a longer temporal information) of the output of the previous one. This is due to the fact that each subsequent layer , apart from the ﬁrst one, receiv es as input an already time-dependent and encoded representation of the original multiv ariate timeseries. The bidirectional connections allo w updating the past network states according to what will be happening into the future. The number of dimensions of each encoding is determined by the size of the corresponding layer . Hence, the depth of the encoder determines the order of the statistics represented by the features that are extracted from the input sequence, while the size of the last layer determines the size of the space in which the original sequence is ﬁnally embedded. Formally , let X = [ x 0 , x 1 , . . . , x N ] , X ∈ R 9 × N +1 be the timeseries representing a gait cycle, where x i ∈ R 9 , i = 0 , . . . , N , is the preprocessed motion vector at sampling instant i , combining inertial signals (accelerometer and gyro- scope) and the motion angles (roll, pitch and yaw) obtained from the video. Then X is fed to the encoder , one sample at a time, until x N is processed. When sample x i is processed, the layer ` = 0 , . . . , L − 1 of the encoder produces a repre- sentation R ` ( X 0 ,i ) of the subsequence X 0 ,i = [ x 0 , . . . , x i ] . Giv en the recurrent nature of this architecture, the obtained representation vector , at the output of the RNN Encoder , can be expressed as: R ` ( X 0 ,i ) = f ( . . . ( f ( X 0 ,i , W 0 , S 0 , b 0 ) , . . . ) , W ` , S ` , b ` ) (4) where W ` is the weight matrix, S ` is the state matrix, and b ` is the vector of biases, associated with layer ` . Moreov er , let f ( · , W ` , S ` , b ` ) be: f ( · , W ` , S ` , b ` ) =    f ( · , w 0 ` , s 0 ` , b 0 ` ) . . . f ( · , w U ` , s U ` , b U ` )    , ` = 0 , 1 , . . . , L − 1 , (5) where U + 1 is the number of neural units in layer ` , whereas f ( · , w u ` , s u ` , b u ` ) , for u = 0 , . . . , U , is the activation value of the corresponding neural unit u . Moreov er , w u ` , s u ` , and b u ` respectiv ely represent the weight vector , the state vector , and the bias associated with unit u in layer ` . It can then be seen that applying Eq. (4) to the full sequence X , the representation obtained at the last layer L − 1 represents a featur e matrix S L − 1 encoding all the temporal information contained in X . Shallow Decoder: a shallow feed forward decoder is utilized to decode the RNN output, obtaining the autoencoder depicted 6 Time Time Final RNN St ate Acc x Acc z Acc y Gyro y Gyro x Gyro z Video y Video x Video z R N N E n c o d e r L i n e a r D e c o d e r T ime T ime Final RNN State Acc x Acc y Acc z Gro x Gro y Gro z Vdo x Vdo y Vdo z RNN Encoder Linear Decoder Fig. 7. RNN-based Embedding Architecture. “ Acc” stands for Accelerometer, “Gro” stands for Gyroscope, and “Vdo” stands for V ideo. in Fig. 7. This feed forward layer decodes the embedded timeseries at the output of the RNN, and the decoding process succeeds when the output sequence matches the input RNN sequence X . For this decoder , we use a linear layer with a number of units matching the size of the input vectors (i.e., 9 samples). These units hav e linear activ ation functions and each sample of the decoded sequence is obtained as a linear combination of the RNN output. Formally , let W D , and b D be the weight matrix, and the bias vector of the decoder , respectively . Then, the output sequence ˆ X = [ ˆ x 0 , ˆ x 1 , . . . , ˆ x N ] , ˆ X ∈ R 9 × N +1 , is generated as shown in the following Eq. (6): ˆ x i =    w 0 D · R L − 1 ( X 0 ,i ) + b 0 D . . . w 8 D · R L − 1 ( X 0 ,i ) + b 8 D    , i = 0 , 1 , . . . , N . (6) The parameter w k D , b k D are the weights vector and the bias of unit k in the decoder, respectiv ely . According to Eq. (6), once the sample ˆ x N has been obtained, the encoder has processed the entire input sequence X . If ˆ X ' X , then the information ( featur es ) contained in the state matrix S L − 1 of the output layer of the encoder nicely captures the input sequence X . T o impro ve the quality of this information, and hence that of the reconstructed sequence ˆ X , a training phase is needed, as we discuss next. T raining: T o obtain encodings that are suitable for the subsequent classiﬁcation task, the RNN-Seq2Seq module is only trained on “normal” walking cycles. This is because by learning to only reproduce normal gaits, when anomalous ones are giv en as input, the module should be unable to determine a correct embedding, and this is detected with high probability by a subsequent classiﬁer . The objective function to minimize during the training phase is the Mean Squared Distance (MSD). Indeed, by minimizing the MSD, one maximizes the match between the input ( X ) and the output ( ˆ X ) sequences. This ensures that the embedding of the input gait cycles obtained by the RNN encoder is a good higher order representation of the original timeseries. The training is performed by means of the Back Propagation Through Time (BPTT) algorithm with no truncation [43]. This is because the time span (i.e., 200 samples in our application) of the considered timeseries does not make truncation a computational necessity . By not truncating the BPTT , we also make sure that the temporal correlation contained in the input gait cycles is fully captured, and reﬂected in the updates of weights and biases. T o update the RNN-Seq2Seq weights and biases, we use a Stochastic Gradient Descent (SGD) algorithm with exponentially decaying learning rate. SGD is a common choice for training Seq2Seq models, see for example [28]. T o ov ercome the lack of ﬂexibility of standard SGD, we intro- duced an exponentially decaying factor to gradually reduce the learning rate with the number of training steps. This strategy is intended to perform a fast minimization of the objecti ve function at the beginning and to slow it down at later iterations making these last updates more accurate. Formally , the cost function to be minimized using a gradient descent strategy is: min N X i =0 8 X j =0 ( x i,j − ˆ x i,j ) 2 with respect to:  W ` , S ` , b ` ` = 0 , . . . , L − 1 W D , b D D : deco der (7) where x i,j and ˆ x i,j indicate the j -th elements of x i and ˆ x i , respectiv ely . Index i runs over the samples in a gait cycle, i.e., vector x i , and j spans over the elements of this vector . C. CNN Classiﬁer Next, we present the binary classiﬁcation architecture that we implemented to identify anomalous gaits. Once the trained encoder has processed the whole se- quence X , as explained in Section II-B, its ﬁnal state S L − 1 contains the higher order representation of the input time series. S L − 1 is a multidimensional vector (i.e., a tensor). By suitably reshaping it, we obtain a multidimensional matrix 7 Con v oluti onal Net work Softmax RNN Encoder Final St ate P[Nor mal] 1-P[ Norma l] CNN RNN Encoder Final State Softmax P[Normal] 1-P[Normal] Fig. 8. CNN-based binary classiﬁer Architecture. ˜ S L − 1 ∈ R W × Y × Z , where the three dimensions W, Y , Z are implementation dependent. By performing this step, on top of the temporal dependencies encoded in the original state, we determine spatial relations between the indi vidual features of the embedding produced by the encoder . These spatial relations can be exploited by the architecture shown in Fig. 8. The multidimensional matrix ˜ S L − 1 is fed into a multilayer CNN with REctiﬁed Linear Unit (RELU) activ ations. Between each layer a max-pooling step is performed to extract the most relev ant features from the kernels, and to reduce the computational complexity of the model [43]. The CNN output is then ﬂattened and fed to a 2 -units layer with logistic activ ation functions. Hence, the ﬁnal classiﬁcation is obtained by means of a softmax layer . Formally , let CNN( ˜ S L − 1 ) = c ∈ R M be the ﬂattened output of the CNN, where M is implementation dependent. Then, according to the notation used in Section II-B, at the output of the classiﬁcation layer we obtain: s =  s 0 s 1  =  σ ( c · w 0 σ + b 0 σ ) σ ( c · w 1 σ + b 1 σ )  , (8) where indices 0 and 1 identify the “normal” and “anomalous” gaits, respectiv ely , and σ ( · ) is the sigmoid function. According to Eq. (8), s represents the scores associated with each of the two classes for a giv en input gait. T o obtain a probability distribution of the class assignment for a giv en input, a softmax operation is performed as shown in Eq. (9):        p ( X ∈ normal gait ) = exp( s 0 ) exp( s 0 ) + exp( s 1 ) p ( X ∈ anomalous gait ) = exp( s 1 ) exp( s 0 ) + exp( s 1 ) , (9) where s 0 and s 1 are the ﬁrst and second component of s , respectiv ely . The result of Eq. (9) is a probability distribution and, hence, X belongs to the class with the highest proba- bility , i.e., either “ 0 ” meaning “normal gait”, or “ 1 ” meaning “anomalous gait”. T o train the classiﬁer, we used both “normal” and “anoma- lous” gaits that have been labeled at acquisition time. The cross-entropy loss applied to the softmax output was selected as the function to minimize during training. As in Section II-B, we performed the minimization by means of Stochastic Gra- dient Descent (SGD). The use of RELU activ ations and the cross-entropy objective function have the advantage of miti- gating the vanishing gradient issue that arises when training deep classiﬁers. D. Experimental Setup and Implementation Details Next, we detail the experimental setup utilized to assess the performance of our system. This setup includes: the developed signal acquisition system (Section II-D1) to gather motion and video data from Android smartphones, the pre-processed dataset (Section II-D2), and the RNN/CNN based algorithms for motion learning and classiﬁcation (respecti vely treated in Sections II-D3 and II-D4). The frame work is publicly av ailable at: https://github .com/Soldelli/gait anomaly detection. 1) Activity Logg er V ideo Application: Inertial and video data are collected by means of a custom made Android application called “ Activity Logger and V ideo” (AL V). AL V was tested on an Asus Zenfone 2 featuring a 2 . 3 GHz quad-core Intel CPU, 4 GB of RAM and an Android 5 “Lollipop” operating system. AL V requires a minimum API Lev el of 8 and requests permissions for audio and video recording, camera access, Internet, and external storage. The application is used to set the acquisition parameters, collect information about the user (age, height, gender , etc.), collect and sav e data into the smartphone non-v olatile memory and, optionally , send them to a File T ransfer Protocol (FTP) server . During the acquisition phase, the smartphone is carried with the rear-facing camera looking forward and is mounted on an ah-hoc made chest support as shown in Fig. 1(a). 2) Dataset Cr eation: Raw data were collected from a single healthy male subject in his twenties. W e recall that our application is meant to be subject-speciﬁc , i.e., to tune itself to the gait patterns of a speciﬁc subject in order to recognize whether their walking style changes. The individual performed 141 “normal” walks, and 61 “anomalous walks”. For the “normal walks” the subject walked following their natural walking habit. The corresponding patterns may contain speed v ariations, rotations and/or lateral oscillations, that are howe v er to be considered normal for the purpose of this study . In fact, our objective amounts to detecting subtle variations in the walking patterns, which may be a p roxy to a disease, injury , etc., that leads to a gradual or sudden degradation of motor functions. Upon collecting the normal walks, the monitored subject was instructed to include anomalies in their walking style. This is in line with pre vious studies from the literature, e.g., [13], [17], where the authors emulated anomalies such as: left/right knee condition, shufﬂing walk (Parkinson), hemiplegia, leg pain or back pain. 8 Fig. 9. AL V homescreen. The anomalies that were considered in our present study are: shufﬂing walk and sliding feet, which is a typical walking attitude of a person af fected by Parkinson, hemiplegic gait, which may result from a stroke [44], unusual and subtle speed variations as the ones that arise in Alzheimer’ s disease. For instance, cautious gait is seen in early Alzheimer’ s disease: g ait changes may be initially subtle and difﬁcult to visually detect, arising with a reduction in the speed and stride of w alking [45]. Finally , we emulated anomalous trunk postures, such as those that may arise in the diabetic gait [46]. Our purpose in the present study is to propose a very accurate detection engine, assessing whether gait anomalies of some sort are present in the walking patterns. Their further classiﬁcation into the corresponding anomaly type is not considered. W e remark that this task can be accomplished by replacing the binary CNN classiﬁer with a CNN (or any other classiﬁcation scheme) that uses the RNN state to discriminate among the different anomaly classes. This further assessment would require the use of clinical data and to focus on one or on sev eral speciﬁc pathologies, which is left as a future endeav or . For each walk, accelerometric, gyroscopic, and video data were collected and time-synchronized. The collected data and timestamps were then processed, segmented, and normalized as described in Section II-A, resulting in 7 , 941 “normal gait cycles”, and 2 , 744 “anomalous gait cycles”. Each gait is represented through a 9 × 200 real matrix. From the “normal” gait cycles, 4 , 966 ha ve been used for training and testing the RNN-Seq2Seq model discussed in Section II-B. The remaining “normal” gait cycles and all the “anomalous” ones hav e been used for training and testing the classiﬁers. 3) RNN-Seq2Seq model implementation and training: The RNN-Seq2Seq model classiﬁer has been implemented in Python- 3 . 6 , using the TensorFlow library . The encoder is made of 2 bidirectional LSTM layers. Each layer has 512 LSTM cells with forget gates and peepholes connections. T o implement this model, we used the tf.contrib.rnn.LSTMBlockCell class with the use_peepholes ﬂag set to True . The decoder is made of a single dense layer with 9 units with linear activ ation functions. It has been implemented by means of the tf.layers.Dense class with the activation parameter set to None . For training purposes, the RNN encoder has been wrapped using a tf.contrib.rnn.DropoutWrapper wrapper with output keep probability of 0 . 8 . The objectiv e function has been deﬁned by means of the tf.losses.mean_squared_error loss, and the training phase lasted for 21 epochs. The learning rate has been halved ev ery 1 , 000 training steps on mini-batches of size 16 samples. Moreov er , we implemented a gradient clipping step, to prev ent the exploding gradient issue. The gradients hav e been clipped with respect to a maximum norm of 5 . The training was executed on a desktop PC equipped with a NV idia T itan-X GPU. The training dataset was made by 90% of the “normal” gait cycles, the remaining 10% has been used for testing purposes. It is worth noting that these data hav e been used only to train the RNN-Seq2Seq model, and have not been used for the subsequent training and testing of the classiﬁer . This is to pre vent the encoder from generating too good representation of the “normal” gait cycles, hence biasing the following classiﬁcation step. 4) CNN classiﬁer implementation and training: The CNN classiﬁer has been implemented in Python- 3 . 6 , utilizing the TensorFlow library . W e implemented one con volutional layer with the tf.nn.conv2d module. For this we used [10, 6, 8, 16] shaped ﬁlters initialized with: tf.truncated_normal_initializer with standard deviation stddev = 0 . 05 . The output of the conv olutional layer is then processed by a max pooling layer implemented through tf.nn.max_pool . For this, we used kernels of size [1, 4, 4, 1] , and strides of length [1, 2, 2, 1] . The resulting tensor is then ﬂattened and processed by a 2 -units tf.layers.dense later with activ ations set to None (i.e., with linear acti vations). This is because, for computing the cross-entropy loss we used the tf.nn.softmax_cross_entropy_with_logits function that applies the logistic function directly to the output of a linear layer . The loss has been minimized using SGD with exponentially decaying learning rate. T o this end, the learning rate has been halved ev ery 1 , 000 training steps. The training has been performed for 11 epochs with mini-batches of size 16 . The classiﬁcation dataset consists in mixed “normal” and “anomalous” gait cycles, with a ratio of about 1 : 1 . The training process in volv ed a ﬁrst encoding phase by means of the pre-trained RNN-Seq2Seq model. The resulting ﬁnal state is then reshaped as described in Section II-C, and fed to the classiﬁer . For the training, we used 90% of the pre viously shufﬂed dataset. The remaining 10% has been used for testing. 9 5) SVM classiﬁer implementation and training: The SVM classiﬁer has been implemented in Python- 3 . 6 with the support of the ScikitLearn library . W e chose a Gaussian RBF kernel, and we fed the model with ﬂattened gait cycles. The model has been trained on the same dataset used for the CNN classiﬁer . The training set was composed of 90% of the data. The trained model has then been tested on the remaining 10% of the data. It is worth noting that, usually , SVM models require less training data with respect to NN-based ones. Hence, by using the same train-test split ratio for all the architectures, we are actually slightly advantaging the SVM classiﬁer . W e remark that the two classiﬁers (i.e., the CNN-based one and the SVM one) have been compared against the same previously unseen test set. The considered test set is composed of 572 gait cycles split in the following way: • 49 . 3% of normal gait cycles; • 50 . 7% of anomalous gait cycles. 6) Synopsis of the implementation parameters: Here, we recap the main parameters of the implemented models. RNN-Seq2Seq model T raining Data 4 , 469 “normal” gait cycles T est Data 497 “normal” gait cycles Recurrent Layers 2 , bidirectional Units Per Layer 512 LSTM cells Initial Learning Rate 0 . 01 Decay Steps 1 , 000 T raining Epochs 21 Max Gradient Norm 5 T ABLE I R N N- S E Q 2S E Q M O D E L ( S EC T I O N I I -B ) I M PL E M E NTA T I ON P A RA M E TE R S CNN-based classiﬁer T raining Data 5 , 147 mixed gait cycles T est Data 572 mixed gait cycles Con volutional Layers 1 Con volutional Filters [ 10 , 6 , 8 , 16 ] Con volutional Strides [ 1 , 2 , 2 , 1 ] Max Pooling Layers 1 Max Pooling K ernel [ 1 , 4 , 4 , 1 ] Max Pooling Strides [ 1 , 2 , 2 , 1 ] Initial Learning Rate 0 . 01 Decay Steps 1 , 000 T raining Epochs 11 T ABLE II C O NV OL U T I ON A L C L AS S I FI ER ( S EC T I O N I I -C ) I M PL E M E NTA T I ON P A R A ME T E R S SVM-based classiﬁer T raining Data 5 , 147 mixed gait cycles T est Data 572 mixed gait cycles SVM Kernel Gaussian RBF T ABLE III S V M C L A S SI FI E R ( S E CT I O N I I - C) I M P LE M E NTA T I O N P AR A M E TE R S I I I . R E S U L T S T o assess the classiﬁcation performance of the proposed method, we compared it to that of the SVM classiﬁer of predicted “normal” predicted “anomalous” “normal” 47 . 38% 1 . 92% “anomalous” 0 . 00% 50 . 70% T ABLE IV S V M C L A S SI FI E R : C O N FU S I O N M A T R IX O N T H E T E S T S E T . Section II-D. The SVM model achiev ed a classiﬁcation accu- racy of 98 . 077% on the test set. The corresponding confusion matrix is sho wn in T ab . IV, from which two main conclusions can be drawn. First, even a standard model achiev es a high classiﬁcation accuracy . Second, the SVM provides “conserv a- tiv e” results. This means that it is more lik ely that a “normal” gait is classiﬁed as anomalous than the opposite. This is also a desirable feature for healthcare monitoring applications, because no potentially dangerous conditions are misclassiﬁed. The classiﬁcation accuracy on the test set that we obtained with the proposed 2-step model (RNN as feature extractor followed by a CNN classiﬁer) is 100 % . This improvement, together with the ability of neural networks to perform online learning (i.e., updating the model parameters while being used), make the proposed architecture a viable choice for gait anomaly detection in free living conditions. Moreov er , it is worth noting that, once the initial training of the RNN-Seq2Seq model and the CNN-based classiﬁer has been completed, the resulting model can be run on off-the-shelf smartphones as the Asus Zenfone 2 used in this work. a) Discussion: W e found the problem of detecting anomalies in gait cycles to be a difﬁcult one. Although here we only describe the ﬁnal design, which worked satisf actorily in all our experiments, prior to this we tried several other ways of combining neural networks and classiﬁers. For ex- ample, we experimented with a system architecture where the RNN is trained to predict the next sample x i ∈ R 9 , giv en the (observed) previous samples in the current cy- cle x 0 , x 1 , . . . , x i − 1 . W e thus ev aluated the prediction error e i = x i − ˆ x i for each sample i , with i = 0 , 1 , . . . , N + 1 . Hence, for each gait cycle, we e valuated the Mean Square Error (MSE) and its standard deviation for each element of e i , leading to a cycle descriptor of size 2 × 9 . This cycle descriptor was then used to discriminate between normal and anomalous cycles using an SVM classiﬁer . In the test phase, this technique has led to a percentage of normal cycles that were misclassiﬁed as anomalous of 5 . 8 %, and to a percentage of anomalous cycles that were misclassiﬁed as normal of 2 . 9 %. The percentage of correctly classiﬁed cycles were 42 . 7 % and 48 . 6 % for normal and anomalous walks, respectiv ely . I V . C O N C L U S I O N S In this work, we have explored new methods for the automated classiﬁcation of gait c ycles from multimodal motion data consisting of inertial (accelerometer and gyroscope) and video signals. The proposed solution is subject-speciﬁc , as we purposely learn the way in which a subject walks, with the goal of correctly classifying normal gaits from anomalous ones, where “anomalous” mean containing subsequences that are usually not present in the normal walks. T o the best of our kno wledge, this is the ﬁrst work that systematically 10 addresses this problem through the fusion of inertial and video data. Also, we utilize advanced deep neural network models, namely , Recurrent Neural Networks (RNN), to capture the statistics underpinning the motion data, through an architec- ture that combines recurrent and con volutional designs. The system that we put forward is lightweight, as data can be con veniently acquired from a smartphone device with no user conﬁguration required. The feature extraction block is trained in an unsupervised manner (thanks to RNN) and labeled data is only needed for the training of the ﬁnal classiﬁer . Our results rev eal that the ﬁnal architecture is very effecti ve, being able to correctly capture the correlations in the walking patterns of the monitored subject and to detect all the anomalous gaits. Our work can be extended in several ways. For instance, we may assess which type of anomaly affects the data, or use a similar design for activity recognition tasks. Also, this trajectory learning approach can be v ery useful to quantify the progress of a disease affecting motor functions or the beneﬁts of rehabilitation therapies. A C K N O W L E D G M E N T This work has been supported by the Univ ersity of Padov a through the project CPD A 151221 “IoT -SURF”. Any opinions, ﬁndings, and conclusions herein are those of the authors and do not necessarily represent those of the funding institution. E T H I C S S TA T E M E N T The present study does not inv olve the use of clinical data. The learning and classiﬁcation algorithms hav e been trained using walking patterns from a healthy subject, who emulated walking anomalies as explained in Section II-D. The subject has provided a written consent to use their walking data for the numerical assessment that we carried out in Section III. R E F E R E N C E S [1] S. Sch ¨ ulein, J. Barth, A. Rampp, R. Rupprecht, B. M. Eskoﬁer, J. W in- kler , K.-G. Gaßmann, and J. Klucken, “Instrumented gait analysis: a measure of gait improvement by a wheeled walker in hospitalized geriatric patients, ” Journal of NeuroEngineering and Rehabilitation , vol. 14, no. 1, pp. 1–11, Feb. 2017. [2] M. Rossi and M. Gadaleta, “IDNet: Smartphone-based Gait Recognition with Convolutional Neural Networks, ” Elsevier P attern Recognition , vol. 74, pp. 25–37, Feb. 2018. [3] S.-M. Lee, S. M. Y oon, and H. Cho, “Human activity recognition from accelerometer data using conv olutional neural network, ” in IEEE International Confer ence on Big Data and Smart Computing (BigComp) , Jeju, K orea, Feb. 2017. [4] S. K. Dhar, M. M. Hasan, and S. A. Chowdhury , “Human activity recognition based on gaussian mixture model and directiv e local binary pattern, ” in IEEE International Conference on Electrical, Computer , and T elecommunication Engineering (ICECTE) , Rajshahi, Bangladesh, Dec. 2016. [5] V . Ghasemi and A. A. Pouyan, “Human activity recognition in ambient assisted living environments using a conv ex optimization problem, ” in IEEE International Confer ence of Signal Processing and Intelligent Systems (ICSPIS) , T ehran, Iran, Dec 2016. [6] H. Y u, S. Cang, and Y . W ang, “ A review of sensor selection, sensor devices and sensor deployment for wearable sensor-based human ac- tivity recognition systems, ” in IEEE International Conference on Soft- war e, Knowledge, Information Management, and Applications (SKIMA) , Chengdu, China, Dec. 2016. [7] A. Muro-de-la Herran, B. Garcia-Zapirain, and A. Mendez-Zorrilla, “Gait Analysis Methods: An Overview of W earable and Non-W earable Systems, Highlighting Clinical Applications, ” MDPI Sensors , vol. 14, no. 2, pp. 3362–3394, Feb. 2014. [8] M. Goffredo, I. Bouchrika, J. N. Carter , and M. S. Nixon, “Performance analysis for automated gait e xtraction and recognition in multi-camera surveillance, ” Multimedia T ools and Applications , vol. 50, no. 1, pp. 75–94, Oct. 2010. [9] J. Little and J. E. Boyd, “Recognizing people by their gait: The shape of motion, ” V ider e: Journal of Computer V ision Researc h , vol. 1, pp. 1–32, 1996. [10] D. Geerse, M. Roerdink, B. Coolen, J. Marinus, and J. van Hilten, “The interactiv e walkw ay: T owards assessing gait-en vironment interactions in a clinical setting, ” Movement Disorder s , vol. 31, no. 2, pp. 1–1, Jun. 2016. [11] L. Middleton, A. A. Buss, A. Bazin, and M. S. Nixon, “ A ﬂoor sensor system for gait recognition, ” in IEEE W orkshop on Automatic Identiﬁcation Advanced T echnologies (AutoID) , Buf falo, NY , USA, Oct. 2005. [12] W . W ang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Device-free Human Activity Recognition Using Commercial WiFi Devices, ” IEEE Journal on Selected Areas in Communications , vol. 35, no. 5, pp. 1118– 1131, May 2017. [13] B. Pogorelc and M. Gams, “Discovery of Gait Anomalies from Motion Sensor Data, ” in IEEE International Conference on P attern Recognition (ICPR’06) , Hong Kong, Aug. 2010. [14] W . T ao, T . Liu, R. Zheng, and H. Feng, “Gait Analysis Using W earable Sensors, ” Sensors , vol. 12, no. 2, pp. 2255–2283, 2012. [15] F . Lin, A. W ang, Y . Zhuang, M. R. T omita, and W . Xu, “Smart Insole: A W earable Sensor Device for Unobtrusive Gait Monitoring in Daily Life, ” IEEE T ransactions on Industrial Informatics , vol. 12, no. 6, pp. 2281–2291, Dec. 2016. [16] B. Mariani, S. Rochat, C. J. Bla, and K. Aminian, “Heel and T oe Clearance Estimation for Gait Analysis Using Wireless Inertial Sensors, ” IEEE T ransactions on Biomedical Engineering , vol. 59, no. 11, pp. 3162–3168, Nov . 2012. [17] G. Cola, M. A vvenuti, A. V ecchio, G.-Z. Y ang, and B. Lo, “An On-Node Processing Approach for Anomaly Detection in Gait, ” IEEE Sensor Journal , vol. 15, no. 11, pp. 6640–6649, Nov . 2015. [18] M. ElSayed, A. Alsebai, A. Salaheldin, N. E. Gayar, and M. ElHelw , “ Ambient and wearable sensing for gait classiﬁcation in perv asive healthcare en vironments, ” in IEEE International Conference on e-Health Networking, Applications and Services , L yon, France, Jul. 2010. [19] E. Hossain and G. Chetty , “ A multi-modal gait based human identity recognition system based on surveillance videos, ” in International Confer ence on Signal Processing and Communication Systems , Gold Coast, Australia, Dec. 2012. [20] R. Soangra, T . Lockhart, and N. V an De Berge, An appr oach for identifying gait events using wavelet denoising technique and single wir eless IMU , Sep. 2011, vol. 55. [21] T . Nakano, B. T . Nukala, S. Zupancic, A. Rodriguez, D. Y . C. Lie, J. Lopez, and T . Q. Nguyen, “Gaits classiﬁcation of normal vs. patients by wireless gait sensor and support vector machine (svm) classiﬁer, ” in IEEE/A CIS International Confer ence on Computer and Information Science (ICIS) , Okayama, Japan, Jun. 2016. [22] B. T . Nukala, T . Nakano, A. Rodriguez, J. Tsay , J. Lopez, T . Q. Nguyen, S. Zupancic, and D. Y . C. Lie, “Real-Time Classiﬁcation of Patients with Balance Disorders vs. Normal Subjects Using a Lo w-Cost Small W ireless W earable Gait Sensor, ” Biosensors , vol. 6, no. 4, pp. 1–22, May 2016. [23] K. Rao, H. Sak, and R. Prabhav alkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer, ” in IEEE Automatic Speech Recognition and Understanding W orkshop (ASR U) , Okinawa, Japan, Dec. 2017, pp. 193–199. [24] D. Bahdanau, J. Chorowski, D. Serdyuk, P . Brakel, and Y . Bengio, “End- to-end attention-based large vocab ulary speech recognition, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , Pudong, Shanghai, China, Mar . 2016, pp. 4945–4949. [25] E. Song, F . K. Soong, and H. G. Kang, “Effectiv e spectral and excitation modeling techniques for lstm-rnn-based speech synthesis systems, ” IEEE/ACM T ransactions on Audio, Speec h, and Language Pr ocessing , vol. 25, no. 11, pp. 2152–2161, Nov . 2017. [26] D. Wu and M. Chi, “Long Short-T erm Memory W ith Quadratic Con- nections in Recursiv e Neural Networks for Representing Compositional Semantics, ” IEEE Access , v ol. 5, pp. 16 077–16 083, Jan. 2017. [27] G. Preethi, P . V . Krishna, M. S. Obaidat, V . Saritha, and S. Y enduri, “ Application of deep learning to sentiment analysis for recommender system on cloud, ” in IEEE International Conference on Computer , Information and T elecommunication Systems (CITS) , Dalian, China, Jul. 2017, pp. 93–97. 11 [28] D. Britz, A. Goldie, T . Luong, and Q. Le, “Massive exploration of neural machine translation architectures, ” in Conference on Empirical Methods in Natural Language Pr ocessing (EMNLP) , Copenhagen, Denmark, Sep. 2017. [29] Y . W u, M. Schuster , Z. Chen, Q. V . Le, M. Norouzi, W . Macherey , M. Krikun, Y . Cao, Q. Gao, K. Macherey , J. Klingner , A. Shah, M. Johnson, X. Liu, ukasz Kaiser, S. Gouws, Y . Kato, T . Kudo, H. Kazawa, K. Stev ens, G. Kurian, N. Patil, W . W ang, C. Y oung, J. Smith, J. Riesa, A. Rudnick, O. V inyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation, ” CoRR , vol. abs/1609.08144, 2016. [Online]. A vailable: http://arxiv .org/abs/1609.08144 [30] A. K. Rao, L. Muratori, E. D. Louis, C. B. Mosko witz, and K. S. Marder , “Spectrum of gait impairments in presymptomatic and symptomatic huntington’ s disease, ” Movement Disor ders , vol. 23, no. 8, pp. 1100– 1107, 2008. [31] B. Salzman, “Gait and balance disorders in older adults, ” American family physician , vol. 82, no. 1, pp. 61–68, Jul. 2010. [32] J. V erghese, R. B. Lipton, C. B. Hall, G. Kuslansky , M. J. Katz, and H. Buschke, “Abnormality of Gait as a Predictor of Non-Alzheimer’ s Dementia, ” New England Journal of Medicine , vol. 347, no. 22, pp. 1761–1768, 2002. [33] Y . W u, M. Schuster , Z. Chen, Q. V . Le, M. Norouzi, W . Macherey , M. Krikun, Y . Cao, Q. Gao, K. Macherey , J. Klingner , A. Shah, M. Johnson, X. Liu, L. Kaiser , S. Gouws, Y . Kato, T . Kudo, H. Kazawa, K. Stev ens, G. Kurian, N. Patil, W . W ang, C. Y oung, J. Smith, J. Riesa, A. Rudnick, O. V inyals, G. Corrado, M. Hughes, and J. Dean, “Google’ s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, ” CoRR , vol. abs/1609.08144, 2016. [Online]. A vailable: http://arxiv .org/abs/1609.08144 [34] C.-C. Chiu, T . Sainath, Y . Wu, R. Prabhavalkar , P . Nguyen, Z. Chen, A. Kannan, R. J. W eiss, K. Rao, K. Gonina, N. Jaitly , B. Li, J. Choro wski, and M. Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models, ” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Calgary , Alberta, Canada, Apr . 2018. [35] C. Eduardo, L. Rafael, and M. Mar ´ ıa-Jos ´ e, “Analysis of Android Device- Based Solutions for Fall Detection, ” Sensors , vol. 15, no. 8, pp. 17 827– 94, 2015. [36] C. Mizuike, S. Ohgi, and S. Morita, “ Analysis of stroke patient walking dynamics using a tri-axial accelerometer, ” Gait & P ostur e , vol. 30, no. 1, pp. 60 – 64, 2009. [37] D. G. Lowe, “Distinctiv e Image Features from Scale-In variant Key- points, ” International Journal of Computer V ision , vol. 60, no. 2, pp. 91–110, Nov . 2004. [38] G. Farneb ¨ ack, “T wo-frame motion estimation based on polynomial expansion, ” in Scandinavian Conference on Imag e Analysis , Halmstad, Sweden, Jun. 2003. [39] I. Bouchrika, M. Goffredo, J. Carter , and M. Nixon, “On Using Gait in Forensic Biometrics, ” Journal of F or ensic Sciences , vol. 56, no. 4, pp. 882–889, Oct. 2011. [40] S. D. Din, A. Godfrey , and L. Rochester , “V alidation of an accelerometer to quantify a comprehensive battery of gait characteristics in healthy older adults and parkinson’ s disease: T o ward clinical and at home use, ” IEEE Journal of Biomedical and Health Informatics , vol. 20, no. 3, pp. 838–847, May 2016. [41] M. Schuster and K. Paliwal , “Bidirectional recurrent neural networks, ” IEEE T ransactions on Signal Pr ocessing , v ol. 45, no. 11, pp. 2673–2681, Nov . 1997. [42] L. Sun, S. Kang, K. Li, and H. Meng, “V oice conv ersion using deep Bidirectional Long Short-T erm Memory based Recurrent Neural Networks, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , South Brisbane, Queensland, Australia, Apr . 2015. [43] I. Goodfellow , Y . Bengio, and A. Courville, Deep Learning . MIT Press, 2016. [44] Q. Li, Y . W ang, A. Sharf, Y . Cao, C. T u, B. Chen, and S. Y u, “Classiﬁcation of gait anomalies from kinect, ” The V isual Computer , vol. 34, no. 2, p. 229?241, Feb. 2018. [45] J. R. Merory , J. E. Wittwer , C. C. Rowec, and K. E. W ebster , “Quan- titativ e gait analysis in patients with dementia with Lewy bodies and Alzheimer’ s disease, ” Gait & P osture , vol. 26, no. 3, pp. 414–419, Sep. 2007. [46] Z. Sawacha, G. Gabriella, G. Cristoferi, A. Guiotto, A. A vogaro, and C. Cobelli, “Diabetic gait and posture abnormalities: a biomechanical in vestigation through three dimensional gait analysis, ” Clinical Biome- chanics , vol. 24, no. 9, pp. 722–728, Nov . 2009.

Seq2Seq RNN based Gait Anomaly Detection from Smartphone Acquired Multimodal Motion Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment