Gesture Recognition with a Focus on Important Actions by Using a Path Searching Method in Weighted Graph
This paper proposes a method of gesture recognition with a focus on important actions for distinguishing similar gestures. The method generates a partial action sequence by using optical flow images, expresses the sequence in the eigenspace, and chec…
Authors: Kazumoto Tanaka
IJCSI International Journal of Com puter Science Issues, Vol. 6, No. 2, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694- 0814 14 Gesture Recognition with a Focus on Important Actions by Using a Path Searching Method in W eighted Graph Kazumoto TANAKA Dept. of Information and Systems Engineering, Kink i University Higashi-hiroshima, 739-2116, Japan Abstract This paper proposes a method of gesture recognition with a focus on important actions for disti nguishing similar gestures. The method generates a partial action sequence by using optical flow images, expresses the sequence in the eigenspace, and checks the feature vector sequence by appl y ing an optimum path-searching method of weighted graph to fo cus the important actions. Also presented are the results of an experiment on the recognition of similar sign language words. Keywords: G esture Recognition, Sign Language, Optical Flow, Appearance-based, Eeigenspa ce, Weighted Graph, Path Searching. 1. Introduction There have been a number of studi es on automatic gest ure recognition m ethod. One of the main purposes of the gesture recognition is to rea lize sign language recognition. In order for ordinary persons to comm unicate with deaf people, the method is needed t o translate sign language into natural language. It is said that there are sim ilar gestures among si gn language words, and som e part of the gestures (hereafter, they are called “partial actions”) play an important role in differentiating t hose words. For example, in Japanese sign language words, an acti on of thumb for the word “lend” i s very significant for di stinguishing the word from the word “return”. However, little attentio n has been given to the gesture recognition that focuses on such partial actions. Thus, this st udy has developed a recognition m ethod for focusing on important part ial actions. The m ethod promises to im prove recognition rate of t he gestures that have important meanings in their local actions. Most of studies on gesture recogni tion have been conducted by utilizing Hidden Markov Models [1][2][3], Dynamic Time Warping [4][5] or Neural Networks [6] [7], while the novel recognition me thod that utilizes a string- matchi ng method based on a path-searching m ethod has been proposed [8]. The targ et of the recognition method was relatively la rge motions as a wide hand-swing, not local actions often used in si gn language words. However, by modify ing the string-m atching m ethod, this study has realized a ma tching met hod for focusing on import ant partial actions t o recognize sign language words successfully. In this paper, a generation method of tim e-series feature vectors of gestures is described first. The feature vectors are employed in the “modified ” stri ng-matching. Next , an optimum path-searching m ethod in weighted graph where the cost of the graph edges corresponding to i mportant partial actions i s reduced is proposed for realizi ng our matchi ng method. Also presente d are the results of an experiment on t he recognition of sim ilar sign la nguage words. 2. Gesture recognition by using a path searching method in weighted graph Since the interest in this stud y rests on the effects of the matchi ng method by whi ch feature vectors of im portant partial actions are focused to disti nguish similar gestures, we will proceed to discuss si ngle gestures as the study subject. Thus, th e author will have the gestu re be in a stationary state at the start and end o f the action and will not address any special word spotti ng. In general, gesture recognition methods utilize time-series feature vectors for pattern matc hing in a gesture dictionary. The way of feature vector composi tion can be divided broadly into two categories: one utilizes geometric characteristics in im ages (model-based m ethod) and another utilizes appearance ch aracteristics of images (appearance-based method) [9][10] . The appearance-based method has m erits in representing object s that are difficult to express geometric character istics. In the study, feature vectors of complex opt ical flow images from sign IJCSI International Journal of Computer Science Issues, Vol . 6, No. 2, 2009 15 language gestures are composed by using the appearance- based method. As known well, appearance based methods often utilize the eigenspace method for re ducing dimensions of the feature vector. The feature vector generation method by using the appearance-based m ethod with the eigenspace is described in 2.1 and 2.2. The method of gesture representation for constructi ng gesture dictionary and the matc hing method by modifyi ng the string-m atching method are described i n 2.3 and 2.4 respectively. 2.1 Generation of partial action sequence The method generates several partial action im ages from the optical flow images wh ich are calculated from motion images by usi ng a local correlation operation. This study established the criteria for the generati on as the position of flow vectors and the directional change. The generation procedure is given in the followi ng: Step 1: A raster scan is perform ed to search already labeled flow vectors on the opt ical flow im age, Im age_i (the suffix i is the frame number; the initial value is 1). If such a flow vector is found, labeling by 8-neighbours will be conducted until unlabeled vector is not found within the connected component of the flow. However, if the angle between the labeled vector a nd its neighbor vector exceeds a given threshold, no labeli ng is conducted. Once this raster scan is completed, the next raster scan will be performed on the Image_i to sear ch unlabeled flow vectors. If an unlabeled flow vector is found in the scan, new la bel will be attached and, in a similar manner as the above, labeling by 8-neighbour s will be conducted. Step 2: If the angle between the flow vector at the coordinate (m, n) of Image_i v:=(vx, vy ) and the flow vector u at the coordinate (m +vx*dt, n+vy*dt) of Image_i+1 is less than the threshold, the lab el same as that of v will be attached to u. Here, dt is a sampling time span. This is performed to all flow vectors of Image_i. Step 3: When i:=i+1 and i i s not over the last number of the image series, one will return to Step 1. Step 4: Each group of flow vectors that have the same label is extracted and then is put in a new frame as a new optical flow image (see Fig. 1) . This is performed to all labels at each optical flow image. Step 5: For all the images creat ed in Step 4, if the number of the images t hat have the same labeled fl ow vectors is less than a given threshold, the images will be removed for noise canceling. Step 6: The images that have the same labeled flow vectors are superimposed on the first im age among them in time order at each label. As a result, the images are combined into one image at each label (see Fig. 1). The combined image is called partial action image in the pap er. Step 7: The partial acti on sequence is generated by arranging the partial actio n images in label order. Fig. 1: Generation of partial action im age 2.2 Generation of feature vector sequence by the eigenspace method Because the eigenspace method is sensitive to any position shift of the subject for recognition, i t was based on the premise that the gesture woul d be perform ed in a nearly identical position i nside the window set in the im age. First, the operation described in 2.1 is perform ed for all dynamic im ages of each gest ure that are prepared for constructing the gesture dicti onary. After the operat ion, Time-series optical flow images Extraction of m-labeled flow vectors Superimposition Partial action image with m-labeled flows Superimposition Partial action image with n-labeled flows m n m n m n Extraction of n-labeled flow vectors IJCSI International Journal of Computer Science Issues, Vol . 6, No. 2, 2009 16 for each partial action image, the method generates a vector v by arranging flow vect or components in raster scanning order on a partial action i mage and obtains a set of vector v. The components of the flow vector at each pixel are laid out from component x t o component y in the vector. Next, from al l the vector v obtained, the m atrix V:= [v 1 , v 2 , …, v N ] is obtained. The eigenspace based on the eigenvectors that correspond to the upper-ranked number k of the eigenvalue is determ ined, by solvi ng the characteristic equation (1) of the covariance matrix U of the matri x which is obtained by subtracting the mean vector of the column vector s of the ma trix V from each column vector. i i i e U e ⋅ = ⋅ λ (1) From this, each vector v of the partial action will be projected by Formula (2) to cal culate the feature vector u on the eigenspace. [] () v v e , , e , e u T k 2 1 − = L (2) Lastly, the feature vector sequence on the eigenspace can be obtained from the part ial action sequence. 2.3 Gesture representation for constructing gesture dictionary Each gesture in the gesture dictionary i s represented by a cluster sequence on the eigenspace. Each cluster is formed by plural patterns of a feature vect or that are obtained from a variation (discrepancy) of a partial action to cope with the problem of t he variation. Each clust er is approximat ed by the k-dim ensional norma l distributi on by Formula (3). () () () ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − − = − µ x Σ µ x 2 1 Σ 2 π 1 Σ µ, ; x 1 T k exp f (3) Here, µ is the mean vector of the cluster, Σ is the covariance matri x of the distribution of the cluster. Thus, each gesture in the dictionary will be represented as a sequence of the cluster c represented by the k-dimensional normal dist ribution as seen in Fig. 2. Fig. 2: Gesture representation in the dictionary 2.4 Matching method by a path-searching method in weighted graph First, Fig. 3 shows the outline of the matching method [8] of one-dimensi onal feature sequences by using t he string- matchi ng method [11] based on a pat h-searching method. The graph in Fig. 3 have X{A, B, C, A, E, F, G} of the one-dimensional sequence on the horizontal axis and Y{A, H, C, I, F, J} on the vertical axis, and a white circle is attached to the intersecti on where sequential elements match each other. Added from the intersection is an edge in the lower left diagonal dir ection. For each edge, a cost is established; the cost is 0 fo r the added edge in the diagonal direction and the rest are all 1. At this tim e, obtaining the minim um cost path from the node V 00 to V 76 , {A, C, F}, which corresponds to the white circl e on the edge in the diagonal direction i ncluded on that path, is LCS (Longest Comm on Subsequence) of X and Y, thus the number of elem ents of LCS l ength ( LCS (X, Y)) is 3. From this and Form ula (4), the degree of sim ilarity Si m (X, Y) can be obtained. Here, the met hod uses Dijkstra’s method to obt ain LCS. )) Y ( ), X ( ( )) Y X, ( ( : ) Y X, ( length length max LCS length Sim = (4) Next, the mat ching met hod for enabling focusing on important partial actions is describ ed in the followings: Regarding the sequence of the cluster in the di ctionary and the feature vector sequence of recogniti on subjects, individual elem ents (corresponding to partial actions) are laid out com posing a graph G PQ as in Fig. 3. If the cluster sequence P and the feature vector sequence Q are } c , , c , {c : P P 2 1 L = (5) } u , , u , {u : Q Q 2 1 L = (6) respectively, which elements will ma tch each other are determined by the thresholding Mahalanobi s distance to be defined in Formula (7). ( )( ) j j j c i 1 c T c i µ u Σ µ u d − − ≡ − (7) Here, µ c is the mean vector of the cluster c, Σ c is a covariance matri x of the distribution of the cluster. From this m atching result, the edge in t he diagonal direction is added to G PQ , the cost of 1 or 0 is allocated for each edge according to the rules. {} ( ) ( ) { } L L , Σ , µ ; x , Σ , µ ; x , c , c 2 2 1 1 2 1 f f ≡ Normal distribution of partial action P 1 Normal distribution of partial action P 2 IJCSI International Journal of Computer Science Issues, Vol . 6, No. 2, 2009 17 The focus on important partial action will be conducted as follows: Let a cluster c k in the sequence P correspond to an important partial action. If P is laid down on the horizontal axis to G PQ , the cost of the edge will b e changed by Formula (8). () n m : ) V , (V do n} , {0,1, j : j all For j k j 1 - k + = ∈ cost L (8) Here, m and n are the numbers of el ements of P and Q respectively, V ij is any node in the graph. From this, if Q has an element that is included in c k , restrictions can be established so that the edge i n the diagonal direction toward that m atching node can be selected as a path. In t he example in Fig. 3, if the f ourth element of Sequence X, Element A, corresponds to an i mportant parti al action, LCS is {A, F} and length ( LCS (X, Y)) will be 2. From the graph com posed in the above, one can obtain LCS (P, Q) using the Dijkstra me thod and search in the dictionary based on Si m (P, Q). In the study, programm ing was done by the creator of the dictionary designating the important partial actions and attach ing a flag to that data. Fig. 3: A Graph for the m atching method of 1D feature sequences 3. Experiment To verify the efficacy of th is method, the study conducted a recognition test on Japanese sign language words. The words used in this exp erime nt included “say,” “order,” “return (home),” “lend.” Of them , “say and order” and “return (home) and lend” are very similar i n action respectively. Two subjects m ade the action 30 tim es for each word and the optical flow images were obtained to create a gesture dictionary for basi c 16 words including the four words. The eigenspace was decided as 4- dimensional space to represent each feature vector of the words. The equipment used for im age processing was a personal computer (C PU: Intel(R ) Core(TM)2 Duo CPU, 2.20GHz; system m emory 3.07GB ; OS: Windows XP). The library of OpenCV was used for im age processing. The resolution for image processing was 320 x 240. The entire upper body of the subject was im aged in front of a blackout curtain. The video rate was 60fps. Fig. 4 shows an example of the optical flow im age when performing a gesture of “lend”. An extracted opti cal flow image of the thum b and that of the hand wit hout the thumb are also shown in the figure. The partial action of the thumb i s an im portant feature to dist inguish “lend” from “return (home).” An exam ple of the partial action image of the thumb is shown i n Fig. 5. The partial action image was generated by combi ning the same label ed optical flow images of the thum b. Fig. 4: Optical flow images when perform ing a sign language word A B C A E F G A H C I F J V 76 V 00 Gesture of “lend” Optical flow image Extraction of the same labeled optical flow image Optical flow image of the thumb Optical flow image of the hand without the thumb IJCSI International Journal of Computer Science Issues, Vol . 6, No. 2, 2009 18 Fig. 5: Partial action image of the thum b For the recognition experim ent, a total num ber of four people, including t wo subjects at the tim e of creating the gesture dictionary , performed a gest ure 20 times per word. Table 1 shows recogniti on rates of the four words and the average recognition rat e of the basic 16 words actions by using the proposed method. Table 2 shows recognition rates when import ant features were not focused. A and B in the tables denote the subject group and the other group, respectively, at the time of creating the dictionary. Table 1: Recognition rates with the proposed m ethod A B “say” 69.0% 63.5% “order” 70.5% 65.0% “return” 77.0% 70.5% “lend” 76.0% 67.5% Average 79.5% 68.5% Table 2: Recognition rates wit hout im portant feature focusing A B “say” 58.0% 51.0% “order” 60.5% 49.0% “return” 62.5% 52.5% “lend” 64.5% 57.0% Average 76.5% 65.5% 4. Discussion The results of the experim ent show that the recognition rate with focused important f eatures was greater than that without focusing. Thus, the efficacy of this method was confirmed. Furtherm ore, the following poi nts were clarified: The recognition rate of “say” and “order” was lower than that of other words. The difference between the two words was the difference in the acti on near the starting location. However, since the variance of the starting lo cation was great, the mat ching of this im portant parti al action often failed. Thus, it is needed to im prove the robustness against the location shift for the futu re work. The reason for the recognition rate of Group B being lower than that of Group A was t he difference in individual “habits.” For this r eason, it will be necessary to obtain gesture pat terns from a large num ber of people and improve the model accuracy of the partial action cluster. For the model, the di stribution expression of “habits” by Gaussian Mixture Model can be used. Overall recognition rates are lower than those in the previous studies of sign langua ge word recognition. This is because the study did not use shapes of the hands and fingers, which are important pieces of information. However, the purpose of this study was to ma ke it possible to recognize gestures that have im portant local actions. Therefore, in this respect, the author believ es that the purpose has been achieved. 5. Conclusion For the identification o f similar gestures, the paper proposed the gesture recognition method with a focus on important local actions. The m ethod generates the partial action sequence by using optical flow images, expresses the sequence in the eigensp ace, and checks the feature vector sequence by apply ing the optim um path-searching method of weight ed graph. The paper also showed the efficacy of this method by conducting a recognition test on similar sign language words. For the future, in addition to the improveme nt on the robustness against the location shift, the author plan to tackle the issue of employing this method for sign language and gestures that i nclude body parts other than hands (shaki ng of the head, etc). Future work will involve further veri fication of the method with various sign language words. Acknowledgments This work was partial ly supported by a Grant-i n-Aid for Scientific Research (c) 20500856 from the Japan Society for the Promotion of Sci ence. References [1] S. Yang and R. Sarkar, “G esture Recognition Using Hidden Markov Models from Fragmented Observations”, in IEEE IJCSI International Journal of Computer Science Issues, Vol . 6, No. 2, 2009 19 Computer Society Conference on Computer Vision and Pattern Recognition, 2006, Vol. 1, pp. 766-773. [2] C. Vogler and D. Metaxa s, “Handshapes and Movements: Multiple-Channel American Sign Language Recognition”, Lecture Notes in Computer Science, Vol. 2915, 2004, pp. 247-258. [3] T. Starner and A. Pentland, “Visual Recognition of American Sign Language Using Hidde n Markov Models”, in International Workshop on Automatic Face and Gesture Recognition, 1995, pp. 189-194. [ 4 ] J . F . L i c h t e n a u e r , E . A . Hendriks and M. J. T. Reinders, “Sign Language Recognition by Combining Statistical DTW and Independent Classifica tion”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 30, No. 11, 2008, pp. 2040-2046. [5] J. Alon, V. Athitsos and et al., “Simultaneous Localization and Recognition of Dynamic Hand Gestures”, in IEEE Workshop on Motion and Video Computing, 2005, Vol. 2, p.254-260. [6] Q. Munib, M. Habeeb and et al. “American Sign Language (ASL) Recognition Based on Hough Transform and Neural Networks”, Expert Systems with Applications Vol. 32, 2007, pp.24-37. [7] M. V. Lamar, M. S. Bhuiya n and A. Iwata, “Hand Gesture Recognition Using T-CombNET: A New Neural Network Model”, IEICE Transactions, Vol.E83-D, No.11, 2000, pp.1986-1995. [8] K. Koara, A. Nishikawa and et al. “Gesture Recognition Based on 1-Dimensional Encodi ng of Motion Changes”, in the 10th International Conference on Advanced Robotics, 2001, pp.639-644. [9] T. B. Moeslund and E. Granum, “A Survey of Computer Vision-Based Human Motion Ca pture”, Computer Vision and Image Understanding, Vol.81, 2001, pp. 231–268. [10] A. Bobick and J. Davis, “An Appearance-Based Representation of Action”, in the 13th International Conference on Pattern Recognition, 1996, vol. 1, pp.307. [11] E. W. Myers, “An O(ND) Difference Algorithm and Its Variations”, Algorithmica Vol. 1, Vol. 2, 1986, pp.251-266. Kazumoto T. is an associate professor of Faculty of Engineering, Kinki University, Japan. He received B.S. degree from Chiba University in 1981, and Ph.D. degree from Tokushima University in 2002. In 1981 he joined Mazda Motor Corporation, he was engaged in research and development of robot vision technology. His current interest is mainly gaming technology. He is a member of Japanese Society for Information and Systems in Education and Institute of Electronics, Information and Communication Engineers.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment