WAVELET PACKET DECOMPOSITION AND ARTIFICIAL NEURAL NETWORKS BASED RECOGNITION OF SPOKEN DIGITS

SONIA SUNNY1*, DAVID PETER S.2, POULOSE JACOB K.3
1Department of Computer Science, Cochin University of Science & Technology, Kochi, India
2School of Engineering, Cochin University of Science & Technology, Kochi, India
3Department of Computer Science, Cochin University of Science & Technology, Kochi, India
* Corresponding Author : sonia.deepak@yahoo.co.in

Received : 06-11-2011     Accepted : 09-12-2011     Published : 12-12-2011
Volume : 3     Issue : 4       Pages : 318 - 321
Int J Mach Intell 3.4 (2011):318-321
DOI : http://dx.doi.org/10.9735/0975-2927.3.4.318-321

Conflict of Interest : None declared

Cite - MLA : SONIA SUNNY, et al "WAVELET PACKET DECOMPOSITION AND ARTIFICIAL NEURAL NETWORKS BASED RECOGNITION OF SPOKEN DIGITS." International Journal of Machine Intelligence 3.4 (2011):318-321. http://dx.doi.org/10.9735/0975-2927.3.4.318-321

Cite - APA : SONIA SUNNY, DAVID PETER S., POULOSE JACOB K. (2011). WAVELET PACKET DECOMPOSITION AND ARTIFICIAL NEURAL NETWORKS BASED RECOGNITION OF SPOKEN DIGITS. International Journal of Machine Intelligence, 3 (4), 318-321. http://dx.doi.org/10.9735/0975-2927.3.4.318-321

Cite - Chicago : SONIA SUNNY, DAVID PETER S., and POULOSE JACOB K. "WAVELET PACKET DECOMPOSITION AND ARTIFICIAL NEURAL NETWORKS BASED RECOGNITION OF SPOKEN DIGITS." International Journal of Machine Intelligence 3, no. 4 (2011):318-321. http://dx.doi.org/10.9735/0975-2927.3.4.318-321

Copyright : © 2011, SONIA SUNNY, et al, Published by Bioinfo Publications. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

Abstract

This paper introduces an efficient method for recognizing spoken digits using a combination of Wavelet Packet Decomposition (WPD) and Artificial Neural Networks (ANN) classifier. Speech recognition is a fascinating application of Digital Signal Processing. There has been lot of research in the area of speech recognition for different languages. Digits in Malayalam, which belong to one of the four Dravidian languages of Southern India, are used to create the database. Wavelet Packet Decomposition is used for feature extraction in the time-frequency domain. Training, testing and pattern recognition are performed using Artificial Neural Networks (ANN). Due to the multi-resolution characteristics and efficient time frequency localizations, wavelets are very much suitable for processing non stationary signals like speech. ANNs are utilized in this work due to their parallel distributed processing, distributed memories, error stability, and pattern learning distinguishing ability. The experimental results show the effectiveness of this hybrid architecture in recognizing speech.

Keywords

Speech Recognition, Digits database, Feature Extraction, Wavelet Packet Decomposition, Classification, Artificial Neural Networks.

Introduction

Speech recognition is one of the intensive areas of research due to its versatile applications. But recognizing speech is a very difficult task due to the variability in the way people speak. Since speech is the primary means of communication between people, research in automatic speech recognition and speech synthesis by machine has attracted a great deal of attention over the past five decades [1] . Automatic recognition of spoken digits is one of the challenging tasks in the field of speech recognition [2] . A spoken digit recognition process is needed in many applications that need numbers as input such as automated banking system, airline reservations, voice dialing telephone, automatic data entry etc [3] .
Recognition accuracy is an important measure for calculating the performance of a speech recognition system. Many parameters affect the accuracy of the speech recognition system. Recent technological advances have made much progress in the recognition of complex speech patterns. But much more research and development in this area is needed in this field. In this work, the speech recognition process is divided into 3 stages namely creating the database, feature extraction and classification. Among these stages, feature extraction is a key, because better feature is good for improving recognition rate. The paper is organized as follows. The spoken digits database is explained in section 2. In the subsequent section, the theory of feature extraction is reviewed followed by the concepts of wavelet packet decomposition employed during this stage. Section 4 describes the classification stage using artificial neural networks. Section 5 presents the detailed analysis of the experiments done and the results obtained. Conclusions are given in the last section.

Digits Database for Malayalam

A spoken digit database is created for Malayalam language using 45 speakers. We have used twenty male speakers and twenty five female speakers for creating the database. The samples stored in the database are recorded by using a high quality studio-recording microphone at a sampling rate of 8 KHz (4 KHz band limited). Recognition has been made on the ten Malayalam digits from 0 to 9 under the same configuration. Our database consists of a total of 450 utterances of the digits. The spoken digits are preprocessed, numbered and stored in the appropriate classes in the database. The spoken digits and their International Phonetic Alphabet (IPA) format are shown in [Table-1] .

Feature Extraction

Feature extraction is very important for any speech recognition process because it plays a vital role in the speech recognition rate. During feature extraction, the relevant features called feature vectors are extracted from the input signals for further processing. The technique selected for feature extraction has great importance since good features increase the speech recognition rate. Researchers have experimented with many different types of methods for use in speech recognition. Most of the speech-based studies are based on Fourier Transforms (FTs), Short Time Fourier Transforms (STFTs), Mel-Frequency Cepstral Coefficients (MFCCs), Linear Predictive Coding (LPCs), and prosodic parameters. Literature on various studies reveals that in case of the above said parameters, the feature vector dimensions and computational complexity are higher to a greater extent. Moreover, many of these methods accept signals stationary within a given time frame. So, it is difficult to analyze the localized events correctly.

The Wavelet Packet Decomposition

Wavelet Packet Decomposition is a relatively recent and computationally efficient technique for extracting information about non-stationary signals like audio. It uses finite durative wavelets instead of periodical sinusoidal waves and has become a very useful tool in describing non-stationary processes. WPD allows time domain information to be incorporated with frequency domain information using multiple window durations. Long windows are used when high resolution information is needed and short windows for extracting low resolution information. It allows simultaneous use of long-time interval for low-frequency information and short-time interval for high-frequency information [4] .
Wavelet packet decomposition is a more detailed method than wavelet transform. In WPD, the original signal passes through two complementary filters, namely low-pass and high-pass filters, and emerges as two signals called approximation coefficients and detail coefficients [5] . Wavelet packet decomposition is based on wavelet transform and decomposes a signal with the same widths in all frequency bands [6] . The wavelet decomposition procedure is shown in [Fig-1] .
In the next level, the low frequency sub-bands are decomposed into lower and higher frequency parts. In the meanwhile, the high frequency sub-bands are also decomposed into lower and higher frequency parts. The same decomposition is continued until the desired level is reached. Wavelet packet decomposition can provide a multi-level time-frequency decomposition of signals and more precise information of the signal. The WPD decomposition tree is shown in [Fig-2] .
The computational complexity can be successfully reduced using wavelets, since the size of the feature vector is very less compared to other methods. It provides good time and frequency localizations.

Classification

The classification stage makes its determination based on all the similarity measures after having been trained using information relating to known patterns and the similarity measured from the pattern. Speech recognition is basically a pattern recognition problem. Pattern recognition is becoming increasingly important in the age of automation and information handling and retrieval. Since neural networks are good at pattern recognition, many early researchers applied neural networks for speech pattern recognition. In this study also, we have used neural networks as the classifier. Neural networks can perform pattern recognition; handle incomplete data and variability well. Artificial neural networks are well suited for speech recognition due to their fault tolerance and non-linear property.

Artificial Neural Networks

Artificial neural networks have been investigated for many years in the hope that speech recognition can be done similar to human beings. A Neural Network is a massively parallel-distributed processor made up of simple processing units. It can store experimental knowledge and make it available for use. Inspired by the structure of the brain, a Neural Network consists of a set of highly interconnected entities, called nodes designed to mimic its biological counterpart, the neurons. Each neuron accepts a weighted set of inputs and produces an output [7] . Neural Networks have become a very important method for pattern recognition because of their ability to deal with uncertain, fuzzy, or insufficient data. ANN is an adaptive system that changes its structure based on external or internal information that flows through the network [8] . Algorithms based on Neural Networks are well suitable for addressing speech recognition tasks.
Multi Layer Perceptron networks are well known for their adaptive learning property and has been already proven to be a universal approximator. So in this work, we use architecture of the Multi Layer Perceptron (MLP) network, which consists of an input layer, one or more hidden layers, and an output layer. The algorithm used is the back propagation training algorithm. In this type of network, the input is presented to the network and moves through the weights and nonlinear activation functions towards the output layer, and the error is corrected in a backward direction using the well-known error back propagation correction algorithm. After extensive training, the network will eventually establish the input-output relationships through the adjusted weights on the network. After training the network, it is tested with the dataset used for testing. The structure of an MLP network is given in [Fig-3] .

Experiments and Results

Selection of the suitable wavelet and the number of decomposition levels play an important role in obtaining good recognition accuracy in speech recognition using the WPD. Among the various wavelet bases, the most popular wavelets that represent foundations of Digital Signal Processing called the Daubechies wavelets are used here because of its orthogonality property and efficient filter implementation [9] . Recent research has pointed out that the Daubechies order-4 (DB4) wavelet is an appropriate basis for speech recognition. So, here db4 type of mother wavelet is used for feature extraction purpose. These are also called Maxflat wavelets as their frequency responses have maximum flatness at frequencies 0 and π. The speech samples in the database are successively decomposed into approximation and detailed coefficients. The speech signal is decomposed up to 8 levels to obtain the feature vector coefficients using WPD.
The feature vectors obtained from wavelet packet decomposition is given as the input to the artificial neural network classifier. Here we have divided the database into three. 70% of the data is used for training, 15% for validation and 15% for testing. MLP architecture is used for the classification scenario. Using this network, the classifier could successfully recognize the spoken digits. After testing, the corresponding accuracy of each spoken digit is obtained. The results obtained clearly shows the efficiency of Neural Networks in classifying the extracted coefficients.
Results obtained using WPD and ANN is given below. The original signal and the coefficient values after decomposition of digits 0 to 9 at level 8 are shown in figure 4 and the performance analysis based on error percentage is given in [Table-2] .
By using Wavelet packet Decomposition as the feature extraction tool and Artificial Neural Network as classifier, the overall recognition accuracy obtained is 80%.

Conclusion

In this paper, an Automatic Speech Recognition system is designed for spoken digits in Malayalam using a hybrid combination of WPD and ANN. The computational complexity and feature vector size is successfully reduced to a great extent by using wavelet packet decomposition and an overall recognition accuracy of 80% is obtained from this work. In this experiment, we have used a limited number of samples. Recognition rate can be increased by increasing the number of samples. The experiment results show that this hybrid architecture using wavelet packet decomposition and neural networks could effectively extract the features from the speech signal for automatic speech recognition.

References

[1] Rabiner L. and Juang B.H. (1993) Prentice-Hall, Englewood Cliffs, NJ.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[2] Ajami Alotaibi Y. (2005) Information Sciences, Vol 173, 115-139.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[3] Cini K. and Kannan B. (2009) World Congress on Nature and Biologically Inspired Computing, 1475-1479.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[4] Ting W., Guo-zheng Y., Banghua Y. and Hong S. (2008) Measurement, 41(6), 618-625.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[5] Chan Woo S., Peng Lin C. and Osman R. (2001) Proceedings of International Symposium on Intelligent Multimedia, Video and Speech processing, 413-416.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[6] Electronics Industrial Press, Beijing.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[7] Freeman J.A. and Skapura D.M. (2006) Pearson Education.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[8] Vimal Krishnan V.R., Jayakumar A. and Babu Anto P. (2008) 4th IEEE International Symbosium on Electronic Design, Test and Applications, 240-243.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[9] Hu Dingyin, Li Wei and Chen Xi (2011) Proceedings of the 2011 IEEE International Conference on Complex Medical Engineering, 694-697.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

Images
Fig. 1- Wavelet Decomposition
Fig. 2- WPD decomposition Tree
Fig. 3- Structure of an MLP network
Fig. 4- Level 8 decomposition of digits 0- 9 using WPD
Table 1- Numbers stored in the Database and their IPA Format:
Table 2- Performance Analysis Based On Error Percentage