Academia.eduAcademia.edu

IRJET- SPEECH RECOGNITION USING CONVOLUTIONAL NEURAL NETWORK

IRJET  Journal
This Paper
A short summary of this paper
37 Full PDFs related to this paper
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 SPEECH RECOGNITION USING CONVOLUTIONAL NEURAL NETWORK Nidhi A. Kulkarni1, Satish P. Deshpande2 1Student, Dept.of Electronics & Communication, KLSGIT College Belgaum, Karnataka, India 2 Assistant Professor, Dept.of Electronics & Communication, KLSGIT College Belgaum, Karnataka, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Language is one of the important means of 1.1 CONVOLUTIONAL NEURAL NETWORK communication; Speech is its main medium by which two Convolutional neural network is a class of deep neural people can communicate. In this present era, speech network which does little preprocessing, that means that the technologies are widely used as it has unlimited uses. Speech neural network learns the filter before doing the real recognition, as the man-machine interface plays a vital role in classification. It consist of single or more than one layer, CNN the field of artificial intelligence where accuracy is a major can do lot of things when they are fed with bunch of signals challenge. In this paper, a review of the speech recognition process, its basic models, and its application is done. as input to it. For computer system it is difficult to consider Discussion is made on different techniques and approaches of whole signal hence by using Convolutional neural network it the speech recognition process using the Convolutional Neural uses a part of signal instead of considering entire signal, CNN Network (CNN). This paper also summarizes some of the wells is a type of neural network where input variables are related known methods used in various stages of the speech spatially to each other. CNN were developed specially to take recognition process. A convolutional neural network is one spatial positions into account. that reduces spectral variations and models spectral correlations present in the signals. The main objective of this Filtering is the main mathematics behind the matching, review is to bring to light the progress made in the field of matching is done by considering the features that are lined speech recognition that uses a convolutional neural network of different languages and technological viewpoint in different up with this patch signal then the pixels are compared and countries. multiplied one by one, next these features are then added and divided with the total number of pixels. This step is Key Words: Speech Recognition, Man-Machine, repeated for all pixels. The act of convolving signals with a Artificial Intelligence, Convolutional Neural Network, bunch full of filters or bunch of features is called as Spectral Variation. convolutional layer. It is a layer in which operations depends 1.INTRODUCTION on the stack that is in convolution one signal becomes a stack of filtered signals. Convolutional layer is one among the Speech is one of the most important media for communication between two people and his environment. layer. The speech recognition system is capable to translate spoken words into text. Text can be either in terms of words or Pooling is another important and main building block of sequences of words or it could be in syllables form. Speech CNN. Its function is to reduce the amount of parameters and recognition using the convolutional neural network is one of computation used in the network. Pooling layer operates on the challenging tasks. We know that human speech signals each feature independently, the most common approach are highly variable due to different speaker attributes, used is max pooling. Once pooling is done next step is different speaking styles and can include environmental normalization. In normalization, if the pixel value is negative, noises so on; hence speech recognition is done by using a then these negative values will be replaced with zeros. This convolutional neural network. In the speech recognition tasks, many parameters will affect the accuracy of the process is done to each and every filtered signal. The last recognition. These parameters are as follows: a continuous step is to stack up all three layers so that present output will word or discrete word recognition, vocabulary, become the input for the next. The final layer is the fully environmental conditions, models, etc. problems such as one connected layer. may pronounce the same word in a different manner, expressing a word by two different speakers may vary. Resolving these problems is the main step towards the aim. The main advantage of using speech recognition is that a person can save his time by directly speaking to the devices rather than typing continuously. © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4577 International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 experimental result shows that CNN can efficiently implement word recognition with fewer complexities. The backpropagation algorithm is used to adjust and maintain the parameters of the neural network. First input signals were passed to the input layer, the middle layer and then the output was passed through the output layer, If the output is not equal to the desired output then it results into error signal, then by using Backpropagation algorithms these parameters are adjusted. In this paper the neural network is trained for 30 people, once training is done then it uses the other five people’s voice as test data. Lastly, they compared CNN with DNN and concluded that CNN greatly reduces the complexity of the system. Figure1.1: CNN STRUTURE Akhilesh Halageri, Amrita Bidappa, Arjun C, Madan Mukund 2. LITERATURE SURVEY Sarathy and Shabana Sultana says that speech recognition usually consists of the following steps, The first step is to capture and digitize the sound waves, then it is converted D. Nagajyothi and P. Siddaiah made a brief survey on speech into phonemes. The main purpose of this paper was to recognition using a convolutional neural network. This review the pattern matching abilities of neural networks on system was developed natively for Telugu language. The speech signal. For feed-forward networks, the activation database was created based on the frequently asked function is used basically to stabilize the output layer but questions in an airport inquiry. CNN was used for training this process is not present in recurrent networks. A system is and testing of the database. It results with the best results as proposed to use some learning algorithms to learn the CNN is best for weight sharing, local connectivity and features without any other assumptions. Algorithms use pooling. Experiments performed on speech signals resulted neural networks to increase the computational power of the in improvement in the performance of the systems as proposed system. compared to traditional systems. In this system, they have used tanh as an activation function for the first layer, while Ying Zhang, et al, proposed an end to end speech framework the rectified linear unit is used for the next two layers. The inspired by the advantages of both CNN and CTC approach, input data was normalized to increase the learning speed. CNN and CTC approaches were combined without using any The network structure was improved by using Max pooling recurrent connections. By evaluating this approach on the in the last layer. Later they concluded that CNN based system TIMIT Phoneme recognition task, they showed that the is able to achieve good performance than a conventional proposed model was not only computationally efficient but it traditional system. also exists in baseline systems. In this paper, they showed that in a CNN of sufficient depth, higher layer features have Akhil Thomas presents a survey on feature extraction the capacity to capture temporal dependencies with suitable technique used for speaker recognition by using Mel information. This model consists of 10 convolutional layers frequency Cepstral coefficients (MFCC). Further, he and 3 fully connected hidden layers. The first convolutional evaluated experiments results conducted along each step of layer is followed by the pooling layer; the pooling layer is the MFCC process. Later MFCC coefficients were retrieved 3*1, which means we can only pool over the frequency axis. using DCT. DCT was done using CORDIC Algorithm. One The filter size 3*5 is used across the layers. Max out with 2 disadvantage of this system is that the performance piece linear functions is used as activation function and to degrades if there is the presence of noise. The system gave optimize the model Adam optimizer is used with learning the most accurate results when implemented in an rate 10^4. Their model has achieved 18.2% error rates on environment where it was really trained. Here the the core test dataset, which is better than the LSTM baseline performance was increased with the number of training model. We use the CNN model because it takes less time to iterations. The accuracy of detected speech was very high train the dataset as compared to the LSTM model. because the MFCC feature extraction technique was used. Narendra D. Londhe, Ghanshyam B. Kshirsagar, and Hitesh The advantage of using the CORDIAC algorithm is that it Tekchandani proposed a Deep Convolutional Neural reduces the number of gates and thus it also decreases the Network (DCNN) based ASR for Chhattisgarhi dialect gate delay. because speaker dependent speech recognition system using conventional machine learning technique was incapable to Du Guiming, Wang Xia, Wang Guangyan, Zhang Yan and Li handle the spectral variations and spectral correlation of Dan presents a brief survey on the convolutional neural acoustic signals. DCNN efficiently handles the spectral network that uses backpropagation to train the network. In variation and spectral correlation of speech signals with a this discussion, they have used a group of speech that was less computational burden. For this experiment, a self- recorded personally by some people as training data. The © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4578 International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 recorded dataset acquired from 170 subjects was used for CNN’s in automatic speech recognition: small scale phone word recognition. In this paper, they have used 8 layers of recognition in TIMIT and large vocabulary voice search (VS) convolution, pooling, and fully connected layer, respectively. task. The results show that CNN reduces the error rate by Rectified linear unit is used as an activation function. In this 6%-10% as compared with DNN’s on both TIMIT phone experiment, 5-fold cross-validation is used for testing and recognition and the voice search large vocabulary speech training of the CNN model. The data was portioned into 4:1 recognition. In this paper, the discussion is also made on form, i.e. 4 parts were used for training and the remaining how to organize speech feature vectors into feature maps one part was used for testing. The implemented algorithm which suit for CNN processing. The building block of the CNN achieved 99.49% accuracy. The proposed DCNN model gave consists of a pair of hidden plies: A pooling ply and good results as compared to conventional techniques. convolutional ply. The input layer consists of localized features that are organized as features map. In this paper, a Xuejiao Li and Zixuan Zhou aimed to build an accurate, low- hybrid ANN-HMM framework with a softmax output layer is latency speech command speech recognition system that used to compute the posterior probability for all HMM states. was capable of determining predefined words. Speech The likelihood of all HMM states is passed through a Viterbi command dataset provided by Google’s Tensorflow and AIY decoder to recognize the continuous stream of speech. teams is used for training and testing, it consists of 65,000 Finally, the pretraining of CNN’s based on convolutional wave audio files saying thirty different words. Models such RBMs gives better performance in large-vocabulary voice as Vanilla single-layer Softmax model, Deep Neural network search as compared to the phone recognition experiment. and convolutional neural network are used, where convolutional neural network proves to outperform the William Song and Jim Cai implemented an end-to-end deep other two models and achieved an accuracy of 95.1%. The learning system utilizing mel-filter features to directly first observation made was that vanilla is not a good model output to spoken phonemes without using traditional hidden because accuracy achieved was about only 56%. The Markov model for decoding. In this paper TIMIT, the dataset experiment results showed that CNN is more effective than is used for testing and training, which comprises of 630 DNN and vanilla, giving 18.6% relative improvement over speakers with 6300 utterances in 8 dialectics. They tried to DNN and 72.3% over Vanilla on precision value. A simple 2- implement the CNN model using the caffe framework, an layer ConvLayer CNN network outperforms the Vanilla and error rate of 22.1% is obtained using the CNN model. The DNN and achieves 31.43%, 66.67% relative improvement decoded phone sequence achieved an error rate of 29.4%. with regard to DNN and Vanilla in test accuracy and achieved An end to end system is achieved by replacing the HMM 82% and 94.6% in the loss. model by RNN and CTC. They also tried to replace the traditional HMM system with a network capable of Jui-Ting Huang, Jinyu Li, and Yifan Gong aimed to provide a outputting sequentially labels on input data. The training is detailed analysis of CNN’s. They showed that edge detector done using the CNN model then the last layer of along various directions can be automatically learned by convolutional is fed into the CTC network for training, such visualizing the localized filter learned in the convolutional that prediction of actual phone sequences is done. This is layer.CNN provides advantages over fully connected layer in nothing but the procedure where CNN is trained first and four domains: channel-mismatched training-test conditions, then freezing of CNN parameters is done for fine tuning the noise robustness, and distant speech recognition and low network for other tasks. Lastly, they concluded that CNN footprint models. CNN structure is established combined performs better than the CTC model. with max out units which gave relative 9.3% WERR. In this paper, all experiments are performed under the context- 3. SUMMARY dependent deep neural network framework, were CNN or DNN is used to classify the input features into classes. CNN Speech is one of the prominent and primary means of architecture uses one convolutional layer followed by one communication between two humans. Speech recognition max-pooling layer and four fully connected layers. The using the convolutional neural network is an emerging field training data of about 1000 hours of audio data recorded by of research with a wide range of applications. In this paper, a the kinetic device is used for this experiment, whereas test survey is made on various algorithms, speech recognition data consist of 18683 utterances recorded at 1, 2, or 3 models used for speech recognition purposes. Survey says meters away from kinetic devices. CNN with random that the accuracy of speech recognition depends on which initialization can be used to learn various sets of edge dataset used and the model on which it is trained. The error detectors to extract low-level features. For distant speech rate greatly depends on the environmental conditions. The recognition, CNN is trained on 1000 hours of Kinect distant survey points that CNN based system gives better accuracy speech data and obtains relative 4% of word error rate and boosts the performance of the system due to its unique reduction (WERR). features like local connectivity and weight sharing, than the conventional traditional system. CNN is widely used in Ossama Abdel-Hamid et al conducted an experiment on two applications related to computer vision, image classification speech recognition tasks and evaluated the effectiveness of © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4579 International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 and pattern matching because it mitigates most of the conventional traditional system problems. REFERENCES [1] D. Nagajyothi and P. Siddaiah, “Speech Recognition Using Convolutional Neural Networks”, International Journal of Engineering & technology 7(4.6) (2018). [2] Akhil Thomas, “Speaker Recognition using MFCC and CORDIC Algorithm”, International Journal of Innovative Research in Science, Engineering and Technology Vol. 7, Issue 5, May 2018. [3] Du Guiming et al, “Speech Recognition Based on Convolutional Neural Networks”,IEEE International conference on signal and image processing (2016). [4] Akhilesh Halageri et al, “Speech Recognition using Deep Learning”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (3) , 2015, 3206-3209. [5] Ying Zhang et al, “Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks”, arXiv: 1701.02720v1 [cs.CL] 10 Jan 2017. [6] Narendra D. Londhe et al, “Deep Convolution Neural Network Based Speech Recognition for Chhattisgarhi”, 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN). [7] Xuejiao Li, Zixuan Zhou “Speech Command Recognition with Convolutional Neural Network”. [8] Jui-Ting Huang, “An Analysis Of Convolutional Neural Networks For Speech Recognition”, Microsoft Corporation, One Microsoft Way, Redmond, WA 98052. [9] Ossama Abdel-Hamid, “Convolutional Neural Networks For Speech Recognition”, IEEE/Acm Transactions On Audio, Speech, And Language Processing, Vol. 22, No. 10, October 2014. [10] William Son and Jim Cai, “End-to-End Deep Neural Network for Automatic Speech Recognition”, Department of Computer Science Stanford University © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4580