International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
SPEECH RECOGNITION USING CONVOLUTIONAL NEURAL NETWORK
Nidhi A. Kulkarni1, Satish P. Deshpande2
1Student, Dept.of Electronics & Communication, KLSGIT College Belgaum, Karnataka, India
2
Assistant Professor, Dept.of Electronics & Communication, KLSGIT College Belgaum, Karnataka, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Language is one of the important means of 1.1 CONVOLUTIONAL NEURAL NETWORK
communication; Speech is its main medium by which two Convolutional neural network is a class of deep neural
people can communicate. In this present era, speech network which does little preprocessing, that means that the
technologies are widely used as it has unlimited uses. Speech neural network learns the filter before doing the real
recognition, as the man-machine interface plays a vital role in classification. It consist of single or more than one layer, CNN
the field of artificial intelligence where accuracy is a major
can do lot of things when they are fed with bunch of signals
challenge. In this paper, a review of the speech recognition
process, its basic models, and its application is done. as input to it. For computer system it is difficult to consider
Discussion is made on different techniques and approaches of whole signal hence by using Convolutional neural network it
the speech recognition process using the Convolutional Neural uses a part of signal instead of considering entire signal, CNN
Network (CNN). This paper also summarizes some of the wells is a type of neural network where input variables are related
known methods used in various stages of the speech spatially to each other. CNN were developed specially to take
recognition process. A convolutional neural network is one spatial positions into account.
that reduces spectral variations and models spectral
correlations present in the signals. The main objective of this Filtering is the main mathematics behind the matching,
review is to bring to light the progress made in the field of
matching is done by considering the features that are lined
speech recognition that uses a convolutional neural network of
different languages and technological viewpoint in different up with this patch signal then the pixels are compared and
countries. multiplied one by one, next these features are then added
and divided with the total number of pixels. This step is
Key Words: Speech Recognition, Man-Machine, repeated for all pixels. The act of convolving signals with a
Artificial Intelligence, Convolutional Neural Network, bunch full of filters or bunch of features is called as
Spectral Variation. convolutional layer. It is a layer in which operations depends
1.INTRODUCTION on the stack that is in convolution one signal becomes a stack
of filtered signals. Convolutional layer is one among the
Speech is one of the most important media for
communication between two people and his environment. layer.
The speech recognition system is capable to translate spoken
words into text. Text can be either in terms of words or Pooling is another important and main building block of
sequences of words or it could be in syllables form. Speech CNN. Its function is to reduce the amount of parameters and
recognition using the convolutional neural network is one of computation used in the network. Pooling layer operates on
the challenging tasks. We know that human speech signals each feature independently, the most common approach
are highly variable due to different speaker attributes, used is max pooling. Once pooling is done next step is
different speaking styles and can include environmental normalization. In normalization, if the pixel value is negative,
noises so on; hence speech recognition is done by using a
then these negative values will be replaced with zeros. This
convolutional neural network. In the speech recognition
tasks, many parameters will affect the accuracy of the process is done to each and every filtered signal. The last
recognition. These parameters are as follows: a continuous step is to stack up all three layers so that present output will
word or discrete word recognition, vocabulary, become the input for the next. The final layer is the fully
environmental conditions, models, etc. problems such as one connected layer.
may pronounce the same word in a different manner,
expressing a word by two different speakers may vary.
Resolving these problems is the main step towards the aim.
The main advantage of using speech recognition is that a
person can save his time by directly speaking to the devices
rather than typing continuously.
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4577
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
experimental result shows that CNN can efficiently
implement word recognition with fewer complexities. The
backpropagation algorithm is used to adjust and maintain
the parameters of the neural network. First input signals
were passed to the input layer, the middle layer and then the
output was passed through the output layer, If the output is
not equal to the desired output then it results into error
signal, then by using Backpropagation algorithms these
parameters are adjusted. In this paper the neural network is
trained for 30 people, once training is done then it uses the
other five people’s voice as test data. Lastly, they compared
CNN with DNN and concluded that CNN greatly reduces the
complexity of the system.
Figure1.1: CNN STRUTURE
Akhilesh Halageri, Amrita Bidappa, Arjun C, Madan Mukund
2. LITERATURE SURVEY Sarathy and Shabana Sultana says that speech recognition
usually consists of the following steps, The first step is to
capture and digitize the sound waves, then it is converted
D. Nagajyothi and P. Siddaiah made a brief survey on speech
into phonemes. The main purpose of this paper was to
recognition using a convolutional neural network. This
review the pattern matching abilities of neural networks on
system was developed natively for Telugu language. The
speech signal. For feed-forward networks, the activation
database was created based on the frequently asked
function is used basically to stabilize the output layer but
questions in an airport inquiry. CNN was used for training
this process is not present in recurrent networks. A system is
and testing of the database. It results with the best results as
proposed to use some learning algorithms to learn the
CNN is best for weight sharing, local connectivity and
features without any other assumptions. Algorithms use
pooling. Experiments performed on speech signals resulted
neural networks to increase the computational power of the
in improvement in the performance of the systems as
proposed system.
compared to traditional systems. In this system, they have
used tanh as an activation function for the first layer, while
Ying Zhang, et al, proposed an end to end speech framework
the rectified linear unit is used for the next two layers. The
inspired by the advantages of both CNN and CTC approach,
input data was normalized to increase the learning speed.
CNN and CTC approaches were combined without using any
The network structure was improved by using Max pooling
recurrent connections. By evaluating this approach on the
in the last layer. Later they concluded that CNN based system
TIMIT Phoneme recognition task, they showed that the
is able to achieve good performance than a conventional
proposed model was not only computationally efficient but it
traditional system.
also exists in baseline systems. In this paper, they showed
that in a CNN of sufficient depth, higher layer features have
Akhil Thomas presents a survey on feature extraction
the capacity to capture temporal dependencies with suitable
technique used for speaker recognition by using Mel
information. This model consists of 10 convolutional layers
frequency Cepstral coefficients (MFCC). Further, he
and 3 fully connected hidden layers. The first convolutional
evaluated experiments results conducted along each step of
layer is followed by the pooling layer; the pooling layer is
the MFCC process. Later MFCC coefficients were retrieved
3*1, which means we can only pool over the frequency axis.
using DCT. DCT was done using CORDIC Algorithm. One
The filter size 3*5 is used across the layers. Max out with 2
disadvantage of this system is that the performance
piece linear functions is used as activation function and to
degrades if there is the presence of noise. The system gave
optimize the model Adam optimizer is used with learning
the most accurate results when implemented in an
rate 10^4. Their model has achieved 18.2% error rates on
environment where it was really trained. Here the
the core test dataset, which is better than the LSTM baseline
performance was increased with the number of training
model. We use the CNN model because it takes less time to
iterations. The accuracy of detected speech was very high
train the dataset as compared to the LSTM model.
because the MFCC feature extraction technique was used.
Narendra D. Londhe, Ghanshyam B. Kshirsagar, and Hitesh
The advantage of using the CORDIAC algorithm is that it
Tekchandani proposed a Deep Convolutional Neural
reduces the number of gates and thus it also decreases the
Network (DCNN) based ASR for Chhattisgarhi dialect
gate delay.
because speaker dependent speech recognition system using
conventional machine learning technique was incapable to
Du Guiming, Wang Xia, Wang Guangyan, Zhang Yan and Li
handle the spectral variations and spectral correlation of
Dan presents a brief survey on the convolutional neural
acoustic signals. DCNN efficiently handles the spectral
network that uses backpropagation to train the network. In
variation and spectral correlation of speech signals with a
this discussion, they have used a group of speech that was
less computational burden. For this experiment, a self-
recorded personally by some people as training data. The
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4578
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
recorded dataset acquired from 170 subjects was used for CNN’s in automatic speech recognition: small scale phone
word recognition. In this paper, they have used 8 layers of recognition in TIMIT and large vocabulary voice search (VS)
convolution, pooling, and fully connected layer, respectively. task. The results show that CNN reduces the error rate by
Rectified linear unit is used as an activation function. In this 6%-10% as compared with DNN’s on both TIMIT phone
experiment, 5-fold cross-validation is used for testing and recognition and the voice search large vocabulary speech
training of the CNN model. The data was portioned into 4:1 recognition. In this paper, the discussion is also made on
form, i.e. 4 parts were used for training and the remaining how to organize speech feature vectors into feature maps
one part was used for testing. The implemented algorithm which suit for CNN processing. The building block of the CNN
achieved 99.49% accuracy. The proposed DCNN model gave consists of a pair of hidden plies: A pooling ply and
good results as compared to conventional techniques. convolutional ply. The input layer consists of localized
features that are organized as features map. In this paper, a
Xuejiao Li and Zixuan Zhou aimed to build an accurate, low- hybrid ANN-HMM framework with a softmax output layer is
latency speech command speech recognition system that used to compute the posterior probability for all HMM states.
was capable of determining predefined words. Speech The likelihood of all HMM states is passed through a Viterbi
command dataset provided by Google’s Tensorflow and AIY decoder to recognize the continuous stream of speech.
teams is used for training and testing, it consists of 65,000 Finally, the pretraining of CNN’s based on convolutional
wave audio files saying thirty different words. Models such RBMs gives better performance in large-vocabulary voice
as Vanilla single-layer Softmax model, Deep Neural network search as compared to the phone recognition experiment.
and convolutional neural network are used, where
convolutional neural network proves to outperform the William Song and Jim Cai implemented an end-to-end deep
other two models and achieved an accuracy of 95.1%. The learning system utilizing mel-filter features to directly
first observation made was that vanilla is not a good model output to spoken phonemes without using traditional hidden
because accuracy achieved was about only 56%. The Markov model for decoding. In this paper TIMIT, the dataset
experiment results showed that CNN is more effective than is used for testing and training, which comprises of 630
DNN and vanilla, giving 18.6% relative improvement over speakers with 6300 utterances in 8 dialectics. They tried to
DNN and 72.3% over Vanilla on precision value. A simple 2- implement the CNN model using the caffe framework, an
layer ConvLayer CNN network outperforms the Vanilla and error rate of 22.1% is obtained using the CNN model. The
DNN and achieves 31.43%, 66.67% relative improvement decoded phone sequence achieved an error rate of 29.4%.
with regard to DNN and Vanilla in test accuracy and achieved An end to end system is achieved by replacing the HMM
82% and 94.6% in the loss. model by RNN and CTC. They also tried to replace the
traditional HMM system with a network capable of
Jui-Ting Huang, Jinyu Li, and Yifan Gong aimed to provide a outputting sequentially labels on input data. The training is
detailed analysis of CNN’s. They showed that edge detector done using the CNN model then the last layer of
along various directions can be automatically learned by convolutional is fed into the CTC network for training, such
visualizing the localized filter learned in the convolutional that prediction of actual phone sequences is done. This is
layer.CNN provides advantages over fully connected layer in nothing but the procedure where CNN is trained first and
four domains: channel-mismatched training-test conditions, then freezing of CNN parameters is done for fine tuning the
noise robustness, and distant speech recognition and low network for other tasks. Lastly, they concluded that CNN
footprint models. CNN structure is established combined performs better than the CTC model.
with max out units which gave relative 9.3% WERR. In this
paper, all experiments are performed under the context- 3. SUMMARY
dependent deep neural network framework, were CNN or
DNN is used to classify the input features into classes. CNN Speech is one of the prominent and primary means of
architecture uses one convolutional layer followed by one communication between two humans. Speech recognition
max-pooling layer and four fully connected layers. The using the convolutional neural network is an emerging field
training data of about 1000 hours of audio data recorded by of research with a wide range of applications. In this paper, a
the kinetic device is used for this experiment, whereas test survey is made on various algorithms, speech recognition
data consist of 18683 utterances recorded at 1, 2, or 3 models used for speech recognition purposes. Survey says
meters away from kinetic devices. CNN with random that the accuracy of speech recognition depends on which
initialization can be used to learn various sets of edge dataset used and the model on which it is trained. The error
detectors to extract low-level features. For distant speech rate greatly depends on the environmental conditions. The
recognition, CNN is trained on 1000 hours of Kinect distant survey points that CNN based system gives better accuracy
speech data and obtains relative 4% of word error rate and boosts the performance of the system due to its unique
reduction (WERR). features like local connectivity and weight sharing, than the
conventional traditional system. CNN is widely used in
Ossama Abdel-Hamid et al conducted an experiment on two applications related to computer vision, image classification
speech recognition tasks and evaluated the effectiveness of
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4579
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
and pattern matching because it mitigates most of the
conventional traditional system problems.
REFERENCES
[1] D. Nagajyothi and P. Siddaiah, “Speech Recognition Using
Convolutional Neural Networks”, International Journal of
Engineering & technology 7(4.6) (2018).
[2] Akhil Thomas, “Speaker Recognition using MFCC and
CORDIC Algorithm”, International Journal of Innovative
Research in Science, Engineering and Technology Vol. 7, Issue
5, May 2018.
[3] Du Guiming et al, “Speech Recognition Based on
Convolutional Neural Networks”,IEEE International
conference on signal and image processing (2016).
[4] Akhilesh Halageri et al, “Speech Recognition using Deep
Learning”, (IJCSIT) International Journal of Computer Science
and Information Technologies, Vol. 6 (3) , 2015, 3206-3209.
[5] Ying Zhang et al, “Towards End-to-End Speech
Recognition with Deep Convolutional Neural
Networks”, arXiv: 1701.02720v1 [cs.CL] 10 Jan 2017.
[6] Narendra D. Londhe et al, “Deep Convolution Neural
Network Based Speech Recognition for Chhattisgarhi”, 2018
5th International Conference on Signal Processing and
Integrated Networks (SPIN).
[7] Xuejiao Li, Zixuan Zhou “Speech Command Recognition
with Convolutional Neural Network”.
[8] Jui-Ting Huang, “An Analysis Of Convolutional Neural
Networks For Speech Recognition”,
Microsoft Corporation, One Microsoft Way, Redmond, WA
98052.
[9] Ossama Abdel-Hamid, “Convolutional Neural Networks
For Speech Recognition”, IEEE/Acm Transactions On Audio,
Speech, And Language Processing, Vol. 22, No. 10, October
2014.
[10] William Son and Jim Cai, “End-to-End Deep Neural
Network for Automatic Speech Recognition”, Department of
Computer Science Stanford University
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4580