Nhận dạng tiếng nói dùng giải thuật trích đặc trưng MFCC và lượng tử vector trên KIT DSKTMS320C6713 của TI

pdf 8 trang phuongnguyen 160
Bạn đang xem tài liệu "Nhận dạng tiếng nói dùng giải thuật trích đặc trưng MFCC và lượng tử vector trên KIT DSKTMS320C6713 của TI", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

Tài liệu đính kèm:

  • pdfnhan_dang_tieng_noi_dung_giai_thuat_trich_dac_trung_mfcc_va.pdf

Nội dung text: Nhận dạng tiếng nói dùng giải thuật trích đặc trưng MFCC và lượng tử vector trên KIT DSKTMS320C6713 của TI

  1. NHẬN DẠNG TIẾNG NÓI DÙNG GIẢI THUẬT TRÍCH ĐẶC TRƯNG MFCC VÀ LƯỢNG TỬ VECTOR TRÊN KIT DSKTMS320C6713 CỦA TI Hoàng Trang1, Huỳnh Lâm Đồng2 1Khoa Điện – Điện tử, Trường ĐH Bách Khoa, ĐHQG Tp.HCM 2 Khoa Điện – Điện tử, Trường ĐH Sư phạm Kỹ thuật Tp. HCM TÓM TẮT: This paper presents the evaluation of the specific method of MFCC and VQ vector quantization in speech recognition. Rating is based on two parameters: the size of the codebook and the number of samples in a sound frame for the accuracy of the identification process. To evaluate the two models are built: the first model codebook of size 16 and the second pattern codebook size 8. In each model, the number of samples in an audio frame and sample some overlap between frames respectively: 160 samples per frame, 80 samples overlap; 200 samples per frame, 100 samples overlap; 256 samples on a frame, 156 overlapping samples. Recognition system is built on TI's TMS320C6713 DSP kit. Assessment results are made from over 16 Vietnamese, each word is evaluated 100 times. Từ khoá: MFCC, VQ, DSKTMS320C6713. 1.INTRODUCTION In recent decades , the control through automatic speech recognition ( ASR ) received great attention by many applications . There are many algorithms given in the quote process characteristics , training and recognition . Speech recognition algorithm through characteristic coefficients extracted MFCC ( Mel frequency cepstrum coefficient ) and the construction of codebook by vector quantization methods are based two methods for speech recognition algorithms and their other been widely used in many applications . In this essay , I will evaluate the above algorithms through building programs on embedded DSP KIT , KIT DSKTMS320C6713 in particular with a platform TMS320C6000 processor . The contents of the thesis is divided into two main parts : the first part of the paper presents algorithms extract the contents of MFCC features and vector quantization in speech recognition . At the same time , building simulation software MATLAB algorithms to visualize an overview of the identification process . The second part of the thesis will build on KIT DSKTMS320C6713 embedded programs . Embedded program consists of two parts : training and recognition . 2. Building a sound pattern recognition using TMS320C6713 DSK KIT Record
  2. Fig1. Block diagram of DSK TMS320C6713. Designed by Spectrumdigital DSKTMS320C6713 Kit is a comprehensive evaluation platform for digital signal processors from Texas Instruments TMS 320C6713. Kit can be used as a reference design for DSP to communicate with devices such as SDRAM, Flash and codecs. An on-board JTAG emulator allows debug from Code Composer Studio through the computer's USB port. Modeling on the DSP kit is made through the following steps: Step 1: Record The recording process is done on Kit DSK through AIC23 codec IC. This is the 32bit stereo audio encoding for input and output. AIC23 onboard encoder using sigma-delta techniques for ADC and DAC process. Connect with 12Mhz system clock. Sampling rate changes from 8 to 96kHz through installation. In this essay I use the sampling frequency is 8kHz. Data stored as 32-bit integers. Step 2: Framing windowing Audio signal is divided into frames with the frame size is 100x256. The next frame will repeat last frame of 156 samples. The overlapping and windowing reduces the abrupt change in the form of the sampled signal spectrum. Here we use the Hamming window with 256 coefficients corresponding to the number of samples in a frame of the audio signal. The Hamming coefficient was calculated using the formula: 2 l 1 h(l) 0.54 0.46cos ,l 1,2, ,L L 1 FRAMING WINDOWING DETECT_ENVELOP BLOCK DC FRAMING WINDOWING Fig2. Block diagram of framing and windowing Step 3: Energy Factor Energy factor is calculated by taking the log of the sum of squares of coefficients in a sound frame. This is the order of 13 after 12 cepstrum coefficients are calculated. Step 4: FFT
  3. Function FFT 256 points, using the butterfly diagram 8 stages, each stage consists of 2 phases: - Calculate the number Twiddle - Make the 256 Butterfly Then perform sequence data bit reversal Data output is 100x256 matrix using descriptive audio signal in the frequency domain. Step 5: Power spectrum After FFT transform, power spectral signals get through power_spectrum function. power_spectrum will take the sum of real and imaginary parts of the signal spectrum. Power spectrum obtained is the real value will be saved as a 100x256 matrix. The data is stored as 32-bit float. Step 6: Mel frequency spectrum Mel frequency spectrum is calculated by multiplying the signal spectrum with triangular filters are designed according to Mel frequency scale. If 'f' is the frequency Hz, the Mel frequency f is calculated by the formula: B(f)=2595 log(1+f/700) mel And if 'm' is the Mel Frequency, Hz frequency is calculated by the formula backwards B-1(m)=700 10(m/1125) – 700 Hz To build the filter banks was necessary to determine the boundary for determining the transfer function of the filter triangle: f(m) = N B^-1(B(fl)+m(1/21 B(fh) – 1/21 B(fl)))/fs With sampling frequency fs, fs = 8kHz High frequency boundaries fh, with sampling frequency fs = 8kHz have fh=fs / 2 = 4Khz Low frequency boundaries fl, with sampling frequency fs = 8kHz have fl = fs/256 = 31.25Hz. N is the total sample, N = 256 m is the order of the filter, 0 f(m+1) Data after through the triangle filters, the result is a matrix 100x20 with the number of columns is the number of filters used. In this essay I use 20 triangular filters Mel frequency scale. As shown in the attached references, here we can use different types of filters such as Gaussian filter to compute the Mel Frequency Cepstral coefficients. After calculating the data will be stored in the variable coeff is a 100x20 matrix, each element is a 32bit float represents energy power spectrum corresponding to the filter. Step 7: Log energy The signal is then taken log energy. After calculating the data stored in the variable co_eff 100x20 matrix form. Step 8: Biến đổi Cosine rời rạc DCT DCT is similar to the IFFT function, but more efficiently by using real numbers.
  4. K ' p Cnp  log Snk cos k 0.5 , p 1,2, ,P k 1 K K is the number of filter ' Snk is signal power spectrum by Mel frequency scale n is the number of frames P is the number of cepstrum coefficients to use for training and recognition p is the order cepstrum coefficients The result of this step, we have the vector Acoustic, characterized the voice is processed. As presented in the theoretical part, we just took the first coefficient of the acoustic vector, in this paper I take the 12 coefficients. Step 9 : Vector quantization VQ The last step is also important indispensable step in the program is to build the codebook by vector quantization VQ methods. In documents LBG algorithm is used to create beams. Data after this step is codebook size 16x12. Data will be added to the system from the energy, delta ,delta_delta coefficients for codebook size 16x39. To calculate the delta coefficients dt, using the coefficients from the Ct vector that includes energy and 12 Cepstral coefficients of the nth frame N n(c c ) n 1 t n t n dt N 2 n2 n 1 This data is then saved as a file. DAT to be used for the identification process. 3. Assess the role of two parameters: codebook size and number of samples on the audio frame (after breaking a word into the small frame) with respect to the accuracy of voice recognition using algorithms feature extract MFCC and vector quantization VQ. Table 1: Results of identification when using codebook of size 16, 160 samples per frame, overlapping 80 samples, the average recognition rate is 87.1% 1 2 3 4 5 6 7 8 9 10 Trái Phải Trên Dưới Trước Sau 1 81 5 10 4 2 93 5 1 1 3 3 2 85 5 2 3 4 90 1 4 1 4 5 4 95 1 6 1 15 80 3 1 7 4 79 6 11 8 3 81 10 6 9 5 2 87 4 2 10 88 12 Trái 2 4 6 88 Phải 1 1 3 82 5 8 Trên 3 4 93 Dưới 6 1 90 3 Trước 1 5 1 93 Sau 6 2 2 1 89
  5. Table 2: Results of identification when using codebook of size 16, 256 samples per frame, overlapping 156 samples, the average recognition rate is 89.8% 1 2 3 4 5 6 7 8 9 10 Trái Phải Trên Dưới Trước Sau 1 90 2 3 2 2 1 2 100 3 90 4 2 4 4 88 2 10 5 80 20 6 9 87 1 3 7 80 4 3 13 8 4 90 2 4 9 5 85 3 7 10 4 1 89 6 Trái 2 6 92 Phải 5 5 90 Trên 1 3 2 95 Dưới 1 1 98 Trước 2 3 2 93 Sau 5 5 90 Table 3: Results of identification when using codebook of size 16, 200 samples per frame, overlapping 100 samples, the average recognition rate is 86.9% 1 2 3 4 5 6 7 8 9 10 Trái Phải Trên Dưới Trước Sau 1 82 4 8 3 2 1 2 90 3 7 3 85 5 6 4 4 1 92 2 5 5 90 4 3 3 6 4 81 3 12 7 5 87 4 4 8 4 86 6 4 9 88 3 7 2 10 2 82 6 2 8 Trái 5 89 6 Phải 1 3 2 90 4 Trên 5 93 2 Dưới 1 4 95 Trước 8 6 5 81 Sau 15 7 79 Table 4: Results of identification when using codebook of size 8, 200 samples per frame, overlapping 100 samples, the average recognition rate is 78.8% 1 2 3 4 5 6 7 8 9 10 Trái Phải Trên Dưới Trước Sau 1 86 5 6 3 2 79 6 10 5 3 12 4 79 5 4 89 7 4 5 12 86 2 6 21 63 16 7 19 3 50 4 10 2 12 8 12 4 65 12 2 5 9 4 92 4
  6. 10 2 3 89 6 Trái 4 6 83 7 Phải 5 6 79 4 6 Trên 23 65 12 Dưới 4 4 12 80 Trước 4 4 91 Sau 1 11 3 85 Table 5: Results of identification when using codebook of size 8, 256 samples per frame, overlapping 156 samples, the average recognition rate is 81.3% 1 2 3 4 5 6 7 8 9 10 Trái Phải Trên Dưới Trước Sau 1 87 3 5 1 4 2 88 4 6 2 3 10 5 79 1 5 4 88 9 3 5 8 3 89 6 3 23 63 7 7 4 65 4 12 15 8 5 14 65 12 4 9 2 4 92 2 10 94 6 Trái 2 12 83 3 Phải 5 6 5 79 5 Trên 1 12 65 22 Dưới 5 10 85 Trước 4 4 92 Sau 12 1 87 Table 6: Results of identification when using codebook of size 8, 160 samples per frame, overlapping 80 samples, the average recognition rate is 80.9% 1 2 3 4 5 6 7 8 9 10 Trái Phải Trên Dưới Trước Sau 1 88 2 1 2 2 5 2 89 2 6 3 3 8 6 79 1 5 1 4 90 9 1 5 1 9 1 89 6 2 5 26 62 5 7 3 3 1 55 7 18 13 8 4 7 1 65 19 7 9 92 4 2 2 10 95 1 2 1 Trái 4 2 10 83 1 Phải 3 7 6 80 4 Trên 5 3 4 4 65 19 Dưới 8 3 84 5 Trước 1 5 1 93 Sau 1 10 2 1 86 4. Conclusion Based on the results of the identification process was conducted on TMS320C6713 DSK hardware have the following results : Identification program achieving the best results for the codebook has size 16 , this model is used in audio frame 256 samples , 156 overlapping samples , with the recognition rate is 89.8 % . Results of case
  7. identification using codebook with size 8 is less than the rate achieved , in which case use the codeword 8 , 200 samples in an audio frame , 100 overlapping samples worst ratio of 78.8 % . This can be explained by the codebook size reduction , increased quantum error leads to worse recognition rate . Like in the process of digitizing a signal , the recognition rate will not increase linearly with the increase of codebook size . In a large value , the further increase is codebook size does not change much recognition rate . In the same model using the same codebook size we consider the influence of the number of samples in a sound frame and overlapping samples : 256 samples in an audio frame , 156 overlapping samples for the best results in all three cases , 200 samples in a sound frame , 100 overlapping samples for the worst outcome . This shows the frame size and overlap of the sample affect recognition results . The possibility of developing the thesis : The evaluation method of identification can be carried out on many other parameters such as FFT size , including Mel filter model number of filters and filter types , window types are used The general evaluation parameters will provide the best pattern recognition . Implementing speech recognition using other methods , such as HMM , neural networks . Since then learned the pros and cons of each method . Draw conclusions in each case , the method will be used . REFERENCES 1. Mohamed D., Jean-Paul H., Amrance H. Improved vector quantization approach for discrete HMM speech recognition system. The international Arab Journal of Information technology. 2. Lawrence R., Biing-Hwang J. Fundamentals of speech recognition. Prentice-Hall International, Inc. 3. The HTK Book (Version 3.4). Cambridge University Engineering Department. 4. Andrew W. Hidden Markov Models. School of Computer Science Carnegie Mellon University. 5. TMS320C6000 Chip Support Library API Reference Guide. Texas Instruments Incorporated, 2004. 6. Jeremy Bradbury. Linear Predictive Coding.Prentice-Hall International Inc ,December 5, 2000. 7. L Tien Thuong, and H.Dinh Chien. Vietnamese Speech Recognition Applied to Robot Communications. National University of Ho Chi Minh City, Jan,2004. 8. Rulph Chassaing. Digital Signal Processing and Applications with the C6713 and C6416 DSK. A John wiley & Sons, INC,. Publication, 2004. 9. GS.TSKH Bạch Hưng Khang, Nghiên cứu phát triển công nghệ nhận dạng, tổng hợp và xử lý ngôn ngữ tiếng Việt. Viện công nghệ thông tin. 10. Bài giảng xử lý tiếng nói. Trường Đại Học Hàng Hải Việt Nam – Khoa Công Nghệ Thông Tin – Bộ Môn Hệ Thống Thông Tin, 2011. 11. Lê Bá Dũng, Tài liệu tham khảo mô học Xử lý tiếng nói. Khoa Công nghệ Thông tin – Trường Đại học Hàng Hải Việt Nam. 12. PGS.TS Hoàng Đình Chiến, Nhận dạng tiếng Việt dùng mạng neuron kết hợp trích đặc trưng dùng LPC và AMDF, 2005. 13. Hồ Tú Bảo, Lương Chi Mai, Về xử lý tiếng Việt trong công nghệ thông tin. Viện Công nghệ Thông tin – Viện khoa học và công nghệ tiên tiến Nhật Bản. 14. Nguyễn Quốc Đính, Luận văn Thiết kế bộ nhận dạng tiếng nói dựa trên nền tảng DSP TMS320C2812. ĐH Bách Khoa TPHCM.
  8. BÀI BÁO KHOA HỌC THỰC HIỆN CÔNG BỐ THEO QUY CHẾ ĐÀO TẠO THẠC SỸ Bài báo khoa học của học viên có xác nhận và đề xuất cho đăng của Giảng viên hướng dẫn Bản tiếng Việt ©, TRƯỜNG ĐẠI HỌC SƯ PHẠM KỸ THUẬT TP. HỒ CHÍ MINH và TÁC GIẢ Bản quyền tác phẩm đã được bảo hộ bởi Luật xuất bản và Luật Sở hữu trí tuệ Việt Nam. Nghiêm cấm mọi hình thức xuất bản, sao chụp, phát tán nội dung khi chưa có sự đồng ý của tác giả và Trường Đại học Sư phạm Kỹ thuật TP. Hồ Chí Minh. ĐỂ CÓ BÀI BÁO KHOA HỌC TỐT, CẦN CHUNG TAY BẢO VỆ TÁC QUYỀN! Thực hiện theo MTCL & KHTHMTCL Năm học 2016-2017 của Thư viện Trường Đại học Sư phạm Kỹ thuật Tp. Hồ Chí Minh.