Have a personal or library account? Click to login
A Comprehensive Video Dataset for Multi-Modal Recognition Systems Cover

A Comprehensive Video Dataset for Multi-Modal Recognition Systems

Open Access
|Nov 2019

Figures & Tables

Subject areaImage Processing, Computer Vision, Machine learning and Deep learning.
More specific subject areaFeature Extraction, Speech recognition and Text Recognition.
Type of dataImages, Audio files, Tables and Figures.
How data was acquired (Experimental Setup)Original videos were captured at University Institute of Engineering Technology, Kanpur using a Canon Eos 1200D 18MP Digital SLR Camera with 18–55 mm and 55–250 mm lens in a highly sophisticated and noise free experimental laboratory.
Data formatVideos are in .MOV format, Frames are in .jpg format, audio files are in .wav format, Wave graphs for are in .png format.
Experimental factorsThe video samples that have been generated for various subjects are De-noised by using Neat Video (Other, 2019).
Experimental featuresExtract various biometric traits for every subject such as frames, boundary box coordinates, audio of the entire video of a subject, the audio wave signal for entire video length, split audio of text spoken by subject, and split audio waveform.
Data source locationUniversity Institute of Engineering Technology, Kanpur, India.
Data accessibilityThe dataset is accessible and it is publicly and freely available for any research, educational, and purposes.
dsj-18-909-g1.jpg
Figure 1

Frames generated for a sample video DSC_0020.MOV.

dsj-18-909-g2.jpg
Figure 2

Boundary box for the frames generated for a sample video.

Table 1

.csv format for the boundary box coordinates of each frame for sample video DSC_0020.MOV.

FramesLower Left (X)Lower Left (Y)Upper Left (X)Upper Leftz (Y)Upper Right (X)Upper Right (Y)Lower Right (X)Lower Right (Y)
frame0.jpg82113482145011374501137134
frame1.jpg82213582244811354481135135
frame1010.jpg81110881142111244211124108
dsj-18-909-g3.png
Figure 3

Wave form of a sample video DSC_0020.MOV.

dsj-18-909-g4.png
Figure 4

Wave form for digit 1 recited in DSC_0020.MOV.

dsj-18-909-g5.png
Figure 5

Wave form for digit 2 recited in DSC_0020.MOV.

Table 2

Configuration of Convolutional Neural Network.

LayersFilter SizeStridesNo. of filters
Convolution Layer 15 × 5132
Pooling Layer 12 × 22
Convolution Layer 2a1 × 1264
Convolution Layer 2a_13 × 3164
Convolution Layer 2b3 × 1164
Convolution Layer 2b_11 × 3164
Pool 2b2 × 22
Convolution Layer 2c1 × 1264
Concatenate192
Pool 22 × 22
Fully Connected Layer 11024
Fully Connected Layer 21024
dsj-18-909-g6.png
Figure 6

Architecture of CNN for speech recognition model (Zhao et al. 2017).

Table 3

Face Recognition Model Results on Our Dataset.

DatasetTraining/Testing PercentageAccuracyTraining Loss
Our Dataset (Handa, Agarwal, and Kohli, 2018)70% and 30%99.14%0.56%
Table 4

Speech Recognition Model Results on Our Dataset.

DatasetTraining/Testing PercentageAccuracyTraining Loss
Our Dataset (Handa, Agarwal, and Kohli, 2018)70% and 30%96.42%0.67%
dsj-18-909-g7.png
Figure 7

Accuracy and training loss results graph on our dataset for face recognition.

dsj-18-909-g8.png
Figure 8

Accuracy and training loss results graph on our dataset for speech recognition.

Table 5

Face Recognition Model Results on JAFFE Dataset.

DatasetTraining/Testing PercentageAccuracyTraining Loss
JAFFE Dataset (Lyons et al., 1998)70% and 30%92.1%0.78%
Table 6

Speech Recognition Model Results on FSDD Dataset.

DatasetTraining/Testing PercentageAccuracyTraining Loss
FSDD Dataset Jackson et al., 2018)70% and 30%89.2%0.81%
dsj-18-909-g9.png
Figure 9

Accuracy and training loss results graph on JAFFE dataset for face recognition.

dsj-18-909-g10.png
Figure 10

Accuracy and training loss results graph on FSDD dataset for speech recognition.

Table 7

Face Recognition test results of our trained model for JAFFE dataset.

DatasetTraining/Testing PercentageAccuracy
TrainingTesting
Our Dataset (Handa, Agarwal, and Kohli, 2018)JAFFE Dataset (Lyons et al., 1998)70% and 30%93.04%
Table 8

Speech Recognition test results of our trained model for FSDD dataset.

DatasetTraining/Testing PercentageAccuracy
TrainingTesting
Our Dataset (Handa, Agarwal, and Kohli, 2018)FSDD Dataset Jackson et al., 2018)70% and 30%90.11%
dsj-18-909-g11.png
Figure 11

Test accuracy of face and speech recognition model trained on our dataset.

Language: English
Submitted on: Nov 20, 2018
Accepted on: Oct 21, 2019
Published on: Nov 8, 2019
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2019 Anand Handa, Rashi Agarwal, Narendra Kohli, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.