Blog Post 5
This week I began using the tensorflow library with my data to generate some preliminary neural nets and test out which models performed best.
After looking at a plotted graph of my wav file data set I realised that the majority of useful information was contained in the first 4 seconds (or 4*44100 samples) of my Wav Files, so I wrote a quick bit of python to truncate all files to 4 seconds long.
Then, using the IR used to create the file as its classifier, I made a data set containing the 583 wav files from each of the IR_2 and IR_3 data sets to test my Neural Nets with.
Before adding the other 7 or so IRs into the mix I wanted to know whether or not a neural net could be trained to differentiate between two rooms.
The Power Of CNNs
This person does not exist. This image was generated using a Convolutional Neural network
After reading a few papers on the subject I found that a sound file can be thought of as a single dimensional temporal data set, as the data essentially represents the amplitude of a signal that evolves over time. Recently a lot of projects such as WaveNet and this Pydata project (https://www.youtube.com/watch?v=nMkqWxMjWzg) have found a lot of success working with this type of temporal data by utilising 1 dimensional convolutional neural networks.
I designed a simple model of a 1D Convolutional NN, based the ‘Best Network’ CNN developed by Nathan Janos and Jeff Roach in the previously linked video, 6 filters wide, 3 deep and a window of 7350 samples (since they were using a window of a 24 hour period with 8 weeks of data – I scaled it to be the same percentage of the 4*44100 samples I was working with) and trained it on my dataset for about 100 epochs.
After About 50 epochs the validation accuracy began to plateau at around 90% and the classification accuracy was at about 85%.
I think I will be able to increase the accuracy of this NN by feeding a larger data set as training data and tweaking the model. After testing this hypothesis my plan is to further alter the model so that it can facilitate more than 2 outputs, so that I will be able to identify even more types of rooms.
Furthermore, I’ve noticed that the majority of projects involving sound and ANNs convert the raw audio files into Mel frequency cepstrums before analysing them. I plan to see if this has any effect on the accuracy of the model compared to just training it on the raw Wav files.