Sound Classification with TensorFlow

Sound Classification with TensorFlow

Having this solution along with an IoT platform allows you to build a smart solution over a very wide area.

Tools to detect, decipher, and act on human speech are a dime a dozen, but when looking for a tool to identify sounds, like speech, animals, or music, we were hard-pressed to find something that worked. So we made our own. In this article, we’ll walk you through how we built some sample sound classification projects using Tensorflow machine learning algorithms.

In this article, we describe which tools were chosen, what challenges we faced, how we trained the model for TensorFlow, and how to run our open source project. We’ve also supplied the recognition results the DeviceHive, IoT platform, to use them in cloud services for 3rd party applications. Hopefully, you can learn from our experience and put our tool to good use.

At first, we need to choose some software to work with neural networks. The first suitable solution that we found wasPython Audio Analysis.

The main problem in machine learning is having a good training dataset. There are many datasets for speech recognition and music classification, but not a lot for random sound classification. After some research, we found theurban sound dataset.

After some testing we were faced with the following problems:

The next solution that we found wasGoogle AudioSet. It is based on labeled YouTube video segments and can bedownloaded in two formats:

These features are compatible withYouTube-8M models. Also this solution offers theTensorFlow VGGish modelas feature extractor. It covered a big part of our requirements, and was therefore the best choice for us.

The next task was to figure out how the YouTube-8M interface works. It’s designed to work with videos, but fortunately can work with audio as well. This library is pretty flexible, but it has a hardcoded number of sample classes. So we modified this a little bit to pass the number of classes as a parameter.

YouTube-8M can work with data of two types: aggregated features and frame features. Google AudioSet can provide data as features as we noted before. Through a little more research we discovered that the features are in frame format. We then needed to choose the model to be trained.

GPUs are a more suitable choice for machine learning than CPUs. You can find more info about thishere. So we will skip this point and go directly to our setup. For our experiments, we have PC with one NVIDIA GTX 970 4GB.

In our case, the training time didn’t really matter. We should mention that 1-2 hours of training was enough to make an initial decision about the chosen model and its accuracy.

Of course, we want to get as good accuracy as possible. But to train a more complex model (potentially better accuracy) you need more RAM (video RAM in case of GPU) to fit it in.

A full list of YouTube-8M models with descriptions is availablehere. Because our training data was in frame format, frame-level models had to be be used. Google AudioSet provides us with a data set split into three parts: balanced train, unbalanced train, and evaluation. You can get more info about them here.

A modified version of YouTube-8M was used for training and evaluation. It’s availablehere.

The training command looks like:

For LstmModel we changed the base learning rate to 0.001 as the documentation suggested. Also we changed the default value of lstm_cells to 256 because we didn’t have enough RAM for more.

As we can see, we got good results during the training step – this doesn’t mean we would get good results on the full evaluation.

Let’s try the unbalanced train dataset. It has a lot more samples, so we will change the number of training epochs to 10 (should change to 5 at least, because it took significant time to train).

If you want to examine our training logs you can download and extracttrain_logs.tar.gz. Then runtensorboard –logdir /path_to_train_logs/and go tohttp://127.0.0.1:6006

YouTube-8M takes many parameters and a lot of them affect the training process.

For example, you can tune the learning rate and number of epochs that will change the training process a lot. There are also three different functions for loss calculation and many other useful variables that you can tune and change to improve the results.

[bctt tweet=”TensorFlow is a very flexible tool and can be helpful in many machine learning applications like image and sound recognition.” username=”iotforall”]

Now we have some trained models, it’s time to add some code to interact with them.

We need to somehow capture audio data from a microphone. We will usePyAudio. It provides a simple interface and can work on most platforms.

As we mentioned before, we will use the TensorFlow VGGish model as the feature extractor. Here is a short explanation of the transformation process:

“Dog bark” example from the UrbanSound dataset was used for visualization.

Compute spectrogram using magnitudes of the Short-Time Fourier Transform with a window size of 25 ms, a window hop of 10 ms, and a periodicHann window.

Compute mel spectrogram by mapping the spectrogram to 64 mel bins.

Compute stabilized log mel spectrogram by applying log(mel-spectrum + 0.01) where an offset is used to avoid taking a logarithm of zero.

These features are then framed into non-overlapping examples of 0.96 seconds, where each example covers 64 mel bands and 96 frames of 10 ms each.

These examples are then fed into the VGGish model to extract embeddings.

And finally, we need an interface to feed the data to the neural network and get the results.

We will use the YouTube-8M interface as an example but will modify it to remove the serialization/deserialization step.

Hereyou can see the result of our work. Let’s take a closer look.

PyAudio uses libportaudio2 and portaudio19-dev so you need to install them to make it work.

Some python libraries are required. You can install them using pip.

Also, you need to download and extract to the project root the archive with the saved models. You can find ithere.

Our project provides three interfaces to use.

Simply runpython parse_file.py path_to_your_file.wavand you will see in the terminal something likeSpeech: 0.75, Music: 0.12, Inside, large room or hall: 0.03

The result depends on the input file. These values are the predictions that the neural network has made. A higher value means a higher chance of the input file belonging to that class.

python capture.pystarts the process that will capture data from your mic infinitely. It will feed data to the classification interface every 5-7 seconds (by default). You will see the results in the previous example.

You can run it with–save_path=/path_to_samples_dir/, in this case, all captured data will be stored in the provided directory inwavfiles. This function is useful in case you want to try different models with the same example(s). Use the–helpparameter to get more info.

python daemon.pyimplements a simple web interface that is available onhttp://127.0.0.1:8000by default. We use the same code as for the previous example. You can see the last ten predictions on the events (http://127.0.0.1:8000/events)page.

Last but not least is integration with the IoT infrastructure. If you run the web interface that we mentioned in the previous section, then you can find the DeviceHive client status and configuration on the index page. As long as the client is connected, predictions will be sent to the specified device as notifications.

TensorFlow is a very flexible tool, as you can see, and can be helpful in many machine learning applications like image and sound recognition. Having such a solution together with an IoT platform allows you to build a smart solution over a very wide area.

Smart cities could use this for security purposes, continuously listening for broken glass, gunfire, and other sounds related to crimes. Even in rainforests, such a solution could be used to track wild animals or birds by analyzing their voices.

The IoT platform can then deliver all such notifications. This solution can be installed on local devices (though it still can be deployed somewhere as a cloud service) to minimize traffic and cloud expenses and be customized to deliver only notifications instead of including the raw audio. Do not forget that this is an open-source project, so please feel free to use it.

Images Powered by Shutterstock