Welcome to shabda’s documentation!¶
About¶
Introduction¶
In recent past, Deep Learning models have proven their potential in many application areas, however its entry into embedded world has its own twists and practical difficulties.
Problem Statement¶
To come up with a framework that enables a fast prototyping of Deep Learning models for Audio (to start with!) and provides an easy way to port the models on to Android using TFLite.
Proposed Solution¶
Come up with following modular components which can be then used as plug and play components:
- Dataset modules with preprocessing modules
- DataIterator modules
- Tensorflow Models (Estimators)
- Engine to run the models
- Tensorflow model serving using TFLite
- Web app
- Mobile
Architecture¶
Dataset¶
- FreeSound from Kaggle
- Speech Recognition
Python Environment¶
conda create -n shabda python=3.6
source activate shabda
pip install -r requirements.txt
Audio Basics¶
What Does the Unit kHz Mean in Digital Music?
kHz
is short for kilohertz, and is a measurement of frequency (cycles per second).
In digital audio, this measurement describes the number of data chunks used per second to represent an analog sound in digital form. These data chunks are known as the sampling rate or sampling frequency.
This definition is often confused with another popular term in digital audio,
called bitrate (measured in kbps)
. However, the difference between these two terms is that bitrate measures how much is sampled every second (size of the chunks) rather than the number of chunks (frequency).
Note: kHz is sometimes referred to as sampling rate, sampling interval, or cycles per second.
What is the Mel scale?
The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency. Humans are much better at discerning small changes in pitch at low frequencies than they are at high frequencies. Incorporating this scale makes our features match more closely what humans hear.
Audio Features
- We start with a speech signal, we’ll assume sampled at 16kHz.
- Frame the signal into 20-40 ms frames. 25ms is standard.
- This means the frame length for a 16kHz signal is 0.025*16000 = 400 samples.
- Frame step is usually something like 10ms (160 samples), which allows some overlap to the frames.
- The first 400 sample frame starts at sample 0, the next 400 sample frame starts at sample 160 etc. until the end of the speech file is reached.
- If the speech file does not divide into an even number of frames, pad it with zeros so that it does.
- Audio Signal File : 0 to N seconds
- Audio Frame : Interval of 20 - 40 ms —> default 25 ms —> 0.025 * 16000 = 400 samples
- Frame step : Default 10 ms —> 0.010 * 16000 —> 160 samples
- First sample: 0 to 400 samples
- Second sample: 160 to 560 samples etc.,
25ms 25ms 25ms 25ms … Frames
400 400 400 400 … Samples/Frame
|—————|—————|—————|—————|—————|—————|—————|————-|
|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|
10 10 10 10 … Frame Step
Bit-depth = 16: The amplitude of each sample in the audio is one of 2^16 (=65536) possible values. Samplig rate = 44.1 kHz: Each second in the audio consists of 44100 samples. So, if the duration of the audio file is 3.2 seconds, the audio will consist of 44100*3.2 = 141120 values.
Still dont get it? Consider the audio signal to be a time series sampled at an interval of 25ms with step size of 10ms
Check out this jupyet notebook @ https://timsainb.github.io/spectrograms-mfccs-and-inversion-in-python.html
Forked version @ https://github.com/dhiraa/python_spectrograms_and_inversion
Mel Frequency Cepstral Coefficient (MFCC) tutorial
Reading Audio Files¶
The audios are Pulse-code modulated with a bit depth of 16 and a sampling rate of 44.1 kHz
- Bit-depth = 16: The amplitude of each sample in the audio is one of 2^16 (=65536) possible values.
- Samplig rate = 44.1 kHz: Each second in the audio consists of 44100 samples. So, if the duration of the audio file is 3.2 seconds, the audio will consist of 44100*3.2 = 141120 values.
Audio Features¶
- Zero Cross Rate
- Energy
- Entropy of Energy
- Spectral Centroid
- Spectral Spread
- Spectral Entropy
- Spectral Flux
- Spectral Roll off
- MFCC
- Chroma Vector
- Chroma Deviation
Introduction to MFCC¶
Before the Deep Learning era, people developed techniques to extract features from audio signals. It turns out that these techniques are still useful. One such technique is computing the MFCC (Mel Frquency Cepstral Coefficients) from the raw audio. Before we jump to MFCC, let’s talk about extracting features from the sound.
If we just want to classify some sound, we should build features that are speaker independent. Any feature that only gives information about the speaker (like the pitch of their voice) will not be helpful for classification. In other words, we should extract features that depend on the “content” of the audio rather than the nature of the speaker. Also, a good feature extraction technique should mimic the human speech perception. We don’t hear loudness on a linear scale. If we want to double the perceived loudness of a sound, we have to put 8 times as much energy into it. Instead of a linear scale, our perception system uses a log scale.
Taking these things into account, Davis and Mermelstein came up with MFCC in the 1980’s. MFCC mimics the logarithmic perception of loudness and pitch of human auditory system and tries to eliminate speaker dependent characteristics by excluding the fundamental frequency and their harmonics. The underlying mathematics is quite complicated and we will skip that. For those interested, here is the detailed explanation.
FFT/STFT Cheat Sheet:¶
- FFT: Fast Fourier transformA method for computing the discrete Fourier transform of a signal. Its “fastness” relies on size being a power of 2.
- STFT: Short-time Fourier transform A method for analyzing a signal whose frequency content is changing over time. The signal is broken into small, often overlapping frames, and the FFT is computed for each frame (i.e., the frequency content is assumed not to change within a frame, but subsequent analysis frames can be compared to understand how the frequency content changes over time).
- IFFT: Inverse Fast Fourier transform Takes a spectrum buffer (a complex vector) of N bins and transforms it into N audio samples.
- FFT size: The number of samples over which the FFT is computed; also the number of “bins” that comprise the analysis output.
- Bin: The content of a bin denotes the magnitude (and phase) of the frequency corresponding to the bin number. The N bins of an N-sample FFT evenly (linearly) partition the spectrum from 0Hz to the sample rate. Note that for real signals (including audio), we can discard the latter half of the bins, using only the bins from 0Hz to the Nyquist frequency.
- Window function: Before computing the FFT, the signal is multiplied by a window function. The simplest window is a rectangular window, which multiplies everything inside the frame by 1 and everything outside the frame by 0. However, in practice, we choose a smoother window function that is 1 in the middle of the window and tapers to 0 or near-0 at the edges. The choice of window depends on the application.
- Zero-padding: It is common practice to use a smaller window size than FFT size, then “zeropad” all the samples that lie in between the edges of the window and the edges of the FFT frame.
- Hop size: In STFT, you must decide how frequently to perform FFT computations on the signal. If your FFT size is 512 samples, and you have a hop size of 512 samples, you are sliding the analysis frame along the signal with no overlap, nor any space between analyses. If your hop size is 256 samples, you are using 50% overlap. The hop size can be small (high overlap) if you want to very faithfully recreate the sound using an IFFT, or very large if you’re only concerned about the spectrum’s or spectral features’ values every now and then.
Videos¶
- https://youtu.be/1RIA9U5oXro
- https://youtu.be/PjlIKVnKe8I
Setup¶
Ubuntu Environment Setup¶
Java¶
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y
sudo apt-get install unzip
#check java version
java -version
Gradle¶
Before downloading the gradle pacakge check the official site for the latest version here
For tutorials check here!
mkdir /opt/gradle
wget https://services.gradle.org/distributions/gradle-4.10.2-bin.zip
unzip gradle-4.10.2-bin.zip
or
curl -s https://get.sdkman.io | bash - install sdk
#install gradle 3.5 (or any version 3.0+ or 4.0+)
sdk install gradle 3.5
#check the installed version
gradle -v
#switching between versions
sdk use gradle 4.0
Add the binaries to user environment path:
vim ~/.bashrc
# add following to the file
export PATH=$PATH:/opt/gradle/gradle-4.10.2/bin
source ~/.bashrc
#test the installation
gradle -v
Python Environment¶
conda create -n shabda python=3.6
source activate shabda
cd path/to/shabda/
pip install -e .[tensorflow-cpu]
#or
pip install -e .[tensorflow-gpu]
pip install -r requirements.txt
Git Configure¶
https://help.github.com/articles/setting-your-commit-email-address-in-git/
git clone https://github.com/dhiraa/shabda
#to push without entering password everytime
git remote rm origin
git remote add origin https://USERNAME:PASSWORD@github.com/dhiraa/shabda.git
#checkout remote branch
git checkout -b branch_name origin/branch_name
IDE Setup¶
A decent IDE integration is a good start for any development.
Since our motive of supporting different platforms, naturally we end up in dealing with different programing languages.
- Python for Tensorflow
- Kotlin for Android
- Java for Audio libraries
- Scala for Web developemnt or Big data, if any.
- C++ for any performance requirements
Build Tool¶
- Gradle has been identifed the build tool for JVM languages. (Android by default uses Gradle)
Python Support¶
Intellij Module configuration is provided along with this repo @
/path/to/shabda/src/main/python/shabda/python.iml
IntelliJ
File -> Project Structure -> Modules
Select Shabda (2nd one)
Go to Dependencies tab -> Select a python environment eg:shabda
Shabda Project Structure¶
Project has three components
- Python based Model development framework using Tensorflow
- Java/Kotlin framework to imitate Python pre-processing and postprocessing steps
- Android application that uses the model using TFLite and the Jar from above step.
|
|- android : Android Applications
|- bin : Any runnable
|- build : Gradle build output
|- data : Open Datasets
|- docs : Documentation
|- gradle : Gradle Wrapper executable JAR & configuration properties
|- intellij :
|- notebooks :
|- shabda : Python source code
|- src : JVM source code
|- .gitignore :
|- .pylint.rc :
|- .travis.yml : Travis CI build script
|- build.gradle : Gradle build script for configuring the current project
|- gradlew : Gradle Wrapper script for Unix-based systems
|- gradle.bat : Gradle Wrapper script for Windows
|- LICENSE
|- README.md
|- readthedocs.yml : Readthedocs build script
|- requirements.txt : Python library requirements
|- settings.gradle : Gradle settings script for configuring the Gradle build
|- setup.py
API¶
Executor¶
-
class
shabda.run.
Executor
(model, data_iterator, config, model_hparams=None, train_hooks=None, eval_hooks=None, session_config=None)[source]¶ Bases:
object
Class that executes training, evaluation, prediction, export, and other actions of tf.estimator.Estimator.
- Args:
- model: An instance of a subclass of
ModelBase
.- data_hparams: A dict or an instance of
HParams
- containing the hyperparameters of data. It must contain train
and/or eval fields for relevant processes. For example, for
train_and_evaluate()
, both fields are required. - config: An instance of
- tf.estimator.RunConfig, used as
the
config
argument of Estimator. - model_hparams (optional): A dict or an instance of
HParams
containing the hyperparameters of the model. If None, usesmodel.hparams
. Used as theparams
argument of Estimator.- train_hooks (optional): Iterable of tf.train.SessionRunHook
- objects to run during training.
- eval_hooks (optional): Iterable of tf.train.SessionRunHook
- objects to run during evaluation.
- session_config (optional): An instance of
- tf.ConfigProto, used as the
config
argument of tf session.
Example:
TODO
See bin/train.py for the usage in detail.
-
evaluate
(steps=None, checkpoint_path=None)[source]¶ Evaluates the model. See tf.estimator.Estimator.evaluate for more details.
- Args:
- steps (int, optional): Number of steps for which to evaluate
- model. If None, evaluates until the eval data raises an OutOfRange exception.
- checkpoint_path (str, optional): Path of a specific checkpoint to
- evaluate. If None, the the latest checkpoint in
config.model_dir
is used. If there are no checkpoints inmodel_dir
, evaluation is run with newly initialized variables instead of restored from checkpoint.
-
train
(max_steps=None)[source]¶ Trains the model. See tf.estimator.Estimator.train for more details.
- Args:
- max_steps (int, optional): Total number of steps for which
- to train model. If None, train forever or until the train
data generates the OutOfRange exception. If OutOfRange occurs
in the middle, training stops before
max_steps
steps.
-
train_and_evaluate
(max_train_steps=None, eval_steps=None)[source]¶ Trains and evaluates the model. See tf.estimator.train_and_evaluate for more details.
- Args:
- max_train_steps (int, optional): Total number of steps for which
- to train model. If None, train forever or until the train
data generates the OutOfRange exception. If OutOfRange occurs
in the middle, training stops before
max_steps
steps. - eval_steps (int, optional): Number of steps for which to evaluate
- model. If None, evaluates until the eval data raises an OutOfRange exception.
DatasetFactory¶
AudioDatasetBase¶
FreeSoundAudioDataset¶
HParams¶
-
class
shabda.hyperparams.
HParams
(hparams, default_hparams, allow_new_hparam=False)[source]¶ Bases:
object
A class to maintains hyperparameters for configing Shabda modules.
The class has several useful features:
Auto-completion of missing values. Users can specify only a subset of hyperparameters they care about. Other hyperparameters will automatically take the default values. The auto-completion performs recursively so that hyperparameters taking dict values will also be auto-completed All Shabda modules provide a
default_hparams()
containing allowed hyperparameters and their default values. For example## Recursive auto-completion default_hparams = {"a": 1, "b": {"c": 2, "d": 3}} hparams = {"b": {"c": 22}} hparams_ = HParams(hparams, default_hparams) hparams_.todict() == {"a": 1, "b": {"c": 22, "d": 3}} # "a" and "d" are auto-completed ## All Shabda modules have built-in `default_hparams` hparams = {"dropout_rate": 0.1} emb = tx.modules.WordEmbedder(hparams=hparams, ) emb.hparams.todict() == { "dropout_rate": 0.1, # provided value "dim": 100 # default value }
Automatic typecheck. For most hyperparameters, provided value must have the same or compatible dtype with the default value. HParams does necessary typecheck, and raises Error if improper dtype is provided. Also, hyperparameters not listed in default_hparams are not allowed, except for “kwargs” as detailed below.
Flexible dtype for specified hyperparameters. Some hyperparameters may allow different dtypes of values.
- Hyperparameters named “type” are not typechecked. For example, in
get_rnn_cell()
, hyperparameter “type” can take value of an RNNCell class, its string name of module path, or an RNNCell class instance. (String name or module path is allowd so that users can specify the value in YAML config files.) - For other hyperparameters, list them in the “@no_typecheck” field in default_hparams to skip typecheck. For example, in
get_rnn_cell()
, hyperparameter “_keep_prob” can be set to either a float or a tf.placeholder.
- Hyperparameters named “type” are not typechecked. For example, in
Special flexibility of keyword argument hyparameters. Hyperparameters named “kwargs” are used as keyword arguments for a class constructor or a function call. Such hyperparameters take a dict, and users can add arbitrary valid keyword arguments to the dict. For example:
default_rnn_cell_hparams = { "type": "BasicLSTMCell", "kwargs": { "num_units": 256 } # Other hyperparameters } my_hparams = { "kwargs" { "num_units": 123, "forget_bias": 0.0 # Other valid keyword arguments "activation": "tf.nn.relu" # for BasicLSTMCell constructor } } _ = HParams(my_hparams, default_rnn_cell_hparams)
Rich interfaces. An HParams instance provides rich interfaces for accessing, updating, or adding hyperparameters.
hparams = HParams(my_hparams, default_hparams) # Access hparams.type == hparams["type"] # Update hparams.type = "GRUCell" hparams.kwargs = { "num_units": 100 } hparams.kwargs.num_units == 100 # Add new hparams.add_hparam("index", 1) hparams.index == 1 # Convert to `dict` (recursively) type(hparams.todic()) == dict # I/O pickle.dump(hparams, "hparams.dump") with open("hparams.dump", 'rb') as f: hparams_loaded = pickle.load(f)
- Args:
- hparams: A dict or an HParams instance containing hyperparameters.
- If None, all hyperparameters are set to default values.
- default_hparams (dict): Hyperparameters with default values. If None,
- Hyperparameters are fully defined by
hparams
. - allow_new_hparam (bool): If False (default),
hparams
cannot - contain hyperparameters that are not included in
default_hparams
, except for the case of"kwargs"
as above.