Add AutoSub

2021-03-28 22:52:17 +05:30 · 2021-03-28 22:52:17 +05:30 · cd31393a94
--- a/autosub/LICENSE
+++ b/autosub/LICENSE
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2020 Abhiroop Talasila
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/autosub/README.md
+++ b/autosub/README.md
@ -0,0 +1,93 @@
+# AutoSub
+
+- [AutoSub](#autosub)
+  - [About](#about)
+  - [Motivation](#motivation)
+  - [Installation](#installation)
+  - [How-to example](#how-to-example)
+  - [How it works](#how-it-works)
+  - [TO-DO](#to-do)
+  - [Contributing](#contributing)
+  - [References](#references)
+
+## About
+
+AutoSub is a CLI application to generate subtitle file (.srt) for any video file using [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech). I use the DeepSpeech Python API to run inference on audio segments and [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) to split the initial audio on silent segments, producing multiple small files.
+
+
+## Motivation
+
+In the age of OTT platforms, there are still some who prefer to download movies/videos from YouTube/Facebook or even torrents rather than stream. I am one of them and on one such occasion, I couldn't find the subtitle file for a particular movie I had downloaded. Then the idea for AutoSub struck me and since I had worked with DeepSpeech previously, I decided to use it. 
+
+
+## Installation
+
+* Clone the repo. All further steps should be performed while in the `AutoSub/` directory
+    ```bash
+    $ git clone https://github.com/abhirooptalasila/AutoSub
+    $ cd AutoSub
+    ```
+* Create a pip virtual environment to install the required packages
+    ```bash
+    $ python3 -m venv sub
+    $ source sub/bin/activate
+    $ pip3 install -r requirements.txt
+    ```
+* Download the model and scorer files from DeepSpeech repo. The scorer file is optional, but it greatly improves inference results.
+    ```bash
+    # Model file (~190 MB)
+    $ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
+    # Scorer file (~950 MB)
+    $ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
+    ```
+* Create two folders `audio/` and `output/` to store audio segments and final SRT file
+    ```bash
+    $ mkdir audio output
+    ```
+* Install FFMPEG. If you're running Ubuntu, this should work fine.
+    ```bash
+    $ sudo apt-get install ffmpeg
+    $ ffmpeg -version               # I'm running 4.1.4
+    ```
+    
+* [OPTIONAL] If you would like the subtitles to be generated faster, you can use the GPU package instead. Make sure to install the appropriate [CUDA](https://deepspeech.readthedocs.io/en/v0.9.3/USING.html#cuda-dependency-inference) version. 
+    ```bash
+    $ source sub/bin/activate
+    $ pip3 install deepspeech-gpu
+    ```
+    
+## How-to example
+
+* After following the installation instructions, you can run `autosub/main.py` as given below. `--model` and `--scorer` arguments take the absolute paths of the respective files. The `--file` argument is the video file for which SRT file is to be generated
+    ```bash
+    $ python3 autosub/main.py --model /home/AutoSub/deepspeech-0.9.3-models.pbmm --scorer /home/AutoSub/deepspeech-0.9.3-models.scorer --file ~/movie.mp4
+    ```
+* After the script finishes, the SRT file is saved in `output/`
+* Open the video file and add this SRT file as a subtitle, or you can just drag and drop in VLC.
+
+
+## How it works
+
+Mozilla DeepSpeech is an amazing open-source speech-to-text engine with support for fine-tuning using custom datasets, external language models, exporting memory-mapped models and a lot more. You should definitely check it out for STT tasks. So, when you first run the script, I use FFMPEG to **extract the audio** from the video and save it in `audio/`. By default DeepSpeech is configured to accept 16kHz audio samples for inference, hence while extracting I make FFMPEG use 16kHz sampling rate. 
+
+Then, I use [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) for silence removal - which basically takes the large audio file initially extracted, and splits it wherever silent regions are encountered, resulting in smaller audio segments which are much easier to process. I haven't used the whole library, instead I've integrated parts of it in `autosub/featureExtraction.py` and `autosub/trainAudio.py` All these audio files are stored in `audio/`. Then for each audio segment, I perform DeepSpeech inference on it, and write the inferred text in a SRT file. After all files are processed, the final SRT file is stored in `output/`.
+
+When I tested the script on my laptop, it took about **40 minutes to generate the SRT file for a 70 minutes video file**. My config is an i5 dual-core @ 2.5 Ghz and 8 gigs of RAM. Ideally, the whole process shouldn't take more than 60% of the duration of original video file. 
+
+
+## TO-DO
+
+* Pre-process inferred text before writing to file (prettify)
+* Add progress bar to `extract_audio()`
+* GUI support (?)
+
+
+## Contributing
+
+I would love to follow up on any suggestions/issues you find :)
+
+
+## References
+1. https://github.com/mozilla/DeepSpeech/
+2. https://github.com/tyiannak/pyAudioAnalysis
+3. https://deepspeech.readthedocs.io/
--- a/autosub/autosub/init.py
+++ b/autosub/autosub/init.py
--- a/autosub/autosub/audioProcessing.py
+++ b/autosub/autosub/audioProcessing.py
@ -0,0 +1,48 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import subprocess
+import numpy as np
+
+
+def extract_audio(input_file, audio_file_name):
+    """Extract audio from input video file and save to audio/ in root dir
+
+    Args:
+        input_file: input video file
+        audio_file_name: save audio WAV file with same filename as video file
+    """
+    
+    command = "ffmpeg -hide_banner -loglevel warning -i {} -b:a 192k -ac 1 -ar 16000 -vn {}".format(input_file, audio_file_name)
+    try:
+        ret = subprocess.call(command, shell=True)
+        print("Extracted audio to audio/{}".format(audio_file_name.split("/")[-1]))
+    except Exception as e:
+        print("Error: ", str(e))
+        exit(1)
+
+
+def convert_samplerate(audio_path, desired_sample_rate):
+    """Convert extracted audio to the format expected by DeepSpeech
+    ***WONT be called as extract_audio() converts the audio to 16kHz while saving***
+    
+    Args:
+        audio_path: audio file path
+        desired_sample_rate: DeepSpeech expects 16kHz 
+
+    Returns:
+        numpy buffer: audio signal stored in numpy array
+    """
+    
+    sox_cmd = "sox {} --type raw --bits 16 --channels 1 --rate {} --encoding signed-integer --endian little --compression 0.0 --no-dither - ".format(
+        quote(audio_path), desired_sample_rate)
+    try:
+        output = subprocess.check_output(
+            shlex.split(sox_cmd), stderr=subprocess.PIPE)
+    except subprocess.CalledProcessError as e:
+        raise RuntimeError("SoX returned non-zero status: {}".format(e.stderr))
+    except OSError as e:
+        raise OSError(e.errno, "SoX not found, use {}hz files or install it: {}".format(
+            desired_sample_rate, e.strerror))
+
+    return np.frombuffer(output, np.int16)
--- a/autosub/autosub/featureExtraction.py
+++ b/autosub/autosub/featureExtraction.py
@ -0,0 +1,413 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import math
+import numpy as np
+from scipy.fftpack import fft
+from scipy.signal import lfilter
+from scipy.fftpack.realtransforms import dct
+
+eps = 0.00000001
+
+def zero_crossing_rate(frame):
+    """Computes zero crossing rate of frame
+    """
+    
+    count = len(frame)
+    count_zero = np.sum(np.abs(np.diff(np.sign(frame)))) / 2
+    return np.float64(count_zero) / np.float64(count - 1.0)
+
+
+def energy(frame):
+    """Computes signal energy of frame
+    """
+    
+    return np.sum(frame ** 2) / np.float64(len(frame))
+
+
+def energy_entropy(frame, n_short_blocks=10):
+    """Computes entropy of energy
+    """
+    
+    # total frame energy
+    frame_energy = np.sum(frame ** 2)
+    frame_length = len(frame)
+    sub_win_len = int(np.floor(frame_length / n_short_blocks))
+    if frame_length != sub_win_len * n_short_blocks:
+        frame = frame[0:sub_win_len * n_short_blocks]
+
+    # sub_wins is of size [n_short_blocks x L]
+    sub_wins = frame.reshape(sub_win_len, n_short_blocks, order='F').copy()
+
+    # Compute normalized sub-frame energies:
+    s = np.sum(sub_wins ** 2, axis=0) / (frame_energy + eps)
+
+    # Compute entropy of the normalized sub-frame energies:
+    entropy = -np.sum(s * np.log2(s + eps))
+    
+    return entropy
+
+
+""" Frequency-domain audio features """
+
+
+def spectral_centroid_spread(fft_magnitude, sampling_rate):
+    """Computes spectral centroid of frame (given abs(FFT))
+    """
+    
+    ind = (np.arange(1, len(fft_magnitude) + 1)) * \
+          (sampling_rate / (2.0 * len(fft_magnitude)))
+
+    Xt = fft_magnitude.copy()
+    Xt = Xt / Xt.max()
+    NUM = np.sum(ind * Xt)
+    DEN = np.sum(Xt) + eps
+
+    # Centroid:
+    centroid = (NUM / DEN)
+
+    # Spread:
+    spread = np.sqrt(np.sum(((ind - centroid) ** 2) * Xt) / DEN)
+
+    # Normalize:
+    centroid = centroid / (sampling_rate / 2.0)
+    spread = spread / (sampling_rate / 2.0)
+
+    return centroid, spread
+
+
+def spectral_entropy(signal, n_short_blocks=10):
+    """Computes the spectral entropy
+    """
+    
+    # number of frame samples
+    num_frames = len(signal)
+
+    # total spectral energy
+    total_energy = np.sum(signal ** 2)
+
+    # length of sub-frame
+    sub_win_len = int(np.floor(num_frames / n_short_blocks))
+    if num_frames != sub_win_len * n_short_blocks:
+        signal = signal[0:sub_win_len * n_short_blocks]
+
+    # define sub-frames (using matrix reshape)
+    sub_wins = signal.reshape(sub_win_len, n_short_blocks, order='F').copy()
+
+    # compute spectral sub-energies
+    s = np.sum(sub_wins ** 2, axis=0) / (total_energy + eps)
+
+    # compute spectral entropy
+    entropy = -np.sum(s * np.log2(s + eps))
+
+    return entropy
+
+
+def spectral_flux(fft_magnitude, previous_fft_magnitude):
+    """Computes the spectral flux feature of the current frame
+    
+    Args:
+        fft_magnitude : the abs(fft) of the current frame
+        previous_fft_magnitude : the abs(fft) of the previous frame
+    """
+    
+    # compute the spectral flux as the sum of square distances:
+    fft_sum = np.sum(fft_magnitude + eps)
+    previous_fft_sum = np.sum(previous_fft_magnitude + eps)
+    sp_flux = np.sum(
+        (fft_magnitude / fft_sum - previous_fft_magnitude /
+         previous_fft_sum) ** 2)
+
+    return sp_flux
+
+
+def spectral_rolloff(signal, c):
+    """Computes spectral roll-off
+    """
+    
+    energy = np.sum(signal ** 2)
+    fft_length = len(signal)
+    threshold = c * energy
+    # Ffind the spectral rolloff as the frequency position 
+    # where the respective spectral energy is equal to c*totalEnergy
+    cumulative_sum = np.cumsum(signal ** 2) + eps
+    a = np.nonzero(cumulative_sum > threshold)[0]
+    if len(a) > 0:
+        sp_rolloff = np.float64(a[0]) / (float(fft_length))
+    else:
+        sp_rolloff = 0.0
+        
+    return sp_rolloff
+
+def mfcc_filter_banks(sampling_rate, num_fft, lowfreq=133.33, linc=200 / 3,
+                      logsc=1.0711703, num_lin_filt=13, num_log_filt=27):
+    """Computes the triangular filterbank for MFCC computation 
+    (used in the stFeatureExtraction function before the stMFCC function call)
+    This function is taken from the scikits.talkbox library (MIT Licence):
+    https://pypi.python.org/pypi/scikits.talkbox
+    """
+
+    if sampling_rate < 8000:
+        nlogfil = 5
+
+    # Total number of filters
+    num_filt_total = num_lin_filt + num_log_filt
+
+    # Compute frequency points of the triangle:
+    frequencies = np.zeros(num_filt_total + 2)
+    frequencies[:num_lin_filt] = lowfreq + np.arange(num_lin_filt) * linc
+    frequencies[num_lin_filt:] = frequencies[num_lin_filt - 1] * logsc ** \
+                                 np.arange(1, num_log_filt + 3)
+    heights = 2. / (frequencies[2:] - frequencies[0:-2])
+
+    # Compute filterbank coeff (in fft domain, in bins)
+    fbank = np.zeros((num_filt_total, num_fft))
+    nfreqs = np.arange(num_fft) / (1. * num_fft) * sampling_rate
+
+    for i in range(num_filt_total):
+        low_freqs = frequencies[i]
+        cent_freqs = frequencies[i + 1]
+        high_freqs = frequencies[i + 2]
+
+        lid = np.arange(np.floor(low_freqs * num_fft / sampling_rate) + 1,
+                        np.floor(cent_freqs * num_fft / sampling_rate) + 1,
+                        dtype=np.int)
+        lslope = heights[i] / (cent_freqs - low_freqs)
+        rid = np.arange(np.floor(cent_freqs * num_fft / sampling_rate) + 1,
+                        np.floor(high_freqs * num_fft / sampling_rate) + 1,
+                        dtype=np.int)
+        rslope = heights[i] / (high_freqs - cent_freqs)
+        fbank[i][lid] = lslope * (nfreqs[lid] - low_freqs)
+        fbank[i][rid] = rslope * (high_freqs - nfreqs[rid])
+
+    return fbank, frequencies
+
+
+def mfcc(fft_magnitude, fbank, num_mfcc_feats):
+    """Computes the MFCCs of a frame, given the fft mag
+
+    Args:
+        fft_magnitude : fft magnitude abs(FFT)
+        fbank : filter bank (see mfccInitFilterBanks)
+    
+    Returns:
+        ceps : MFCCs (13 element vector)
+
+    Note:    MFCC calculation is, in general, taken from the 
+             scikits.talkbox library (MIT Licence),
+    #    with a small number of modifications to make it more 
+         compact and suitable for the pyAudioAnalysis Lib
+    """
+
+    mspec = np.log10(np.dot(fft_magnitude, fbank.T) + eps)
+    ceps = dct(mspec, type=2, norm='ortho', axis=-1)[:num_mfcc_feats]
+    return ceps
+
+
+def chroma_features_init(num_fft, sampling_rate):
+    """This function initializes the chroma matrices used in the calculation
+    of the chroma features
+    """
+    
+    freqs = np.array([((f + 1) * sampling_rate) /
+                      (2 * num_fft) for f in range(num_fft)])
+    cp = 27.50
+    num_chroma = np.round(12.0 * np.log2(freqs / cp)).astype(int)
+
+    num_freqs_per_chroma = np.zeros((num_chroma.shape[0],))
+
+    unique_chroma = np.unique(num_chroma)
+    for u in unique_chroma:
+        idx = np.nonzero(num_chroma == u)
+        num_freqs_per_chroma[idx] = idx[0].shape
+
+    return num_chroma, num_freqs_per_chroma
+
+
+def chroma_features(signal, sampling_rate, num_fft):
+    # TODO: 1 complexity
+    # TODO: 2 bug with large windows
+
+    num_chroma, num_freqs_per_chroma = \
+        chroma_features_init(num_fft, sampling_rate)
+    chroma_names = ['A', 'A#', 'B', 'C', 'C#', 'D',
+                    'D#', 'E', 'F', 'F#', 'G', 'G#']
+    spec = signal ** 2
+    if num_chroma.max() < num_chroma.shape[0]:
+        C = np.zeros((num_chroma.shape[0],))
+        C[num_chroma] = spec
+        C /= num_freqs_per_chroma[num_chroma]
+    else:
+        I = np.nonzero(num_chroma > num_chroma.shape[0])[0][0]
+        C = np.zeros((num_chroma.shape[0],))
+        C[num_chroma[0:I - 1]] = spec
+        C /= num_freqs_per_chroma
+    final_matrix = np.zeros((12, 1))
+    newD = int(np.ceil(C.shape[0] / 12.0) * 12)
+    C2 = np.zeros((newD,))
+    C2[0:C.shape[0]] = C
+    C2 = C2.reshape(int(C2.shape[0] / 12), 12)
+    # for i in range(12):
+    #    finalC[i] = np.sum(C[i:C.shape[0]:12])
+    final_matrix = np.matrix(np.sum(C2, axis=0)).T
+    final_matrix /= spec.sum()
+
+    #    ax = plt.gca()
+    #    plt.hold(False)
+    #    plt.plot(finalC)
+    #    ax.set_xticks(range(len(chromaNames)))
+    #    ax.set_xticklabels(chromaNames)
+    #    xaxis = np.arange(0, 0.02, 0.01);
+    #    ax.set_yticks(range(len(xaxis)))
+    #    ax.set_yticklabels(xaxis)
+    #    plt.show(block=False)
+    #    plt.draw()
+
+    return chroma_names, final_matrix
+
+""" Windowing and feature extraction """
+
+def feature_extraction(signal, sampling_rate, window, step, deltas=True):
+    """This function implements the shor-term windowing process.
+    For each short-term window a set of features is extracted.
+    This results to a sequence of feature vectors, stored in a np matrix.
+
+    Args:
+        signal : the input signal samples
+        sampling_rate : the sampling freq (in Hz)
+        window : the short-term window size (in samples)
+        step : the short-term window step (in samples)
+        deltas : (opt) True/False if delta features are to be computed
+        
+    Returns:
+        features (numpy.ndarray) : contains features 
+                                (n_feats x numOfShortTermWindows)
+        feature_names (numpy.ndarray) : contains feature names 
+                                (n_feats x numOfShortTermWindows)
+    """
+
+    window = int(window)
+    step = int(step)
+
+    # signal normalization
+    signal = np.double(signal)
+    signal = signal / (2.0 ** 15)
+    dc_offset = signal.mean()
+    signal_max = (np.abs(signal)).max()
+    signal = (signal - dc_offset) / (signal_max + 0.0000000001)
+
+    number_of_samples = len(signal)  # total number of samples
+    current_position = 0
+    count_fr = 0
+    num_fft = int(window / 2)
+
+    # compute the triangular filter banks used in the mfcc calculation
+    fbank, freqs = mfcc_filter_banks(sampling_rate, num_fft)
+
+    n_time_spectral_feats = 8
+    n_harmonic_feats = 0
+    n_mfcc_feats = 13
+    n_chroma_feats = 13
+    n_total_feats = n_time_spectral_feats + n_mfcc_feats + n_harmonic_feats + \
+                    n_chroma_feats
+    #    n_total_feats = n_time_spectral_feats + n_mfcc_feats +
+    #    n_harmonic_feats
+
+    # define list of feature names
+    feature_names = ["zcr", "energy", "energy_entropy"]
+    feature_names += ["spectral_centroid", "spectral_spread"]
+    feature_names.append("spectral_entropy")
+    feature_names.append("spectral_flux")
+    feature_names.append("spectral_rolloff")
+    feature_names += ["mfcc_{0:d}".format(mfcc_i)
+                      for mfcc_i in range(1, n_mfcc_feats + 1)]
+    feature_names += ["chroma_{0:d}".format(chroma_i)
+                      for chroma_i in range(1, n_chroma_feats)]
+    feature_names.append("chroma_std")
+
+    # add names for delta features:
+    if deltas:
+        feature_names_2 = feature_names + ["delta " + f for f in feature_names]
+        feature_names = feature_names_2
+
+    features = []
+    # for each short-term window to end of signal
+    while current_position + window - 1 < number_of_samples:
+        count_fr += 1
+        # get current window
+        x = signal[current_position:current_position + window]
+
+        # update window position
+        current_position = current_position + step
+
+        # get fft magnitude
+        fft_magnitude = abs(fft(x))
+
+        # normalize fft
+        fft_magnitude = fft_magnitude[0:num_fft]
+        fft_magnitude = fft_magnitude / len(fft_magnitude)
+
+        # keep previous fft mag (used in spectral flux)
+        if count_fr == 1:
+            fft_magnitude_previous = fft_magnitude.copy()
+        feature_vector = np.zeros((n_total_feats, 1))
+
+        # zero crossing rate
+        feature_vector[0] = zero_crossing_rate(x)
+
+        # short-term energy
+        feature_vector[1] = energy(x)
+
+        # short-term entropy of energy
+        feature_vector[2] = energy_entropy(x)
+
+        # sp centroid/spread
+        [feature_vector[3], feature_vector[4]] = \
+            spectral_centroid_spread(fft_magnitude,
+                                     sampling_rate)
+
+        # spectral entropy
+        feature_vector[5] = \
+            spectral_entropy(fft_magnitude)
+
+        # spectral flux
+        feature_vector[6] = \
+            spectral_flux(fft_magnitude,
+                          fft_magnitude_previous)
+
+        # spectral rolloff
+        feature_vector[7] = \
+            spectral_rolloff(fft_magnitude, 0.90)
+
+        # MFCCs
+        mffc_feats_end = n_time_spectral_feats + n_mfcc_feats
+        feature_vector[n_time_spectral_feats:mffc_feats_end, 0] = \
+            mfcc(fft_magnitude, fbank, n_mfcc_feats).copy()
+
+        # chroma features
+        chroma_names, chroma_feature_matrix = \
+            chroma_features(fft_magnitude, sampling_rate, num_fft)
+        chroma_features_end = n_time_spectral_feats + n_mfcc_feats + \
+                              n_chroma_feats - 1
+        feature_vector[mffc_feats_end:chroma_features_end] = \
+            chroma_feature_matrix
+        feature_vector[chroma_features_end] = chroma_feature_matrix.std()
+        if not deltas:
+            features.append(feature_vector)
+        else:
+            # delta features
+            if count_fr > 1:
+                delta = feature_vector - feature_vector_prev
+                feature_vector_2 = np.concatenate((feature_vector, delta))
+            else:
+                feature_vector_2 = np.concatenate((feature_vector,
+                                                   np.zeros(feature_vector.
+                                                            shape)))
+            feature_vector_prev = feature_vector
+            features.append(feature_vector_2)
+
+        fft_magnitude_previous = fft_magnitude.copy()
+
+    features = np.concatenate(features, 1)
+    
+    return features, feature_names
--- a/autosub/autosub/main.py
+++ b/autosub/autosub/main.py
@ -0,0 +1,135 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import os
+import re
+import sys
+import wave
+import shutil
+import argparse
+import subprocess
+import numpy as np
+from tqdm import tqdm
+from deepspeech import Model, version 
+from segmentAudio import silenceRemoval
+from audioProcessing import extract_audio, convert_samplerate
+from writeToFile import write_to_file
+
+# Line count for SRT file
+line_count = 0
+
+def sort_alphanumeric(data):
+    """Sort function to sort os.listdir() alphanumerically
+    Helps to process audio files sequentially after splitting 
+
+    Args:
+        data : file name
+    """
+    
+    convert = lambda text: int(text) if text.isdigit() else text.lower()
+    alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)] 
+    
+    return sorted(data, key = alphanum_key)
+
+
+def ds_process_audio(ds, audio_file, file_handle):  
+    """Run DeepSpeech inference on each audio file generated after silenceRemoval
+    and write to file pointed by file_handle
+
+    Args:
+        ds : DeepSpeech Model
+        audio_file : audio file
+        file_handle : SRT file handle
+    """
+    
+    global line_count
+    fin = wave.open(audio_file, 'rb')
+    fs_orig = fin.getframerate()
+    desired_sample_rate = ds.sampleRate()
+    
+    # Check if sampling rate is required rate (16000)
+    # won't be carried out as FFmpeg already converts to 16kHz
+    if fs_orig != desired_sample_rate:
+        print("Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition".format(fs_orig, desired_sample_rate), file=sys.stderr)
+        audio = convert_samplerate(audio_file, desired_sample_rate)
+    else:
+        audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
+
+    fin.close()
+    
+    # Perform inference on audio segment
+    infered_text = ds.stt(audio)
+    
+    # File name contains start and end times in seconds. Extract that
+    limits = audio_file.split(os.sep)[-1][:-4].split("_")[-1].split("-")
+    
+    if len(infered_text) != 0:
+        line_count += 1
+        write_to_file(file_handle, infered_text, line_count, limits)
+
+
+def main():
+    global line_count
+    print("AutoSub v0.1\n")
+        
+    parser = argparse.ArgumentParser(description="AutoSub v0.1")
+    parser.add_argument('--model', required=True,
+                        help='DeepSpeech model file')
+    parser.add_argument('--scorer',
+                        help='DeepSpeech scorer file')
+    parser.add_argument('--file', required=True,
+                        help='Input video file')
+    args = parser.parse_args()
+    
+    ds_model = args.model
+    if not ds_model.endswith(".pbmm"):
+        print("Invalid model file. Exiting\n")
+        exit(1)
+    
+    # Load DeepSpeech model 
+    ds = Model(ds_model)
+            
+    if args.scorer:
+        ds_scorer = args.scorer
+        if not ds_scorer.endswith(".scorer"):
+            print("Invalid scorer file. Running inference using only model file\n")
+        else:
+            ds.enableExternalScorer(ds_scorer)
+    
+    input_file = args.file
+    print("\nInput file:", input_file)
+    
+    base_directory = os.getcwd()
+    output_directory = os.path.join(base_directory, "output")
+    audio_directory = os.path.join(base_directory, "audio")
+    video_file_name = input_file.split(os.sep)[-1].split(".")[0]
+    audio_file_name = os.path.join(audio_directory, video_file_name + ".wav")
+    srt_file_name = os.path.join(output_directory, video_file_name + ".srt")
+    
+    # Extract audio from input video file
+    extract_audio(input_file, audio_file_name)
+    
+    print("Splitting on silent parts in audio file")
+    silenceRemoval(audio_file_name)
+    
+    # Output SRT file
+    file_handle = open(srt_file_name, "a+")
+    
+    print("\nRunning inference:")
+    
+    for file in tqdm(sort_alphanumeric(os.listdir(audio_directory))):
+        audio_segment_path = os.path.join(audio_directory, file)
+        
+        # Dont run inference on the original audio file
+        if audio_segment_path.split(os.sep)[-1] != audio_file_name.split(os.sep)[-1]:
+            ds_process_audio(ds, audio_segment_path, file_handle)
+            
+    print("\nSRT file saved to", srt_file_name)
+    file_handle.close()
+
+    # Clean audio/ directory 
+    shutil.rmtree(audio_directory)
+    os.mkdir(audio_directory)
+        
+if __name__ == "__main__":
+    main()
--- a/autosub/autosub/segmentAudio.py
+++ b/autosub/autosub/segmentAudio.py
@ -0,0 +1,204 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import os
+import numpy as np
+from pydub import AudioSegment
+import scipy.io.wavfile as wavfile
+import featureExtraction as FE
+import trainAudio as TA
+
+
+def read_audio_file(input_file):
+    """This function returns a numpy array that stores the audio samples of a
+    specified WAV file
+    
+    Args:
+        input_file : audio from input video file
+    """
+
+    sampling_rate = -1
+    signal = np.array([])
+    try:
+        audiofile = AudioSegment.from_file(input_file)
+        data = np.array([])
+        if audiofile.sample_width == 2:
+            data = np.fromstring(audiofile._data, np.int16)
+        elif audiofile.sample_width == 4:
+            data = np.fromstring(audiofile._data, np.int32)
+
+        if data.size > 0:
+            sampling_rate = audiofile.frame_rate
+            temp_signal = []
+            for chn in list(range(audiofile.channels)):
+                temp_signal.append(data[chn::audiofile.channels])
+            signal = np.array(temp_signal).T
+    except:
+        print("Error: file not found or other I/O error. (DECODING FAILED)")
+
+    if signal.ndim == 2 and signal.shape[1] == 1:
+        signal = signal.flatten()
+
+    return sampling_rate, signal
+
+def smooth_moving_avg(signal, window=11):
+    window = int(window)
+    if signal.ndim != 1:
+        raise ValueError("")
+    if signal.size < window:
+        raise ValueError("Input vector needs to be bigger than window size.")
+    if window < 3:
+        return signal
+    s = np.r_[2 * signal[0] - signal[window - 1::-1],
+              signal, 2 * signal[-1] - signal[-1:-window:-1]]
+    w = np.ones(window, 'd')
+    y = np.convolve(w/w.sum(), s, mode='same')
+    
+    return y[window:-window + 1]
+
+def stereo_to_mono(signal):
+    """This function converts the input signal to MONO (if it is STEREO)
+    
+    Args:
+        signal: audio file stored in a Numpy array
+    """
+
+    if signal.ndim == 2:
+        if signal.shape[1] == 1:
+            signal = signal.flatten()
+        else:
+            if signal.shape[1] == 2:
+                signal = (signal[:, 1] / 2) + (signal[:, 0] / 2)
+                
+    return signal
+
+def silence_removal(signal, sampling_rate, st_win, st_step, smooth_window=0.5,
+                    weight=0.5):
+    """Event Detection (silence removal)
+    
+    Args:
+        signal : the input audio signal
+        sampling_rate : sampling freq
+        st_win, st_step : window size and step in seconds
+        smoothWindow : (optinal) smooth window (in seconds)
+        weight : (optinal) weight factor (0 < weight < 1) the higher, the more strict
+        plot : (optinal) True if results are to be plotted
+        
+    Returns:
+        seg_limits : list of segment limits in seconds (e.g [[0.1, 0.9], 
+                [1.4, 3.0]] means that the resulting segments 
+                are (0.1 - 0.9) seconds and (1.4, 3.0) seconds
+    """
+
+    if weight >= 1:
+        weight = 0.99
+    if weight <= 0:
+        weight = 0.01
+
+    # Step 1: feature extraction
+    signal = stereo_to_mono(signal)
+    st_feats, _ = FE.feature_extraction(signal, sampling_rate,
+                                         st_win * sampling_rate,
+                                         st_step * sampling_rate)
+
+    # Step 2: train binary svm classifier of low vs high energy frames
+    # keep only the energy short-term sequence (2nd feature)
+    st_energy = st_feats[1, :]
+    en = np.sort(st_energy)
+    # number of 10% of the total short-term windows
+    st_windows_fraction = int(len(en) / 10)
+
+    # compute "lower" 10% energy threshold
+    low_threshold = np.mean(en[0:st_windows_fraction]) + 1e-15
+
+    # compute "higher" 10% energy threshold
+    high_threshold = np.mean(en[-st_windows_fraction:-1]) + 1e-15
+
+    # get all features that correspond to low energy
+    low_energy = st_feats[:, np.where(st_energy <= low_threshold)[0]]
+
+    # get all features that correspond to high energy
+    high_energy = st_feats[:, np.where(st_energy >= high_threshold)[0]]
+
+    # form the binary classification task and ...
+    features = [low_energy.T, high_energy.T]
+    # normalize and train the respective svm probabilistic model
+
+    # (ONSET vs SILENCE)
+    features_norm, mean, std = TA.normalize_features(features)
+    svm = TA.train_svm(features_norm, 1.0)
+
+    # Step 3: compute onset probability based on the trained svm
+    prob_on_set = []
+    for index in range(st_feats.shape[1]):
+        # for each frame
+        cur_fv = (st_feats[:, index] - mean) / std
+        # get svm probability (that it belongs to the ONSET class)
+        prob_on_set.append(svm.predict_proba(cur_fv.reshape(1, -1))[0][1])
+    prob_on_set = np.array(prob_on_set)
+
+    # smooth probability:
+    prob_on_set = smooth_moving_avg(prob_on_set, smooth_window / st_step)
+
+    # Step 4A: detect onset frame indices:
+    prog_on_set_sort = np.sort(prob_on_set)
+
+    # find probability Threshold as a weighted average
+    # of top 10% and lower 10% of the values
+    nt = int(prog_on_set_sort.shape[0] / 10)
+    threshold = (np.mean((1 - weight) * prog_on_set_sort[0:nt]) +
+         weight * np.mean(prog_on_set_sort[-nt::]))
+
+    max_indices = np.where(prob_on_set > threshold)[0]
+    # get the indices of the frames that satisfy the thresholding
+    index = 0
+    seg_limits = []
+    time_clusters = []
+
+    # Step 4B: group frame indices to onset segments
+    while index < len(max_indices):
+        # for each of the detected onset indices
+        cur_cluster = [max_indices[index]]
+        if index == len(max_indices)-1:
+            break
+        while max_indices[index+1] - cur_cluster[-1] <= 2:
+            cur_cluster.append(max_indices[index+1])
+            index += 1
+            if index == len(max_indices)-1:
+                break
+        index += 1
+        time_clusters.append(cur_cluster)
+        seg_limits.append([cur_cluster[0] * st_step,
+                           cur_cluster[-1] * st_step])
+
+    # Step 5: Post process: remove very small segments:
+    min_duration = 0.2
+    seg_limits_2 = []
+    for s_lim in seg_limits:
+        if s_lim[1] - s_lim[0] > min_duration:
+            seg_limits_2.append(s_lim)
+    seg_limits = seg_limits_2
+
+    return seg_limits
+
+def silenceRemoval(input_file, smoothing_window = 1.0, weight = 0.2):
+    """Remove silence segments from an audio file and split on those segments
+
+    Args:
+        input_file : audio from input video file
+        smoothing : Smoothing window size in seconds. Defaults to 1.0.
+        weight : Weight factor in (0,1). Defaults to 0.5.
+    """
+    
+    if not os.path.isfile(input_file):
+        raise Exception("Input audio file not found!")
+
+    [fs, x] = read_audio_file(input_file)
+    segmentLimits = silence_removal(x, fs, 0.05, 0.05, smoothing_window, weight)
+    
+    for i, s in enumerate(segmentLimits):
+        strOut = "{0:s}_{1:.3f}-{2:.3f}.wav".format(input_file[0:-4], s[0], s[1])
+        wavfile.write(strOut, fs, x[int(fs * s[0]):int(fs * s[1])])
+
+#if __name__ == "__main__":
+#    silenceRemoval("video.wav")
--- a/autosub/autosub/trainAudio.py
+++ b/autosub/autosub/trainAudio.py
@ -0,0 +1,104 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import os
+import csv
+import sys
+import glob
+import signal
+import ntpath
+import numpy as np
+import sklearn.svm
+
+shortTermWindow = 0.050
+shortTermStep = 0.050
+eps = 0.00000001
+
+
+def train_svm(features, c_param, kernel='linear'):
+    """Train a multi-class probabilitistic SVM classifier.
+    Note:     This function is simply a wrapper to the sklearn functionality 
+              for SVM training
+              See function trainSVM_feature() to use a wrapper on both the 
+              feature extraction and the SVM training
+              (and parameter tuning) processes.
+    Args:
+        features : a list ([numOfClasses x 1]) whose elements 
+                containt np matrices of features  each matrix 
+                features[i] of class i is 
+                [n_samples x numOfDimensions]
+        c_param : SVM parameter C (cost of constraints violation)
+        
+    Returns:
+        svm : the trained SVM variable
+
+    NOTE:
+        This function trains a linear-kernel SVM for a given C value.
+        For a different kernel, other types of parameters should be provided.
+    """
+
+    feature_matrix, labels = features_to_matrix(features)
+    svm = sklearn.svm.SVC(C=c_param, kernel=kernel, probability=True,
+                          gamma='auto')
+    svm.fit(feature_matrix, labels)
+
+    return svm
+
+def normalize_features(features):
+    """This function normalizes a feature set to 0-mean and 1-std
+    Used in most classifier trainning cases
+
+    Args:
+        features : list of feature matrices (each one of them is a np matrix)
+        
+    Returns:
+        features_norm : list of NORMALIZED feature matrices
+        mean : mean vector
+        std : std vector
+    """
+    
+    temp_feats = np.array([])
+
+    for count, f in enumerate(features):
+        if f.shape[0] > 0:
+            if count == 0:
+                temp_feats = f
+            else:
+                temp_feats = np.vstack((temp_feats, f))
+            count += 1
+
+    mean = np.mean(temp_feats, axis=0) + 1e-14
+    std = np.std(temp_feats, axis=0) + 1e-14
+
+    features_norm = []
+    for f in features:
+        ft = f.copy()
+        for n_samples in range(f.shape[0]):
+            ft[n_samples, :] = (ft[n_samples, :] - mean) / std
+        features_norm.append(ft)
+    return features_norm, mean, std
+
+
+def features_to_matrix(features):
+    """This function takes a list of feature matrices as argument and returns
+    a single concatenated feature matrix and the respective class labels.
+
+    Args:
+        features : a list of feature matrices
+
+    Returns:
+        feature_matrix : a concatenated matrix of features
+        labels : a vector of class indices
+    """
+
+    labels = np.array([])
+    feature_matrix = np.array([])
+    for i, f in enumerate(features):
+        if i == 0:
+            feature_matrix = f
+            labels = i * np.ones((len(f), 1))
+        else:
+            feature_matrix = np.vstack((feature_matrix, f))
+            labels = np.append(labels, i * np.ones((len(f), 1)))
+            
+    return feature_matrix, labels
--- a/autosub/autosub/writeToFile.py
+++ b/autosub/autosub/writeToFile.py
@ -0,0 +1,32 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import os
+import datetime
+
+def write_to_file(file_handle, inferred_text, line_count, limits):
+    """Write the inferred text to SRT file
+    Follows a specific format for SRT files
+
+    Args:
+        file_handle : SRT file handle
+        inferred_text : text to be written
+        line_count : subtitle line count 
+        limits : starting and ending times for text
+    """
+    
+    d = str(datetime.timedelta(seconds=float(limits[0])))
+    try:
+        from_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2])
+    except:
+        from_dur = "0" + str(d) + "," + "00"
+        
+    d = str(datetime.timedelta(seconds=float(limits[1])))
+    try:
+        to_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2])
+    except:
+        to_dur = "0" + str(d) + "," + "00"
+        
+    file_handle.write(str(line_count) + "\n")
+    file_handle.write(from_dur + " --> " + to_dur + "\n")
+    file_handle.write(inferred_text + "\n\n")
--- a/autosub/requirements.txt
+++ b/autosub/requirements.txt
@ -0,0 +1,13 @@
+cycler==0.10.0
+Cython==0.29.21
+numpy
+deepspeech==0.9.3
+joblib==0.16.0
+kiwisolver==1.2.0
+pydub==0.23.1
+pyparsing==2.4.7
+python-dateutil==2.8.1
+scikit-learn==0.21.3
+scipy==1.4.1
+six==1.15.0
+tqdm==4.44.1
--- a/autosub/setup.py
+++ b/autosub/setup.py
@ -0,0 +1,28 @@
+import os
+from setuptools import setup
+
+DIR = os.path.dirname(os.path.abspath(__file__))
+INSTALL_PACKAGES = open(os.path.join(DIR, 'requirements.txt')).read().splitlines()
+
+with open("README.md", "r") as fh:
+    README = fh.read()
+
+setup(
+    name="AutoSub",
+    packages="autosub",
+    version="0.0.1",
+    author="Abhiroop Talasila",
+    author_email="abhiroop.talasila@gmail.com",
+    description="CLI application to generate subtitle file (.srt) for any video file using using STT",
+    long_description=README,
+    install_requires=INSTALL_PACKAGES,
+    long_description_content_type="text/markdown",
+    url="https://github.com/abhirooptalasila/AutoSub",
+    keywords=['speech-to-text','deepspeech','machine-learning'],
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+    ],
+    python_requires='>=3.5',
+)