diff --git a/autosub/LICENSE b/autosub/LICENSE new file mode 100644 index 0000000..f3eb50b --- /dev/null +++ b/autosub/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2020 Abhiroop Talasila + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/autosub/README.md b/autosub/README.md new file mode 100644 index 0000000..0ee1eb1 --- /dev/null +++ b/autosub/README.md @@ -0,0 +1,93 @@ +# AutoSub + +- [AutoSub](#autosub) + - [About](#about) + - [Motivation](#motivation) + - [Installation](#installation) + - [How-to example](#how-to-example) + - [How it works](#how-it-works) + - [TO-DO](#to-do) + - [Contributing](#contributing) + - [References](#references) + +## About + +AutoSub is a CLI application to generate subtitle file (.srt) for any video file using [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech). I use the DeepSpeech Python API to run inference on audio segments and [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) to split the initial audio on silent segments, producing multiple small files. + + +## Motivation + +In the age of OTT platforms, there are still some who prefer to download movies/videos from YouTube/Facebook or even torrents rather than stream. I am one of them and on one such occasion, I couldn't find the subtitle file for a particular movie I had downloaded. Then the idea for AutoSub struck me and since I had worked with DeepSpeech previously, I decided to use it. + + +## Installation + +* Clone the repo. All further steps should be performed while in the `AutoSub/` directory + ```bash + $ git clone https://github.com/abhirooptalasila/AutoSub + $ cd AutoSub + ``` +* Create a pip virtual environment to install the required packages + ```bash + $ python3 -m venv sub + $ source sub/bin/activate + $ pip3 install -r requirements.txt + ``` +* Download the model and scorer files from DeepSpeech repo. The scorer file is optional, but it greatly improves inference results. + ```bash + # Model file (~190 MB) + $ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm + # Scorer file (~950 MB) + $ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer + ``` +* Create two folders `audio/` and `output/` to store audio segments and final SRT file + ```bash + $ mkdir audio output + ``` +* Install FFMPEG. If you're running Ubuntu, this should work fine. + ```bash + $ sudo apt-get install ffmpeg + $ ffmpeg -version # I'm running 4.1.4 + ``` + +* [OPTIONAL] If you would like the subtitles to be generated faster, you can use the GPU package instead. Make sure to install the appropriate [CUDA](https://deepspeech.readthedocs.io/en/v0.9.3/USING.html#cuda-dependency-inference) version. + ```bash + $ source sub/bin/activate + $ pip3 install deepspeech-gpu + ``` + +## How-to example + +* After following the installation instructions, you can run `autosub/main.py` as given below. `--model` and `--scorer` arguments take the absolute paths of the respective files. The `--file` argument is the video file for which SRT file is to be generated + ```bash + $ python3 autosub/main.py --model /home/AutoSub/deepspeech-0.9.3-models.pbmm --scorer /home/AutoSub/deepspeech-0.9.3-models.scorer --file ~/movie.mp4 + ``` +* After the script finishes, the SRT file is saved in `output/` +* Open the video file and add this SRT file as a subtitle, or you can just drag and drop in VLC. + + +## How it works + +Mozilla DeepSpeech is an amazing open-source speech-to-text engine with support for fine-tuning using custom datasets, external language models, exporting memory-mapped models and a lot more. You should definitely check it out for STT tasks. So, when you first run the script, I use FFMPEG to **extract the audio** from the video and save it in `audio/`. By default DeepSpeech is configured to accept 16kHz audio samples for inference, hence while extracting I make FFMPEG use 16kHz sampling rate. + +Then, I use [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) for silence removal - which basically takes the large audio file initially extracted, and splits it wherever silent regions are encountered, resulting in smaller audio segments which are much easier to process. I haven't used the whole library, instead I've integrated parts of it in `autosub/featureExtraction.py` and `autosub/trainAudio.py` All these audio files are stored in `audio/`. Then for each audio segment, I perform DeepSpeech inference on it, and write the inferred text in a SRT file. After all files are processed, the final SRT file is stored in `output/`. + +When I tested the script on my laptop, it took about **40 minutes to generate the SRT file for a 70 minutes video file**. My config is an i5 dual-core @ 2.5 Ghz and 8 gigs of RAM. Ideally, the whole process shouldn't take more than 60% of the duration of original video file. + + +## TO-DO + +* Pre-process inferred text before writing to file (prettify) +* Add progress bar to `extract_audio()` +* GUI support (?) + + +## Contributing + +I would love to follow up on any suggestions/issues you find :) + + +## References +1. https://github.com/mozilla/DeepSpeech/ +2. https://github.com/tyiannak/pyAudioAnalysis +3. https://deepspeech.readthedocs.io/ diff --git a/autosub/autosub/__init__.py b/autosub/autosub/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/autosub/autosub/audioProcessing.py b/autosub/autosub/audioProcessing.py new file mode 100644 index 0000000..99a7098 --- /dev/null +++ b/autosub/autosub/audioProcessing.py @@ -0,0 +1,48 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import subprocess +import numpy as np + + +def extract_audio(input_file, audio_file_name): + """Extract audio from input video file and save to audio/ in root dir + + Args: + input_file: input video file + audio_file_name: save audio WAV file with same filename as video file + """ + + command = "ffmpeg -hide_banner -loglevel warning -i {} -b:a 192k -ac 1 -ar 16000 -vn {}".format(input_file, audio_file_name) + try: + ret = subprocess.call(command, shell=True) + print("Extracted audio to audio/{}".format(audio_file_name.split("/")[-1])) + except Exception as e: + print("Error: ", str(e)) + exit(1) + + +def convert_samplerate(audio_path, desired_sample_rate): + """Convert extracted audio to the format expected by DeepSpeech + ***WONT be called as extract_audio() converts the audio to 16kHz while saving*** + + Args: + audio_path: audio file path + desired_sample_rate: DeepSpeech expects 16kHz + + Returns: + numpy buffer: audio signal stored in numpy array + """ + + sox_cmd = "sox {} --type raw --bits 16 --channels 1 --rate {} --encoding signed-integer --endian little --compression 0.0 --no-dither - ".format( + quote(audio_path), desired_sample_rate) + try: + output = subprocess.check_output( + shlex.split(sox_cmd), stderr=subprocess.PIPE) + except subprocess.CalledProcessError as e: + raise RuntimeError("SoX returned non-zero status: {}".format(e.stderr)) + except OSError as e: + raise OSError(e.errno, "SoX not found, use {}hz files or install it: {}".format( + desired_sample_rate, e.strerror)) + + return np.frombuffer(output, np.int16) \ No newline at end of file diff --git a/autosub/autosub/featureExtraction.py b/autosub/autosub/featureExtraction.py new file mode 100644 index 0000000..800e379 --- /dev/null +++ b/autosub/autosub/featureExtraction.py @@ -0,0 +1,413 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import math +import numpy as np +from scipy.fftpack import fft +from scipy.signal import lfilter +from scipy.fftpack.realtransforms import dct + +eps = 0.00000001 + +def zero_crossing_rate(frame): + """Computes zero crossing rate of frame + """ + + count = len(frame) + count_zero = np.sum(np.abs(np.diff(np.sign(frame)))) / 2 + return np.float64(count_zero) / np.float64(count - 1.0) + + +def energy(frame): + """Computes signal energy of frame + """ + + return np.sum(frame ** 2) / np.float64(len(frame)) + + +def energy_entropy(frame, n_short_blocks=10): + """Computes entropy of energy + """ + + # total frame energy + frame_energy = np.sum(frame ** 2) + frame_length = len(frame) + sub_win_len = int(np.floor(frame_length / n_short_blocks)) + if frame_length != sub_win_len * n_short_blocks: + frame = frame[0:sub_win_len * n_short_blocks] + + # sub_wins is of size [n_short_blocks x L] + sub_wins = frame.reshape(sub_win_len, n_short_blocks, order='F').copy() + + # Compute normalized sub-frame energies: + s = np.sum(sub_wins ** 2, axis=0) / (frame_energy + eps) + + # Compute entropy of the normalized sub-frame energies: + entropy = -np.sum(s * np.log2(s + eps)) + + return entropy + + +""" Frequency-domain audio features """ + + +def spectral_centroid_spread(fft_magnitude, sampling_rate): + """Computes spectral centroid of frame (given abs(FFT)) + """ + + ind = (np.arange(1, len(fft_magnitude) + 1)) * \ + (sampling_rate / (2.0 * len(fft_magnitude))) + + Xt = fft_magnitude.copy() + Xt = Xt / Xt.max() + NUM = np.sum(ind * Xt) + DEN = np.sum(Xt) + eps + + # Centroid: + centroid = (NUM / DEN) + + # Spread: + spread = np.sqrt(np.sum(((ind - centroid) ** 2) * Xt) / DEN) + + # Normalize: + centroid = centroid / (sampling_rate / 2.0) + spread = spread / (sampling_rate / 2.0) + + return centroid, spread + + +def spectral_entropy(signal, n_short_blocks=10): + """Computes the spectral entropy + """ + + # number of frame samples + num_frames = len(signal) + + # total spectral energy + total_energy = np.sum(signal ** 2) + + # length of sub-frame + sub_win_len = int(np.floor(num_frames / n_short_blocks)) + if num_frames != sub_win_len * n_short_blocks: + signal = signal[0:sub_win_len * n_short_blocks] + + # define sub-frames (using matrix reshape) + sub_wins = signal.reshape(sub_win_len, n_short_blocks, order='F').copy() + + # compute spectral sub-energies + s = np.sum(sub_wins ** 2, axis=0) / (total_energy + eps) + + # compute spectral entropy + entropy = -np.sum(s * np.log2(s + eps)) + + return entropy + + +def spectral_flux(fft_magnitude, previous_fft_magnitude): + """Computes the spectral flux feature of the current frame + + Args: + fft_magnitude : the abs(fft) of the current frame + previous_fft_magnitude : the abs(fft) of the previous frame + """ + + # compute the spectral flux as the sum of square distances: + fft_sum = np.sum(fft_magnitude + eps) + previous_fft_sum = np.sum(previous_fft_magnitude + eps) + sp_flux = np.sum( + (fft_magnitude / fft_sum - previous_fft_magnitude / + previous_fft_sum) ** 2) + + return sp_flux + + +def spectral_rolloff(signal, c): + """Computes spectral roll-off + """ + + energy = np.sum(signal ** 2) + fft_length = len(signal) + threshold = c * energy + # Ffind the spectral rolloff as the frequency position + # where the respective spectral energy is equal to c*totalEnergy + cumulative_sum = np.cumsum(signal ** 2) + eps + a = np.nonzero(cumulative_sum > threshold)[0] + if len(a) > 0: + sp_rolloff = np.float64(a[0]) / (float(fft_length)) + else: + sp_rolloff = 0.0 + + return sp_rolloff + +def mfcc_filter_banks(sampling_rate, num_fft, lowfreq=133.33, linc=200 / 3, + logsc=1.0711703, num_lin_filt=13, num_log_filt=27): + """Computes the triangular filterbank for MFCC computation + (used in the stFeatureExtraction function before the stMFCC function call) + This function is taken from the scikits.talkbox library (MIT Licence): + https://pypi.python.org/pypi/scikits.talkbox + """ + + if sampling_rate < 8000: + nlogfil = 5 + + # Total number of filters + num_filt_total = num_lin_filt + num_log_filt + + # Compute frequency points of the triangle: + frequencies = np.zeros(num_filt_total + 2) + frequencies[:num_lin_filt] = lowfreq + np.arange(num_lin_filt) * linc + frequencies[num_lin_filt:] = frequencies[num_lin_filt - 1] * logsc ** \ + np.arange(1, num_log_filt + 3) + heights = 2. / (frequencies[2:] - frequencies[0:-2]) + + # Compute filterbank coeff (in fft domain, in bins) + fbank = np.zeros((num_filt_total, num_fft)) + nfreqs = np.arange(num_fft) / (1. * num_fft) * sampling_rate + + for i in range(num_filt_total): + low_freqs = frequencies[i] + cent_freqs = frequencies[i + 1] + high_freqs = frequencies[i + 2] + + lid = np.arange(np.floor(low_freqs * num_fft / sampling_rate) + 1, + np.floor(cent_freqs * num_fft / sampling_rate) + 1, + dtype=np.int) + lslope = heights[i] / (cent_freqs - low_freqs) + rid = np.arange(np.floor(cent_freqs * num_fft / sampling_rate) + 1, + np.floor(high_freqs * num_fft / sampling_rate) + 1, + dtype=np.int) + rslope = heights[i] / (high_freqs - cent_freqs) + fbank[i][lid] = lslope * (nfreqs[lid] - low_freqs) + fbank[i][rid] = rslope * (high_freqs - nfreqs[rid]) + + return fbank, frequencies + + +def mfcc(fft_magnitude, fbank, num_mfcc_feats): + """Computes the MFCCs of a frame, given the fft mag + + Args: + fft_magnitude : fft magnitude abs(FFT) + fbank : filter bank (see mfccInitFilterBanks) + + Returns: + ceps : MFCCs (13 element vector) + + Note: MFCC calculation is, in general, taken from the + scikits.talkbox library (MIT Licence), + # with a small number of modifications to make it more + compact and suitable for the pyAudioAnalysis Lib + """ + + mspec = np.log10(np.dot(fft_magnitude, fbank.T) + eps) + ceps = dct(mspec, type=2, norm='ortho', axis=-1)[:num_mfcc_feats] + return ceps + + +def chroma_features_init(num_fft, sampling_rate): + """This function initializes the chroma matrices used in the calculation + of the chroma features + """ + + freqs = np.array([((f + 1) * sampling_rate) / + (2 * num_fft) for f in range(num_fft)]) + cp = 27.50 + num_chroma = np.round(12.0 * np.log2(freqs / cp)).astype(int) + + num_freqs_per_chroma = np.zeros((num_chroma.shape[0],)) + + unique_chroma = np.unique(num_chroma) + for u in unique_chroma: + idx = np.nonzero(num_chroma == u) + num_freqs_per_chroma[idx] = idx[0].shape + + return num_chroma, num_freqs_per_chroma + + +def chroma_features(signal, sampling_rate, num_fft): + # TODO: 1 complexity + # TODO: 2 bug with large windows + + num_chroma, num_freqs_per_chroma = \ + chroma_features_init(num_fft, sampling_rate) + chroma_names = ['A', 'A#', 'B', 'C', 'C#', 'D', + 'D#', 'E', 'F', 'F#', 'G', 'G#'] + spec = signal ** 2 + if num_chroma.max() < num_chroma.shape[0]: + C = np.zeros((num_chroma.shape[0],)) + C[num_chroma] = spec + C /= num_freqs_per_chroma[num_chroma] + else: + I = np.nonzero(num_chroma > num_chroma.shape[0])[0][0] + C = np.zeros((num_chroma.shape[0],)) + C[num_chroma[0:I - 1]] = spec + C /= num_freqs_per_chroma + final_matrix = np.zeros((12, 1)) + newD = int(np.ceil(C.shape[0] / 12.0) * 12) + C2 = np.zeros((newD,)) + C2[0:C.shape[0]] = C + C2 = C2.reshape(int(C2.shape[0] / 12), 12) + # for i in range(12): + # finalC[i] = np.sum(C[i:C.shape[0]:12]) + final_matrix = np.matrix(np.sum(C2, axis=0)).T + final_matrix /= spec.sum() + + # ax = plt.gca() + # plt.hold(False) + # plt.plot(finalC) + # ax.set_xticks(range(len(chromaNames))) + # ax.set_xticklabels(chromaNames) + # xaxis = np.arange(0, 0.02, 0.01); + # ax.set_yticks(range(len(xaxis))) + # ax.set_yticklabels(xaxis) + # plt.show(block=False) + # plt.draw() + + return chroma_names, final_matrix + +""" Windowing and feature extraction """ + +def feature_extraction(signal, sampling_rate, window, step, deltas=True): + """This function implements the shor-term windowing process. + For each short-term window a set of features is extracted. + This results to a sequence of feature vectors, stored in a np matrix. + + Args: + signal : the input signal samples + sampling_rate : the sampling freq (in Hz) + window : the short-term window size (in samples) + step : the short-term window step (in samples) + deltas : (opt) True/False if delta features are to be computed + + Returns: + features (numpy.ndarray) : contains features + (n_feats x numOfShortTermWindows) + feature_names (numpy.ndarray) : contains feature names + (n_feats x numOfShortTermWindows) + """ + + window = int(window) + step = int(step) + + # signal normalization + signal = np.double(signal) + signal = signal / (2.0 ** 15) + dc_offset = signal.mean() + signal_max = (np.abs(signal)).max() + signal = (signal - dc_offset) / (signal_max + 0.0000000001) + + number_of_samples = len(signal) # total number of samples + current_position = 0 + count_fr = 0 + num_fft = int(window / 2) + + # compute the triangular filter banks used in the mfcc calculation + fbank, freqs = mfcc_filter_banks(sampling_rate, num_fft) + + n_time_spectral_feats = 8 + n_harmonic_feats = 0 + n_mfcc_feats = 13 + n_chroma_feats = 13 + n_total_feats = n_time_spectral_feats + n_mfcc_feats + n_harmonic_feats + \ + n_chroma_feats + # n_total_feats = n_time_spectral_feats + n_mfcc_feats + + # n_harmonic_feats + + # define list of feature names + feature_names = ["zcr", "energy", "energy_entropy"] + feature_names += ["spectral_centroid", "spectral_spread"] + feature_names.append("spectral_entropy") + feature_names.append("spectral_flux") + feature_names.append("spectral_rolloff") + feature_names += ["mfcc_{0:d}".format(mfcc_i) + for mfcc_i in range(1, n_mfcc_feats + 1)] + feature_names += ["chroma_{0:d}".format(chroma_i) + for chroma_i in range(1, n_chroma_feats)] + feature_names.append("chroma_std") + + # add names for delta features: + if deltas: + feature_names_2 = feature_names + ["delta " + f for f in feature_names] + feature_names = feature_names_2 + + features = [] + # for each short-term window to end of signal + while current_position + window - 1 < number_of_samples: + count_fr += 1 + # get current window + x = signal[current_position:current_position + window] + + # update window position + current_position = current_position + step + + # get fft magnitude + fft_magnitude = abs(fft(x)) + + # normalize fft + fft_magnitude = fft_magnitude[0:num_fft] + fft_magnitude = fft_magnitude / len(fft_magnitude) + + # keep previous fft mag (used in spectral flux) + if count_fr == 1: + fft_magnitude_previous = fft_magnitude.copy() + feature_vector = np.zeros((n_total_feats, 1)) + + # zero crossing rate + feature_vector[0] = zero_crossing_rate(x) + + # short-term energy + feature_vector[1] = energy(x) + + # short-term entropy of energy + feature_vector[2] = energy_entropy(x) + + # sp centroid/spread + [feature_vector[3], feature_vector[4]] = \ + spectral_centroid_spread(fft_magnitude, + sampling_rate) + + # spectral entropy + feature_vector[5] = \ + spectral_entropy(fft_magnitude) + + # spectral flux + feature_vector[6] = \ + spectral_flux(fft_magnitude, + fft_magnitude_previous) + + # spectral rolloff + feature_vector[7] = \ + spectral_rolloff(fft_magnitude, 0.90) + + # MFCCs + mffc_feats_end = n_time_spectral_feats + n_mfcc_feats + feature_vector[n_time_spectral_feats:mffc_feats_end, 0] = \ + mfcc(fft_magnitude, fbank, n_mfcc_feats).copy() + + # chroma features + chroma_names, chroma_feature_matrix = \ + chroma_features(fft_magnitude, sampling_rate, num_fft) + chroma_features_end = n_time_spectral_feats + n_mfcc_feats + \ + n_chroma_feats - 1 + feature_vector[mffc_feats_end:chroma_features_end] = \ + chroma_feature_matrix + feature_vector[chroma_features_end] = chroma_feature_matrix.std() + if not deltas: + features.append(feature_vector) + else: + # delta features + if count_fr > 1: + delta = feature_vector - feature_vector_prev + feature_vector_2 = np.concatenate((feature_vector, delta)) + else: + feature_vector_2 = np.concatenate((feature_vector, + np.zeros(feature_vector. + shape))) + feature_vector_prev = feature_vector + features.append(feature_vector_2) + + fft_magnitude_previous = fft_magnitude.copy() + + features = np.concatenate(features, 1) + + return features, feature_names \ No newline at end of file diff --git a/autosub/autosub/main.py b/autosub/autosub/main.py new file mode 100644 index 0000000..ea365e2 --- /dev/null +++ b/autosub/autosub/main.py @@ -0,0 +1,135 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os +import re +import sys +import wave +import shutil +import argparse +import subprocess +import numpy as np +from tqdm import tqdm +from deepspeech import Model, version +from segmentAudio import silenceRemoval +from audioProcessing import extract_audio, convert_samplerate +from writeToFile import write_to_file + +# Line count for SRT file +line_count = 0 + +def sort_alphanumeric(data): + """Sort function to sort os.listdir() alphanumerically + Helps to process audio files sequentially after splitting + + Args: + data : file name + """ + + convert = lambda text: int(text) if text.isdigit() else text.lower() + alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)] + + return sorted(data, key = alphanum_key) + + +def ds_process_audio(ds, audio_file, file_handle): + """Run DeepSpeech inference on each audio file generated after silenceRemoval + and write to file pointed by file_handle + + Args: + ds : DeepSpeech Model + audio_file : audio file + file_handle : SRT file handle + """ + + global line_count + fin = wave.open(audio_file, 'rb') + fs_orig = fin.getframerate() + desired_sample_rate = ds.sampleRate() + + # Check if sampling rate is required rate (16000) + # won't be carried out as FFmpeg already converts to 16kHz + if fs_orig != desired_sample_rate: + print("Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition".format(fs_orig, desired_sample_rate), file=sys.stderr) + audio = convert_samplerate(audio_file, desired_sample_rate) + else: + audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16) + + fin.close() + + # Perform inference on audio segment + infered_text = ds.stt(audio) + + # File name contains start and end times in seconds. Extract that + limits = audio_file.split(os.sep)[-1][:-4].split("_")[-1].split("-") + + if len(infered_text) != 0: + line_count += 1 + write_to_file(file_handle, infered_text, line_count, limits) + + +def main(): + global line_count + print("AutoSub v0.1\n") + + parser = argparse.ArgumentParser(description="AutoSub v0.1") + parser.add_argument('--model', required=True, + help='DeepSpeech model file') + parser.add_argument('--scorer', + help='DeepSpeech scorer file') + parser.add_argument('--file', required=True, + help='Input video file') + args = parser.parse_args() + + ds_model = args.model + if not ds_model.endswith(".pbmm"): + print("Invalid model file. Exiting\n") + exit(1) + + # Load DeepSpeech model + ds = Model(ds_model) + + if args.scorer: + ds_scorer = args.scorer + if not ds_scorer.endswith(".scorer"): + print("Invalid scorer file. Running inference using only model file\n") + else: + ds.enableExternalScorer(ds_scorer) + + input_file = args.file + print("\nInput file:", input_file) + + base_directory = os.getcwd() + output_directory = os.path.join(base_directory, "output") + audio_directory = os.path.join(base_directory, "audio") + video_file_name = input_file.split(os.sep)[-1].split(".")[0] + audio_file_name = os.path.join(audio_directory, video_file_name + ".wav") + srt_file_name = os.path.join(output_directory, video_file_name + ".srt") + + # Extract audio from input video file + extract_audio(input_file, audio_file_name) + + print("Splitting on silent parts in audio file") + silenceRemoval(audio_file_name) + + # Output SRT file + file_handle = open(srt_file_name, "a+") + + print("\nRunning inference:") + + for file in tqdm(sort_alphanumeric(os.listdir(audio_directory))): + audio_segment_path = os.path.join(audio_directory, file) + + # Dont run inference on the original audio file + if audio_segment_path.split(os.sep)[-1] != audio_file_name.split(os.sep)[-1]: + ds_process_audio(ds, audio_segment_path, file_handle) + + print("\nSRT file saved to", srt_file_name) + file_handle.close() + + # Clean audio/ directory + shutil.rmtree(audio_directory) + os.mkdir(audio_directory) + +if __name__ == "__main__": + main() diff --git a/autosub/autosub/segmentAudio.py b/autosub/autosub/segmentAudio.py new file mode 100644 index 0000000..9259e60 --- /dev/null +++ b/autosub/autosub/segmentAudio.py @@ -0,0 +1,204 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os +import numpy as np +from pydub import AudioSegment +import scipy.io.wavfile as wavfile +import featureExtraction as FE +import trainAudio as TA + + +def read_audio_file(input_file): + """This function returns a numpy array that stores the audio samples of a + specified WAV file + + Args: + input_file : audio from input video file + """ + + sampling_rate = -1 + signal = np.array([]) + try: + audiofile = AudioSegment.from_file(input_file) + data = np.array([]) + if audiofile.sample_width == 2: + data = np.fromstring(audiofile._data, np.int16) + elif audiofile.sample_width == 4: + data = np.fromstring(audiofile._data, np.int32) + + if data.size > 0: + sampling_rate = audiofile.frame_rate + temp_signal = [] + for chn in list(range(audiofile.channels)): + temp_signal.append(data[chn::audiofile.channels]) + signal = np.array(temp_signal).T + except: + print("Error: file not found or other I/O error. (DECODING FAILED)") + + if signal.ndim == 2 and signal.shape[1] == 1: + signal = signal.flatten() + + return sampling_rate, signal + +def smooth_moving_avg(signal, window=11): + window = int(window) + if signal.ndim != 1: + raise ValueError("") + if signal.size < window: + raise ValueError("Input vector needs to be bigger than window size.") + if window < 3: + return signal + s = np.r_[2 * signal[0] - signal[window - 1::-1], + signal, 2 * signal[-1] - signal[-1:-window:-1]] + w = np.ones(window, 'd') + y = np.convolve(w/w.sum(), s, mode='same') + + return y[window:-window + 1] + +def stereo_to_mono(signal): + """This function converts the input signal to MONO (if it is STEREO) + + Args: + signal: audio file stored in a Numpy array + """ + + if signal.ndim == 2: + if signal.shape[1] == 1: + signal = signal.flatten() + else: + if signal.shape[1] == 2: + signal = (signal[:, 1] / 2) + (signal[:, 0] / 2) + + return signal + +def silence_removal(signal, sampling_rate, st_win, st_step, smooth_window=0.5, + weight=0.5): + """Event Detection (silence removal) + + Args: + signal : the input audio signal + sampling_rate : sampling freq + st_win, st_step : window size and step in seconds + smoothWindow : (optinal) smooth window (in seconds) + weight : (optinal) weight factor (0 < weight < 1) the higher, the more strict + plot : (optinal) True if results are to be plotted + + Returns: + seg_limits : list of segment limits in seconds (e.g [[0.1, 0.9], + [1.4, 3.0]] means that the resulting segments + are (0.1 - 0.9) seconds and (1.4, 3.0) seconds + """ + + if weight >= 1: + weight = 0.99 + if weight <= 0: + weight = 0.01 + + # Step 1: feature extraction + signal = stereo_to_mono(signal) + st_feats, _ = FE.feature_extraction(signal, sampling_rate, + st_win * sampling_rate, + st_step * sampling_rate) + + # Step 2: train binary svm classifier of low vs high energy frames + # keep only the energy short-term sequence (2nd feature) + st_energy = st_feats[1, :] + en = np.sort(st_energy) + # number of 10% of the total short-term windows + st_windows_fraction = int(len(en) / 10) + + # compute "lower" 10% energy threshold + low_threshold = np.mean(en[0:st_windows_fraction]) + 1e-15 + + # compute "higher" 10% energy threshold + high_threshold = np.mean(en[-st_windows_fraction:-1]) + 1e-15 + + # get all features that correspond to low energy + low_energy = st_feats[:, np.where(st_energy <= low_threshold)[0]] + + # get all features that correspond to high energy + high_energy = st_feats[:, np.where(st_energy >= high_threshold)[0]] + + # form the binary classification task and ... + features = [low_energy.T, high_energy.T] + # normalize and train the respective svm probabilistic model + + # (ONSET vs SILENCE) + features_norm, mean, std = TA.normalize_features(features) + svm = TA.train_svm(features_norm, 1.0) + + # Step 3: compute onset probability based on the trained svm + prob_on_set = [] + for index in range(st_feats.shape[1]): + # for each frame + cur_fv = (st_feats[:, index] - mean) / std + # get svm probability (that it belongs to the ONSET class) + prob_on_set.append(svm.predict_proba(cur_fv.reshape(1, -1))[0][1]) + prob_on_set = np.array(prob_on_set) + + # smooth probability: + prob_on_set = smooth_moving_avg(prob_on_set, smooth_window / st_step) + + # Step 4A: detect onset frame indices: + prog_on_set_sort = np.sort(prob_on_set) + + # find probability Threshold as a weighted average + # of top 10% and lower 10% of the values + nt = int(prog_on_set_sort.shape[0] / 10) + threshold = (np.mean((1 - weight) * prog_on_set_sort[0:nt]) + + weight * np.mean(prog_on_set_sort[-nt::])) + + max_indices = np.where(prob_on_set > threshold)[0] + # get the indices of the frames that satisfy the thresholding + index = 0 + seg_limits = [] + time_clusters = [] + + # Step 4B: group frame indices to onset segments + while index < len(max_indices): + # for each of the detected onset indices + cur_cluster = [max_indices[index]] + if index == len(max_indices)-1: + break + while max_indices[index+1] - cur_cluster[-1] <= 2: + cur_cluster.append(max_indices[index+1]) + index += 1 + if index == len(max_indices)-1: + break + index += 1 + time_clusters.append(cur_cluster) + seg_limits.append([cur_cluster[0] * st_step, + cur_cluster[-1] * st_step]) + + # Step 5: Post process: remove very small segments: + min_duration = 0.2 + seg_limits_2 = [] + for s_lim in seg_limits: + if s_lim[1] - s_lim[0] > min_duration: + seg_limits_2.append(s_lim) + seg_limits = seg_limits_2 + + return seg_limits + +def silenceRemoval(input_file, smoothing_window = 1.0, weight = 0.2): + """Remove silence segments from an audio file and split on those segments + + Args: + input_file : audio from input video file + smoothing : Smoothing window size in seconds. Defaults to 1.0. + weight : Weight factor in (0,1). Defaults to 0.5. + """ + + if not os.path.isfile(input_file): + raise Exception("Input audio file not found!") + + [fs, x] = read_audio_file(input_file) + segmentLimits = silence_removal(x, fs, 0.05, 0.05, smoothing_window, weight) + + for i, s in enumerate(segmentLimits): + strOut = "{0:s}_{1:.3f}-{2:.3f}.wav".format(input_file[0:-4], s[0], s[1]) + wavfile.write(strOut, fs, x[int(fs * s[0]):int(fs * s[1])]) + +#if __name__ == "__main__": +# silenceRemoval("video.wav") \ No newline at end of file diff --git a/autosub/autosub/trainAudio.py b/autosub/autosub/trainAudio.py new file mode 100644 index 0000000..dbd4745 --- /dev/null +++ b/autosub/autosub/trainAudio.py @@ -0,0 +1,104 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os +import csv +import sys +import glob +import signal +import ntpath +import numpy as np +import sklearn.svm + +shortTermWindow = 0.050 +shortTermStep = 0.050 +eps = 0.00000001 + + +def train_svm(features, c_param, kernel='linear'): + """Train a multi-class probabilitistic SVM classifier. + Note: This function is simply a wrapper to the sklearn functionality + for SVM training + See function trainSVM_feature() to use a wrapper on both the + feature extraction and the SVM training + (and parameter tuning) processes. + Args: + features : a list ([numOfClasses x 1]) whose elements + containt np matrices of features each matrix + features[i] of class i is + [n_samples x numOfDimensions] + c_param : SVM parameter C (cost of constraints violation) + + Returns: + svm : the trained SVM variable + + NOTE: + This function trains a linear-kernel SVM for a given C value. + For a different kernel, other types of parameters should be provided. + """ + + feature_matrix, labels = features_to_matrix(features) + svm = sklearn.svm.SVC(C=c_param, kernel=kernel, probability=True, + gamma='auto') + svm.fit(feature_matrix, labels) + + return svm + +def normalize_features(features): + """This function normalizes a feature set to 0-mean and 1-std + Used in most classifier trainning cases + + Args: + features : list of feature matrices (each one of them is a np matrix) + + Returns: + features_norm : list of NORMALIZED feature matrices + mean : mean vector + std : std vector + """ + + temp_feats = np.array([]) + + for count, f in enumerate(features): + if f.shape[0] > 0: + if count == 0: + temp_feats = f + else: + temp_feats = np.vstack((temp_feats, f)) + count += 1 + + mean = np.mean(temp_feats, axis=0) + 1e-14 + std = np.std(temp_feats, axis=0) + 1e-14 + + features_norm = [] + for f in features: + ft = f.copy() + for n_samples in range(f.shape[0]): + ft[n_samples, :] = (ft[n_samples, :] - mean) / std + features_norm.append(ft) + return features_norm, mean, std + + +def features_to_matrix(features): + """This function takes a list of feature matrices as argument and returns + a single concatenated feature matrix and the respective class labels. + + Args: + features : a list of feature matrices + + Returns: + feature_matrix : a concatenated matrix of features + labels : a vector of class indices + """ + + labels = np.array([]) + feature_matrix = np.array([]) + for i, f in enumerate(features): + if i == 0: + feature_matrix = f + labels = i * np.ones((len(f), 1)) + else: + feature_matrix = np.vstack((feature_matrix, f)) + labels = np.append(labels, i * np.ones((len(f), 1))) + + return feature_matrix, labels diff --git a/autosub/autosub/writeToFile.py b/autosub/autosub/writeToFile.py new file mode 100644 index 0000000..8c85f98 --- /dev/null +++ b/autosub/autosub/writeToFile.py @@ -0,0 +1,32 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os +import datetime + +def write_to_file(file_handle, inferred_text, line_count, limits): + """Write the inferred text to SRT file + Follows a specific format for SRT files + + Args: + file_handle : SRT file handle + inferred_text : text to be written + line_count : subtitle line count + limits : starting and ending times for text + """ + + d = str(datetime.timedelta(seconds=float(limits[0]))) + try: + from_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2]) + except: + from_dur = "0" + str(d) + "," + "00" + + d = str(datetime.timedelta(seconds=float(limits[1]))) + try: + to_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2]) + except: + to_dur = "0" + str(d) + "," + "00" + + file_handle.write(str(line_count) + "\n") + file_handle.write(from_dur + " --> " + to_dur + "\n") + file_handle.write(inferred_text + "\n\n") \ No newline at end of file diff --git a/autosub/requirements.txt b/autosub/requirements.txt new file mode 100644 index 0000000..26d052a --- /dev/null +++ b/autosub/requirements.txt @@ -0,0 +1,13 @@ +cycler==0.10.0 +Cython==0.29.21 +numpy +deepspeech==0.9.3 +joblib==0.16.0 +kiwisolver==1.2.0 +pydub==0.23.1 +pyparsing==2.4.7 +python-dateutil==2.8.1 +scikit-learn==0.21.3 +scipy==1.4.1 +six==1.15.0 +tqdm==4.44.1 diff --git a/autosub/setup.py b/autosub/setup.py new file mode 100644 index 0000000..f5cfed5 --- /dev/null +++ b/autosub/setup.py @@ -0,0 +1,28 @@ +import os +from setuptools import setup + +DIR = os.path.dirname(os.path.abspath(__file__)) +INSTALL_PACKAGES = open(os.path.join(DIR, 'requirements.txt')).read().splitlines() + +with open("README.md", "r") as fh: + README = fh.read() + +setup( + name="AutoSub", + packages="autosub", + version="0.0.1", + author="Abhiroop Talasila", + author_email="abhiroop.talasila@gmail.com", + description="CLI application to generate subtitle file (.srt) for any video file using using STT", + long_description=README, + install_requires=INSTALL_PACKAGES, + long_description_content_type="text/markdown", + url="https://github.com/abhirooptalasila/AutoSub", + keywords=['speech-to-text','deepspeech','machine-learning'], + classifiers=[ + "Programming Language :: Python :: 3", + "License :: OSI Approved :: MIT License", + "Operating System :: OS Independent", + ], + python_requires='>=3.5', +)