This commit is contained in:
Abhiroop Talasila 2021-03-28 22:52:17 +05:30
Родитель 0f1168d78f
Коммит cd31393a94
11 изменённых файлов: 1091 добавлений и 0 удалений

21
autosub/LICENSE Normal file
Просмотреть файл

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2020 Abhiroop Talasila
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

93
autosub/README.md Normal file
Просмотреть файл

@ -0,0 +1,93 @@
# AutoSub
- [AutoSub](#autosub)
- [About](#about)
- [Motivation](#motivation)
- [Installation](#installation)
- [How-to example](#how-to-example)
- [How it works](#how-it-works)
- [TO-DO](#to-do)
- [Contributing](#contributing)
- [References](#references)
## About
AutoSub is a CLI application to generate subtitle file (.srt) for any video file using [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech). I use the DeepSpeech Python API to run inference on audio segments and [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) to split the initial audio on silent segments, producing multiple small files.
## Motivation
In the age of OTT platforms, there are still some who prefer to download movies/videos from YouTube/Facebook or even torrents rather than stream. I am one of them and on one such occasion, I couldn't find the subtitle file for a particular movie I had downloaded. Then the idea for AutoSub struck me and since I had worked with DeepSpeech previously, I decided to use it.
## Installation
* Clone the repo. All further steps should be performed while in the `AutoSub/` directory
```bash
$ git clone https://github.com/abhirooptalasila/AutoSub
$ cd AutoSub
```
* Create a pip virtual environment to install the required packages
```bash
$ python3 -m venv sub
$ source sub/bin/activate
$ pip3 install -r requirements.txt
```
* Download the model and scorer files from DeepSpeech repo. The scorer file is optional, but it greatly improves inference results.
```bash
# Model file (~190 MB)
$ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
# Scorer file (~950 MB)
$ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
```
* Create two folders `audio/` and `output/` to store audio segments and final SRT file
```bash
$ mkdir audio output
```
* Install FFMPEG. If you're running Ubuntu, this should work fine.
```bash
$ sudo apt-get install ffmpeg
$ ffmpeg -version # I'm running 4.1.4
```
* [OPTIONAL] If you would like the subtitles to be generated faster, you can use the GPU package instead. Make sure to install the appropriate [CUDA](https://deepspeech.readthedocs.io/en/v0.9.3/USING.html#cuda-dependency-inference) version.
```bash
$ source sub/bin/activate
$ pip3 install deepspeech-gpu
```
## How-to example
* After following the installation instructions, you can run `autosub/main.py` as given below. `--model` and `--scorer` arguments take the absolute paths of the respective files. The `--file` argument is the video file for which SRT file is to be generated
```bash
$ python3 autosub/main.py --model /home/AutoSub/deepspeech-0.9.3-models.pbmm --scorer /home/AutoSub/deepspeech-0.9.3-models.scorer --file ~/movie.mp4
```
* After the script finishes, the SRT file is saved in `output/`
* Open the video file and add this SRT file as a subtitle, or you can just drag and drop in VLC.
## How it works
Mozilla DeepSpeech is an amazing open-source speech-to-text engine with support for fine-tuning using custom datasets, external language models, exporting memory-mapped models and a lot more. You should definitely check it out for STT tasks. So, when you first run the script, I use FFMPEG to **extract the audio** from the video and save it in `audio/`. By default DeepSpeech is configured to accept 16kHz audio samples for inference, hence while extracting I make FFMPEG use 16kHz sampling rate.
Then, I use [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) for silence removal - which basically takes the large audio file initially extracted, and splits it wherever silent regions are encountered, resulting in smaller audio segments which are much easier to process. I haven't used the whole library, instead I've integrated parts of it in `autosub/featureExtraction.py` and `autosub/trainAudio.py` All these audio files are stored in `audio/`. Then for each audio segment, I perform DeepSpeech inference on it, and write the inferred text in a SRT file. After all files are processed, the final SRT file is stored in `output/`.
When I tested the script on my laptop, it took about **40 minutes to generate the SRT file for a 70 minutes video file**. My config is an i5 dual-core @ 2.5 Ghz and 8 gigs of RAM. Ideally, the whole process shouldn't take more than 60% of the duration of original video file.
## TO-DO
* Pre-process inferred text before writing to file (prettify)
* Add progress bar to `extract_audio()`
* GUI support (?)
## Contributing
I would love to follow up on any suggestions/issues you find :)
## References
1. https://github.com/mozilla/DeepSpeech/
2. https://github.com/tyiannak/pyAudioAnalysis
3. https://deepspeech.readthedocs.io/

Просмотреть файл

Просмотреть файл

@ -0,0 +1,48 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import subprocess
import numpy as np
def extract_audio(input_file, audio_file_name):
"""Extract audio from input video file and save to audio/ in root dir
Args:
input_file: input video file
audio_file_name: save audio WAV file with same filename as video file
"""
command = "ffmpeg -hide_banner -loglevel warning -i {} -b:a 192k -ac 1 -ar 16000 -vn {}".format(input_file, audio_file_name)
try:
ret = subprocess.call(command, shell=True)
print("Extracted audio to audio/{}".format(audio_file_name.split("/")[-1]))
except Exception as e:
print("Error: ", str(e))
exit(1)
def convert_samplerate(audio_path, desired_sample_rate):
"""Convert extracted audio to the format expected by DeepSpeech
***WONT be called as extract_audio() converts the audio to 16kHz while saving***
Args:
audio_path: audio file path
desired_sample_rate: DeepSpeech expects 16kHz
Returns:
numpy buffer: audio signal stored in numpy array
"""
sox_cmd = "sox {} --type raw --bits 16 --channels 1 --rate {} --encoding signed-integer --endian little --compression 0.0 --no-dither - ".format(
quote(audio_path), desired_sample_rate)
try:
output = subprocess.check_output(
shlex.split(sox_cmd), stderr=subprocess.PIPE)
except subprocess.CalledProcessError as e:
raise RuntimeError("SoX returned non-zero status: {}".format(e.stderr))
except OSError as e:
raise OSError(e.errno, "SoX not found, use {}hz files or install it: {}".format(
desired_sample_rate, e.strerror))
return np.frombuffer(output, np.int16)

Просмотреть файл

@ -0,0 +1,413 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import math
import numpy as np
from scipy.fftpack import fft
from scipy.signal import lfilter
from scipy.fftpack.realtransforms import dct
eps = 0.00000001
def zero_crossing_rate(frame):
"""Computes zero crossing rate of frame
"""
count = len(frame)
count_zero = np.sum(np.abs(np.diff(np.sign(frame)))) / 2
return np.float64(count_zero) / np.float64(count - 1.0)
def energy(frame):
"""Computes signal energy of frame
"""
return np.sum(frame ** 2) / np.float64(len(frame))
def energy_entropy(frame, n_short_blocks=10):
"""Computes entropy of energy
"""
# total frame energy
frame_energy = np.sum(frame ** 2)
frame_length = len(frame)
sub_win_len = int(np.floor(frame_length / n_short_blocks))
if frame_length != sub_win_len * n_short_blocks:
frame = frame[0:sub_win_len * n_short_blocks]
# sub_wins is of size [n_short_blocks x L]
sub_wins = frame.reshape(sub_win_len, n_short_blocks, order='F').copy()
# Compute normalized sub-frame energies:
s = np.sum(sub_wins ** 2, axis=0) / (frame_energy + eps)
# Compute entropy of the normalized sub-frame energies:
entropy = -np.sum(s * np.log2(s + eps))
return entropy
""" Frequency-domain audio features """
def spectral_centroid_spread(fft_magnitude, sampling_rate):
"""Computes spectral centroid of frame (given abs(FFT))
"""
ind = (np.arange(1, len(fft_magnitude) + 1)) * \
(sampling_rate / (2.0 * len(fft_magnitude)))
Xt = fft_magnitude.copy()
Xt = Xt / Xt.max()
NUM = np.sum(ind * Xt)
DEN = np.sum(Xt) + eps
# Centroid:
centroid = (NUM / DEN)
# Spread:
spread = np.sqrt(np.sum(((ind - centroid) ** 2) * Xt) / DEN)
# Normalize:
centroid = centroid / (sampling_rate / 2.0)
spread = spread / (sampling_rate / 2.0)
return centroid, spread
def spectral_entropy(signal, n_short_blocks=10):
"""Computes the spectral entropy
"""
# number of frame samples
num_frames = len(signal)
# total spectral energy
total_energy = np.sum(signal ** 2)
# length of sub-frame
sub_win_len = int(np.floor(num_frames / n_short_blocks))
if num_frames != sub_win_len * n_short_blocks:
signal = signal[0:sub_win_len * n_short_blocks]
# define sub-frames (using matrix reshape)
sub_wins = signal.reshape(sub_win_len, n_short_blocks, order='F').copy()
# compute spectral sub-energies
s = np.sum(sub_wins ** 2, axis=0) / (total_energy + eps)
# compute spectral entropy
entropy = -np.sum(s * np.log2(s + eps))
return entropy
def spectral_flux(fft_magnitude, previous_fft_magnitude):
"""Computes the spectral flux feature of the current frame
Args:
fft_magnitude : the abs(fft) of the current frame
previous_fft_magnitude : the abs(fft) of the previous frame
"""
# compute the spectral flux as the sum of square distances:
fft_sum = np.sum(fft_magnitude + eps)
previous_fft_sum = np.sum(previous_fft_magnitude + eps)
sp_flux = np.sum(
(fft_magnitude / fft_sum - previous_fft_magnitude /
previous_fft_sum) ** 2)
return sp_flux
def spectral_rolloff(signal, c):
"""Computes spectral roll-off
"""
energy = np.sum(signal ** 2)
fft_length = len(signal)
threshold = c * energy
# Ffind the spectral rolloff as the frequency position
# where the respective spectral energy is equal to c*totalEnergy
cumulative_sum = np.cumsum(signal ** 2) + eps
a = np.nonzero(cumulative_sum > threshold)[0]
if len(a) > 0:
sp_rolloff = np.float64(a[0]) / (float(fft_length))
else:
sp_rolloff = 0.0
return sp_rolloff
def mfcc_filter_banks(sampling_rate, num_fft, lowfreq=133.33, linc=200 / 3,
logsc=1.0711703, num_lin_filt=13, num_log_filt=27):
"""Computes the triangular filterbank for MFCC computation
(used in the stFeatureExtraction function before the stMFCC function call)
This function is taken from the scikits.talkbox library (MIT Licence):
https://pypi.python.org/pypi/scikits.talkbox
"""
if sampling_rate < 8000:
nlogfil = 5
# Total number of filters
num_filt_total = num_lin_filt + num_log_filt
# Compute frequency points of the triangle:
frequencies = np.zeros(num_filt_total + 2)
frequencies[:num_lin_filt] = lowfreq + np.arange(num_lin_filt) * linc
frequencies[num_lin_filt:] = frequencies[num_lin_filt - 1] * logsc ** \
np.arange(1, num_log_filt + 3)
heights = 2. / (frequencies[2:] - frequencies[0:-2])
# Compute filterbank coeff (in fft domain, in bins)
fbank = np.zeros((num_filt_total, num_fft))
nfreqs = np.arange(num_fft) / (1. * num_fft) * sampling_rate
for i in range(num_filt_total):
low_freqs = frequencies[i]
cent_freqs = frequencies[i + 1]
high_freqs = frequencies[i + 2]
lid = np.arange(np.floor(low_freqs * num_fft / sampling_rate) + 1,
np.floor(cent_freqs * num_fft / sampling_rate) + 1,
dtype=np.int)
lslope = heights[i] / (cent_freqs - low_freqs)
rid = np.arange(np.floor(cent_freqs * num_fft / sampling_rate) + 1,
np.floor(high_freqs * num_fft / sampling_rate) + 1,
dtype=np.int)
rslope = heights[i] / (high_freqs - cent_freqs)
fbank[i][lid] = lslope * (nfreqs[lid] - low_freqs)
fbank[i][rid] = rslope * (high_freqs - nfreqs[rid])
return fbank, frequencies
def mfcc(fft_magnitude, fbank, num_mfcc_feats):
"""Computes the MFCCs of a frame, given the fft mag
Args:
fft_magnitude : fft magnitude abs(FFT)
fbank : filter bank (see mfccInitFilterBanks)
Returns:
ceps : MFCCs (13 element vector)
Note: MFCC calculation is, in general, taken from the
scikits.talkbox library (MIT Licence),
# with a small number of modifications to make it more
compact and suitable for the pyAudioAnalysis Lib
"""
mspec = np.log10(np.dot(fft_magnitude, fbank.T) + eps)
ceps = dct(mspec, type=2, norm='ortho', axis=-1)[:num_mfcc_feats]
return ceps
def chroma_features_init(num_fft, sampling_rate):
"""This function initializes the chroma matrices used in the calculation
of the chroma features
"""
freqs = np.array([((f + 1) * sampling_rate) /
(2 * num_fft) for f in range(num_fft)])
cp = 27.50
num_chroma = np.round(12.0 * np.log2(freqs / cp)).astype(int)
num_freqs_per_chroma = np.zeros((num_chroma.shape[0],))
unique_chroma = np.unique(num_chroma)
for u in unique_chroma:
idx = np.nonzero(num_chroma == u)
num_freqs_per_chroma[idx] = idx[0].shape
return num_chroma, num_freqs_per_chroma
def chroma_features(signal, sampling_rate, num_fft):
# TODO: 1 complexity
# TODO: 2 bug with large windows
num_chroma, num_freqs_per_chroma = \
chroma_features_init(num_fft, sampling_rate)
chroma_names = ['A', 'A#', 'B', 'C', 'C#', 'D',
'D#', 'E', 'F', 'F#', 'G', 'G#']
spec = signal ** 2
if num_chroma.max() < num_chroma.shape[0]:
C = np.zeros((num_chroma.shape[0],))
C[num_chroma] = spec
C /= num_freqs_per_chroma[num_chroma]
else:
I = np.nonzero(num_chroma > num_chroma.shape[0])[0][0]
C = np.zeros((num_chroma.shape[0],))
C[num_chroma[0:I - 1]] = spec
C /= num_freqs_per_chroma
final_matrix = np.zeros((12, 1))
newD = int(np.ceil(C.shape[0] / 12.0) * 12)
C2 = np.zeros((newD,))
C2[0:C.shape[0]] = C
C2 = C2.reshape(int(C2.shape[0] / 12), 12)
# for i in range(12):
# finalC[i] = np.sum(C[i:C.shape[0]:12])
final_matrix = np.matrix(np.sum(C2, axis=0)).T
final_matrix /= spec.sum()
# ax = plt.gca()
# plt.hold(False)
# plt.plot(finalC)
# ax.set_xticks(range(len(chromaNames)))
# ax.set_xticklabels(chromaNames)
# xaxis = np.arange(0, 0.02, 0.01);
# ax.set_yticks(range(len(xaxis)))
# ax.set_yticklabels(xaxis)
# plt.show(block=False)
# plt.draw()
return chroma_names, final_matrix
""" Windowing and feature extraction """
def feature_extraction(signal, sampling_rate, window, step, deltas=True):
"""This function implements the shor-term windowing process.
For each short-term window a set of features is extracted.
This results to a sequence of feature vectors, stored in a np matrix.
Args:
signal : the input signal samples
sampling_rate : the sampling freq (in Hz)
window : the short-term window size (in samples)
step : the short-term window step (in samples)
deltas : (opt) True/False if delta features are to be computed
Returns:
features (numpy.ndarray) : contains features
(n_feats x numOfShortTermWindows)
feature_names (numpy.ndarray) : contains feature names
(n_feats x numOfShortTermWindows)
"""
window = int(window)
step = int(step)
# signal normalization
signal = np.double(signal)
signal = signal / (2.0 ** 15)
dc_offset = signal.mean()
signal_max = (np.abs(signal)).max()
signal = (signal - dc_offset) / (signal_max + 0.0000000001)
number_of_samples = len(signal) # total number of samples
current_position = 0
count_fr = 0
num_fft = int(window / 2)
# compute the triangular filter banks used in the mfcc calculation
fbank, freqs = mfcc_filter_banks(sampling_rate, num_fft)
n_time_spectral_feats = 8
n_harmonic_feats = 0
n_mfcc_feats = 13
n_chroma_feats = 13
n_total_feats = n_time_spectral_feats + n_mfcc_feats + n_harmonic_feats + \
n_chroma_feats
# n_total_feats = n_time_spectral_feats + n_mfcc_feats +
# n_harmonic_feats
# define list of feature names
feature_names = ["zcr", "energy", "energy_entropy"]
feature_names += ["spectral_centroid", "spectral_spread"]
feature_names.append("spectral_entropy")
feature_names.append("spectral_flux")
feature_names.append("spectral_rolloff")
feature_names += ["mfcc_{0:d}".format(mfcc_i)
for mfcc_i in range(1, n_mfcc_feats + 1)]
feature_names += ["chroma_{0:d}".format(chroma_i)
for chroma_i in range(1, n_chroma_feats)]
feature_names.append("chroma_std")
# add names for delta features:
if deltas:
feature_names_2 = feature_names + ["delta " + f for f in feature_names]
feature_names = feature_names_2
features = []
# for each short-term window to end of signal
while current_position + window - 1 < number_of_samples:
count_fr += 1
# get current window
x = signal[current_position:current_position + window]
# update window position
current_position = current_position + step
# get fft magnitude
fft_magnitude = abs(fft(x))
# normalize fft
fft_magnitude = fft_magnitude[0:num_fft]
fft_magnitude = fft_magnitude / len(fft_magnitude)
# keep previous fft mag (used in spectral flux)
if count_fr == 1:
fft_magnitude_previous = fft_magnitude.copy()
feature_vector = np.zeros((n_total_feats, 1))
# zero crossing rate
feature_vector[0] = zero_crossing_rate(x)
# short-term energy
feature_vector[1] = energy(x)
# short-term entropy of energy
feature_vector[2] = energy_entropy(x)
# sp centroid/spread
[feature_vector[3], feature_vector[4]] = \
spectral_centroid_spread(fft_magnitude,
sampling_rate)
# spectral entropy
feature_vector[5] = \
spectral_entropy(fft_magnitude)
# spectral flux
feature_vector[6] = \
spectral_flux(fft_magnitude,
fft_magnitude_previous)
# spectral rolloff
feature_vector[7] = \
spectral_rolloff(fft_magnitude, 0.90)
# MFCCs
mffc_feats_end = n_time_spectral_feats + n_mfcc_feats
feature_vector[n_time_spectral_feats:mffc_feats_end, 0] = \
mfcc(fft_magnitude, fbank, n_mfcc_feats).copy()
# chroma features
chroma_names, chroma_feature_matrix = \
chroma_features(fft_magnitude, sampling_rate, num_fft)
chroma_features_end = n_time_spectral_feats + n_mfcc_feats + \
n_chroma_feats - 1
feature_vector[mffc_feats_end:chroma_features_end] = \
chroma_feature_matrix
feature_vector[chroma_features_end] = chroma_feature_matrix.std()
if not deltas:
features.append(feature_vector)
else:
# delta features
if count_fr > 1:
delta = feature_vector - feature_vector_prev
feature_vector_2 = np.concatenate((feature_vector, delta))
else:
feature_vector_2 = np.concatenate((feature_vector,
np.zeros(feature_vector.
shape)))
feature_vector_prev = feature_vector
features.append(feature_vector_2)
fft_magnitude_previous = fft_magnitude.copy()
features = np.concatenate(features, 1)
return features, feature_names

135
autosub/autosub/main.py Normal file
Просмотреть файл

@ -0,0 +1,135 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import re
import sys
import wave
import shutil
import argparse
import subprocess
import numpy as np
from tqdm import tqdm
from deepspeech import Model, version
from segmentAudio import silenceRemoval
from audioProcessing import extract_audio, convert_samplerate
from writeToFile import write_to_file
# Line count for SRT file
line_count = 0
def sort_alphanumeric(data):
"""Sort function to sort os.listdir() alphanumerically
Helps to process audio files sequentially after splitting
Args:
data : file name
"""
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
return sorted(data, key = alphanum_key)
def ds_process_audio(ds, audio_file, file_handle):
"""Run DeepSpeech inference on each audio file generated after silenceRemoval
and write to file pointed by file_handle
Args:
ds : DeepSpeech Model
audio_file : audio file
file_handle : SRT file handle
"""
global line_count
fin = wave.open(audio_file, 'rb')
fs_orig = fin.getframerate()
desired_sample_rate = ds.sampleRate()
# Check if sampling rate is required rate (16000)
# won't be carried out as FFmpeg already converts to 16kHz
if fs_orig != desired_sample_rate:
print("Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition".format(fs_orig, desired_sample_rate), file=sys.stderr)
audio = convert_samplerate(audio_file, desired_sample_rate)
else:
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
fin.close()
# Perform inference on audio segment
infered_text = ds.stt(audio)
# File name contains start and end times in seconds. Extract that
limits = audio_file.split(os.sep)[-1][:-4].split("_")[-1].split("-")
if len(infered_text) != 0:
line_count += 1
write_to_file(file_handle, infered_text, line_count, limits)
def main():
global line_count
print("AutoSub v0.1\n")
parser = argparse.ArgumentParser(description="AutoSub v0.1")
parser.add_argument('--model', required=True,
help='DeepSpeech model file')
parser.add_argument('--scorer',
help='DeepSpeech scorer file')
parser.add_argument('--file', required=True,
help='Input video file')
args = parser.parse_args()
ds_model = args.model
if not ds_model.endswith(".pbmm"):
print("Invalid model file. Exiting\n")
exit(1)
# Load DeepSpeech model
ds = Model(ds_model)
if args.scorer:
ds_scorer = args.scorer
if not ds_scorer.endswith(".scorer"):
print("Invalid scorer file. Running inference using only model file\n")
else:
ds.enableExternalScorer(ds_scorer)
input_file = args.file
print("\nInput file:", input_file)
base_directory = os.getcwd()
output_directory = os.path.join(base_directory, "output")
audio_directory = os.path.join(base_directory, "audio")
video_file_name = input_file.split(os.sep)[-1].split(".")[0]
audio_file_name = os.path.join(audio_directory, video_file_name + ".wav")
srt_file_name = os.path.join(output_directory, video_file_name + ".srt")
# Extract audio from input video file
extract_audio(input_file, audio_file_name)
print("Splitting on silent parts in audio file")
silenceRemoval(audio_file_name)
# Output SRT file
file_handle = open(srt_file_name, "a+")
print("\nRunning inference:")
for file in tqdm(sort_alphanumeric(os.listdir(audio_directory))):
audio_segment_path = os.path.join(audio_directory, file)
# Dont run inference on the original audio file
if audio_segment_path.split(os.sep)[-1] != audio_file_name.split(os.sep)[-1]:
ds_process_audio(ds, audio_segment_path, file_handle)
print("\nSRT file saved to", srt_file_name)
file_handle.close()
# Clean audio/ directory
shutil.rmtree(audio_directory)
os.mkdir(audio_directory)
if __name__ == "__main__":
main()

Просмотреть файл

@ -0,0 +1,204 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import numpy as np
from pydub import AudioSegment
import scipy.io.wavfile as wavfile
import featureExtraction as FE
import trainAudio as TA
def read_audio_file(input_file):
"""This function returns a numpy array that stores the audio samples of a
specified WAV file
Args:
input_file : audio from input video file
"""
sampling_rate = -1
signal = np.array([])
try:
audiofile = AudioSegment.from_file(input_file)
data = np.array([])
if audiofile.sample_width == 2:
data = np.fromstring(audiofile._data, np.int16)
elif audiofile.sample_width == 4:
data = np.fromstring(audiofile._data, np.int32)
if data.size > 0:
sampling_rate = audiofile.frame_rate
temp_signal = []
for chn in list(range(audiofile.channels)):
temp_signal.append(data[chn::audiofile.channels])
signal = np.array(temp_signal).T
except:
print("Error: file not found or other I/O error. (DECODING FAILED)")
if signal.ndim == 2 and signal.shape[1] == 1:
signal = signal.flatten()
return sampling_rate, signal
def smooth_moving_avg(signal, window=11):
window = int(window)
if signal.ndim != 1:
raise ValueError("")
if signal.size < window:
raise ValueError("Input vector needs to be bigger than window size.")
if window < 3:
return signal
s = np.r_[2 * signal[0] - signal[window - 1::-1],
signal, 2 * signal[-1] - signal[-1:-window:-1]]
w = np.ones(window, 'd')
y = np.convolve(w/w.sum(), s, mode='same')
return y[window:-window + 1]
def stereo_to_mono(signal):
"""This function converts the input signal to MONO (if it is STEREO)
Args:
signal: audio file stored in a Numpy array
"""
if signal.ndim == 2:
if signal.shape[1] == 1:
signal = signal.flatten()
else:
if signal.shape[1] == 2:
signal = (signal[:, 1] / 2) + (signal[:, 0] / 2)
return signal
def silence_removal(signal, sampling_rate, st_win, st_step, smooth_window=0.5,
weight=0.5):
"""Event Detection (silence removal)
Args:
signal : the input audio signal
sampling_rate : sampling freq
st_win, st_step : window size and step in seconds
smoothWindow : (optinal) smooth window (in seconds)
weight : (optinal) weight factor (0 < weight < 1) the higher, the more strict
plot : (optinal) True if results are to be plotted
Returns:
seg_limits : list of segment limits in seconds (e.g [[0.1, 0.9],
[1.4, 3.0]] means that the resulting segments
are (0.1 - 0.9) seconds and (1.4, 3.0) seconds
"""
if weight >= 1:
weight = 0.99
if weight <= 0:
weight = 0.01
# Step 1: feature extraction
signal = stereo_to_mono(signal)
st_feats, _ = FE.feature_extraction(signal, sampling_rate,
st_win * sampling_rate,
st_step * sampling_rate)
# Step 2: train binary svm classifier of low vs high energy frames
# keep only the energy short-term sequence (2nd feature)
st_energy = st_feats[1, :]
en = np.sort(st_energy)
# number of 10% of the total short-term windows
st_windows_fraction = int(len(en) / 10)
# compute "lower" 10% energy threshold
low_threshold = np.mean(en[0:st_windows_fraction]) + 1e-15
# compute "higher" 10% energy threshold
high_threshold = np.mean(en[-st_windows_fraction:-1]) + 1e-15
# get all features that correspond to low energy
low_energy = st_feats[:, np.where(st_energy <= low_threshold)[0]]
# get all features that correspond to high energy
high_energy = st_feats[:, np.where(st_energy >= high_threshold)[0]]
# form the binary classification task and ...
features = [low_energy.T, high_energy.T]
# normalize and train the respective svm probabilistic model
# (ONSET vs SILENCE)
features_norm, mean, std = TA.normalize_features(features)
svm = TA.train_svm(features_norm, 1.0)
# Step 3: compute onset probability based on the trained svm
prob_on_set = []
for index in range(st_feats.shape[1]):
# for each frame
cur_fv = (st_feats[:, index] - mean) / std
# get svm probability (that it belongs to the ONSET class)
prob_on_set.append(svm.predict_proba(cur_fv.reshape(1, -1))[0][1])
prob_on_set = np.array(prob_on_set)
# smooth probability:
prob_on_set = smooth_moving_avg(prob_on_set, smooth_window / st_step)
# Step 4A: detect onset frame indices:
prog_on_set_sort = np.sort(prob_on_set)
# find probability Threshold as a weighted average
# of top 10% and lower 10% of the values
nt = int(prog_on_set_sort.shape[0] / 10)
threshold = (np.mean((1 - weight) * prog_on_set_sort[0:nt]) +
weight * np.mean(prog_on_set_sort[-nt::]))
max_indices = np.where(prob_on_set > threshold)[0]
# get the indices of the frames that satisfy the thresholding
index = 0
seg_limits = []
time_clusters = []
# Step 4B: group frame indices to onset segments
while index < len(max_indices):
# for each of the detected onset indices
cur_cluster = [max_indices[index]]
if index == len(max_indices)-1:
break
while max_indices[index+1] - cur_cluster[-1] <= 2:
cur_cluster.append(max_indices[index+1])
index += 1
if index == len(max_indices)-1:
break
index += 1
time_clusters.append(cur_cluster)
seg_limits.append([cur_cluster[0] * st_step,
cur_cluster[-1] * st_step])
# Step 5: Post process: remove very small segments:
min_duration = 0.2
seg_limits_2 = []
for s_lim in seg_limits:
if s_lim[1] - s_lim[0] > min_duration:
seg_limits_2.append(s_lim)
seg_limits = seg_limits_2
return seg_limits
def silenceRemoval(input_file, smoothing_window = 1.0, weight = 0.2):
"""Remove silence segments from an audio file and split on those segments
Args:
input_file : audio from input video file
smoothing : Smoothing window size in seconds. Defaults to 1.0.
weight : Weight factor in (0,1). Defaults to 0.5.
"""
if not os.path.isfile(input_file):
raise Exception("Input audio file not found!")
[fs, x] = read_audio_file(input_file)
segmentLimits = silence_removal(x, fs, 0.05, 0.05, smoothing_window, weight)
for i, s in enumerate(segmentLimits):
strOut = "{0:s}_{1:.3f}-{2:.3f}.wav".format(input_file[0:-4], s[0], s[1])
wavfile.write(strOut, fs, x[int(fs * s[0]):int(fs * s[1])])
#if __name__ == "__main__":
# silenceRemoval("video.wav")

Просмотреть файл

@ -0,0 +1,104 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import csv
import sys
import glob
import signal
import ntpath
import numpy as np
import sklearn.svm
shortTermWindow = 0.050
shortTermStep = 0.050
eps = 0.00000001
def train_svm(features, c_param, kernel='linear'):
"""Train a multi-class probabilitistic SVM classifier.
Note: This function is simply a wrapper to the sklearn functionality
for SVM training
See function trainSVM_feature() to use a wrapper on both the
feature extraction and the SVM training
(and parameter tuning) processes.
Args:
features : a list ([numOfClasses x 1]) whose elements
containt np matrices of features each matrix
features[i] of class i is
[n_samples x numOfDimensions]
c_param : SVM parameter C (cost of constraints violation)
Returns:
svm : the trained SVM variable
NOTE:
This function trains a linear-kernel SVM for a given C value.
For a different kernel, other types of parameters should be provided.
"""
feature_matrix, labels = features_to_matrix(features)
svm = sklearn.svm.SVC(C=c_param, kernel=kernel, probability=True,
gamma='auto')
svm.fit(feature_matrix, labels)
return svm
def normalize_features(features):
"""This function normalizes a feature set to 0-mean and 1-std
Used in most classifier trainning cases
Args:
features : list of feature matrices (each one of them is a np matrix)
Returns:
features_norm : list of NORMALIZED feature matrices
mean : mean vector
std : std vector
"""
temp_feats = np.array([])
for count, f in enumerate(features):
if f.shape[0] > 0:
if count == 0:
temp_feats = f
else:
temp_feats = np.vstack((temp_feats, f))
count += 1
mean = np.mean(temp_feats, axis=0) + 1e-14
std = np.std(temp_feats, axis=0) + 1e-14
features_norm = []
for f in features:
ft = f.copy()
for n_samples in range(f.shape[0]):
ft[n_samples, :] = (ft[n_samples, :] - mean) / std
features_norm.append(ft)
return features_norm, mean, std
def features_to_matrix(features):
"""This function takes a list of feature matrices as argument and returns
a single concatenated feature matrix and the respective class labels.
Args:
features : a list of feature matrices
Returns:
feature_matrix : a concatenated matrix of features
labels : a vector of class indices
"""
labels = np.array([])
feature_matrix = np.array([])
for i, f in enumerate(features):
if i == 0:
feature_matrix = f
labels = i * np.ones((len(f), 1))
else:
feature_matrix = np.vstack((feature_matrix, f))
labels = np.append(labels, i * np.ones((len(f), 1)))
return feature_matrix, labels

Просмотреть файл

@ -0,0 +1,32 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import datetime
def write_to_file(file_handle, inferred_text, line_count, limits):
"""Write the inferred text to SRT file
Follows a specific format for SRT files
Args:
file_handle : SRT file handle
inferred_text : text to be written
line_count : subtitle line count
limits : starting and ending times for text
"""
d = str(datetime.timedelta(seconds=float(limits[0])))
try:
from_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2])
except:
from_dur = "0" + str(d) + "," + "00"
d = str(datetime.timedelta(seconds=float(limits[1])))
try:
to_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2])
except:
to_dur = "0" + str(d) + "," + "00"
file_handle.write(str(line_count) + "\n")
file_handle.write(from_dur + " --> " + to_dur + "\n")
file_handle.write(inferred_text + "\n\n")

13
autosub/requirements.txt Normal file
Просмотреть файл

@ -0,0 +1,13 @@
cycler==0.10.0
Cython==0.29.21
numpy
deepspeech==0.9.3
joblib==0.16.0
kiwisolver==1.2.0
pydub==0.23.1
pyparsing==2.4.7
python-dateutil==2.8.1
scikit-learn==0.21.3
scipy==1.4.1
six==1.15.0
tqdm==4.44.1

28
autosub/setup.py Normal file
Просмотреть файл

@ -0,0 +1,28 @@
import os
from setuptools import setup
DIR = os.path.dirname(os.path.abspath(__file__))
INSTALL_PACKAGES = open(os.path.join(DIR, 'requirements.txt')).read().splitlines()
with open("README.md", "r") as fh:
README = fh.read()
setup(
name="AutoSub",
packages="autosub",
version="0.0.1",
author="Abhiroop Talasila",
author_email="abhiroop.talasila@gmail.com",
description="CLI application to generate subtitle file (.srt) for any video file using using STT",
long_description=README,
install_requires=INSTALL_PACKAGES,
long_description_content_type="text/markdown",
url="https://github.com/abhirooptalasila/AutoSub",
keywords=['speech-to-text','deepspeech','machine-learning'],
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
],
python_requires='>=3.5',
)