Add AutoSub
This commit is contained in:
Родитель
0f1168d78f
Коммит
cd31393a94
|
@ -0,0 +1,21 @@
|
|||
MIT License
|
||||
|
||||
Copyright (c) 2020 Abhiroop Talasila
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
|
@ -0,0 +1,93 @@
|
|||
# AutoSub
|
||||
|
||||
- [AutoSub](#autosub)
|
||||
- [About](#about)
|
||||
- [Motivation](#motivation)
|
||||
- [Installation](#installation)
|
||||
- [How-to example](#how-to-example)
|
||||
- [How it works](#how-it-works)
|
||||
- [TO-DO](#to-do)
|
||||
- [Contributing](#contributing)
|
||||
- [References](#references)
|
||||
|
||||
## About
|
||||
|
||||
AutoSub is a CLI application to generate subtitle file (.srt) for any video file using [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech). I use the DeepSpeech Python API to run inference on audio segments and [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) to split the initial audio on silent segments, producing multiple small files.
|
||||
|
||||
|
||||
## Motivation
|
||||
|
||||
In the age of OTT platforms, there are still some who prefer to download movies/videos from YouTube/Facebook or even torrents rather than stream. I am one of them and on one such occasion, I couldn't find the subtitle file for a particular movie I had downloaded. Then the idea for AutoSub struck me and since I had worked with DeepSpeech previously, I decided to use it.
|
||||
|
||||
|
||||
## Installation
|
||||
|
||||
* Clone the repo. All further steps should be performed while in the `AutoSub/` directory
|
||||
```bash
|
||||
$ git clone https://github.com/abhirooptalasila/AutoSub
|
||||
$ cd AutoSub
|
||||
```
|
||||
* Create a pip virtual environment to install the required packages
|
||||
```bash
|
||||
$ python3 -m venv sub
|
||||
$ source sub/bin/activate
|
||||
$ pip3 install -r requirements.txt
|
||||
```
|
||||
* Download the model and scorer files from DeepSpeech repo. The scorer file is optional, but it greatly improves inference results.
|
||||
```bash
|
||||
# Model file (~190 MB)
|
||||
$ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
|
||||
# Scorer file (~950 MB)
|
||||
$ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
|
||||
```
|
||||
* Create two folders `audio/` and `output/` to store audio segments and final SRT file
|
||||
```bash
|
||||
$ mkdir audio output
|
||||
```
|
||||
* Install FFMPEG. If you're running Ubuntu, this should work fine.
|
||||
```bash
|
||||
$ sudo apt-get install ffmpeg
|
||||
$ ffmpeg -version # I'm running 4.1.4
|
||||
```
|
||||
|
||||
* [OPTIONAL] If you would like the subtitles to be generated faster, you can use the GPU package instead. Make sure to install the appropriate [CUDA](https://deepspeech.readthedocs.io/en/v0.9.3/USING.html#cuda-dependency-inference) version.
|
||||
```bash
|
||||
$ source sub/bin/activate
|
||||
$ pip3 install deepspeech-gpu
|
||||
```
|
||||
|
||||
## How-to example
|
||||
|
||||
* After following the installation instructions, you can run `autosub/main.py` as given below. `--model` and `--scorer` arguments take the absolute paths of the respective files. The `--file` argument is the video file for which SRT file is to be generated
|
||||
```bash
|
||||
$ python3 autosub/main.py --model /home/AutoSub/deepspeech-0.9.3-models.pbmm --scorer /home/AutoSub/deepspeech-0.9.3-models.scorer --file ~/movie.mp4
|
||||
```
|
||||
* After the script finishes, the SRT file is saved in `output/`
|
||||
* Open the video file and add this SRT file as a subtitle, or you can just drag and drop in VLC.
|
||||
|
||||
|
||||
## How it works
|
||||
|
||||
Mozilla DeepSpeech is an amazing open-source speech-to-text engine with support for fine-tuning using custom datasets, external language models, exporting memory-mapped models and a lot more. You should definitely check it out for STT tasks. So, when you first run the script, I use FFMPEG to **extract the audio** from the video and save it in `audio/`. By default DeepSpeech is configured to accept 16kHz audio samples for inference, hence while extracting I make FFMPEG use 16kHz sampling rate.
|
||||
|
||||
Then, I use [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) for silence removal - which basically takes the large audio file initially extracted, and splits it wherever silent regions are encountered, resulting in smaller audio segments which are much easier to process. I haven't used the whole library, instead I've integrated parts of it in `autosub/featureExtraction.py` and `autosub/trainAudio.py` All these audio files are stored in `audio/`. Then for each audio segment, I perform DeepSpeech inference on it, and write the inferred text in a SRT file. After all files are processed, the final SRT file is stored in `output/`.
|
||||
|
||||
When I tested the script on my laptop, it took about **40 minutes to generate the SRT file for a 70 minutes video file**. My config is an i5 dual-core @ 2.5 Ghz and 8 gigs of RAM. Ideally, the whole process shouldn't take more than 60% of the duration of original video file.
|
||||
|
||||
|
||||
## TO-DO
|
||||
|
||||
* Pre-process inferred text before writing to file (prettify)
|
||||
* Add progress bar to `extract_audio()`
|
||||
* GUI support (?)
|
||||
|
||||
|
||||
## Contributing
|
||||
|
||||
I would love to follow up on any suggestions/issues you find :)
|
||||
|
||||
|
||||
## References
|
||||
1. https://github.com/mozilla/DeepSpeech/
|
||||
2. https://github.com/tyiannak/pyAudioAnalysis
|
||||
3. https://deepspeech.readthedocs.io/
|
|
@ -0,0 +1,48 @@
|
|||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import subprocess
|
||||
import numpy as np
|
||||
|
||||
|
||||
def extract_audio(input_file, audio_file_name):
|
||||
"""Extract audio from input video file and save to audio/ in root dir
|
||||
|
||||
Args:
|
||||
input_file: input video file
|
||||
audio_file_name: save audio WAV file with same filename as video file
|
||||
"""
|
||||
|
||||
command = "ffmpeg -hide_banner -loglevel warning -i {} -b:a 192k -ac 1 -ar 16000 -vn {}".format(input_file, audio_file_name)
|
||||
try:
|
||||
ret = subprocess.call(command, shell=True)
|
||||
print("Extracted audio to audio/{}".format(audio_file_name.split("/")[-1]))
|
||||
except Exception as e:
|
||||
print("Error: ", str(e))
|
||||
exit(1)
|
||||
|
||||
|
||||
def convert_samplerate(audio_path, desired_sample_rate):
|
||||
"""Convert extracted audio to the format expected by DeepSpeech
|
||||
***WONT be called as extract_audio() converts the audio to 16kHz while saving***
|
||||
|
||||
Args:
|
||||
audio_path: audio file path
|
||||
desired_sample_rate: DeepSpeech expects 16kHz
|
||||
|
||||
Returns:
|
||||
numpy buffer: audio signal stored in numpy array
|
||||
"""
|
||||
|
||||
sox_cmd = "sox {} --type raw --bits 16 --channels 1 --rate {} --encoding signed-integer --endian little --compression 0.0 --no-dither - ".format(
|
||||
quote(audio_path), desired_sample_rate)
|
||||
try:
|
||||
output = subprocess.check_output(
|
||||
shlex.split(sox_cmd), stderr=subprocess.PIPE)
|
||||
except subprocess.CalledProcessError as e:
|
||||
raise RuntimeError("SoX returned non-zero status: {}".format(e.stderr))
|
||||
except OSError as e:
|
||||
raise OSError(e.errno, "SoX not found, use {}hz files or install it: {}".format(
|
||||
desired_sample_rate, e.strerror))
|
||||
|
||||
return np.frombuffer(output, np.int16)
|
|
@ -0,0 +1,413 @@
|
|||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
from scipy.fftpack import fft
|
||||
from scipy.signal import lfilter
|
||||
from scipy.fftpack.realtransforms import dct
|
||||
|
||||
eps = 0.00000001
|
||||
|
||||
def zero_crossing_rate(frame):
|
||||
"""Computes zero crossing rate of frame
|
||||
"""
|
||||
|
||||
count = len(frame)
|
||||
count_zero = np.sum(np.abs(np.diff(np.sign(frame)))) / 2
|
||||
return np.float64(count_zero) / np.float64(count - 1.0)
|
||||
|
||||
|
||||
def energy(frame):
|
||||
"""Computes signal energy of frame
|
||||
"""
|
||||
|
||||
return np.sum(frame ** 2) / np.float64(len(frame))
|
||||
|
||||
|
||||
def energy_entropy(frame, n_short_blocks=10):
|
||||
"""Computes entropy of energy
|
||||
"""
|
||||
|
||||
# total frame energy
|
||||
frame_energy = np.sum(frame ** 2)
|
||||
frame_length = len(frame)
|
||||
sub_win_len = int(np.floor(frame_length / n_short_blocks))
|
||||
if frame_length != sub_win_len * n_short_blocks:
|
||||
frame = frame[0:sub_win_len * n_short_blocks]
|
||||
|
||||
# sub_wins is of size [n_short_blocks x L]
|
||||
sub_wins = frame.reshape(sub_win_len, n_short_blocks, order='F').copy()
|
||||
|
||||
# Compute normalized sub-frame energies:
|
||||
s = np.sum(sub_wins ** 2, axis=0) / (frame_energy + eps)
|
||||
|
||||
# Compute entropy of the normalized sub-frame energies:
|
||||
entropy = -np.sum(s * np.log2(s + eps))
|
||||
|
||||
return entropy
|
||||
|
||||
|
||||
""" Frequency-domain audio features """
|
||||
|
||||
|
||||
def spectral_centroid_spread(fft_magnitude, sampling_rate):
|
||||
"""Computes spectral centroid of frame (given abs(FFT))
|
||||
"""
|
||||
|
||||
ind = (np.arange(1, len(fft_magnitude) + 1)) * \
|
||||
(sampling_rate / (2.0 * len(fft_magnitude)))
|
||||
|
||||
Xt = fft_magnitude.copy()
|
||||
Xt = Xt / Xt.max()
|
||||
NUM = np.sum(ind * Xt)
|
||||
DEN = np.sum(Xt) + eps
|
||||
|
||||
# Centroid:
|
||||
centroid = (NUM / DEN)
|
||||
|
||||
# Spread:
|
||||
spread = np.sqrt(np.sum(((ind - centroid) ** 2) * Xt) / DEN)
|
||||
|
||||
# Normalize:
|
||||
centroid = centroid / (sampling_rate / 2.0)
|
||||
spread = spread / (sampling_rate / 2.0)
|
||||
|
||||
return centroid, spread
|
||||
|
||||
|
||||
def spectral_entropy(signal, n_short_blocks=10):
|
||||
"""Computes the spectral entropy
|
||||
"""
|
||||
|
||||
# number of frame samples
|
||||
num_frames = len(signal)
|
||||
|
||||
# total spectral energy
|
||||
total_energy = np.sum(signal ** 2)
|
||||
|
||||
# length of sub-frame
|
||||
sub_win_len = int(np.floor(num_frames / n_short_blocks))
|
||||
if num_frames != sub_win_len * n_short_blocks:
|
||||
signal = signal[0:sub_win_len * n_short_blocks]
|
||||
|
||||
# define sub-frames (using matrix reshape)
|
||||
sub_wins = signal.reshape(sub_win_len, n_short_blocks, order='F').copy()
|
||||
|
||||
# compute spectral sub-energies
|
||||
s = np.sum(sub_wins ** 2, axis=0) / (total_energy + eps)
|
||||
|
||||
# compute spectral entropy
|
||||
entropy = -np.sum(s * np.log2(s + eps))
|
||||
|
||||
return entropy
|
||||
|
||||
|
||||
def spectral_flux(fft_magnitude, previous_fft_magnitude):
|
||||
"""Computes the spectral flux feature of the current frame
|
||||
|
||||
Args:
|
||||
fft_magnitude : the abs(fft) of the current frame
|
||||
previous_fft_magnitude : the abs(fft) of the previous frame
|
||||
"""
|
||||
|
||||
# compute the spectral flux as the sum of square distances:
|
||||
fft_sum = np.sum(fft_magnitude + eps)
|
||||
previous_fft_sum = np.sum(previous_fft_magnitude + eps)
|
||||
sp_flux = np.sum(
|
||||
(fft_magnitude / fft_sum - previous_fft_magnitude /
|
||||
previous_fft_sum) ** 2)
|
||||
|
||||
return sp_flux
|
||||
|
||||
|
||||
def spectral_rolloff(signal, c):
|
||||
"""Computes spectral roll-off
|
||||
"""
|
||||
|
||||
energy = np.sum(signal ** 2)
|
||||
fft_length = len(signal)
|
||||
threshold = c * energy
|
||||
# Ffind the spectral rolloff as the frequency position
|
||||
# where the respective spectral energy is equal to c*totalEnergy
|
||||
cumulative_sum = np.cumsum(signal ** 2) + eps
|
||||
a = np.nonzero(cumulative_sum > threshold)[0]
|
||||
if len(a) > 0:
|
||||
sp_rolloff = np.float64(a[0]) / (float(fft_length))
|
||||
else:
|
||||
sp_rolloff = 0.0
|
||||
|
||||
return sp_rolloff
|
||||
|
||||
def mfcc_filter_banks(sampling_rate, num_fft, lowfreq=133.33, linc=200 / 3,
|
||||
logsc=1.0711703, num_lin_filt=13, num_log_filt=27):
|
||||
"""Computes the triangular filterbank for MFCC computation
|
||||
(used in the stFeatureExtraction function before the stMFCC function call)
|
||||
This function is taken from the scikits.talkbox library (MIT Licence):
|
||||
https://pypi.python.org/pypi/scikits.talkbox
|
||||
"""
|
||||
|
||||
if sampling_rate < 8000:
|
||||
nlogfil = 5
|
||||
|
||||
# Total number of filters
|
||||
num_filt_total = num_lin_filt + num_log_filt
|
||||
|
||||
# Compute frequency points of the triangle:
|
||||
frequencies = np.zeros(num_filt_total + 2)
|
||||
frequencies[:num_lin_filt] = lowfreq + np.arange(num_lin_filt) * linc
|
||||
frequencies[num_lin_filt:] = frequencies[num_lin_filt - 1] * logsc ** \
|
||||
np.arange(1, num_log_filt + 3)
|
||||
heights = 2. / (frequencies[2:] - frequencies[0:-2])
|
||||
|
||||
# Compute filterbank coeff (in fft domain, in bins)
|
||||
fbank = np.zeros((num_filt_total, num_fft))
|
||||
nfreqs = np.arange(num_fft) / (1. * num_fft) * sampling_rate
|
||||
|
||||
for i in range(num_filt_total):
|
||||
low_freqs = frequencies[i]
|
||||
cent_freqs = frequencies[i + 1]
|
||||
high_freqs = frequencies[i + 2]
|
||||
|
||||
lid = np.arange(np.floor(low_freqs * num_fft / sampling_rate) + 1,
|
||||
np.floor(cent_freqs * num_fft / sampling_rate) + 1,
|
||||
dtype=np.int)
|
||||
lslope = heights[i] / (cent_freqs - low_freqs)
|
||||
rid = np.arange(np.floor(cent_freqs * num_fft / sampling_rate) + 1,
|
||||
np.floor(high_freqs * num_fft / sampling_rate) + 1,
|
||||
dtype=np.int)
|
||||
rslope = heights[i] / (high_freqs - cent_freqs)
|
||||
fbank[i][lid] = lslope * (nfreqs[lid] - low_freqs)
|
||||
fbank[i][rid] = rslope * (high_freqs - nfreqs[rid])
|
||||
|
||||
return fbank, frequencies
|
||||
|
||||
|
||||
def mfcc(fft_magnitude, fbank, num_mfcc_feats):
|
||||
"""Computes the MFCCs of a frame, given the fft mag
|
||||
|
||||
Args:
|
||||
fft_magnitude : fft magnitude abs(FFT)
|
||||
fbank : filter bank (see mfccInitFilterBanks)
|
||||
|
||||
Returns:
|
||||
ceps : MFCCs (13 element vector)
|
||||
|
||||
Note: MFCC calculation is, in general, taken from the
|
||||
scikits.talkbox library (MIT Licence),
|
||||
# with a small number of modifications to make it more
|
||||
compact and suitable for the pyAudioAnalysis Lib
|
||||
"""
|
||||
|
||||
mspec = np.log10(np.dot(fft_magnitude, fbank.T) + eps)
|
||||
ceps = dct(mspec, type=2, norm='ortho', axis=-1)[:num_mfcc_feats]
|
||||
return ceps
|
||||
|
||||
|
||||
def chroma_features_init(num_fft, sampling_rate):
|
||||
"""This function initializes the chroma matrices used in the calculation
|
||||
of the chroma features
|
||||
"""
|
||||
|
||||
freqs = np.array([((f + 1) * sampling_rate) /
|
||||
(2 * num_fft) for f in range(num_fft)])
|
||||
cp = 27.50
|
||||
num_chroma = np.round(12.0 * np.log2(freqs / cp)).astype(int)
|
||||
|
||||
num_freqs_per_chroma = np.zeros((num_chroma.shape[0],))
|
||||
|
||||
unique_chroma = np.unique(num_chroma)
|
||||
for u in unique_chroma:
|
||||
idx = np.nonzero(num_chroma == u)
|
||||
num_freqs_per_chroma[idx] = idx[0].shape
|
||||
|
||||
return num_chroma, num_freqs_per_chroma
|
||||
|
||||
|
||||
def chroma_features(signal, sampling_rate, num_fft):
|
||||
# TODO: 1 complexity
|
||||
# TODO: 2 bug with large windows
|
||||
|
||||
num_chroma, num_freqs_per_chroma = \
|
||||
chroma_features_init(num_fft, sampling_rate)
|
||||
chroma_names = ['A', 'A#', 'B', 'C', 'C#', 'D',
|
||||
'D#', 'E', 'F', 'F#', 'G', 'G#']
|
||||
spec = signal ** 2
|
||||
if num_chroma.max() < num_chroma.shape[0]:
|
||||
C = np.zeros((num_chroma.shape[0],))
|
||||
C[num_chroma] = spec
|
||||
C /= num_freqs_per_chroma[num_chroma]
|
||||
else:
|
||||
I = np.nonzero(num_chroma > num_chroma.shape[0])[0][0]
|
||||
C = np.zeros((num_chroma.shape[0],))
|
||||
C[num_chroma[0:I - 1]] = spec
|
||||
C /= num_freqs_per_chroma
|
||||
final_matrix = np.zeros((12, 1))
|
||||
newD = int(np.ceil(C.shape[0] / 12.0) * 12)
|
||||
C2 = np.zeros((newD,))
|
||||
C2[0:C.shape[0]] = C
|
||||
C2 = C2.reshape(int(C2.shape[0] / 12), 12)
|
||||
# for i in range(12):
|
||||
# finalC[i] = np.sum(C[i:C.shape[0]:12])
|
||||
final_matrix = np.matrix(np.sum(C2, axis=0)).T
|
||||
final_matrix /= spec.sum()
|
||||
|
||||
# ax = plt.gca()
|
||||
# plt.hold(False)
|
||||
# plt.plot(finalC)
|
||||
# ax.set_xticks(range(len(chromaNames)))
|
||||
# ax.set_xticklabels(chromaNames)
|
||||
# xaxis = np.arange(0, 0.02, 0.01);
|
||||
# ax.set_yticks(range(len(xaxis)))
|
||||
# ax.set_yticklabels(xaxis)
|
||||
# plt.show(block=False)
|
||||
# plt.draw()
|
||||
|
||||
return chroma_names, final_matrix
|
||||
|
||||
""" Windowing and feature extraction """
|
||||
|
||||
def feature_extraction(signal, sampling_rate, window, step, deltas=True):
|
||||
"""This function implements the shor-term windowing process.
|
||||
For each short-term window a set of features is extracted.
|
||||
This results to a sequence of feature vectors, stored in a np matrix.
|
||||
|
||||
Args:
|
||||
signal : the input signal samples
|
||||
sampling_rate : the sampling freq (in Hz)
|
||||
window : the short-term window size (in samples)
|
||||
step : the short-term window step (in samples)
|
||||
deltas : (opt) True/False if delta features are to be computed
|
||||
|
||||
Returns:
|
||||
features (numpy.ndarray) : contains features
|
||||
(n_feats x numOfShortTermWindows)
|
||||
feature_names (numpy.ndarray) : contains feature names
|
||||
(n_feats x numOfShortTermWindows)
|
||||
"""
|
||||
|
||||
window = int(window)
|
||||
step = int(step)
|
||||
|
||||
# signal normalization
|
||||
signal = np.double(signal)
|
||||
signal = signal / (2.0 ** 15)
|
||||
dc_offset = signal.mean()
|
||||
signal_max = (np.abs(signal)).max()
|
||||
signal = (signal - dc_offset) / (signal_max + 0.0000000001)
|
||||
|
||||
number_of_samples = len(signal) # total number of samples
|
||||
current_position = 0
|
||||
count_fr = 0
|
||||
num_fft = int(window / 2)
|
||||
|
||||
# compute the triangular filter banks used in the mfcc calculation
|
||||
fbank, freqs = mfcc_filter_banks(sampling_rate, num_fft)
|
||||
|
||||
n_time_spectral_feats = 8
|
||||
n_harmonic_feats = 0
|
||||
n_mfcc_feats = 13
|
||||
n_chroma_feats = 13
|
||||
n_total_feats = n_time_spectral_feats + n_mfcc_feats + n_harmonic_feats + \
|
||||
n_chroma_feats
|
||||
# n_total_feats = n_time_spectral_feats + n_mfcc_feats +
|
||||
# n_harmonic_feats
|
||||
|
||||
# define list of feature names
|
||||
feature_names = ["zcr", "energy", "energy_entropy"]
|
||||
feature_names += ["spectral_centroid", "spectral_spread"]
|
||||
feature_names.append("spectral_entropy")
|
||||
feature_names.append("spectral_flux")
|
||||
feature_names.append("spectral_rolloff")
|
||||
feature_names += ["mfcc_{0:d}".format(mfcc_i)
|
||||
for mfcc_i in range(1, n_mfcc_feats + 1)]
|
||||
feature_names += ["chroma_{0:d}".format(chroma_i)
|
||||
for chroma_i in range(1, n_chroma_feats)]
|
||||
feature_names.append("chroma_std")
|
||||
|
||||
# add names for delta features:
|
||||
if deltas:
|
||||
feature_names_2 = feature_names + ["delta " + f for f in feature_names]
|
||||
feature_names = feature_names_2
|
||||
|
||||
features = []
|
||||
# for each short-term window to end of signal
|
||||
while current_position + window - 1 < number_of_samples:
|
||||
count_fr += 1
|
||||
# get current window
|
||||
x = signal[current_position:current_position + window]
|
||||
|
||||
# update window position
|
||||
current_position = current_position + step
|
||||
|
||||
# get fft magnitude
|
||||
fft_magnitude = abs(fft(x))
|
||||
|
||||
# normalize fft
|
||||
fft_magnitude = fft_magnitude[0:num_fft]
|
||||
fft_magnitude = fft_magnitude / len(fft_magnitude)
|
||||
|
||||
# keep previous fft mag (used in spectral flux)
|
||||
if count_fr == 1:
|
||||
fft_magnitude_previous = fft_magnitude.copy()
|
||||
feature_vector = np.zeros((n_total_feats, 1))
|
||||
|
||||
# zero crossing rate
|
||||
feature_vector[0] = zero_crossing_rate(x)
|
||||
|
||||
# short-term energy
|
||||
feature_vector[1] = energy(x)
|
||||
|
||||
# short-term entropy of energy
|
||||
feature_vector[2] = energy_entropy(x)
|
||||
|
||||
# sp centroid/spread
|
||||
[feature_vector[3], feature_vector[4]] = \
|
||||
spectral_centroid_spread(fft_magnitude,
|
||||
sampling_rate)
|
||||
|
||||
# spectral entropy
|
||||
feature_vector[5] = \
|
||||
spectral_entropy(fft_magnitude)
|
||||
|
||||
# spectral flux
|
||||
feature_vector[6] = \
|
||||
spectral_flux(fft_magnitude,
|
||||
fft_magnitude_previous)
|
||||
|
||||
# spectral rolloff
|
||||
feature_vector[7] = \
|
||||
spectral_rolloff(fft_magnitude, 0.90)
|
||||
|
||||
# MFCCs
|
||||
mffc_feats_end = n_time_spectral_feats + n_mfcc_feats
|
||||
feature_vector[n_time_spectral_feats:mffc_feats_end, 0] = \
|
||||
mfcc(fft_magnitude, fbank, n_mfcc_feats).copy()
|
||||
|
||||
# chroma features
|
||||
chroma_names, chroma_feature_matrix = \
|
||||
chroma_features(fft_magnitude, sampling_rate, num_fft)
|
||||
chroma_features_end = n_time_spectral_feats + n_mfcc_feats + \
|
||||
n_chroma_feats - 1
|
||||
feature_vector[mffc_feats_end:chroma_features_end] = \
|
||||
chroma_feature_matrix
|
||||
feature_vector[chroma_features_end] = chroma_feature_matrix.std()
|
||||
if not deltas:
|
||||
features.append(feature_vector)
|
||||
else:
|
||||
# delta features
|
||||
if count_fr > 1:
|
||||
delta = feature_vector - feature_vector_prev
|
||||
feature_vector_2 = np.concatenate((feature_vector, delta))
|
||||
else:
|
||||
feature_vector_2 = np.concatenate((feature_vector,
|
||||
np.zeros(feature_vector.
|
||||
shape)))
|
||||
feature_vector_prev = feature_vector
|
||||
features.append(feature_vector_2)
|
||||
|
||||
fft_magnitude_previous = fft_magnitude.copy()
|
||||
|
||||
features = np.concatenate(features, 1)
|
||||
|
||||
return features, feature_names
|
|
@ -0,0 +1,135 @@
|
|||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import wave
|
||||
import shutil
|
||||
import argparse
|
||||
import subprocess
|
||||
import numpy as np
|
||||
from tqdm import tqdm
|
||||
from deepspeech import Model, version
|
||||
from segmentAudio import silenceRemoval
|
||||
from audioProcessing import extract_audio, convert_samplerate
|
||||
from writeToFile import write_to_file
|
||||
|
||||
# Line count for SRT file
|
||||
line_count = 0
|
||||
|
||||
def sort_alphanumeric(data):
|
||||
"""Sort function to sort os.listdir() alphanumerically
|
||||
Helps to process audio files sequentially after splitting
|
||||
|
||||
Args:
|
||||
data : file name
|
||||
"""
|
||||
|
||||
convert = lambda text: int(text) if text.isdigit() else text.lower()
|
||||
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
|
||||
|
||||
return sorted(data, key = alphanum_key)
|
||||
|
||||
|
||||
def ds_process_audio(ds, audio_file, file_handle):
|
||||
"""Run DeepSpeech inference on each audio file generated after silenceRemoval
|
||||
and write to file pointed by file_handle
|
||||
|
||||
Args:
|
||||
ds : DeepSpeech Model
|
||||
audio_file : audio file
|
||||
file_handle : SRT file handle
|
||||
"""
|
||||
|
||||
global line_count
|
||||
fin = wave.open(audio_file, 'rb')
|
||||
fs_orig = fin.getframerate()
|
||||
desired_sample_rate = ds.sampleRate()
|
||||
|
||||
# Check if sampling rate is required rate (16000)
|
||||
# won't be carried out as FFmpeg already converts to 16kHz
|
||||
if fs_orig != desired_sample_rate:
|
||||
print("Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition".format(fs_orig, desired_sample_rate), file=sys.stderr)
|
||||
audio = convert_samplerate(audio_file, desired_sample_rate)
|
||||
else:
|
||||
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
|
||||
|
||||
fin.close()
|
||||
|
||||
# Perform inference on audio segment
|
||||
infered_text = ds.stt(audio)
|
||||
|
||||
# File name contains start and end times in seconds. Extract that
|
||||
limits = audio_file.split(os.sep)[-1][:-4].split("_")[-1].split("-")
|
||||
|
||||
if len(infered_text) != 0:
|
||||
line_count += 1
|
||||
write_to_file(file_handle, infered_text, line_count, limits)
|
||||
|
||||
|
||||
def main():
|
||||
global line_count
|
||||
print("AutoSub v0.1\n")
|
||||
|
||||
parser = argparse.ArgumentParser(description="AutoSub v0.1")
|
||||
parser.add_argument('--model', required=True,
|
||||
help='DeepSpeech model file')
|
||||
parser.add_argument('--scorer',
|
||||
help='DeepSpeech scorer file')
|
||||
parser.add_argument('--file', required=True,
|
||||
help='Input video file')
|
||||
args = parser.parse_args()
|
||||
|
||||
ds_model = args.model
|
||||
if not ds_model.endswith(".pbmm"):
|
||||
print("Invalid model file. Exiting\n")
|
||||
exit(1)
|
||||
|
||||
# Load DeepSpeech model
|
||||
ds = Model(ds_model)
|
||||
|
||||
if args.scorer:
|
||||
ds_scorer = args.scorer
|
||||
if not ds_scorer.endswith(".scorer"):
|
||||
print("Invalid scorer file. Running inference using only model file\n")
|
||||
else:
|
||||
ds.enableExternalScorer(ds_scorer)
|
||||
|
||||
input_file = args.file
|
||||
print("\nInput file:", input_file)
|
||||
|
||||
base_directory = os.getcwd()
|
||||
output_directory = os.path.join(base_directory, "output")
|
||||
audio_directory = os.path.join(base_directory, "audio")
|
||||
video_file_name = input_file.split(os.sep)[-1].split(".")[0]
|
||||
audio_file_name = os.path.join(audio_directory, video_file_name + ".wav")
|
||||
srt_file_name = os.path.join(output_directory, video_file_name + ".srt")
|
||||
|
||||
# Extract audio from input video file
|
||||
extract_audio(input_file, audio_file_name)
|
||||
|
||||
print("Splitting on silent parts in audio file")
|
||||
silenceRemoval(audio_file_name)
|
||||
|
||||
# Output SRT file
|
||||
file_handle = open(srt_file_name, "a+")
|
||||
|
||||
print("\nRunning inference:")
|
||||
|
||||
for file in tqdm(sort_alphanumeric(os.listdir(audio_directory))):
|
||||
audio_segment_path = os.path.join(audio_directory, file)
|
||||
|
||||
# Dont run inference on the original audio file
|
||||
if audio_segment_path.split(os.sep)[-1] != audio_file_name.split(os.sep)[-1]:
|
||||
ds_process_audio(ds, audio_segment_path, file_handle)
|
||||
|
||||
print("\nSRT file saved to", srt_file_name)
|
||||
file_handle.close()
|
||||
|
||||
# Clean audio/ directory
|
||||
shutil.rmtree(audio_directory)
|
||||
os.mkdir(audio_directory)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -0,0 +1,204 @@
|
|||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import os
|
||||
import numpy as np
|
||||
from pydub import AudioSegment
|
||||
import scipy.io.wavfile as wavfile
|
||||
import featureExtraction as FE
|
||||
import trainAudio as TA
|
||||
|
||||
|
||||
def read_audio_file(input_file):
|
||||
"""This function returns a numpy array that stores the audio samples of a
|
||||
specified WAV file
|
||||
|
||||
Args:
|
||||
input_file : audio from input video file
|
||||
"""
|
||||
|
||||
sampling_rate = -1
|
||||
signal = np.array([])
|
||||
try:
|
||||
audiofile = AudioSegment.from_file(input_file)
|
||||
data = np.array([])
|
||||
if audiofile.sample_width == 2:
|
||||
data = np.fromstring(audiofile._data, np.int16)
|
||||
elif audiofile.sample_width == 4:
|
||||
data = np.fromstring(audiofile._data, np.int32)
|
||||
|
||||
if data.size > 0:
|
||||
sampling_rate = audiofile.frame_rate
|
||||
temp_signal = []
|
||||
for chn in list(range(audiofile.channels)):
|
||||
temp_signal.append(data[chn::audiofile.channels])
|
||||
signal = np.array(temp_signal).T
|
||||
except:
|
||||
print("Error: file not found or other I/O error. (DECODING FAILED)")
|
||||
|
||||
if signal.ndim == 2 and signal.shape[1] == 1:
|
||||
signal = signal.flatten()
|
||||
|
||||
return sampling_rate, signal
|
||||
|
||||
def smooth_moving_avg(signal, window=11):
|
||||
window = int(window)
|
||||
if signal.ndim != 1:
|
||||
raise ValueError("")
|
||||
if signal.size < window:
|
||||
raise ValueError("Input vector needs to be bigger than window size.")
|
||||
if window < 3:
|
||||
return signal
|
||||
s = np.r_[2 * signal[0] - signal[window - 1::-1],
|
||||
signal, 2 * signal[-1] - signal[-1:-window:-1]]
|
||||
w = np.ones(window, 'd')
|
||||
y = np.convolve(w/w.sum(), s, mode='same')
|
||||
|
||||
return y[window:-window + 1]
|
||||
|
||||
def stereo_to_mono(signal):
|
||||
"""This function converts the input signal to MONO (if it is STEREO)
|
||||
|
||||
Args:
|
||||
signal: audio file stored in a Numpy array
|
||||
"""
|
||||
|
||||
if signal.ndim == 2:
|
||||
if signal.shape[1] == 1:
|
||||
signal = signal.flatten()
|
||||
else:
|
||||
if signal.shape[1] == 2:
|
||||
signal = (signal[:, 1] / 2) + (signal[:, 0] / 2)
|
||||
|
||||
return signal
|
||||
|
||||
def silence_removal(signal, sampling_rate, st_win, st_step, smooth_window=0.5,
|
||||
weight=0.5):
|
||||
"""Event Detection (silence removal)
|
||||
|
||||
Args:
|
||||
signal : the input audio signal
|
||||
sampling_rate : sampling freq
|
||||
st_win, st_step : window size and step in seconds
|
||||
smoothWindow : (optinal) smooth window (in seconds)
|
||||
weight : (optinal) weight factor (0 < weight < 1) the higher, the more strict
|
||||
plot : (optinal) True if results are to be plotted
|
||||
|
||||
Returns:
|
||||
seg_limits : list of segment limits in seconds (e.g [[0.1, 0.9],
|
||||
[1.4, 3.0]] means that the resulting segments
|
||||
are (0.1 - 0.9) seconds and (1.4, 3.0) seconds
|
||||
"""
|
||||
|
||||
if weight >= 1:
|
||||
weight = 0.99
|
||||
if weight <= 0:
|
||||
weight = 0.01
|
||||
|
||||
# Step 1: feature extraction
|
||||
signal = stereo_to_mono(signal)
|
||||
st_feats, _ = FE.feature_extraction(signal, sampling_rate,
|
||||
st_win * sampling_rate,
|
||||
st_step * sampling_rate)
|
||||
|
||||
# Step 2: train binary svm classifier of low vs high energy frames
|
||||
# keep only the energy short-term sequence (2nd feature)
|
||||
st_energy = st_feats[1, :]
|
||||
en = np.sort(st_energy)
|
||||
# number of 10% of the total short-term windows
|
||||
st_windows_fraction = int(len(en) / 10)
|
||||
|
||||
# compute "lower" 10% energy threshold
|
||||
low_threshold = np.mean(en[0:st_windows_fraction]) + 1e-15
|
||||
|
||||
# compute "higher" 10% energy threshold
|
||||
high_threshold = np.mean(en[-st_windows_fraction:-1]) + 1e-15
|
||||
|
||||
# get all features that correspond to low energy
|
||||
low_energy = st_feats[:, np.where(st_energy <= low_threshold)[0]]
|
||||
|
||||
# get all features that correspond to high energy
|
||||
high_energy = st_feats[:, np.where(st_energy >= high_threshold)[0]]
|
||||
|
||||
# form the binary classification task and ...
|
||||
features = [low_energy.T, high_energy.T]
|
||||
# normalize and train the respective svm probabilistic model
|
||||
|
||||
# (ONSET vs SILENCE)
|
||||
features_norm, mean, std = TA.normalize_features(features)
|
||||
svm = TA.train_svm(features_norm, 1.0)
|
||||
|
||||
# Step 3: compute onset probability based on the trained svm
|
||||
prob_on_set = []
|
||||
for index in range(st_feats.shape[1]):
|
||||
# for each frame
|
||||
cur_fv = (st_feats[:, index] - mean) / std
|
||||
# get svm probability (that it belongs to the ONSET class)
|
||||
prob_on_set.append(svm.predict_proba(cur_fv.reshape(1, -1))[0][1])
|
||||
prob_on_set = np.array(prob_on_set)
|
||||
|
||||
# smooth probability:
|
||||
prob_on_set = smooth_moving_avg(prob_on_set, smooth_window / st_step)
|
||||
|
||||
# Step 4A: detect onset frame indices:
|
||||
prog_on_set_sort = np.sort(prob_on_set)
|
||||
|
||||
# find probability Threshold as a weighted average
|
||||
# of top 10% and lower 10% of the values
|
||||
nt = int(prog_on_set_sort.shape[0] / 10)
|
||||
threshold = (np.mean((1 - weight) * prog_on_set_sort[0:nt]) +
|
||||
weight * np.mean(prog_on_set_sort[-nt::]))
|
||||
|
||||
max_indices = np.where(prob_on_set > threshold)[0]
|
||||
# get the indices of the frames that satisfy the thresholding
|
||||
index = 0
|
||||
seg_limits = []
|
||||
time_clusters = []
|
||||
|
||||
# Step 4B: group frame indices to onset segments
|
||||
while index < len(max_indices):
|
||||
# for each of the detected onset indices
|
||||
cur_cluster = [max_indices[index]]
|
||||
if index == len(max_indices)-1:
|
||||
break
|
||||
while max_indices[index+1] - cur_cluster[-1] <= 2:
|
||||
cur_cluster.append(max_indices[index+1])
|
||||
index += 1
|
||||
if index == len(max_indices)-1:
|
||||
break
|
||||
index += 1
|
||||
time_clusters.append(cur_cluster)
|
||||
seg_limits.append([cur_cluster[0] * st_step,
|
||||
cur_cluster[-1] * st_step])
|
||||
|
||||
# Step 5: Post process: remove very small segments:
|
||||
min_duration = 0.2
|
||||
seg_limits_2 = []
|
||||
for s_lim in seg_limits:
|
||||
if s_lim[1] - s_lim[0] > min_duration:
|
||||
seg_limits_2.append(s_lim)
|
||||
seg_limits = seg_limits_2
|
||||
|
||||
return seg_limits
|
||||
|
||||
def silenceRemoval(input_file, smoothing_window = 1.0, weight = 0.2):
|
||||
"""Remove silence segments from an audio file and split on those segments
|
||||
|
||||
Args:
|
||||
input_file : audio from input video file
|
||||
smoothing : Smoothing window size in seconds. Defaults to 1.0.
|
||||
weight : Weight factor in (0,1). Defaults to 0.5.
|
||||
"""
|
||||
|
||||
if not os.path.isfile(input_file):
|
||||
raise Exception("Input audio file not found!")
|
||||
|
||||
[fs, x] = read_audio_file(input_file)
|
||||
segmentLimits = silence_removal(x, fs, 0.05, 0.05, smoothing_window, weight)
|
||||
|
||||
for i, s in enumerate(segmentLimits):
|
||||
strOut = "{0:s}_{1:.3f}-{2:.3f}.wav".format(input_file[0:-4], s[0], s[1])
|
||||
wavfile.write(strOut, fs, x[int(fs * s[0]):int(fs * s[1])])
|
||||
|
||||
#if __name__ == "__main__":
|
||||
# silenceRemoval("video.wav")
|
|
@ -0,0 +1,104 @@
|
|||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import os
|
||||
import csv
|
||||
import sys
|
||||
import glob
|
||||
import signal
|
||||
import ntpath
|
||||
import numpy as np
|
||||
import sklearn.svm
|
||||
|
||||
shortTermWindow = 0.050
|
||||
shortTermStep = 0.050
|
||||
eps = 0.00000001
|
||||
|
||||
|
||||
def train_svm(features, c_param, kernel='linear'):
|
||||
"""Train a multi-class probabilitistic SVM classifier.
|
||||
Note: This function is simply a wrapper to the sklearn functionality
|
||||
for SVM training
|
||||
See function trainSVM_feature() to use a wrapper on both the
|
||||
feature extraction and the SVM training
|
||||
(and parameter tuning) processes.
|
||||
Args:
|
||||
features : a list ([numOfClasses x 1]) whose elements
|
||||
containt np matrices of features each matrix
|
||||
features[i] of class i is
|
||||
[n_samples x numOfDimensions]
|
||||
c_param : SVM parameter C (cost of constraints violation)
|
||||
|
||||
Returns:
|
||||
svm : the trained SVM variable
|
||||
|
||||
NOTE:
|
||||
This function trains a linear-kernel SVM for a given C value.
|
||||
For a different kernel, other types of parameters should be provided.
|
||||
"""
|
||||
|
||||
feature_matrix, labels = features_to_matrix(features)
|
||||
svm = sklearn.svm.SVC(C=c_param, kernel=kernel, probability=True,
|
||||
gamma='auto')
|
||||
svm.fit(feature_matrix, labels)
|
||||
|
||||
return svm
|
||||
|
||||
def normalize_features(features):
|
||||
"""This function normalizes a feature set to 0-mean and 1-std
|
||||
Used in most classifier trainning cases
|
||||
|
||||
Args:
|
||||
features : list of feature matrices (each one of them is a np matrix)
|
||||
|
||||
Returns:
|
||||
features_norm : list of NORMALIZED feature matrices
|
||||
mean : mean vector
|
||||
std : std vector
|
||||
"""
|
||||
|
||||
temp_feats = np.array([])
|
||||
|
||||
for count, f in enumerate(features):
|
||||
if f.shape[0] > 0:
|
||||
if count == 0:
|
||||
temp_feats = f
|
||||
else:
|
||||
temp_feats = np.vstack((temp_feats, f))
|
||||
count += 1
|
||||
|
||||
mean = np.mean(temp_feats, axis=0) + 1e-14
|
||||
std = np.std(temp_feats, axis=0) + 1e-14
|
||||
|
||||
features_norm = []
|
||||
for f in features:
|
||||
ft = f.copy()
|
||||
for n_samples in range(f.shape[0]):
|
||||
ft[n_samples, :] = (ft[n_samples, :] - mean) / std
|
||||
features_norm.append(ft)
|
||||
return features_norm, mean, std
|
||||
|
||||
|
||||
def features_to_matrix(features):
|
||||
"""This function takes a list of feature matrices as argument and returns
|
||||
a single concatenated feature matrix and the respective class labels.
|
||||
|
||||
Args:
|
||||
features : a list of feature matrices
|
||||
|
||||
Returns:
|
||||
feature_matrix : a concatenated matrix of features
|
||||
labels : a vector of class indices
|
||||
"""
|
||||
|
||||
labels = np.array([])
|
||||
feature_matrix = np.array([])
|
||||
for i, f in enumerate(features):
|
||||
if i == 0:
|
||||
feature_matrix = f
|
||||
labels = i * np.ones((len(f), 1))
|
||||
else:
|
||||
feature_matrix = np.vstack((feature_matrix, f))
|
||||
labels = np.append(labels, i * np.ones((len(f), 1)))
|
||||
|
||||
return feature_matrix, labels
|
|
@ -0,0 +1,32 @@
|
|||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import os
|
||||
import datetime
|
||||
|
||||
def write_to_file(file_handle, inferred_text, line_count, limits):
|
||||
"""Write the inferred text to SRT file
|
||||
Follows a specific format for SRT files
|
||||
|
||||
Args:
|
||||
file_handle : SRT file handle
|
||||
inferred_text : text to be written
|
||||
line_count : subtitle line count
|
||||
limits : starting and ending times for text
|
||||
"""
|
||||
|
||||
d = str(datetime.timedelta(seconds=float(limits[0])))
|
||||
try:
|
||||
from_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2])
|
||||
except:
|
||||
from_dur = "0" + str(d) + "," + "00"
|
||||
|
||||
d = str(datetime.timedelta(seconds=float(limits[1])))
|
||||
try:
|
||||
to_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2])
|
||||
except:
|
||||
to_dur = "0" + str(d) + "," + "00"
|
||||
|
||||
file_handle.write(str(line_count) + "\n")
|
||||
file_handle.write(from_dur + " --> " + to_dur + "\n")
|
||||
file_handle.write(inferred_text + "\n\n")
|
|
@ -0,0 +1,13 @@
|
|||
cycler==0.10.0
|
||||
Cython==0.29.21
|
||||
numpy
|
||||
deepspeech==0.9.3
|
||||
joblib==0.16.0
|
||||
kiwisolver==1.2.0
|
||||
pydub==0.23.1
|
||||
pyparsing==2.4.7
|
||||
python-dateutil==2.8.1
|
||||
scikit-learn==0.21.3
|
||||
scipy==1.4.1
|
||||
six==1.15.0
|
||||
tqdm==4.44.1
|
|
@ -0,0 +1,28 @@
|
|||
import os
|
||||
from setuptools import setup
|
||||
|
||||
DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
INSTALL_PACKAGES = open(os.path.join(DIR, 'requirements.txt')).read().splitlines()
|
||||
|
||||
with open("README.md", "r") as fh:
|
||||
README = fh.read()
|
||||
|
||||
setup(
|
||||
name="AutoSub",
|
||||
packages="autosub",
|
||||
version="0.0.1",
|
||||
author="Abhiroop Talasila",
|
||||
author_email="abhiroop.talasila@gmail.com",
|
||||
description="CLI application to generate subtitle file (.srt) for any video file using using STT",
|
||||
long_description=README,
|
||||
install_requires=INSTALL_PACKAGES,
|
||||
long_description_content_type="text/markdown",
|
||||
url="https://github.com/abhirooptalasila/AutoSub",
|
||||
keywords=['speech-to-text','deepspeech','machine-learning'],
|
||||
classifiers=[
|
||||
"Programming Language :: Python :: 3",
|
||||
"License :: OSI Approved :: MIT License",
|
||||
"Operating System :: OS Independent",
|
||||
],
|
||||
python_requires='>=3.5',
|
||||
)
|
Загрузка…
Ссылка в новой задаче