зеркало из https://github.com/mozilla/kaldi.git
trunk: merging sandbox/pawel to add the AMI recipe.
git-svn-id: https://svn.code.sf.net/p/kaldi/code/trunk@4276 5e6a8d80-dfce-4ca6-a32a-6e07a63d50c8
This commit is contained in:
Коммит
b02ad40bf1
|
@ -0,0 +1,32 @@
|
|||
|
||||
About the AMI corpus:
|
||||
|
||||
WEB: http://groups.inf.ed.ac.uk/ami/corpus/
|
||||
LICENCE: http://groups.inf.ed.ac.uk/ami/corpus/license.shtml
|
||||
|
||||
"The AMI Meeting Corpus consists of 100 hours of meeting recordings. The recordings use a range of signals synchronized to a common timeline. These include close-talking and far-field microphones, individual and room-view video cameras, and output from a slide projector and an electronic whiteboard. During the meetings, the participants also have unsynchronized pens available to them that record what is written. The meetings were recorded in English using three different rooms with different acoustic properties, and include mostly non-native speakers." See http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml for more details.
|
||||
|
||||
|
||||
About the recipe:
|
||||
|
||||
s5)
|
||||
|
||||
The scripts under this directory build systems using AMI data only, this includes both training, development and evaluation sets (following Full ASR split on http://groups.inf.ed.ac.uk/ami/corpus/datasets.shtml). This is different from RT evaluation campaigns that usually combined couple of different meeting datasets from multiple sources. In general, the recipe reproduce baseline systems build in [1] but without propeirary components* that means we use CMUDict [2] and in the future will try to use open texts to estimate background language model.
|
||||
|
||||
Currently, one can build the systems for close-talking scenario, for which we refer as
|
||||
-- IHM (Individual Headset Microphones)
|
||||
and two variants of distant speech
|
||||
-- SDM (Single Distant Microphone) using 1st micarray and,
|
||||
-- MDM (Multiple Distant Microphones) where the mics are combined using BeamformIt [3] toolkit.
|
||||
|
||||
To run all su-recipes the following (non-standard) software is expected to be installed
|
||||
1) SRILM - to build language models (look at KALDI_ROOT/tools/install_srilm.sh)
|
||||
2) BeamformIt (for MDM scenario, installed with Kaldi tools)
|
||||
3) Java (optional, but if available will be used to extract transcripts from XML)
|
||||
|
||||
[1] "Hybrid acoustic models for distant and multichannel large vocabulary speech recognition", Pawel Swietojanski, Arnab Ghoshal and Steve Renals, In Proc. ASRU, December 2013
|
||||
[2] http://www.speech.cs.cmu.edu/cgi-bin/cmudict
|
||||
[3] "Acoustic beamforming for speaker diarization of meetings", Xavier Anguera, Chuck Wooters and Javier Hernando, IEEE Transactions on Audio, Speech and Language Processing, September 2007, volume 15, number 7, pp.2011-2023.
|
||||
|
||||
*) there is still optional dependency on Fisher transcripts (LDC2004T19, LDC2005T19) to build background language model and closely reproduce [1].
|
||||
|
|
@ -0,0 +1,15 @@
|
|||
|
||||
|
||||
dev
|
||||
exp/ihm/tri2a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev.ctm.filt.dtl:Percent Total Error = 38.0% (35925)
|
||||
exp/ihm/tri3a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_14/dev.ctm.filt.dtl:Percent Total Error = 35.3% (33329)
|
||||
exp/ihm/tri4a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev.ctm.filt.dtl:Percent Total Error = 32.1% (30364)
|
||||
exp/ihm/tri4a_mmi_b0.1/decode_dev_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_12/dev.ctm.filt.dtl:Percent Total Error = 29.9% (28220)
|
||||
|
||||
eval
|
||||
exp/ihm/tri2a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval.ctm.filt.dtl:Percent Total Error = 43.7% (39330)
|
||||
exp/ihm/tri3a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_14/eval.ctm.filt.dtl:Percent Total Error = 40.4% (36385)
|
||||
exp/ihm/tri4a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval_o4.ctm.filt.dtl:Percent Total Error = 35.0% (31463)
|
||||
exp/ihm/tri4a_mmi_b0.1/decode_eval_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_12/eval_o4.ctm.filt.dtl:Percent Total Error = 31.7% (28518)
|
||||
|
||||
|
|
@ -0,0 +1,15 @@
|
|||
|
||||
|
||||
#Beamforming of 8 microphones, WER scores with up to 4 overlapping speakers
|
||||
|
||||
dev
|
||||
exp/mdm8/tri2a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev_o4.ctm.filt.dtl:Percent Total Error = 58.8% (55568)
|
||||
exp/mdm8/tri3a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev_o4.ctm.filt.dtl:Percent Total Error = 57.0% (53855)
|
||||
exp/mdm8/tri3a_mmi_b0.1/decode_dev_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_10/dev_o4.ctm.filt.dtl:Percent Total Error = 54.9% (51926)
|
||||
|
||||
eval
|
||||
exp/mdm8/tri2a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval_o4.ctm.filt.dtl:Percent Total Error = 64.4% (57916)
|
||||
exp/mdm8/tri3a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval_o4.ctm.filt.dtl:Percent Total Error = 61.9% (55738)
|
||||
exp/mdm8/tri3a_mmi_b0.1/decode_eval_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_10/eval_o4.ctm.filt.dtl:Percent Total Error = 59.3% (53370)
|
||||
|
||||
|
|
@ -0,0 +1,14 @@
|
|||
|
||||
|
||||
#the below are WER scores with up to 4 overlapping speakers
|
||||
|
||||
dev
|
||||
exp/sdm1/tri2a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev_o4.ctm.filt.dtl:Percent Total Error = 66.9% (63190)
|
||||
exp/sdm1/tri3a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev_o4.ctm.filt.dtl:Percent Total Error = 64.5% (60963)
|
||||
exp/sdm1/tri3a_mmi_b0.1/decode_dev_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_10/dev_o4.ctm.filt.dtl:Percent Total Error = 62.2% (58772)
|
||||
|
||||
eval
|
||||
exp/sdm1/tri2a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval_o4.ctm.filt.dtl:Percent Total Error = 71.8% (64577)
|
||||
exp/sdm1/tri3a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_12/eval_o4.ctm.filt.dtl:Percent Total Error = 69.5% (62576)
|
||||
exp/sdm1/tri3a_mmi_b0.1/decode_eval_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_10/eval_o4.ctm.filt.dtl:Percent Total Error = 67.2% (60447)
|
||||
|
|
@ -0,0 +1,17 @@
|
|||
# "queue.pl" uses qsub. The options to it are
|
||||
# options to qsub. If you have GridEngine installed,
|
||||
# change this to a queue you have access to.
|
||||
# Otherwise, use "run.pl", which will run jobs locally
|
||||
# (make sure your --num-jobs options are no more than
|
||||
# the number of cpus on your machine.
|
||||
|
||||
# On Eddie use:
|
||||
#export train_cmd="queue.pl -P inf_hcrc_cstr_nst -l h_rt=08:00:00"
|
||||
#export decode_cmd="queue.pl -P inf_hcrc_cstr_nst -l h_rt=05:00:00 -pe memory-2G 4"
|
||||
#export highmem_cmd="queue.pl -P inf_hcrc_cstr_nst -l h_rt=05:00:00 -pe memory-2G 4"
|
||||
#export scoring_cmd="queue.pl -P inf_hcrc_cstr_nst -l h_rt=00:20:00"
|
||||
|
||||
# To run locally, use:
|
||||
export train_cmd=run.pl
|
||||
export decode_cmd=run.pl
|
||||
export highmem_cmd=run.pl
|
|
@ -0,0 +1,50 @@
|
|||
#BeamformIt sample configuration file for AMI data (http://groups.inf.ed.ac.uk/ami/download/)
|
||||
|
||||
# scrolling size to compute the delays
|
||||
scroll_size = 250
|
||||
|
||||
# cross correlation computation window size
|
||||
window_size = 500
|
||||
|
||||
#amount of maximum points for the xcorrelation taken into account
|
||||
nbest_amount = 4
|
||||
|
||||
#flag wether to apply an automatic noise thresholding
|
||||
do_noise_threshold = 1
|
||||
|
||||
#Percentage of frames with lower xcorr taken as noisy
|
||||
noise_percent = 10
|
||||
|
||||
######## acoustic modelling parameters
|
||||
|
||||
#transition probabilities weight for multichannel decoding
|
||||
trans_weight_multi = 25
|
||||
trans_weight_nbest = 25
|
||||
|
||||
###
|
||||
|
||||
#flag wether to print the feaures after setting them, or not
|
||||
print_features = 1
|
||||
|
||||
#flag wether to use the bad frames in the sum process
|
||||
do_avoid_bad_frames = 1
|
||||
|
||||
#flag to use the best channel (SNR) as a reference
|
||||
#defined from command line
|
||||
do_compute_reference = 1
|
||||
|
||||
#flag wether to use a uem file or not(process all the file)
|
||||
do_use_uem_file = 0
|
||||
|
||||
#flag wether to use an adaptative weights scheme or fixed weights
|
||||
do_adapt_weights = 1
|
||||
|
||||
#flag wether to output the sph files or just run the system to create the auxiliary files
|
||||
do_write_sph_files = 1
|
||||
|
||||
####directories where to store/retrieve info####
|
||||
#channels_file = ./cfg-files/channels
|
||||
|
||||
#show needs to be passed as argument normally, here a default one is given just in case
|
||||
#show_id = Ttmp
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
beam=11.0 # beam for decoding. Was 13.0 in the scripts.
|
||||
first_beam=8.0 # beam for 1st-pass decoding in SAT.
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
--window-type=hamming # disable Dans window, use the standard
|
||||
--use-energy=false # only fbank outputs
|
||||
--sample-frequency=16000 # AMI is sampled at 16kHz
|
||||
|
||||
#--low-freq=64 # typical setup from Frantisek Grezl
|
||||
#--high-freq=3800
|
||||
--dither=1
|
||||
|
||||
--num-mel-bins=40 # 8kHz so we use 15 bins
|
||||
--htk-compat=true # try to make it compatible with HTK
|
|
@ -0,0 +1,2 @@
|
|||
--use-energy=false # only non-default option.
|
||||
--sample-frequency=16000
|
|
@ -0,0 +1,74 @@
|
|||
#!/bin/bash
|
||||
|
||||
#Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
|
||||
#Apache 2.0
|
||||
|
||||
wiener_filtering=false
|
||||
nj=4
|
||||
cmd=run.pl
|
||||
|
||||
# End configuration section
|
||||
echo "$0 $@" # Print the command line for logging
|
||||
|
||||
[ -f ./path.sh ] && . ./path.sh; # source the path.
|
||||
. parse_options.sh || exit 1;
|
||||
|
||||
if [ $# != 3 ]; then
|
||||
echo "Wrong #arguments ($#, expected 4)"
|
||||
echo "Usage: steps/ami_beamform.sh [options] <num-mics> <ami-dir> <wav-out-dir>"
|
||||
echo "main options (for others, see top of script file)"
|
||||
echo " --nj <nj> # number of parallel jobs"
|
||||
echo " --cmd <cmd> # Command to run in parallel with"
|
||||
echo " --wiener-filtering <true/false> # Cancel noise with Wiener filter prior to beamforming"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
numch=$1
|
||||
sdir=$2
|
||||
odir=$3
|
||||
wdir=data/local/beamforming
|
||||
|
||||
mkdir -p $odir
|
||||
mkdir -p $wdir/log
|
||||
|
||||
meetings=$wdir/meetings.list
|
||||
|
||||
cat local/split_train.orig local/split_dev.orig local/split_eval.orig | sort > $meetings
|
||||
|
||||
ch_inc=$((8/$numch))
|
||||
bmf=
|
||||
for ch in `seq 1 $ch_inc 8`; do
|
||||
bmf="$bmf $ch"
|
||||
done
|
||||
|
||||
echo "Will use the following channels: $bmf"
|
||||
|
||||
#make the channel file
|
||||
if [ -f $wdir/channels_$numch ]; then
|
||||
rm $wdir/channels_$numch
|
||||
fi
|
||||
touch $wdir/channels_$numch
|
||||
|
||||
while read line;
|
||||
do
|
||||
channels="$line "
|
||||
for ch in $bmf; do
|
||||
channels="$channels $line/audio/$line.Array1-0$ch.wav"
|
||||
done
|
||||
echo $channels >> $wdir/channels_$numch
|
||||
done < $meetings
|
||||
|
||||
#do noise cancellation
|
||||
|
||||
if [ $wiener_filtering == "true" ]; then
|
||||
echo "Wiener filtering not yet implemented."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
#do beamforming
|
||||
|
||||
echo -e "Beamforming\n"
|
||||
|
||||
$cmd JOB=1:$nj $wdir/log/beamform.JOB.log \
|
||||
local/beamformit.sh $nj JOB $numch $meetings $sdir $odir
|
||||
|
|
@ -0,0 +1,95 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski, Jonathan Kilgour)
|
||||
|
||||
if [ $# -ne 2 ]; then
|
||||
echo "Usage: $0 <mic> <ami-dir>"
|
||||
echo " where <mic> is either ihm, sdm or mdm and <ami-dir> is download space."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
mic=$1
|
||||
adir=$2
|
||||
amiurl=http://groups.inf.ed.ac.uk/ami
|
||||
annotver=ami_public_manual_1.6.1
|
||||
wdir=data/local/downloads
|
||||
|
||||
if [[ ! "$mic" =~ ^(ihm|sdm|mdm)$ ]]; then
|
||||
echo "$0. Wrong <mic> option."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
mics="1 2 3 4 5 6 7 8"
|
||||
if [ "$mic" == "sdm" ]; then
|
||||
mics=1
|
||||
fi
|
||||
|
||||
mkdir -p $adir
|
||||
mkdir -p $wdir/log
|
||||
|
||||
#download annotations
|
||||
|
||||
annot="$adir/$annotver"
|
||||
if [[ ! -d $adir/annotations || ! -f "$annot" ]]; then
|
||||
echo "Downloading annotiations..."
|
||||
wget -nv -O $annot.zip $amiurl/AMICorpusAnnotations/$annotver.zip &> $wdir/log/download_ami_annot.log
|
||||
mkdir -p $adir/annotations
|
||||
unzip -o -d $adir/annotations $annot.zip &> /dev/null
|
||||
fi
|
||||
[ ! -f "$adir/annotations/AMI-metadata.xml" ] && echo "$0: File AMI-Metadata.xml not found under $adir/annotations." && exit 1;
|
||||
|
||||
#download waves
|
||||
|
||||
cat local/split_train.orig local/split_eval.orig local/split_dev.orig > $wdir/ami_meet_ids.flist
|
||||
|
||||
wgetfile=$wdir/wget_$mic.sh
|
||||
manifest="wget -O $adir/MANIFEST.TXT http://groups.inf.ed.ac.uk/ami/download/temp/amiBuild-04237-Sun-Jun-15-2014.manifest.txt"
|
||||
license="wget -O $adir/LICENCE.TXT http://groups.inf.ed.ac.uk/ami/download/temp/Creative-Commons-Attribution-NonCommercial-ShareAlike-2.5.txt"
|
||||
|
||||
echo "#!/bin/bash" > $wgetfile
|
||||
echo $manifest >> $wgetfile
|
||||
echo $license >> $wgetfile
|
||||
while read line; do
|
||||
if [ "$mic" == "ihm" ]; then
|
||||
extra_headset= #some meetings have 5 sepakers (headsets)
|
||||
for mtg in EN2001a EN2001d EN2001e; do
|
||||
[ "$mtg" == "$line" ] && extra_headset=4;
|
||||
done
|
||||
for m in 0 1 2 3 $extra_headset; do
|
||||
echo "wget -nv -c -P $adir/$line/audio $amiurl/AMICorpusMirror/amicorpus/$line/audio/$line.Headset-$m.wav" >> $wgetfile
|
||||
done
|
||||
else
|
||||
for m in $mics; do
|
||||
echo "wget -nv -c -P $adir/$line/audio $amiurl/AMICorpusMirror/amicorpus/$line/audio/$line.Array1-0$m.wav" >> $wgetfile
|
||||
done
|
||||
fi
|
||||
done < $wdir/ami_meet_ids.flist
|
||||
|
||||
chmod +x $wgetfile
|
||||
echo "Downloading audio files for $mic scenario."
|
||||
echo "Look at $wdir/log/download_ami_$mic.log for progress"
|
||||
$wgetfile &> $wdir/log/download_ami_$mic.log
|
||||
|
||||
#do rough check if #wavs is as expected, it will fail anyway in data prep stage if it isn't
|
||||
if [ "$mic" == "ihm" ]; then
|
||||
num_files=`find $adir -iname *Headset*`
|
||||
if [ $num_files -ne 687 ]; then
|
||||
echo "Warning: Found $num_files headset wavs but expected 687. Check $wdir/log/download_ami_$mic.log for details."
|
||||
exit 1;
|
||||
fi
|
||||
else
|
||||
num_files=`find $adir -iname *Array1*`
|
||||
if [[ $num_files -lt 1352 && "$mic" == "mdm" ]]; then
|
||||
echo "Warning: Found $num_files distant Array1 waves but expected 1352 for mdm. Check $wdir/log/download_ami_$mic.log for details."
|
||||
exit 1;
|
||||
elif [[ $num_files -lt 169 && "$mic" == "sdm" ]]; then
|
||||
echo "Warning: Found $num_files distant Array1 waves but expected 169 for sdm. Check $wdir/log/download_ami_$mic.log for details."
|
||||
exit 1;
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "Downloads of AMI corpus completed succesfully. License can be found under $adir/LICENCE.TXT"
|
||||
exit 0;
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,64 @@
|
|||
#!/bin/bash
|
||||
#
|
||||
|
||||
if [ -f path.sh ]; then . path.sh; fi
|
||||
|
||||
if [ $# -ne 1 ]; then
|
||||
echo 'Usage: $0 <arpa-lm>'
|
||||
exit
|
||||
fi
|
||||
|
||||
silprob=0.5
|
||||
arpa_lm=$1
|
||||
|
||||
[ ! -f $arpa_lm ] && echo No such file $arpa_lm && exit 1;
|
||||
|
||||
cp -r data/lang data/lang_test
|
||||
|
||||
# grep -v '<s> <s>' etc. is only for future-proofing this script. Our
|
||||
# LM doesn't have these "invalid combinations". These can cause
|
||||
# determinization failures of CLG [ends up being epsilon cycles].
|
||||
# Note: remove_oovs.pl takes a list of words in the LM that aren't in
|
||||
# our word list. Since our LM doesn't have any, we just give it
|
||||
# /dev/null [we leave it in the script to show how you'd do it].
|
||||
gunzip -c "$arpa_lm" | \
|
||||
grep -v '<s> <s>' | \
|
||||
grep -v '</s> <s>' | \
|
||||
grep -v '</s> </s>' | \
|
||||
arpa2fst - | fstprint | \
|
||||
utils/remove_oovs.pl /dev/null | \
|
||||
utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=data/lang_test/words.txt \
|
||||
--osymbols=data/lang_test/words.txt --keep_isymbols=false --keep_osymbols=false | \
|
||||
fstrmepsilon > data/lang_test/G.fst
|
||||
fstisstochastic data/lang_test/G.fst
|
||||
|
||||
echo "Checking how stochastic G is (the first of these numbers should be small):"
|
||||
fstisstochastic data/lang_test/G.fst
|
||||
|
||||
## Check lexicon.
|
||||
## just have a look and make sure it seems sane.
|
||||
echo "First few lines of lexicon FST:"
|
||||
fstprint --isymbols=data/lang/phones.txt --osymbols=data/lang/words.txt data/lang/L.fst | head
|
||||
|
||||
echo Performing further checks
|
||||
|
||||
# Checking that G.fst is determinizable.
|
||||
fstdeterminize data/lang_test/G.fst /dev/null || echo Error determinizing G.
|
||||
|
||||
# Checking that L_disambig.fst is determinizable.
|
||||
fstdeterminize data/lang_test/L_disambig.fst /dev/null || echo Error determinizing L.
|
||||
|
||||
# Checking that disambiguated lexicon times G is determinizable
|
||||
# Note: we do this with fstdeterminizestar not fstdeterminize, as
|
||||
# fstdeterminize was taking forever (presumbaly relates to a bug
|
||||
# in this version of OpenFst that makes determinization slow for
|
||||
# some case).
|
||||
fsttablecompose data/lang_test/L_disambig.fst data/lang_test/G.fst | \
|
||||
fstdeterminizestar >/dev/null || echo Error
|
||||
|
||||
# Checking that LG is stochastic:
|
||||
fsttablecompose data/lang/L_disambig.fst data/lang_test/G.fst | \
|
||||
fstisstochastic || echo LG is not stochastic
|
||||
|
||||
echo AMI_format_data succeeded.
|
||||
|
|
@ -0,0 +1,95 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
|
||||
# AMI Corpus training data preparation
|
||||
# Apache 2.0
|
||||
|
||||
# To be run from one directory above this script.
|
||||
|
||||
. path.sh
|
||||
|
||||
#check existing directories
|
||||
if [ $# != 1 ]; then
|
||||
echo "Usage: ami_ihm_data_prep.sh /path/to/AMI"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
AMI_DIR=$1
|
||||
|
||||
SEGS=data/local/annotations/train.txt
|
||||
dir=data/local/ihm/train
|
||||
mkdir -p $dir
|
||||
|
||||
# Audio data directory check
|
||||
if [ ! -d $AMI_DIR ]; then
|
||||
echo "Error: $AMI_DIR directory does not exists."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# And transcripts check
|
||||
if [ ! -f $SEGS ]; then
|
||||
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
|
||||
# find headset wav audio files only
|
||||
find $AMI_DIR -iname '*.Headset-*.wav' | sort > $dir/wav.flist
|
||||
n=`cat $dir/wav.flist | wc -l`
|
||||
echo "In total, $n headset files were found."
|
||||
[ $n -ne 687 ] && \
|
||||
echo "Warning: expected 687 (168 mtgs x 4 mics + 3 mtgs x 5 mics) data files, found $n"
|
||||
|
||||
# (1a) Transcriptions preparation
|
||||
# here we start with normalised transcriptions, the utt ids follow the convention
|
||||
# AMI_MEETING_CHAN_SPK_STIME_ETIME
|
||||
# AMI_ES2011a_H00_FEE041_0003415_0003484
|
||||
# we use uniq as some (rare) entries are doubled in transcripts
|
||||
|
||||
awk '{meeting=$1; channel=$2; speaker=$3; stime=$4; etime=$5;
|
||||
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
|
||||
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
|
||||
|
||||
# (1b) Make segment files from transcript
|
||||
|
||||
awk '{
|
||||
segment=$1;
|
||||
split(segment,S,"[_]");
|
||||
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
|
||||
print segment " " audioname " " startf*10/1000 " " endf*10/1000 " "
|
||||
}' < $dir/text > $dir/segments
|
||||
|
||||
# (1c) Make wav.scp file.
|
||||
|
||||
sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
|
||||
perl -ne 'split; $_ =~ m/(.*)\..*\-([0-9])/; print "AMI_$1_H0$2\n"' | \
|
||||
paste - $dir/wav.flist > $dir/wav1.scp
|
||||
|
||||
#Keep only train part of waves
|
||||
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp > $dir/wav2.scp
|
||||
|
||||
#replace path with an appropriate sox command that select single channel only
|
||||
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
|
||||
|
||||
# (1d) reco2file_and_channel
|
||||
cat $dir/wav.scp \
|
||||
| perl -ane '$_ =~ m:^(\S+)(H0[0-4])\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
|
||||
print "$1$2 $3 A\n"; ' > $dir/reco2file_and_channel || exit 1;
|
||||
|
||||
|
||||
awk '{print $1}' $dir/segments | \
|
||||
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
|
||||
print "$1$2$3 $1$2\n";' > $dir/utt2spk || exit 1;
|
||||
|
||||
sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;
|
||||
|
||||
# Copy stuff into its final location
|
||||
mkdir -p data/ihm/train
|
||||
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
|
||||
cp $dir/$f data/ihm/train/$f || exit 1;
|
||||
done
|
||||
|
||||
utils/validate_data_dir.sh --no-feats data/ihm/train || exit 1;
|
||||
|
||||
echo AMI IHM data preparation succeeded.
|
||||
|
|
@ -0,0 +1,118 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
|
||||
# AMI Corpus dev/eval data preparation
|
||||
|
||||
. path.sh
|
||||
|
||||
#check existing directories
|
||||
if [ $# != 2 ]; then
|
||||
echo "Usage: ami_*_scoring_data_prep_edin.sh /path/to/AMI set-name"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
AMI_DIR=$1
|
||||
SET=$2
|
||||
SEGS=data/local/annotations/$SET.txt
|
||||
|
||||
dir=data/local/ihm/$SET
|
||||
mkdir -p $dir
|
||||
|
||||
# Audio data directory check
|
||||
if [ ! -d $AMI_DIR ]; then
|
||||
echo "Error: run.sh requires a directory argument"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# And transcripts check
|
||||
if [ ! -f $SEGS ]; then
|
||||
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# find headset wav audio files only, here we again get all
|
||||
# the files in the corpora and filter only specific sessions
|
||||
# while building segments
|
||||
|
||||
find $AMI_DIR -iname '*.Headset-*.wav' | sort > $dir/wav.flist
|
||||
n=`cat $dir/wav.flist | wc -l`
|
||||
echo "In total, $n headset files were found."
|
||||
[ $n -ne 687 ] && \
|
||||
echo "Warning: expected 687 (168 mtgs x 4 mics + 3 mtgs x 5 mics) data files, found $n"
|
||||
|
||||
# (1a) Transcriptions preparation
|
||||
# here we start with normalised transcriptions, the utt ids follow the convention
|
||||
# AMI_MEETING_CHAN_SPK_STIME_ETIME
|
||||
# AMI_ES2011a_H00_FEE041_0003415_0003484
|
||||
|
||||
awk '{meeting=$1; channel=$2; speaker=$3; stime=$4; etime=$5;
|
||||
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
|
||||
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
|
||||
|
||||
# (1c) Make segment files from transcript
|
||||
#segments file format is: utt-id side-id start-time end-time, e.g.:
|
||||
|
||||
awk '{
|
||||
segment=$1;
|
||||
split(segment,S,"[_]");
|
||||
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
|
||||
print segment " " audioname " " startf*10/1000 " " endf*10/1000 " "
|
||||
}' < $dir/text > $dir/segments
|
||||
|
||||
#prepare wav.scp
|
||||
sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
|
||||
perl -ne 'split; $_ =~ m/(.*)\..*\-([0-9])/; print "AMI_$1_H0$2\n"' | \
|
||||
paste - $dir/wav.flist > $dir/wav1.scp
|
||||
|
||||
#Keep only train part of waves
|
||||
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp > $dir/wav2.scp
|
||||
|
||||
#replace path with an appropriate sox command that select single channel only
|
||||
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
|
||||
|
||||
# (1d) reco2file_and_channel
|
||||
cat $dir/wav.scp \
|
||||
| perl -ane '$_ =~ m:^(\S+)(H0[0-4])\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
|
||||
print "$1$2 $3 A\n"; ' > $dir/reco2file_and_channel || exit 1;
|
||||
|
||||
awk '{print $1}' $dir/segments | \
|
||||
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "segments: bad label $_";
|
||||
print "$1$2$3 $1$2\n";' > $dir/utt2spk || exit 1;
|
||||
|
||||
sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;
|
||||
|
||||
#check and correct the case when segment timings for given speaker overlap themself
|
||||
#(important for simulatenous asclite scoring to proceed).
|
||||
#There is actually only one such case for devset and automatic segmentetions
|
||||
join $dir/utt2spkm $dir/segments | \
|
||||
perl -ne '{BEGIN{$pu=""; $pt=0.0;} split;
|
||||
if ($pu eq $_[1] && $pt > $_[3]) {
|
||||
print "$_[0] $_[2] $_[3] $_[4]>$_[0] $_[2] $pt $_[4]\n"
|
||||
}
|
||||
$pu=$_[1]; $pt=$_[4];
|
||||
}' > $dir/segments_to_fix
|
||||
if [ `cat $dir/segments_to_fix | wc -l` -gt 0 ]; then
|
||||
echo "$0. Applying following fixes to segments"
|
||||
cat $dir/segments_to_fix
|
||||
while read line; do
|
||||
p1=`echo $line | awk -F'>' '{print $1}'`
|
||||
p2=`echo $line | awk -F'>' '{print $2}'`
|
||||
sed -ir "s!$p1!$p2!" $dir/segments
|
||||
done < $dir/segments_to_fix
|
||||
fi
|
||||
|
||||
# Copy stuff into its final locations
|
||||
fdir=data/ihm/$SET
|
||||
mkdir -p $fdir
|
||||
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
|
||||
cp $dir/$f $fdir/$f || exit 1;
|
||||
done
|
||||
|
||||
#Produce STMs for sclite scoring
|
||||
local/convert2stm.pl $dir > $fdir/stm
|
||||
cp local/english.glm $fdir/glm
|
||||
|
||||
utils/validate_data_dir.sh --no-feats $fdir || exit 1;
|
||||
|
||||
echo AMI $SET set data preparation succeeded.
|
||||
|
|
@ -0,0 +1,102 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
|
||||
# AMI Corpus dev/eval data preparation
|
||||
|
||||
# To be run from one directory above this script.
|
||||
|
||||
. path.sh
|
||||
|
||||
#check existing directories
|
||||
if [ $# != 2 ]; then
|
||||
echo "Usage: ami_data_prep.sh </path/to/AMI-MDM> <mic>"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
AMI_DIR=$1
|
||||
mic=$2
|
||||
|
||||
SEGS=data/local/annotations/train.txt
|
||||
dir=data/local/$mic/train
|
||||
odir=data/$mic/train
|
||||
mkdir -p $dir
|
||||
|
||||
# Audio data directory check
|
||||
if [ ! -d $AMI_DIR ]; then
|
||||
echo "Error: run.sh requires a directory argument"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# And transcripts check
|
||||
if [ ! -f $SEGS ]; then
|
||||
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# find MDM mics
|
||||
find $AMI_DIR -iname "*${mic}.wav" | sort > $dir/wav.flist
|
||||
|
||||
n=`cat $dir/wav.flist | wc -l`
|
||||
echo "In total, $n headset files were found."
|
||||
[ $n -ne 169 ] && \
|
||||
echo Warning: expected 169 data data files, found $n
|
||||
|
||||
# (1a) Transcriptions preparation
|
||||
# here we start with rt09 transcriptions, hence not much to do
|
||||
|
||||
awk '{meeting=$1; channel="MDM"; speaker=$3; stime=$4; etime=$5;
|
||||
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
|
||||
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
|
||||
|
||||
# (1c) Make segment files from transcript
|
||||
#segments file format is: utt-id side-id start-time end-time, e.g.:
|
||||
#AMI_ES2011a_H00_FEE041_0003415_0003484
|
||||
awk '{
|
||||
segment=$1;
|
||||
split(segment,S,"[_]");
|
||||
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
|
||||
print segment " " audioname " " startf/100 " " endf/100 " "
|
||||
}' < $dir/text > $dir/segments
|
||||
|
||||
#EN2001a.Array1-01.wav
|
||||
#sed -e 's?.*/??' -e 's?.sph??' $dir/wav.flist | paste - $dir/wav.flist \
|
||||
# > $dir/wav.scp
|
||||
|
||||
sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
|
||||
perl -ne 'split; $_ =~ m/(.*)\_.*/; print "AMI_$1_MDM\n"' | \
|
||||
paste - $dir/wav.flist > $dir/wav1.scp
|
||||
|
||||
#Keep only training part of waves
|
||||
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp | sort -o $dir/wav2.scp
|
||||
#Two distant recordings are missing, agree segments with wav.scp
|
||||
awk '{print $1}' $dir/wav2.scp | join -2 2 - $dir/segments | \
|
||||
awk '{print $2" "$1" "$3" "$4" "$5}' > $dir/s; mv $dir/s $dir/segments
|
||||
#...and text with segments
|
||||
awk '{print $1}' $dir/segments | join - $dir/text > $dir/t; mv $dir/t $dir/text
|
||||
|
||||
#replace path with an appropriate sox command that select single channel only
|
||||
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
|
||||
|
||||
#prep reco2file_and_channel
|
||||
cat $dir/wav.scp | \
|
||||
perl -ane '$_ =~ m:^(\S+MDM).*\/([IETB].*)\.wav.*$: || die "bad label $_";
|
||||
print "$1 $2 A\n"; ' > $dir/reco2file_and_channel || exit 1;
|
||||
|
||||
# we assume we adapt to the session only
|
||||
awk '{print $1}' $dir/segments | \
|
||||
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
|
||||
print "$1$2$3 $1\n";' \
|
||||
> $dir/utt2spk || exit 1;
|
||||
|
||||
sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;
|
||||
|
||||
# Copy stuff into its final locations
|
||||
mkdir -p $odir
|
||||
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
|
||||
cp $dir/$f $odir/$f | exit 1;
|
||||
done
|
||||
|
||||
utils/validate_data_dir.sh --no-feats $odir
|
||||
|
||||
echo AMI MDM data preparation succeeded.
|
||||
|
|
@ -0,0 +1,126 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
|
||||
# AMI Corpus dev/eval data preparation
|
||||
|
||||
. path.sh
|
||||
|
||||
#check existing directories
|
||||
if [ $# != 3 ]; then
|
||||
echo "Usage: ami_mdm_scoring_data_prep.sh /path/to/AMI-MDM mic-name set-name"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
AMI_DIR=$1
|
||||
mic=$2
|
||||
SET=$3
|
||||
|
||||
SEGS=data/local/annotations/$SET.txt
|
||||
tmpdir=data/local/$mic/$SET
|
||||
dir=data/$mic/$SET
|
||||
|
||||
mkdir -p $tmpdir
|
||||
|
||||
# Audio data directory check
|
||||
if [ ! -d $AMI_DIR ]; then
|
||||
echo "Error: run.sh requires a directory argument"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# And transcripts check
|
||||
if [ ! -f $SEGS ]; then
|
||||
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# find selected mdm wav audio files only
|
||||
find $AMI_DIR -iname "*${mic}.wav" | sort > $tmpdir/wav.flist
|
||||
n=`cat $tmpdir/wav.flist | wc -l`
|
||||
if [ $n -ne 169 ]; then
|
||||
echo "Warning. Expected to find 169 files but found $n."
|
||||
fi
|
||||
|
||||
# (1a) Transcriptions preparation
|
||||
awk '{meeting=$1; channel="MDM"; speaker=$3; stime=$4; etime=$5;
|
||||
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
|
||||
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $tmpdir/text
|
||||
|
||||
# (1c) Make segment files from transcript
|
||||
#segments file format is: utt-id side-id start-time end-time, e.g.:
|
||||
#AMI_ES2011a_H00_FEE041_0003415_0003484
|
||||
awk '{
|
||||
segment=$1;
|
||||
split(segment,S,"[_]");
|
||||
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
|
||||
print segment " " audioname " " startf/100 " " endf/100 " "
|
||||
}' < $tmpdir/text > $tmpdir/segments
|
||||
|
||||
#EN2001a.Array1-01.wav
|
||||
#sed -e 's?.*/??' -e 's?.sph??' $dir/wav.flist | paste - $dir/wav.flist \
|
||||
# > $dir/wav.scp
|
||||
|
||||
sed -e 's?.*/??' -e 's?.wav??' $tmpdir/wav.flist | \
|
||||
perl -ne 'split; $_ =~ m/(.*)\_.*/; print "AMI_$1_MDM\n"' | \
|
||||
paste - $tmpdir/wav.flist > $tmpdir/wav1.scp
|
||||
|
||||
#Keep only devset part of waves
|
||||
awk '{print $2}' $tmpdir/segments | sort -u | join - $tmpdir/wav1.scp > $tmpdir/wav2.scp
|
||||
|
||||
#replace path with an appropriate sox command that select single channel only
|
||||
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $tmpdir/wav2.scp > $tmpdir/wav.scp
|
||||
|
||||
#prep reco2file_and_channel
|
||||
cat $tmpdir/wav.scp | \
|
||||
perl -ane '$_ =~ m:^(\S+MDM)\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
|
||||
print "$1 $2 A\n"; ' > $tmpdir/reco2file_and_channel || exit 1;
|
||||
|
||||
# we assume we adapt to the session only
|
||||
awk '{print $1}' $tmpdir/segments | \
|
||||
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
|
||||
print "$1$2$3 $1\n";' \
|
||||
> $tmpdir/utt2spk || exit 1;
|
||||
|
||||
sort -k 2 $tmpdir/utt2spk | utils/utt2spk_to_spk2utt.pl > $tmpdir/spk2utt || exit 1;
|
||||
|
||||
# but we want to properly score the overlapped segments, hence we generate the extra
|
||||
# utt2spk_stm file containing speakers ids used to generate the stms for mdm/sdm case
|
||||
awk '{print $1}' $tmpdir/segments | \
|
||||
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
|
||||
print "$1$2$3 $1$2\n";' > $tmpdir/utt2spk_stm || exit 1;
|
||||
|
||||
#check and correct case when segment timings for a given speaker overlap themself
|
||||
#(important for simulatenous asclite scoring to proceed).
|
||||
#There is actually only one such case for devset and automatic segmentetions
|
||||
join $tmpdir/utt2spk_stm $tmpdir/segments | \
|
||||
perl -ne '{BEGIN{$pu=""; $pt=0.0;} split;
|
||||
if ($pu eq $_[1] && $pt > $_[3]) {
|
||||
print "$_[0] $_[2] $_[3] $_[4]>$_[0] $_[2] $pt $_[4]\n"
|
||||
}
|
||||
$pu=$_[1]; $pt=$_[4];
|
||||
}' > $tmpdir/segments_to_fix
|
||||
if [ `cat $tmpdir/segments_to_fix | wc -l` -gt 0 ]; then
|
||||
echo "$0. Applying following fixes to segments"
|
||||
cat $tmpdir/segments_to_fix
|
||||
while read line; do
|
||||
p1=`echo $line | awk -F'>' '{print $1}'`
|
||||
p2=`echo $line | awk -F'>' '{print $2}'`
|
||||
sed -ir "s!$p1!$p2!" $tmpdir/segments
|
||||
done < $tmpdir/segments_to_fix
|
||||
fi
|
||||
|
||||
# Copy stuff into its final locations [this has been moved from the format_data
|
||||
# script]
|
||||
mkdir -p $dir
|
||||
for f in spk2utt utt2spk utt2spk_stm wav.scp text segments reco2file_and_channel; do
|
||||
cp $tmpdir/$f $dir/$f || exit 1;
|
||||
done
|
||||
|
||||
cp local/english.glm $dir/glm
|
||||
#note, although utt2spk contains mappings to the whole meetings for simulatenous scoring
|
||||
#we need to know which speakers overlap at meeting level, hence we generate an extra utt2spk_stm file
|
||||
local/convert2stm.pl $dir utt2spk_stm > $dir/stm
|
||||
|
||||
utils/validate_data_dir.sh --no-feats $dir
|
||||
|
||||
echo AMI $SET set data preparation succeeded.
|
||||
|
|
@ -0,0 +1,69 @@
|
|||
#!/bin/bash
|
||||
|
||||
#adapted from fisher dict preparation script, Author: Pawel Swietojanski
|
||||
|
||||
dir=data/local/dict
|
||||
mkdir -p $dir
|
||||
echo "Getting CMU dictionary"
|
||||
svn co https://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict $dir/cmudict
|
||||
|
||||
# silence phones, one per line.
|
||||
for w in sil laughter noise oov; do echo $w; done > $dir/silence_phones.txt
|
||||
echo sil > $dir/optional_silence.txt
|
||||
|
||||
# For this setup we're discarding stress.
|
||||
cat $dir/cmudict/cmudict.0.7a.symbols | sed s/[0-9]//g | \
|
||||
perl -ane 's:\r::; print;' | sort | uniq > $dir/nonsilence_phones.txt
|
||||
|
||||
# An extra question will be added by including the silence phones in one class.
|
||||
cat $dir/silence_phones.txt| awk '{printf("%s ", $1);} END{printf "\n";}' > $dir/extra_questions.txt || exit 1;
|
||||
|
||||
grep -v ';;;' $dir/cmudict/cmudict.0.7a | \
|
||||
perl -ane 'if(!m:^;;;:){ s:(\S+)\(\d+\) :$1 :; s: : :; print; }' | \
|
||||
sed s/[0-9]//g | sort | uniq > $dir/lexicon1_raw_nosil.txt || exit 1;
|
||||
|
||||
#cat eddie_data/rt09.ami.ihmtrain09.v3.dct | sort > $dir/lexicon1_raw_nosil.txt
|
||||
|
||||
# limit the vocabulary to the predefined 50k words
|
||||
wget -nv -O $dir/wordlist.50k.gz http://www.openslr.org/resources/9/wordlist.50k.gz
|
||||
gunzip -c $dir/wordlist.50k.gz > $dir/wordlist.50k
|
||||
join $dir/lexicon1_raw_nosil.txt $dir/wordlist.50k > $dir/lexicon1_raw_nosil_50k.txt
|
||||
|
||||
# Add prons for laughter, noise, oov
|
||||
for w in `grep -v sil $dir/silence_phones.txt`; do
|
||||
echo "[$w] $w"
|
||||
done | cat - $dir/lexicon1_raw_nosil_50k.txt > $dir/lexicon2_raw_50k.txt || exit 1;
|
||||
|
||||
# add some specific words, those are only with 100 missing occurences or more
|
||||
( echo "MM M"; \
|
||||
echo "HMM HH M"; \
|
||||
echo "MM-HMM M HH M"; \
|
||||
echo "COLOUR K AH L ER"; \
|
||||
echo "COLOURS K AH L ER Z"; \
|
||||
echo "REMOTES R IH M OW T Z"; \
|
||||
echo "FAVOURITE F EY V ER IH T"; \
|
||||
echo "<unk> oov" ) | cat - $dir/lexicon2_raw_50k.txt \
|
||||
| sort -u > $dir/lexicon3_extra_50k.txt
|
||||
|
||||
cp $dir/lexicon3_extra_50k.txt $dir/lexicon.txt
|
||||
|
||||
[ ! -f $dir/lexicon.txt ] && exit 1;
|
||||
|
||||
# This is just for diagnostics:
|
||||
cat data/ihm/train/text | \
|
||||
awk '{for (n=2;n<=NF;n++){ count[$n]++; } } END { for(n in count) { print count[n], n; }}' | \
|
||||
sort -nr > $dir/word_counts
|
||||
|
||||
awk '{print $1}' $dir/lexicon.txt | \
|
||||
perl -e '($word_counts)=@ARGV;
|
||||
open(W, "<$word_counts")||die "opening word-counts $word_counts";
|
||||
while(<STDIN>) { chop; $seen{$_}=1; }
|
||||
while(<W>) {
|
||||
($c,$w) = split;
|
||||
if (!defined $seen{$w}) { print; }
|
||||
} ' $dir/word_counts > $dir/oov_counts.txt
|
||||
|
||||
echo "*Highest-count OOVs are:"
|
||||
head -n 20 $dir/oov_counts.txt
|
||||
|
||||
utils/validate_dict_dir.pl $dir
|
|
@ -0,0 +1,100 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
|
||||
# AMI Corpus dev/eval data preparation
|
||||
|
||||
. path.sh
|
||||
|
||||
#check existing directories
|
||||
if [ $# != 2 ]; then
|
||||
echo "Usage: ami_sdm_data_prep.sh <path/to/AMI> <dist-mic-num>"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
AMI_DIR=$1
|
||||
MICNUM=$2
|
||||
DSET="sdm$MICNUM"
|
||||
|
||||
SEGS=data/local/annotations/train.txt
|
||||
dir=data/local/$DSET/train
|
||||
mkdir -p $dir
|
||||
|
||||
# Audio data directory check
|
||||
if [ ! -d $AMI_DIR ]; then
|
||||
echo "Error: run.sh requires a directory argument"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# And transcripts check
|
||||
if [ ! -f $SEGS ]; then
|
||||
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# as the sdm we treat first mic from the array
|
||||
find $AMI_DIR -iname "*.Array1-0$MICNUM.wav" | sort > $dir/wav.flist
|
||||
|
||||
n=`cat $dir/wav.flist | wc -l`
|
||||
|
||||
echo "In total, $n files were found."
|
||||
[ $n -ne 169 ] && \
|
||||
echo Warning: expected 169 data data files, found $n
|
||||
|
||||
# (1a) Transcriptions preparation
|
||||
# here we start with already normalised transcripts, just make the ids
|
||||
# Note, we set here SDM rather than, for example, SDM1 as we want to easily use
|
||||
# the same alignments across different mics
|
||||
|
||||
awk '{meeting=$1; channel="SDM"; speaker=$3; stime=$4; etime=$5;
|
||||
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
|
||||
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
|
||||
|
||||
# (1c) Make segment files from transcript
|
||||
#segments file format is: utt-id side-id start-time end-time, e.g.:
|
||||
#AMI_ES2011a_H00_FEE041_0003415_0003484
|
||||
awk '{
|
||||
segment=$1;
|
||||
split(segment,S,"[_]");
|
||||
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
|
||||
print segment " " audioname " " startf/100 " " endf/100 " "
|
||||
}' < $dir/text > $dir/segments
|
||||
|
||||
#EN2001a.Array1-01.wav
|
||||
|
||||
sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
|
||||
perl -ne 'split; $_ =~ m/(.*)\..*/; print "AMI_$1_SDM\n"' | \
|
||||
paste - $dir/wav.flist > $dir/wav1.scp
|
||||
|
||||
#Keep only training part of waves
|
||||
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp | sort -o $dir/wav2.scp
|
||||
#Two distant recordings are missing, agree segments with wav.scp
|
||||
awk '{print $1}' $dir/wav2.scp | join -2 2 - $dir/segments | \
|
||||
awk '{print $2" "$1" "$3" "$4" "$5}' > $dir/s; mv $dir/s $dir/segments
|
||||
#...and text with segments
|
||||
awk '{print $1}' $dir/segments | join - $dir/text > $dir/t; mv $dir/t $dir/text
|
||||
|
||||
#replace path with an appropriate sox command that select a single channel only
|
||||
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
|
||||
|
||||
# this file reco2file_and_channel maps recording-id
|
||||
cat $dir/wav.scp | \
|
||||
perl -ane '$_ =~ m:^(\S+SDM)\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
|
||||
print "$1 $2 A\n"; ' > $dir/reco2file_and_channel || exit 1;
|
||||
|
||||
# Assumtion, for sdm we adapt to the session only
|
||||
awk '{print $1}' $dir/segments | \
|
||||
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
|
||||
print "$1$2$3 $1\n";' | sort > $dir/utt2spk || exit 1;
|
||||
|
||||
sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;
|
||||
|
||||
# Copy stuff into its final locations
|
||||
mkdir -p data/$DSET/train
|
||||
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
|
||||
cp $dir/$f data/$DSET/train/$f || exit 1;
|
||||
done
|
||||
|
||||
utils/validate_data_dir.sh --no-feats data/$DSET/train
|
||||
|
||||
echo AMI $DSET data preparation succeeded.
|
||||
|
|
@ -0,0 +1,131 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
|
||||
# AMI Corpus dev/eval data preparation
|
||||
|
||||
. path.sh
|
||||
|
||||
#check existing directories
|
||||
if [ $# != 3 ]; then
|
||||
echo "Usage: ami_sdm_scoring_data_prep.sh <path/to/AMI> <mic-id> <set-name>"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
AMI_DIR=$1
|
||||
MICNUM=$2
|
||||
SET=$3
|
||||
DSET="sdm$MICNUM"
|
||||
|
||||
SEGS=data/local/annotations/$SET.txt
|
||||
tmpdir=data/local/$DSET/$SET
|
||||
dir=data/$DSET/$SET
|
||||
|
||||
mkdir -p $tmpdir
|
||||
|
||||
# Audio data directory check
|
||||
if [ ! -d $AMI_DIR ]; then
|
||||
echo "Error: run.sh requires a directory argument"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# And transcripts check
|
||||
if [ ! -f $SEGS ]; then
|
||||
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
# find headset wav audio files only, here we again get all
|
||||
# the files in the corpora and filter only specific sessions
|
||||
# while building segments
|
||||
|
||||
find $AMI_DIR -iname "*.Array1-0$MICNUM.wav" | sort > $tmpdir/wav.flist
|
||||
|
||||
n=`cat $tmpdir/wav.flist | wc -l`
|
||||
echo "In total, $n files were found."
|
||||
|
||||
# (1a) Transcriptions preparation
|
||||
# here we start with normalised transcripts
|
||||
|
||||
awk '{meeting=$1; channel="SDM"; speaker=$3; stime=$4; etime=$5;
|
||||
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
|
||||
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $tmpdir/text
|
||||
|
||||
# (1c) Make segment files from transcript
|
||||
#segments file format is: utt-id side-id start-time end-time, e.g.:
|
||||
#AMI_ES2011a_H00_FEE041_0003415_0003484
|
||||
awk '{
|
||||
segment=$1;
|
||||
split(segment,S,"[_]");
|
||||
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
|
||||
print segment " " audioname " " startf/100 " " endf/100 " "
|
||||
}' < $tmpdir/text > $tmpdir/segments
|
||||
|
||||
#EN2001a.Array1-01.wav
|
||||
#sed -e 's?.*/??' -e 's?.sph??' $dir/wav.flist | paste - $dir/wav.flist \
|
||||
# > $dir/wav.scp
|
||||
|
||||
sed -e 's?.*/??' -e 's?.wav??' $tmpdir/wav.flist | \
|
||||
perl -ne 'split; $_ =~ m/(.*)\..*/; print "AMI_$1_SDM\n"' | \
|
||||
paste - $tmpdir/wav.flist > $tmpdir/wav1.scp
|
||||
|
||||
#Keep only devset part of waves
|
||||
awk '{print $2}' $tmpdir/segments | sort -u | join - $tmpdir/wav1.scp > $tmpdir/wav2.scp
|
||||
|
||||
#replace path with an appropriate sox command that select single channel only
|
||||
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $tmpdir/wav2.scp > $tmpdir/wav.scp
|
||||
|
||||
#prep reco2file_and_channel
|
||||
cat $tmpdir/wav.scp | \
|
||||
perl -ane '$_ =~ m:^(\S+SDM).*\/([IETB].*)\.wav.*$: || die "bad label $_";
|
||||
print "$1 $2 A\n"; '\
|
||||
> $tmpdir/reco2file_and_channel || exit 1;
|
||||
|
||||
# we assume we adapt to the session only
|
||||
awk '{print $1}' $tmpdir/segments | \
|
||||
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
|
||||
print "$1$2$3 $1\n";' \
|
||||
> $tmpdir/utt2spk || exit 1;
|
||||
|
||||
sort -k 2 $tmpdir/utt2spk | utils/utt2spk_to_spk2utt.pl > $tmpdir/spk2utt || exit 1;
|
||||
|
||||
# but we want to properly score the overlapped segments, hence we generate the extra
|
||||
# utt2spk_stm file containing speakers ids used to generate the stms for mdm/sdm case
|
||||
awk '{print $1}' $tmpdir/segments | \
|
||||
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
|
||||
print "$1$2$3 $1$2\n";' \
|
||||
> $tmpdir/utt2spk_stm || exit 1;
|
||||
|
||||
#check and correct the case when segment timings for given speaker overlap themself
|
||||
#(important for simulatenous asclite scoring to proceed).
|
||||
#There is actually only one such case for devset and automatic segmentetions
|
||||
join $tmpdir/utt2spk_stm $tmpdir/segments | \
|
||||
perl -ne '{BEGIN{$pu=""; $pt=0.0;} split;
|
||||
if ($pu eq $_[1] && $pt > $_[3]) {
|
||||
print "$_[0] $_[2] $_[3] $_[4]>$_[0] $_[2] $pt $_[4]\n"
|
||||
}
|
||||
$pu=$_[1]; $pt=$_[4];
|
||||
}' > $tmpdir/segments_to_fix
|
||||
if [ `cat $tmpdir/segments_to_fix | wc -l` -gt 0 ]; then
|
||||
echo "$0. Applying following fixes to segments"
|
||||
cat $tmpdir/segments_to_fix
|
||||
while read line; do
|
||||
p1=`echo $line | awk -F'>' '{print $1}'`
|
||||
p2=`echo $line | awk -F'>' '{print $2}'`
|
||||
sed -ir "s!$p1!$p2!" $tmpdir/segments
|
||||
done < $tmpdir/segments_to_fix
|
||||
fi
|
||||
|
||||
# Copy stuff into its final locations [this has been moved from the format_data
|
||||
# script]
|
||||
mkdir -p $dir
|
||||
for f in spk2utt utt2spk utt2spk_stm wav.scp text segments reco2file_and_channel; do
|
||||
cp $tmpdir/$f $dir/$f || exit 1;
|
||||
done
|
||||
|
||||
local/convert2stm.pl $dir utt2spk_stm > $dir/stm
|
||||
cp local/english.glm $dir/glm
|
||||
|
||||
utils/validate_data_dir.sh --no-feats $dir
|
||||
|
||||
echo AMI $DSET scenario and $SET set data preparation succeeded.
|
||||
|
|
@ -0,0 +1,218 @@
|
|||
#!/usr/bin/perl
|
||||
|
||||
# Copyright 2014 University of Edinburgh (Author: Pawel Swietojanski)
|
||||
|
||||
# The script - based on punctuation times - splits segments longer than #words (input parameter)
|
||||
# and produces bit more more normalised form of transcripts, as follows
|
||||
# MeetID Channel Spkr stime etime transcripts
|
||||
|
||||
#use List::MoreUtils 'indexes';
|
||||
use strict;
|
||||
use warnings;
|
||||
|
||||
sub split_transcripts;
|
||||
sub normalise_transcripts;
|
||||
|
||||
sub merge_hashes {
|
||||
my ($h1, $h2) = @_;
|
||||
my %hash1 = %$h1; my %hash2 = %$h2;
|
||||
foreach my $key2 ( keys %hash2 ) {
|
||||
if( exists $hash1{$key2} ) {
|
||||
warn "Key [$key2] is in both hashes!";
|
||||
next;
|
||||
} else {
|
||||
$hash1{$key2} = $hash2{$key2};
|
||||
}
|
||||
}
|
||||
return %hash1;
|
||||
}
|
||||
|
||||
sub print_hash {
|
||||
my ($h) = @_;
|
||||
my %hash = %$h;
|
||||
foreach my $k (sort keys %hash) {
|
||||
print "$k : $hash{$k}\n";
|
||||
}
|
||||
}
|
||||
|
||||
sub get_name {
|
||||
#no warnings;
|
||||
my $sname = sprintf("%07d_%07d", $_[0]*100, $_[1]*100) || die 'Input undefined!';
|
||||
#use warnings;
|
||||
return $sname;
|
||||
}
|
||||
|
||||
sub split_on_comma {
|
||||
|
||||
my ($text, $comma_times, $btime, $etime, $max_words_per_seg)= @_;
|
||||
my %comma_hash = %$comma_times;
|
||||
|
||||
print "Btime, Etime : $btime, $etime\n";
|
||||
|
||||
my $stime = ($etime+$btime)/2; #split time
|
||||
my $skey = "";
|
||||
my $otime = $btime;
|
||||
foreach my $k (sort {$comma_hash{$a} cmp $comma_hash{$b} } keys %comma_hash) {
|
||||
print "Key : $k : $comma_hash{$k}\n";
|
||||
my $ktime = $comma_hash{$k};
|
||||
if ($ktime==$btime) { next; }
|
||||
if ($ktime==$etime) { last; }
|
||||
if (abs($stime-$ktime)/2<abs($stime-$otime)/2) {
|
||||
$otime = $ktime;
|
||||
$skey = $k;
|
||||
}
|
||||
}
|
||||
|
||||
my %transcripts = ();
|
||||
|
||||
if (!($skey =~ /[\,][0-9]+/)) {
|
||||
print "Cannot split into less than $max_words_per_seg words! Leaving : $text\n";
|
||||
$transcripts{get_name($btime, $etime)}=$text;
|
||||
return %transcripts;
|
||||
}
|
||||
|
||||
print "Splitting $text on $skey at time $otime (stime is $stime)\n";
|
||||
my @utts1 = split(/$skey\s+/, $text);
|
||||
for (my $i=0; $i<=$#utts1; $i++) {
|
||||
my $st = $btime;
|
||||
my $et = $comma_hash{$skey};
|
||||
if ($i>0) {
|
||||
$st=$comma_hash{$skey};
|
||||
$et = $etime;
|
||||
}
|
||||
my (@utts) = split (' ', $utts1[$i]);
|
||||
if ($#utts < $max_words_per_seg) {
|
||||
my $nm = get_name($st, $et);
|
||||
print "SplittedOnComma[$i]: $nm : $utts1[$i]\n";
|
||||
$transcripts{$nm} = $utts1[$i];
|
||||
} else {
|
||||
print 'Continue splitting!';
|
||||
my %transcripts2 = split_on_comma($utts1[$i], \%comma_hash, $st, $et, $max_words_per_seg);
|
||||
%transcripts = merge_hashes(\%transcripts, \%transcripts2);
|
||||
}
|
||||
}
|
||||
return %transcripts;
|
||||
}
|
||||
|
||||
sub split_transcripts {
|
||||
@_ == 4 || die 'split_transcripts: transcript btime etime max_word_per_seg';
|
||||
|
||||
my ($text, $btime, $etime, $max_words_per_seg) = @_;
|
||||
my (@transcript) = @$text;
|
||||
|
||||
my (@punct_indices) = grep { $transcript[$_] =~ /^[\.,\?\!\:]$/ } 0..$#transcript;
|
||||
my (@time_indices) = grep { $transcript[$_] =~ /^[0-9]+\.[0-9]*/ } 0..$#transcript;
|
||||
my (@puncts_times) = delete @transcript[@time_indices];
|
||||
my (@puncts) = @transcript[@punct_indices];
|
||||
|
||||
if ($#puncts_times != $#puncts) {
|
||||
print 'Ooops, different number of punctuation signs and timestamps! Skipping.';
|
||||
return ();
|
||||
}
|
||||
|
||||
#first split on full stops
|
||||
my (@full_stop_indices) = grep { $puncts[$_] =~ /[\.\?]/ } 0..$#puncts;
|
||||
my (@full_stop_times) = @puncts_times[@full_stop_indices];
|
||||
|
||||
unshift (@full_stop_times, $btime);
|
||||
push (@full_stop_times, $etime);
|
||||
|
||||
my %comma_puncts = ();
|
||||
for (my $i=0, my $j=0;$i<=$#punct_indices; $i++) {
|
||||
my $lbl = "$transcript[$punct_indices[$i]]$j";
|
||||
if ($lbl =~ /[\.\?].+/) { next; }
|
||||
$transcript[$punct_indices[$i]] = $lbl;
|
||||
$comma_puncts{$lbl} = $puncts_times[$i];
|
||||
$j++;
|
||||
}
|
||||
|
||||
#print_hash(\%comma_puncts);
|
||||
|
||||
print "InpTrans : @transcript\n";
|
||||
print "Full stops: @full_stop_times\n";
|
||||
|
||||
my @utts1 = split (/[\.\?]/, uc join(' ', @transcript));
|
||||
my %transcripts = ();
|
||||
for (my $i=0; $i<=$#utts1; $i++) {
|
||||
my (@utts) = split (' ', $utts1[$i]);
|
||||
if ($#utts < $max_words_per_seg) {
|
||||
print "ReadyTrans: $utts1[$i]\n";
|
||||
$transcripts{get_name($full_stop_times[$i], $full_stop_times[$i+1])} = $utts1[$i];
|
||||
} else {
|
||||
print "TransToSplit: $utts1[$i]\n";
|
||||
my %transcripts2 = split_on_comma($utts1[$i], \%comma_puncts, $full_stop_times[$i], $full_stop_times[$i+1], $max_words_per_seg);
|
||||
print "Hash TR2:\n"; print_hash(\%transcripts2);
|
||||
print "Hash TR:\n"; print_hash(\%transcripts);
|
||||
%transcripts = merge_hashes(\%transcripts, \%transcripts2);
|
||||
print "Hash TR_NEW : \n"; print_hash(\%transcripts);
|
||||
}
|
||||
}
|
||||
return %transcripts;
|
||||
}
|
||||
|
||||
sub normalise_transcripts {
|
||||
my $text = $_[0];
|
||||
|
||||
#DO SOME ROUGH AND OBVIOUS PRELIMINARY NORMALISATION, AS FOLLOWS
|
||||
#remove the remaining punctation labels e.g. some text ,0 some text ,1
|
||||
$text =~ s/[\.\,\?\!\:][0-9]+//g;
|
||||
#there are some extra spurious puncations without spaces, e.g. UM,I, replace with space
|
||||
$text =~ s/[A-Z']+,[A-Z']+/ /g;
|
||||
#split words combination, ie. ANTI-TRUST to ANTI TRUST (None of them appears in cmudict anyway)
|
||||
#$text =~ s/(.*)([A-Z])\s+(\-)(.*)/$1$2$3$4/g;
|
||||
$text =~ s/\-/ /g;
|
||||
#substitute X_M_L with X. M. L. etc.
|
||||
$text =~ s/\_/. /g;
|
||||
#normalise and trim spaces
|
||||
$text =~ s/^\s*//g;
|
||||
$text =~ s/\s*$//g;
|
||||
$text =~ s/\s+/ /g;
|
||||
#some transcripts are empty with -, nullify (and ignore) them
|
||||
$text =~ s/^\-$//g;
|
||||
$text =~ s/\s+\-$//;
|
||||
# apply few exception for dashed phrases, Mm-Hmm, Uh-Huh, etc. those are frequent in AMI
|
||||
# and will be added to dictionary
|
||||
$text =~ s/MM HMM/MM\-HMM/g;
|
||||
$text =~ s/UH HUH/UH\-HUH/g;
|
||||
|
||||
return $text;
|
||||
}
|
||||
|
||||
if (@ARGV != 2) {
|
||||
print STDERR "Usage: ami_split_segments.pl <meet-file> <out-file>\n";
|
||||
exit(1);
|
||||
}
|
||||
|
||||
my $meet_file = shift @ARGV;
|
||||
my $out_file = shift @ARGV;
|
||||
my %transcripts = ();
|
||||
|
||||
open(W, ">$out_file") || die "opening output file $out_file";
|
||||
open(S, "<$meet_file") || die "opening meeting file $meet_file";
|
||||
|
||||
while(<S>) {
|
||||
|
||||
my @A = split(" ", $_);
|
||||
if (@A < 9) { print "Skipping line @A"; next; }
|
||||
|
||||
my ($meet_id, $channel, $spk, $channel2, $trans_btime, $trans_etime, $aut_btime, $aut_etime) = @A[0..7];
|
||||
my @transcript = @A[8..$#A];
|
||||
my %transcript = split_transcripts(\@transcript, $aut_btime, $aut_etime, 30);
|
||||
|
||||
for my $key (keys %transcript) {
|
||||
my $value = $transcript{$key};
|
||||
my $segment = normalise_transcripts($value);
|
||||
my @times = split(/\_/, $key);
|
||||
if ($times[0] >= $times[1]) {
|
||||
print "Warning, $meet_id, $spk, $times[0] > $times[1]. Skipping. \n"; next;
|
||||
}
|
||||
if (length($segment)>0) {
|
||||
print W join " ", $meet_id, "H0${channel2}", $spk, $times[0]/100.0, $times[1]/100.0, $segment, "\n";
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
close(S);
|
||||
close(W);
|
||||
|
||||
print STDERR "Finished."
|
|
@ -0,0 +1,32 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski), 2014, Apache 2.0
|
||||
|
||||
if [ $# -ne 1 ]; then
|
||||
echo "Usage: $0 <ami-dir>"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
amidir=$1
|
||||
wdir=data/local/annotations
|
||||
|
||||
#extract text from AMI XML annotations
|
||||
local/ami_xml2text.sh $amidir
|
||||
|
||||
[ ! -f $wdir/transcripts1 ] && echo "$0: File $wdir/transcripts1 not found." && exit 1;
|
||||
|
||||
echo "Preprocessing transcripts..."
|
||||
local/ami_split_segments.pl $wdir/transcripts1 $wdir/transcripts2 &> $wdir/log/split_segments.log
|
||||
|
||||
#make final train/dev/eval splits
|
||||
for dset in train eval dev; do
|
||||
[ ! -f local/split_$dset.final ] && cp local/split_$dset.orig local/split_$dset.final
|
||||
grep -f local/split_$dset.final $wdir/transcripts2 > $wdir/$dset.txt
|
||||
done
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,176 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2013 Arnab Ghoshal, Pawel Swietojanski
|
||||
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
|
||||
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
|
||||
# MERCHANTABLITY OR NON-INFRINGEMENT.
|
||||
# See the Apache 2 License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# To be run from one directory above this script.
|
||||
|
||||
# Begin configuration section.
|
||||
fisher=
|
||||
order=3
|
||||
swbd=
|
||||
google=
|
||||
web_sw=
|
||||
web_fsh=
|
||||
web_mtg=
|
||||
# end configuration sections
|
||||
|
||||
help_message="Usage: "`basename $0`" [options] <train-txt> <dev-txt> <dict> <out-dir>
|
||||
Train language models for AMI and optionally for Switchboard, Fisher and web-data from University of Washington.\n
|
||||
options:
|
||||
--help # print this message and exit
|
||||
--fisher DIR # directory for Fisher transcripts
|
||||
--order N # N-gram order (default: '$order')
|
||||
--swbd DIR # Directory for Switchboard transcripts
|
||||
--web-sw FILE # University of Washington (191M) Switchboard web data
|
||||
--web-fsh FILE # University of Washington (525M) Fisher web data
|
||||
--web-mtg FILE # University of Washington (150M) CMU+ICSI+NIST meeting data
|
||||
";
|
||||
|
||||
. utils/parse_options.sh
|
||||
|
||||
if [ $# -ne 4 ]; then
|
||||
printf "$help_message\n";
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
train=$1 # data/ihm/train/text
|
||||
dev=$2 # data/ihm/dev/text
|
||||
lexicon=$3 # data/ihm/dict/lexicon.txt
|
||||
dir=$4 # data/local/lm
|
||||
|
||||
for f in "$text" "$lexicon"; do
|
||||
[ ! -f $x ] && echo "$0: No such file $f" && exit 1;
|
||||
done
|
||||
|
||||
set -o errexit
|
||||
mkdir -p $dir
|
||||
export LC_ALL=C
|
||||
|
||||
cut -d' ' -f2- $train | gzip -c > $dir/train.gz
|
||||
cut -d' ' -f2- $dev | gzip -c > $dir/dev.gz
|
||||
|
||||
awk '{print $1}' $lexicon | sort -u > $dir/wordlist.lex
|
||||
gunzip -c $dir/train.gz | tr ' ' '\n' | grep -v ^$ | sort -u > $dir/wordlist.train
|
||||
sort -u $dir/wordlist.lex $dir/wordlist.train > $dir/wordlist
|
||||
|
||||
ngram-count -text $dir/train.gz -order $order -limit-vocab -vocab $dir/wordlist \
|
||||
-unk -map-unk "<unk>" -kndiscount -interpolate -lm $dir/ami.o${order}g.kn.gz
|
||||
echo "PPL for AMI LM:"
|
||||
ngram -unk -lm $dir/ami.o${order}g.kn.gz -ppl $dir/dev.gz
|
||||
ngram -unk -lm $dir/ami.o${order}g.kn.gz -ppl $dir/dev.gz -debug 2 >& $dir/ppl2
|
||||
mix_ppl="$dir/ppl2"
|
||||
mix_tag="ami"
|
||||
mix_lms=( "$dir/ami.o${order}g.kn.gz" )
|
||||
num_lms=1
|
||||
|
||||
if [ ! -z "$swbd" ]; then
|
||||
mkdir -p $dir/swbd
|
||||
|
||||
find $swbd -iname '*-trans.text' -exec cat {} \; | cut -d' ' -f4- \
|
||||
| gzip -c > $dir/swbd/text0.gz
|
||||
gunzip -c $dir/swbd/text0.gz | swbd_map_words.pl | gzip -c \
|
||||
> $dir/swbd/text1.gz
|
||||
ngram-count -text $dir/swbd/text1.gz -order $order -limit-vocab \
|
||||
-vocab $dir/wordlist -unk -map-unk "<unk>" -kndiscount -interpolate \
|
||||
-lm $dir/swbd/swbd.o${order}g.kn.gz
|
||||
echo "PPL for SWBD LM:"
|
||||
ngram -unk -lm $dir/swbd/swbd.o${order}g.kn.gz -ppl $dir/dev.gz
|
||||
ngram -unk -lm $dir/swbd/swbd.o${order}g.kn.gz -ppl $dir/dev.gz -debug 2 \
|
||||
>& $dir/swbd/ppl2
|
||||
|
||||
mix_ppl="$mix_ppl $dir/swbd/ppl2"
|
||||
mix_tag="${mix_tag}_swbd"
|
||||
mix_lms=("${mix_lms[@]}" "$dir/swbd/swbd.o${order}g.kn.gz")
|
||||
num_lms=$[ num_lms + 1 ]
|
||||
fi
|
||||
|
||||
if [ ! -z "$fisher" ]; then
|
||||
[ ! -d "$fisher/part1/data/trans" ] \
|
||||
&& echo "Cannot find transcripts in Fisher directory: '$fisher'" \
|
||||
&& exit 1;
|
||||
mkdir -p $dir/fisher
|
||||
|
||||
find $fisher -path '*/trans/*fe*.txt' -exec cat {} \; | grep -v ^# | grep -v ^$ \
|
||||
| cut -d' ' -f4- | gzip -c > $dir/fisher/text0.gz
|
||||
gunzip -c $dir/fisher/text0.gz | fisher_map_words.pl \
|
||||
| gzip -c > $dir/fisher/text1.gz
|
||||
ngram-count -debug 0 -text $dir/fisher/text1.gz -order $order -limit-vocab \
|
||||
-vocab $dir/wordlist -unk -map-unk "<unk>" -kndiscount -interpolate \
|
||||
-lm $dir/fisher/fisher.o${order}g.kn.gz
|
||||
echo "PPL for Fisher LM:"
|
||||
ngram -unk -lm $dir/fisher/fisher.o${order}g.kn.gz -ppl $dir/dev.gz
|
||||
ngram -unk -lm $dir/fisher/fisher.o${order}g.kn.gz -ppl $dir/dev.gz -debug 2 \
|
||||
>& $dir/fisher/ppl2
|
||||
|
||||
mix_ppl="$mix_ppl $dir/fisher/ppl2"
|
||||
mix_tag="${mix_tag}_fsh"
|
||||
mix_lms=("${mix_lms[@]}" "$dir/fisher/fisher.o${order}g.kn.gz")
|
||||
num_lms=$[ num_lms + 1 ]
|
||||
fi
|
||||
|
||||
if [ ! -z "$google1B" ]; then
|
||||
mkdir -p $dir/google
|
||||
wget -O $dir/google/cantab.lm3.bz2 http://vm.cantabresearch.com:6080/demo/cantab.lm3.bz2
|
||||
wget -O $dir/google/150000.lex http://vm.cantabresearch.com:6080/demo/150000.lex
|
||||
|
||||
ngram -unk -limit-vocab -vocab $dir/wordlist -lm $dir/google.cantab.lm3.bz3 \
|
||||
-write-lm $dir/google/google.o${order}g.kn.gz
|
||||
|
||||
mix_ppl="$mix_ppl $dir/goog1e/ppl2"
|
||||
mix_tag="${mix_tag}_fsh"
|
||||
mix_lms=("${mix_lms[@]}" "$dir/google/google.o${order}g.kn.gz")
|
||||
num_lms=$[ num_lms + 1 ]
|
||||
fi
|
||||
|
||||
## The University of Washington conversational web data can be obtained as:
|
||||
## wget --no-check-certificate http://ssli.ee.washington.edu/data/191M_conversational_web-filt+periods.gz
|
||||
if [ ! -z "$web_sw" ]; then
|
||||
echo "Interpolating web-LM not implemented yet"
|
||||
fi
|
||||
|
||||
## The University of Washington Fisher conversational web data can be obtained as:
|
||||
## wget --no-check-certificate http://ssli.ee.washington.edu/data/525M_fisher_conv_web-filt+periods.gz
|
||||
if [ ! -z "$web_fsh" ]; then
|
||||
echo "Interpolating web-LM not implemented yet"
|
||||
fi
|
||||
|
||||
## The University of Washington meeting web data can be obtained as:
|
||||
## wget --no-check-certificate http://ssli.ee.washington.edu/data/150M_cmu+icsi+nist-meetings.gz
|
||||
if [ ! -z "$web_mtg" ]; then
|
||||
echo "Interpolating web-LM not implemented yet"
|
||||
fi
|
||||
|
||||
if [ $num_lms -gt 1 ]; then
|
||||
echo "Computing interpolation weights from: $mix_ppl"
|
||||
compute-best-mix $mix_ppl >& $dir/mix.log
|
||||
grep 'best lambda' $dir/mix.log \
|
||||
| perl -e '$_=<>; s/.*\(//; s/\).*//; @A = split; for $i (@A) {print "$i\n";}' \
|
||||
> $dir/mix.weights
|
||||
weights=( `cat $dir/mix.weights` )
|
||||
cmd="ngram -lm ${mix_lms[0]} -lambda 0.715759 -mix-lm ${mix_lms[1]}"
|
||||
for i in `seq 2 $((num_lms-1))`; do
|
||||
cmd="$cmd -mix-lm${i} ${mix_lms[$i]} -mix-lambda${i} ${weights[$i]}"
|
||||
done
|
||||
cmd="$cmd -unk -write-lm $dir/${mix_tag}.o${order}g.kn.gz"
|
||||
echo "Interpolating LMs with command: \"$cmd\""
|
||||
$cmd
|
||||
echo "PPL for the interolated LM:"
|
||||
ngram -unk -lm $dir/${mix_tag}.o${order}g.kn.gz -ppl $dir/dev.gz
|
||||
fi
|
||||
|
||||
#save the lm name for furher use
|
||||
echo "${mix_tag}.o${order}g.kn" > $dir/final_lm
|
||||
|
|
@ -0,0 +1,47 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright, University of Edinburgh (Pawel Swietojanski and Jonathan Kilgour)
|
||||
|
||||
if [ $# -ne 1 ]; then
|
||||
echo "Usage: $0 <ami-dir>"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
adir=$1
|
||||
wdir=data/local/annotations
|
||||
|
||||
[ ! -f $adir/annotations/AMI-metadata.xml ] && echo "$0: File $adir/annotations/AMI-metadata.xml no found." && exit 1;
|
||||
|
||||
mkdir -p $wdir/log
|
||||
|
||||
JAVA_VER=$(java -version 2>&1 | sed 's/java version "\(.*\)\.\(.*\)\..*"/\1\2/; 1q')
|
||||
|
||||
if [ "$JAVA_VER" -ge 15 ]; then
|
||||
if [ ! -d $wdir/nxt ]; then
|
||||
echo "Downloading NXT annotation tool..."
|
||||
wget -O $wdir/nxt.zip http://sourceforge.net/projects/nite/files/nite/nxt_1.4.4/nxt_1.4.4.zip &> /dev/null
|
||||
unzip -d $wdir/nxt $wdir/nxt.zip &> /dev/null
|
||||
fi
|
||||
|
||||
if [ ! -f $wdir/transcripts0 ]; then
|
||||
echo "Parsing XML files (can take several minutes)..."
|
||||
nxtlib=$wdir/nxt/lib
|
||||
java -cp $nxtlib/nxt.jar:$nxtlib/xmlParserAPIs.jar:$nxtlib/xalan.jar:$nxtlib \
|
||||
FunctionQuery -c $adir/annotations/AMI-metadata.xml -q '($s segment)(exists $w1 w):$s^$w1' -atts obs who \
|
||||
'@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent,global_name, 0)'\
|
||||
'@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent, channel, 0)' \
|
||||
transcriber_start transcriber_end starttime endtime '$s' '@extract(($w w):$s^$w & $w@punc="true", starttime,0,0)' \
|
||||
1> $wdir/transcripts0 2> $wdir/log/nxt_export.log
|
||||
fi
|
||||
else
|
||||
echo "$0. Java not found. Will download exported version of transcripts."
|
||||
annots=ami_manual_annotations_v1.6.1_export
|
||||
wget -O $wdir/$annots.gzip http://groups.inf.ed.ac.uk/ami/AMICorpusAnnotations/$annots.gzip
|
||||
gunzip -c $wdir/${annots}.gzip > $wdir/transcripts0
|
||||
fi
|
||||
|
||||
#remove NXT logs dumped to stdio
|
||||
grep -e '^Found' -e '^Obs' -i -v $wdir/transcripts0 > $wdir/transcripts1
|
||||
|
||||
exit 0;
|
||||
|
|
@ -0,0 +1,33 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright 2014, University of Edibnurgh (Author: Pawel Swietojanski)
|
||||
|
||||
. ./path.sh
|
||||
|
||||
nj=$1
|
||||
job=$2
|
||||
numch=$3
|
||||
meetings=$4
|
||||
sdir=$5
|
||||
odir=$6
|
||||
wdir=data/local/beamforming
|
||||
|
||||
utils/split_scp.pl -j $nj $((job-1)) $meetings $meetings.$job
|
||||
|
||||
while read line; do
|
||||
|
||||
mkdir -p $odir/$line
|
||||
BeamformIt -s $line -c $wdir/channels_$numch \
|
||||
--config_file `pwd`/conf/ami.cfg \
|
||||
--source_dir $sdir \
|
||||
--result_dir $odir/$line
|
||||
mkdir -p $odir/$line
|
||||
mv $odir/$line/${line}.del $odir/$line/${line}_MDM$numch.del
|
||||
mv $odir/$line/${line}.del2 $odir/$line/${line}_MDM$numch.del2
|
||||
mv $odir/$line/${line}.info $odir/$line/${line}_MDM$numch.info
|
||||
mv $odir/$line/${line}.ovl $odir/$line/${line}_MDM$numch.ovl
|
||||
mv $odir/$line/${line}.weat $odir/$line/${line}_MDM$numch.weat
|
||||
mv $odir/$line/${line}.wav $odir/$line/${line}_MDM$numch.wav
|
||||
|
||||
done < $meetings.$job
|
||||
|
|
@ -0,0 +1,101 @@
|
|||
#!/usr/bin/perl
|
||||
|
||||
# Copyright 2012 Johns Hopkins University (Author: Daniel Povey). Apache 2.0.
|
||||
# 2013 University of Edinburgh (Author: Pawel Swietojanski)
|
||||
|
||||
# This takes as standard input path to directory containing all the usual
|
||||
# data files - segments, text, utt2spk and reco2file_and_channel and creates stm
|
||||
|
||||
if (@ARGV < 1 || @ARGV > 2) {
|
||||
print STDERR "Usage: convert2stm.pl <data-dir> [<utt2spk_stm>] > stm-file\n";
|
||||
exit(1);
|
||||
}
|
||||
|
||||
$dir=shift @ARGV;
|
||||
$utt2spk_file=shift @ARGV || 'utt2spk';
|
||||
|
||||
$segments = "$dir/segments";
|
||||
$reco2file_and_channel = "$dir/reco2file_and_channel";
|
||||
$text = "$dir/text";
|
||||
$utt2spk_file = "$dir/$utt2spk_file";
|
||||
|
||||
open(S, "<$segments") || die "opening segments file $segments";
|
||||
while(<S>) {
|
||||
@A = split(" ", $_);
|
||||
@A > 3 || die "convert2stm: Bad line in segments file: $_";
|
||||
($utt, $recording_id, $begin_time, $end_time) = @A[0..3];
|
||||
$utt2reco{$utt} = $recording_id;
|
||||
$begin{$utt} = $begin_time;
|
||||
$end{$utt} = $end_time;
|
||||
}
|
||||
close(S);
|
||||
|
||||
open(R, "<$reco2file_and_channel") || die "open reco2file_and_channel file $reco2file_and_channel";
|
||||
while(<R>) {
|
||||
@A = split(" ", $_);
|
||||
@A == 3 || die "convert2stm: Bad line in reco2file_and_channel file: $_";
|
||||
($recording_id, $file, $channel) = @A;
|
||||
$reco2file{$recording_id} = $file;
|
||||
$reco2channel{$recording_id} = $channel;
|
||||
}
|
||||
close(R);
|
||||
|
||||
open(T, "<$text") || die "open text file $text";
|
||||
while(<T>) {
|
||||
@A = split(" ", $_);
|
||||
$utt = shift @A;
|
||||
$utt2text{$utt} = "@A";
|
||||
}
|
||||
close(T);
|
||||
|
||||
open(U, "<$utt2spk_file") || die "open utt2spk file $utt2spk_file";
|
||||
while(<U>) {
|
||||
@A = split(" ", $_);
|
||||
@A == 2 || die "convert2stm: Bad line in utt2spk file: $_";
|
||||
($utt, $spk) = @A;
|
||||
$utt2spk{$utt} = $spk;
|
||||
}
|
||||
close(U);
|
||||
|
||||
# Now generate the stm file
|
||||
foreach $utt (sort keys(%utt2reco)) {
|
||||
|
||||
# lines look like:
|
||||
# <File> <Channel> <Speaker> <BeginTime> <EndTime> [ <LABEL> ] transcript
|
||||
$recording_id = $utt2reco{$utt};
|
||||
if (!defined $recording_id) { die "Utterance-id $utt not defined in segments file $segments"; }
|
||||
$file = $reco2file{$recording_id};
|
||||
$channel = $reco2channel{$recording_id};
|
||||
if (!defined $file || !defined $channel) {
|
||||
die "convert2stm: Recording-id $recording_id not defined in reco2file_and_channel file $reco2file_and_channel";
|
||||
}
|
||||
|
||||
$speaker = $utt2spk{$utt};
|
||||
$transcripts = $utt2text{$utt};
|
||||
|
||||
if (!defined $speaker) { die "convert2stm: Speaker-id for utterance $utt not defined in utt2spk file $utt2spk_file"; }
|
||||
if (!defined $transcripts) { die "convert2stm: Transcript for $utt not defined in text file $text"; }
|
||||
|
||||
$b = $begin{$utt};
|
||||
$e = $end{$utt};
|
||||
$line = "$file $channel $speaker $b $e $transcripts \n";
|
||||
|
||||
print $line; # goes to stdout.
|
||||
}
|
||||
|
||||
__END__
|
||||
|
||||
# Test example
|
||||
# ES2011a.Headset-0 A AMI_ES2011a_H00_FEE041 34.27 37.14 HERE WE GO
|
||||
mkdir tmpdir
|
||||
echo utt reco 10.0 20.0 > tmpdir/segments
|
||||
echo utt word > tmpdir/text
|
||||
echo reco file A > tmpdir/reco2file_and_channel
|
||||
echo utt spk > tmpdir/utt2spk
|
||||
echo file A spk 10.0 20.00 word > stm_tst
|
||||
utils/convert2stm.pl tmpdir | cmp - stm_tst || echo error
|
||||
rm -r tmpdir stm_tst
|
||||
|
||||
|
||||
|
||||
|
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
|
@ -0,0 +1,83 @@
|
|||
#!/usr/bin/perl -w
|
||||
|
||||
# Copyright 2013 Arnab Ghoshal
|
||||
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
|
||||
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
|
||||
# MERCHANTABLITY OR NON-INFRINGEMENT.
|
||||
# See the Apache 2 License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
# This script cleans up the Fisher English transcripts and maps the words to
|
||||
# be similar to the Switchboard Mississippi State transcripts
|
||||
# Reads from STDIN and writes to STDOUT
|
||||
|
||||
use strict;
|
||||
|
||||
while (<>) {
|
||||
chomp;
|
||||
|
||||
$_ = lc($_); # few things aren't lowercased in the data, e.g. I'm
|
||||
s/\*//g; # *mandatory -> mandatory
|
||||
s/\(//g; s/\)//g; # Remove parentheses
|
||||
next if /^\s*$/; # Skip empty lines
|
||||
|
||||
# In one conversation people speak some German phrases that are tagged as
|
||||
# <german (( ja wohl )) > -- we remove these
|
||||
s/<[^>]*>//g;
|
||||
|
||||
s/\.\_/ /g; # Abbreviations: a._b._c. -> a b c.
|
||||
s/(\w)\.s( |$)/$1's /g; # a.s -> a's
|
||||
s/\./ /g; # Remove remaining .
|
||||
s/(\w)\,(\w| )/$1 $2/g; # commas don't appear within numbers, but still
|
||||
|
||||
s/( |^)\'(blade|cause|course|frisco|okay|plain|specially)( |$)/ $2 /g;
|
||||
s/\'em/-em/g;
|
||||
|
||||
# Remove an opening ' if there is a matching closing ' since some word
|
||||
# fragments are annotated as: 'kay, etc.
|
||||
# The substitution is done twice, since matching once doesn't capture
|
||||
# consequetive quoted segments (the space in between is used up).
|
||||
s/(^| )\'(.*?)\'( |$)/ $2 /g;
|
||||
s/(^| )\'(.*?)\'( |$)/ $2 /g;
|
||||
|
||||
s/( |^)\'(\w)( |-|$)/$1 /g; # 'a- -> a
|
||||
s/( |^)-( |$)/ /g; # Remove dangling -
|
||||
s/\?//g; # Remove ?
|
||||
s/( |^)non-(\w+)( |$)/ non $2 /g; # non-stop -> non stop
|
||||
|
||||
# Some words that are annotated as fragments are actual dictionary words
|
||||
s/( |-)(acceptable|arthritis|ball|cause|comes|course|eight|eighty|field|giving|habitating|heard|hood|how|king|ninety|okay|paper|press|scripts|store|till|vascular|wood|what|york)(-| )/ $2 /g;
|
||||
|
||||
# Remove [[skip]] and [pause]
|
||||
s/\[\[skip\]\]/ /g;
|
||||
s/\[pause\]/ /g;
|
||||
|
||||
# [breath], [cough], [lipsmack], [sigh], [sneeze] -> [noise]
|
||||
s/\[breath\]/[noise]/g;
|
||||
s/\[cough\]/[noise]/g;
|
||||
s/\[lipsmack\]/[noise]/g;
|
||||
s/\[sigh\]/[noise]/g;
|
||||
s/\[sneeze\]/[noise]/g;
|
||||
|
||||
s/\[mn\]/[vocalized-noise]/g; # [mn] -> [vocalized-noise]
|
||||
s/\[laugh\]/[laughter]/g; # [laugh] -> [laughter]
|
||||
|
||||
$_ = uc($_);
|
||||
# Now, mapping individual words
|
||||
my @words = split /\s+/;
|
||||
for my $i (0..$#words) {
|
||||
my $w = $words[$i];
|
||||
$w =~ s/^'/-/;
|
||||
$words[$i] = $w;
|
||||
}
|
||||
print join(" ", @words) . "\n";
|
||||
}
|
|
@ -0,0 +1,42 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright Johns Hopkins University (Author: Daniel Povey) 2012
|
||||
# Copyright University of Edinburgh (Author: Pawel Swietojanski) 2014
|
||||
# Apache 2.0
|
||||
|
||||
orig_args=
|
||||
for x in "$@"; do orig_args="$orig_args '$x'"; done
|
||||
|
||||
# begin configuration section. we include all the options that score_sclite.sh or
|
||||
# score_basic.sh might need, or parse_options.sh will die.
|
||||
cmd=run.pl
|
||||
stage=0
|
||||
min_lmwt=9
|
||||
max_lmwt=20
|
||||
reverse=false
|
||||
asclite=true
|
||||
#end configuration section.
|
||||
|
||||
[ -f ./path.sh ] && . ./path.sh
|
||||
. parse_options.sh || exit 1;
|
||||
|
||||
if [ $# -ne 3 ]; then
|
||||
echo "Usage: local/score.sh [options] <data-dir> <lang-dir|graph-dir> <decode-dir>" && exit;
|
||||
echo " Options:"
|
||||
echo " --cmd (run.pl|queue.pl...) # specify how to run the sub-processes."
|
||||
echo " --stage (0|1|2) # start scoring script from part-way through."
|
||||
echo " --min_lmwt <int> # minumum LM-weight for lattice rescoring "
|
||||
echo " --max_lmwt <int> # maximum LM-weight for lattice rescoring "
|
||||
echo " --reverse (true/false) # score with time reversed features "
|
||||
echo " --asclite (true/false) # score with ascltie instead of sclite (overlapped speech)"
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
data=$1
|
||||
|
||||
if [ -f $data/stm ]; then # use sclite scoring.
|
||||
eval local/score_asclite.sh --asclite $asclite $orig_args
|
||||
else
|
||||
echo "$data/stm does not exist: using local/score_basic.sh"
|
||||
eval local/score_basic.sh $orig_args
|
||||
fi
|
|
@ -0,0 +1,97 @@
|
|||
#!/bin/bash
|
||||
# Copyright Johns Hopkins University (Author: Daniel Povey) 2012. Apache 2.0.
|
||||
# 2014, University of Edinburgh, (Author: Pawel Swietojanski)
|
||||
|
||||
# begin configuration section.
|
||||
cmd=run.pl
|
||||
stage=0
|
||||
min_lmwt=9
|
||||
max_lmwt=20
|
||||
reverse=false
|
||||
asclite=true
|
||||
overlap_spk=4
|
||||
#end configuration section.
|
||||
|
||||
[ -f ./path.sh ] && . ./path.sh
|
||||
. parse_options.sh || exit 1;
|
||||
|
||||
if [ $# -ne 3 ]; then
|
||||
echo "Usage: local/score_asclite.sh [--cmd (run.pl|queue.pl...)] <data-dir> <lang-dir|graph-dir> <decode-dir>"
|
||||
echo " Options:"
|
||||
echo " --cmd (run.pl|queue.pl...) # specify how to run the sub-processes."
|
||||
echo " --stage (0|1|2) # start scoring script from part-way through."
|
||||
echo " --min_lmwt <int> # minumum LM-weight for lattice rescoring "
|
||||
echo " --max_lmwt <int> # maximum LM-weight for lattice rescoring "
|
||||
echo " --reverse (true/false) # score with time reversed features "
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
data=$1
|
||||
lang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.
|
||||
dir=$3
|
||||
|
||||
model=$dir/../final.mdl # assume model one level up from decoding dir.
|
||||
|
||||
hubscr=$KALDI_ROOT/tools/sctk-2.4.0/bin/hubscr.pl
|
||||
[ ! -f $hubscr ] && echo "Cannot find scoring program at $hubscr" && exit 1;
|
||||
hubdir=`dirname $hubscr`
|
||||
|
||||
for f in $data/stm $data/glm $lang/words.txt $lang/phones/word_boundary.int \
|
||||
$model $data/segments $data/reco2file_and_channel $dir/lat.1.gz; do
|
||||
[ ! -f $f ] && echo "$0: expecting file $f to exist" && exit 1;
|
||||
done
|
||||
|
||||
name=`basename $data`; # e.g. eval2000
|
||||
|
||||
mkdir -p $dir/ascoring/log
|
||||
|
||||
if [ $stage -le 0 ]; then
|
||||
if $reverse; then
|
||||
$cmd LMWT=$min_lmwt:$max_lmwt $dir/ascoring/log/get_ctm.LMWT.log \
|
||||
mkdir -p $dir/ascore_LMWT/ '&&' \
|
||||
lattice-1best --lm-scale=LMWT "ark:gunzip -c $dir/lat.*.gz|" ark:- \| \
|
||||
lattice-reverse ark:- ark:- \| \
|
||||
lattice-align-words --reorder=false $lang/phones/word_boundary.int $model ark:- ark:- \| \
|
||||
nbest-to-ctm ark:- - \| \
|
||||
utils/int2sym.pl -f 5 $lang/words.txt \| \
|
||||
utils/convert_ctm.pl $data/segments $data/reco2file_and_channel \
|
||||
'>' $dir/ascore_LMWT/$name.ctm || exit 1;
|
||||
else
|
||||
$cmd LMWT=$min_lmwt:$max_lmwt $dir/ascoring/log/get_ctm.LMWT.log \
|
||||
mkdir -p $dir/ascore_LMWT/ '&&' \
|
||||
lattice-1best --lm-scale=LMWT "ark:gunzip -c $dir/lat.*.gz|" ark:- \| \
|
||||
lattice-align-words $lang/phones/word_boundary.int $model ark:- ark:- \| \
|
||||
nbest-to-ctm ark:- - \| \
|
||||
utils/int2sym.pl -f 5 $lang/words.txt \| \
|
||||
utils/convert_ctm.pl $data/segments $data/reco2file_and_channel \
|
||||
'>' $dir/ascore_LMWT/$name.ctm || exit 1;
|
||||
fi
|
||||
fi
|
||||
|
||||
if [ $stage -le 1 ]; then
|
||||
# Remove some stuff we don't want to score, from the ctm.
|
||||
for x in $dir/ascore_*/$name.ctm; do
|
||||
cp $x $dir/tmpf;
|
||||
cat $dir/tmpf | grep -i -v -E '\[noise|laughter|vocalized-noise\]' | \
|
||||
grep -i -v -E '<unk>' > $x;
|
||||
# grep -i -v -E '<UNK>|%HESITATION' > $x;
|
||||
done
|
||||
fi
|
||||
|
||||
if [ $stage -le 2 ]; then
|
||||
if [ "$asclite" == "true" ]; then
|
||||
oname=$name
|
||||
[ ! -z $overlap_spk ] && oname=${name}_o$overlap_spk
|
||||
$cmd LMWT=$min_lmwt:$max_lmwt $dir/ascoring/log/score.LMWT.log \
|
||||
cp $data/stm $dir/ascore_LMWT/ '&&' \
|
||||
cp $dir/ascore_LMWT/${name}.ctm $dir/ascore_LMWT/${oname}.ctm '&&' \
|
||||
$hubscr -G -v -m 1:2 -o$overlap_spk -a -C -B 8192 -p $hubdir -V -l english \
|
||||
-h rt-stt -g $data/glm -r $dir/ascore_LMWT/stm $dir/ascore_LMWT/${oname}.ctm || exit 1;
|
||||
else
|
||||
$cmd LMWT=$min_lmwt:$max_lmwt $dir/ascoring/log/score.LMWT.log \
|
||||
cp $data/stm $dir/ascore_LMWT/ '&&' \
|
||||
$hubscr -p $hubdir -V -l english -h hub5 -g $data/glm -r $dir/ascore_LMWT/stm $dir/ascore_LMWT/${name}.ctm || exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
exit 0
|
|
@ -0,0 +1,5 @@
|
|||
The splits in this directory follow the official AMI Corpus Full-ASR split
|
||||
on train, dev and eval sets.
|
||||
|
||||
If for some reason ones need to use different split the way to do so is
|
||||
to create split_*.final versions in this directory and run the recipe.
|
|
@ -0,0 +1,18 @@
|
|||
ES2011a
|
||||
ES2011b
|
||||
ES2011c
|
||||
ES2011d
|
||||
IB4001
|
||||
IB4002
|
||||
IB4003
|
||||
IB4004
|
||||
IB4010
|
||||
IB4011
|
||||
IS1008a
|
||||
IS1008b
|
||||
IS1008c
|
||||
IS1008d
|
||||
TS3004a
|
||||
TS3004b
|
||||
TS3004c
|
||||
TS3004d
|
|
@ -0,0 +1,16 @@
|
|||
EN2002a
|
||||
EN2002b
|
||||
EN2002c
|
||||
EN2002d
|
||||
ES2004a
|
||||
ES2004b
|
||||
ES2004c
|
||||
ES2004d
|
||||
IS1009a
|
||||
IS1009b
|
||||
IS1009c
|
||||
IS1009d
|
||||
TS3003a
|
||||
TS3003b
|
||||
TS3003c
|
||||
TS3003d
|
|
@ -0,0 +1,137 @@
|
|||
EN2001a
|
||||
EN2001b
|
||||
EN2001d
|
||||
EN2001e
|
||||
EN2003a
|
||||
EN2004a
|
||||
EN2005a
|
||||
EN2006a
|
||||
EN2006b
|
||||
EN2009b
|
||||
EN2009c
|
||||
EN2009d
|
||||
ES2002a
|
||||
ES2002b
|
||||
ES2002c
|
||||
ES2002d
|
||||
ES2003a
|
||||
ES2003b
|
||||
ES2003c
|
||||
ES2003d
|
||||
ES2005a
|
||||
ES2005b
|
||||
ES2005c
|
||||
ES2005d
|
||||
ES2006a
|
||||
ES2006b
|
||||
ES2006c
|
||||
ES2006d
|
||||
ES2007a
|
||||
ES2007b
|
||||
ES2007c
|
||||
ES2007d
|
||||
ES2008a
|
||||
ES2008b
|
||||
ES2008c
|
||||
ES2008d
|
||||
ES2009a
|
||||
ES2009b
|
||||
ES2009c
|
||||
ES2009d
|
||||
ES2010a
|
||||
ES2010b
|
||||
ES2010c
|
||||
ES2010d
|
||||
ES2012a
|
||||
ES2012b
|
||||
ES2012c
|
||||
ES2012d
|
||||
ES2013a
|
||||
ES2013b
|
||||
ES2013c
|
||||
ES2013d
|
||||
ES2014a
|
||||
ES2014b
|
||||
ES2014c
|
||||
ES2014d
|
||||
ES2015a
|
||||
ES2015b
|
||||
ES2015c
|
||||
ES2015d
|
||||
ES2016a
|
||||
ES2016b
|
||||
ES2016c
|
||||
ES2016d
|
||||
IB4005
|
||||
IN1001
|
||||
IN1002
|
||||
IN1005
|
||||
IN1007
|
||||
IN1008
|
||||
IN1009
|
||||
IN1012
|
||||
IN1013
|
||||
IN1014
|
||||
IN1016
|
||||
IS1000a
|
||||
IS1000b
|
||||
IS1000c
|
||||
IS1000d
|
||||
IS1001a
|
||||
IS1001b
|
||||
IS1001c
|
||||
IS1001d
|
||||
IS1002b
|
||||
IS1002c
|
||||
IS1002d
|
||||
IS1003a
|
||||
IS1003b
|
||||
IS1003c
|
||||
IS1003d
|
||||
IS1004a
|
||||
IS1004b
|
||||
IS1004c
|
||||
IS1004d
|
||||
IS1005a
|
||||
IS1005b
|
||||
IS1005c
|
||||
IS1006a
|
||||
IS1006b
|
||||
IS1006c
|
||||
IS1006d
|
||||
IS1007a
|
||||
IS1007b
|
||||
IS1007c
|
||||
IS1007d
|
||||
TS3005a
|
||||
TS3005b
|
||||
TS3005c
|
||||
TS3005d
|
||||
TS3006a
|
||||
TS3006b
|
||||
TS3006c
|
||||
TS3006d
|
||||
TS3007a
|
||||
TS3007b
|
||||
TS3007c
|
||||
TS3007d
|
||||
TS3008a
|
||||
TS3008b
|
||||
TS3008c
|
||||
TS3008d
|
||||
TS3009a
|
||||
TS3009b
|
||||
TS3009c
|
||||
TS3009d
|
||||
TS3010a
|
||||
TS3010b
|
||||
TS3010c
|
||||
TS3010d
|
||||
TS3011a
|
||||
TS3011b
|
||||
TS3011c
|
||||
TS3011d
|
||||
TS3012a
|
||||
TS3012b
|
||||
TS3012c
|
||||
TS3012d
|
|
@ -0,0 +1,36 @@
|
|||
|
||||
export LC_ALL=C # For expected sorting and joining behaviour
|
||||
|
||||
KALDI_ROOT=/gpfs/scratch/s1136550/kaldi-code
|
||||
#KALDI_ROOT=/disk/data1/software/kaldi-trunk-atlas
|
||||
#KALDI_ROOT=/disk/data1/pbell1/software/kaldi-trunk-mkl/
|
||||
|
||||
KALDISRC=$KALDI_ROOT/src
|
||||
KALDIBIN=$KALDISRC/bin:$KALDISRC/featbin:$KALDISRC/fgmmbin:$KALDISRC/fstbin
|
||||
KALDIBIN=$KALDIBIN:$KALDISRC/gmmbin:$KALDISRC/latbin:$KALDISRC/nnetbin
|
||||
KALDIBIN=$KALDIBIN:$KALDISRC/sgmmbin:$KALDISRC/tiedbin
|
||||
|
||||
FSTBIN=$KALDI_ROOT/tools/openfst/bin
|
||||
LMBIN=$KALDI_ROOT/tools/irstlm/bin
|
||||
SRILM=$KALDI_ROOT/tools/srilm/bin/i686-m64
|
||||
BEAMFORMIT=$KALDI_ROOT/tools/BeamformIt-3.51
|
||||
#BEAMFORMIT=/disk/data1/s1136550/BeamformIt-3.51
|
||||
|
||||
[ -d $PWD/local ] || { echo "Error: 'local' subdirectory not found."; }
|
||||
[ -d $PWD/utils ] || { echo "Error: 'utils' subdirectory not found."; }
|
||||
[ -d $PWD/steps ] || { echo "Error: 'steps' subdirectory not found."; }
|
||||
|
||||
export kaldi_local=$PWD/local
|
||||
export kaldi_utils=$PWD/utils
|
||||
export kaldi_steps=$PWD/steps
|
||||
SCRIPTS=$kaldi_local:$kaldi_utils:$kaldi_steps
|
||||
|
||||
export PATH=$PATH:$KALDIBIN:$FSTBIN:$LMBIN:$SCRIPTS:$BEAMFORMIT:$SRILM
|
||||
|
||||
#CUDA_VER='cuda-5.0.35'
|
||||
|
||||
#export PATH=$PATH:/opt/$CUDA_VER/bin
|
||||
#export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/$CUDA_VER/lib64:/opt/$CUDA_VER/lib
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,204 @@
|
|||
#!/bin/bash -u
|
||||
|
||||
. ./cmd.sh
|
||||
. ./path.sh
|
||||
|
||||
#INITIAL COMMENTS
|
||||
#To run the whole recipe you're gonna to need
|
||||
# 1) SRILM
|
||||
# 2)
|
||||
|
||||
#1) some setings
|
||||
#do not change this, it's for ctr-c ctr-v of training commands between ihm, sdm and mdm
|
||||
mic=ihm
|
||||
#path where AMI whould be downloaded or where is locally available
|
||||
AMI_DIR=/disk/data2/amicorpus/
|
||||
# path to Fisher transcripts for background language model
|
||||
# when not set only in-domain LM will be build
|
||||
FISHER_TRANS=`pwd`/eddie_data/lm/data/fisher
|
||||
|
||||
norm_vars=false
|
||||
|
||||
#1)
|
||||
|
||||
#in case you want download AMI corpus, uncomment this line
|
||||
#you need arount 130GB of free space to get whole data ihm+mdm
|
||||
local/ami_download.sh ihm $AMI_DIR || exit 1;
|
||||
|
||||
#2) Data preparation
|
||||
|
||||
local/ami_text_prep.sh $AMI_DIR
|
||||
|
||||
local/ami_ihm_data_prep.sh $AMI_DIR || exit 1;
|
||||
|
||||
local/ami_ihm_scoring_data_prep.sh $AMI_DIR dev || exit 1;
|
||||
|
||||
local/ami_ihm_scoring_data_prep.sh $AMI_DIR eval || exit 1;
|
||||
|
||||
local/ami_prepare_dict.sh
|
||||
|
||||
utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang
|
||||
|
||||
local/ami_train_lms.sh --fisher $FISHER_TRANS data/ihm/train/text data/ihm/dev/text data/local/dict/lexicon.txt data/local/lm
|
||||
|
||||
final_lm=`cat data/local/lm/final_lm`
|
||||
LM=$final_lm.pr1-7
|
||||
nj=16
|
||||
|
||||
prune-lm --threshold=1e-7 data/local/lm/$final_lm.gz /dev/stdout | \
|
||||
gzip -c > data/local/lm/$LM.gz
|
||||
|
||||
utils/format_lm.sh data/lang data/local/lm/$LM.gz data/local/dict/lexicon.txt data/lang_$LM
|
||||
|
||||
#local/ami_format_data.sh data/local/lm/$LM.gz
|
||||
|
||||
# 3) Building systems
|
||||
# here starts the normal recipe, which is mostly shared across mic scenarios
|
||||
# one difference is for sdm and mdm we do not adapt for speaker byt for environment only
|
||||
|
||||
mfccdir=mfcc_$mic
|
||||
(
|
||||
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1;
|
||||
steps/compute_cmvn_stats.sh data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1
|
||||
)&
|
||||
(
|
||||
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1;
|
||||
steps/compute_cmvn_stats.sh data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1
|
||||
)&
|
||||
(
|
||||
steps/make_mfcc.sh --nj 16 --cmd "$train_cmd" data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1;
|
||||
steps/compute_cmvn_stats.sh data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1
|
||||
)&
|
||||
|
||||
wait;
|
||||
|
||||
for dset in train eval dev; do utils/fix_data_dir.sh data/$mic/$dset; done
|
||||
|
||||
# 4) Train systems
|
||||
nj=16
|
||||
|
||||
mkdir -p exp/$mic/mono
|
||||
steps/train_mono.sh --nj $nj --cmd "$train_cmd" --feat-dim 39 --norm-vars $norm_vars \
|
||||
data/$mic/train data/lang exp/$mic/mono >& exp/$mic/mono/train_mono.log || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/mono_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" data/$mic/train data/lang exp/$mic/mono \
|
||||
exp/$mic/mono_ali >& exp/$mic/mono_ali/align.log || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri1
|
||||
steps/train_deltas.sh --cmd "$train_cmd" --norm-vars $norm_vars \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/mono_ali exp/$mic/tri1 \
|
||||
>& exp/$mic/tri1/train.log || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri1_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri1 exp/$mic/tri1_ali || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri2a
|
||||
steps/train_deltas.sh --cmd "$train_cmd" --norm-vars $norm_vars \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/tri1_ali exp/$mic/tri2a \
|
||||
>& exp/$mic/tri2a/train.log || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
# (
|
||||
graph_dir=exp/$mic/tri2a/graph_${lm_suffix}
|
||||
$highmem_cmd $graph_dir/mkgraph.log \
|
||||
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri2a $graph_dir
|
||||
|
||||
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/dev exp/$mic/tri2a/decode_dev_${lm_suffix}
|
||||
|
||||
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/eval exp/$mic/tri2a/decode_eval_${lm_suffix}
|
||||
|
||||
# ) &
|
||||
done
|
||||
|
||||
mkdir -p exp/$mic/tri2a_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri2a exp/$mic/tri2_ali || exit 1;
|
||||
|
||||
# Train tri3a, which is LDA+MLLT
|
||||
mkdir -p exp/$mic/tri3a
|
||||
steps/train_lda_mllt.sh --cmd "$train_cmd" \
|
||||
--splice-opts "--left-context=3 --right-context=3" \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/tri2_ali exp/$mic/tri3a \
|
||||
>& exp/$mic/tri3a/train.log || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
|
||||
$highmem_cmd $graph_dir/mkgraph.log \
|
||||
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri3a $graph_dir
|
||||
|
||||
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/dev exp/$mic/tri3a/decode_dev_${lm_suffix}
|
||||
|
||||
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/eval exp/$mic/tri3a/decode_eval_${lm_suffix}
|
||||
)
|
||||
done
|
||||
|
||||
# Train tri4a, which is LDA+MLLT+SAT
|
||||
steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_ali || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri4a
|
||||
steps/train_sat.sh --cmd "$train_cmd" \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/tri3a_ali \
|
||||
exp/$mic/tri4a >& exp/$mic/tri4a/train.log || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri4a/graph_${lm_suffix}
|
||||
$highmem_cmd $graph_dir/mkgraph.log \
|
||||
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri4a $graph_dir
|
||||
|
||||
steps/decode_fmllr.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/dev exp/$mic/tri4a/decode_dev_${lm_suffix}
|
||||
|
||||
steps/decode_fmllr.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/eval exp/$mic/tri4a/decode_eval_${lm_suffix}
|
||||
)
|
||||
done
|
||||
exit;
|
||||
# MMI training starting from the LDA+MLLT+SAT systems
|
||||
steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri4a exp/$mic/tri4a_ali || exit 1
|
||||
|
||||
steps/make_denlats.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
--transform-dir exp/$mic/tri4a_ali \
|
||||
data/$mic/train data/lang exp/$mic/tri4a exp/$mic/tri4a_denlats || exit 1;
|
||||
|
||||
# 4 iterations of MMI seems to work well overall. The number of iterations is
|
||||
# used as an explicit argument even though train_mmi.sh will use 4 iterations by
|
||||
# default.
|
||||
num_mmi_iters=4
|
||||
steps/train_mmi.sh --cmd "$train_cmd" --boost 0.1 --num-iters $num_mmi_iters \
|
||||
data/$mic/train data/lang exp/$mic/tri4a_ali exp/$mic/tri4a_denlats \
|
||||
exp/$mic/tri4a_mmi_b0.1 || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri4a/graph_${lm_suffix}
|
||||
|
||||
for i in `seq 1 4`; do
|
||||
decode_dir=exp/$mic/tri4a_mmi_b0.1/decode_dev_${i}.mdl_${lm_suffix}
|
||||
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
--transform-dir exp/$mic/tri4a/decode_dev_${lm_suffix} --iter $i \
|
||||
$graph_dir data/$mic/dev $decode_dir
|
||||
done
|
||||
|
||||
i=3 #simply assummed
|
||||
decode_dir=exp/$mic/tri4a_mmi_b0.1/decode_eval_${i}.mdl_${lm_suffix}
|
||||
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
--transform-dir exp/$mic/tri4a/decode_eval_${lm_suffix} --iter $i \
|
||||
$graph_dir data/$mic/eval $decode_dir
|
||||
)
|
||||
done
|
||||
|
||||
# here goes hybrid stuf
|
||||
# in the ASRU paper we used different python nnet code, so someone needs to copy&adjust nnet or nnet2 switchboard commands
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,156 @@
|
|||
#!/bin/bash -u
|
||||
|
||||
. ./cmd.sh
|
||||
. ./path.sh
|
||||
|
||||
# MDM - Multiple Distant Microphones
|
||||
# Assuming text preparation, dict, lang and LM were build as in run_ihm
|
||||
|
||||
nmics=8 #we use all 8 channels, possible other options are 2 and 4
|
||||
mic=mdm$nmics #subdir name under data/
|
||||
AMI_DIR=/disk/data2/amicorpus #root of AMI corpus
|
||||
MDM_DIR=/disk/data1/s1136550/ami/mdm #directory for beamformed waves
|
||||
|
||||
#1) Download AMI (distant channels)
|
||||
|
||||
local/ami_download.sh mdm $AMI_DIR
|
||||
|
||||
#2) Beamform
|
||||
|
||||
local/ami_beamform.sh --nj 16 $nmics $AMI_DIR $MDM_DIR
|
||||
|
||||
#3) Prepare mdm data directories
|
||||
|
||||
local/ami_mdm_data_prep.sh $MDM_DIR $mic || exit 1;
|
||||
local/ami_mdm_scoring_data_prep.sh $MDM_DIR $mic dev || exit 1;
|
||||
local/ami_mdm_scoring_data_prep.sh $MDM_DIR $mic eval || exit 1;
|
||||
|
||||
#use the final LM
|
||||
final_lm=`cat data/local/lm/final_lm`
|
||||
LM=$final_lm.pr1-7
|
||||
|
||||
DEV_SPK=`cut -d" " -f2 data/$mic/dev/utt2spk | sort | uniq -c | wc -l`
|
||||
EVAL_SPK=`cut -d" " -f2 data/$mic/eval/utt2spk | sort | uniq -c | wc -l`
|
||||
nj=16
|
||||
|
||||
#GENERATE FEATS
|
||||
mfccdir=mfcc_$mic
|
||||
(
|
||||
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1;
|
||||
steps/compute_cmvn_stats.sh data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1
|
||||
)&
|
||||
(
|
||||
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1;
|
||||
steps/compute_cmvn_stats.sh data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1
|
||||
)&
|
||||
(
|
||||
steps/make_mfcc.sh --nj 16 --cmd "$train_cmd" data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1;
|
||||
steps/compute_cmvn_stats.sh data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1
|
||||
)&
|
||||
|
||||
wait;
|
||||
for dset in train eval dev; do utils/fix_data_dir.sh data/$mic/$dset; done
|
||||
|
||||
# Build the systems
|
||||
|
||||
# TRAIN THE MODELS
|
||||
mkdir -p exp/$mic/mono
|
||||
steps/train_mono.sh --nj $nj --cmd "$train_cmd" --feat-dim 39 \
|
||||
data/$mic/train data/lang exp/$mic/mono >& exp/$mic/mono/train_mono.log || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/mono_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" data/$mic/train data/lang exp/$mic/mono \
|
||||
exp/$mic/mono_ali >& exp/$mic/mono_ali/align.log || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri1
|
||||
steps/train_deltas.sh --cmd "$train_cmd" \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/mono_ali exp/$mic/tri1 \
|
||||
>& exp/$mic/tri1/train.log || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri1_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri1 exp/$mic/tri1_ali || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri2a
|
||||
steps/train_deltas.sh --cmd "$train_cmd" \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/tri1_ali exp/$mic/tri2a \
|
||||
>& exp/$mic/tri2a/train.log || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri2a/graph_${lm_suffix}
|
||||
$highmem_cmd $graph_dir/mkgraph.log \
|
||||
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri2a $graph_dir
|
||||
|
||||
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/dev exp/$mic/tri2a/decode_dev_${lm_suffix}
|
||||
|
||||
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/eval exp/$mic/tri2a/decode_eval_${lm_suffix}
|
||||
)
|
||||
done
|
||||
|
||||
#THE TARGET LDA+MLLT+SAT+BMMI PART GOES HERE:
|
||||
mkdir -p exp/$mic/tri2a_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri2a exp/$mic/tri2_ali || exit 1;
|
||||
|
||||
# Train tri3a, which is LDA+MLLT
|
||||
mkdir -p exp/$mic/tri3a
|
||||
steps/train_lda_mllt.sh --cmd "$train_cmd" \
|
||||
--splice-opts "--left-context=3 --right-context=3" \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/tri2_ali exp/$mic/tri3a \
|
||||
>& exp/$mic/tri3a/train.log || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
|
||||
$highmem_cmd $graph_dir/mkgraph.log \
|
||||
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri3a $graph_dir
|
||||
|
||||
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/dev exp/$mic/tri3a/decode_dev_${lm_suffix}
|
||||
|
||||
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/eval exp/$mic/tri3a/decode_eval_${lm_suffix}
|
||||
)
|
||||
done
|
||||
|
||||
|
||||
# skip SAT, and build MMI models
|
||||
steps/make_denlats.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.config \
|
||||
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_denlats || exit 1;
|
||||
|
||||
|
||||
mkdir -p exp/$mic/tri3a_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_ali || exit 1;
|
||||
|
||||
# 4 iterations of MMI seems to work well overall. The number of iterations is
|
||||
# used as an explicit argument even though train_mmi.sh will use 4 iterations by
|
||||
# default.
|
||||
num_mmi_iters=4
|
||||
steps/train_mmi.sh --cmd "$train_cmd" --boost 0.1 --num-iters $num_mmi_iters \
|
||||
data/$mic/train data/lang exp/$mic/tri3a_ali exp/$mic/tri3a_denlats \
|
||||
exp/$mic/tri3a_mmi_b0.1 || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
|
||||
|
||||
for i in `seq 1 4`; do
|
||||
decode_dir=exp/$mic/tri3a_mmi_b0.1/decode_dev_${i}.mdl_${lm_suffix}
|
||||
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
--iter $i $graph_dir data/$mic/dev $decode_dir
|
||||
done
|
||||
|
||||
i=3 #simply assummed
|
||||
decode_dir=exp/$mic/tri3a_mmi_b0.1/decode_eval_${i}.mdl_${lm_suffix}
|
||||
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
--iter $i $graph_dir data/$mic/eval $decode_dir
|
||||
)
|
||||
done
|
||||
|
||||
# here goes hybrid stuf
|
||||
# in the ASRU paper we used different python nnet code, so someone needs to copy&adjust nnet or nnet2 switchboard commands
|
||||
|
|
@ -0,0 +1,214 @@
|
|||
#!/bin/bash -u
|
||||
|
||||
. ./cmd.sh
|
||||
. ./path.sh
|
||||
|
||||
#SDM - Signle Distant Microphone
|
||||
#Assuming initial transcrips, dict, lang and LM were build in run_ihm.sh
|
||||
|
||||
micid=1 #which mic from array should be used?
|
||||
mic=sdm$micid
|
||||
AMI_DIR=/disk/data2/amicorpus/
|
||||
norm_vars=false
|
||||
|
||||
#1) Download AMI (single distant channel)
|
||||
|
||||
local/ami_download.sh sdm $AMI_DIR
|
||||
|
||||
#2) Prepare sdm data directories
|
||||
|
||||
local/ami_sdm_data_prep.sh $AMI_DIR $micid
|
||||
local/ami_sdm_scoring_data_prep.sh $AMI_DIR $micid dev
|
||||
local/ami_sdm_scoring_data_prep.sh $AMI_DIR $micid eval
|
||||
|
||||
#use the final LM
|
||||
final_lm=`cat data/local/lm/final_lm`
|
||||
LM=$final_lm.pr1-7
|
||||
|
||||
#jobs for SDM/MDM decodes - one per meeting on 16core local machine
|
||||
DEV_SPK=$((`cut -d" " -f2 data/$mic/dev/utt2spk | sort | uniq -c | wc -l`))
|
||||
EVAL_SPK=$((`cut -d" " -f2 data/$mic/eval/utt2spk | sort | uniq -c | wc -l`))
|
||||
echo $DEV_SPK $EVAL_SPK
|
||||
nj=16
|
||||
|
||||
#GENERATE FEATS
|
||||
mfccdir=mfcc_$mic
|
||||
(
|
||||
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1;
|
||||
steps/compute_cmvn_stats.sh data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1
|
||||
)&
|
||||
(
|
||||
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1;
|
||||
steps/compute_cmvn_stats.sh data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1
|
||||
)&
|
||||
(
|
||||
steps/make_mfcc.sh --nj 16 --cmd "$train_cmd" data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1;
|
||||
steps/compute_cmvn_stats.sh data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1
|
||||
)&
|
||||
|
||||
wait;
|
||||
for dset in train eval dev; do utils/fix_data_dir.sh data/$mic/$dset; done
|
||||
|
||||
# TRAIN THE MODELS
|
||||
mkdir -p exp/$mic/mono
|
||||
steps/train_mono.sh --nj $nj --cmd "$train_cmd" --feat-dim 39 \
|
||||
data/$mic/train data/lang exp/$mic/mono >& exp/$mic/mono/train_mono.log || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/mono_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" data/$mic/train data/lang exp/$mic/mono \
|
||||
exp/$mic/mono_ali >& exp/$mic/mono_ali/align.log || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri1
|
||||
steps/train_deltas.sh --cmd "$train_cmd" \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/mono_ali exp/$mic/tri1 \
|
||||
>& exp/$mic/tri1/train.log || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri1_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri1 exp/$mic/tri1_ali || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri2a
|
||||
steps/train_deltas.sh --cmd "$train_cmd" \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/tri1_ali exp/$mic/tri2a \
|
||||
>& exp/$mic/tri2a/train.log || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri2a/graph_${lm_suffix}
|
||||
$highmem_cmd $graph_dir/mkgraph.log \
|
||||
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri2a $graph_dir
|
||||
|
||||
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/dev exp/$mic/tri2a/decode_dev_${lm_suffix}
|
||||
|
||||
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/eval exp/$mic/tri2a/decode_eval_${lm_suffix}
|
||||
)
|
||||
done
|
||||
|
||||
#THE TARGET LDA+MLLT+SAT+BMMI PART GOES HERE:
|
||||
mkdir -p exp/$mic/tri2a_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri2a exp/$mic/tri2_ali || exit 1;
|
||||
|
||||
# Train tri3a, which is LDA+MLLT
|
||||
mkdir -p exp/$mic/tri3a
|
||||
steps/train_lda_mllt.sh --cmd "$train_cmd" \
|
||||
--splice-opts "--left-context=3 --right-context=3" \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/tri2_ali exp/$mic/tri3a \
|
||||
>& exp/$mic/tri3a/train.log || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
|
||||
$highmem_cmd $graph_dir/mkgraph.log \
|
||||
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri3a $graph_dir
|
||||
|
||||
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/dev exp/$mic/tri3a/decode_dev_${lm_suffix}
|
||||
|
||||
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/eval exp/$mic/tri3a/decode_eval_${lm_suffix}
|
||||
)
|
||||
done
|
||||
|
||||
# skip SAT, and build MMI models
|
||||
steps/make_denlats.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.config \
|
||||
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_denlats || exit 1;
|
||||
|
||||
|
||||
mkdir -p exp/$mic/tri3a_ali
|
||||
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_ali || exit 1;
|
||||
|
||||
# 4 iterations of MMI seems to work well overall. The number of iterations is
|
||||
# used as an explicit argument even though train_mmi.sh will use 4 iterations by
|
||||
# default.
|
||||
num_mmi_iters=4
|
||||
steps/train_mmi.sh --cmd "$train_cmd" --boost 0.1 --num-iters $num_mmi_iters \
|
||||
data/$mic/train data/lang exp/$mic/tri3a_ali exp/$mic/tri3a_denlats \
|
||||
exp/$mic/tri3a_mmi_b0.1 || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
|
||||
|
||||
for i in `seq 1 4`; do
|
||||
decode_dir=exp/$mic/tri3a_mmi_b0.1/decode_dev_${i}.mdl_${lm_suffix}
|
||||
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --iter $i --config conf/decode.conf \
|
||||
$graph_dir data/$mic/dev $decode_dir
|
||||
done
|
||||
|
||||
i=3 #simply assummed
|
||||
decode_dir=exp/$mic/tri3a_mmi_b0.1/decode_eval_${i}.mdl_${lm_suffix}
|
||||
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --iter $i --config conf/decode.conf \
|
||||
$graph_dir data/$mic/eval $decode_dir
|
||||
)
|
||||
done
|
||||
|
||||
#By default we do no build systems adapted to sessions for AMI in distant scnearios as this does not help a lot (around 1%)
|
||||
#But one can do this by running below code
|
||||
exit;
|
||||
|
||||
# Train tri4a, which is LDA+MLLT+SAT
|
||||
steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_ali || exit 1;
|
||||
|
||||
mkdir -p exp/$mic/tri4a
|
||||
steps/train_sat.sh --cmd "$train_cmd" \
|
||||
5000 80000 data/$mic/train data/lang exp/$mic/tri3a_ali \
|
||||
exp/$mic/tri4a >& exp/$mic/tri4a/train.log || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri4a/graph_${lm_suffix}
|
||||
$highmem_cmd $graph_dir/mkgraph.log \
|
||||
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri4a $graph_dir
|
||||
|
||||
steps/decode_fmllr.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/dev exp/$mic/tri4a/decode_dev_${lm_suffix}
|
||||
|
||||
steps/decode_fmllr.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
$graph_dir data/$mic/eval exp/$mic/tri4a/decode_eval_${lm_suffix}
|
||||
)
|
||||
done
|
||||
|
||||
# MMI training starting from the LDA+MLLT+SAT systems
|
||||
steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \
|
||||
data/$mic/train data/lang exp/$mic/tri4a exp/$mic/tri4a_ali || exit 1
|
||||
|
||||
steps/make_denlats.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
--transform-dir exp/$mic/tri4a_ali \
|
||||
data/$mic/train data/lang exp/$mic/tri4a exp/$mic/tri4a_denlats || exit 1;
|
||||
|
||||
# 4 iterations of MMI seems to work well overall. The number of iterations is
|
||||
# used as an explicit argument even though train_mmi.sh will use 4 iterations by
|
||||
# default.
|
||||
num_mmi_iters=4
|
||||
steps/train_mmi.sh --cmd "$train_cmd" --boost 0.1 --num-iters $num_mmi_iters \
|
||||
data/$mic/train data/lang exp/$mic/tri4a_ali exp/$mic/tri4a_denlats \
|
||||
exp/$mic/tri4a_mmi_b0.1 || exit 1;
|
||||
|
||||
for lm_suffix in $LM; do
|
||||
(
|
||||
graph_dir=exp/$mic/tri4a/graph_${lm_suffix}
|
||||
|
||||
for i in `seq 1 4`; do
|
||||
decode_dir=exp/$mic/tri4a_mmi_b0.1/decode_dev_${i}.mdl_${lm_suffix}
|
||||
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
--transform-dir exp/$mic/tri4a/decode_dev_${lm_suffix} \
|
||||
$graph_dir data/$mic/dev $decode_dir
|
||||
done
|
||||
|
||||
wait;
|
||||
i=3 #simply assummed
|
||||
decode_dir=exp/$mic/tri4a_mmi_b0.1/decode_eval_${i}.mdl_${lm_suffix}
|
||||
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
|
||||
--transform-dir exp/$mic/tri4a/decode_eval_${lm_suffix} \
|
||||
$graph_dir data/$mic/eval $decode_dir
|
||||
)&
|
||||
done
|
||||
|
||||
# here goes hybrid stuf
|
||||
# in the ASRU paper we used different python nnet code, so someone needs to copy&adjust nnet or nnet2 switchboard commands
|
||||
|
|
@ -0,0 +1 @@
|
|||
../../wsj/s5/steps
|
|
@ -0,0 +1 @@
|
|||
../../wsj/s5/utils
|
|
@ -159,3 +159,15 @@ fortran_opt = $(shell gcc -v 2>&1 | perl -e '$$x = join(" ", <STDIN>); if($$x =~
|
|||
openblas_compiled:
|
||||
-git clone git://github.com/xianyi/OpenBLAS
|
||||
$(MAKE) PREFIX=`pwd`/OpenBLAS/install FC=gfortran $(fortran_opt) DEBUG=1 USE_THREAD=0 -C OpenBLAS all install
|
||||
|
||||
beamformit: beamformit-3.51
|
||||
|
||||
.PHONY: beamformit-3.51
|
||||
|
||||
beamformit-3.51: beamformit-3.51.tgz
|
||||
tar -xozf BeamformIt-3.51.tgz; \
|
||||
cd BeamformIt-3.51; cmake . ; make
|
||||
|
||||
beamformit-3.51.tgz:
|
||||
wget -c -T 10 http://www.xavieranguera.com/beamformit/releases/BeamformIt-3.51.tgz
|
||||
|
||||
|
|
|
@ -0,0 +1,7 @@
|
|||
#!/bin/bash
|
||||
|
||||
# to be run from ..
|
||||
# this script just exists to tell you how you'd make beamformit- we actually did it via Makefile rules,
|
||||
# but it's not a default target.
|
||||
|
||||
make beamformit
|
Загрузка…
Ссылка в новой задаче