trunk: merging sandbox/pawel to add the AMI recipe.

git-svn-id: https://svn.code.sf.net/p/kaldi/code/trunk@4276 5e6a8d80-dfce-4ca6-a32a-6e07a63d50c8
This commit is contained in:
Dan Povey 2014-08-07 22:03:57 +00:00
Родитель 519109f42f f72d2a7d86
Коммит b02ad40bf1
41 изменённых файлов: 4791 добавлений и 0 удалений

32
egs/ami/README.txt Normal file
Просмотреть файл

@ -0,0 +1,32 @@
About the AMI corpus:
WEB: http://groups.inf.ed.ac.uk/ami/corpus/
LICENCE: http://groups.inf.ed.ac.uk/ami/corpus/license.shtml
"The AMI Meeting Corpus consists of 100 hours of meeting recordings. The recordings use a range of signals synchronized to a common timeline. These include close-talking and far-field microphones, individual and room-view video cameras, and output from a slide projector and an electronic whiteboard. During the meetings, the participants also have unsynchronized pens available to them that record what is written. The meetings were recorded in English using three different rooms with different acoustic properties, and include mostly non-native speakers." See http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml for more details.
About the recipe:
s5)
The scripts under this directory build systems using AMI data only, this includes both training, development and evaluation sets (following Full ASR split on http://groups.inf.ed.ac.uk/ami/corpus/datasets.shtml). This is different from RT evaluation campaigns that usually combined couple of different meeting datasets from multiple sources. In general, the recipe reproduce baseline systems build in [1] but without propeirary components* that means we use CMUDict [2] and in the future will try to use open texts to estimate background language model.
Currently, one can build the systems for close-talking scenario, for which we refer as
-- IHM (Individual Headset Microphones)
and two variants of distant speech
-- SDM (Single Distant Microphone) using 1st micarray and,
-- MDM (Multiple Distant Microphones) where the mics are combined using BeamformIt [3] toolkit.
To run all su-recipes the following (non-standard) software is expected to be installed
1) SRILM - to build language models (look at KALDI_ROOT/tools/install_srilm.sh)
2) BeamformIt (for MDM scenario, installed with Kaldi tools)
3) Java (optional, but if available will be used to extract transcripts from XML)
[1] "Hybrid acoustic models for distant and multichannel large vocabulary speech recognition", Pawel Swietojanski, Arnab Ghoshal and Steve Renals, In Proc. ASRU, December 2013
[2] http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[3] "Acoustic beamforming for speaker diarization of meetings", Xavier Anguera, Chuck Wooters and Javier Hernando, IEEE Transactions on Audio, Speech and Language Processing, September 2007, volume 15, number 7, pp.2011-2023.
*) there is still optional dependency on Fisher transcripts (LDC2004T19, LDC2005T19) to build background language model and closely reproduce [1].

15
egs/ami/s5/RESULTS_ihm Normal file
Просмотреть файл

@ -0,0 +1,15 @@
dev
exp/ihm/tri2a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev.ctm.filt.dtl:Percent Total Error = 38.0% (35925)
exp/ihm/tri3a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_14/dev.ctm.filt.dtl:Percent Total Error = 35.3% (33329)
exp/ihm/tri4a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev.ctm.filt.dtl:Percent Total Error = 32.1% (30364)
exp/ihm/tri4a_mmi_b0.1/decode_dev_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_12/dev.ctm.filt.dtl:Percent Total Error = 29.9% (28220)
eval
exp/ihm/tri2a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval.ctm.filt.dtl:Percent Total Error = 43.7% (39330)
exp/ihm/tri3a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_14/eval.ctm.filt.dtl:Percent Total Error = 40.4% (36385)
exp/ihm/tri4a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval_o4.ctm.filt.dtl:Percent Total Error = 35.0% (31463)
exp/ihm/tri4a_mmi_b0.1/decode_eval_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_12/eval_o4.ctm.filt.dtl:Percent Total Error = 31.7% (28518)

15
egs/ami/s5/RESULTS_mdm Normal file
Просмотреть файл

@ -0,0 +1,15 @@
#Beamforming of 8 microphones, WER scores with up to 4 overlapping speakers
dev
exp/mdm8/tri2a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev_o4.ctm.filt.dtl:Percent Total Error = 58.8% (55568)
exp/mdm8/tri3a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev_o4.ctm.filt.dtl:Percent Total Error = 57.0% (53855)
exp/mdm8/tri3a_mmi_b0.1/decode_dev_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_10/dev_o4.ctm.filt.dtl:Percent Total Error = 54.9% (51926)
eval
exp/mdm8/tri2a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval_o4.ctm.filt.dtl:Percent Total Error = 64.4% (57916)
exp/mdm8/tri3a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval_o4.ctm.filt.dtl:Percent Total Error = 61.9% (55738)
exp/mdm8/tri3a_mmi_b0.1/decode_eval_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_10/eval_o4.ctm.filt.dtl:Percent Total Error = 59.3% (53370)

14
egs/ami/s5/RESULTS_sdm Normal file
Просмотреть файл

@ -0,0 +1,14 @@
#the below are WER scores with up to 4 overlapping speakers
dev
exp/sdm1/tri2a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev_o4.ctm.filt.dtl:Percent Total Error = 66.9% (63190)
exp/sdm1/tri3a/decode_dev_ami_fsh.o3g.kn.pr1-7/ascore_13/dev_o4.ctm.filt.dtl:Percent Total Error = 64.5% (60963)
exp/sdm1/tri3a_mmi_b0.1/decode_dev_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_10/dev_o4.ctm.filt.dtl:Percent Total Error = 62.2% (58772)
eval
exp/sdm1/tri2a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_13/eval_o4.ctm.filt.dtl:Percent Total Error = 71.8% (64577)
exp/sdm1/tri3a/decode_eval_ami_fsh.o3g.kn.pr1-7/ascore_12/eval_o4.ctm.filt.dtl:Percent Total Error = 69.5% (62576)
exp/sdm1/tri3a_mmi_b0.1/decode_eval_3.mdl_ami_fsh.o3g.kn.pr1-7/ascore_10/eval_o4.ctm.filt.dtl:Percent Total Error = 67.2% (60447)

17
egs/ami/s5/cmd.sh Normal file
Просмотреть файл

@ -0,0 +1,17 @@
# "queue.pl" uses qsub. The options to it are
# options to qsub. If you have GridEngine installed,
# change this to a queue you have access to.
# Otherwise, use "run.pl", which will run jobs locally
# (make sure your --num-jobs options are no more than
# the number of cpus on your machine.
# On Eddie use:
#export train_cmd="queue.pl -P inf_hcrc_cstr_nst -l h_rt=08:00:00"
#export decode_cmd="queue.pl -P inf_hcrc_cstr_nst -l h_rt=05:00:00 -pe memory-2G 4"
#export highmem_cmd="queue.pl -P inf_hcrc_cstr_nst -l h_rt=05:00:00 -pe memory-2G 4"
#export scoring_cmd="queue.pl -P inf_hcrc_cstr_nst -l h_rt=00:20:00"
# To run locally, use:
export train_cmd=run.pl
export decode_cmd=run.pl
export highmem_cmd=run.pl

50
egs/ami/s5/conf/ami.cfg Normal file
Просмотреть файл

@ -0,0 +1,50 @@
#BeamformIt sample configuration file for AMI data (http://groups.inf.ed.ac.uk/ami/download/)
# scrolling size to compute the delays
scroll_size = 250
# cross correlation computation window size
window_size = 500
#amount of maximum points for the xcorrelation taken into account
nbest_amount = 4
#flag wether to apply an automatic noise thresholding
do_noise_threshold = 1
#Percentage of frames with lower xcorr taken as noisy
noise_percent = 10
######## acoustic modelling parameters
#transition probabilities weight for multichannel decoding
trans_weight_multi = 25
trans_weight_nbest = 25
###
#flag wether to print the feaures after setting them, or not
print_features = 1
#flag wether to use the bad frames in the sum process
do_avoid_bad_frames = 1
#flag to use the best channel (SNR) as a reference
#defined from command line
do_compute_reference = 1
#flag wether to use a uem file or not(process all the file)
do_use_uem_file = 0
#flag wether to use an adaptative weights scheme or fixed weights
do_adapt_weights = 1
#flag wether to output the sph files or just run the system to create the auxiliary files
do_write_sph_files = 1
####directories where to store/retrieve info####
#channels_file = ./cfg-files/channels
#show needs to be passed as argument normally, here a default one is given just in case
#show_id = Ttmp

Просмотреть файл

@ -0,0 +1,3 @@
beam=11.0 # beam for decoding. Was 13.0 in the scripts.
first_beam=8.0 # beam for 1st-pass decoding in SAT.

Просмотреть файл

@ -0,0 +1,10 @@
--window-type=hamming # disable Dans window, use the standard
--use-energy=false # only fbank outputs
--sample-frequency=16000 # AMI is sampled at 16kHz
#--low-freq=64 # typical setup from Frantisek Grezl
#--high-freq=3800
--dither=1
--num-mel-bins=40 # 8kHz so we use 15 bins
--htk-compat=true # try to make it compatible with HTK

Просмотреть файл

@ -0,0 +1,2 @@
--use-energy=false # only non-default option.
--sample-frequency=16000

Просмотреть файл

@ -0,0 +1,74 @@
#!/bin/bash
#Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
#Apache 2.0
wiener_filtering=false
nj=4
cmd=run.pl
# End configuration section
echo "$0 $@" # Print the command line for logging
[ -f ./path.sh ] && . ./path.sh; # source the path.
. parse_options.sh || exit 1;
if [ $# != 3 ]; then
echo "Wrong #arguments ($#, expected 4)"
echo "Usage: steps/ami_beamform.sh [options] <num-mics> <ami-dir> <wav-out-dir>"
echo "main options (for others, see top of script file)"
echo " --nj <nj> # number of parallel jobs"
echo " --cmd <cmd> # Command to run in parallel with"
echo " --wiener-filtering <true/false> # Cancel noise with Wiener filter prior to beamforming"
exit 1;
fi
numch=$1
sdir=$2
odir=$3
wdir=data/local/beamforming
mkdir -p $odir
mkdir -p $wdir/log
meetings=$wdir/meetings.list
cat local/split_train.orig local/split_dev.orig local/split_eval.orig | sort > $meetings
ch_inc=$((8/$numch))
bmf=
for ch in `seq 1 $ch_inc 8`; do
bmf="$bmf $ch"
done
echo "Will use the following channels: $bmf"
#make the channel file
if [ -f $wdir/channels_$numch ]; then
rm $wdir/channels_$numch
fi
touch $wdir/channels_$numch
while read line;
do
channels="$line "
for ch in $bmf; do
channels="$channels $line/audio/$line.Array1-0$ch.wav"
done
echo $channels >> $wdir/channels_$numch
done < $meetings
#do noise cancellation
if [ $wiener_filtering == "true" ]; then
echo "Wiener filtering not yet implemented."
exit 1;
fi
#do beamforming
echo -e "Beamforming\n"
$cmd JOB=1:$nj $wdir/log/beamform.JOB.log \
local/beamformit.sh $nj JOB $numch $meetings $sdir $odir

Просмотреть файл

@ -0,0 +1,95 @@
#!/bin/bash
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski, Jonathan Kilgour)
if [ $# -ne 2 ]; then
echo "Usage: $0 <mic> <ami-dir>"
echo " where <mic> is either ihm, sdm or mdm and <ami-dir> is download space."
exit 1;
fi
mic=$1
adir=$2
amiurl=http://groups.inf.ed.ac.uk/ami
annotver=ami_public_manual_1.6.1
wdir=data/local/downloads
if [[ ! "$mic" =~ ^(ihm|sdm|mdm)$ ]]; then
echo "$0. Wrong <mic> option."
exit 1;
fi
mics="1 2 3 4 5 6 7 8"
if [ "$mic" == "sdm" ]; then
mics=1
fi
mkdir -p $adir
mkdir -p $wdir/log
#download annotations
annot="$adir/$annotver"
if [[ ! -d $adir/annotations || ! -f "$annot" ]]; then
echo "Downloading annotiations..."
wget -nv -O $annot.zip $amiurl/AMICorpusAnnotations/$annotver.zip &> $wdir/log/download_ami_annot.log
mkdir -p $adir/annotations
unzip -o -d $adir/annotations $annot.zip &> /dev/null
fi
[ ! -f "$adir/annotations/AMI-metadata.xml" ] && echo "$0: File AMI-Metadata.xml not found under $adir/annotations." && exit 1;
#download waves
cat local/split_train.orig local/split_eval.orig local/split_dev.orig > $wdir/ami_meet_ids.flist
wgetfile=$wdir/wget_$mic.sh
manifest="wget -O $adir/MANIFEST.TXT http://groups.inf.ed.ac.uk/ami/download/temp/amiBuild-04237-Sun-Jun-15-2014.manifest.txt"
license="wget -O $adir/LICENCE.TXT http://groups.inf.ed.ac.uk/ami/download/temp/Creative-Commons-Attribution-NonCommercial-ShareAlike-2.5.txt"
echo "#!/bin/bash" > $wgetfile
echo $manifest >> $wgetfile
echo $license >> $wgetfile
while read line; do
if [ "$mic" == "ihm" ]; then
extra_headset= #some meetings have 5 sepakers (headsets)
for mtg in EN2001a EN2001d EN2001e; do
[ "$mtg" == "$line" ] && extra_headset=4;
done
for m in 0 1 2 3 $extra_headset; do
echo "wget -nv -c -P $adir/$line/audio $amiurl/AMICorpusMirror/amicorpus/$line/audio/$line.Headset-$m.wav" >> $wgetfile
done
else
for m in $mics; do
echo "wget -nv -c -P $adir/$line/audio $amiurl/AMICorpusMirror/amicorpus/$line/audio/$line.Array1-0$m.wav" >> $wgetfile
done
fi
done < $wdir/ami_meet_ids.flist
chmod +x $wgetfile
echo "Downloading audio files for $mic scenario."
echo "Look at $wdir/log/download_ami_$mic.log for progress"
$wgetfile &> $wdir/log/download_ami_$mic.log
#do rough check if #wavs is as expected, it will fail anyway in data prep stage if it isn't
if [ "$mic" == "ihm" ]; then
num_files=`find $adir -iname *Headset*`
if [ $num_files -ne 687 ]; then
echo "Warning: Found $num_files headset wavs but expected 687. Check $wdir/log/download_ami_$mic.log for details."
exit 1;
fi
else
num_files=`find $adir -iname *Array1*`
if [[ $num_files -lt 1352 && "$mic" == "mdm" ]]; then
echo "Warning: Found $num_files distant Array1 waves but expected 1352 for mdm. Check $wdir/log/download_ami_$mic.log for details."
exit 1;
elif [[ $num_files -lt 169 && "$mic" == "sdm" ]]; then
echo "Warning: Found $num_files distant Array1 waves but expected 169 for sdm. Check $wdir/log/download_ami_$mic.log for details."
exit 1;
fi
fi
echo "Downloads of AMI corpus completed succesfully. License can be found under $adir/LICENCE.TXT"
exit 0;

Просмотреть файл

@ -0,0 +1,64 @@
#!/bin/bash
#
if [ -f path.sh ]; then . path.sh; fi
if [ $# -ne 1 ]; then
echo 'Usage: $0 <arpa-lm>'
exit
fi
silprob=0.5
arpa_lm=$1
[ ! -f $arpa_lm ] && echo No such file $arpa_lm && exit 1;
cp -r data/lang data/lang_test
# grep -v '<s> <s>' etc. is only for future-proofing this script. Our
# LM doesn't have these "invalid combinations". These can cause
# determinization failures of CLG [ends up being epsilon cycles].
# Note: remove_oovs.pl takes a list of words in the LM that aren't in
# our word list. Since our LM doesn't have any, we just give it
# /dev/null [we leave it in the script to show how you'd do it].
gunzip -c "$arpa_lm" | \
grep -v '<s> <s>' | \
grep -v '</s> <s>' | \
grep -v '</s> </s>' | \
arpa2fst - | fstprint | \
utils/remove_oovs.pl /dev/null | \
utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=data/lang_test/words.txt \
--osymbols=data/lang_test/words.txt --keep_isymbols=false --keep_osymbols=false | \
fstrmepsilon > data/lang_test/G.fst
fstisstochastic data/lang_test/G.fst
echo "Checking how stochastic G is (the first of these numbers should be small):"
fstisstochastic data/lang_test/G.fst
## Check lexicon.
## just have a look and make sure it seems sane.
echo "First few lines of lexicon FST:"
fstprint --isymbols=data/lang/phones.txt --osymbols=data/lang/words.txt data/lang/L.fst | head
echo Performing further checks
# Checking that G.fst is determinizable.
fstdeterminize data/lang_test/G.fst /dev/null || echo Error determinizing G.
# Checking that L_disambig.fst is determinizable.
fstdeterminize data/lang_test/L_disambig.fst /dev/null || echo Error determinizing L.
# Checking that disambiguated lexicon times G is determinizable
# Note: we do this with fstdeterminizestar not fstdeterminize, as
# fstdeterminize was taking forever (presumbaly relates to a bug
# in this version of OpenFst that makes determinization slow for
# some case).
fsttablecompose data/lang_test/L_disambig.fst data/lang_test/G.fst | \
fstdeterminizestar >/dev/null || echo Error
# Checking that LG is stochastic:
fsttablecompose data/lang/L_disambig.fst data/lang_test/G.fst | \
fstisstochastic || echo LG is not stochastic
echo AMI_format_data succeeded.

Просмотреть файл

@ -0,0 +1,95 @@
#!/bin/bash
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
# AMI Corpus training data preparation
# Apache 2.0
# To be run from one directory above this script.
. path.sh
#check existing directories
if [ $# != 1 ]; then
echo "Usage: ami_ihm_data_prep.sh /path/to/AMI"
exit 1;
fi
AMI_DIR=$1
SEGS=data/local/annotations/train.txt
dir=data/local/ihm/train
mkdir -p $dir
# Audio data directory check
if [ ! -d $AMI_DIR ]; then
echo "Error: $AMI_DIR directory does not exists."
exit 1;
fi
# And transcripts check
if [ ! -f $SEGS ]; then
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
exit 1;
fi
# find headset wav audio files only
find $AMI_DIR -iname '*.Headset-*.wav' | sort > $dir/wav.flist
n=`cat $dir/wav.flist | wc -l`
echo "In total, $n headset files were found."
[ $n -ne 687 ] && \
echo "Warning: expected 687 (168 mtgs x 4 mics + 3 mtgs x 5 mics) data files, found $n"
# (1a) Transcriptions preparation
# here we start with normalised transcriptions, the utt ids follow the convention
# AMI_MEETING_CHAN_SPK_STIME_ETIME
# AMI_ES2011a_H00_FEE041_0003415_0003484
# we use uniq as some (rare) entries are doubled in transcripts
awk '{meeting=$1; channel=$2; speaker=$3; stime=$4; etime=$5;
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
# (1b) Make segment files from transcript
awk '{
segment=$1;
split(segment,S,"[_]");
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
print segment " " audioname " " startf*10/1000 " " endf*10/1000 " "
}' < $dir/text > $dir/segments
# (1c) Make wav.scp file.
sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
perl -ne 'split; $_ =~ m/(.*)\..*\-([0-9])/; print "AMI_$1_H0$2\n"' | \
paste - $dir/wav.flist > $dir/wav1.scp
#Keep only train part of waves
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp > $dir/wav2.scp
#replace path with an appropriate sox command that select single channel only
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
# (1d) reco2file_and_channel
cat $dir/wav.scp \
| perl -ane '$_ =~ m:^(\S+)(H0[0-4])\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
print "$1$2 $3 A\n"; ' > $dir/reco2file_and_channel || exit 1;
awk '{print $1}' $dir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
print "$1$2$3 $1$2\n";' > $dir/utt2spk || exit 1;
sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;
# Copy stuff into its final location
mkdir -p data/ihm/train
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
cp $dir/$f data/ihm/train/$f || exit 1;
done
utils/validate_data_dir.sh --no-feats data/ihm/train || exit 1;
echo AMI IHM data preparation succeeded.

Просмотреть файл

@ -0,0 +1,118 @@
#!/bin/bash
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
# AMI Corpus dev/eval data preparation
. path.sh
#check existing directories
if [ $# != 2 ]; then
echo "Usage: ami_*_scoring_data_prep_edin.sh /path/to/AMI set-name"
exit 1;
fi
AMI_DIR=$1
SET=$2
SEGS=data/local/annotations/$SET.txt
dir=data/local/ihm/$SET
mkdir -p $dir
# Audio data directory check
if [ ! -d $AMI_DIR ]; then
echo "Error: run.sh requires a directory argument"
exit 1;
fi
# And transcripts check
if [ ! -f $SEGS ]; then
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
exit 1;
fi
# find headset wav audio files only, here we again get all
# the files in the corpora and filter only specific sessions
# while building segments
find $AMI_DIR -iname '*.Headset-*.wav' | sort > $dir/wav.flist
n=`cat $dir/wav.flist | wc -l`
echo "In total, $n headset files were found."
[ $n -ne 687 ] && \
echo "Warning: expected 687 (168 mtgs x 4 mics + 3 mtgs x 5 mics) data files, found $n"
# (1a) Transcriptions preparation
# here we start with normalised transcriptions, the utt ids follow the convention
# AMI_MEETING_CHAN_SPK_STIME_ETIME
# AMI_ES2011a_H00_FEE041_0003415_0003484
awk '{meeting=$1; channel=$2; speaker=$3; stime=$4; etime=$5;
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
# (1c) Make segment files from transcript
#segments file format is: utt-id side-id start-time end-time, e.g.:
awk '{
segment=$1;
split(segment,S,"[_]");
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
print segment " " audioname " " startf*10/1000 " " endf*10/1000 " "
}' < $dir/text > $dir/segments
#prepare wav.scp
sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
perl -ne 'split; $_ =~ m/(.*)\..*\-([0-9])/; print "AMI_$1_H0$2\n"' | \
paste - $dir/wav.flist > $dir/wav1.scp
#Keep only train part of waves
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp > $dir/wav2.scp
#replace path with an appropriate sox command that select single channel only
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
# (1d) reco2file_and_channel
cat $dir/wav.scp \
| perl -ane '$_ =~ m:^(\S+)(H0[0-4])\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
print "$1$2 $3 A\n"; ' > $dir/reco2file_and_channel || exit 1;
awk '{print $1}' $dir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "segments: bad label $_";
print "$1$2$3 $1$2\n";' > $dir/utt2spk || exit 1;
sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;
#check and correct the case when segment timings for given speaker overlap themself
#(important for simulatenous asclite scoring to proceed).
#There is actually only one such case for devset and automatic segmentetions
join $dir/utt2spkm $dir/segments | \
perl -ne '{BEGIN{$pu=""; $pt=0.0;} split;
if ($pu eq $_[1] && $pt > $_[3]) {
print "$_[0] $_[2] $_[3] $_[4]>$_[0] $_[2] $pt $_[4]\n"
}
$pu=$_[1]; $pt=$_[4];
}' > $dir/segments_to_fix
if [ `cat $dir/segments_to_fix | wc -l` -gt 0 ]; then
echo "$0. Applying following fixes to segments"
cat $dir/segments_to_fix
while read line; do
p1=`echo $line | awk -F'>' '{print $1}'`
p2=`echo $line | awk -F'>' '{print $2}'`
sed -ir "s!$p1!$p2!" $dir/segments
done < $dir/segments_to_fix
fi
# Copy stuff into its final locations
fdir=data/ihm/$SET
mkdir -p $fdir
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
cp $dir/$f $fdir/$f || exit 1;
done
#Produce STMs for sclite scoring
local/convert2stm.pl $dir > $fdir/stm
cp local/english.glm $fdir/glm
utils/validate_data_dir.sh --no-feats $fdir || exit 1;
echo AMI $SET set data preparation succeeded.

Просмотреть файл

@ -0,0 +1,102 @@
#!/bin/bash
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
# AMI Corpus dev/eval data preparation
# To be run from one directory above this script.
. path.sh
#check existing directories
if [ $# != 2 ]; then
echo "Usage: ami_data_prep.sh </path/to/AMI-MDM> <mic>"
exit 1;
fi
AMI_DIR=$1
mic=$2
SEGS=data/local/annotations/train.txt
dir=data/local/$mic/train
odir=data/$mic/train
mkdir -p $dir
# Audio data directory check
if [ ! -d $AMI_DIR ]; then
echo "Error: run.sh requires a directory argument"
exit 1;
fi
# And transcripts check
if [ ! -f $SEGS ]; then
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
exit 1;
fi
# find MDM mics
find $AMI_DIR -iname "*${mic}.wav" | sort > $dir/wav.flist
n=`cat $dir/wav.flist | wc -l`
echo "In total, $n headset files were found."
[ $n -ne 169 ] && \
echo Warning: expected 169 data data files, found $n
# (1a) Transcriptions preparation
# here we start with rt09 transcriptions, hence not much to do
awk '{meeting=$1; channel="MDM"; speaker=$3; stime=$4; etime=$5;
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
# (1c) Make segment files from transcript
#segments file format is: utt-id side-id start-time end-time, e.g.:
#AMI_ES2011a_H00_FEE041_0003415_0003484
awk '{
segment=$1;
split(segment,S,"[_]");
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
print segment " " audioname " " startf/100 " " endf/100 " "
}' < $dir/text > $dir/segments
#EN2001a.Array1-01.wav
#sed -e 's?.*/??' -e 's?.sph??' $dir/wav.flist | paste - $dir/wav.flist \
# > $dir/wav.scp
sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
perl -ne 'split; $_ =~ m/(.*)\_.*/; print "AMI_$1_MDM\n"' | \
paste - $dir/wav.flist > $dir/wav1.scp
#Keep only training part of waves
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp | sort -o $dir/wav2.scp
#Two distant recordings are missing, agree segments with wav.scp
awk '{print $1}' $dir/wav2.scp | join -2 2 - $dir/segments | \
awk '{print $2" "$1" "$3" "$4" "$5}' > $dir/s; mv $dir/s $dir/segments
#...and text with segments
awk '{print $1}' $dir/segments | join - $dir/text > $dir/t; mv $dir/t $dir/text
#replace path with an appropriate sox command that select single channel only
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
#prep reco2file_and_channel
cat $dir/wav.scp | \
perl -ane '$_ =~ m:^(\S+MDM).*\/([IETB].*)\.wav.*$: || die "bad label $_";
print "$1 $2 A\n"; ' > $dir/reco2file_and_channel || exit 1;
# we assume we adapt to the session only
awk '{print $1}' $dir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
print "$1$2$3 $1\n";' \
> $dir/utt2spk || exit 1;
sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;
# Copy stuff into its final locations
mkdir -p $odir
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
cp $dir/$f $odir/$f | exit 1;
done
utils/validate_data_dir.sh --no-feats $odir
echo AMI MDM data preparation succeeded.

Просмотреть файл

@ -0,0 +1,126 @@
#!/bin/bash
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
# AMI Corpus dev/eval data preparation
. path.sh
#check existing directories
if [ $# != 3 ]; then
echo "Usage: ami_mdm_scoring_data_prep.sh /path/to/AMI-MDM mic-name set-name"
exit 1;
fi
AMI_DIR=$1
mic=$2
SET=$3
SEGS=data/local/annotations/$SET.txt
tmpdir=data/local/$mic/$SET
dir=data/$mic/$SET
mkdir -p $tmpdir
# Audio data directory check
if [ ! -d $AMI_DIR ]; then
echo "Error: run.sh requires a directory argument"
exit 1;
fi
# And transcripts check
if [ ! -f $SEGS ]; then
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
exit 1;
fi
# find selected mdm wav audio files only
find $AMI_DIR -iname "*${mic}.wav" | sort > $tmpdir/wav.flist
n=`cat $tmpdir/wav.flist | wc -l`
if [ $n -ne 169 ]; then
echo "Warning. Expected to find 169 files but found $n."
fi
# (1a) Transcriptions preparation
awk '{meeting=$1; channel="MDM"; speaker=$3; stime=$4; etime=$5;
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $tmpdir/text
# (1c) Make segment files from transcript
#segments file format is: utt-id side-id start-time end-time, e.g.:
#AMI_ES2011a_H00_FEE041_0003415_0003484
awk '{
segment=$1;
split(segment,S,"[_]");
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
print segment " " audioname " " startf/100 " " endf/100 " "
}' < $tmpdir/text > $tmpdir/segments
#EN2001a.Array1-01.wav
#sed -e 's?.*/??' -e 's?.sph??' $dir/wav.flist | paste - $dir/wav.flist \
# > $dir/wav.scp
sed -e 's?.*/??' -e 's?.wav??' $tmpdir/wav.flist | \
perl -ne 'split; $_ =~ m/(.*)\_.*/; print "AMI_$1_MDM\n"' | \
paste - $tmpdir/wav.flist > $tmpdir/wav1.scp
#Keep only devset part of waves
awk '{print $2}' $tmpdir/segments | sort -u | join - $tmpdir/wav1.scp > $tmpdir/wav2.scp
#replace path with an appropriate sox command that select single channel only
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $tmpdir/wav2.scp > $tmpdir/wav.scp
#prep reco2file_and_channel
cat $tmpdir/wav.scp | \
perl -ane '$_ =~ m:^(\S+MDM)\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
print "$1 $2 A\n"; ' > $tmpdir/reco2file_and_channel || exit 1;
# we assume we adapt to the session only
awk '{print $1}' $tmpdir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
print "$1$2$3 $1\n";' \
> $tmpdir/utt2spk || exit 1;
sort -k 2 $tmpdir/utt2spk | utils/utt2spk_to_spk2utt.pl > $tmpdir/spk2utt || exit 1;
# but we want to properly score the overlapped segments, hence we generate the extra
# utt2spk_stm file containing speakers ids used to generate the stms for mdm/sdm case
awk '{print $1}' $tmpdir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
print "$1$2$3 $1$2\n";' > $tmpdir/utt2spk_stm || exit 1;
#check and correct case when segment timings for a given speaker overlap themself
#(important for simulatenous asclite scoring to proceed).
#There is actually only one such case for devset and automatic segmentetions
join $tmpdir/utt2spk_stm $tmpdir/segments | \
perl -ne '{BEGIN{$pu=""; $pt=0.0;} split;
if ($pu eq $_[1] && $pt > $_[3]) {
print "$_[0] $_[2] $_[3] $_[4]>$_[0] $_[2] $pt $_[4]\n"
}
$pu=$_[1]; $pt=$_[4];
}' > $tmpdir/segments_to_fix
if [ `cat $tmpdir/segments_to_fix | wc -l` -gt 0 ]; then
echo "$0. Applying following fixes to segments"
cat $tmpdir/segments_to_fix
while read line; do
p1=`echo $line | awk -F'>' '{print $1}'`
p2=`echo $line | awk -F'>' '{print $2}'`
sed -ir "s!$p1!$p2!" $tmpdir/segments
done < $tmpdir/segments_to_fix
fi
# Copy stuff into its final locations [this has been moved from the format_data
# script]
mkdir -p $dir
for f in spk2utt utt2spk utt2spk_stm wav.scp text segments reco2file_and_channel; do
cp $tmpdir/$f $dir/$f || exit 1;
done
cp local/english.glm $dir/glm
#note, although utt2spk contains mappings to the whole meetings for simulatenous scoring
#we need to know which speakers overlap at meeting level, hence we generate an extra utt2spk_stm file
local/convert2stm.pl $dir utt2spk_stm > $dir/stm
utils/validate_data_dir.sh --no-feats $dir
echo AMI $SET set data preparation succeeded.

Просмотреть файл

@ -0,0 +1,69 @@
#!/bin/bash
#adapted from fisher dict preparation script, Author: Pawel Swietojanski
dir=data/local/dict
mkdir -p $dir
echo "Getting CMU dictionary"
svn co https://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict $dir/cmudict
# silence phones, one per line.
for w in sil laughter noise oov; do echo $w; done > $dir/silence_phones.txt
echo sil > $dir/optional_silence.txt
# For this setup we're discarding stress.
cat $dir/cmudict/cmudict.0.7a.symbols | sed s/[0-9]//g | \
perl -ane 's:\r::; print;' | sort | uniq > $dir/nonsilence_phones.txt
# An extra question will be added by including the silence phones in one class.
cat $dir/silence_phones.txt| awk '{printf("%s ", $1);} END{printf "\n";}' > $dir/extra_questions.txt || exit 1;
grep -v ';;;' $dir/cmudict/cmudict.0.7a | \
perl -ane 'if(!m:^;;;:){ s:(\S+)\(\d+\) :$1 :; s: : :; print; }' | \
sed s/[0-9]//g | sort | uniq > $dir/lexicon1_raw_nosil.txt || exit 1;
#cat eddie_data/rt09.ami.ihmtrain09.v3.dct | sort > $dir/lexicon1_raw_nosil.txt
# limit the vocabulary to the predefined 50k words
wget -nv -O $dir/wordlist.50k.gz http://www.openslr.org/resources/9/wordlist.50k.gz
gunzip -c $dir/wordlist.50k.gz > $dir/wordlist.50k
join $dir/lexicon1_raw_nosil.txt $dir/wordlist.50k > $dir/lexicon1_raw_nosil_50k.txt
# Add prons for laughter, noise, oov
for w in `grep -v sil $dir/silence_phones.txt`; do
echo "[$w] $w"
done | cat - $dir/lexicon1_raw_nosil_50k.txt > $dir/lexicon2_raw_50k.txt || exit 1;
# add some specific words, those are only with 100 missing occurences or more
( echo "MM M"; \
echo "HMM HH M"; \
echo "MM-HMM M HH M"; \
echo "COLOUR K AH L ER"; \
echo "COLOURS K AH L ER Z"; \
echo "REMOTES R IH M OW T Z"; \
echo "FAVOURITE F EY V ER IH T"; \
echo "<unk> oov" ) | cat - $dir/lexicon2_raw_50k.txt \
| sort -u > $dir/lexicon3_extra_50k.txt
cp $dir/lexicon3_extra_50k.txt $dir/lexicon.txt
[ ! -f $dir/lexicon.txt ] && exit 1;
# This is just for diagnostics:
cat data/ihm/train/text | \
awk '{for (n=2;n<=NF;n++){ count[$n]++; } } END { for(n in count) { print count[n], n; }}' | \
sort -nr > $dir/word_counts
awk '{print $1}' $dir/lexicon.txt | \
perl -e '($word_counts)=@ARGV;
open(W, "<$word_counts")||die "opening word-counts $word_counts";
while(<STDIN>) { chop; $seen{$_}=1; }
while(<W>) {
($c,$w) = split;
if (!defined $seen{$w}) { print; }
} ' $dir/word_counts > $dir/oov_counts.txt
echo "*Highest-count OOVs are:"
head -n 20 $dir/oov_counts.txt
utils/validate_dict_dir.pl $dir

Просмотреть файл

@ -0,0 +1,100 @@
#!/bin/bash
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
# AMI Corpus dev/eval data preparation
. path.sh
#check existing directories
if [ $# != 2 ]; then
echo "Usage: ami_sdm_data_prep.sh <path/to/AMI> <dist-mic-num>"
exit 1;
fi
AMI_DIR=$1
MICNUM=$2
DSET="sdm$MICNUM"
SEGS=data/local/annotations/train.txt
dir=data/local/$DSET/train
mkdir -p $dir
# Audio data directory check
if [ ! -d $AMI_DIR ]; then
echo "Error: run.sh requires a directory argument"
exit 1;
fi
# And transcripts check
if [ ! -f $SEGS ]; then
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
exit 1;
fi
# as the sdm we treat first mic from the array
find $AMI_DIR -iname "*.Array1-0$MICNUM.wav" | sort > $dir/wav.flist
n=`cat $dir/wav.flist | wc -l`
echo "In total, $n files were found."
[ $n -ne 169 ] && \
echo Warning: expected 169 data data files, found $n
# (1a) Transcriptions preparation
# here we start with already normalised transcripts, just make the ids
# Note, we set here SDM rather than, for example, SDM1 as we want to easily use
# the same alignments across different mics
awk '{meeting=$1; channel="SDM"; speaker=$3; stime=$4; etime=$5;
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
# (1c) Make segment files from transcript
#segments file format is: utt-id side-id start-time end-time, e.g.:
#AMI_ES2011a_H00_FEE041_0003415_0003484
awk '{
segment=$1;
split(segment,S,"[_]");
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
print segment " " audioname " " startf/100 " " endf/100 " "
}' < $dir/text > $dir/segments
#EN2001a.Array1-01.wav
sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
perl -ne 'split; $_ =~ m/(.*)\..*/; print "AMI_$1_SDM\n"' | \
paste - $dir/wav.flist > $dir/wav1.scp
#Keep only training part of waves
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp | sort -o $dir/wav2.scp
#Two distant recordings are missing, agree segments with wav.scp
awk '{print $1}' $dir/wav2.scp | join -2 2 - $dir/segments | \
awk '{print $2" "$1" "$3" "$4" "$5}' > $dir/s; mv $dir/s $dir/segments
#...and text with segments
awk '{print $1}' $dir/segments | join - $dir/text > $dir/t; mv $dir/t $dir/text
#replace path with an appropriate sox command that select a single channel only
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
# this file reco2file_and_channel maps recording-id
cat $dir/wav.scp | \
perl -ane '$_ =~ m:^(\S+SDM)\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
print "$1 $2 A\n"; ' > $dir/reco2file_and_channel || exit 1;
# Assumtion, for sdm we adapt to the session only
awk '{print $1}' $dir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
print "$1$2$3 $1\n";' | sort > $dir/utt2spk || exit 1;
sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;
# Copy stuff into its final locations
mkdir -p data/$DSET/train
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
cp $dir/$f data/$DSET/train/$f || exit 1;
done
utils/validate_data_dir.sh --no-feats data/$DSET/train
echo AMI $DSET data preparation succeeded.

Просмотреть файл

@ -0,0 +1,131 @@
#!/bin/bash
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski)
# AMI Corpus dev/eval data preparation
. path.sh
#check existing directories
if [ $# != 3 ]; then
echo "Usage: ami_sdm_scoring_data_prep.sh <path/to/AMI> <mic-id> <set-name>"
exit 1;
fi
AMI_DIR=$1
MICNUM=$2
SET=$3
DSET="sdm$MICNUM"
SEGS=data/local/annotations/$SET.txt
tmpdir=data/local/$DSET/$SET
dir=data/$DSET/$SET
mkdir -p $tmpdir
# Audio data directory check
if [ ! -d $AMI_DIR ]; then
echo "Error: run.sh requires a directory argument"
exit 1;
fi
# And transcripts check
if [ ! -f $SEGS ]; then
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
exit 1;
fi
# find headset wav audio files only, here we again get all
# the files in the corpora and filter only specific sessions
# while building segments
find $AMI_DIR -iname "*.Array1-0$MICNUM.wav" | sort > $tmpdir/wav.flist
n=`cat $tmpdir/wav.flist | wc -l`
echo "In total, $n files were found."
# (1a) Transcriptions preparation
# here we start with normalised transcripts
awk '{meeting=$1; channel="SDM"; speaker=$3; stime=$4; etime=$5;
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $tmpdir/text
# (1c) Make segment files from transcript
#segments file format is: utt-id side-id start-time end-time, e.g.:
#AMI_ES2011a_H00_FEE041_0003415_0003484
awk '{
segment=$1;
split(segment,S,"[_]");
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
print segment " " audioname " " startf/100 " " endf/100 " "
}' < $tmpdir/text > $tmpdir/segments
#EN2001a.Array1-01.wav
#sed -e 's?.*/??' -e 's?.sph??' $dir/wav.flist | paste - $dir/wav.flist \
# > $dir/wav.scp
sed -e 's?.*/??' -e 's?.wav??' $tmpdir/wav.flist | \
perl -ne 'split; $_ =~ m/(.*)\..*/; print "AMI_$1_SDM\n"' | \
paste - $tmpdir/wav.flist > $tmpdir/wav1.scp
#Keep only devset part of waves
awk '{print $2}' $tmpdir/segments | sort -u | join - $tmpdir/wav1.scp > $tmpdir/wav2.scp
#replace path with an appropriate sox command that select single channel only
awk '{print $1" sox -c 1 -t wavpcm -s "$2" -t wavpcm - |"}' $tmpdir/wav2.scp > $tmpdir/wav.scp
#prep reco2file_and_channel
cat $tmpdir/wav.scp | \
perl -ane '$_ =~ m:^(\S+SDM).*\/([IETB].*)\.wav.*$: || die "bad label $_";
print "$1 $2 A\n"; '\
> $tmpdir/reco2file_and_channel || exit 1;
# we assume we adapt to the session only
awk '{print $1}' $tmpdir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
print "$1$2$3 $1\n";' \
> $tmpdir/utt2spk || exit 1;
sort -k 2 $tmpdir/utt2spk | utils/utt2spk_to_spk2utt.pl > $tmpdir/spk2utt || exit 1;
# but we want to properly score the overlapped segments, hence we generate the extra
# utt2spk_stm file containing speakers ids used to generate the stms for mdm/sdm case
awk '{print $1}' $tmpdir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
print "$1$2$3 $1$2\n";' \
> $tmpdir/utt2spk_stm || exit 1;
#check and correct the case when segment timings for given speaker overlap themself
#(important for simulatenous asclite scoring to proceed).
#There is actually only one such case for devset and automatic segmentetions
join $tmpdir/utt2spk_stm $tmpdir/segments | \
perl -ne '{BEGIN{$pu=""; $pt=0.0;} split;
if ($pu eq $_[1] && $pt > $_[3]) {
print "$_[0] $_[2] $_[3] $_[4]>$_[0] $_[2] $pt $_[4]\n"
}
$pu=$_[1]; $pt=$_[4];
}' > $tmpdir/segments_to_fix
if [ `cat $tmpdir/segments_to_fix | wc -l` -gt 0 ]; then
echo "$0. Applying following fixes to segments"
cat $tmpdir/segments_to_fix
while read line; do
p1=`echo $line | awk -F'>' '{print $1}'`
p2=`echo $line | awk -F'>' '{print $2}'`
sed -ir "s!$p1!$p2!" $tmpdir/segments
done < $tmpdir/segments_to_fix
fi
# Copy stuff into its final locations [this has been moved from the format_data
# script]
mkdir -p $dir
for f in spk2utt utt2spk utt2spk_stm wav.scp text segments reco2file_and_channel; do
cp $tmpdir/$f $dir/$f || exit 1;
done
local/convert2stm.pl $dir utt2spk_stm > $dir/stm
cp local/english.glm $dir/glm
utils/validate_data_dir.sh --no-feats $dir
echo AMI $DSET scenario and $SET set data preparation succeeded.

Просмотреть файл

@ -0,0 +1,218 @@
#!/usr/bin/perl
# Copyright 2014 University of Edinburgh (Author: Pawel Swietojanski)
# The script - based on punctuation times - splits segments longer than #words (input parameter)
# and produces bit more more normalised form of transcripts, as follows
# MeetID Channel Spkr stime etime transcripts
#use List::MoreUtils 'indexes';
use strict;
use warnings;
sub split_transcripts;
sub normalise_transcripts;
sub merge_hashes {
my ($h1, $h2) = @_;
my %hash1 = %$h1; my %hash2 = %$h2;
foreach my $key2 ( keys %hash2 ) {
if( exists $hash1{$key2} ) {
warn "Key [$key2] is in both hashes!";
next;
} else {
$hash1{$key2} = $hash2{$key2};
}
}
return %hash1;
}
sub print_hash {
my ($h) = @_;
my %hash = %$h;
foreach my $k (sort keys %hash) {
print "$k : $hash{$k}\n";
}
}
sub get_name {
#no warnings;
my $sname = sprintf("%07d_%07d", $_[0]*100, $_[1]*100) || die 'Input undefined!';
#use warnings;
return $sname;
}
sub split_on_comma {
my ($text, $comma_times, $btime, $etime, $max_words_per_seg)= @_;
my %comma_hash = %$comma_times;
print "Btime, Etime : $btime, $etime\n";
my $stime = ($etime+$btime)/2; #split time
my $skey = "";
my $otime = $btime;
foreach my $k (sort {$comma_hash{$a} cmp $comma_hash{$b} } keys %comma_hash) {
print "Key : $k : $comma_hash{$k}\n";
my $ktime = $comma_hash{$k};
if ($ktime==$btime) { next; }
if ($ktime==$etime) { last; }
if (abs($stime-$ktime)/2<abs($stime-$otime)/2) {
$otime = $ktime;
$skey = $k;
}
}
my %transcripts = ();
if (!($skey =~ /[\,][0-9]+/)) {
print "Cannot split into less than $max_words_per_seg words! Leaving : $text\n";
$transcripts{get_name($btime, $etime)}=$text;
return %transcripts;
}
print "Splitting $text on $skey at time $otime (stime is $stime)\n";
my @utts1 = split(/$skey\s+/, $text);
for (my $i=0; $i<=$#utts1; $i++) {
my $st = $btime;
my $et = $comma_hash{$skey};
if ($i>0) {
$st=$comma_hash{$skey};
$et = $etime;
}
my (@utts) = split (' ', $utts1[$i]);
if ($#utts < $max_words_per_seg) {
my $nm = get_name($st, $et);
print "SplittedOnComma[$i]: $nm : $utts1[$i]\n";
$transcripts{$nm} = $utts1[$i];
} else {
print 'Continue splitting!';
my %transcripts2 = split_on_comma($utts1[$i], \%comma_hash, $st, $et, $max_words_per_seg);
%transcripts = merge_hashes(\%transcripts, \%transcripts2);
}
}
return %transcripts;
}
sub split_transcripts {
@_ == 4 || die 'split_transcripts: transcript btime etime max_word_per_seg';
my ($text, $btime, $etime, $max_words_per_seg) = @_;
my (@transcript) = @$text;
my (@punct_indices) = grep { $transcript[$_] =~ /^[\.,\?\!\:]$/ } 0..$#transcript;
my (@time_indices) = grep { $transcript[$_] =~ /^[0-9]+\.[0-9]*/ } 0..$#transcript;
my (@puncts_times) = delete @transcript[@time_indices];
my (@puncts) = @transcript[@punct_indices];
if ($#puncts_times != $#puncts) {
print 'Ooops, different number of punctuation signs and timestamps! Skipping.';
return ();
}
#first split on full stops
my (@full_stop_indices) = grep { $puncts[$_] =~ /[\.\?]/ } 0..$#puncts;
my (@full_stop_times) = @puncts_times[@full_stop_indices];
unshift (@full_stop_times, $btime);
push (@full_stop_times, $etime);
my %comma_puncts = ();
for (my $i=0, my $j=0;$i<=$#punct_indices; $i++) {
my $lbl = "$transcript[$punct_indices[$i]]$j";
if ($lbl =~ /[\.\?].+/) { next; }
$transcript[$punct_indices[$i]] = $lbl;
$comma_puncts{$lbl} = $puncts_times[$i];
$j++;
}
#print_hash(\%comma_puncts);
print "InpTrans : @transcript\n";
print "Full stops: @full_stop_times\n";
my @utts1 = split (/[\.\?]/, uc join(' ', @transcript));
my %transcripts = ();
for (my $i=0; $i<=$#utts1; $i++) {
my (@utts) = split (' ', $utts1[$i]);
if ($#utts < $max_words_per_seg) {
print "ReadyTrans: $utts1[$i]\n";
$transcripts{get_name($full_stop_times[$i], $full_stop_times[$i+1])} = $utts1[$i];
} else {
print "TransToSplit: $utts1[$i]\n";
my %transcripts2 = split_on_comma($utts1[$i], \%comma_puncts, $full_stop_times[$i], $full_stop_times[$i+1], $max_words_per_seg);
print "Hash TR2:\n"; print_hash(\%transcripts2);
print "Hash TR:\n"; print_hash(\%transcripts);
%transcripts = merge_hashes(\%transcripts, \%transcripts2);
print "Hash TR_NEW : \n"; print_hash(\%transcripts);
}
}
return %transcripts;
}
sub normalise_transcripts {
my $text = $_[0];
#DO SOME ROUGH AND OBVIOUS PRELIMINARY NORMALISATION, AS FOLLOWS
#remove the remaining punctation labels e.g. some text ,0 some text ,1
$text =~ s/[\.\,\?\!\:][0-9]+//g;
#there are some extra spurious puncations without spaces, e.g. UM,I, replace with space
$text =~ s/[A-Z']+,[A-Z']+/ /g;
#split words combination, ie. ANTI-TRUST to ANTI TRUST (None of them appears in cmudict anyway)
#$text =~ s/(.*)([A-Z])\s+(\-)(.*)/$1$2$3$4/g;
$text =~ s/\-/ /g;
#substitute X_M_L with X. M. L. etc.
$text =~ s/\_/. /g;
#normalise and trim spaces
$text =~ s/^\s*//g;
$text =~ s/\s*$//g;
$text =~ s/\s+/ /g;
#some transcripts are empty with -, nullify (and ignore) them
$text =~ s/^\-$//g;
$text =~ s/\s+\-$//;
# apply few exception for dashed phrases, Mm-Hmm, Uh-Huh, etc. those are frequent in AMI
# and will be added to dictionary
$text =~ s/MM HMM/MM\-HMM/g;
$text =~ s/UH HUH/UH\-HUH/g;
return $text;
}
if (@ARGV != 2) {
print STDERR "Usage: ami_split_segments.pl <meet-file> <out-file>\n";
exit(1);
}
my $meet_file = shift @ARGV;
my $out_file = shift @ARGV;
my %transcripts = ();
open(W, ">$out_file") || die "opening output file $out_file";
open(S, "<$meet_file") || die "opening meeting file $meet_file";
while(<S>) {
my @A = split(" ", $_);
if (@A < 9) { print "Skipping line @A"; next; }
my ($meet_id, $channel, $spk, $channel2, $trans_btime, $trans_etime, $aut_btime, $aut_etime) = @A[0..7];
my @transcript = @A[8..$#A];
my %transcript = split_transcripts(\@transcript, $aut_btime, $aut_etime, 30);
for my $key (keys %transcript) {
my $value = $transcript{$key};
my $segment = normalise_transcripts($value);
my @times = split(/\_/, $key);
if ($times[0] >= $times[1]) {
print "Warning, $meet_id, $spk, $times[0] > $times[1]. Skipping. \n"; next;
}
if (length($segment)>0) {
print W join " ", $meet_id, "H0${channel2}", $spk, $times[0]/100.0, $times[1]/100.0, $segment, "\n";
}
}
}
close(S);
close(W);
print STDERR "Finished."

Просмотреть файл

@ -0,0 +1,32 @@
#!/bin/bash
# Copyright 2014, University of Edinburgh (Author: Pawel Swietojanski), 2014, Apache 2.0
if [ $# -ne 1 ]; then
echo "Usage: $0 <ami-dir>"
exit 1;
fi
amidir=$1
wdir=data/local/annotations
#extract text from AMI XML annotations
local/ami_xml2text.sh $amidir
[ ! -f $wdir/transcripts1 ] && echo "$0: File $wdir/transcripts1 not found." && exit 1;
echo "Preprocessing transcripts..."
local/ami_split_segments.pl $wdir/transcripts1 $wdir/transcripts2 &> $wdir/log/split_segments.log
#make final train/dev/eval splits
for dset in train eval dev; do
[ ! -f local/split_$dset.final ] && cp local/split_$dset.orig local/split_$dset.final
grep -f local/split_$dset.final $wdir/transcripts2 > $wdir/$dset.txt
done

176
egs/ami/s5/local/ami_train_lms.sh Executable file
Просмотреть файл

@ -0,0 +1,176 @@
#!/bin/bash
# Copyright 2013 Arnab Ghoshal, Pawel Swietojanski
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
# To be run from one directory above this script.
# Begin configuration section.
fisher=
order=3
swbd=
google=
web_sw=
web_fsh=
web_mtg=
# end configuration sections
help_message="Usage: "`basename $0`" [options] <train-txt> <dev-txt> <dict> <out-dir>
Train language models for AMI and optionally for Switchboard, Fisher and web-data from University of Washington.\n
options:
--help # print this message and exit
--fisher DIR # directory for Fisher transcripts
--order N # N-gram order (default: '$order')
--swbd DIR # Directory for Switchboard transcripts
--web-sw FILE # University of Washington (191M) Switchboard web data
--web-fsh FILE # University of Washington (525M) Fisher web data
--web-mtg FILE # University of Washington (150M) CMU+ICSI+NIST meeting data
";
. utils/parse_options.sh
if [ $# -ne 4 ]; then
printf "$help_message\n";
exit 1;
fi
train=$1 # data/ihm/train/text
dev=$2 # data/ihm/dev/text
lexicon=$3 # data/ihm/dict/lexicon.txt
dir=$4 # data/local/lm
for f in "$text" "$lexicon"; do
[ ! -f $x ] && echo "$0: No such file $f" && exit 1;
done
set -o errexit
mkdir -p $dir
export LC_ALL=C
cut -d' ' -f2- $train | gzip -c > $dir/train.gz
cut -d' ' -f2- $dev | gzip -c > $dir/dev.gz
awk '{print $1}' $lexicon | sort -u > $dir/wordlist.lex
gunzip -c $dir/train.gz | tr ' ' '\n' | grep -v ^$ | sort -u > $dir/wordlist.train
sort -u $dir/wordlist.lex $dir/wordlist.train > $dir/wordlist
ngram-count -text $dir/train.gz -order $order -limit-vocab -vocab $dir/wordlist \
-unk -map-unk "<unk>" -kndiscount -interpolate -lm $dir/ami.o${order}g.kn.gz
echo "PPL for AMI LM:"
ngram -unk -lm $dir/ami.o${order}g.kn.gz -ppl $dir/dev.gz
ngram -unk -lm $dir/ami.o${order}g.kn.gz -ppl $dir/dev.gz -debug 2 >& $dir/ppl2
mix_ppl="$dir/ppl2"
mix_tag="ami"
mix_lms=( "$dir/ami.o${order}g.kn.gz" )
num_lms=1
if [ ! -z "$swbd" ]; then
mkdir -p $dir/swbd
find $swbd -iname '*-trans.text' -exec cat {} \; | cut -d' ' -f4- \
| gzip -c > $dir/swbd/text0.gz
gunzip -c $dir/swbd/text0.gz | swbd_map_words.pl | gzip -c \
> $dir/swbd/text1.gz
ngram-count -text $dir/swbd/text1.gz -order $order -limit-vocab \
-vocab $dir/wordlist -unk -map-unk "<unk>" -kndiscount -interpolate \
-lm $dir/swbd/swbd.o${order}g.kn.gz
echo "PPL for SWBD LM:"
ngram -unk -lm $dir/swbd/swbd.o${order}g.kn.gz -ppl $dir/dev.gz
ngram -unk -lm $dir/swbd/swbd.o${order}g.kn.gz -ppl $dir/dev.gz -debug 2 \
>& $dir/swbd/ppl2
mix_ppl="$mix_ppl $dir/swbd/ppl2"
mix_tag="${mix_tag}_swbd"
mix_lms=("${mix_lms[@]}" "$dir/swbd/swbd.o${order}g.kn.gz")
num_lms=$[ num_lms + 1 ]
fi
if [ ! -z "$fisher" ]; then
[ ! -d "$fisher/part1/data/trans" ] \
&& echo "Cannot find transcripts in Fisher directory: '$fisher'" \
&& exit 1;
mkdir -p $dir/fisher
find $fisher -path '*/trans/*fe*.txt' -exec cat {} \; | grep -v ^# | grep -v ^$ \
| cut -d' ' -f4- | gzip -c > $dir/fisher/text0.gz
gunzip -c $dir/fisher/text0.gz | fisher_map_words.pl \
| gzip -c > $dir/fisher/text1.gz
ngram-count -debug 0 -text $dir/fisher/text1.gz -order $order -limit-vocab \
-vocab $dir/wordlist -unk -map-unk "<unk>" -kndiscount -interpolate \
-lm $dir/fisher/fisher.o${order}g.kn.gz
echo "PPL for Fisher LM:"
ngram -unk -lm $dir/fisher/fisher.o${order}g.kn.gz -ppl $dir/dev.gz
ngram -unk -lm $dir/fisher/fisher.o${order}g.kn.gz -ppl $dir/dev.gz -debug 2 \
>& $dir/fisher/ppl2
mix_ppl="$mix_ppl $dir/fisher/ppl2"
mix_tag="${mix_tag}_fsh"
mix_lms=("${mix_lms[@]}" "$dir/fisher/fisher.o${order}g.kn.gz")
num_lms=$[ num_lms + 1 ]
fi
if [ ! -z "$google1B" ]; then
mkdir -p $dir/google
wget -O $dir/google/cantab.lm3.bz2 http://vm.cantabresearch.com:6080/demo/cantab.lm3.bz2
wget -O $dir/google/150000.lex http://vm.cantabresearch.com:6080/demo/150000.lex
ngram -unk -limit-vocab -vocab $dir/wordlist -lm $dir/google.cantab.lm3.bz3 \
-write-lm $dir/google/google.o${order}g.kn.gz
mix_ppl="$mix_ppl $dir/goog1e/ppl2"
mix_tag="${mix_tag}_fsh"
mix_lms=("${mix_lms[@]}" "$dir/google/google.o${order}g.kn.gz")
num_lms=$[ num_lms + 1 ]
fi
## The University of Washington conversational web data can be obtained as:
## wget --no-check-certificate http://ssli.ee.washington.edu/data/191M_conversational_web-filt+periods.gz
if [ ! -z "$web_sw" ]; then
echo "Interpolating web-LM not implemented yet"
fi
## The University of Washington Fisher conversational web data can be obtained as:
## wget --no-check-certificate http://ssli.ee.washington.edu/data/525M_fisher_conv_web-filt+periods.gz
if [ ! -z "$web_fsh" ]; then
echo "Interpolating web-LM not implemented yet"
fi
## The University of Washington meeting web data can be obtained as:
## wget --no-check-certificate http://ssli.ee.washington.edu/data/150M_cmu+icsi+nist-meetings.gz
if [ ! -z "$web_mtg" ]; then
echo "Interpolating web-LM not implemented yet"
fi
if [ $num_lms -gt 1 ]; then
echo "Computing interpolation weights from: $mix_ppl"
compute-best-mix $mix_ppl >& $dir/mix.log
grep 'best lambda' $dir/mix.log \
| perl -e '$_=<>; s/.*\(//; s/\).*//; @A = split; for $i (@A) {print "$i\n";}' \
> $dir/mix.weights
weights=( `cat $dir/mix.weights` )
cmd="ngram -lm ${mix_lms[0]} -lambda 0.715759 -mix-lm ${mix_lms[1]}"
for i in `seq 2 $((num_lms-1))`; do
cmd="$cmd -mix-lm${i} ${mix_lms[$i]} -mix-lambda${i} ${weights[$i]}"
done
cmd="$cmd -unk -write-lm $dir/${mix_tag}.o${order}g.kn.gz"
echo "Interpolating LMs with command: \"$cmd\""
$cmd
echo "PPL for the interolated LM:"
ngram -unk -lm $dir/${mix_tag}.o${order}g.kn.gz -ppl $dir/dev.gz
fi
#save the lm name for furher use
echo "${mix_tag}.o${order}g.kn" > $dir/final_lm

Просмотреть файл

@ -0,0 +1,47 @@
#!/bin/bash
# Copyright, University of Edinburgh (Pawel Swietojanski and Jonathan Kilgour)
if [ $# -ne 1 ]; then
echo "Usage: $0 <ami-dir>"
exit 1;
fi
adir=$1
wdir=data/local/annotations
[ ! -f $adir/annotations/AMI-metadata.xml ] && echo "$0: File $adir/annotations/AMI-metadata.xml no found." && exit 1;
mkdir -p $wdir/log
JAVA_VER=$(java -version 2>&1 | sed 's/java version "\(.*\)\.\(.*\)\..*"/\1\2/; 1q')
if [ "$JAVA_VER" -ge 15 ]; then
if [ ! -d $wdir/nxt ]; then
echo "Downloading NXT annotation tool..."
wget -O $wdir/nxt.zip http://sourceforge.net/projects/nite/files/nite/nxt_1.4.4/nxt_1.4.4.zip &> /dev/null
unzip -d $wdir/nxt $wdir/nxt.zip &> /dev/null
fi
if [ ! -f $wdir/transcripts0 ]; then
echo "Parsing XML files (can take several minutes)..."
nxtlib=$wdir/nxt/lib
java -cp $nxtlib/nxt.jar:$nxtlib/xmlParserAPIs.jar:$nxtlib/xalan.jar:$nxtlib \
FunctionQuery -c $adir/annotations/AMI-metadata.xml -q '($s segment)(exists $w1 w):$s^$w1' -atts obs who \
'@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent,global_name, 0)'\
'@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent, channel, 0)' \
transcriber_start transcriber_end starttime endtime '$s' '@extract(($w w):$s^$w & $w@punc="true", starttime,0,0)' \
1> $wdir/transcripts0 2> $wdir/log/nxt_export.log
fi
else
echo "$0. Java not found. Will download exported version of transcripts."
annots=ami_manual_annotations_v1.6.1_export
wget -O $wdir/$annots.gzip http://groups.inf.ed.ac.uk/ami/AMICorpusAnnotations/$annots.gzip
gunzip -c $wdir/${annots}.gzip > $wdir/transcripts0
fi
#remove NXT logs dumped to stdio
grep -e '^Found' -e '^Obs' -i -v $wdir/transcripts0 > $wdir/transcripts1
exit 0;

33
egs/ami/s5/local/beamformit.sh Executable file
Просмотреть файл

@ -0,0 +1,33 @@
#!/bin/bash
# Copyright 2014, University of Edibnurgh (Author: Pawel Swietojanski)
. ./path.sh
nj=$1
job=$2
numch=$3
meetings=$4
sdir=$5
odir=$6
wdir=data/local/beamforming
utils/split_scp.pl -j $nj $((job-1)) $meetings $meetings.$job
while read line; do
mkdir -p $odir/$line
BeamformIt -s $line -c $wdir/channels_$numch \
--config_file `pwd`/conf/ami.cfg \
--source_dir $sdir \
--result_dir $odir/$line
mkdir -p $odir/$line
mv $odir/$line/${line}.del $odir/$line/${line}_MDM$numch.del
mv $odir/$line/${line}.del2 $odir/$line/${line}_MDM$numch.del2
mv $odir/$line/${line}.info $odir/$line/${line}_MDM$numch.info
mv $odir/$line/${line}.ovl $odir/$line/${line}_MDM$numch.ovl
mv $odir/$line/${line}.weat $odir/$line/${line}_MDM$numch.weat
mv $odir/$line/${line}.wav $odir/$line/${line}_MDM$numch.wav
done < $meetings.$job

101
egs/ami/s5/local/convert2stm.pl Executable file
Просмотреть файл

@ -0,0 +1,101 @@
#!/usr/bin/perl
# Copyright 2012 Johns Hopkins University (Author: Daniel Povey). Apache 2.0.
# 2013 University of Edinburgh (Author: Pawel Swietojanski)
# This takes as standard input path to directory containing all the usual
# data files - segments, text, utt2spk and reco2file_and_channel and creates stm
if (@ARGV < 1 || @ARGV > 2) {
print STDERR "Usage: convert2stm.pl <data-dir> [<utt2spk_stm>] > stm-file\n";
exit(1);
}
$dir=shift @ARGV;
$utt2spk_file=shift @ARGV || 'utt2spk';
$segments = "$dir/segments";
$reco2file_and_channel = "$dir/reco2file_and_channel";
$text = "$dir/text";
$utt2spk_file = "$dir/$utt2spk_file";
open(S, "<$segments") || die "opening segments file $segments";
while(<S>) {
@A = split(" ", $_);
@A > 3 || die "convert2stm: Bad line in segments file: $_";
($utt, $recording_id, $begin_time, $end_time) = @A[0..3];
$utt2reco{$utt} = $recording_id;
$begin{$utt} = $begin_time;
$end{$utt} = $end_time;
}
close(S);
open(R, "<$reco2file_and_channel") || die "open reco2file_and_channel file $reco2file_and_channel";
while(<R>) {
@A = split(" ", $_);
@A == 3 || die "convert2stm: Bad line in reco2file_and_channel file: $_";
($recording_id, $file, $channel) = @A;
$reco2file{$recording_id} = $file;
$reco2channel{$recording_id} = $channel;
}
close(R);
open(T, "<$text") || die "open text file $text";
while(<T>) {
@A = split(" ", $_);
$utt = shift @A;
$utt2text{$utt} = "@A";
}
close(T);
open(U, "<$utt2spk_file") || die "open utt2spk file $utt2spk_file";
while(<U>) {
@A = split(" ", $_);
@A == 2 || die "convert2stm: Bad line in utt2spk file: $_";
($utt, $spk) = @A;
$utt2spk{$utt} = $spk;
}
close(U);
# Now generate the stm file
foreach $utt (sort keys(%utt2reco)) {
# lines look like:
# <File> <Channel> <Speaker> <BeginTime> <EndTime> [ <LABEL> ] transcript
$recording_id = $utt2reco{$utt};
if (!defined $recording_id) { die "Utterance-id $utt not defined in segments file $segments"; }
$file = $reco2file{$recording_id};
$channel = $reco2channel{$recording_id};
if (!defined $file || !defined $channel) {
die "convert2stm: Recording-id $recording_id not defined in reco2file_and_channel file $reco2file_and_channel";
}
$speaker = $utt2spk{$utt};
$transcripts = $utt2text{$utt};
if (!defined $speaker) { die "convert2stm: Speaker-id for utterance $utt not defined in utt2spk file $utt2spk_file"; }
if (!defined $transcripts) { die "convert2stm: Transcript for $utt not defined in text file $text"; }
$b = $begin{$utt};
$e = $end{$utt};
$line = "$file $channel $speaker $b $e $transcripts \n";
print $line; # goes to stdout.
}
__END__
# Test example
# ES2011a.Headset-0 A AMI_ES2011a_H00_FEE041 34.27 37.14 HERE WE GO
mkdir tmpdir
echo utt reco 10.0 20.0 > tmpdir/segments
echo utt word > tmpdir/text
echo reco file A > tmpdir/reco2file_and_channel
echo utt spk > tmpdir/utt2spk
echo file A spk 10.0 20.00 word > stm_tst
utils/convert2stm.pl tmpdir | cmp - stm_tst || echo error
rm -r tmpdir stm_tst

2023
egs/ami/s5/local/english.glm Normal file

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,83 @@
#!/usr/bin/perl -w
# Copyright 2013 Arnab Ghoshal
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
# This script cleans up the Fisher English transcripts and maps the words to
# be similar to the Switchboard Mississippi State transcripts
# Reads from STDIN and writes to STDOUT
use strict;
while (<>) {
chomp;
$_ = lc($_); # few things aren't lowercased in the data, e.g. I'm
s/\*//g; # *mandatory -> mandatory
s/\(//g; s/\)//g; # Remove parentheses
next if /^\s*$/; # Skip empty lines
# In one conversation people speak some German phrases that are tagged as
# <german (( ja wohl )) > -- we remove these
s/<[^>]*>//g;
s/\.\_/ /g; # Abbreviations: a._b._c. -> a b c.
s/(\w)\.s( |$)/$1's /g; # a.s -> a's
s/\./ /g; # Remove remaining .
s/(\w)\,(\w| )/$1 $2/g; # commas don't appear within numbers, but still
s/( |^)\'(blade|cause|course|frisco|okay|plain|specially)( |$)/ $2 /g;
s/\'em/-em/g;
# Remove an opening ' if there is a matching closing ' since some word
# fragments are annotated as: 'kay, etc.
# The substitution is done twice, since matching once doesn't capture
# consequetive quoted segments (the space in between is used up).
s/(^| )\'(.*?)\'( |$)/ $2 /g;
s/(^| )\'(.*?)\'( |$)/ $2 /g;
s/( |^)\'(\w)( |-|$)/$1 /g; # 'a- -> a
s/( |^)-( |$)/ /g; # Remove dangling -
s/\?//g; # Remove ?
s/( |^)non-(\w+)( |$)/ non $2 /g; # non-stop -> non stop
# Some words that are annotated as fragments are actual dictionary words
s/( |-)(acceptable|arthritis|ball|cause|comes|course|eight|eighty|field|giving|habitating|heard|hood|how|king|ninety|okay|paper|press|scripts|store|till|vascular|wood|what|york)(-| )/ $2 /g;
# Remove [[skip]] and [pause]
s/\[\[skip\]\]/ /g;
s/\[pause\]/ /g;
# [breath], [cough], [lipsmack], [sigh], [sneeze] -> [noise]
s/\[breath\]/[noise]/g;
s/\[cough\]/[noise]/g;
s/\[lipsmack\]/[noise]/g;
s/\[sigh\]/[noise]/g;
s/\[sneeze\]/[noise]/g;
s/\[mn\]/[vocalized-noise]/g; # [mn] -> [vocalized-noise]
s/\[laugh\]/[laughter]/g; # [laugh] -> [laughter]
$_ = uc($_);
# Now, mapping individual words
my @words = split /\s+/;
for my $i (0..$#words) {
my $w = $words[$i];
$w =~ s/^'/-/;
$words[$i] = $w;
}
print join(" ", @words) . "\n";
}

42
egs/ami/s5/local/score.sh Executable file
Просмотреть файл

@ -0,0 +1,42 @@
#!/bin/bash
# Copyright Johns Hopkins University (Author: Daniel Povey) 2012
# Copyright University of Edinburgh (Author: Pawel Swietojanski) 2014
# Apache 2.0
orig_args=
for x in "$@"; do orig_args="$orig_args '$x'"; done
# begin configuration section. we include all the options that score_sclite.sh or
# score_basic.sh might need, or parse_options.sh will die.
cmd=run.pl
stage=0
min_lmwt=9
max_lmwt=20
reverse=false
asclite=true
#end configuration section.
[ -f ./path.sh ] && . ./path.sh
. parse_options.sh || exit 1;
if [ $# -ne 3 ]; then
echo "Usage: local/score.sh [options] <data-dir> <lang-dir|graph-dir> <decode-dir>" && exit;
echo " Options:"
echo " --cmd (run.pl|queue.pl...) # specify how to run the sub-processes."
echo " --stage (0|1|2) # start scoring script from part-way through."
echo " --min_lmwt <int> # minumum LM-weight for lattice rescoring "
echo " --max_lmwt <int> # maximum LM-weight for lattice rescoring "
echo " --reverse (true/false) # score with time reversed features "
echo " --asclite (true/false) # score with ascltie instead of sclite (overlapped speech)"
exit 1;
fi
data=$1
if [ -f $data/stm ]; then # use sclite scoring.
eval local/score_asclite.sh --asclite $asclite $orig_args
else
echo "$data/stm does not exist: using local/score_basic.sh"
eval local/score_basic.sh $orig_args
fi

Просмотреть файл

@ -0,0 +1,97 @@
#!/bin/bash
# Copyright Johns Hopkins University (Author: Daniel Povey) 2012. Apache 2.0.
# 2014, University of Edinburgh, (Author: Pawel Swietojanski)
# begin configuration section.
cmd=run.pl
stage=0
min_lmwt=9
max_lmwt=20
reverse=false
asclite=true
overlap_spk=4
#end configuration section.
[ -f ./path.sh ] && . ./path.sh
. parse_options.sh || exit 1;
if [ $# -ne 3 ]; then
echo "Usage: local/score_asclite.sh [--cmd (run.pl|queue.pl...)] <data-dir> <lang-dir|graph-dir> <decode-dir>"
echo " Options:"
echo " --cmd (run.pl|queue.pl...) # specify how to run the sub-processes."
echo " --stage (0|1|2) # start scoring script from part-way through."
echo " --min_lmwt <int> # minumum LM-weight for lattice rescoring "
echo " --max_lmwt <int> # maximum LM-weight for lattice rescoring "
echo " --reverse (true/false) # score with time reversed features "
exit 1;
fi
data=$1
lang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.
dir=$3
model=$dir/../final.mdl # assume model one level up from decoding dir.
hubscr=$KALDI_ROOT/tools/sctk-2.4.0/bin/hubscr.pl
[ ! -f $hubscr ] && echo "Cannot find scoring program at $hubscr" && exit 1;
hubdir=`dirname $hubscr`
for f in $data/stm $data/glm $lang/words.txt $lang/phones/word_boundary.int \
$model $data/segments $data/reco2file_and_channel $dir/lat.1.gz; do
[ ! -f $f ] && echo "$0: expecting file $f to exist" && exit 1;
done
name=`basename $data`; # e.g. eval2000
mkdir -p $dir/ascoring/log
if [ $stage -le 0 ]; then
if $reverse; then
$cmd LMWT=$min_lmwt:$max_lmwt $dir/ascoring/log/get_ctm.LMWT.log \
mkdir -p $dir/ascore_LMWT/ '&&' \
lattice-1best --lm-scale=LMWT "ark:gunzip -c $dir/lat.*.gz|" ark:- \| \
lattice-reverse ark:- ark:- \| \
lattice-align-words --reorder=false $lang/phones/word_boundary.int $model ark:- ark:- \| \
nbest-to-ctm ark:- - \| \
utils/int2sym.pl -f 5 $lang/words.txt \| \
utils/convert_ctm.pl $data/segments $data/reco2file_and_channel \
'>' $dir/ascore_LMWT/$name.ctm || exit 1;
else
$cmd LMWT=$min_lmwt:$max_lmwt $dir/ascoring/log/get_ctm.LMWT.log \
mkdir -p $dir/ascore_LMWT/ '&&' \
lattice-1best --lm-scale=LMWT "ark:gunzip -c $dir/lat.*.gz|" ark:- \| \
lattice-align-words $lang/phones/word_boundary.int $model ark:- ark:- \| \
nbest-to-ctm ark:- - \| \
utils/int2sym.pl -f 5 $lang/words.txt \| \
utils/convert_ctm.pl $data/segments $data/reco2file_and_channel \
'>' $dir/ascore_LMWT/$name.ctm || exit 1;
fi
fi
if [ $stage -le 1 ]; then
# Remove some stuff we don't want to score, from the ctm.
for x in $dir/ascore_*/$name.ctm; do
cp $x $dir/tmpf;
cat $dir/tmpf | grep -i -v -E '\[noise|laughter|vocalized-noise\]' | \
grep -i -v -E '<unk>' > $x;
# grep -i -v -E '<UNK>|%HESITATION' > $x;
done
fi
if [ $stage -le 2 ]; then
if [ "$asclite" == "true" ]; then
oname=$name
[ ! -z $overlap_spk ] && oname=${name}_o$overlap_spk
$cmd LMWT=$min_lmwt:$max_lmwt $dir/ascoring/log/score.LMWT.log \
cp $data/stm $dir/ascore_LMWT/ '&&' \
cp $dir/ascore_LMWT/${name}.ctm $dir/ascore_LMWT/${oname}.ctm '&&' \
$hubscr -G -v -m 1:2 -o$overlap_spk -a -C -B 8192 -p $hubdir -V -l english \
-h rt-stt -g $data/glm -r $dir/ascore_LMWT/stm $dir/ascore_LMWT/${oname}.ctm || exit 1;
else
$cmd LMWT=$min_lmwt:$max_lmwt $dir/ascoring/log/score.LMWT.log \
cp $data/stm $dir/ascore_LMWT/ '&&' \
$hubscr -p $hubdir -V -l english -h hub5 -g $data/glm -r $dir/ascore_LMWT/stm $dir/ascore_LMWT/${name}.ctm || exit 1
fi
fi
exit 0

Просмотреть файл

@ -0,0 +1,5 @@
The splits in this directory follow the official AMI Corpus Full-ASR split
on train, dev and eval sets.
If for some reason ones need to use different split the way to do so is
to create split_*.final versions in this directory and run the recipe.

Просмотреть файл

@ -0,0 +1,18 @@
ES2011a
ES2011b
ES2011c
ES2011d
IB4001
IB4002
IB4003
IB4004
IB4010
IB4011
IS1008a
IS1008b
IS1008c
IS1008d
TS3004a
TS3004b
TS3004c
TS3004d

Просмотреть файл

@ -0,0 +1,16 @@
EN2002a
EN2002b
EN2002c
EN2002d
ES2004a
ES2004b
ES2004c
ES2004d
IS1009a
IS1009b
IS1009c
IS1009d
TS3003a
TS3003b
TS3003c
TS3003d

Просмотреть файл

@ -0,0 +1,137 @@
EN2001a
EN2001b
EN2001d
EN2001e
EN2003a
EN2004a
EN2005a
EN2006a
EN2006b
EN2009b
EN2009c
EN2009d
ES2002a
ES2002b
ES2002c
ES2002d
ES2003a
ES2003b
ES2003c
ES2003d
ES2005a
ES2005b
ES2005c
ES2005d
ES2006a
ES2006b
ES2006c
ES2006d
ES2007a
ES2007b
ES2007c
ES2007d
ES2008a
ES2008b
ES2008c
ES2008d
ES2009a
ES2009b
ES2009c
ES2009d
ES2010a
ES2010b
ES2010c
ES2010d
ES2012a
ES2012b
ES2012c
ES2012d
ES2013a
ES2013b
ES2013c
ES2013d
ES2014a
ES2014b
ES2014c
ES2014d
ES2015a
ES2015b
ES2015c
ES2015d
ES2016a
ES2016b
ES2016c
ES2016d
IB4005
IN1001
IN1002
IN1005
IN1007
IN1008
IN1009
IN1012
IN1013
IN1014
IN1016
IS1000a
IS1000b
IS1000c
IS1000d
IS1001a
IS1001b
IS1001c
IS1001d
IS1002b
IS1002c
IS1002d
IS1003a
IS1003b
IS1003c
IS1003d
IS1004a
IS1004b
IS1004c
IS1004d
IS1005a
IS1005b
IS1005c
IS1006a
IS1006b
IS1006c
IS1006d
IS1007a
IS1007b
IS1007c
IS1007d
TS3005a
TS3005b
TS3005c
TS3005d
TS3006a
TS3006b
TS3006c
TS3006d
TS3007a
TS3007b
TS3007c
TS3007d
TS3008a
TS3008b
TS3008c
TS3008d
TS3009a
TS3009b
TS3009c
TS3009d
TS3010a
TS3010b
TS3010c
TS3010d
TS3011a
TS3011b
TS3011c
TS3011d
TS3012a
TS3012b
TS3012c
TS3012d

36
egs/ami/s5/path.sh Normal file
Просмотреть файл

@ -0,0 +1,36 @@
export LC_ALL=C # For expected sorting and joining behaviour
KALDI_ROOT=/gpfs/scratch/s1136550/kaldi-code
#KALDI_ROOT=/disk/data1/software/kaldi-trunk-atlas
#KALDI_ROOT=/disk/data1/pbell1/software/kaldi-trunk-mkl/
KALDISRC=$KALDI_ROOT/src
KALDIBIN=$KALDISRC/bin:$KALDISRC/featbin:$KALDISRC/fgmmbin:$KALDISRC/fstbin
KALDIBIN=$KALDIBIN:$KALDISRC/gmmbin:$KALDISRC/latbin:$KALDISRC/nnetbin
KALDIBIN=$KALDIBIN:$KALDISRC/sgmmbin:$KALDISRC/tiedbin
FSTBIN=$KALDI_ROOT/tools/openfst/bin
LMBIN=$KALDI_ROOT/tools/irstlm/bin
SRILM=$KALDI_ROOT/tools/srilm/bin/i686-m64
BEAMFORMIT=$KALDI_ROOT/tools/BeamformIt-3.51
#BEAMFORMIT=/disk/data1/s1136550/BeamformIt-3.51
[ -d $PWD/local ] || { echo "Error: 'local' subdirectory not found."; }
[ -d $PWD/utils ] || { echo "Error: 'utils' subdirectory not found."; }
[ -d $PWD/steps ] || { echo "Error: 'steps' subdirectory not found."; }
export kaldi_local=$PWD/local
export kaldi_utils=$PWD/utils
export kaldi_steps=$PWD/steps
SCRIPTS=$kaldi_local:$kaldi_utils:$kaldi_steps
export PATH=$PATH:$KALDIBIN:$FSTBIN:$LMBIN:$SCRIPTS:$BEAMFORMIT:$SRILM
#CUDA_VER='cuda-5.0.35'
#export PATH=$PATH:/opt/$CUDA_VER/bin
#export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/$CUDA_VER/lib64:/opt/$CUDA_VER/lib

204
egs/ami/s5/run_ihm.sh Executable file
Просмотреть файл

@ -0,0 +1,204 @@
#!/bin/bash -u
. ./cmd.sh
. ./path.sh
#INITIAL COMMENTS
#To run the whole recipe you're gonna to need
# 1) SRILM
# 2)
#1) some setings
#do not change this, it's for ctr-c ctr-v of training commands between ihm, sdm and mdm
mic=ihm
#path where AMI whould be downloaded or where is locally available
AMI_DIR=/disk/data2/amicorpus/
# path to Fisher transcripts for background language model
# when not set only in-domain LM will be build
FISHER_TRANS=`pwd`/eddie_data/lm/data/fisher
norm_vars=false
#1)
#in case you want download AMI corpus, uncomment this line
#you need arount 130GB of free space to get whole data ihm+mdm
local/ami_download.sh ihm $AMI_DIR || exit 1;
#2) Data preparation
local/ami_text_prep.sh $AMI_DIR
local/ami_ihm_data_prep.sh $AMI_DIR || exit 1;
local/ami_ihm_scoring_data_prep.sh $AMI_DIR dev || exit 1;
local/ami_ihm_scoring_data_prep.sh $AMI_DIR eval || exit 1;
local/ami_prepare_dict.sh
utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang
local/ami_train_lms.sh --fisher $FISHER_TRANS data/ihm/train/text data/ihm/dev/text data/local/dict/lexicon.txt data/local/lm
final_lm=`cat data/local/lm/final_lm`
LM=$final_lm.pr1-7
nj=16
prune-lm --threshold=1e-7 data/local/lm/$final_lm.gz /dev/stdout | \
gzip -c > data/local/lm/$LM.gz
utils/format_lm.sh data/lang data/local/lm/$LM.gz data/local/dict/lexicon.txt data/lang_$LM
#local/ami_format_data.sh data/local/lm/$LM.gz
# 3) Building systems
# here starts the normal recipe, which is mostly shared across mic scenarios
# one difference is for sdm and mdm we do not adapt for speaker byt for environment only
mfccdir=mfcc_$mic
(
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1
)&
(
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1
)&
(
steps/make_mfcc.sh --nj 16 --cmd "$train_cmd" data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1
)&
wait;
for dset in train eval dev; do utils/fix_data_dir.sh data/$mic/$dset; done
# 4) Train systems
nj=16
mkdir -p exp/$mic/mono
steps/train_mono.sh --nj $nj --cmd "$train_cmd" --feat-dim 39 --norm-vars $norm_vars \
data/$mic/train data/lang exp/$mic/mono >& exp/$mic/mono/train_mono.log || exit 1;
mkdir -p exp/$mic/mono_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" data/$mic/train data/lang exp/$mic/mono \
exp/$mic/mono_ali >& exp/$mic/mono_ali/align.log || exit 1;
mkdir -p exp/$mic/tri1
steps/train_deltas.sh --cmd "$train_cmd" --norm-vars $norm_vars \
5000 80000 data/$mic/train data/lang exp/$mic/mono_ali exp/$mic/tri1 \
>& exp/$mic/tri1/train.log || exit 1;
mkdir -p exp/$mic/tri1_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri1 exp/$mic/tri1_ali || exit 1;
mkdir -p exp/$mic/tri2a
steps/train_deltas.sh --cmd "$train_cmd" --norm-vars $norm_vars \
5000 80000 data/$mic/train data/lang exp/$mic/tri1_ali exp/$mic/tri2a \
>& exp/$mic/tri2a/train.log || exit 1;
for lm_suffix in $LM; do
# (
graph_dir=exp/$mic/tri2a/graph_${lm_suffix}
$highmem_cmd $graph_dir/mkgraph.log \
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri2a $graph_dir
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/dev exp/$mic/tri2a/decode_dev_${lm_suffix}
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/eval exp/$mic/tri2a/decode_eval_${lm_suffix}
# ) &
done
mkdir -p exp/$mic/tri2a_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri2a exp/$mic/tri2_ali || exit 1;
# Train tri3a, which is LDA+MLLT
mkdir -p exp/$mic/tri3a
steps/train_lda_mllt.sh --cmd "$train_cmd" \
--splice-opts "--left-context=3 --right-context=3" \
5000 80000 data/$mic/train data/lang exp/$mic/tri2_ali exp/$mic/tri3a \
>& exp/$mic/tri3a/train.log || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
$highmem_cmd $graph_dir/mkgraph.log \
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri3a $graph_dir
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/dev exp/$mic/tri3a/decode_dev_${lm_suffix}
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/eval exp/$mic/tri3a/decode_eval_${lm_suffix}
)
done
# Train tri4a, which is LDA+MLLT+SAT
steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_ali || exit 1;
mkdir -p exp/$mic/tri4a
steps/train_sat.sh --cmd "$train_cmd" \
5000 80000 data/$mic/train data/lang exp/$mic/tri3a_ali \
exp/$mic/tri4a >& exp/$mic/tri4a/train.log || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri4a/graph_${lm_suffix}
$highmem_cmd $graph_dir/mkgraph.log \
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri4a $graph_dir
steps/decode_fmllr.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/dev exp/$mic/tri4a/decode_dev_${lm_suffix}
steps/decode_fmllr.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/eval exp/$mic/tri4a/decode_eval_${lm_suffix}
)
done
exit;
# MMI training starting from the LDA+MLLT+SAT systems
steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri4a exp/$mic/tri4a_ali || exit 1
steps/make_denlats.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
--transform-dir exp/$mic/tri4a_ali \
data/$mic/train data/lang exp/$mic/tri4a exp/$mic/tri4a_denlats || exit 1;
# 4 iterations of MMI seems to work well overall. The number of iterations is
# used as an explicit argument even though train_mmi.sh will use 4 iterations by
# default.
num_mmi_iters=4
steps/train_mmi.sh --cmd "$train_cmd" --boost 0.1 --num-iters $num_mmi_iters \
data/$mic/train data/lang exp/$mic/tri4a_ali exp/$mic/tri4a_denlats \
exp/$mic/tri4a_mmi_b0.1 || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri4a/graph_${lm_suffix}
for i in `seq 1 4`; do
decode_dir=exp/$mic/tri4a_mmi_b0.1/decode_dev_${i}.mdl_${lm_suffix}
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
--transform-dir exp/$mic/tri4a/decode_dev_${lm_suffix} --iter $i \
$graph_dir data/$mic/dev $decode_dir
done
i=3 #simply assummed
decode_dir=exp/$mic/tri4a_mmi_b0.1/decode_eval_${i}.mdl_${lm_suffix}
steps/decode.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
--transform-dir exp/$mic/tri4a/decode_eval_${lm_suffix} --iter $i \
$graph_dir data/$mic/eval $decode_dir
)
done
# here goes hybrid stuf
# in the ASRU paper we used different python nnet code, so someone needs to copy&adjust nnet or nnet2 switchboard commands

156
egs/ami/s5/run_mdm.sh Executable file
Просмотреть файл

@ -0,0 +1,156 @@
#!/bin/bash -u
. ./cmd.sh
. ./path.sh
# MDM - Multiple Distant Microphones
# Assuming text preparation, dict, lang and LM were build as in run_ihm
nmics=8 #we use all 8 channels, possible other options are 2 and 4
mic=mdm$nmics #subdir name under data/
AMI_DIR=/disk/data2/amicorpus #root of AMI corpus
MDM_DIR=/disk/data1/s1136550/ami/mdm #directory for beamformed waves
#1) Download AMI (distant channels)
local/ami_download.sh mdm $AMI_DIR
#2) Beamform
local/ami_beamform.sh --nj 16 $nmics $AMI_DIR $MDM_DIR
#3) Prepare mdm data directories
local/ami_mdm_data_prep.sh $MDM_DIR $mic || exit 1;
local/ami_mdm_scoring_data_prep.sh $MDM_DIR $mic dev || exit 1;
local/ami_mdm_scoring_data_prep.sh $MDM_DIR $mic eval || exit 1;
#use the final LM
final_lm=`cat data/local/lm/final_lm`
LM=$final_lm.pr1-7
DEV_SPK=`cut -d" " -f2 data/$mic/dev/utt2spk | sort | uniq -c | wc -l`
EVAL_SPK=`cut -d" " -f2 data/$mic/eval/utt2spk | sort | uniq -c | wc -l`
nj=16
#GENERATE FEATS
mfccdir=mfcc_$mic
(
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1
)&
(
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1
)&
(
steps/make_mfcc.sh --nj 16 --cmd "$train_cmd" data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1
)&
wait;
for dset in train eval dev; do utils/fix_data_dir.sh data/$mic/$dset; done
# Build the systems
# TRAIN THE MODELS
mkdir -p exp/$mic/mono
steps/train_mono.sh --nj $nj --cmd "$train_cmd" --feat-dim 39 \
data/$mic/train data/lang exp/$mic/mono >& exp/$mic/mono/train_mono.log || exit 1;
mkdir -p exp/$mic/mono_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" data/$mic/train data/lang exp/$mic/mono \
exp/$mic/mono_ali >& exp/$mic/mono_ali/align.log || exit 1;
mkdir -p exp/$mic/tri1
steps/train_deltas.sh --cmd "$train_cmd" \
5000 80000 data/$mic/train data/lang exp/$mic/mono_ali exp/$mic/tri1 \
>& exp/$mic/tri1/train.log || exit 1;
mkdir -p exp/$mic/tri1_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri1 exp/$mic/tri1_ali || exit 1;
mkdir -p exp/$mic/tri2a
steps/train_deltas.sh --cmd "$train_cmd" \
5000 80000 data/$mic/train data/lang exp/$mic/tri1_ali exp/$mic/tri2a \
>& exp/$mic/tri2a/train.log || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri2a/graph_${lm_suffix}
$highmem_cmd $graph_dir/mkgraph.log \
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri2a $graph_dir
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/dev exp/$mic/tri2a/decode_dev_${lm_suffix}
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/eval exp/$mic/tri2a/decode_eval_${lm_suffix}
)
done
#THE TARGET LDA+MLLT+SAT+BMMI PART GOES HERE:
mkdir -p exp/$mic/tri2a_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri2a exp/$mic/tri2_ali || exit 1;
# Train tri3a, which is LDA+MLLT
mkdir -p exp/$mic/tri3a
steps/train_lda_mllt.sh --cmd "$train_cmd" \
--splice-opts "--left-context=3 --right-context=3" \
5000 80000 data/$mic/train data/lang exp/$mic/tri2_ali exp/$mic/tri3a \
>& exp/$mic/tri3a/train.log || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
$highmem_cmd $graph_dir/mkgraph.log \
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri3a $graph_dir
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/dev exp/$mic/tri3a/decode_dev_${lm_suffix}
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/eval exp/$mic/tri3a/decode_eval_${lm_suffix}
)
done
# skip SAT, and build MMI models
steps/make_denlats.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.config \
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_denlats || exit 1;
mkdir -p exp/$mic/tri3a_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_ali || exit 1;
# 4 iterations of MMI seems to work well overall. The number of iterations is
# used as an explicit argument even though train_mmi.sh will use 4 iterations by
# default.
num_mmi_iters=4
steps/train_mmi.sh --cmd "$train_cmd" --boost 0.1 --num-iters $num_mmi_iters \
data/$mic/train data/lang exp/$mic/tri3a_ali exp/$mic/tri3a_denlats \
exp/$mic/tri3a_mmi_b0.1 || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
for i in `seq 1 4`; do
decode_dir=exp/$mic/tri3a_mmi_b0.1/decode_dev_${i}.mdl_${lm_suffix}
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
--iter $i $graph_dir data/$mic/dev $decode_dir
done
i=3 #simply assummed
decode_dir=exp/$mic/tri3a_mmi_b0.1/decode_eval_${i}.mdl_${lm_suffix}
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
--iter $i $graph_dir data/$mic/eval $decode_dir
)
done
# here goes hybrid stuf
# in the ASRU paper we used different python nnet code, so someone needs to copy&adjust nnet or nnet2 switchboard commands

214
egs/ami/s5/run_sdm.sh Executable file
Просмотреть файл

@ -0,0 +1,214 @@
#!/bin/bash -u
. ./cmd.sh
. ./path.sh
#SDM - Signle Distant Microphone
#Assuming initial transcrips, dict, lang and LM were build in run_ihm.sh
micid=1 #which mic from array should be used?
mic=sdm$micid
AMI_DIR=/disk/data2/amicorpus/
norm_vars=false
#1) Download AMI (single distant channel)
local/ami_download.sh sdm $AMI_DIR
#2) Prepare sdm data directories
local/ami_sdm_data_prep.sh $AMI_DIR $micid
local/ami_sdm_scoring_data_prep.sh $AMI_DIR $micid dev
local/ami_sdm_scoring_data_prep.sh $AMI_DIR $micid eval
#use the final LM
final_lm=`cat data/local/lm/final_lm`
LM=$final_lm.pr1-7
#jobs for SDM/MDM decodes - one per meeting on 16core local machine
DEV_SPK=$((`cut -d" " -f2 data/$mic/dev/utt2spk | sort | uniq -c | wc -l`))
EVAL_SPK=$((`cut -d" " -f2 data/$mic/eval/utt2spk | sort | uniq -c | wc -l`))
echo $DEV_SPK $EVAL_SPK
nj=16
#GENERATE FEATS
mfccdir=mfcc_$mic
(
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$mic/eval exp/$mic/make_mfcc/eval $mfccdir || exit 1
)&
(
steps/make_mfcc.sh --nj 5 --cmd "$train_cmd" data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$mic/dev exp/$mic/make_mfcc/dev $mfccdir || exit 1
)&
(
steps/make_mfcc.sh --nj 16 --cmd "$train_cmd" data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$mic/train exp/$mic/make_mfcc/train $mfccdir || exit 1
)&
wait;
for dset in train eval dev; do utils/fix_data_dir.sh data/$mic/$dset; done
# TRAIN THE MODELS
mkdir -p exp/$mic/mono
steps/train_mono.sh --nj $nj --cmd "$train_cmd" --feat-dim 39 \
data/$mic/train data/lang exp/$mic/mono >& exp/$mic/mono/train_mono.log || exit 1;
mkdir -p exp/$mic/mono_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" data/$mic/train data/lang exp/$mic/mono \
exp/$mic/mono_ali >& exp/$mic/mono_ali/align.log || exit 1;
mkdir -p exp/$mic/tri1
steps/train_deltas.sh --cmd "$train_cmd" \
5000 80000 data/$mic/train data/lang exp/$mic/mono_ali exp/$mic/tri1 \
>& exp/$mic/tri1/train.log || exit 1;
mkdir -p exp/$mic/tri1_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri1 exp/$mic/tri1_ali || exit 1;
mkdir -p exp/$mic/tri2a
steps/train_deltas.sh --cmd "$train_cmd" \
5000 80000 data/$mic/train data/lang exp/$mic/tri1_ali exp/$mic/tri2a \
>& exp/$mic/tri2a/train.log || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri2a/graph_${lm_suffix}
$highmem_cmd $graph_dir/mkgraph.log \
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri2a $graph_dir
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/dev exp/$mic/tri2a/decode_dev_${lm_suffix}
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/eval exp/$mic/tri2a/decode_eval_${lm_suffix}
)
done
#THE TARGET LDA+MLLT+SAT+BMMI PART GOES HERE:
mkdir -p exp/$mic/tri2a_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri2a exp/$mic/tri2_ali || exit 1;
# Train tri3a, which is LDA+MLLT
mkdir -p exp/$mic/tri3a
steps/train_lda_mllt.sh --cmd "$train_cmd" \
--splice-opts "--left-context=3 --right-context=3" \
5000 80000 data/$mic/train data/lang exp/$mic/tri2_ali exp/$mic/tri3a \
>& exp/$mic/tri3a/train.log || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
$highmem_cmd $graph_dir/mkgraph.log \
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri3a $graph_dir
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/dev exp/$mic/tri3a/decode_dev_${lm_suffix}
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/eval exp/$mic/tri3a/decode_eval_${lm_suffix}
)
done
# skip SAT, and build MMI models
steps/make_denlats.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.config \
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_denlats || exit 1;
mkdir -p exp/$mic/tri3a_ali
steps/align_si.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_ali || exit 1;
# 4 iterations of MMI seems to work well overall. The number of iterations is
# used as an explicit argument even though train_mmi.sh will use 4 iterations by
# default.
num_mmi_iters=4
steps/train_mmi.sh --cmd "$train_cmd" --boost 0.1 --num-iters $num_mmi_iters \
data/$mic/train data/lang exp/$mic/tri3a_ali exp/$mic/tri3a_denlats \
exp/$mic/tri3a_mmi_b0.1 || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri3a/graph_${lm_suffix}
for i in `seq 1 4`; do
decode_dir=exp/$mic/tri3a_mmi_b0.1/decode_dev_${i}.mdl_${lm_suffix}
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --iter $i --config conf/decode.conf \
$graph_dir data/$mic/dev $decode_dir
done
i=3 #simply assummed
decode_dir=exp/$mic/tri3a_mmi_b0.1/decode_eval_${i}.mdl_${lm_suffix}
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --iter $i --config conf/decode.conf \
$graph_dir data/$mic/eval $decode_dir
)
done
#By default we do no build systems adapted to sessions for AMI in distant scnearios as this does not help a lot (around 1%)
#But one can do this by running below code
exit;
# Train tri4a, which is LDA+MLLT+SAT
steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri3a exp/$mic/tri3a_ali || exit 1;
mkdir -p exp/$mic/tri4a
steps/train_sat.sh --cmd "$train_cmd" \
5000 80000 data/$mic/train data/lang exp/$mic/tri3a_ali \
exp/$mic/tri4a >& exp/$mic/tri4a/train.log || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri4a/graph_${lm_suffix}
$highmem_cmd $graph_dir/mkgraph.log \
utils/mkgraph.sh data/lang_${lm_suffix} exp/$mic/tri4a $graph_dir
steps/decode_fmllr.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/dev exp/$mic/tri4a/decode_dev_${lm_suffix}
steps/decode_fmllr.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
$graph_dir data/$mic/eval exp/$mic/tri4a/decode_eval_${lm_suffix}
)
done
# MMI training starting from the LDA+MLLT+SAT systems
steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \
data/$mic/train data/lang exp/$mic/tri4a exp/$mic/tri4a_ali || exit 1
steps/make_denlats.sh --nj $nj --cmd "$decode_cmd" --config conf/decode.conf \
--transform-dir exp/$mic/tri4a_ali \
data/$mic/train data/lang exp/$mic/tri4a exp/$mic/tri4a_denlats || exit 1;
# 4 iterations of MMI seems to work well overall. The number of iterations is
# used as an explicit argument even though train_mmi.sh will use 4 iterations by
# default.
num_mmi_iters=4
steps/train_mmi.sh --cmd "$train_cmd" --boost 0.1 --num-iters $num_mmi_iters \
data/$mic/train data/lang exp/$mic/tri4a_ali exp/$mic/tri4a_denlats \
exp/$mic/tri4a_mmi_b0.1 || exit 1;
for lm_suffix in $LM; do
(
graph_dir=exp/$mic/tri4a/graph_${lm_suffix}
for i in `seq 1 4`; do
decode_dir=exp/$mic/tri4a_mmi_b0.1/decode_dev_${i}.mdl_${lm_suffix}
steps/decode.sh --nj $DEV_SPK --cmd "$decode_cmd" --config conf/decode.conf \
--transform-dir exp/$mic/tri4a/decode_dev_${lm_suffix} \
$graph_dir data/$mic/dev $decode_dir
done
wait;
i=3 #simply assummed
decode_dir=exp/$mic/tri4a_mmi_b0.1/decode_eval_${i}.mdl_${lm_suffix}
steps/decode.sh --nj $EVAL_SPK --cmd "$decode_cmd" --config conf/decode.conf \
--transform-dir exp/$mic/tri4a/decode_eval_${lm_suffix} \
$graph_dir data/$mic/eval $decode_dir
)&
done
# here goes hybrid stuf
# in the ASRU paper we used different python nnet code, so someone needs to copy&adjust nnet or nnet2 switchboard commands

1
egs/ami/s5/steps Symbolic link
Просмотреть файл

@ -0,0 +1 @@
../../wsj/s5/steps

1
egs/ami/s5/utils Symbolic link
Просмотреть файл

@ -0,0 +1 @@
../../wsj/s5/utils

Просмотреть файл

@ -159,3 +159,15 @@ fortran_opt = $(shell gcc -v 2>&1 | perl -e '$$x = join(" ", <STDIN>); if($$x =~
openblas_compiled:
-git clone git://github.com/xianyi/OpenBLAS
$(MAKE) PREFIX=`pwd`/OpenBLAS/install FC=gfortran $(fortran_opt) DEBUG=1 USE_THREAD=0 -C OpenBLAS all install
beamformit: beamformit-3.51
.PHONY: beamformit-3.51
beamformit-3.51: beamformit-3.51.tgz
tar -xozf BeamformIt-3.51.tgz; \
cd BeamformIt-3.51; cmake . ; make
beamformit-3.51.tgz:
wget -c -T 10 http://www.xavieranguera.com/beamformit/releases/BeamformIt-3.51.tgz

Просмотреть файл

@ -0,0 +1,7 @@
#!/bin/bash
# to be run from ..
# this script just exists to tell you how you'd make beamformit- we actually did it via Makefile rules,
# but it's not a default target.
make beamformit