LightLDA

История

Fei Gao e37c71c7ed Update pubmed.sh		2015-11-09 19:45:48 +08:00
..
README.md	Update README.md	2015-11-06 15:16:46 +08:00
get_meta.py	add word count for preprocessing	2015-11-06 19:28:23 +08:00
nytimes.sh	Update nytimes.sh	2015-11-09 19:45:37 +08:00
pubmed.sh	Update pubmed.sh	2015-11-09 19:45:48 +08:00
text2libsvm.py	first commit	2015-09-12 14:45:19 +08:00

README.md

#LightLDA usage

Running lightlda --help gives the usage information.

LightLDA usage:

-num_vocabs <arg>        Size of dataset vocabulary 
-num_topics <arg>        Number of topics. Default: 100
-num_iterations <arg>    Number of iteratioins. Default: 100
-mh_steps <arg>          Metropolis-hasting steps. Default: 2
-alpha <arg>             Dirichlet prior alpha. Default: 0.1
-beta <arg>              Dirichlet prior beta. Default: 0.01
-num_blocks <arg>        Number of blocks in disk. Default: 1
-max_num_document <arg>  Max number of document in a data block 
-input_dir <arg>         Directory of input data, containing
                         files generated by dump_block 
-num_servers <arg>       Number of servers. Default: 1
-num_local_workers <arg> Number of local training threads. Default: 4
-num_aggregator <arg>    Number of local aggregation threads. Default: 1
-server_file <arg>       Server endpoint file. Used by MPI-free version
-warm_start              Warm start 
-out_of_core             Use out of core computing 
-data_capacity <arg>     Memory pool size(MB) for data storage, 
                         should larger than the any data block
-model_capacity <arg>    Memory pool size(MB) for local model cache
-alias_capacity <arg>    Memory pool size(MB) for alias table 
-delta_capacity <arg>    Memory pool size(MB) for local delta cache

#Note on the input data

The input data is placed in a folder, which is specified by the command line argument input_dir.

This folder should contains files named as block.id, vocab.id. The id is range from 0 to N-1 where N is the number of data block.

The input data should be generated by the tool dump_binary(released along with LightLDA), which convert the libsvm format in a binary format. This is for training efficiency consideration.

#Note on the arguments about capacity

In LightLDA, almost all the memory chunk is pre-allocated. LightLDA uses these fixed-capacity memory as memory pool.

For data capacity, you should assign a value at least larger than the largest size of your binary training block file(generated by dump_binary, see Note on input data above).

For model/alias/delta capacity, you can assign any value. LightLDA handles big model challenge under limited memory condition by model scheduling, which loads only a slice of needed parameters that can fit into the pre-allocated memory and schedules only related tokens to train. To reduce the wait time, the next slice is prefetched in the background. Empirically, model capacity and alias capacity are in same order. delta capacity can be much smaller than model/alias capacity. Logs will gives the actually memory size used at the beggning of program. You can use this information to adjust these arguments to achieve better computation/memory efficiency.

#Note on distirubted running

Data should be distributed into different nodes.

Running with MPI, you just need to run mpiexec --machinefile machine_file lightlda -lightlda_arguments...

Running without MPI, you need to prepare a server_endpoint file which contains ip:port information for server process.