Early Detection of Fake News with Multi-source Weak Social Supervision
Перейти к файлу
Yichuan LI 744cbb9dc0
Merge PR #2
* Update README.md

* update README.md

* add figure

* update README.md

* add model directory
2021-03-18 15:43:03 -07:00
figure Update README.md (#1) 2020-11-28 12:38:11 -08:00
model Merge PR #2 2021-03-18 15:43:03 -07:00
.gitignore Initial commit 2020-11-24 23:05:42 +00:00
CODE_OF_CONDUCT.md Initial CODE_OF_CONDUCT.md commit 2020-11-24 15:05:46 -08:00
LICENSE initial commit 2020-11-24 15:57:17 -08:00
README.md Update README.md (#1) 2020-11-28 12:38:11 -08:00
dataset.py initial commit 2020-11-24 15:57:17 -08:00
l2w.py initial commit 2020-11-24 15:57:17 -08:00
run_classifiy.py initial commit 2020-11-24 15:57:17 -08:00
train.py initial commit 2020-11-24 15:57:17 -08:00

README.md

Multi-source Weak Social Supervision for Fake News Detection

Authors: Guoqing Zheng (zheng@microsoft.com), Yichuan Li (yli29@wpi.edu), Kai Shu (kshu@iit.edu)

This repository contains code for fake news detection with Multi-source Weak Social Supervision (MWSS), published at ECML-PKDD 2020 at: Early Detection of Fake News with Multi-source Weak Social Supervision

Model Structure

The structure of the MWSS model: model structure

The training process of MWSS model: model structure

Requirements

torch=1.x

transformers=2.4.0

Prepare dataset

The weak labeled dataset is in the format of:

news Weak_Label_1 Weak_Label_2 ... Weak_Label_K
Hello this is MWSS readme 1 0 ... 1

The clean labeled dataset is in the format of:

News label
Hello this is MWSS readme 1

Usage

a. train_type is {0:"clean", 1:"noise", 2:"clean+noise"}

b. "--meta_learn" is to set the instance weight for each noise samples.

c. "--multi_head" is to set the weak source count, if you have three different weak source, you should set it to 3.

d. "--group_opt": specific optimizer for group weight. You can choose Adam and SGD.

e. "--gold_ratio": Float gold ratio for the training data. Default is 0 which will use [0.02, 0.04, 0.06, 0.08, 0.1] all the gold ratio. For gold ratio 0.02, set it as "--gold_ratio 0.02"

  • Finetune on RoBERTa Group Weight

    python3 run_classifiy.py
    --model_name_or_path roberta-base
    --evaluate_during_training --do_train --do_eval
    --num_train_epochs 15
    --output_dir ./output/
    --logging_steps 100
    --max_seq_length 256
    --train_type 0
    --per_gpu_eval_batch_size 16
    --g_train_batch_size 5
    --s_train_batch_size 5
    --clf_model "robert"
    --meta_learn
    --weak_type "none"
    --multi_head 3
    --use_group_net
    --group_opt "adam"
    --train_path "./data/political/weak"
    --eval_path "./data/political/test.csv"
    --fp16
    --fp16_opt_level O1
    --learning_rate 1e-4
    --group_adam_epsilon 1e-9
    --group_lr 1e-3
    --gold_ratio 0.04
    --id "ParameterGroup1"

The log information will stored in

~/output
  • CNN Baseline Model

    python run_classifiy.py
    --model_name_or_path distilbert-base-uncased
    --evaluate_during_training --do_train --do_eval --do_lower_case
    --num_train_epochs 30
    --output_dir ./output/
    --logging_steps 10
    --max_seq_length 256
    --train_type 0
    --weak_type most_vote
    --per_gpu_train_batch_size 256
    --per_gpu_eval_batch_size 256
    --learning_rate 1e-3
    --clf_model cnn

  • CNN Instance Weight Model with multi classification heads

    python run_classifiy.py
    --model_name_or_path distilbert-base-uncased
    --evaluate_during_training --do_train --do_eval --do_lower_case
    --num_train_epochs 256
    --output_dir ./output/
    --logging_steps 10
    --max_seq_length 256
    --train_type 0
    --per_gpu_eval_batch_size 256
    --g_train_batch_size 256
    --s_train_batch_size 256
    --learning_rate 1e-3
    --clf_model cnn
    --meta_learn
    --weak_type "none"

  • CNN group weight

    python run_classifiy.py
    --model_name_or_path distilbert-base-uncased
    --evaluate_during_training --do_train --do_eval --do_lower_case
    --num_train_epochs 256
    --output_dir ./output/
    --logging_steps 10
    --max_seq_length 256
    --train_type 0
    --per_gpu_eval_batch_size 256
    --g_train_batch_size 256
    --s_train_batch_size 256
    --learning_rate 1e-3
    --clf_model cnn
    --meta_learn
    --weak_type "none"
    --multi_head 3
    --use_group_weight
    --group_opt "SGD"
    --group_momentum 0.9
    --group_lr 1e-5

  • RoBERTa Baseline Model

    python run_classifiy.py
    --model_name_or_path roberta-base
    --evaluate_during_training
    --do_train --do_eval --do_lower_case
    --num_train_epochs 30
    --output_dir ./output/
    --logging_steps 10
    --max_seq_length 256
    --train_type 0
    --weak_type most_vote
    --per_gpu_train_batch_size 16
    --per_gpu_eval_batch_size 16
    --learning_rate 5e-5
    --clf_model robert

  • RoBERTa Instance Weight with Multi Head Classification

    python run_classifiy.py
    --model_name_or_path roberta-base
    --evaluate_during_training --do_train --do_eval --do_lower_case
    --num_train_epochs 256
    --output_dir ./output/
    --logging_steps 10
    --max_seq_length 256
    --weak_type most_vote
    --per_gpu_eval_batch_size 16
    --g_train_batch_size 16
    --s_train_batch_size 16
    --learning_rate 5e-5
    --clf_model robert
    --meta_learn
    --weak_type "none"
    --multi_head 3 \

  • RoBERTa Group Weight

    python run_classifiy.py
    --model_name_or_path roberta-base
    --evaluate_during_training --do_train --do_eval --do_lower_case
    --num_train_epochs 256
    --output_dir ./output/
    --logging_steps 10
    --max_seq_length 256
    --weak_type most_vote
    --per_gpu_eval_batch_size 16
    --g_train_batch_size 16
    --s_train_batch_size 16
    --learning_rate 5e-5
    --clf_model robert
    --meta_learn
    --weak_type "none"
    --multi_head 3
    --use_group_weight
    --group_opt "SGD"
    --group_momentum 0.9
    --group_lr 1e-5

  • Finetune on RoBERTa Group Weight

    python3 run_classifiy.py
    --model_name_or_path roberta-base
    --evaluate_during_training --do_train --do_eval
    --num_train_epochs 15
    --output_dir ./output/
    --logging_steps 100
    --max_seq_length 256
    --train_type 0
    --per_gpu_eval_batch_size 16
    --g_train_batch_size 1
    --s_train_batch_size 1
    --clf_model "robert"
    --meta_learn
    --weak_type "none"
    --multi_head 3
    --use_group_net
    --group_opt "adam"
    --train_path "./data/political/weak"
    --eval_path "./data/political/test.csv"
    --fp16
    --fp16_opt_level O1
    --learning_rate "1e-4,5e-4,1e-5,5e-5"
    --group_adam_epsilon "1e-9, 1e-8, 5e-8"
    --group_lr "1e-3,1e-4,3e-4,5e-4,1e-5,5e-5"
    --gold_ratio 0.04

The log information will stored in

~/ray_results/GoldRatio_{}_GroupNet

You can run the following command to extract the best result which is sorted by the average of accuracy and f1.

export LOG_FILE=~/ray_results/GoldRatio_{}_GroupNet
python read_json.py --file_name $LOG_FILE --save_dir ./output

In the meantime, you can visualize the log text by tensorboard

tensorboard --logdir $LOG_FILE