Anthony Li Medicine | Engineering | Data Science

One Paper A Day: Reviewing dataset paper "A Dataset for Multilingual Epidemiological Event Extraction"

Categorised under:  OPAD

Preface

This is the first article in the series of One Paper A Day. In this series, I will commit to reviewing and summarising one article every day. Papers reviewed likely to fall under the category of machine learning, software engineering or medicine.

TLDR

This paper essentially provides a corpus for development of NLP tools and techniques for processing text related to Emerging Infectious Diseases (EID). Paper can be found here.

Why should you be interested in this paper?

  • Suitable for people developing NLP tools for surveillance of EID.
  • There are few other EID corpuses available on the web.
  • As paper was published in 2020, it has a fairly recent publication (relative to this blogpost) with a comprehensive literature review of both NLP techniques and corpus.

Key learning points

Literature review

  • Event extraction methods applied on EID news sources can be divided into three types:
    • Pattern based: Rules and template to extract events from text through representation and explotation of expert knowledge
    • Data driven: Statistical techniques to discover the relations in text
    • Hybrid of pattern based and data driven approaches
  • Review of existing event extraction methods published:
  • NLP applied on internet search data to look for specific EID trends e.g. Influenza
    • Sources of internet search data: Google, Twitter, Yahoo

Methods

  • Corpus creation:
    • Data extracted from online Promed articles from 1 Aug 2013 to 31 Aug 2018
    • Data cleaning done to remove boilerplate content in each of the articles
    • Data filtering to extract languages of interest
    • K means clustering applied on articles
    • Deduplication of data was done to remove potential duplicates. Tool used is ONION.
    • Control dataset created with Huffpost news articles from 2012 to 2018
  • Corpus statistics:
    • Training set comprises of 10,000 English articles and 2,996 French articles
    • Human annotated datasets, consisting of 444 English articles and 2,722 French articles, provided the ground truth to evaluate models developed.
  • DANIEL for extraction:
    • Segmentation
    • Event detection
    • Event localisation
    • Output is disease location pair
  • Text classification model:
    • Supervised classification methods evaluated: naive bayes, random forest, neural network
  • Performance metrics considered:
    • Recall and precision conisdered for evaluation of DANIEL
    • Recall, precision, F measure metrics for evaluation of text classification models

Results

  • DANIEL:
    • F score of 75%, precision of 60% for English articles
    • Precision of 74%, recall of 83% for French articles
  • Text classification model (English)
    • Highest precision of 80% by Random Forest
    • Highest recall of 76% by Neural Network
    • Highest F measure of 74% by Random Forest
  • Text classification model (English)
    • Highest precision of 80% by Random Forest
    • Highest recall of 74% by Naive Bayes
    • Highest F measure of 67% by Random Forest

Key contributions

  • Large multilingual EID dataset available here
  • Baseline performance of NLP techniques for event extraction and text classification methods