top of page

Named Entity Recognition on Noisy Social Media Texts

Updated: Mar 9

1. Context

Named Entity Recognition (NER) aims at identifying different types of entities, such as people names, companies, location, etc., within a given text. For example, in “Going to San Diego”, “San Diego” refers to a specific instance of a location; compare with “Going to the city”, where the destination isn’t named, but rather a generic city. This information is useful for higher-level Natural Language Processing (NLP) applications such as information extraction, summarization, and data mining.

This project particularly focuses on NER on noisy social media texts. Social media texts are famous for grammatical inconsistency, wide-range of writing styles and rapidly shifting topic domains. NER from such texts cannot rely on formal grammatical tools or consider only specific types of writing styles and rely on topic specific resources.

2. Objective

The challenge of identifying unusual, previously-unseen entities in noisy social media texts is described in the Emerging and Rare Entity Recognition shared task at the 3rd Workshop on Noisy User generated Text (WNUT2017). The authors have annotated and made available 2,295 texts taken from many different sources (Reddit, Twitter, YouTube, and StackExchange comments) that focus on entities that are emerging (i.e. not present in data from n years ago) and rare (i.e. not present more than k times in our data). There are six entity types:

  • person – Names of people (e.g. Virginia Wade)

  • location – Names that are locations (e.g. France)

  • corporation – Names of corporations (e.g. Google)

  • product – Name of products (e.g. iPhone)

  • creative-work – Names of creative works (e.g. Bohemian Rhapsody).

  • group – Names of groups (e.g. Nirvana, San Diego Padres).

The shared task evaluates against two measures:

  • The classical entity-level precision, recall, and their harmonic mean, F1.

  • The set of unique surface forms in the gold data and the submission are compared, and their precision, recall, and F1 are measured as well.

These two measures are denoted F1 (entity) and F1 (surface). The latter measures how good systems are at correctly recognizing a diverse range of entities, rather than just the very frequent surface forms. For instance, the classical measure would reward a system that always recognizes London accurately, and so such a system would get a high score on a corpus where 50% of the location entities are just London. The second measure, though, would reward London just once, regardless of how many times it appeared in the text.

3. Methods and Techniques


  • Character embedding, word embedding: 1D-CNN, GloVe

  • Topic modeling: LDA

  • NER model: BiLSTM, Fully-Connected Layers and CRF

4. Publications

1. Leon Derczynski, Eric Nichols, Marieke van Erp, Nut Limsopatham (2017) "Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition", in Proceedings of the 3rd Workshop on Noisy, User-generated Text.


2. Jansson, P., & Liu, S. (2017, December). Topic modelling enriched LSTM models for the detection of novel and emerging named entities from social media. In Big Data (Big Data), 2017 IEEE International Conference on (pp. 4329-4336). IEEE.

Recent Posts

See All

Comments


Warm Regards from Giang Nguyen

bottom of page