Approaches and Challenges in Annotating a Closed Domain NER Dataset

Author

Zikun Fu

1 Introduction

1.1 Named Entity Recognition

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying entities in text into predefined categories such as person names, organizations, locations, etc.

1.2 Closed Domain NER

Closed Domain Named Entity Recognition (CD-NER) involves extracting entities from text that correspond to elements of a structured database, such as table names, column names, or partial tuple values. This domain-specific set can contain billions of entities, making extraction a significant challenge. The primary difficulty lies in accurately identifying entities within this closed set while managing the complexities of database size and specificity. CD-NER requires handling specialized vocabulary, leveraging domain-specific context, and dealing with a large fixed pool of entities.

1.3 Benchmarks

In the field of text-to-SQL translation, benchmark datasets like BIRD and Spider have advanced research and established baselines. However, the lack of high-quality CD-NER benchmark datasets limits progress in this area. This article addresses this gap by converting text-to-SQL benchmarks into CD-NER benchmarks. By leveraging structured features from text-to-SQL datasets, we aim to provide a reliable evaluation resource for closed-domain entity extraction.

2 BIRD Dataset

We’ll be working with the BIRD dataset, which contains natural language questions paired with SQL queries.

import sys
from pathlib import Path

# Relative project root path
project_root = Path("../..")
src_path = project_root / "5-API" / "src"
dataset_path = project_root / "data" / "BIRD"

# Add src directory to Python path
sys.path.append(str(src_path.resolve()))

from cdner.datasets import BirdDataset
from cdner.annotators.pglast_annotator import PglastAnnotator

# Initialize the dataset
dataset = BirdDataset(root=dataset_path, train=True).load()
examples_list = list(dataset.examples) 
print(f"Number of samples in the dataset: {len(examples_list)}")

Number of samples in the dataset: 9428

from pprint import pprint
# Display the first sample
pprint(examples_list[0].model_dump())

{'db_id': 'movie_platform',
 'id': 'bird:train.json:0',
 'query': 'SELECT movie_title FROM movies WHERE movie_release_year = 1945 '
          'ORDER BY movie_popularity DESC LIMIT 1',
 'question': 'Name movie titles released in year 1945. Sort the listing by the '
             'descending order of movie popularity.'}

3 Building CD-NER Benchmarks

To transform BIRD to CDNER, we map sentence (question text) to lexemes (SQL query entities).

This is broken down into the following steps:

3.1 Extracting Lexemes

We begin by parsing the SQL queries using pglast, a Python library that parses PostgreSQL SQL statements into an Abstract Syntax Tree (AST). This AST representation allows us to navigate the structure of the SQL queries and extract:

Tables: Identified by navigating RangeVar nodes in the AST.
Columns: Extracted from ColumnRef nodes.
Values: Retrieved from A_Const nodes representing constants in the query.

This provides a candidate list that needs to be matched with the question text. We call the extracted entities lexemes.

3.2 Matching Lexemes

Once we have the list of lexemes, the next step is to match them with substrings in the corresponding natural language question or sentence. Direct string matching is often insufficient due to variations in phrasing, synonyms, or differences in tokenization. To address this, we use a convolutional search with fuzzy string matching:

Tokenization: The question text is tokenized, preserving the position of each token for accurate mapping.
Convolutional Search: We slide a window over the tokens to consider all possible substrings of varying lengths.
Fuzzy Matching: For each substring, we compute a similarity score with the entity using metrics like the token sort ratio from the thefuzz library.

3.3 Annotating the sentence

We annotate each sentence with: - Start and End Positions: Indicating the exact location of the entity in the question. - Label Type: Denoting whether the entity is a table, column, or value. - Lexeme: The original entity extracted from the SQL query. - Similarity Score: Reflecting the confidence of the match.

3.4 Applying the BIO Tagging

Finally, we convert the annotated entities into a BIO tagging format:

B-Label: Marks the beginning of an entity.
I-Label: Marks tokens inside an entity.
O: Marks tokens outside any entity.

4 Step-by-Step Example

4.0.1 Extracting Lexemes

Using pglast, we parse the SQL query and extract the following lexemes:

annotator = PglastAnnotator()

annotated_example = annotator.annotate(examples_list[0])
pprint(annotated_example.model_dump())

{'entities': [{'end': 34,
               'label_type': 'column',
               'lexeme': 'movie_release_year',
               'schema_element': None,
               'similarity': 0.77,
               'start': 5,
               'substring': 'movie titles released in year'},
              {'end': 101,
               'label_type': 'column',
               'lexeme': 'movie_popularity',
               'schema_element': None,
               'similarity': 1.0,
               'start': 85,
               'substring': 'movie popularity'},
              {'end': 17,
               'label_type': 'column',
               'lexeme': 'movie_title',
               'schema_element': None,
               'similarity': 0.96,
               'start': 5,
               'substring': 'movie titles'},
              {'end': 10,
               'label_type': 'table',
               'lexeme': 'movies',
               'schema_element': None,
               'similarity': 0.91,
               'start': 5,
               'substring': 'movie'},
              {'end': 39,
               'label_type': 'value',
               'lexeme': '1945',
               'schema_element': None,
               'similarity': 1.0,
               'start': 35,
               'substring': '1945'}],
 'id': 'bird:train.json:0',
 'question': 'Name movie titles released in year 1945. Sort the listing by the '
             'descending order of movie popularity.'}

4.0.2 Matching Lexemes

We use convolutional search with fuzzy matching to align lexemes with segments of the sentence (question text). The matching process identifies the most similar substring within a sliding window across the sentence, based on a similarity threshold:

sentence = "Name movie titles released in year 1945. Sort the listing by the descending order of movie popularity."
lexemes = [
    ('column', 'movie_release_year'),
    # ('column', 'movie_popularity'),
    # ('column', 'movie_title'),
    # ('table', 'movies'),
    # ('value', '1945')
]
print("Sentence =",sentence)
print("lexemes =", lexemes)
# Set a similarity threshold
threshold = 0.8

# Perform matching
print("Starting the matching process:")
entities = conv_match_substring(sentence, lexemes, threshold=threshold)

# Display the matched entities
print("\nMatched entities:")
for entity in entities:
    print(f"Entity Type: {entity.label_type}")
    print(f"Matched Text: '{sentence[entity.start:entity.end]}'")
    print(f"Lexeme: {entity.lexeme}")
    print(f"Similarity: {entity.similarity}\n")

Sentence = Name movie titles released in year 1945. Sort the listing by the descending order of movie popularity.
lexemes = [('column', 'movie_release_year')]
Starting the matching process:

Matching lexeme 'movie_release_year' of type 'column'

Searching for best match for phrase 'movie_release_year' in sentence.
Window 'Name movie titles released in' (Tokens 0-5): Similarity = 0.64
Window 'movie titles released in year' (Tokens 1-6): Similarity = 0.77
Window 'titles released in year 1945' (Tokens 2-7): Similarity = 0.61
Window 'released in year 1945. Sort' (Tokens 3-8): Similarity = 0.64
Window 'in year 1945. Sort the' (Tokens 4-9): Similarity = 0.46
Window 'year 1945. Sort the listing' (Tokens 5-10): Similarity = 0.41
Window '1945. Sort the listing by' (Tokens 6-11): Similarity = 0.24
Window 'Sort the listing by the' (Tokens 7-12): Similarity = 0.29
Window 'the listing by the descending' (Tokens 8-13): Similarity = 0.3
Window 'listing by the descending order' (Tokens 9-14): Similarity = 0.29
Window 'by the descending order of' (Tokens 10-15): Similarity = 0.32
Window 'the descending order of movie' (Tokens 11-16): Similarity = 0.43
Window 'descending order of movie popularity' (Tokens 12-17): Similarity = 0.41
Window size 5: Best match 'movie titles released in year' with similarity 0.77
Window 'Name movie titles released' (Tokens 0-4): Similarity = 0.68
Window 'movie titles released in' (Tokens 1-5): Similarity = 0.71
Window 'titles released in year' (Tokens 2-6): Similarity = 0.68
Window 'released in year 1945' (Tokens 3-7): Similarity = 0.72
Window 'in year 1945. Sort' (Tokens 4-8): Similarity = 0.46
Window 'year 1945. Sort the' (Tokens 5-9): Similarity = 0.44
Window '1945. Sort the listing' (Tokens 6-10): Similarity = 0.26
Window 'Sort the listing by' (Tokens 7-11): Similarity = 0.27
Window 'the listing by the' (Tokens 8-12): Similarity = 0.33
Window 'listing by the descending' (Tokens 9-13): Similarity = 0.28
Window 'by the descending order' (Tokens 10-14): Similarity = 0.34
Window 'the descending order of' (Tokens 11-15): Similarity = 0.29
Window 'descending order of movie' (Tokens 12-16): Similarity = 0.42
Window 'order of movie popularity' (Tokens 13-17): Similarity = 0.51
Window size 4: Best match 'released in year 1945' with similarity 0.72
Window 'Name movie titles' (Tokens 0-3): Similarity = 0.57
Window 'movie titles released' (Tokens 1-4): Similarity = 0.77
Window 'titles released in' (Tokens 2-5): Similarity = 0.61
Window 'released in year' (Tokens 3-6): Similarity = 0.82
Window 'in year 1945' (Tokens 4-7): Similarity = 0.4
Window 'year 1945. Sort' (Tokens 5-8): Similarity = 0.44
Window '1945. Sort the' (Tokens 6-9): Similarity = 0.26
Window 'Sort the listing' (Tokens 7-10): Similarity = 0.29
Window 'the listing by' (Tokens 8-11): Similarity = 0.31
Window 'listing by the' (Tokens 9-12): Similarity = 0.31
Window 'by the descending' (Tokens 10-13): Similarity = 0.34
Window 'the descending order' (Tokens 11-14): Similarity = 0.32
Window 'descending order of' (Tokens 12-15): Similarity = 0.32
Window 'order of movie' (Tokens 13-16): Similarity = 0.56
Window 'of movie popularity' (Tokens 14-17): Similarity = 0.49
Window size 3: Best match 'released in year' with similarity 0.82
Window 'Name movie' (Tokens 0-2): Similarity = 0.57
Window 'movie titles' (Tokens 1-3): Similarity = 0.6
Window 'titles released' (Tokens 2-4): Similarity = 0.55
Window 'released in' (Tokens 3-5): Similarity = 0.62
Window 'in year' (Tokens 4-6): Similarity = 0.48
Window 'year 1945' (Tokens 5-7): Similarity = 0.37
Window '1945. Sort' (Tokens 6-8): Similarity = 0.22
Window 'Sort the' (Tokens 7-9): Similarity = 0.31
Window 'the listing' (Tokens 8-10): Similarity = 0.28
Window 'listing by' (Tokens 9-11): Similarity = 0.21
Window 'by the' (Tokens 10-12): Similarity = 0.17
Window 'the descending' (Tokens 11-13): Similarity = 0.31
Window 'descending order' (Tokens 12-14): Similarity = 0.35
Window 'order of' (Tokens 13-15): Similarity = 0.38
Window 'of movie' (Tokens 14-16): Similarity = 0.46
Window 'movie popularity' (Tokens 15-17): Similarity = 0.53
Window size 2: Best match 'released in' with similarity 0.62
Window 'Name' (Tokens 0-1): Similarity = 0.18
Window 'movie' (Tokens 1-2): Similarity = 0.43
Window 'titles' (Tokens 2-3): Similarity = 0.33
Window 'released' (Tokens 3-4): Similarity = 0.54
Window 'in' (Tokens 4-5): Similarity = 0.1
Window 'year' (Tokens 5-6): Similarity = 0.36
Window '1945' (Tokens 6-7): Similarity = 0.0
Window 'Sort' (Tokens 7-8): Similarity = 0.18
Window 'the' (Tokens 8-9): Similarity = 0.1
Window 'listing' (Tokens 9-10): Similarity = 0.16
Window 'by' (Tokens 10-11): Similarity = 0.1
Window 'the' (Tokens 11-12): Similarity = 0.1
Window 'descending' (Tokens 12-13): Similarity = 0.21
Window 'order' (Tokens 13-14): Similarity = 0.35
Window 'of' (Tokens 14-15): Similarity = 0.1
Window 'movie' (Tokens 15-16): Similarity = 0.43
Window 'popularity' (Tokens 16-17): Similarity = 0.29
Window size 1: Best match 'released' with similarity 0.54

Best overall match: 'released in year' with similarity 0.82
Matched 'released in year' in sentence with similarity 0.82

Matched entities:
Entity Type: column
Matched Text: 'released in year'
Lexeme: movie_release_year
Similarity: 0.82

5 Challenges

Alignment Issues:
- Natural language questions often use varied phrasing that doesn’t directly match the lexemes (e.g., table names, column names) in the database schema.
Overlapping Entities:
- When multiple entities are mentioned closely together in a question, their textual representations can overlap.

We can approach the challenge in the following ways:

Alignment
1. Continous Annotation: Only continuous (adjacent) substrings in the sentence can be annotated as entities. This means that the words corresponding to an entity must be next to each other without any interruptions.
2. Non-continuous Annotation: Allows for the annotation of entities even if the corresponding words are not adjacent in the sentence. This approach is more flexible and can capture entities that are mentioned in a scattered manner throughout the sentence.
Overlap
1. Overlap Annotation: Annotations are allowed to overlap in the sentence; that is, a word or phrase can be part of multiple entity annotations. This is useful when different entities share common words in the question.
2. Non-overlap Annotation: Annotations cannot overlap; each word or phrase can be assigned to at most one entity. This constraint ensures that once a word is part of an entity annotation, it cannot be part of another.

5.1 Example

Suppose the following sentence and lexemes:

sentence = "Name movie titles released in year 1945. Sort the listing by the descending order of movie popularity."
lexemes = [
    ('column', 'movie_release_year'),
    ('column', 'movie_popularity'),
    ('column', 'movie_title'),
    ('table', 'movies'),
    ('value', '1945')
]

5.2 Continous Overlapping

Definition: Only continuous substrings in the sentence can be annotated as entities, and annotations are allowed to overlap (i.e., a word or phrase can be part of multiple annotations).

Sentence: “Name [{movie} titles] [released in year] [1945]. Sort the listing by the descending order of [movie popularity].”

{movie} and [movie released in year] matches to the ‘movie_release_year’ column.
[movie titles] matches the ‘movie_title’ column.
[1945] matches the ‘1945’ value.
[movie popularity] matches the ‘movie_popularity’ column.

5.3 Continuous Non-Overlapping

Definition: Only continuous substrings in the sentence can be annotated as entities, and annotations cannot overlap (i.e., each word or phrase can be part of at most one annotation).

Sentence: “Name [movie titles] [released in year] [1945]. Sort the listing by the descending order of [movie popularity].”

[movie titles] matches the ‘movie_title’ column.
[released in year] matches the ‘movie_release_year’ column.
[1945] matches the ‘1945’ value.
[movie popularity] matches the ‘movie_popularity’ column.

5.4 Non-Continuous Overlapping

Definition: Substrings can be non-continuous (i.e., words corresponding to an entity do not need to be adjacent), and annotations are allowed to overlap.

Sentence: “Name [{movie} titles] [released] in [year] [1945]. Sort the listing by the descending order of [{movie} popularity].”

{movie} matches the ‘movies’ table.
[movie titles] separately match the ‘movie_title’ columns.
{movie}, [released] and [year] correspond to the ‘movie_release_year’ columns.
[1945] matches the ‘1945’ value.
[movie popularity] correspond to the ‘movie_popularity’ column.

5.5 Non-Continuous Non-Overlapping

Definition: Substrings can be non-continuous, and annotations cannot overlap.

Sentence: “Name [movie] [titles] [released] in [year] [1945]. Sort the listing by the descending order of [movie] [popularity].”

[movie] matches the ‘movies’ table or ‘movie_title’ column.
[titles] matches the ‘movie_title’ column.
[released] matches the ‘movie_release_year’ column.
[year] matches the ‘movie_release_year’ column.
[1945] matches the ‘1945’ value.
[movie] [popularity] matches the ‘movie_popularity’ column.