As a data scientist, I found out that most educational content about Artificial Intelligence is pretty theoretical and hard to grasp at first. Therefore, newcomers in the field of AI might be frightened by all its complexity, especially when it comes to NLP. Still, I think that one can truly enjoy the beauty and the simplicity of building an NLP system from simple NLP blocks that are already built and well recognized in the community.
In this article, we will explore the vast wosrld of Natural Language Processing (NLP) through “fill-in-the-blank trivia questions generation”. More precisely we will generate Multiple Choice Questions from texts. We will cover the extraction of data from Wikipedia, the transformation of data into sentences, and finally the NLP models used to achieve this task.
What is Natural Language Processing?
We call natural language a language that has developed naturally in use which is in contrast with languages used by computer code or logic. One can interpret a natural language as a human spoken language.
Natural language processing emerged in the 1950s with the famous Turing test to give a criterion of intelligence for a machine. This test aims at evaluating the machine on a task consisting in interpreting and generating natural language. The goal of NLP is to enable computers to deal with natural language data whether it is spoken or written and respond with their own text or speech through a program as a human would do. Currently, NLP has many fields of application from medical research to recommendation engines. Typical tasks achieved by NLP tools are speech recognition, named entity recognition, sentiment analysis, and of course natural language generation.
What do fill-in-the-blank questions have to do with the Fundamentals of NLP?
NLP tools are more and more efficient when it comes to understanding the syntax as well as the semantics of a given language. My idea in this article is to use those abilities to test my general knowledge by making the machine generate and ask me fill-in-the-blank trivia questions. More precisely, in this article, I will be interested in testing my knowledge about the different countries of the world.
Extract data from Wikipedia with SPARQL
To constitute the dataset I was interested in, I chose to create a dataset consisting of country names and abstracts corresponding to said country. That is to say for each country, I will retrieve the abstract part of the Wikipedia page concerning the country. To do so, I used a SPARQL query on DBpedia.
SPARQL is a query language. More precisely, it is used to retrieve data that is stored under the Resource Description Framework (RDF) format. In our case, as the Wikipedia resources are structured under the RDF format in DBpedia, we are able to use SPARQL to retrieve them from the World Wide Web quite easily. For country data, the query that I used was the following:
This query retrieves all countries URIs (line 3) and for each URI it looks for the country name (line 4) and the country abstract (line 5). The lines 7 and 8 are used to set the language to English.
Transform data into sentences
Now that we have the abstract of each country, we would like to separate them into sentences in which we will look for keywords. We have several options to achieve this task.
a. Regular expressions
One can use regular expressions to look for patterns in the text and split it accordingly. For instance, if we check using the “Starts with a capital letter and ends with a punctuation marker” pattern, we might expect to split a given abstract into sentences and it indeed works for a lot of texts.
looks for the pattern: Any uppercase letter, followed by anything but “.”, “!” or “?”, or even nothing then ending with “.”, “!” or “?”.
It will split the text “Hello World! I enjoy learning about the Fundamentals of NLP. Do you like it too?” into the 3 following strings: “Hello World!”, “I enjoy learning about Fundamentals of NLP.” and “Do you like it too ?”. Still the following text “I live in the U.S.A.” will be split into “I live in the U.”, “S.” and “A.” which is not what we want. One could argue that we could add other rules to the regex in order to detect edge cases but it is somewhat hard to detect all edge cases exhaustively and the regex might become quite complex and tedious to write.
Does it mean that we cannot split text into sentences easily? No, we can use a NLP technique called sentencizing to do the job! The spaCy library allows us to use a pre-trained sentencizer that relies on Dependency Parsing. In NLP, a dependency parser aims at finding how words are in relation with each other by analyzing the grammatical structure of the text they are in. To do so, the model tokenizes the words in the text into, i.e. it separates the text into smaller units that are called tokens. Then it looks for binary relationships between said tokens. For instance the model will find a “direct object” link describing that “computers” is the direct object of “like” in “I like computers”.
Moreover, it can also classify each token of the text to indicate whether or not it is a sentence starting token which is what we use in a sentencizer to detect sentences.
Implementing a sentencizer is quite straight forward in Python as library like spaCy already put pre-trained models at our disposal. The following code will instantiate a sentencizer from the spaCy library. This model, contrary to the regex method was able to accurately sentencize the “U.S.A.” example !
Load sentences in NLP models
Now that we have sentences, we can manipulate them to create fill-in-the-blank trivia MCQs. To do so, we will need to detect specific words of importance in the sentence and generate similar words to create wrong answers, called distractors, for our MCQ. For instance, let’s take the sentence “France’s capital city is Paris”. Our system would need to detect that the word “Paris” is important and generate distractors like “Marseille”, “Lyon” and “Nantes” so that the following MCQ is generated :
Fill in the blank : France’s capital city is <blank>.
a. Keyword extraction
In my use case, I want to detect information about countries. One way to detect such information is to use the Named Entity Recognition technique to detect countries and important people’s name in the abstract.
To understand what concepts lies behind a NER system, one should first understand what is an embedding. In NLP and more generally in Machine Learning & Deep Learning an embedding is the representation of an input in a vector space of high dimensions. A word like “apple” could be seen by the machine as a 300 long vector of numbers between 0 and 1 for instance. This process of transforming data into vectors is very complex but doing it well will ensure good results for the classification of said vectors into specific classes (in our case entities) in the end.
Similarly to the sentencizer, spaCy also provides pre-trained NER models that allow us to classify tokens in a text to entities and detect token from given entities. In the sentence “Napoleon Bonaparte fought in the battle of Waterloo”, “Napoleon Bonaparte” will be classified as a PERSON while “Waterloo” will be classified as a GPE (Geopolitical Entity).
The following code will instantiate a NER model from spaCy.
The “en_core_web_sm” parameter specifies what type of language model amongst all spaCy models should be used by spaCy. One can easily download this spaCy model with this line of code:
python -m spacy
Of course, to deal with larger and more complex problems, there are bigger models that could have a better suit . We can mention transformer models for instance.
In the event that one would want to train its own NER model, I invite them to read this article about NER to get started with training using the NLTK python library.
b. Synonyms, Similarity & Embeddings
Now that we know how to detect specific words in a sentence, we only have to create alternative answers for our MCQ that is to say: distractors ! To generate distractor a good strategy would be to find similar words to the answers. For instance if we detect the word “France” our model could input “Quebec”, “Germany” or even “Switzerland” to propose other answers aside from the good one.
To find similar words that the original one, the most common NLP technique would be to compute the embedding of the word and search for similar vector in the embedding space to find similar words. This could be done using clustering or cosine similarity.
To instantiate a downloaded sense2vec model, the following code should do the job.
sense2vec already provides two pre-trained models available in their PyPi under the “Pretrained vectors” section. The one used in this article is the smallest one which is the “s2v_reddit_2015_md” model.
Moreover, the model that is used in this article is also able to detect the type of words that it deals with whether it be a NOUN, a VERB, a PROPN, etc... Therefore we can use this classification to detect our answer’s class and force the generated distractors to have the same class so that if the answer is a proper noun, all the distractors are too.
Question Generation (assembling our NLP tools)
Now that we have all the bricks necessary to make our question generator, we can make a little Streamlit application so that we can play with all of this. In this repo I made, one will be able to run a Streamlit application that will implement a fill-in-the-blank MCQ generator about countries in the world and test their knowledge!
This repository implement the questions generation as described above. Still, if one looks more closely to the code, they will notice some custom rules that have been added such as “The answer should not be a substring to the country”. Such rules aim at making MCQs less obvious and more challenging.
Conclusion to Fundamentals of NLP
This project allowed me to learn about the Fundamentals of NLP by introducing several tools to me such as tokenisers, sentencizers, NER models as well as embeddings and similarity computation. Even if NLP may seem frightening at first, I hope this article makes the subject a little more accessible and comprehensive. To continue with more NLP related tasks, I recommend this article about text summarisation.
To keep growing and learning, I greatly recommend Sicara’s blog as a resource to learn about other techniques whether they be about NLP, computer vision, learning methods, cloud computing or even good code craftsmanship. For any questions about Sicara’s projects, please contact us here.
Fundamentals of NLP: References
Here are some links and papers that give more insights into how models work.
- spaCy documentation on Dependency Parser
- Matthew Honnibal and Mark Johnson. 2015. An Improved Non-monotonic Transition System for Dependency Parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373–1378, Lisbon, Portugal. Association for Computational Linguistics.