Wikipedia Q&A

Project Info

  • Location: Pittsburgh, PA
  • Start Date: September 2016
  • End Date: December 2016
  • Related: 11-411, Natural Language Processing
Project Features

  • An asking program which takes in a Wikipedia article text file and a non-negative integer n, and produces n questions based on the article.
  • An answering program which takes in a Wikipedia article text file and a text file of questions based on the article, and produces answers to those questions.
  • Completed in a team of 2.


Siri, Google Asistant, Cortana, Bixby, Alexa... What do they have in common? I'll give you a hint - no, they are not tiny magical creatures that live inside our electronics that answer to our every demand. :P Instead, they are smart assistants that are becoming more and more prevalent in today's smart devices and they rely heavily on a field of Computer Science called Natural Language Processing. I've always been interested in how these smart assistants are able to help us the way they do, and with this project, I was able to gain some insight and get a little taste of NLP.

I mainly worked on the answer-generation portion of the project. Some of the tools/concepts I used include:

  • Python
  • NLTK
  • scikit-learn
  • tf-idf and cosine similarity
  • AMALGrAM 2.0 Supersense tagger
  • Nodebox English Linguistics library

Content of the video report:

  • Question generation: 0:05
    • Sentence preprocessing: 0:11
    • Making questions: 1:11
    • Question scoring: 2:29
  • Question answering: 3:03
    • Source and question preprocessing: 3:10
    • Answer generation: 3:56