first system commit

husseinmozannar · Jun 11, 2019 · 37b667a · 37b667a
1 parent b40416c
commit 37b667a
Show file tree

Hide file tree

Showing 53 changed files with 11,420 additions and 6 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+*.p
+*.pyc
diff --git a/README.md b/README.md
@@ -1,10 +1,88 @@
-# SOQAL: Neural Arabic Question Answering (UNDER CONSTRUCTION)
-# Around mid to end june all code for the system will be available, we are trying to make it as accessible as possible
+# SOQAL: Neural Arabic Question Answering
+This repository includes the code and dataset described in our WANLP 2019 paper Neural Arabic Question Answering by Hussein Mozananr, Karl El Hajal, Elie Maamary and Hazem Hajj.
 
-## Data
+Quick Links:
+*  [Datasets](data/README.md)
+*  [BERT](bert/README.md)
+*  [Document Retrievers](retriever/README.md)
+*  [Getting Arabic Wikipedia](arwiki/README.md)
+*  [Tools for Creating our datasets](dataset_creation/README.md)
+## Arabic Open Domain Question Answering
+![](system_fig.jpg)
+This work builds a system for open domain
+factual Arabic question answering (QA) using
+Wikipedia as our knowledge source. This
+constrains the answer of any question to be a
+span of text in Wikipedia. However, this enables to use neural reading comprehension models for our end goal.
+
+Open domain QA
+for Arabic entails three challenges: annotated
+QA datasets in Arabic, large scale efficient information
+retrieval and machine reading comprehension.
+To deal with the lack of Arabic
+QA datasets we present the Arabic Reading
+Comprehension Dataset (ARCD) composed of
+1,395 questions posed by crowdworkers on
+Wikipedia articles, and a machine translation
+of the Stanford Question Answering Dataset
+(Arabic-SQuAD) containing 48,344 questions.
 
-In the data folder you can find the Arabic Reading Comprehension Dataset (arcd.json): a crowdsourced arabic reading comprehension dataset composed of paragraphs and accomnying questions and answers.
+Our system for open domain
+question answering in Arabic (SOQAL)
+is based on three components: (1) a document
+retriever using a hierarchical TF-IDF approach, (2) a neural reading comprehension
+model using the pre-trained bi-directional
+transformer BERT and finally (3) a linear answer ranking module to obtain .
 
-You will additionally find Arabic-SQuAD: a machine translation (Google API) of half of the Stanford Question Answering Dataset (arabic-SQuAD.json).
+Credit: This work draws inspiration from [DrQA](https://github.com/facebookresearch/DrQA). 
 
-More details will soon follow.
+## Platform
+Tested for Python 3.6 on Windows 8 and 10.
+
+## Installing SOQAL
+Create a new virtual environment (you need to install virtualenv if you want) and activate it:
+```shell
+virtualenv venv
+venv\Scripts\activate
+```
+Now you are in the virtual environment you have created and will install things here.
+
+
+Run the following commands to clone the repository and install SOQAL:
+```shell
+git clone https://github.com/husseinmozannar/SOQAL.git
+cd SOQAL
+pip install -r requirements.txt
+```
+
+
+## Demo
+(We will soon provide trained models, this relies on you training BERT and building the retriever)
+
+To interactively ask Arabic open-domain questions to SOQAL, follow the instructions bellow: 
+
+```shell
+python demo_open.py ^
+-c bert/multilingual_L-12_H-768_A-12/bert_config.json ^
+-v bert/multilingual_L-12_H-768_A-12/vocab.txt ^
+-o bert/runs/ ^
+-r retriever/tfidfretriever.p
+```
+
+And on your browser go to:
+```
+localhost:9999
+```
+## Citation
+
+(pending ACL release)
+
+Please cite our paper if you use our datasets or code:
+
+```
+@inproceedings{mozannar2019soqal,
+  title={Neural Arabic Question Answering},
+  author={Mozannar, Hussein and El Hajal, Karl and Maamary, Elie and Hajj, Hazem},
+  booktitle={Association for Computational Linguistics (ACL)},
+  year={2019}
+}
diff --git a/arwiki/README.md b/arwiki/README.md
@@ -0,0 +1,34 @@
+## Obtaining Wikipedia as a Python dictionary
+
+We adapt the Wikipedia extractor available in https://github.com/attardi/wikiextractor (all code is available in the arwiki folder).
+From Wikipedia dumps we will turn it to a Python dictionary to be able to access it as:
+
+```
+wikipedia['لبنان'] = ["       
+لبنان أو (رسمياً: الجمهوريّة اللبنانيّة)، هي دولة عربية واقعة في الشرق الأوسط في غرب القارة الآسيوية.", ... ]
+```
+**Steps**:
+All scripts here are located in the **arwiki** folder.
+
+*  First download Wikipedia dump available at: https://dumps.wikimedia.org/arwiki/20190520/arwiki-20190520-pages-articles-multistream.xml.bz2 and unzip to .xml (you can use older versions also).
+*  Create a temporary **empty** folder, say it's location is  TEMP_DIRECTORY,  Use arwiki/wikiextractor.py to do a first step extraction of the dump to your (if you use Linux instead of '^' write '\'):
+
+**Note:** This command will create a bunch of folders in your TEMP_DIRECTORY named AA, AB, ... and will take up to 10 minutes (there are 660k articles in total).
+```shell
+python WikiExtractor.py ^
+arwiki-20190201-pages-articles-multistream.xml ^
+--processes 16 ^
+--o . ^
+--no-templates ^
+--json
+```
+
+* Now using the output of WikiExtractor we will build a Python dictionary of Arabic Wikipedia and save it in pickle form  (if you are not familiar with Pickle check https://wiki.python.org/moin/UsingPickle, we will use it extensively here), pick an OUTPUT_DIRECTORY:
+
+```shell
+python arwiki_to_dict.py ^
+-i TEMP_DIRECTORY ^
+-o OUTPUT_DIRECTORY
+```
+This command will create a file called arwiki.p of size 1.2GB in your output directory and this is your pickled Wikipedia.
+*  You can safely now delete your TEMP_DIRECTORY