-
Notifications
You must be signed in to change notification settings - Fork 33
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b40416c
commit 37b667a
Showing
53 changed files
with
11,420 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
*.p | ||
*.pyc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,88 @@ | ||
# SOQAL: Neural Arabic Question Answering (UNDER CONSTRUCTION) | ||
# Around mid to end june all code for the system will be available, we are trying to make it as accessible as possible | ||
# SOQAL: Neural Arabic Question Answering | ||
This repository includes the code and dataset described in our WANLP 2019 paper Neural Arabic Question Answering by Hussein Mozananr, Karl El Hajal, Elie Maamary and Hazem Hajj. | ||
|
||
## Data | ||
Quick Links: | ||
* [Datasets](data/README.md) | ||
* [BERT](bert/README.md) | ||
* [Document Retrievers](retriever/README.md) | ||
* [Getting Arabic Wikipedia](arwiki/README.md) | ||
* [Tools for Creating our datasets](dataset_creation/README.md) | ||
## Arabic Open Domain Question Answering | ||
 | ||
This work builds a system for open domain | ||
factual Arabic question answering (QA) using | ||
Wikipedia as our knowledge source. This | ||
constrains the answer of any question to be a | ||
span of text in Wikipedia. However, this enables to use neural reading comprehension models for our end goal. | ||
|
||
Open domain QA | ||
for Arabic entails three challenges: annotated | ||
QA datasets in Arabic, large scale efficient information | ||
retrieval and machine reading comprehension. | ||
To deal with the lack of Arabic | ||
QA datasets we present the Arabic Reading | ||
Comprehension Dataset (ARCD) composed of | ||
1,395 questions posed by crowdworkers on | ||
Wikipedia articles, and a machine translation | ||
of the Stanford Question Answering Dataset | ||
(Arabic-SQuAD) containing 48,344 questions. | ||
|
||
In the data folder you can find the Arabic Reading Comprehension Dataset (arcd.json): a crowdsourced arabic reading comprehension dataset composed of paragraphs and accomnying questions and answers. | ||
Our system for open domain | ||
question answering in Arabic (SOQAL) | ||
is based on three components: (1) a document | ||
retriever using a hierarchical TF-IDF approach, (2) a neural reading comprehension | ||
model using the pre-trained bi-directional | ||
transformer BERT and finally (3) a linear answer ranking module to obtain . | ||
|
||
You will additionally find Arabic-SQuAD: a machine translation (Google API) of half of the Stanford Question Answering Dataset (arabic-SQuAD.json). | ||
Credit: This work draws inspiration from [DrQA](https://github.com/facebookresearch/DrQA). | ||
|
||
More details will soon follow. | ||
## Platform | ||
Tested for Python 3.6 on Windows 8 and 10. | ||
|
||
## Installing SOQAL | ||
Create a new virtual environment (you need to install virtualenv if you want) and activate it: | ||
```shell | ||
virtualenv venv | ||
venv\Scripts\activate | ||
``` | ||
Now you are in the virtual environment you have created and will install things here. | ||
|
||
|
||
Run the following commands to clone the repository and install SOQAL: | ||
```shell | ||
git clone https://github.com/husseinmozannar/SOQAL.git | ||
cd SOQAL | ||
pip install -r requirements.txt | ||
``` | ||
|
||
|
||
## Demo | ||
(We will soon provide trained models, this relies on you training BERT and building the retriever) | ||
|
||
To interactively ask Arabic open-domain questions to SOQAL, follow the instructions bellow: | ||
|
||
```shell | ||
python demo_open.py ^ | ||
-c bert/multilingual_L-12_H-768_A-12/bert_config.json ^ | ||
-v bert/multilingual_L-12_H-768_A-12/vocab.txt ^ | ||
-o bert/runs/ ^ | ||
-r retriever/tfidfretriever.p | ||
``` | ||
|
||
And on your browser go to: | ||
``` | ||
localhost:9999 | ||
``` | ||
## Citation | ||
|
||
(pending ACL release) | ||
|
||
Please cite our paper if you use our datasets or code: | ||
|
||
``` | ||
@inproceedings{mozannar2019soqal, | ||
title={Neural Arabic Question Answering}, | ||
author={Mozannar, Hussein and El Hajal, Karl and Maamary, Elie and Hajj, Hazem}, | ||
booktitle={Association for Computational Linguistics (ACL)}, | ||
year={2019} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
## Obtaining Wikipedia as a Python dictionary | ||
|
||
We adapt the Wikipedia extractor available in https://github.com/attardi/wikiextractor (all code is available in the arwiki folder). | ||
From Wikipedia dumps we will turn it to a Python dictionary to be able to access it as: | ||
|
||
``` | ||
wikipedia['لبنان'] = [" | ||
لبنان أو (رسمياً: الجمهوريّة اللبنانيّة)، هي دولة عربية واقعة في الشرق الأوسط في غرب القارة الآسيوية.", ... ] | ||
``` | ||
**Steps**: | ||
All scripts here are located in the **arwiki** folder. | ||
|
||
* First download Wikipedia dump available at: https://dumps.wikimedia.org/arwiki/20190520/arwiki-20190520-pages-articles-multistream.xml.bz2 and unzip to .xml (you can use older versions also). | ||
* Create a temporary **empty** folder, say it's location is TEMP_DIRECTORY, Use arwiki/wikiextractor.py to do a first step extraction of the dump to your (if you use Linux instead of '^' write '\'): | ||
|
||
**Note:** This command will create a bunch of folders in your TEMP_DIRECTORY named AA, AB, ... and will take up to 10 minutes (there are 660k articles in total). | ||
```shell | ||
python WikiExtractor.py ^ | ||
arwiki-20190201-pages-articles-multistream.xml ^ | ||
--processes 16 ^ | ||
--o . ^ | ||
--no-templates ^ | ||
--json | ||
``` | ||
|
||
* Now using the output of WikiExtractor we will build a Python dictionary of Arabic Wikipedia and save it in pickle form (if you are not familiar with Pickle check https://wiki.python.org/moin/UsingPickle, we will use it extensively here), pick an OUTPUT_DIRECTORY: | ||
|
||
```shell | ||
python arwiki_to_dict.py ^ | ||
-i TEMP_DIRECTORY ^ | ||
-o OUTPUT_DIRECTORY | ||
``` | ||
This command will create a file called arwiki.p of size 1.2GB in your output directory and this is your pickled Wikipedia. | ||
* You can safely now delete your TEMP_DIRECTORY |
Oops, something went wrong.