Skip to content

Commit

Permalink
first system commit
Browse files Browse the repository at this point in the history
  • Loading branch information
husseinmozannar committed Jun 11, 2019
1 parent b40416c commit 37b667a
Show file tree
Hide file tree
Showing 53 changed files with 11,420 additions and 6 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.p
*.pyc
90 changes: 84 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,88 @@
# SOQAL: Neural Arabic Question Answering (UNDER CONSTRUCTION)
# Around mid to end june all code for the system will be available, we are trying to make it as accessible as possible
# SOQAL: Neural Arabic Question Answering
This repository includes the code and dataset described in our WANLP 2019 paper Neural Arabic Question Answering by Hussein Mozananr, Karl El Hajal, Elie Maamary and Hazem Hajj.

## Data
Quick Links:
* [Datasets](data/README.md)
* [BERT](bert/README.md)
* [Document Retrievers](retriever/README.md)
* [Getting Arabic Wikipedia](arwiki/README.md)
* [Tools for Creating our datasets](dataset_creation/README.md)
## Arabic Open Domain Question Answering
![](system_fig.jpg)
This work builds a system for open domain
factual Arabic question answering (QA) using
Wikipedia as our knowledge source. This
constrains the answer of any question to be a
span of text in Wikipedia. However, this enables to use neural reading comprehension models for our end goal.

Open domain QA
for Arabic entails three challenges: annotated
QA datasets in Arabic, large scale efficient information
retrieval and machine reading comprehension.
To deal with the lack of Arabic
QA datasets we present the Arabic Reading
Comprehension Dataset (ARCD) composed of
1,395 questions posed by crowdworkers on
Wikipedia articles, and a machine translation
of the Stanford Question Answering Dataset
(Arabic-SQuAD) containing 48,344 questions.

In the data folder you can find the Arabic Reading Comprehension Dataset (arcd.json): a crowdsourced arabic reading comprehension dataset composed of paragraphs and accomnying questions and answers.
Our system for open domain
question answering in Arabic (SOQAL)
is based on three components: (1) a document
retriever using a hierarchical TF-IDF approach, (2) a neural reading comprehension
model using the pre-trained bi-directional
transformer BERT and finally (3) a linear answer ranking module to obtain .

You will additionally find Arabic-SQuAD: a machine translation (Google API) of half of the Stanford Question Answering Dataset (arabic-SQuAD.json).
Credit: This work draws inspiration from [DrQA](https://github.com/facebookresearch/DrQA).

More details will soon follow.
## Platform
Tested for Python 3.6 on Windows 8 and 10.

## Installing SOQAL
Create a new virtual environment (you need to install virtualenv if you want) and activate it:
```shell
virtualenv venv
venv\Scripts\activate
```
Now you are in the virtual environment you have created and will install things here.


Run the following commands to clone the repository and install SOQAL:
```shell
git clone https://github.com/husseinmozannar/SOQAL.git
cd SOQAL
pip install -r requirements.txt
```


## Demo
(We will soon provide trained models, this relies on you training BERT and building the retriever)

To interactively ask Arabic open-domain questions to SOQAL, follow the instructions bellow:

```shell
python demo_open.py ^
-c bert/multilingual_L-12_H-768_A-12/bert_config.json ^
-v bert/multilingual_L-12_H-768_A-12/vocab.txt ^
-o bert/runs/ ^
-r retriever/tfidfretriever.p
```

And on your browser go to:
```
localhost:9999
```
## Citation

(pending ACL release)

Please cite our paper if you use our datasets or code:

```
@inproceedings{mozannar2019soqal,
title={Neural Arabic Question Answering},
author={Mozannar, Hussein and El Hajal, Karl and Maamary, Elie and Hajj, Hazem},
booktitle={Association for Computational Linguistics (ACL)},
year={2019}
}
34 changes: 34 additions & 0 deletions arwiki/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## Obtaining Wikipedia as a Python dictionary

We adapt the Wikipedia extractor available in https://github.com/attardi/wikiextractor (all code is available in the arwiki folder).
From Wikipedia dumps we will turn it to a Python dictionary to be able to access it as:

```
wikipedia['لبنان'] = ["
لبنان أو (رسمياً: الجمهوريّة اللبنانيّة)، هي دولة عربية واقعة في الشرق الأوسط في غرب القارة الآسيوية.", ... ]
```
**Steps**:
All scripts here are located in the **arwiki** folder.

* First download Wikipedia dump available at: https://dumps.wikimedia.org/arwiki/20190520/arwiki-20190520-pages-articles-multistream.xml.bz2 and unzip to .xml (you can use older versions also).
* Create a temporary **empty** folder, say it's location is TEMP_DIRECTORY, Use arwiki/wikiextractor.py to do a first step extraction of the dump to your (if you use Linux instead of '^' write '\'):

**Note:** This command will create a bunch of folders in your TEMP_DIRECTORY named AA, AB, ... and will take up to 10 minutes (there are 660k articles in total).
```shell
python WikiExtractor.py ^
arwiki-20190201-pages-articles-multistream.xml ^
--processes 16 ^
--o . ^
--no-templates ^
--json
```

* Now using the output of WikiExtractor we will build a Python dictionary of Arabic Wikipedia and save it in pickle form (if you are not familiar with Pickle check https://wiki.python.org/moin/UsingPickle, we will use it extensively here), pick an OUTPUT_DIRECTORY:

```shell
python arwiki_to_dict.py ^
-i TEMP_DIRECTORY ^
-o OUTPUT_DIRECTORY
```
This command will create a file called arwiki.p of size 1.2GB in your output directory and this is your pickled Wikipedia.
* You can safely now delete your TEMP_DIRECTORY
Loading

0 comments on commit 37b667a

Please sign in to comment.