This document lists all the things I should look more carefully in the future.
- If an element ID contains
_
then the sneaking part inrelation_matrix
won't work. - Insert some sample files in input and output folders.
- Update the code to support the last version of NLTK ParentedTree implementation.
- The writers should use an xml library instead of writing strings to a file.
- Installation script.
- Implement an error measurement framework (in ManTIME class) to get statistics from the models.
- Implement a shuffle method and cross-fold validation for the data.
- Do I really need to load Stanford Core NLP everytime for every document? Once (the problem)[dasmith/stanford-corenlp-python#13] with long texts is solved I should switch to the new stanford-core-nlp.
- Unit-test the code with a proper testing framework (py.test).
- Comment the code: better and more verbosely using Google Commenting Style.
Done:
- Make the code general with respect to different annotation standards for CRF (IO, BIO, WIO, WBIO, WBIOE, BIOE).
- Can the same two objects be connected by two different types of temporal relations? No.
- Can an event be anchored to two different MAKEINSTANCE tags? Yes. (not supported yet.)
- Move the output folder up.
- Complete the InTextEntity class development.
- Implement features for the temporal relation extraction task looking at my notes from the literature.
- Implement the classifier for Temporal Links.
- Adapt the writers to output temporal links too.
- Implement the feature extractor for Temporal Links.
- Probably some variables in Document and Sentence objects can be deleted.
- What's
id_token
in Word class? - Do we need EventInstance? Yes, we do.
- In the attribute training phase, the multi-word expressions should be represented as one sample. The features will be merged according to the order of appearance.
- add useful folder (models, output, buffer) in the Git repo
- pickle the num2py arrays and remove the dependency
- Activate the post processing pipeline.
- Fix the logging messages (info, warning, debug)
- The method search_subsequence is called many times. A more adequate ADT should be used.
- Implement a HTML (CSS3) writer (timesheet.js, TimelineJS).
- Make the features as lighter as possible (in terms of storage space).
- show the #_files_processed/#files.
- convert the gazetteers to Unicode.
- Should the attribute data matrix be made of positive samples only?
- Correct some morphological gazetteer features according to the English grammar. Are all the things called prepositions actually prepositions? (ask to Marilena Di Bari)
- Implement the bufferisation at feature level.
- Fix unicode-related bug at utilities.py:76.
- Have a look at argparse ... it's not correct right now.
- Filter out useless features such as female gazetters, male gazetters, US cities. (commented)
- Look carefully at all the features and possibly cut them. (commented)
- Instead of the settings.py file, use OS.ENVIRONS variable.
- Implement the i2b2 reader.
- Implement the i2b2 writer.
- Implement a caching system for Stanford Core NLP.
- Remove the output produced by CRF++ in the training phase.
- Integrate (Norma)[https://github.com/filannim/timex-normaliser].
- Introduce model folders instead of files.
- Fix and connect the post-processing pipeline.
- Attributes models should include identification feature (heavier but hopefully better).
- Split identification models (TIMEXes and EVENTs).
- CRF based attributes extraction.
- There are some print statement somewhere (WARNING cases). I should use something more appropriate for them (log).
- Remove the output produced from Stanford Parser in the stdout/stderr (if everything goes ok).
- Implement AttributeDataMatrix writer.
- Implement TempEval-3 writer.
- Implement TempEval-3 reader.
- Implement the classifier for events and timexes.
- Implement the universal feature extractor for events and timexes.
- Find documentation about how to comment the code so that nice Python-doc style web pages can be automatically generated.
- Love ManTIME and refactor it!