You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+24-15
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@ Text-analysis support for *Django* clients, talking through HTTP API to an exten
3
3
4
4
This is a **Django** app implementing a repertoire of **Text Analysis** functions with general objectives of linguistic education, to be used in the context of both L1 and L2, by learners and teachers and by editors of text resources.
5
5
6
-
*commons-textanalysis* currently supports, quite in similar way, **8 European languages**: English, Italian, Spanish, Greek, French, Portuguese, Croatian and Lithuanian.
6
+
*commons-textanalysis* currently supports, quite in similar way, **8 European languages**: English, Italian, Spanish, Greek, French, Croatian, Danish and Lithuanian.
7
7
Since this largely depends on the availability of *spaCy* (statistical) *language models* and on other *language resources* needed by different text analysis methods, support of additional languages is expected to be added as those resources will be available. This, in turn, will depend on the the interest shown by users and contributors.
8
8
9
9
***Origin***
@@ -27,24 +27,33 @@ At present, CommonSpaces hosts also a few *mini-sites* dedicated to the communit
27
27
28
28
It also includes many **language resources**, mostly concerning specific languages, being available as *open data*. The role of these resources is to make the analysis methods, and often even the algorithms, able to work in a very similar way for different languages.
29
29
30
-
**Functionality**
30
+
***Functionality***
31
31
32
32
Currently the following ouput views are implemented:
33
-
1. <u>Keywords in Context</u> thanks to the exploitation of a function of **tmtoolkit** in *commons-language*;
34
-
2. <u>ord lists by POS</u>; sorted lists are produced based on lexical resources concerning word frequencies and/or *CEFR* vocabulary levels;
35
-
3. <u>Annotated text<u>, interactively showing individual attributes of the text *tokens* and the result of *Named Entity Recognition* (NER); at present it reuses some code of **NlpBuddy**;
36
-
4. <u>Noun chunks</u> this comes directly from *spaCy*;
37
-
5. <u>Text readability</u> this is a provisional view putting together some raw (shallow) text features - mainly counts and means -, lexical features and syntactic features, with the results of classical *readability formulas* also based on raw text features;
38
-
6. <u>Text Cohesion</u> this view puts together *text coherence* scores computed with the *entity graph method* (Guinaudeau and Strube), as implemented by **TRUNAJOD**, with *local cohesion* scores based on the lexicon shared among contiguous paragraphs (visual detail is provided) and on *similarity scores* coming directly from spaCy;
39
-
7. <u>Text Summarization</u> this is the result of a very simple extractive algorithm;
40
-
8. <u>Text Analysis Dashboard</u> this is a tentative view putting together some results from 2, 3, 5 and 7; it also includes a sophisticated visualisation of the text structure derived with *dependency parsing*.
41
-
42
-
**Plans**
43
-
44
-
This package is work in progress; main activities planned are:
33
+
1.*Keywords in Context*; thanks to the exploitation of a function of **tmtoolkit** in *commons-language*;
34
+
2.*Word lists by POS*; sorted lists are produced based on lexical resources concerning word frequencies and/or *CEFR* vocabulary levels;
35
+
3.*Annotated text*; interactively shows individual attributes of the text *tokens* and the result of *Named Entity Recognition* (NER); at present it reuses some code of **NlpBuddy**;
36
+
4.*Noun chunks*; this comes directly from *spaCy*;
37
+
5.*Text readability*; this is a provisional view putting together some raw (shallow) text features - mainly counts and means -, lexical features and syntactic features, with the results of classical *readability formulas* also based on raw text features;
38
+
6.*Text Cohesion*; this view puts together *text coherence* scores computed with the *entity graph method* (Guinaudeau and Strube), as implemented by **TRUNAJOD**, with *local cohesion* scores based on the lexicon shared among contiguous paragraphs (visual detail is provided) and on *similarity scores* coming directly from spaCy;
39
+
7.*Text Summarization*; this is the result of a very simple extractive algorithm;
40
+
8.*Text Analysis Dashboard*; this is a tentative view putting together some results from 2, 3, 5 and 7; it also includes a sophisticated visualisation of the text structure derived with *dependency parsing*.
41
+
42
+
***Interfaces***
43
+
44
+
There are 3 levels of interfaces *commons-textanalysis*, In correspondence with the components of its architecture:
45
+
- an **upward** interface utilizing the generic HTTP API exposed by the *commons-language service*;
46
+
- the **interactive** (user) interface for selecting a text analysis function and executing it on the text inserted in the *input box* (possibly doing *copy-and-paste°), or specified through an URL;
47
+
- a **downward** interface exposing application-level API through a list of *url patterns*, for the convenience of other applications, such as the collaborative learning platform *CommonSpaces*.
48
+
49
+
Moreover, *commons-textanalysis* acts as *pass-through* for a set of functions provided by *commons-language*, aimed at building and exploiting **corpora** of texts; here the term *corpus* is strictly related to the *DocBin* object type in *spaCy*, which "lets you efficiently serialize the information from a collection of *Doc* objects". Currently, said functionality isn't available in interactive way through *commons-textanalysis*, but is exploited only by *CommonSpaces*.
50
+
51
+
***Plans***
52
+
53
+
This package is *work in progress*; main activities planned are:
45
54
- complete the restructuring of the software stack, in order to make *commons-textanalysis* completely independent from the software of the *Commons Platform*, of wich originally it was part;
46
55
- document the API;
47
-
- retrieve language resources allowing to enable additional languages;
56
+
- retrieve and adapt language resources allowing to enable additional languages;
48
57
- improve and extend the current functionality;
49
58
- reorganize the output views to improve their usability;
50
59
- clean up the code, also to make easier possible contributions.
0 commit comments