Linas Vepstas December 2013 Updated May 2018 by Claudia Castillo
Obsolete. This file describes the language-learning project as it existed in the 2015-2018 timeframe. It was enough to demonstrate the basic ideas. It is missing multiple important parts:
-
It does not provide any way of learning grammatical classes (for clustering words int grammatically similar categories)
-
It does not employ a calibrated learner. The quality of results depends strongly on the various parameters that control learning. Setting these blindly will give results, but of unknown quality.
-
It uses the Postgress backend. The RocksDB backend is simpler to deploy and manage.
See the language learning wiki for a general overview of the project.
- Summary
- Setting up the AtomSpace
- Bulk Text Parsing
- Mutual Information of Word Pairs
- Minimum Spanning Tree Parsing
- Exploring Connector Sets
- Setting up a Docker Container
The goal of the project is to build a system that can learn parse
dictionaries for different languages, and possibly do some rudimentary
semantic extraction. The primary design point is that the learning is
to be done in an unsupervised fashion. A sketch of the theory that
enables this can be found in the paper "Language Learning", B. Goertzel
and L. Vepstas (2014) on ArXiv abs/1401.3372.
A shorter sketch is given below. Most of this README concerns the
practical details of configuring and operating the system, and some
diary-like notes about system configuration and operation. A diary of
scientific notes and results is in the nlp/learn/learn-lang-diary
directory.
The basic algorithmic steps, as implemented so far, are as follows:
-
A) Ingest a lot of raw text, such as novels and narrative literature, and count the occurrence of nearby word-pairs.
-
B) Compute the mutual information (mutual entropy) between the word-pairs.
-
C) Use a Minimum-Spanning-Tree algorithm to obtain provisional parses of sentences. This requires ingesting a lot of raw text, again. (Independently of step A)
-
D) Extract linkage disjuncts from the parses, and count their frequency.
-
E) Use maximum-entropy principles to merge similar linkage disjuncts. This will also result in sets of similar words. Presumably, the sets will roughly correspond to nouns, verbs, adjectives, and so-on.
Currently, the software implements steps A, B, C and D. Step E is a topic of current research; its not entirely clear what the best merging algorithm might be, or how it will work.
Steps A-C are "well-known" in the academic literature, with results
reported by many researchers over the last two decades. The results
from Steps D & E are new, and have never been published before.
(Results from Step D can be found in the drafts/connector-sets.lyx
file,
inside the nlp/learn/learn-lang-diary
directory, the PDF of which was posted to the mailing lists)
All of the statistics gathering is done within the OpenCog AtomSpace, where counts and other statistical quantities are associated with various different hypergraphs. The contents of the atomspace are saved to an SQL (Postgres) server for storage. The system is fed with raw text using assorted ad-hoc scripts, which include link-grammar as a central components of the processing pipeline. Most of the data analysis is performed with an assortment of scheme scripts.
Thus, operating the system requires three basic steps:
- Setting up the atomspace with the SQL backing store,
- Setting up the misc scripts to feed in raw text, and
- Processing the data after it has been collected.
Each of these is described in greater detail in separate sections below.
Alternatively to word-pair counting and MI calculations, we have also set up the system to be able to generate MST parses for sentences by providing the weights of their instance-pairs in a file (instead of MI). This allows for the use of different algorithms that estimate the relationship between words, e.g. neural networks. See the MST section below for more details.
This section describes how to set up the atomspace to collect statistics. Most of it revolves around setting up postgres, and for this you have two options:
-
You can choose to install everything directly on your machine, in which case you should just continue reading this section, or
-
You can follow the instructions in the Setting up a Docker Container section below to setup a docker container with all the environment needed to run the ULL ready for you to use.
If you choose the second option go to the section Bulk Text Parsing once you have your container working.
Pre-installations:
0.0) Optional. If you plan to run the pipeline on multiple different languages, it can be convenient, for various reasons, to run the processing in an LXC container. If you already know LXC, then do it. If not, or this is your first time, then don't bother.
0.1) Probably mandatory. Chances are good that you'll work with large
datasets; in this case, you also need to do the below. Skipping this
step will lead to the error Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS
. So:
git clone https://github.com/ivmai/bdwgc
cd bdwgc
git checkout release-7_6
./autogen.sh
./configure --enable-large-config
make; sudo make install
0.2) The atomspace MUST be built with guile version 2.2.2.1 or newer, which can only be obtained from git: that is, by
git clone git://git.sv.gnu.org/guile.git
git checkout stable-2.2
Earlier versions have problems of various sorts. Version 2.0.11 will quickly crash with the error message:
guile: hashtab.c:137: vacuum_weak_hash_table: Assertion 'removed <= len' failed.
Also par-for-each hangs: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26616 (in guile-2.2, it doesn't hang, but still behaves very badly).
0.3) Opencog should be built with link-grammar-5.4.3
or newer.
You can check it by running link-parser --version
If not, this version is available here.
Now, let's set up the text-ingestion pipeline:
-
Set up and configure postgres, as described in atomspace/opencog/persist/sql/README.md
-
Test that your previous step was successful. Create and initialize a database. Pick any name you want; here it is
learn_pairs
.
createdb learn_pairs
cat $ATOMSPACE_SOURCE_DIR/opencog/persist/sql/multi-driver/atom.sql | psql learn_pairs
- Create/edit the
~/.guile
file and add the following content:
(use-modules (ice-9 readline))
(activate-readline)
(debug-enable 'backtrace)
(read-enable 'positions)
(add-to-load-path ".")
-
Start the REPL server. Eventually, you can use the
run-multiple-terminals.sh
script in therun
directory to do this, which creates abyobu
session with different processes in different terminals so you can keep an eye on them. However, the first time through, it is better to do it by hand. So, for now in a terminal start the REPL server by writting:guile -l launch-cogserver -- --mode pairs --lang en --db learn_pairs --user opencog_user --password cheese
The --user option is needed only if the database owner is different from the current user. --password is also optional, not needed if no password was setup for the database
-
Verify that the language processing pipeline works. Try sending it input by running the following in a second terminal:
rlwrap telnet localhost 17005
opencog-en> (observe-text "this is a test")
Or better yet (in a third terminal):
echo -e "(observe-text \"this is a another test\")" |nc -N localhost 17005
echo -e "(observe-text \"Bernstein () (1876\")" |nc -N localhost 17005
echo -e "(observe-text \"Lietuvos žydų kilmės žurnalistas\")" |nc -N localhost 17005
If this command shows the error
nc: invalid option -- 'N'
open process-one.sh
and remove the -N option from the nc commands
(some versions of netcat require this option).
Note: 17005 is the default port for the REPL server in English. This should result in activity on the cogserver and on the database: the "observe text" scheme code sends the text for parsing, counts the returned word-pairs, and stores them in the database.
- Verify that the above resulted in data sent to the SQL database. For example, log into the database, and check:
psql learn_pairs
learn_pairs=# SELECT * FROM atoms;
learn_pairs=# SELECT COUNT(*) FROM atoms;
learn_pairs=# SELECT * FROM valuations;
The above shows that the database now contains word-counts for pair-wise linkages for the input sentences. If the above worked without trouble you are ready to use the pipeline and continue to the next section, but if the above are empty, something is wrong: go back to step zero!
-
Finally, there are some parameters you can optionally adjust before starting to feed the pipeline with real input. Take some time to read and understand the scripts in the
run-ull
(current) and thescm
directories, in particular the use of theparams.txt
configuration file (more on it will come in the next sections).For instance, you might want to tune the forced garbage collection parameters. Currently, garbage collection is forced whenever the guile heap exceeds 750 MBytes; this helps keep RAM usage down on small-RAM machines. However, it does cost CPU time.
You can adjust the
count-reach
parameter passed toobserve-text
(e.g. in process-one.sh) to suit your whims in RAM usage and time spent in GC.
This section describes how to feed text into the pipeline. To do that you first need to find some adequate text corpora that you can feed into the pipeline. It would be best to get text that consists of narrative literature, adventure and young-adult novels, newspaper stories. These contain a good mix of common nouns and verbs, which is needed for conversational natural language.
It turns out that Wikipedia is a poor choice for a dataset. That's because the "encyclopedic style" means it contains few pronouns, and few action-verbs (hit, jump, push, take, sing, love) because its mostly describing objects and events (is, has, was). It also contains large numbers of product names, model numbers, geographical place names, and foreign language words, which do little or nothing for learning grammar. Finally, it has large numbers of tables and lists of dates, awards, ceremonies, locations, sports-league names, battles, etc. that get mistaken for sentences, and leads to unusual deductions of grammar. Thus, Wikipedia is not a good choice for learning text.
There are various scripts in the learn/download directory for downloading and pre-processing texts from Project Gutenberg, Wikipedia, and the "Archive of Our Own" fan-fiction website. Once you are sure you have the right material to start, follow the next steps:
-
To avoid using an ad-hoc tokenizer in the process, the current pipeline assumes that the texts have been pre-tokenized in the user's preferred way, and the current pipeline only does naive tokenization on them (splitting by space). As such, we need to empty the affix-punc file in link-grammar/any/ directory. It's recommended to make a copy of the origina to restore for other possible uses of the link-parser in "any" mode. If you didn't build link-grammar in a different location, this should work:
sudo mv /usr/local/share/link-grammar/any/affix-punc /usr/local/share/link-grammar/any/affix-punc_original sudo touch /usr/local/share/link-grammar/any/affix-punc
-
Setup the working directory by running the following commands from the root of your
learn
clone, if you haven't already.~/learn$ mkdir build ~/learn$ cd build ~/learn/build$ rm -rf * ~/learn/build$ cmake .. ~/learn/build$ make ~/learn/build$ sudo make install ~/learn/build$ make run-ull
Review the file README-run.md if you want to have a general understanding of what each of these scripts/files do.
-
Put all the training plain text files of the same language in a separate directory inside your working directory, at
~/learn/build/run-ull/
. The scripts used in this section use by default the namebeta-pages
for such a directory, so if you want to use a different name make sure you change the respective path inside thetext-process.sh
script. Also, keep in mind that the files will be removed from the folder after being processed, so make sure you keep a back-up of them somewhere else (you don't want to mess up the original files after all the work done to get them).If you used the provided example scripts, you should have a test file in the
alpha-pages
folder. Make a copy of this folder with the desired name.cp -pr alpha-pages beta-pages
-
Set up distinct databases, one for each language you will work with:
createdb fr_pairs lt_pairs pl_pairs en_pairs cat $ATOMSPACE_SOURCE_DIR/opencog/persist/sql/multi-driver/atom.sql | psql ??_pairs
-
If you are familiar with the counting and parsing methods used in the pipeline, open
config/params.txt
and choose the counting mode you want to use. Otherwise, just leave the default values. There are currently three observing modes, set by cnt_mode and taking another integer parameter:- any: counts pairs of words linked by the LG parser in 'any' language. In this case, 'count-reach' specifies how many linkages from LG-parser to use. If cnt_reach is set to 0, then it uses all returned parses.
- clique: itearates over each word in the sentence and pairs it with every word located within distance 'count-reach' to its right. Distance is defined as the difference between words positions in the sentence, so neighboring words have distance of 1.
- clique-dist: same word-pairs as 'clique', but the count in each word pair is incremented by 'cnt_reach / distance'
-
In your working directory at
~/learn/build/run-ull
run the following:./run-multiple-terminals.sh pairs lang ??_pairs your_user your_password
This starts the cogserver and sets a default prompt: set up by default to avoid conflicts and confusion, and to allow multiple languages to be processed at the same time.
Replace the arguements above with the ones that apply to the language you are using and your database credentials. User and password are optional, as previously explained. For example, for english run:
./run-multiple-terminals.sh pairs en en_pairs opencog_user cheese
-
In the parse tab of the byobu (you can navigate with the F3 and F4 keys), run the following:
./text-process.sh pairs en
Or can change 'en' for the respective language initials. If this command shows the error
nc: invalid option -- 'N'
open
process-one.sh
and remove the -N option from the nc commands (some versions of netcat require this option). -
Wait some time, possibly a few days. When finished, stop the cogserver.
-
Verify that the information was correctly saved in the database.
Some handy SQL commands:
SELECT count(uuid) FROM atoms;
SELECT count(uuid) FROM atoms WHERE type =123;
type 123 is WordNode
for me; verify with
SELECT * FROM Typecodes;
The total count accumulated is
SELECT sum(floatvalue[3]) FROM valuations WHERE type=7;
where type 7 is CountTruthValue
.
Some extra notes:
The submit-one.pl
script here is called with the "observe-text"
instruction for word-pair counting when send to the cogserver.
Obtaining word-pair counts requires digesting a lot of text, and
counting the word-pairs that occur in the text. There are several
ways of doing this at the moment. The original way was to parse the
text with link-grammar using the "any" language. This pseudo-language
will link any word to any other as long as the links do not cross each
other, thus extracting word pairs from the parsed text. An alternative
method (although not necessarily more efficient) is to find all possible
word-pair combinations (a clique) that lie within a specified distance.
You may want to reduce the amount of data that is collected. Currently,
the observe-text
function in scm/link-pipeline.scm
can collect counts
on four different kinds of structures:
- Word counts -- how often a word is seen.
- Clique pairs, and pair-lengths -- This counts pairs using the "clique pair" counting method. The max length between words must be specified. Optionally, the lengths of the pairs can be recorded. Caution: enabling length recording will result in 6x or 20x more data to be collected, if you've set the length to 6 or 20. That's because any given word pair will be observed at almost any length apart, each of these is an atom with a count. Watch out!
- Lg "ANY" word pairs -- how often a word-pair is observed.
- Disjunct counts -- how often the random ANY disjuncts are used. You almost surely do not need this. This is for my own personal curiosity.
Edit the comments on that file if you want to vary the collected data.
This pipeline requires postgres 9.3 or newer, for multiple reasons. One reason is that older postgres don't automatically VACUUM. The other is that the list membership functions are needed.
Also, be sure to perform the postgres tuning recommendations found in various online postgres performance wikis, or in the atomspace/opencog/persist/sql/README.md file. See 'Performance' section below.
When you're done observing, restore the original affix-punc file that we modified in step 0 of this section.
After accumulating a few million word pairs, we're ready to compute the mutual entropy between them. Follow the next steps to do so. Note that if the parsing is interrupted, you can restart the various scripts; they will automatically pick up where they left off.
-
Setup the working directory by running the following commands from the root of your opencog clone, if you haven't already.
~/learn$ mkdir build ~/learn$ cd build ~/learn/build$ rm -rf * ~/learn/build$ cmake .. ~/learn/build$ make ~/learn/build$ sudo make install ~/learn/build$ make run-ull
-
In your working directory at
~/learn/build/run-ull
run the following:./run-multiple-terminals.sh cmi lang ??_pairs your_user your_password
This starts the cogserver and sets a default prompt: set up by default to avoid conflicts and confusion, and to allow multiple languages to be processed at the same time.
Replace the arguements above with the ones that apply to the language you are using and your database credentials. User and password are optional, as previously explained. For example, for english run:
./run-multiple-terminals.sh cmi en en_pairs opencog_user cheese
-
In one of the unused byobu windows (you can navigate with the F3 and F4 keys), run the following:
./process-word-pairs.sh cmi en
Or can change 'en' for the respective language initials. If this command shows the error
nc: invalid option -- 'N'
open
process-word-pairs.sh
and remove the -N option from the nc commands (some versions of netcat require this option). -
Wait some time, possibly a few days. When finished, you can export the word-pair MI values to a file if you want. Start by loading the file in the cogserver:
(load "export-mi.scm")
and running the export comand (change "any" to the mode used when pair counting):
(export-mi "any")
This will generate a file called
mi-pairs.txt
in your working directory. Then stop the cogserver.These scripts use commands from the scripts in the
scm
directory. The code for computing word-pair MI is inbatch-word-pair.scm
. It uses the(opencog matrix)
subsystem to perform the core work.
General remarks:
-
The system might not be robust enough at this stage yet, so if you find an error while executing this code, run each command from the function in compute-mi.scm directly on the cogserver and separately to trace the error.
-
Batch-counting might take hours or longer, depending on your dataset size. The batching routine will print to stdout, giving a hint of the rate of progress.
Example stats and performance:
-
current fr_pairs db has 16785 words and 177960 pairs. This takes 17K + 2x 178K = 370K total atoms loaded. These load up in 10-20 seconds-ish or so.
-
New fr_pairs has 225K words, 5M pairs (10.3M atoms): Load 10.3M atoms, which takes about 10 minutes cpu time to load 20-30 minutes wall-clock time (500K atoms per minute, 9K/second on an overloaded server).
-
RSS for cogserver: 436MB, holding approx 370K atoms So this is about 1.2KB per atom, all included. Atoms are a bit fat... ... loading all pairs is very manageable even for modest-sized machines.
-
RSS for cogserver: 10GB, holding 10.3M atoms So this is just under 1KB per atom.
(By comparison, direct measurement of atom size i.e. class Atom: typical atom size: 4820384 / 35444 = 136 Bytes/atom this is NOT counting indexes, etc.)
- For dataset (fr_pairs) with 225K words, 5M pairs: Current rate is 150 words/sec or 9K words/min.
After the single-word counts complete, and all-pair count is done. This is fast, takes a couple of minutes.
-
Next: batch-logli takes 540 seconds for 225K words
-
Finally, an MI compute stage. Current rate is 60 words/sec = 3.6K per minute. This rate is per-word, not per word-pair .
Update Feb 2014: fr_pairs now contains 10.3M atoms SELECT count(uuid) FROM Atoms; gives 10324863 (10.3M atoms) select count(uuid) from atoms where type = 77; gives 226030 (226K words) select count(uuid) from atoms where type = 8; gives 5050835 (5M pairs ListLink) select count(uuid) from atoms where type = 27; gives 5050847 (5M pairs EvaluationLink)
The MST parser discovers the minimum spanning tree that connects the words together in a sentence, using the provided link weights. The link-cost used can be (minus) the mutual information between word-pairs (so we are maximizing MI). In this case, MST parsing cannot be started before the above steps to compute word-pair MI have been accomplished. Alternatively, one can obtain the weights between word-instance-pairs from a different source (e.g. a neural-network-generated language model) and feed them to the MST algorithm (see instructions after step 7).
The minimum spanning tree code is called from scm/mst-parser.scm
and
run-poc/redefine-mst-parser.scm
. The current
version works well. To run it using MI calculated as explained in the previous
sections, follow the next steps (see after step 7 below for using other type
of weights):
-
Setup the working directory by running the following commands from the root of your opencog clone, if you haven't already.
~/learn$ mkdir build ~/learn$ cd build ~/learn/build$ rm -rf * ~/learn/build$ cmake .. ~/learn/build$ make ~/learn/build$ sudo make install ~/learn/build$ make run-ull
Review the file README-run.md if you want to have a general understanding of what each of these scripts/files do.
-
Copy again all your text files (pre-tokenized if you wish), now to the
gamma-pages
directory (or edittext-process.sh
and change the corresponding directory name). Once again, keep in mind that during processing, text files are removed from this directory. -
(Optional but suggested) Make a copy of your word-pair database, "just in case". You can copy databases by saying:
createdb -T existing_dbname backup_dbname
-
Tweak your parser parameters accordingly:
- Open
config/params.txt
and make sure the cnt_mode matches the one used for pair-counting. Also, if you want to assign distance weight to the word-pairs' MI values (adding a factor 1 / distance), assign mst_dist="#t". Set the export parses variable to true (exp_parses="#t") if you want to export the sentence parses for each corpus file into the directory mst-parses. The parameter cnt_reach does not have an effect at this stage, you can leave it as is.
- Open
-
In your working directory at
~/learn/build/run-ull
run the following:./run-multiple-terminals.sh mst lang dbname your_user your_password
Replace the arguements above with the ones that apply to the language you are using and your database credentials. User and password are optional, as previously explained. For example, for English run:
./run-multiple-terminals.sh mst en en_pairs opencog_user cheese
-
In an unused tab of the byobu (you can navigate with the F3 and F4 keys), run the following:
./process-word-pairs.sh mst en
Or can change 'en' for the respective language initials. Once again, if this command shows the error
nc: invalid option -- 'N'
open
process-word-pairs.sh
and remove the -N option from the nc commands.Wait 10 to 60+ minutes for the guile prompt to appear. This script opens a connection to the database, and then loads all word-pairs into the atomspace. This can take a long time, depending on the size of the database. The word-pairs are needed to get the pair-costs that are used to perform the MST parse.
-
Once the above has finished loading, the parse script can be started. In the parse tab of the byobu run:
./text-process.sh mst en
Remember to change 'en' to the respective language if it applies. Wait a few days for data to accumulate. Once again, if the above command shows the error
nc: invalid option -- 'N'
open
process-one.sh
and remove the -N option from the nc commands.If the process is stopped for any reason, you can just re-run these scripts; they will pick up where they left off. When finished, remember to stop the cogserver.
To use link weights calculated in some other way (instead of MI), you need to provide them in files with the following format:
First Sentence (prefixed with ###LEFT-WALL###)
0 ###LEFT-WALL### 1 First-word-in-sentence link-weight
0 ###LEFT-WALL### 2 Second-word-in-sentence link-weight
...
1 First-word-in-sentence 2 Second-word-in-sentence link-weight
1 First-word-in-sentence 3 Third-word-in-sentence link-weight
...
Second Sentence (prefixed with ###LEFT-WALL###)
0 ###LEFT-WALL### 1 First-word-in-sentence link-weight
0 ###LEFT-WALL### 2 Second-word-in-sentence link-weight
...
where each block of a sentence (notice that the sentences need to include
the initial ###LEFT-WALL### toekn) and all its word-instance-pair lines are
separated from the next sentence by an empty line.
The 7 steps above still apply, with the following modifications:
In step 2), place the special-format files in gamma-pages
, instead
of the plain text files.
In step 4), you need to set cnt_mode="file"
, to indicate you're using
file-based weights, and make sure split_sents="#f"
.
All other parameters still apply.
Step 6) is not needed.
Once this is done (either using MI or file-based weights), you can move
to the next step, which is explained in
the next section. If you activated the option, you can check out the
sentence parses generated in the folder mst-parses/
.
Once you have a database with some fair number of connector sets in it, you can start exploring. For ideas checkout the original version of this README in opencog/nlp/learn.
I add here some links to other usefull resources for understanding:
- Structure of the atomstpace: atoms and its types
- The basic operations on atoms: README.
- Atom structure used for NLP: sentence-representation-wiki.
After this, clusterization and feedback steps should be performed, but for now you are on your own.. Good luck!!
Before you follow the next steps make sure you have cloned the repositories from OpenCog (opencog, atomspace, cogutil) in your machine.
-
Download and set up docker from here. If you are using Linux, also install docker-compose from here, or by:
~$ sudo pip install -U docker-compose
-
Clone the docker repository:
~$ git clone https://github.com/singnet/docker.git
-
Make sure your user is in the docker group (
getent group docker
), otherwise you will get the errorGot permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: ...
If this happens, follow the instructions here.
Enter the opencog directory and build your images:
~$ cd docker/opencog/ ~/docker/opencog$ ./docker-build.sh -a
-
Create a directory in your machine to store code that will make building again the containers faster (change path to your own, ex: $HOME/.ccache):
~$ mkdir -p $HOME/path/to/where/you/want/to/save/ccache/output
Optionally you can instead just comment out the
$CCACHE_DIR:/home/opencog/.ccache
line in common.yml file -
If you don't have opencog installed and running in your OS, you need to first install opencog. Please follow all instructions here and make sure you have a running opencog.
-
Add these lines to ~/.bashrc at $HOME of your host OS (change paths to your own) and run
source ~/.bashrc
.export OPENCOG_SOURCE_DIR=$HOME/path/to/opencog export ATOMSPACE_SOURCE_DIR=$HOME/path/to/atomspace export COGUTIL_SOURCE_DIR=$HOME/path/to/cogutil export CCACHE_DIR=$HOME/path/to/where/you/want/to/save/ccache/output`
If you are trying to install the container in a shared server insert these lines in the file ~/.profile instead, and restart your session before continuing.
-
Run the container:
~$ cd docker/opencog/ ~/docker/opencog$ docker-compose run dev
-
Inside the container, clone the
learn
repository with the ULL pipeline into your home directory:git clone https://github.com/singnet/learn /home/opencog/learn
-
Setup the working directory by running the following commands from the newly cloned container.
cd ~/learn ~/learn$ mkdir build ~/learn$ cd build ~/learn/build$ rm -rf * ~/learn/build$ cmake .. ~/learn/build$ make ~/learn/build$ sudo make install ~/learn/build$ make run-ull
-
Test that everything is working:
a) Create and format a database (password is cheese):
$ createdb learn_pairs
$ cat atom.sql | psql learn_pairs
Use tmux to create parallel sessions of the container. If you are not familiar with it you can use this cheatsheet in here.
b) In a separate session start the REPL server.
$ cd ~/learn/build/run-ull/
$ guile -l launch-cogserver.scm -- --mode pairs --lang en --db learn_pairs --user opencog_user --password cheese
c) Send input to the pipeline:
$ echo -e "(observe-text-mode \"This is a test\" \"any\" 24)" |nc -N localhost 17005
d) Check that the input was registered in the database:
$ psql learn_pairs
learn_pairs=# SELECT * FROM atoms;
The words from the sentence "This is a test" should appear inside the table below the name column.
IF EVERYTHING WORKED FINE YOU ARE READY TO WORK (go to Bulk Text Parsing), OTHERWISE GO BACK TO STEP 0 (or fix your bug if you happen to know what went wrong)!!
Note 1: If something went wrong when trying to connect to the cogserver consider making a clean build and re-installing inside the container all three: cogutil, atomspace, opencog. For example for the first one:
~$ cd /cogutil/build
/cogutil/build$ rm -rf *
/cogutil/build$ cmake ..
/cogutil/build$ make && sudo make install
Note 2: If you make changes to the code in your installed repos, you can update those in your
current container by cd
-ing to the mount directory inside the container and running /tmp/octool -bi
Note 3: Steps 1-5 are only necessary the first time you install the docker container and images. Afterwards, you just need to follow steps 6 and 7 every time you want to create a new opencog container, or you might want to access directly your already existing container (see next note).
Note 4: Keep in mind that everytime you run docker-compose run dev
it will create a new instance of opencog
but the same instances of postgres and relex will be running on the background. Use (Ctrl+D) to exit a container.
Some usefull commands for managing your containers on your local machine are listed below:
docker ps
To see the list of all the active containers (it shows container_ID).docker ps -a
To see the list of all the existing containers.docker start container_ID
To start an inactive existing container (for example an existing instance of opencog).docker attach container_ID
To "log-in" to a running (existing & active) container in a terminal.docker stop container_ID
To stop a running container.docker stop $(docker ps -q)
To stop all running containers.docker kill container_ID
To kill a running container (forces the stop).docker rm container_ID
To delete an existing but inactive container.docker rm -f container_ID
To delete a running container (it will kill it first).docker cp container_ID:/path file
To copy a file from host to the container.
DO NOT try to delete all running containers unless strictly necessary because it will delete the postgres instance as well, which means losing all your databases!!!
Note 5: Remember to always close any cogserver (Ctrl+D) sessions you have started before continuing, otherwise you will have problems accessing your databases later.
Note 6: If you prefer to avoid specifying username and password for the postgres databases, you can follow these instructions