The discovery service is an application for dataset discovery with 3 components:
- It exposes a series of services via
REST API
- It automatically ingests newly added datasets using a
scheduler
implemented withCelery
. The scheduler can be configured using DATA_INGESTION_INTERVAL variable in.env-default
. The default value is60
seconds. - It provides services for Jupyter Notebook via a developed plugin: https://github.com/Archer6621/jupyterlab-daisy
The entire project is containerised, therefore the only requirement is Docker
You can browse the full OpenAPI documentation.
The discovery service is available for both development and production.
The environement varibles can be found in .env-default
Always delete the auto-generated
.env
file after changing something in.env-default
DAISY_PRODUCTION
-TRUE
to run in production mode andFALSE
to run in development mode. Default FALSEDATA_INGESTION_INTERVAL
- The time interval in SECONDS for starting the auto-ingest pipeline. The time interval should reflect how often new data is uploaded/received.DATA_ROOT_PATH
- The location of the datasets
Run docker_start.sh
to start Docker. Based on the DAISY_PRODUCTION variable, it will automatically use
the appropriate docker-compose.
Visit the API Documentation via localhost:443
once the application is up.
- Run
/ingest-data
endpoint.- The data should be in the
data
folder and it has to follow this structure:{id}/resources/{file-name}.csv
- This endpoint will take a while to run. The more data to process, the more it will run.
- The data should be in the
- Run
/filter-connections
to remove extra edges.
- Run
/purge
. This will remove all the data from neo4j and redis.
- Get joinable tables - Get all assets that share a column(key) with the speficied asset
/get-joinable
with input:asset_id
- Get related assets - Given a source and a target, show how and if the assets are connected
/get-realted
with 2 input variablesfrom_asset_id
andto_asset_id
(Development) The following admin-panels are exposed, for inspecting the services:
- Rabbit MQ:
localhost:15672
- Neo4j:
localhost:7474
- Celery Flower:
localhost:5555
- Redis:
localhost:8001
You can edit any python file in the src
folder with your favorite text editor and it will live-update while the container is running (and in case of the API, restart/reload automatically).
If you get an error about file sharing on windows, visit this thread.