This is a Trino connector to access the GitHub API using SQL.
Because most endpoints have some parameters that are required, there are two ways of getting data:
- reading from functions - required parameters are passed explicitly as function arguments, including page number;
functions return an array of rows, which must be unnested:
"SELECT * FROM unnest(pulls(:token, :owner, :repo, :page))
- reading from tables - required parameters are inferred from query conditions (
WHERE
clause); if there are multiple pages of results, multiple requests will be made to fetch all of them before returning the data
Not all API endpoints are mapped yet, here's a list of the available tables:
orgs
- Organizationsusers
- Usersrepos
- Repositoriesissues
andissue_comments
- Issuespulls
,pull_commits
,pull_stats
,reviews
,review_comments
- Pull requestsworkflows
,runs
,jobs
,steps
,artifacts
,runners
- Actions
The following configuration options are recognied by this connector:
token
- required, the personal access token to authenticate with; it is required, but doesn't have to be valid; see the Authentication and rate limits section belowclient-cache-path
- setting this enables caching of HTTP responsesclient-cache-max-size
- defaults to10MB
client-connect-timeout
andclient-read-timeout
- defaults to 10 secondsclient-max-binary-download-size
- ignore artifacts larger than this sizemin_splits
andmin_split_tables
- for tables matching endpoints that don't return total number of elements, a minimum number of splits can be used, where every split will fetch next page until no more results are available; defaults to 1 forissues
,pulls
,runs
andcheck_runs
tablesmax-requests-per-second
- rate-limit HTTP requests, applies to the HTTP client, so is independent on every workermax-splits-per-second
- rate-limit the number of splits, this effectively limits the number of requests for the whole Trino cluster
Build, test and create a deployable zip file:
mvn clean package
An example command to run the Trino server with the git plugin and catalog enabled:
src=$(git rev-parse --show-toplevel)
docker run \
-v $src/trino-rest-github/target/trino-rest-github-0.156-SNAPSHOT:/usr/lib/trino/plugin/github \
-v $src/catalog:/etc/trino/catalog \
-p 8080:8080 \
--name trino \
-d \
trinodb/trino:462
Connect to that server using:
docker run -it --rm --link trino trinodb/trino:462 trino --server trino:8080 --catalog github --schema default
GitHub API doesn't require authentication, but unauthenticated requests have very low rate-limits.
Generate a personal auth token in Settings -> Developer settings -> Personal access tokens
.
For more details, see the Authentication section in GitHub's API docs.
Some tables fetch data from endpoints that return total number of items. In such case, this connector will generate one split for every page of data (with 100 items per page). Such splits can be distributed evenly among all available nodes and/or threads, resulting in making concurrent HTTP requests.
For other tables, a configurable minimum number of splits will be created (defaults to 1), where each fetches pages until no more are available.
Note: no rate limits are being accounted for when generating splits.
Caching is enabled in the HTTP client used. API responses are cached in a temporary directory, and max cache size is 50Mb.
To better utilize the cache, avoid using query conditions that changes often, like timestamps with seconds. So when using relative time
to filter recent records, like now() - interval '14' days
, use a date instead: current_date - interval '14' days
.
Caching can prevent from fetching latest data. This could be mitigated by adding a condition to get records older than current timestamp,
but mos endpoints only support a since
filter, which is the opposite. Another solution is to disable caching,
and copy the data into another, persistent catalog. There is a Sync program,
that does this in an incremental fashion, that is it can be run in regular intervals, for example from a cron job.
To run the Sync
utility in trino-rest-github
:
java -cp "trino-rest-github/target/trino-rest-github-0.156-SNAPSHOT/*" pl.net.was.rest.github.Sync
Check how much data the Sync
collected by running a query like:
SELECT COUNT(DISTINCT id), COUNT(*), MIN(created_at), MAX(created_at) FROM runs;
To add a new table:
- Register one or more endpoints in GithubService. This will be used to build HTTP requests sent to the API.
- Create a new model in the model directory.
This class is the data model for the API response. Annotations in the constructor arguments map JSON fields into properties.
The
writeRow()
method is used to serialize the object into a Trino row, so this is where the mapping of JSON objects to SQL rows happens. Note that some properties might not be set from the API response, but from the request. - If the endpoint has some required filters, like
owner
orrepo
(most do), create a new filter in the filter directory. It defines which columns' constraints will be pushed down and sent in HTTP requests. - In GithubRest:
- add table definition in the
columns
property; make sure they match exactly the write methods in the model; - add the filter created in previous step to the
filterAppliers
property; - create a function to get table rows and call it in the
getRows()
method.
To add a new function, assuming there's a model and columns mapping from previous points:
- Add a row type string constant in the GithubRest class.
- Create a new function class in function.