Skip to content

Latest commit





ctbk python library

CLI for generating datasets (derived from Citi Bike public data in s3://tripdata).

Data flow

flowchart LR;
agg_sc["AggregatedMonths(YYYYMM, 's', 'c')\ns3://ctbk/aggregated/s_c_YYYYMM.parquet"]
agg_sec["AggregatedMonths(YYYYMM, 'se', 'c')\ns3://ctbk/aggregated/se_c_YYYYMM.parquet"]
agg_ymrgtb["AggregatedMonths(YYYYMM, 'ymrgtb', 'cd')\ns3://ctbk/aggregated/ymrgtb_cd_YYYYMM.parquet"]
smh_in["StationMetaHists(YYYYMM, 'in')\ns3://ctbk/stations/meta_hists/in_YYYYMM.parquet"]
smh_il["StationMetaHists(YYYYMM, 'il')\ns3://ctbk/stations/meta_hists/il_YYYYMM.parquet"]

z --> c --> n
n --> agg_sc
n --> agg_sec
n --> agg_ymrgtb
n --> smh_in
n --> smh_il
smh_in --> sm
smh_il --> sm
agg_sc --> sm
sm --> spj
agg_sec --> spj

TripdataZips (a.k.a. zips): public Citi Bike files

  • Released as NYC and JC files at s3://tripdata
  • See s3://tripdata

TripdataCsvs (a.k.a. csvs): unzipped and gzipped CSVs

NormalizedMonths (a.k.a. norms): normalize csvs

  • Merge regions (NYC, JC) for the same month, harmonize columns drop duplicate data, etc.
  • Writes <root>/ctbk/normalized/YYYYMM.parquet
  • See also: s3://ctbk/normalized

AggregatedMonths (a.k.a. aggs): compute histograms over each month's rides:

  • Group by any of several "aggregation keys" ({year, month, day, hour, user type, bike type, start and end station, …})
  • Produce any "sum keys" ({ride counts, duration in seconds})
  • Writes <root>/ctbk/aggregated/KEYS_YYYYMM.parquet
  • See also: s3://ctbk/aggregated/*.parquet

StationMetaHists (a.k.a. smhs): compute station {id,name,lat/lng} histograms:

  • Similar to aggs, but counts station {id,name,lat/lng} tuples that appear as each ride's start and end stations (whereas agg's rows are 1:1 with rides)
  • "agg_keys" can include id (i), name (n), and lat/lng (l); there are no "sum_keys" (only counting is supported)
  • Writes <root>/ctbk/stations/meta_hists/YYYYMM.parquet
  • See also: s3://ctbk/stations/meta_hists

StationModes (a.k.a. sms): canonical {id,name,lat/lng} info for each station:

  • Computed from StationMetaHists:
    • name is chosen as the "mode" (most commonly listed name for that station ID)
    • lat/lng is taken to be the mean of the lat/lngs reported for each ride's start and end station
  • Writes <root>/ctbk/aggregated/YYYYMM/stations.json
  • See also: s3://ctbk/aggregated/YYYYMM/stations.json

StationPairJsons (a.k.a. spjs): counts of rides between each pair of stations:

  • JSON formatted as { <start idx>: { <end idx>: <count> } }
  • idxs are based on order of appearance in StationModes / stations.json above (which is also sorted by station ID)
  • Values are read from AggregatedMonths(<ym>, 'se', 'c'):
    • group by station start ("s") and end ("e"),
    • sum ride counts ("c")
  • Writes <root>/ctbk/aggregated/YYYYMM/se_c.json
  • See also: s3://ctbk/stations/YYYYMM/se_c.json


Clone this repo and install this library:

git clone
pip install -e ctbk

Then the ctbk executable will be available, which exposes a subcommand for each of the stages above:


Usage: ctbk [OPTIONS] COMMAND [ARGS]...

  CLI for generating datasets (derived from Citi Bike public data in `s3://`).
  ## Data flow
  ### `TripdataZips` (a.k.a. `zip`s): Public Citi Bike `` files
  - Released as NYC and JC `` files at s3://tripdata
  - See
  ### `TripdataCsvs` (a.k.a. `csv`s): unzipped and gzipped CSVs
  - Writes `<root>/ctbk/csvs/YYYYMM.csv`
  - See also:
  ### `NormalizedMonths` (a.k.a. `norm`s): normalize `csv`s
  - Merge regions (NYC, JC) for the same month, harmonize columns drop duplicate data, etc.
  - Writes `<root>/ctbk/normalized/YYYYMM.parquet`
  - See also:
  ### `AggregatedMonths` (a.k.a. `agg`s): compute histograms over each month's rides:
  - Group by any of several "aggregation keys" ({year, month, day, hour, user type, bike
    type, start and end station, …})
  - Produce any "sum keys" ({ride counts, duration in seconds})
  - Writes `<root>/ctbk/aggregated/KEYS_YYYYMM.parquet`
  - See also:
  ### `StationMetaHists` (a.k.a. `smh`s): compute station {id,name,lat/lng} histograms:
  - Similar to `agg`s, but counts station {id,name,lat/lng} tuples that appear as each
    ride's start and end stations (whereas `agg`'s rows are 1:1 with rides)
  - "agg_keys" can include id (i), name (n), and lat/lng (l); there are no "sum_keys"
    (only counting is supported)
  - Writes `<root>/ctbk/stations/meta_hists/YYYYMM/KEYS.parquet`
  - See also:
  ### `StationModes` (a.k.a. `sm`s): canonical {id,name,lat/lng} info for each station:
  - Computed from `StationMetaHist`s:
    - `name` is chosen as the "mode" (most commonly listed name for that station ID)
    - `lat/lng` is taken to be the mean of the lat/lngs reported for each ride's start
      and end station
  - Writes `<root>/ctbk/aggregated/YYYYMM/stations.json`
  - See also:
  ### `StationPairJsons` (a.k.a. `spj`s): counts of rides between each pair of stations:
  - JSON formatted as `{ <start idx>: { <end idx>: <count> } }`
  - `idx`s are based on order of appearance in `StationModes` / `stations.json` above
    (which is also sorted by station ID)
  - Values are read from `AggregatedMonths(YYYYMM, 'se', 'c')`:
    - group by station start ("s") and end ("e"),
    - sum ride counts ("c")
  - Writes `<root>/ctbk/aggregated/YYYYMM/se_c.json`
  - See also:

  -r, --read TEXT   Set "read" behavior for `HasRoot` subclasses, `<alias>=<value>` to set specific classes by
                    alias, just `<value>` to set a global default. `<value>`s are `memory`, `disk`, and their
                    aliases, indicating whether to return disk-round-tripped versions of newly-computed
  -t, --root TEXT   Path- or URL-prefixes for `HasRoot` subclasses to write to and read from. `<alias>=<value>`
                    to set specific classes by alias, just `<value>` to set a global default. `<value>`s are
                    `memory`, `disk`, and their aliases, indicating whether to return disk-round-tripped
                    versions of newly-computed datasets.
  -w, --write TEXT  Set "write" behavior for `HasRoot` subclasses, `<alias>=<value>` to set specific classes by
                    alias, just `<value>` to set a global default. `<value>`s are `never`, `ifabsent`, `always`,
                    and their aliases, indicating how to handle each dataset type already existing on disk
                    (under its `root`) vs. not.
  --s3              Alias for `--root s3:/`, pointing all classes' "root" dirs at S3
  --help            Show this message and exit.

  zip                 Read files from s3://tripdata
  csv                 Extract CSVs from "tripdata" .zip files.
  normalized          Normalize "tripdata" CSVs (combine regions for each...
  partition           Separate pre-2024 parquets (`normalized/v0`) by...
  consolidate         Consolidate `normalized/YM/YM_YM.parquet` files...
  aggregated          Aggregate normalized ride entries by various...
  ymrgtb-cd           Read aggregated...
  station-meta-hist   Aggregate station name, lat/lng info from ride...
  station-modes-json  Compute canonical station names, lat/lngs from...
  station-pairs-json  Write station-pair ride_counts keyed by...
  yms                 Print one or more YM (year-month) ranges, e.g.:
ctbk zip --help
Usage: ctbk zip [OPTIONS] COMMAND [ARGS]...

  Read files from s3://tripdata

  --help  Show this message and exit.

  urls  Print URLs for selected datasets
ctbk csv --help
Usage: ctbk csv [OPTIONS] COMMAND [ARGS]...

  Extract CSVs from "tripdata" .zip files. Writes to <root>/ctbk/csvs.

  --help  Show this message and exit.

  urls    Print URLs for selected datasets
  create  Create selected datasets
  sort    Sort one or more `.csv{,.gz}`'s in-place, remove empty lines
ctbk normalized --help
Usage: ctbk normalized [OPTIONS] COMMAND [ARGS]...

  Normalize "tripdata" CSVs (combine regions for each month, harmonize column
  names, etc. Populates directory `<root>/ctbk/normalized/YYYYMM/` with files
  of the form `YYYYMM_YYYYMM.parquet`, for each pair of (start,end) months
  found in a given month's CSVs.

  --help  Show this message and exit.

  urls    Print URLs for selected datasets
  create  Create selected datasets
ctbk partition --help
Usage: ctbk partition [OPTIONS] [YM_RANGES_STR]

  Separate pre-2024 parquets (`normalized/v0`) by {src,start,end} months.

  --help  Show this message and exit.
ctbk consolidate --help
Usage: ctbk consolidate [OPTIONS] [YM_RANGES_STR]

  Consolidate `normalized/YM/YM_YM.parquet` files into a single
  `normalized/YM.parquet`, containing all rides ending in the given month.

  -c, --col TEXT  Columns to backfill; default: ['Birth Year', 'Gender', 'Bike
  -n, --dry-run   Print stats about fields that would be backfilled, but don't
                  perform any writes
  --help          Show this message and exit.
ctbk aggregated --help
Usage: ctbk aggregated [OPTIONS] COMMAND [ARGS]...

  Aggregate normalized ride entries by various columns, summing ride counts or
  durations. Writes to <root>/ctbk/aggregated/KEYS_YYYYMM.parquet.

  --help  Show this message and exit.

  urls    Print URLs for selected datasets
  create  Create selected datasets
ctbk station-meta-hist --help
Usage: ctbk station-meta-hist [OPTIONS] COMMAND [ARGS]...

  Aggregate station name, lat/lng info from ride start and end fields. Writes
  to <root>/ctbk/stations/meta_hists/KEYS_YYYYMM.parquet.

  --help  Show this message and exit.

  urls    Print URLs for selected datasets
  create  Create selected datasets
ctbk station-modes-json --help
Usage: ctbk station-modes-json [OPTIONS] COMMAND [ARGS]...

  Compute canonical station names, lat/lngs from StationMetaHists. Writes to

  --help  Show this message and exit.

  urls    Print URLs for selected datasets
  create  Create selected datasets
ctbk station-pairs-json --help
Usage: ctbk station-pairs-json [OPTIONS] COMMAND [ARGS]...

  Write station-pair ride_counts keyed by StationModes' JSON indices. Writes
  to <root>/ctbk/aggregated/YYYYMM/se_c.json.

  --help  Show this message and exit.

  urls    Print URLs for selected datasets
  create  Create selected datasets

Subcommands: urls, create

Each of the ctbk commands above supports 3 further subcommands:

  • urls: print the URLs that would be read from or written to
  • create: compute and save the relevant data to those URLs (optionally no-op'ing if already present, overwriting, or failing if not present)


urls: print URLS

Print URLs for 3 months of normalized data on S3:

ctbk --s3 normalized -d 202206-202209 urls
# s3://ctbk/normalized/202206.parquet
# s3://ctbk/normalized/202207.parquet
# s3://ctbk/normalized/202208.parquet

Print URLs for 3 months of normalized data in the local folder s3/

ctbk normalized -d 202206-202209 urls
# s3/ctbk/normalized/202206.parquet
# s3/ctbk/normalized/202207.parquet
# s3/ctbk/normalized/202208.parquet

create: create+save data

Compute one month of normalized ride data:

ctbk --s3 -tnorm=tmproot normalized -d202101 create
ctbk --s3 -tnorm=tmproot normalized -d202101 create
# Writing tmproot/ctbk/normalized/202101.parquet
# Reading s3://ctbk/csvs/202101-citibike-tripdata.csv.gz
# s3://ctbk/csvs/202101-citibike-tripdata.csv.gz: "Rideable Type" column not found; setting to "unknown" for all rows
# Reading s3://ctbk/csvs/JC-202101-citibike-tripdata.csv.gz
# s3://ctbk/csvs/JC-202101-citibike-tripdata.csv.gz: "Rideable Type" column not found; setting to "unknown" for all rows

Upstream data is read from S3 (--s3 flag):

Output normalized data is written under local folder tmproot/ (-tnorm=tmproot):

tree -sh tmproot
# [  96]  tmproot
# └── [  96]  ctbk
#     └── [  96]  normalized
#         └── [ 30M]  202101.parquet
# 3 directories, 1 file

stderr messages about Rideable Type not being found are due to the chosen month predating the addition of that column in February 2021.

Generate all the data used by in a local s3/ctbk directory (mirroring s3://ctbk):

ctbk spj create
  • spj stands for station-pair-json (the final derived data product in the diagram above)
  • createing spj requires createing all predecessor datasets
    • Default "root" for each dataset is the local folder s3/
      • No existing data will be found there (on your computer), so it will be computed and saved
      • One exception to this is the initial TripdataZips, which are read from s3://tripdata by default
    • Previous examples use --s3 to point datasets to S3 locations, where they already exist
      • --s3 is equivalent to -ts3 or --root s3
      • This is in turn equivalent to --root csvs=s3:/ --root norm=s3:/ --root agg=s3:/ ...; all stages' URLs are prefixed with s3:/
      • Point at your own bucket with -ts3://my_bucket, which will result in files written under s3://my_bucket/ctbk/

⚠️ takes O(hours), streams ≈7GB of .csv.zips from s3://tripdata, writes ≈12GiB under s3/ctbk/ locally.

Abbreviated command names

Abbreviations for each subcommand are supported, e.g. n for normalized:

ctbk n -d2022- urls

See HasRoot for more info about top-level -r/--read, -w/--write, and -t/--root parameters.

GitHub Actions

ci.yml breaks each derived dataset into a separate job, for example:

ctbk dev gha dag

It also includes a final call to generate JSON used by the main plot at

ctbk ymrgtb-cd

Any changes are pushed to the www branch, which triggers the www.yml GHA.

The www.yml GHA:

The code for the site is under ../www.