Skip to content

Commit

Permalink
Remove "salt" from sessions
Browse files Browse the repository at this point in the history
This doesn't really add anything, not really. The way this has worked
for a long time (since 2020, a1aa1c5) is:

1. A sessionHash is created as hash(salt + User-Agent + IP).
2. Store this in memory as a sessionHash→UUIDv4 map.
3. Use this UUIDv4 in the database and such.
4. The salt changes every few hours, resulting in new hashes and new
   UUIDs

So it never hits the disk, and the only thing the salt really adds is
rotation of the sessionHashes. This is different from before, where the
session ID was stored directly, but that was a long time ago and in the
very early versions only. So now it just expire the sessionHash after a
few hours.
  • Loading branch information
arp242 committed Apr 22, 2024
1 parent 84cec37 commit 533d0b7
Show file tree
Hide file tree
Showing 14 changed files with 150 additions and 328 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Features

[privacy]: https://www.goatcounter.com/privacy
[gdpr]: https://www.goatcounter.com/gdpr
[sessions]: https://github.com/arp242/goatcounter/blob/master/docs/sessions.md
[sessions]: http://www.goatcounter.com/help/sessions


Getting data in to GoatCounter
Expand Down
3 changes: 2 additions & 1 deletion cmd/goatcounter/help.go
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,7 @@ commas.
all Show debug logs for all of the below.
acme ACME certificate creation.
cli-trace Show stack traces in errors on the CLI.
cron Background "cron" jobs.
cron-acme Cron jobs for ACME certificate creations.
dashboard Dashboard view.
Expand All @@ -245,7 +246,7 @@ commas.
import-api Imports from the API.
memstore Storing of pageviews in the database.
monitor Additional logs in "goatcounter monitor" .
cli-trace Show stack traces in errors on the CLI.
session Internal "session" generation to track visitors .
startup Some additional logs during startup.
vacuum Deletion of old deleted sites and old pageviews.
`
24 changes: 12 additions & 12 deletions cron/size_stat_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,12 @@ func TestSizeStats(t *testing.T) {
want := `{
"more": false,
"stats": [
{"count": 0, "id": "phone", "name": "Phones"},
{"count": 0, "id": "largephone", "name": "Large phones, small tablets" },
{"count": 0, "id": "tablet", "name": "Tablets and small laptops"},
{"count": 2, "id": "desktop", "name": "Computer monitors"},
{"count": 0, "id": "desktophd", "name": "Computer monitors larger than HD"},
{"count": 0, "id": "unknown", "name": "(unknown)"}
{"count": 0, "id": "phone", "name": ""},
{"count": 0, "id": "largephone", "name": "" },
{"count": 0, "id": "tablet", "name": ""},
{"count": 2, "id": "desktop", "name": ""},
{"count": 0, "id": "desktophd", "name": ""},
{"count": 0, "id": "unknown", "name": ""}
]
}`

Expand All @@ -71,12 +71,12 @@ func TestSizeStats(t *testing.T) {
want = `{
"more": false,
"stats": [
{"count": 0, "id": "phone", "name": "Phones"},
{"count": 0, "id": "largephone", "name": "Large phones, small tablets" },
{"count": 0, "id": "tablet", "name": "Tablets and small laptops"},
{"count": 3, "id": "desktop", "name": "Computer monitors"},
{"count": 0, "id": "desktophd", "name": "Computer monitors larger than HD"},
{"count": 1, "id": "unknown", "name": "(unknown)"}
{"count": 0, "id": "phone", "name": ""},
{"count": 0, "id": "largephone", "name": "" },
{"count": 0, "id": "tablet", "name": ""},
{"count": 3, "id": "desktop", "name": ""},
{"count": 0, "id": "desktophd", "name": ""},
{"count": 1, "id": "unknown", "name": ""}
]
}`
if d := ztest.Diff(zjson.MustMarshalString(have), want, ztest.DiffJSON); d != "" {
Expand Down
1 change: 0 additions & 1 deletion cron/tasks.go
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,5 @@ func vacuumDeleted(ctx context.Context) error {

func sessions(ctx context.Context) error {
goatcounter.Memstore.EvictSessions()
goatcounter.Memstore.RefreshSalt()
return nil
}
164 changes: 0 additions & 164 deletions docs/sessions.md
Original file line number Diff line number Diff line change
@@ -1,164 +0,0 @@
Session tracking
================

*2022 preface*: this document mostly served as a design document for when I
wrote the session tracking code back in 2019. It's not intended as an overview
of "all possible techniques one could use in $current_year"; the examination of
existing solutions is likely incomplete, and sometimes probably outdated as
platforms evolve.

In short: I don't keep it updated.

---


"Session tracking" allows more advanced tracking than just the "pageview"
counter we have now. A "session" is a single browsing session people have on a
website.

Right now, every pageview shows up as-is in the dashboard, including things like
page refreshes. There is also no ability to determine things like conversion
rates and the like.

Goals:

- Avoid requiring GDPR consent notices.

- The ability to view the number of "unique visitors" rather than just
"pageviews".

- Basic "bounce rate" and "conversion rate"-like statistics; for example, if
someone enters on /foo we want to be able to see how many leave after visiting
just that page, or how many end up on /signup.

Non-goals:

- Track beyond a single browsing session.


Existing solutions
------------------

An overview of existing solutions that I'm aware or with roughly the same goals.

Ackee
-----

https://github.com/electerious/Ackee/blob/master/docs/Anonymization.md

> Uses a one-way salted hash of the IP, User-Agent, and sites.ID. The hash changes
> daily and is never stored.
>
> This way a visitor can be tracked for one day at the most.
This seems like a decent enough approach, and it doesn't require storing any
information in the browser with e.g. a cookie.

It does generate a persistent device-unique identifier though, and I'm not sure
this is enough anonymisation in the context of the GDPR (although it may be?
It's hard to say anything conclusive about this at the moment)

Fathom
------

https://usefathom.com/blog/anonymization

> Unique siteviews are tracked by a hash which includes the site.ID; unique
> pageviews are tracked by as hash which includes the site.ID and path being
> tracked.
>
> To mark previous requests "finished" (not sure what that means) the current
> pageview's hash is removed and moved to the newest pageview.
I'm not entirely sure if it's actually better or more "private" than Ackee's
simpler method. The Fathom docs mention that they "can’t put together an
anonymous, individual user’s browsing habits", but is seeing which path people
take on your website really tracking someone's "browsing habits", or can this
lead to identifying a "natural person"?

Or, to give an analogy: I'm not sure if there's anything wrong with just seeing
where your customers go in your store. The problems start when you start
creating profiles of those people on recurring visits, or when you see where
they go to other stores, too.


SimpleAnalytics
---------------

https://docs.simpleanalytics.com/uniques

> If the Referer header is another.site or missing it's counted as a unique
> visit; if it's mysite.com then it's counted as a recurring visit.
A lot of browsers/people don't send a Referer header (somewhere between ~30% and
~50%); this number is probably higher since Referer is set more often for
requests in the same domain, but probably not 100%.

This is a pretty simple method, but it doesn't allow showing bounce or
conversion rates or other slightly more advanced statistics.


Simple Web Analytics
--------------------

https://simple-web-analytics.com/

Uses the browser cache to achieve season tracking: The endpoint being called by
the tracking code sets the `Expire` header to the next calendar day of the
user's timezone.

This ensures the server only gets hit once per day per session; subsequent
requests are not tracked at all.

In practice there are cases where a session is counted more than once. Firefox
for example ignores the HTTP cache when the user hits the reload button.

It's a simple and un-complex approach, and doesn't require storing any
information about the user (hashed or otherwise) on the server. The downside is
that intermediate requests are not tracked at all, which would make it
unsuitable for GoatCounter.


GoatCounter's solution
----------------------

- Create a server-side hash: hash(site.ID, User-Agent, IP, salt) to identify
the client without storing any personal information directly.

- Don't persist the hash to disk; this isn't really needed as we just want to
track the "browsing session" rather than re-identify someone coming back the
next day.

- The salt is rotated every 4 hour on a sliding schedule; when a new pageview
comes in we try to find an existing session based on the current and previous
salt. This ensures there isn't some arbitrary cut-off time when the salt is
rotated. After 8 hours, the salt is permanently deleted.

- If a user visits the next time, they will have the same hash, but the system
has forgotten about it by then.

The whole hashing thing is *kind of* superfluous since the data is never stored
to disk with one exception: it's temporarily stored on shutdown, to be read and
deleted on startup. It doesn't hurt to hash the data though, and better safe
than sorry.

I considered generating the ID on the client side as a session cookie or
localStorage, but this is tricky due to the ePrivacy directive, which requires
that *"users are provided with clear and precise information in accordance with
Directive 95/46/EC about the purposes of cookies"* and should be offered the
*"right to refuse"*, making exceptions only for data that is *"strictly
necessary in order to provide a [..] service explicitly requested by the
subscriber or user"*.

Ironically, using a cookie would not only make things simpler but also *more*
privacy friendly, as there would be no salt stored on the server, and the user
has more control. It is what it is 🤷

I'm not super keen on adding the IP address in the hash, as IP addresses are
quite ephemeral; think about moving from WiFi to 4G for example, or ISPs who
recycle IP addresses a lot. There's no clear alternatives as far as I know
though, but it may be replaced with something else in the future.

Fathom's solution with multiple hashes seems rather complex, without any clear
advantages; using just a single hash like this already won't store more
information than before, and the hash is stored temporarily.
34 changes: 6 additions & 28 deletions handlers/count_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -270,30 +270,6 @@ func TestBackendCountSessions(t *testing.T) {
}
}

rotate := func(ctx context.Context) {
now = now.Add(12 * time.Hour)
oldCur, _ := goatcounter.Memstore.GetSalt()

goatcounter.Memstore.RefreshSalt()

_, prev := goatcounter.Memstore.GetSalt()
if string(prev) != string(oldCur) {
t.Fatalf("salts not cycled?\noldCur: %s\nprev: %s\n", string(oldCur), string(prev))
}
}

// Ensure salts aren't cycled before they should.
beforeCur, beforePrev := goatcounter.Memstore.GetSalt()
now = now.Add(1 * time.Hour)
goatcounter.Memstore.RefreshSalt()
afterCur, afterPrev := goatcounter.Memstore.GetSalt()

before := string(beforeCur) + " → " + string(beforePrev)
after := string(afterCur) + " → " + string(afterPrev)
if before != after {
t.Fatalf("salts cycled too soon\nbefore: %s\nafter: %s", before, after)
}

send(ctx1, "test")
send(ctx1, "test")
send(ctx1, "other")
Expand All @@ -308,17 +284,19 @@ func TestBackendCountSessions(t *testing.T) {
want := []int{1, 1, 2, 3, 3, 1, 2}
checkSess(append(hits1, hits2...), want)

// Rotate, should still use the same sessions.
rotate(ctx1)
// Should still use the same sessions.
goatcounter.SessionTime = 1 * time.Second
goatcounter.Memstore.EvictSessions()
send(ctx1, "test")
send(ctx2, "test")
hits1 = checkHits(ctx1, 6)
hits2 = checkHits(ctx2, 3)
want = []int{1, 1, 2, 3, 3, 1, 2, 1, 3}
checkSess(append(hits1, hits2...), want)

// Rotate again, should use new sessions from now on.
rotate(ctx1)
// Should use new sessions from now on.
now = time.Date(2019, 6, 18, 14, 42, 2, 0, time.UTC)
goatcounter.Memstore.EvictSessions()
send(ctx1, "test")
send(ctx2, "test")
hits1 = checkHits(ctx1, 7)
Expand Down
2 changes: 1 addition & 1 deletion handlers/http_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ func TestMain(m *testing.M) {
// Don't need tests.
"", "bosmang.gohtml", "bosmang_site.gohtml", "bosmang_cache.gohtml",
"bosmang_bgrun.gohtml", "bosmang_metrics.gohtml", "bosmang_sites.gohtml",
"i18n_list.gohtml", "i18n_show.gohtml",
"i18n_list.gohtml", "i18n_show.gohtml", "i18n_manage.gohtml",

// Tested in tpl_test.go
"email_export_done.gotxt", "email_forgot_site.gotxt",
Expand Down
24 changes: 12 additions & 12 deletions hit_stats_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -135,32 +135,32 @@ func TestHitStats(t *testing.T) {
"stats": [
{
"id": "phone",
"name": "Phones",
"name": "",
"count": 0
},
{
"id": "largephone",
"name": "Large phones, small tablets",
"name": "",
"count": 1
},
{
"id": "tablet",
"name": "Tablets and small laptops",
"name": "",
"count": 0
},
{
"id": "desktop",
"name": "Computer monitors",
"name": "",
"count": 1
},
{
"id": "desktophd",
"name": "Computer monitors larger than HD",
"name": "",
"count": 0
},
{
"id": "unknown",
"name": "(unknown)",
"name": "",
"count": 0
}
]
Expand Down Expand Up @@ -288,32 +288,32 @@ func TestListSizes(t *testing.T) {
"stats": [
{
"id": "phone",
"name": "Phones",
"name": "",
"count": 1
},
{
"id": "largephone",
"name": "Large phones, small tablets",
"name": "",
"count": 1
},
{
"id": "tablet",
"name": "Tablets and small laptops",
"name": "",
"count": 1
},
{
"id": "desktop",
"name": "Computer monitors",
"name": "",
"count": 1
},
{
"id": "desktophd",
"name": "Computer monitors larger than HD",
"name": "",
"count": 3
},
{
"id": "unknown",
"name": "(unknown)",
"name": "",
"count": 1
}
]
Expand Down
Loading

0 comments on commit 533d0b7

Please sign in to comment.