-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(smart-autocomplete): add migration for smart autocomplete view on eap_items #6912
base: master
Are you sure you want to change the base?
Conversation
This PR has a migration; here is the generated SQL for -- start migrations
-- forward migration events_analytics_platform : 0030_smart_autocomplete_items
Local op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_1_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayConcat(attributes_string, attributes_float)), attributes_string Array(String), attributes_float Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayConcat(attributes_string, attributes_float)))) ENGINE ReplicatedReplacingMergeTree('/clickhouse/tables/events_analytics_platform/{shard}/default/eap_item_co_occurring_attrs_1_local', '{replica}') PRIMARY KEY (organization_id, project_id, date, item_type, key_hash) ORDER BY (organization_id, project_id, date, item_type, key_hash, retention_days) PARTITION BY (retention_days, toMonday(date)) TTL date + toIntervalDay(retention_days);
Distributed op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_1_dist (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayConcat(attributes_string, attributes_float)), attributes_string Array(String), attributes_float Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayConcat(attributes_string, attributes_float)))) ENGINE Distributed(`cluster_one_sh`, default, eap_item_co_occurring_attrs_1_local);
Local op: ALTER TABLE eap_item_co_occurring_attrs_1_local ADD INDEX IF NOT EXISTS bf_attribute_keys_hash attribute_keys_hash TYPE bloom_filter GRANULARITY 1;
Local op: CREATE MATERIALIZED VIEW IF NOT EXISTS eap_item_co_occurring_attrs_1_mv TO eap_item_co_occurring_attrs_1_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayConcat(attributes_string, attributes_float)), attributes_string Array(String), attributes_float Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayConcat(attributes_string, attributes_float)))) AS
SELECT
organization_id AS organization_id,
project_id AS project_id,
item_type as item_type,
toMonday(timestamp) AS date,
retention_days as retention_days,
arrayConcat(mapKeys(attributes_string_0), mapKeys(attributes_string_1), mapKeys(attributes_string_2), mapKeys(attributes_string_3), mapKeys(attributes_string_4), mapKeys(attributes_string_5), mapKeys(attributes_string_6), mapKeys(attributes_string_7), mapKeys(attributes_string_8), mapKeys(attributes_string_9), mapKeys(attributes_string_10), mapKeys(attributes_string_11), mapKeys(attributes_string_12), mapKeys(attributes_string_13), mapKeys(attributes_string_14), mapKeys(attributes_string_15), mapKeys(attributes_string_16), mapKeys(attributes_string_17), mapKeys(attributes_string_18), mapKeys(attributes_string_19), mapKeys(attributes_string_20), mapKeys(attributes_string_21), mapKeys(attributes_string_22), mapKeys(attributes_string_23), mapKeys(attributes_string_24), mapKeys(attributes_string_25), mapKeys(attributes_string_26), mapKeys(attributes_string_27), mapKeys(attributes_string_28), mapKeys(attributes_string_29), mapKeys(attributes_string_30), mapKeys(attributes_string_31), mapKeys(attributes_string_32), mapKeys(attributes_string_33), mapKeys(attributes_string_34), mapKeys(attributes_string_35), mapKeys(attributes_string_36), mapKeys(attributes_string_37), mapKeys(attributes_string_38), mapKeys(attributes_string_39)) AS attributes_string,
arrayConcat(mapKeys(attributes_float_0), mapKeys(attributes_float_1), mapKeys(attributes_float_2), mapKeys(attributes_float_3), mapKeys(attributes_float_4), mapKeys(attributes_float_5), mapKeys(attributes_float_6), mapKeys(attributes_float_7), mapKeys(attributes_float_8), mapKeys(attributes_float_9), mapKeys(attributes_float_10), mapKeys(attributes_float_11), mapKeys(attributes_float_12), mapKeys(attributes_float_13), mapKeys(attributes_float_14), mapKeys(attributes_float_15), mapKeys(attributes_float_16), mapKeys(attributes_float_17), mapKeys(attributes_float_18), mapKeys(attributes_float_19), mapKeys(attributes_float_20), mapKeys(attributes_float_21), mapKeys(attributes_float_22), mapKeys(attributes_float_23), mapKeys(attributes_float_24), mapKeys(attributes_float_25), mapKeys(attributes_float_26), mapKeys(attributes_float_27), mapKeys(attributes_float_28), mapKeys(attributes_float_29), mapKeys(attributes_float_30), mapKeys(attributes_float_31), mapKeys(attributes_float_32), mapKeys(attributes_float_33), mapKeys(attributes_float_34), mapKeys(attributes_float_35), mapKeys(attributes_float_36), mapKeys(attributes_float_37), mapKeys(attributes_float_38), mapKeys(attributes_float_39)) AS attributes_float
FROM eap_items_1_local
;
-- end forward migration events_analytics_platform : 0030_smart_autocomplete_items
-- backward migration events_analytics_platform : 0030_smart_autocomplete_items
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_1_mv;
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_1_local;
Distributed op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_1_dist;
-- end backward migration events_analytics_platform : 0030_smart_autocomplete_items |
@kylemumma asked:
|
organization_id AS organization_id, | ||
project_id AS project_id, | ||
item_type as item_type, | ||
toDate(timestamp) AS date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to do a toMonday
on this to keep even less data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
engine=table_engines.ReplacingMergeTree( | ||
storage_set=self.storage_set_key, | ||
primary_key="(organization_id, project_id, date, key_hash)", | ||
order_by="(organization_id, project_id, date, key_hash)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the retention days in the sort key to deduplicate between 2 identical rows with short and long retention period? I wouldn't want the value to vanish when it's actually a valid value for a span stored for a longer period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Im a bit confused by these answers, I think it could be because I don't have a good understanding of how your table is used. Could you please explain or link a doc about autocomplete? |
storage_set_key = StorageSetKey.EVENTS_ANALYTICS_PLATFORM | ||
granularity = "8192" | ||
|
||
local_table_name = "eap_item_attrs_1_local" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the name of the table be more related to autocomplete? it may be hard to distinguish the difference between this table and the item_attrs mv I am making.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, done
engine=table_engines.ReplacingMergeTree( | ||
storage_set=self.storage_set_key, | ||
primary_key="(organization_id, project_id, date, item_type, key_hash)", | ||
order_by="(organization_id, project_id, date, item_type, key_hash, retention_days)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the purpose of key_hash
is to make ReplacingMergeTree de-duplication more efficient here compared to using attributes_string
or attributes_float
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
631bdf3
to
213c1ac
Compare
Consolidating the learnings from the previous smart autocomplete experiments into one view:
The following was tested with locally inserted data: