Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(smart-autocomplete): add migration for smart autocomplete view on eap_items #6912

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

volokluev
Copy link
Member

Consolidating the learnings from the previous smart autocomplete experiments into one view:

  • only indexing the keys for greater compaction
  • storing a hash array for numerical and string attrbiutes
  • bloom filter index on hash columns

The following was tested with locally inserted data:

  1. De-dupication of rows with the same attribute keys
  2. Searching by attribute hash
  3. bloom filter index used when searching the same hash column

@volokluev volokluev requested review from a team as code owners February 25, 2025 18:54
Copy link

github-actions bot commented Feb 25, 2025

This PR has a migration; here is the generated SQL for ./snuba/migrations/groups.py ()

-- start migrations

-- forward migration events_analytics_platform : 0030_smart_autocomplete_items
Local op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_1_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayConcat(attributes_string, attributes_float)), attributes_string Array(String), attributes_float Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayConcat(attributes_string, attributes_float)))) ENGINE ReplicatedReplacingMergeTree('/clickhouse/tables/events_analytics_platform/{shard}/default/eap_item_co_occurring_attrs_1_local', '{replica}') PRIMARY KEY (organization_id, project_id, date, item_type, key_hash) ORDER BY (organization_id, project_id, date, item_type, key_hash, retention_days) PARTITION BY (retention_days, toMonday(date)) TTL date + toIntervalDay(retention_days);
Distributed op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_1_dist (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayConcat(attributes_string, attributes_float)), attributes_string Array(String), attributes_float Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayConcat(attributes_string, attributes_float)))) ENGINE Distributed(`cluster_one_sh`, default, eap_item_co_occurring_attrs_1_local);
Local op: ALTER TABLE eap_item_co_occurring_attrs_1_local ADD INDEX IF NOT EXISTS bf_attribute_keys_hash attribute_keys_hash TYPE bloom_filter GRANULARITY 1;
Local op: CREATE MATERIALIZED VIEW IF NOT EXISTS eap_item_co_occurring_attrs_1_mv TO eap_item_co_occurring_attrs_1_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayConcat(attributes_string, attributes_float)), attributes_string Array(String), attributes_float Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayConcat(attributes_string, attributes_float)))) AS 
SELECT
    organization_id AS organization_id,
    project_id AS project_id,
    item_type as item_type,
    toMonday(timestamp) AS date,
    retention_days as retention_days,
    arrayConcat(mapKeys(attributes_string_0), mapKeys(attributes_string_1), mapKeys(attributes_string_2), mapKeys(attributes_string_3), mapKeys(attributes_string_4), mapKeys(attributes_string_5), mapKeys(attributes_string_6), mapKeys(attributes_string_7), mapKeys(attributes_string_8), mapKeys(attributes_string_9), mapKeys(attributes_string_10), mapKeys(attributes_string_11), mapKeys(attributes_string_12), mapKeys(attributes_string_13), mapKeys(attributes_string_14), mapKeys(attributes_string_15), mapKeys(attributes_string_16), mapKeys(attributes_string_17), mapKeys(attributes_string_18), mapKeys(attributes_string_19), mapKeys(attributes_string_20), mapKeys(attributes_string_21), mapKeys(attributes_string_22), mapKeys(attributes_string_23), mapKeys(attributes_string_24), mapKeys(attributes_string_25), mapKeys(attributes_string_26), mapKeys(attributes_string_27), mapKeys(attributes_string_28), mapKeys(attributes_string_29), mapKeys(attributes_string_30), mapKeys(attributes_string_31), mapKeys(attributes_string_32), mapKeys(attributes_string_33), mapKeys(attributes_string_34), mapKeys(attributes_string_35), mapKeys(attributes_string_36), mapKeys(attributes_string_37), mapKeys(attributes_string_38), mapKeys(attributes_string_39)) AS attributes_string,
    arrayConcat(mapKeys(attributes_float_0), mapKeys(attributes_float_1), mapKeys(attributes_float_2), mapKeys(attributes_float_3), mapKeys(attributes_float_4), mapKeys(attributes_float_5), mapKeys(attributes_float_6), mapKeys(attributes_float_7), mapKeys(attributes_float_8), mapKeys(attributes_float_9), mapKeys(attributes_float_10), mapKeys(attributes_float_11), mapKeys(attributes_float_12), mapKeys(attributes_float_13), mapKeys(attributes_float_14), mapKeys(attributes_float_15), mapKeys(attributes_float_16), mapKeys(attributes_float_17), mapKeys(attributes_float_18), mapKeys(attributes_float_19), mapKeys(attributes_float_20), mapKeys(attributes_float_21), mapKeys(attributes_float_22), mapKeys(attributes_float_23), mapKeys(attributes_float_24), mapKeys(attributes_float_25), mapKeys(attributes_float_26), mapKeys(attributes_float_27), mapKeys(attributes_float_28), mapKeys(attributes_float_29), mapKeys(attributes_float_30), mapKeys(attributes_float_31), mapKeys(attributes_float_32), mapKeys(attributes_float_33), mapKeys(attributes_float_34), mapKeys(attributes_float_35), mapKeys(attributes_float_36), mapKeys(attributes_float_37), mapKeys(attributes_float_38), mapKeys(attributes_float_39)) AS attributes_float
FROM eap_items_1_local
;
-- end forward migration events_analytics_platform : 0030_smart_autocomplete_items




-- backward migration events_analytics_platform : 0030_smart_autocomplete_items
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_1_mv;
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_1_local;
Distributed op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_1_dist;
-- end backward migration events_analytics_platform : 0030_smart_autocomplete_items

@volokluev
Copy link
Member Author

volokluev commented Feb 25, 2025

@kylemumma asked:

smart autocomplete is your attempt to solve the issues with the TraceItemAttributes and TraceItemValues endpoints right? If thats the case, how come we still need the str_attrs and num_attrs mvs that im building rn?

  1. If the autocomplete is empty, how do I suggest any attributes?
  2. These views are not good at checking whether an attribute exists for a specific project, it has to search all the events to do that
  3. This view does not support any value autocomplete

organization_id AS organization_id,
project_id AS project_id,
item_type as item_type,
toDate(timestamp) AS date,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to do a toMonday on this to keep even less data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

engine=table_engines.ReplacingMergeTree(
storage_set=self.storage_set_key,
primary_key="(organization_id, project_id, date, key_hash)",
order_by="(organization_id, project_id, date, key_hash)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the retention days in the sort key to deduplicate between 2 identical rows with short and long retention period? I wouldn't want the value to vanish when it's actually a valid value for a span stored for a longer period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kylemumma
Copy link
Member

@kylemumma asked:

smart autocomplete is your attempt to solve the issues with the TraceItemAttributes and TraceItemValues endpoints right? If thats the case, how come we still need the str_attrs and num_attrs mvs that im building rn?

  1. If the autocomplete is empty, how do I suggest any attributes?
  2. These views are not good at checking whether an attribute exists for a specific project, it has to search all the events to do that
  3. This view does not support any value autocomplete

Im a bit confused by these answers, I think it could be because I don't have a good understanding of how your table is used. Could you please explain or link a doc about autocomplete?

storage_set_key = StorageSetKey.EVENTS_ANALYTICS_PLATFORM
granularity = "8192"

local_table_name = "eap_item_attrs_1_local"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the name of the table be more related to autocomplete? it may be hard to distinguish the difference between this table and the item_attrs mv I am making.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, done

engine=table_engines.ReplacingMergeTree(
storage_set=self.storage_set_key,
primary_key="(organization_id, project_id, date, item_type, key_hash)",
order_by="(organization_id, project_id, date, item_type, key_hash, retention_days)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the purpose of key_hash is to make ReplacingMergeTree de-duplication more efficient here compared to using attributes_string or attributes_float?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

@volokluev volokluev force-pushed the volo/smart_autocomplete/items_mv branch from 631bdf3 to 213c1ac Compare February 27, 2025 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants