The do_analytics_sessionization task is used to sessionize the Analytics data in the analytical database.
task:
type: do_analytics_sessionization
start_date: today -3 days
transforms:
# IP address filter
- type: filter
field: ip_address.raw
values:
- ::ffff:127.0.0.1 # Locahost IPv6
- 127.0.0.1 # Locahost IPv4
- 123.4.56.0 # We receive a hit every minute from this IP address
# Development URL's
- type: filter_regexp
field: page.url.raw
values:
- file:///.*
# The following filter EXCLUDES hits from the bot filter that have account_id=osb.direct
- type: bot_filter_exception
field: system.account_id
values:
- osb.direct
load:
table_id: hits_v5
table_id_trash: trash_v5
table_id_user_history: user_history_v5
table_id_profiles: profiles_v5
dataset_id: osb_analytics
# Which ids are stitching keys?
ids:
- key: ga
regexp: ^GA[0-9]\\.[0-9]\\.[0-9]+\\.[0-9]+$
stitching: no
- key: email
regexp: ^[0-9a-f]{64}$
stitching: yes
- key: account_id
regexp: ^(tvg_)*[a-z0-9]{32}$
stitching: yes
# The validation below checks if all hits up to and including yesterday are sessionized
validate:
- query: |
SELECT DATE(DATETIME(tstamp, 'Europe/Amsterdam')) AS date_local,
COUNT(CASE WHEN session.number IS NULL THEN 1 END) AS hits_no_session,
COUNT(distinct CASE WHEN session.number IS NULL THEN user_id END) AS users_no_session,
COUNT(CASE WHEN session.number IS NOT NULL THEN 1 END) AS hits_session
FROM `{{ load.project_id }}.{{ load.dataset_id }}.{{ load.table_id }}` h
WHERE DATE(tstamp, 'Europe/Amsterdam') = CURRENT_DATE('Europe/Amsterdam')-1
AND hit_type <> 'ids'
AND coalesce(event.interaction, TRUE) = TRUE
GROUP BY 1
type: number
operator: '=='
value: 0
field: hits_no_session
- type: number
operator: '>'
value: 0
field: hits_session
The properties below can be used at root level.
property | type | required | description |
---|---|---|---|
type | enumerator | yes | Set to do_sessionize |
start_date | relative or absolute date or date & time | yes | Start date of the period that will be selected in the datasource. Can be filled with an absolute or relative date. Read more about relative date and time here. It is used by this task to scan the hits table for unsessionized hits. The start_date is then corrected to timestamp of the earliest non-sessionized hit. This is done to reduce data scanning to a minimum. |
The `transforms` part of the configuration enables you to manipulate data before you load it in your data lake. You have to add all transform as an array.
transforms:
# IP address filter
- type: filter
exclude: yes
field: ip_address.raw
values:
- ::ffff:127.0.0.1 # localhost IPv6
- 127.0.0.1 # Locahost IPv4
# Test browsers (e.g. HeadlessChrome)
- type: filter_regexp
exclude: yes
field: user_agent.raw
values:
- .*HeadlessChrome.*
# The following filter EXCLUDES hits from the bot filter that have account_id=osb.direct
- type: bot_filter_exception
field: system.account_id
values:
- osb.direct
Below are the types of transforms. You have to add them as an array. Below is an explanation per transform on how to use it exactly.
Filter out the hits that match the string filter and move them to the `trash` table.
property | type | required | description |
---|---|---|---|
type | string | no | Set to filter |
field | string | no | Name of the table field. Only works on fields at root level. |
values | array of strings | no | Filter values. If filter matches positive on any of the values, the complete hit is moved to the trash table. The value of the system.status field will be 1 and the run_id will be put in the system.run_id. |
Filter out the hits that match the regular expression filter and move them to the `trash` table.
property | type | required | description |
---|---|---|---|
type | string | yes | Set to filter |
field | string | yes | Name of the table field. Only works on fields at root level. |
values | array of regular expressions | yes | Filter regular expression values. If filter matches positive on any of the values, the complete hit is moved to the trash table. The value of the system.status field will be 2 and the run_id will be put in the system.run_id. |
Makes sure that the bot filter does NOT check the hits in this filter. So if field = system.account_id
and the
value = osb.direct
, this means that the bot filter will be applied to all hits except the hits where
system.account_id=osb.direct
. All other hits from bots are moved to the `trash` table. A hit from a bot is where the user_agent.device_class
or user_agent.class
contains robot
.
property | type | required | description |
---|---|---|---|
type | string | yes | Set to filter |
field | string | yes | Name of the table field. |
values | array of strings | yes | Filter string values. If filter matches positive on any of the values, the complete hit is NOT checked if it is a hit from a bot. All the rest is moved to the trash table. The value of the system.status field will be 5 and the run_id will be put in the system.run_id. |
property | type | required | description |
---|---|---|---|
type | enumerator(redo,reset) | no | redo = Use if you want to redo the sessionization partially. In combination with a start_date in task it will first undo the sessionization of the previous runs and the re-sessionize the hits as usual.
reset = Use if you want to completely redo the sessionization. It will drop the profiles, sessions, user_profiles and trash tables before undoing all sessionization. |
start_date | relative or absolute date or date & time | yes | Start date of the period that will be selected in the datasource. Can be filled with an absolute or relative date. Read more about relative date and time here. It is used to determine the run_id that sessionized the hit at the start_date. All hits from that run_id and the run_ids after will be re-sessionized. Handle with care. It might lead to new numbers in some cases, especially with changed filters. |
property | type | required | description |
---|---|---|---|
project_id | string | no | BigQuery only. Contains the ID of the project that holds all tables. |
dataset_id | string | no | BigQuery only. Contains the ID of the dataset that has all the tables. |
table_id | string | yes | BigQuery only. Contains the ID of the table that holds the (short-term) hits. This table is time partitioned by the hour. It contains hits including and later than 160 days ago. |
table_id_longterm | string | yes | BigQuery only. Contains the ID of the table that holds the longterm hits. It is time partitioned by day. It contains hit up till (and does not include) 160 days ago. |
table_id_trash | string | yes | BigQuery only. Contains the ID of the trash table. Holds all the hits that matched the filters. |
table_id_user_history | string | yes | BigQuery only. Contains the ID of the user history table. Contains the last session number per cookie of the last X runs. |
table_id_profiles | string | yes | BigQuery only. Contains the ID of the profiles table. |
site_id | string | no | Only sessionizes hits that have this site_id. Leaves all other hits in the hits table. It does not move hits to the trash. For that purpose, you should use a filter. |
stitching | array of objects | no | Use key , include and regexp . Stitch users on these keys. Only consider key values that match the regexp . |
ids | array of strings | no | Ids to add to the profiles table. |