top of page

DON'T collect PII with onsite search DO collect onsite search data responsibly

Google recommends avoiding sending PII to GA4 when collecting Analytics data. With privacy in mind, and at the forefront of data collection best practices. this is common sense — privacy first, and privacy by design.


So, when using a rock solid tactic like on-site search that can drive results for SEO, product optimisation, recommendation algorithms, and experimentation, why is it so often overlooked as a major risk vector for ingesting PII?


Believe.

It happens.


Users will search for their own email address, their login, their post code, their account number. If they've lost key details, quite sensibly, they'll search for help documents, and related pages. Don't let their normal behaviour put you at risk of ingesting personally identifiable information.


Here's a privacy by design approach to collecting site search data without the risk of having your GA4 property deleted.


GA4 Configuration

Admin -> Data Streams -> Data Stream -> Redact Data:

To start with, we might choose to turn on the obvious "don't ingest email addresses" option:

However, there's still quite a lot of scope for PII ingest: Credit card numbers, post codes, addresses, first and last names.


No, to be properly safe, we need to go further. We choose to add the name of the query string parameter that holds our search keyword:

That's right, we're excluding all search terms from GA4. We'll happily collect the search event to use in reporting, and the workloads mentioned earlier but we can't afford to collect PII here...


Onsite Search Reporting

Let's not mess around, just go and follow this excellent guide on AnalyticsMania.com to set up your reports.


The goal is to be able to see the session key event rate, and revenue for sessions where searches happened compared to those where searches did not. Using an exploration to get a breakdown by search page is useful.


This is as far as the GA4 reporting interface can safely go - a keyword breakdown isn't possible as keywords are now redacted to defend against PII ingestion.


Onsite Search Keyword Data Redaction

The redaction of the search term happens at the point of data collection. We can see this in preview mode. If I hit this page to simulate searching for a fake email address, it'd be bad, right? https://www.dugadigital.com/about?search_keyword=duga@dugadigital.com


Here's what the data looks like in GTM on the client side in preview mode:


But the Network tab tells us what's exactly being sent out from the browser our sGTM endpoint - notice the last entry - ep.search_keyword: (redacted):

And in the sGTM Preview Mode, checking the incoming HTTP request confirms the search_keyword is redacted at source: ep.search_keyword=(redacted)


This means anything on the search_keyword event parameter being sent by a GA4 tag will be redacted.


 

It doesn't mean the original data is gone though. We just need to send it to sGTM not using a GA4 tag. Then we know it won't land in GA4, but we can still use and store the data safely in another repository.


Safe Onsite Search Keyword Data Collection

Client Side

Let's use some 'off-the-shelf' components to build our solution. It's relatively easy to write your own, but it's hard to do it really well so we'll go low-code.


We'll use the stape.io Data Tag:

We'll send a search event:

Here's our event Data payload:

Very importantly, we need to send some meta-data with the request:

That means we include the page path, client id, and session id information which is necessary for joining the data later on.


And we'll send the request to:

The data payload sent to our sGTM endpoint looks like this. It's got the search_keyword (not redacted!), the page location where the search happened, and the client id and session id in a very raw state from the GA4 cookies:

We need a non-GA4 server side component to receive this request and do something useful with it.


Server Side

It's logical to receive a request from the stape.io tag with a stape.io client so that's what we'll do having imported the https://stape.io/solutions/data-tag-client:


Notice our specific /data/search accepted path setting.


Now we can use the "Write to BigQuery" tag template by TRKKN to easily leverage the BigQuery API. This tag will write the chosen fields to our BQ table like this:


We need to trigger the Write to BigQuery tag using the search event, and then request comes from the stape.io data tag. We've created the stape.io client to claim the request so we'll create a basic trigger to fire when the client name matches:


Now, in preview mode on the sGTM container we can see the GA4 search event, and the stape search event arriving:

Each event is claimed by the specific client, and the correct tag is fired. Redacted data is sent to GA4, the raw subset is written to BigQuery. Now we can get to work and start to use the data:

 

Using safely collected onsite search data combined with GA4 analytics data

Okay - so armed with our BigQuery table of safely siphoned search terms, what next?

This table in isolation will have multiple uses. "Glom" your search keywords en masse to classify, analyse volumes, categories and common topics, spellings, even natural language processing for sentiment analysis of long form questions.


With our siphoned data containing client_id and session_id raw cookie values, we have the ability to join our search terms with the exact sessions where they happened.


It's a short conversation with the AI of your choice to be able to pull the session id values where searches happened. Let's put those in a table called search_sessions and join our two data sets like so:

select
	search_keyword,   
	page_location
from   
	`search_siphon` 
where   
	regexp_extract(session_id, r"^GS1\.1\.(\d+)\.\d{2}")
in (
	select     
		cast(session_id as string) as search_session_id  
	from     
		`search_sessions`
)

And there we have it

Now we've got the ability to pivot by search keyword for session key event rates, and ecommerce conversion rates, revenue, profit per search keyword.


Hey, let's even pull some channel data in there, device graphs, maybe local weather.


You get the picture. And we only touched code in the SQL part - everything else in this solution is approachable off-the-shelf.


65 views0 comments

Recent Posts

See All

Comments


bottom of page