“Not Provided” and Google’s Conflicting Messages

In the non-too-distant past, you would see the search terms people used to reach your site through Google. So, to given an entirely random example, I know that three people visited my site yesterday having typed in ‘Myra Hindley’ (on account of me having a page on the subject). 

I know those people stayed on the site for an average of 3 minutes and looked at 3 or 4 pages. Looking back over the last couple of months that seems to be fairly standard for keywords relating to Myra Hindley. From this, I can gauge that actually the content must be pretty decent and that I’ve done a good job.

That allies greatly with what Google have been telling me I should be doing for many years now: building great quality pages. I have a perfect set of metrics by which to measure this.

  1. I know that the site ranks in the first 3 pages for “Myra Hindley”
  2. I know that people who visit through that keyword rarely bounce
  3. I know that people who visit through that keyword stay on the site for a good length of time and explore further.

This is what Google has always said it wants: to serve great quality pages to its users. By giving me the necessary data to work with through Analytics, Google helps me to do that.

By contrast, the page for the recently discovered “London piss dungeon” news story fares worse. While traffic is higher due to the better position the page holds (what a claim! To be on the first page of Google for ‘piss dungeon‘!) the metrics are worse. People tend to bounce pretty quickly and not explore further. There’s a bunch of reasons there might be for that, and if I were so minded, I could work harder to do more with that specific tranche of traffic.

But Google have started on the road to anonymising chunks of that search data. The problem is only going to get worse. Already, the site I run as my day job is seeing this anonymised data rapidly becoming the biggest ‘search phrase’ we have after branded traffic (and will, on current trends, soon surpass that).

Herein lies the dilemma. It’s clear that the traffic lumped into “not provided” is fairly decent in terms of search interaction, but how can I tell what lies within, or how I can improve things further? Perhaps 900 of that number is some amazing first page term I don’t know anything about but people bounce after a couple of seconds. Or maybe it’s a tonne of great long tail phrases with good dwell times.

Shorn of that valuable bit of data, there is no way to tell, and experimenting with layouts and content and site structures becomes increasingly a shot in the dark – and a risky one at that.

Google is company with internal factions. And at the minute, the privacy faction is beating the user experience faction. There’s more of a conflict between those aims than first meets the eye. On current trends, we’re all going to be working in the dark. Not only will that affect the work that site owners do, but also the quality of Google itself.

Update: further fascinating discussion on Google and privacy has arisen on Gizmodo. Go read!

Why you should ignore Schema

Once upon a time, knowledge was codified in books. As a method of information storage, books have several problems aside from physical storage, but the primary one is codification. This problem can be most commonly seen in a library.

If you’ve been to a library, you’ll know that shelves are arranged according to categories. Over here, Fiction. Over there Crime Fiction. On those shelves British History. On that shelf European Travel.

And as you walk around, it seems to make some kind of sense. I’m interested in British History, so that’s where I should head. But actually, it is an extraordinarily arbitrary assignation in a lot of cases. For example, if I’m interested in Henry VIII, I might find him not in ‘History’ but in ‘Biography.’ Similarly, the 100 Years’ War might fall within “British History”, “European History”, general “History” or even “French History” if the library is so-minded. But the key characters might be covered in more detail in “Biography” and there may be further reading in the travel sections.

Often there are no clear boundaries. The Woman in White by Wilkie Collins could be straight ‘fiction’ or ‘crime fiction’ depending on how you see it. And it is the personal nature of our views that colour attempts to categorise content. In Britain, we tend to consider ‘British History’ as a distinct entity from ‘European History’ and yet to the French, that distinction may make no logical sense.

The internet freed us from the tyranny of categorisation.

No longer did a book have to exist where a bibiographer chose to assign it, but could sit within a web of links and made accessible through search. Perhaps the best example being Wikipedia. To find the information there, one doesn’t spend a fruitless hour seeking through a categorisation system, but simply type whatever you want into the search box… est voila! The Hundred Years War entry comes up. And throughout, there are links to ancillary information – places, people, dates, battles – none of them needing anything more than a click to access and without any recourse to esoteric knowledge about how some librarian has decided things ought to be ordered.

Meta-engines like Google allow us to access incalculable amounts of data from innumerable sources swiftly and accurately with nothing more than Pidgin English and the click of a button. People decry the creation of what is snobbishly called ‘shallow knowledge’ but these are often the voices of gatekeepers: the kind of people who would protect knowledge under what they see as their own professional or expert curatorship.

In fact, early attempts to impose order on the web are still with us: DMOZ, Yahoo Directory, Best of the Web and so on. There, either through automation or human intervention, an attempt is made to assign every website into whatever category seems best.

Why is this is so bad? Consider a site which sells gardening products, offers gardening tips and has a gardening forum. It could easily sit within some shopping category, information category or community category – but why would you place it into any particular one? Actually, you are interested in why your lawn isn’t green and what  you can do about it. The answer could lie in a handy hint, a product or a friendly forum and most likely in a journey that takes in all three. There is no simple category into which an answer or site can simply be put.

The proof of the pudding: when would you ever used a service like Yahoo Directory to find an answer to that question, when Google can offer you a hundred suggestions a minute.

And yet, for the past few years there have been numerous attempts to reimpose the spurious idea of curatorship onto the internet. The so called ‘semantic web’ being one example, XML sitemaps another. Not content with having to fill your web page with content, the people behind the ‘semantic web’ wish you to add additional markup to that content.

The reason? Not so that humans can better understand your content, but that machines might be able to make decisions about how to treat it. Here is a piece of markup for ‘author’ so that machines can identify who wrote what. Here is a piece of markup for ‘navigation’, so that machines can tell which bit of the piece can be ignored for indexing purposes. What utter rot it is. We have escaped the dogma of categorisation, only to find it being reimposed by stealth by technologists who would prefer us to solve a problem that only exists in their heads.

The latest bright spark is ‘schema‘. The merest look at the list of items immediately highlights the problem with such schemes: they are inherently limited by the imagination of the person creating the categorisation. Take the scope for the “AutomotiveBusiness” item:

  • AutoBodyShop
  • AutoDealer
  • AutoPartsStore*
  • AutoRental
  • AutoRepair
  • AutoWash
  • GasStation
  • MotorcycleDealer
  • MotorcycleRepair

Pop quiz: would a branch of KwikFit belong under ‘AutoRepair’, ‘AutoBodyShop’ or ‘AutoPartsStore’? Well depending on what’s happened to your car it could be any of them.

Drill down further, and the ludicrous nature of what is being proposed becomes clearer still. Having decided that KwikFit is, for the sake of argument, an AutoPartsStore, we are then intended to add markup for the contact person, the geographic co-ordinates, currencies accepted and a slew of other ancillary information which in all probably is already there: in the content.

And once our branch of KwikFit is so assigned, what happens when someone searches for something related to “AutoRepair”? Is our KwikFit branch excluded or included? Either way, it makes a mockery of the supposed benefit of categorisation, when you think about it.

What the people behind Schema (and those who proselytise about it) are trying to do is to get you to shortcut their problems by adding more and more code to describe various on-page elemenets. Their problem is how to understand context, relevancy and importance. Their ‘solution’ is to try and get you to tag your content so they don’t have to work on the extremely hard problem of doing the same via algorithmic means.

Luckily, it will ultimately become an irrelevance. The number of people who enact Schema will never be anything more than vanishingly small and any short-term boost in the search rankings that people see will soon come to an end as the inevitable sharks move in and start to abuse the idea of Schema (fake review scores, fake identities, misattributions and so on).

Don’t waste your brain energy on this walking dud for anything other than short term gain.