We rely on Google Analytics to tell us how many readers we have, and what articles our readers click through to. We don’t seek or possess any data about the reading habits of individual readers; only the aggregates. So far, we haven’t tried to make any practical use of these analytics; but we’re wondering if they might help us to make The Browser better. So we’ve asked our friend Tony to consult with us on what our data might tell us. This is the first of his notes. — Robert
THERE IS quite a surprising finding in a first view of the stats. As the graph shows, when the paywall was tightened in early September, from ten down to five free click-throughs, a whole lot of articles – dots on the graph – stopped being clicked through by anyone much (getting fewer than 100 views or so), while the other articles kept on much as before.
It might have been expected that each article might suffer a bit from the tightening, but actually, a small number took most of the hit. This can be seen in the second graph, which shows, for each week, the mean and 1 and 2 standard deviations from it for readership. The spread downwards increases after September, but the upper distribution hardly moves – popular articles are just as popular under a tightened paywall.
I have tried to see if I can extract any information about the less popular articles. In summary – the paywall cut average views by about 200, but as we’ve seen, this was concentrated in a subset of articles that took bigger hits; “arts” articles tend to do 200 views worse than average; media, society and world are also poor performers. After the introduction of the paywall, energy, society and politics have been above average performers, while science, people and media have done badly. Society is the odd case, it seems, here, since it has changed from being below average to being above average after the tightening of the paywall. This might be a statistical anomaly (though it is, in good old Fisherian terms, significant); or it might be that the composition of readers has changed because of the tightening.
I’ve included the bottom 20 performers post-paywall in case you can think of some commonality that might be tested. It strikes me that the paywall may have eliminated a particular subset of readers with shared tastes. The list seems unusually “techie”. That would go with Science having turned into a poor performer post paywall. So … maybe that was traffic via the RSS feed, when the full RSS went paid?
As an analysis of paywalls, it is already an encouraging start.
There is much more to be done with all this data, and I suspect that the article-level material will be very rich. The category information is particularly valuable because it provides a fixed point within which variation can be explained. The extraction of unusual words from the title and standfirst could do the same sort of thing. As could some tag-extraction tools like openCalais. We might try it out to see if it adds useful category data. In any case, we should tag posts generously.
Looking at simple frequency distributions of authors and publications shows, in the top echelon of each, some somewhat unsurprising results. Aeon is perhaps the most surprising, being up there with much more established publications. Matter performs even more strongly, but it is hard for the moment to say whether it should be treated as a publication or a platform.
Maybe we could produce 2 rankings, by date of establishment of the publication. Say, pre and post 2001. That would allow us to bring to the surface more smaller publications. I have tried some regressions to establish the value of a publication to click-throughs but have yet to get anything significant out – again with the exception of Aeon and Matter, which had an above average effect. I’ll try something with top authors to see if I find an effect. But I suspect I’ll have to do something more involved with individual-level data to extract this sort of information.
I am still groping around rather with the data and how to think about it. For example, I think that my next attempt ought to be to consider that each day, the articles published on that day are in a competition for attention. That will essentially give me many separate markets over which to try to isolate the effects of the different variables, and should yield much more information. At the moment, my best R squared is 17%, implying that there is still over 80% of the variation in the data that I am not capturing!
Most frequently recommended publications post paywall*
*paywall was introduced on March 28th 2013
Titles of the poorest performers post paywall
Embracing the void
Everyone on The Couch
Review: Autobiography, By Morrisey
The Culture That Gave Birth To The Personal Computer
Ukraine, Russia and Europe
George Orwell: Animal Farm
Kennedy and After
The Plum in the Golden Vase
David Eggers: The Circle
If This Toaster Could Talk
Yahoo’s Geek Goddess
Cli-Fi: Birth Of A Genre
A Time Of Hugs And Kisses: XOXO 2013
Obituary: Hiroshi Yamauchi, President Of Nintendo
How Britain Exported Next-Generation Surveillance
Facebook Must Win The Grown-Up Vote
When Condé Nast Was A Force For Good
Tor Is Less Anonymous Than You Think
1. it doesn’t help, readership-wise, to be in the top 10 most referred to publications. ie, once an article has been through the Broswer selection process, it doesn’t matter if it came from the New Yorker or from myhomespunblog – that’s a rather good testament to the power of the editorial selection on the Browser, I think – it is an equalising force.
2. there is a small but significant effect of word length – an additional 1000 words gets you 9 more views. I presume that this is the result of some hidden variable rather than readers actually responding to word length. But it is nice to know that there is a small bias towards lengthier writing!