Ceci n’est-pas le Web


In René Magritte’s picture of a tobacco pipe “the treachery of images” (La trahison des images: 1928–29) he adds the words “Ceci n’est pas une pipe”  (this is not a pipe). We are invited to move past the seemingly obvious error in the inscription to realise that what we are looking at is simply a picture of a pipe and not a real pipe that can be used in the real world.

Magritte wrote: “The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No, it’s just a representation, is it not? So if I had written on my picture “This is a pipe,” I’d have been lying!”



In Web Science we are trying to establish models (representations) of behaviour and structure on the Web through a combination of mathematical rigour, engineering principles and interdisciplinary expertise from a range of sources in the humanities and social sciences .

Despite this rigour we must remember that the models we develop remain models and may be missing key features of the real thing, hence: “Ceci n’est pas le Web”. There are a number of pitfalls/challenges, which must be considered as part of good Web Science research design and this paper places them in the context of what we can expect to know about reality from a model and the implications of becoming part of the system being studied. This paper is based on observations of themes raised during a series of Web Science and Web Observatory workshops held over the past two years under the auspices of the Web Science Trust (www.webscience.org) and contributes a synthesis of themes raised, highlighting implications for Web Science research and making recommendations for planning future research in this area.

Finally we consider the case of Web Observatories (WO), which combine data and analytics to support research about the Web. The implications for interoperability between WO’s and ultimately a future World Wide Web Observatory (W3O) as an instrument for Web Science research are discussed.

1.    Introduction: Web Science, Web Observatories and Social Machines

Shadbolt, Hall et al present the fabric of Web Science as a rich combination of technical and social processes comprising the study of networks, mathematical models, law, business, psychology, social policy, education and many other fields and in the few short years since coining the term Web Science Hall, Shadbolt, Hendler et al have created a global network of laboratories (WSTnet) studying Web Science as a new interdisciplinary field of study.

With the persistent and accelerating growth of data sources and volumes driven by more bandwidth, more sharing, Big Data, “Broad Data” and an emerging Internet of Things, the need to look at data in new ways has become self-evident.

In his historical account “Weaving the Web”, Berners-Lee discusses the interaction between Web, users and data to form social machines and Shadbolt, Berners-Lee, Hall et al  are attempting to give formal structure to the theory and practice of these Social Machines (www.sociam.org).  It has been suggested that Web Science can be thought of as the study of Social Machines.

Tiropanis and Hall et al outline the development of instruments to record/analyse data for Web Science research in the form of Web Observatories: software -based tools akin to virtual astronomical observatories, which by contrast observe the Web rather than the external universe. Below we will further unpack the idea of “observing the web”.

2.    Perspectives of the Web

In the context of Web Science research, de Roure has observed that the Web takes on multiple roles, which may complicate our understanding.

Expanding on this at WebSci 2013 (and also via blog http://www.scilogs.com/eresearch/social-machines/) de Roure casts the Web into three simultaneous roles:

  • The Web as infrastructure (a medium through which we wish to express other activities)
  • The Web as an artefact (an external system whose operation can be studied in its own right)
  • The Web as a lens (a tool through which we can observe the first two: structure and behaviour)

Though covered in part by de Roure’s triad, from a cognitive perspective we would also suggest an additional perspective:

  • The Web as a reason or driver in of itself. The Web offers affordances/opportunities, which may become attractive only when they are web-based. Thus the Web may not simply permit/enable such activities but encourage them. This may form part of the underlying drivers for the observed sociotechnical effects and emergent properties.

The notion of Web as an artefact allows us to look at the structure of the scale-free networks, which underpin the Web as described by Barabasi , the implications for the expansion of the Web and its operation and our stewardship responsibilities in terms of maintaining/improving the Web in the future Hall. The digital footprints, which are left by using the Web, are typically captured/stored on the Web artefact.  A key distinction here is that these are DATA ABOUT THE WEB and are highly central to Web Science research into structures and the technical part of sociotechnical effects and patterns of behaviour.

The notion of Web as infrastructure allows us to look at social phenomena (in government, health, crime, education, law, business etc.), which were typically expressed via other media before the Web and to understand how their expression via the Web makes a difference. Equally this expression of social phenomena via a technical platform is increasingly studied in terms of sociotechnical interplays and effects. The key distinction here is there these may be DATA ON THE WEB and are highly central to Web Science research into motivations, business models and the social part of sociotechnical processes.

The notion of Web as a lens allows us to look AT the Web (both structure and sociotechnical effects) VIA the Web and typically includes DATA ABOUT USING THE WEB (how we use it and how the users we focus on use it).  Whilst the Web as a lens or research tool is by no means the only approach to Web Science research it can throw up some meta-level questions pertinent to the design of research and the impact the research methods may have on the outcome.

In a thought experiment in which we are looking through a telescope to observe far-off people who themselves are using telescopes, someone observing us might wish to know:

  • What is the nature of our telescope? What are we using it for? How? Why?
  • What is the nature of the telescopes used by the far-off people? What are they doing? How is it being done? Why are they doing it?

The answer to the first set of questions may impact the second set.

Finally the idea that the Web may be an inherently attractive way to participate/interact creates a separate viewpoint in which we should consider the distinctions between certain actions in the physical world and their analogue on the Web.

Consider cases of cyber-bullying/stalking {Lazuras:2013vl}, access to illegal/offensive material, online shopping, social interaction via networks – all of which may be perceived to be considerably easier and/or more anonymous/private (sic) or generally involve fewer consequences/costs than via alternative channels. If there are aspects of the Web which influence behaviour {Suler:2004kv} then  data/behaviour on the Web may not be simply be considered a like-for-like expression of social processes alongside alternative technologies or expressions in the physical world.

Would we physically “friend” and share information with as many unfamiliar people in reality as we may feel is appropriate on social networks?

Do we express ourselves as negatively where we can be identified versus (apparently) anonymous encounters?

2.1    Research issues / pitfalls

The following section looks at the structural/methodological challenges around Web Science and borrows from work on cognitive bias.

2.1.1    Selection bias problems

Whilst the predominance of published research in this area is currently in the English language, some other languages and cultural models (particularly Chinese, Russian, Arabic and Spanish) are growing at near exponentially higher rates (source: internetworldstats.com). Hence in addition to encouraging the growth of Web Science research in countries where these languages are spoken natively, research exchange and advanced translation tools, researchers must consider the validity of samples based on a single culture and consider how to contrast results (particularly in semiotic analysis) across cultures in a resource-efficient way. Whilst historical events/themes may be represented in the body of available data we may naturally expect to find more data from/about current themes and groups versus those which no longer operate and to correct for survivorship bias.

Equally Web Science will naturally find most of its digital data from countries/locations that are served with good connectivity. Whilst organisations such as the Web Foundation are attempting to highlight the digital divide through tools such as the Web index (thewebindex.org) and Ushahidi are pioneering new technologies such as BRCK (www.brck.com) to provide robust communications to remote communities who are underserved with digital infrastructure, researchers will need to be aware of placing Web Science research in a developed vs. developing world context and to account for non-response issues by those who cannot participate.

2.1.2    The Observer/Participation bias problem

In addition to the documented insight that observation can affect behaviour, an additional issue for Web Science is not only to decide how obtrusive the observations themselves may be (“All observation is a form of participation”  but also to decide what level of disclosure/feedback is required ethically vs. the need to avoid inadvertently modifying behaviour at Web scale through a feedback mechanism.

Imagine, for example, we are studying the quality/movement of air in a room. As we move into the room to take measurements we are changing not only the movement of the air but also consuming/transforming the air even as we measure its properties.

This is not the remote, consequence-free, observation of distant stars or galaxies but a case of changing the outcome of the experiment by performing that experiment.

We risk changing the outcome of our Web Science experiments not only by running them (potentially introducing changes into the system by virtue of the measurement process) but also by revealing the results of the measurement to the population/system that are being measured and two examples follow below.

The CS-Indiana observatory Truthy (www.truthy.indiana.edu/) seeks to gauge sentiment through an analysis of Tweets around social and political issues in the US and provides near real-time feedback/visualisation of sentiment around these topics. The ability of Truthy to track sentiment in the real world has been discussed. Despite being a politically neutral service the system  reportedly ceased operations during the recent Obama US election amid concerns around unduly influencing the outcome of the election. Web Science must consider not only our own theories/reactions around our research results but those of the observed. In Logik der Forschung (1935) Popper claimed all observation (and arguably the awareness of observation) is theory-laden.

In a recent discussion of health social networks for SOCIAM, a partner shared an example of how a tele-health project (intended to help reduce the need for medical intervention by increasing patient engagement) found that the project had apparently caused in an improvement in patient symptoms. Further analysis of the data, however, revealed that the additional digital data about symptoms available to healthcare professionals had lead to an increase in treatment/prescribing, which was underpinning the results rather than improved patient behaviours.

The effect here is two-fold:

  • The act of observing/measuring may have an effect on the system being measured? (an Observer/Expectation effect)
  • The measurer/experiment may have to be considered part of the system being measured. (A participation effect)

As the scope of our Web experience extends to encompass pervasive computing and a global Internet of Things, the key implication to consider is that it may be increasingly difficult (or perhaps impossible) to observe the Web from outside. If we are drawn into the growing context of the “black box” we are observing (ie if the box simply grows ever larger until it encompasses us) then these participation effects may no longer be avoidable.

2.1.3    Data-centric problems

As the volume of data gatherable from online networks has become “big” (difficult to process due to volume, velocity and/or lack of structure) and increasingly proprietary, researchers are faced with a twofold problem:

  • Wrangling enough of the right data from the right sources (vs. ALL data)
  • Interpreting what the data means vs. a data mining approach

In terms of data wrangling, currently few organisations are able to handle the full feed of even one source such as Twitter (let alone correlating multiple feeds from different sources) and so research may be based on a sample chosen at random from a provider whose data may be easiest to access/wrangle. An anchoring bias may emerge from an undue reliance on a narrow set of data sources. Additionally the data gathered from provider is often gathered under strict license and may not be shared or re-distributed leading to problems around verifying/repeating or extending the work done by other research groups.

Where data sets are very large it is tempting to think of them as exhaustive and authoritative – they may not, however, be complete, random or representative. Multiple smaller data collections from different sources may be preferable to validate results through triangulation.

Unlike some other forms of social science research where the raw data gathered may be very specific/unique, much time/bandwidth/storage is potentially wasted in Web Science research reacquiring data sets, which already exist elsewhere. Agreements to share data sets for research are needed to avoid this huge duplication of efforts/storage.

In terms of data interpretation, Web Scientists with access to ever more data researchers are certain to find correlations and hence may be faced with illusory correlation problems: namely that correlations revealed by analysis may not inherently be valuable without a way to validate the causation or predict something from the correlation – if indeed causation can be established. The simple revelation that some dataset yields a particularly visually interesting pattern or exhibits an XYZ factor of 1.234, whilst descriptive, must surely fail the test of usefulness for Web Science unless an insight into underlying behaviour, cause or social process is illuminated in the process.

2.1.4    The impersonation / provenance problem

Following on from gathering the “right data” comes the issue of ensuring that the data is right (correct and what it appears to be). In a paper on Observing social machines De Roure et al  highlight the additional complexity of observing not only Human/Human interactions via the machine but also interaction WITH the machine (bots) and even Machine/Machine interaction where proxies interact (presumably unknowingly) with other proxies. The implications for Web Science are non-trivial if researchers are unable to determine whether input/sentiment is human or machine-generated. In a reductio ad absurdum scenario researchers could find themselves unknowingly studying groups of non-human users on social networks with no live people actually posting.

Google have patented a service (US Patent 8589401) for “Automated generation of suggestions for personalised reactions in a social network”.

This goes beyond the workflow services such as ITTT (If This Then That) and Zapier which simply reproduce/pipe content between applications and offers to create novel content in the user’s name based on an analysis of previous posts by the user.

Whilst (contrary to the implications of various press reports) the patent does not suggest that the responses will be SENT automatically only SUGGESTED automatically there is nonetheless potentially only a trivial extension to make this possible and a new class of provenance issue for researchers is created where users may be impersonating themselves via a bot.

Further potential research questions become apparent around the right to know if you are interacting with a person or machine, the nature of our interactions with machines vs. people and how our knowledge / expectations colour the exchange.  Indiana CS’s work in detecting so-called “Astroturf” may become increasingly important in this context even where the Astroturf is intentional (vs. fraudulent) and generated based on our own material/style. A new area of application for the Turing test is potentially born.

3.    Implications for Web Observatories

The discussions above highlight a number of core challenges (structural rather than thematic) for Web Science:

  • Provenance and Quality of Data
  • Reuse/Sharing of Data
  • Accurate sampling of Data in context
  • Observing from within the system

Whilst there might be individual/localised solutions to these problems, Tiropanis, Hall et al have proposed the development of Web Observatories – a general class of web repositories to observe targeted phenomena and to then make the data, analysis and tools associated with this research available to others via known interfaces and standards.

It has been argued that this approach differs from existing tools and will need to reflect more complex processes and requirements and in turn will offer different and potentially richer affordances.

By building the known challenges for Web Science into the design of Web Observatories (WO) we will enable significant progress to address several of the challenges above. This includes presenting trusted repositories of curated data and tools, the services and metadata required to discover and select appropriate data sets, to offer clear terms of use (licenses) and to do this within a defined research context (linking to open research notes, papers and experimental methodologies). Access not only to published papers but to the research data, the research processes and the tools behind the papers will allow a broader debate and clarity around context, interpretation and potentially the acceleration of research process.

Whilst no single tool/approach is ever likely to be a “magic bullet” solving all problems in a single stroke, the move towards collaborative research approaches, shared data, shared processes and a system of provenance and Trust are core concepts for Web Observatories and could represent a unifying set of tools and principles for future Web Science research.


Web Science research is naturally not unique in needing to account for research bias and be rigorous in research design but Web Science may have a unique combination of bias issues when dealing with capturing, analysing and then publishing data at Web scale. These specific challenges require further research culminating in the provision of research models and tools such as Web Observatories. We await not only the growth of individual Web Observatories with specific aims to address key topics but also the future integration and interplay of these repositories.

The interoperation of individual WO repositories may ultimately form the basis of a worldwide web observatory (W3O) through which we may not only aggregate datasets to produce novel insights from synthetic data but may even learn to observe ourselves as we do the research.

Web Science and the Web Observatory: the changing remit of web curation for research, enrichment and cultural preservation.

Web Science has been defined as the the study of social machines – the hybrid human/virtual solutions and processes that result from the use in society of information and information systems on the Web. Tim Berners-Lee (2009) described them thus:

Real life is and must be full of all kinds of social constraint – the very processes from which society arises. Computers can help if we use them to create abstract social machines on the Web: processes in which the people do the creative work and the machine does the administration. . . The stage is set for an evolutionary growth of new social engines. The ability to create new forms of social process would social engines. The ability to create new forms of social process would be given to the world at large, and development would be rapid.

The Web has evolved beyond a collection of static html pages for academic research to a platform for human interaction in all it’s forms and an open conduit for publishing and self-expression. These social machines exist in Government, Science, Art, Crime, Health and in many virtual categories beyond.

The desire to retain such data, which has been deliberately/explicitly put onto the Web may fit within a widening remit for archivists to preserve works of art, literature, science and other traditional “publications” and yet an increasing body of data is being added that is not explicitly published by any individual but instead comes via a technical platform or channel and is about a topic, about “society” and/or about specific groups globally. This is an immensely valuable resource and may help us to model evolving trends and behaviour in society though the expression of activities on the Web. Not least this resource may help historians understand the 21st Century through more detailed records than have ever been available before the advent of the Web.

This form of publication may be a problematic fit with traditional policies and libraries globally are reacting quickly to expand and re-define what it is to be a library, what constitutes an artefact for preservation and what deserves space in growing (but ultimately limited) digital collections.

This change is not without challenges and not least the donation of the entire Twitter corpus (growing at some 500 million messages per day) to the Library of Congress has highlighted the huge operational issues which come with data at Web scale.

How then can we decide which data to preserve and which to discard? Most recently the Web has been leveraged as a source of huge amounts of information / metadata known as “big data”much of which may be considered the “exhaust” of other human activities on telephone-, social media- and other networks. In the coming decade growing portions of the electronic fabric of society ranging from cars to cameras to pacemakers and household devices may be brought on-line to form an Internet of Things whose data output is predicted to dwarf even the huge volumes of web data we collect today.

Such volumes of data cannot currently be captured and stored in their entirety using available technologies and beyond this the challenges of curation, access, licensing and presentation may take many more years beyond a future storage solution. Museums and libraries have faced the selection challenge for as long as we have had limited shelf space but with the virtually unlimited digital shelf the choices may become less about inclusion/exclusion and more about the resolution of the data stored.

e.g. One example of a particular newspaper vs. weekly examples of selected newspapers vs. all copies of all newspapers ..

In Web Science we are happy to embrace the value that may come from analysing Big Data without assuming that more data is necessarily better per se or that Big Data is inherently insightful or even meaningful! Our chosen tool for Web Science is the development of an instrument akin to a stethoscope in medicine or a telescope in Astronomy – namely the Web Observatory. Something to help us observe the digital footprints left by society as a whole rather than directly observing/intruding into personal spaces.

A single web observatory is a data repository in which data ON the Web or data ABOUT the Web is collected and in turn made available to other users via portals, interfaces or visualisations and users may in turn add or return data to the Observatory based on their own findings, research or experiences.

Whilst some have argued that only data ABOUT the web (webometrics, cybermetrics, bibliometrics etc) should be considered as a focus for study it seems apparent that as data is put ON the Web the behaviours and responses may be detected in responses from users often through data ABOUT the Web (locations, times, tagging/classification etc ) which are metadata items akin to the metadata held ABOUT library collections and are distinct from the books or artefacts themselves. Thus the argument about data vs meta-data becomes a chicken/egg question of socio-technical effects: does the technical change the social or vice versa?

One of the key challenges around the curation of so much data across so many perspectives is to encourage a reduction of waste/repetition so that data gathered by Observatory A may be discovered and re-used/re-purposed by Observatory B. We are looking forward to a collection of interoperating observatories that will form the fabric of a World-wide Web Observatory. These many distributed repositories/collections may be based on varied technologies and approaches but we are calling for shared standards of identification, metadata, licensing and collaboration. Such a emergent system may accelerate the efforts of academic researchers by linking publications to underlying research data, notes about usage and methodology and allow the study of hybrid data assets and synthetic artefacts that are available nowhere else.

The pressing need is to co-ordinate the skills of the archivists with the needs of the researchers and the understanding of the technologists into an interdisciplinary view of how the Web could be sampled, preserved and made available as an invaluable tool for current researchers, historians and for future generations.


The WAISFest (Web and Internet Science) – pun intended – is an opportunity to step away from your normal routine and hack something together typically with other members from the Research group.

This year I was a one-man band 🙁



but had a lot of fun hacking visualisations together from a few different sources. Had fun and learned quite a bit ..

Finding a way to integrate standalone datascopes – a GloWORM

GloWORM – Global Web Observatory Resource Monitor

The Web Observatory is not a singular piece of software or a single repository but rather emerges as a function of accessibility to a set of distributed data telescopes (datascopes) that make their data and services available.
There are major challenges around automated discovery, content conversion, licensing and operational factors to deliver a *perfect* Web Observatory but to at least know what resources are available and what they address would be a great start.
There are currently hundreds of individual web pages,wiki’s or links to Open Data sets, standalone repositories, linked open data or to data sets that are privately shared as part of a project.
Nothing currently pulls all this together to allow visualisation / browsing by region, data-type, date range or topic. The ability to code or classify a dataset or tool by hand (crowd-sourcing the metadata) may be a much more pragmatic way of pulling together the resources in the short term so this should be considered where automated discovery or classification is not possible.
Challenges here include:

  • How to discover data vs static links
  • How to identify topics and QUICKLY enhance metadata through tags
  • How to combine public data with personal/project links
  • How to visualise/browse in an engaging way

This project will help to highlight the technical challenges in Observatory development and provide a user-level tool to show how the type/quantity of content changes over time.

The truth of the long tail

I’ve recently been reading “the long tail” and back in 2004/2005 the author is describing teenagers who abandon traditional TV, albums and other content in favour of an unlimited, self-selected micro-niche of culture/content.

I realised that both my children now do exactly this .. I cancelled our cable TV service today. They didn’t seem very worried.