Ceci n’est-pas le Web


In René Magritte’s picture of a tobacco pipe “the treachery of images” (La trahison des images: 1928–29) he adds the words “Ceci n’est pas une pipe”  (this is not a pipe). We are invited to move past the seemingly obvious error in the inscription to realise that what we are looking at is simply a picture of a pipe and not a real pipe that can be used in the real world.

Magritte wrote: “The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No, it’s just a representation, is it not? So if I had written on my picture “This is a pipe,” I’d have been lying!”



In Web Science we are trying to establish models (representations) of behaviour and structure on the Web through a combination of mathematical rigour, engineering principles and interdisciplinary expertise from a range of sources in the humanities and social sciences .

Despite this rigour we must remember that the models we develop remain models and may be missing key features of the real thing, hence: “Ceci n’est pas le Web”. There are a number of pitfalls/challenges, which must be considered as part of good Web Science research design and this paper places them in the context of what we can expect to know about reality from a model and the implications of becoming part of the system being studied. This paper is based on observations of themes raised during a series of Web Science and Web Observatory workshops held over the past two years under the auspices of the Web Science Trust (www.webscience.org) and contributes a synthesis of themes raised, highlighting implications for Web Science research and making recommendations for planning future research in this area.

Finally we consider the case of Web Observatories (WO), which combine data and analytics to support research about the Web. The implications for interoperability between WO’s and ultimately a future World Wide Web Observatory (W3O) as an instrument for Web Science research are discussed.

1.    Introduction: Web Science, Web Observatories and Social Machines

Shadbolt, Hall et al present the fabric of Web Science as a rich combination of technical and social processes comprising the study of networks, mathematical models, law, business, psychology, social policy, education and many other fields and in the few short years since coining the term Web Science Hall, Shadbolt, Hendler et al have created a global network of laboratories (WSTnet) studying Web Science as a new interdisciplinary field of study.

With the persistent and accelerating growth of data sources and volumes driven by more bandwidth, more sharing, Big Data, “Broad Data” and an emerging Internet of Things, the need to look at data in new ways has become self-evident.

In his historical account “Weaving the Web”, Berners-Lee discusses the interaction between Web, users and data to form social machines and Shadbolt, Berners-Lee, Hall et al  are attempting to give formal structure to the theory and practice of these Social Machines (www.sociam.org).  It has been suggested that Web Science can be thought of as the study of Social Machines.

Tiropanis and Hall et al outline the development of instruments to record/analyse data for Web Science research in the form of Web Observatories: software -based tools akin to virtual astronomical observatories, which by contrast observe the Web rather than the external universe. Below we will further unpack the idea of “observing the web”.

2.    Perspectives of the Web

In the context of Web Science research, de Roure has observed that the Web takes on multiple roles, which may complicate our understanding.

Expanding on this at WebSci 2013 (and also via blog http://www.scilogs.com/eresearch/social-machines/) de Roure casts the Web into three simultaneous roles:

  • The Web as infrastructure (a medium through which we wish to express other activities)
  • The Web as an artefact (an external system whose operation can be studied in its own right)
  • The Web as a lens (a tool through which we can observe the first two: structure and behaviour)

Though covered in part by de Roure’s triad, from a cognitive perspective we would also suggest an additional perspective:

  • The Web as a reason or driver in of itself. The Web offers affordances/opportunities, which may become attractive only when they are web-based. Thus the Web may not simply permit/enable such activities but encourage them. This may form part of the underlying drivers for the observed sociotechnical effects and emergent properties.

The notion of Web as an artefact allows us to look at the structure of the scale-free networks, which underpin the Web as described by Barabasi , the implications for the expansion of the Web and its operation and our stewardship responsibilities in terms of maintaining/improving the Web in the future Hall. The digital footprints, which are left by using the Web, are typically captured/stored on the Web artefact.  A key distinction here is that these are DATA ABOUT THE WEB and are highly central to Web Science research into structures and the technical part of sociotechnical effects and patterns of behaviour.

The notion of Web as infrastructure allows us to look at social phenomena (in government, health, crime, education, law, business etc.), which were typically expressed via other media before the Web and to understand how their expression via the Web makes a difference. Equally this expression of social phenomena via a technical platform is increasingly studied in terms of sociotechnical interplays and effects. The key distinction here is there these may be DATA ON THE WEB and are highly central to Web Science research into motivations, business models and the social part of sociotechnical processes.

The notion of Web as a lens allows us to look AT the Web (both structure and sociotechnical effects) VIA the Web and typically includes DATA ABOUT USING THE WEB (how we use it and how the users we focus on use it).  Whilst the Web as a lens or research tool is by no means the only approach to Web Science research it can throw up some meta-level questions pertinent to the design of research and the impact the research methods may have on the outcome.

In a thought experiment in which we are looking through a telescope to observe far-off people who themselves are using telescopes, someone observing us might wish to know:

  • What is the nature of our telescope? What are we using it for? How? Why?
  • What is the nature of the telescopes used by the far-off people? What are they doing? How is it being done? Why are they doing it?

The answer to the first set of questions may impact the second set.

Finally the idea that the Web may be an inherently attractive way to participate/interact creates a separate viewpoint in which we should consider the distinctions between certain actions in the physical world and their analogue on the Web.

Consider cases of cyber-bullying/stalking {Lazuras:2013vl}, access to illegal/offensive material, online shopping, social interaction via networks – all of which may be perceived to be considerably easier and/or more anonymous/private (sic) or generally involve fewer consequences/costs than via alternative channels. If there are aspects of the Web which influence behaviour {Suler:2004kv} then  data/behaviour on the Web may not be simply be considered a like-for-like expression of social processes alongside alternative technologies or expressions in the physical world.

Would we physically “friend” and share information with as many unfamiliar people in reality as we may feel is appropriate on social networks?

Do we express ourselves as negatively where we can be identified versus (apparently) anonymous encounters?

2.1    Research issues / pitfalls

The following section looks at the structural/methodological challenges around Web Science and borrows from work on cognitive bias.

2.1.1    Selection bias problems

Whilst the predominance of published research in this area is currently in the English language, some other languages and cultural models (particularly Chinese, Russian, Arabic and Spanish) are growing at near exponentially higher rates (source: internetworldstats.com). Hence in addition to encouraging the growth of Web Science research in countries where these languages are spoken natively, research exchange and advanced translation tools, researchers must consider the validity of samples based on a single culture and consider how to contrast results (particularly in semiotic analysis) across cultures in a resource-efficient way. Whilst historical events/themes may be represented in the body of available data we may naturally expect to find more data from/about current themes and groups versus those which no longer operate and to correct for survivorship bias.

Equally Web Science will naturally find most of its digital data from countries/locations that are served with good connectivity. Whilst organisations such as the Web Foundation are attempting to highlight the digital divide through tools such as the Web index (thewebindex.org) and Ushahidi are pioneering new technologies such as BRCK (www.brck.com) to provide robust communications to remote communities who are underserved with digital infrastructure, researchers will need to be aware of placing Web Science research in a developed vs. developing world context and to account for non-response issues by those who cannot participate.

2.1.2    The Observer/Participation bias problem

In addition to the documented insight that observation can affect behaviour, an additional issue for Web Science is not only to decide how obtrusive the observations themselves may be (“All observation is a form of participation”  but also to decide what level of disclosure/feedback is required ethically vs. the need to avoid inadvertently modifying behaviour at Web scale through a feedback mechanism.

Imagine, for example, we are studying the quality/movement of air in a room. As we move into the room to take measurements we are changing not only the movement of the air but also consuming/transforming the air even as we measure its properties.

This is not the remote, consequence-free, observation of distant stars or galaxies but a case of changing the outcome of the experiment by performing that experiment.

We risk changing the outcome of our Web Science experiments not only by running them (potentially introducing changes into the system by virtue of the measurement process) but also by revealing the results of the measurement to the population/system that are being measured and two examples follow below.

The CS-Indiana observatory Truthy (www.truthy.indiana.edu/) seeks to gauge sentiment through an analysis of Tweets around social and political issues in the US and provides near real-time feedback/visualisation of sentiment around these topics. The ability of Truthy to track sentiment in the real world has been discussed. Despite being a politically neutral service the system  reportedly ceased operations during the recent Obama US election amid concerns around unduly influencing the outcome of the election. Web Science must consider not only our own theories/reactions around our research results but those of the observed. In Logik der Forschung (1935) Popper claimed all observation (and arguably the awareness of observation) is theory-laden.

In a recent discussion of health social networks for SOCIAM, a partner shared an example of how a tele-health project (intended to help reduce the need for medical intervention by increasing patient engagement) found that the project had apparently caused in an improvement in patient symptoms. Further analysis of the data, however, revealed that the additional digital data about symptoms available to healthcare professionals had lead to an increase in treatment/prescribing, which was underpinning the results rather than improved patient behaviours.

The effect here is two-fold:

  • The act of observing/measuring may have an effect on the system being measured? (an Observer/Expectation effect)
  • The measurer/experiment may have to be considered part of the system being measured. (A participation effect)

As the scope of our Web experience extends to encompass pervasive computing and a global Internet of Things, the key implication to consider is that it may be increasingly difficult (or perhaps impossible) to observe the Web from outside. If we are drawn into the growing context of the “black box” we are observing (ie if the box simply grows ever larger until it encompasses us) then these participation effects may no longer be avoidable.

2.1.3    Data-centric problems

As the volume of data gatherable from online networks has become “big” (difficult to process due to volume, velocity and/or lack of structure) and increasingly proprietary, researchers are faced with a twofold problem:

  • Wrangling enough of the right data from the right sources (vs. ALL data)
  • Interpreting what the data means vs. a data mining approach

In terms of data wrangling, currently few organisations are able to handle the full feed of even one source such as Twitter (let alone correlating multiple feeds from different sources) and so research may be based on a sample chosen at random from a provider whose data may be easiest to access/wrangle. An anchoring bias may emerge from an undue reliance on a narrow set of data sources. Additionally the data gathered from provider is often gathered under strict license and may not be shared or re-distributed leading to problems around verifying/repeating or extending the work done by other research groups.

Where data sets are very large it is tempting to think of them as exhaustive and authoritative – they may not, however, be complete, random or representative. Multiple smaller data collections from different sources may be preferable to validate results through triangulation.

Unlike some other forms of social science research where the raw data gathered may be very specific/unique, much time/bandwidth/storage is potentially wasted in Web Science research reacquiring data sets, which already exist elsewhere. Agreements to share data sets for research are needed to avoid this huge duplication of efforts/storage.

In terms of data interpretation, Web Scientists with access to ever more data researchers are certain to find correlations and hence may be faced with illusory correlation problems: namely that correlations revealed by analysis may not inherently be valuable without a way to validate the causation or predict something from the correlation – if indeed causation can be established. The simple revelation that some dataset yields a particularly visually interesting pattern or exhibits an XYZ factor of 1.234, whilst descriptive, must surely fail the test of usefulness for Web Science unless an insight into underlying behaviour, cause or social process is illuminated in the process.

2.1.4    The impersonation / provenance problem

Following on from gathering the “right data” comes the issue of ensuring that the data is right (correct and what it appears to be). In a paper on Observing social machines De Roure et al  highlight the additional complexity of observing not only Human/Human interactions via the machine but also interaction WITH the machine (bots) and even Machine/Machine interaction where proxies interact (presumably unknowingly) with other proxies. The implications for Web Science are non-trivial if researchers are unable to determine whether input/sentiment is human or machine-generated. In a reductio ad absurdum scenario researchers could find themselves unknowingly studying groups of non-human users on social networks with no live people actually posting.

Google have patented a service (US Patent 8589401) for “Automated generation of suggestions for personalised reactions in a social network”.

This goes beyond the workflow services such as ITTT (If This Then That) and Zapier which simply reproduce/pipe content between applications and offers to create novel content in the user’s name based on an analysis of previous posts by the user.

Whilst (contrary to the implications of various press reports) the patent does not suggest that the responses will be SENT automatically only SUGGESTED automatically there is nonetheless potentially only a trivial extension to make this possible and a new class of provenance issue for researchers is created where users may be impersonating themselves via a bot.

Further potential research questions become apparent around the right to know if you are interacting with a person or machine, the nature of our interactions with machines vs. people and how our knowledge / expectations colour the exchange.  Indiana CS’s work in detecting so-called “Astroturf” may become increasingly important in this context even where the Astroturf is intentional (vs. fraudulent) and generated based on our own material/style. A new area of application for the Turing test is potentially born.

3.    Implications for Web Observatories

The discussions above highlight a number of core challenges (structural rather than thematic) for Web Science:

  • Provenance and Quality of Data
  • Reuse/Sharing of Data
  • Accurate sampling of Data in context
  • Observing from within the system

Whilst there might be individual/localised solutions to these problems, Tiropanis, Hall et al have proposed the development of Web Observatories – a general class of web repositories to observe targeted phenomena and to then make the data, analysis and tools associated with this research available to others via known interfaces and standards.

It has been argued that this approach differs from existing tools and will need to reflect more complex processes and requirements and in turn will offer different and potentially richer affordances.

By building the known challenges for Web Science into the design of Web Observatories (WO) we will enable significant progress to address several of the challenges above. This includes presenting trusted repositories of curated data and tools, the services and metadata required to discover and select appropriate data sets, to offer clear terms of use (licenses) and to do this within a defined research context (linking to open research notes, papers and experimental methodologies). Access not only to published papers but to the research data, the research processes and the tools behind the papers will allow a broader debate and clarity around context, interpretation and potentially the acceleration of research process.

Whilst no single tool/approach is ever likely to be a “magic bullet” solving all problems in a single stroke, the move towards collaborative research approaches, shared data, shared processes and a system of provenance and Trust are core concepts for Web Observatories and could represent a unifying set of tools and principles for future Web Science research.


Web Science research is naturally not unique in needing to account for research bias and be rigorous in research design but Web Science may have a unique combination of bias issues when dealing with capturing, analysing and then publishing data at Web scale. These specific challenges require further research culminating in the provision of research models and tools such as Web Observatories. We await not only the growth of individual Web Observatories with specific aims to address key topics but also the future integration and interplay of these repositories.

The interoperation of individual WO repositories may ultimately form the basis of a worldwide web observatory (W3O) through which we may not only aggregate datasets to produce novel insights from synthetic data but may even learn to observe ourselves as we do the research.