Web Science and the Web Observatory: the changing remit of web curation for research, enrichment and cultural preservation.

Web Science has been defined as the the study of social machines – the hybrid human/virtual solutions and processes that result from the use in society of information and information systems on the Web. Tim Berners-Lee (2009) described them thus:

Real life is and must be full of all kinds of social constraint – the very processes from which society arises. Computers can help if we use them to create abstract social machines on the Web: processes in which the people do the creative work and the machine does the administration. . . The stage is set for an evolutionary growth of new social engines. The ability to create new forms of social process would social engines. The ability to create new forms of social process would be given to the world at large, and development would be rapid.

The Web has evolved beyond a collection of static html pages for academic research to a platform for human interaction in all it’s forms and an open conduit for publishing and self-expression. These social machines exist in Government, Science, Art, Crime, Health and in many virtual categories beyond.

The desire to retain such data, which has been deliberately/explicitly put onto the Web may fit within a widening remit for archivists to preserve works of art, literature, science and other traditional “publications” and yet an increasing body of data is being added that is not explicitly published by any individual but instead comes via a technical platform or channel and is about a topic, about “society” and/or about specific groups globally. This is an immensely valuable resource and may help us to model evolving trends and behaviour in society though the expression of activities on the Web. Not least this resource may help historians understand the 21st Century through more detailed records than have ever been available before the advent of the Web.

This form of publication may be a problematic fit with traditional policies and libraries globally are reacting quickly to expand and re-define what it is to be a library, what constitutes an artefact for preservation and what deserves space in growing (but ultimately limited) digital collections.

This change is not without challenges and not least the donation of the entire Twitter corpus (growing at some 500 million messages per day) to the Library of Congress has highlighted the huge operational issues which come with data at Web scale.

How then can we decide which data to preserve and which to discard? Most recently the Web has been leveraged as a source of huge amounts of information / metadata known as “big data”much of which may be considered the “exhaust” of other human activities on telephone-, social media- and other networks. In the coming decade growing portions of the electronic fabric of society ranging from cars to cameras to pacemakers and household devices may be brought on-line to form an Internet of Things whose data output is predicted to dwarf even the huge volumes of web data we collect today.

Such volumes of data cannot currently be captured and stored in their entirety using available technologies and beyond this the challenges of curation, access, licensing and presentation may take many more years beyond a future storage solution. Museums and libraries have faced the selection challenge for as long as we have had limited shelf space but with the virtually unlimited digital shelf the choices may become less about inclusion/exclusion and more about the resolution of the data stored.

e.g. One example of a particular newspaper vs. weekly examples of selected newspapers vs. all copies of all newspapers ..

In Web Science we are happy to embrace the value that may come from analysing Big Data without assuming that more data is necessarily better per se or that Big Data is inherently insightful or even meaningful! Our chosen tool for Web Science is the development of an instrument akin to a stethoscope in medicine or a telescope in Astronomy – namely the Web Observatory. Something to help us observe the digital footprints left by society as a whole rather than directly observing/intruding into personal spaces.

A single web observatory is a data repository in which data ON the Web or data ABOUT the Web is collected and in turn made available to other users via portals, interfaces or visualisations and users may in turn add or return data to the Observatory based on their own findings, research or experiences.

Whilst some have argued that only data ABOUT the web (webometrics, cybermetrics, bibliometrics etc) should be considered as a focus for study it seems apparent that as data is put ON the Web the behaviours and responses may be detected in responses from users often through data ABOUT the Web (locations, times, tagging/classification etc ) which are metadata items akin to the metadata held ABOUT library collections and are distinct from the books or artefacts themselves. Thus the argument about data vs meta-data becomes a chicken/egg question of socio-technical effects: does the technical change the social or vice versa?

One of the key challenges around the curation of so much data across so many perspectives is to encourage a reduction of waste/repetition so that data gathered by Observatory A may be discovered and re-used/re-purposed by Observatory B. We are looking forward to a collection of interoperating observatories that will form the fabric of a World-wide Web Observatory. These many distributed repositories/collections may be based on varied technologies and approaches but we are calling for shared standards of identification, metadata, licensing and collaboration. Such a emergent system may accelerate the efforts of academic researchers by linking publications to underlying research data, notes about usage and methodology and allow the study of hybrid data assets and synthetic artefacts that are available nowhere else.

The pressing need is to co-ordinate the skills of the archivists with the needs of the researchers and the understanding of the technologists into an interdisciplinary view of how the Web could be sampled, preserved and made available as an invaluable tool for current researchers, historians and for future generations.