The arrival of OSO’s …

I was talking to my son about an edition of Mythbusters where they were looking at, of all exotic things, Anvils.  The piece was about making the distinction between REAL anvils (which can be used for iron-work) and apparently similar artefacts (that are only for decoration).

The meme became Anvils vs. ASOs or Anvil-shaped Objects.

I realise this is an idea I’ve been waiting for in the world of Web Observatories to talk about systems that may be similar or even identical in function to Observatories but don’t use the name and may not be focussed on this approach.

Thus I’ve coined the term OSOthe Observatory-shaped Object – to denote systems which are close to the sort of WO system we are talking about even if they were not designed/intended to be an observatory but have the potential to act as an Observatory or be extended to become an observatory.

A classic example of an OSO is the Southampton University ePrints system, which started life as a document repository, but which has been extended to harvest data sets (e.g. from Twitter), to host data sets and link them to academic papers and, critically, to locate and index the existence of other repositories with other data and docs.

So now we have WOs and OSOs !!!

(With thanks to Harrison Brown and Mythbusters)

So what is Web Science anyway …

Reproduced from a Quora question …

I am currently working on a Web Science PhD and I work for the Web Science Trust which comprises many groups who research in this area. This does not make my opinion or definitions any more true/valid than anyone else’s who may comment here but I think it is fair to say I am close to the subject.

Whilst exact definitions are always risky I will see if a few observations are helpful…

It is all about browsers ….?

Web Science was defined (in the original paper and red book) refer to the study of “decentralised information systems” (which does not mean exclusively WWW though this is the most obvious example right now) and also (I hope) makes it clear that if you are looking at data moving between apps on phones or data between networked physical objects rather than pages in traditional HTML browsers that this may still be Web Science.

What are we trying to discover …?

Web Science is seeking to understand how WWW works both structurally, and socially and the way in which these two elements interact and create emergent properties and behaviours at Web scale. As such, it is a study of socio-technical systems or social machines. How we change and evolve the Web is fairly apparent to most people but how the Web changes us may be less obvious and just as important.

“On the Web” or “About the Web”..?

There is a question about whether Web Science studies data ON the Web (ie *any* kind of data that is available on the Web) or if it should focus only on data ABOUT the Web (ie metadata and technical/networking data).

My own view on this is that “it depends”. Web Science attracts researchers from many disciplines and I personally work with researchers with backgrounds in Law, Maths, Education, Politics, Astronomy, Philosophy, Sociology, Medicine, Marketing as well as Computing and others. Hence their specific interests and perspectives are highly diverse. The lawyers may (for example) be less interested in the network structures and mathematical properties of a data set than the behaviours they represent ..

Whats the difference between Web Science and ….?

Web Science potentially overlaps with elements of several other disciplines such as Internet Science, Data Science, Computational Social Science, Network Science and others – simply because almost ALL disciplines are affected/mediated by the Web in some way and have received attention from researchers over recent years.

My hope (and largely my observation) is that Web Scientists are not trying to claim this area as their own but recognise that, say, Psychologists studying the Web may overlap with Web Scientists looking at psychological effects – and it is perhaps less important to decide who is doing “Web Science” or “Web Psychology” etc and more important that such groups should collaborate, share data and enrich each others understanding of best practice.

It’s all about the data …

Given the sheer scale of the Web, datasets may end up being “big” (by which I mean difficult to process with the systems/technologies at hand) but if they aren’t this doesn’t exclude them from Web Science. You may interview a dozen people about their experience of bullying on-line and, whilst not a big data set, may be valid and central to an interesting line of Web Science research on how the nature of the Web affects social behaviour.

A colleague at the Web Science Trust, Jim Hendler, wrote a very interesting paper about “Broad Data” which I recommend.

Ultimately the data may be about a range of things, user behaviour, the shape/structure of graph data, information cascades etc so I would have to say that almost any type of data (open/closed, UGC/machine-generated, big/broad etc) may be used in Web Science.

I hope as least parts of this are helpful …

Thinking about DNA AND NDA …

The really interesting insight I got from a DNA view of the social machines is that whilst DNA was initially meant to be evocative of the idea of building blocks and diversity from simpler atomic elements it now occurs to me that DNA (in it’s presentation suggests an order D->N->A which we might read as the defined technology driving the choice/selection of methods/processing affecting the behavioural Archetypes in a technologically deterministic way ..

Whereas AND (A->N->D) might be more representative of a socially constructed paradigm in which Archetypes and their drives/ambitions lead to a choice of methods/expressions which shapes the functionality of technological systems.

NDA is even possible (N->D->A) suggesting a scenario is which processes/expressiveness may be presented (particularly in the form of legislation) such that archetypal ambitions are shaped by the systems which are defined in legislation.

From Search to Observation: an update

Background

Often the first question I get asked in this space is WHAT IS an Observatory? – In the fullness of time I have come to believe this is not a particularly engaging or rewarding question. IT technologies can be highly similar across a diverse set of platforms and applications ranging from Apple watches to warehouse management systems.

For example, things which appear to share a moderate level of common building blocks can be radically different in their appearance and function. Whilst is popular to point out that your DNA is 99% similar to that of a chimp – what is less often pointed out is that your DNA is 30% similar to a Daffodil! … and so instead of building blocks, I find myself focussing much more on two other questions:HOW IS an Observatory? and WHY IS an Observatory?

Here the variations of what is done with the Observatory technology and who is applying it for what reasons seem to be much more relevant and, in any case, much more interesting.

With this in mind I have recently revisited an earlier paper called “From Search to Observation” in which I argue that even though Search Engines and Observatories share many common architectural elements (databases, API’s, graphics/analytics etc etc ) that the essence of what they are trying to do is NOT identical.

I initially identified more than a dozen processes that seemed to emerge from an analysis of the literature and the wider  dialogue in this space and published these for community feedback – not much dissent so far.

More recently I expanded this analysis on the back of a series of interviews with more than 50 participants and the resulting process list more than tripled what we had seen – particularly as I started to distinguish between input/output factors and internal processes.

An updated paper has been prepared but is not yet published and given the restrictions on page count a list of more 60 factors could not easily be reproduced and so I have included these here for this who wish to comment, refute or discuss the existence or classification of these factors/processes.

Overview

Brief Process and Factor definitions

Celebrity

Describes the existance of wide-spread recognition of a theme, person/group, resource/tool or other entity which sets expectations around priority, importance and inclusion of the said entity.

e.g. there may no evidence to support the idea that Tweets are particularly more accurate, enlightening or relevant that micro-posts from any other source and yet the immense cultural impact Twitter has had almost certainly skews expectations of its inclusion in analysis and consideration.

Cost

Impact of the cost of usage/operation of WO systems and tying into later emergent Cost-Benefit assessment.

Corporate Structure

Which may affect how/where organisations (not only commercial organisations) are able to participate in terms of authority, jurisdiction, charter, stakeholder impact.

Community

The aspect of connecting to existing groups or creation of new groups via the use of WO – especially where available data/tools align with the objectives and interests of a community.

Convenience

Technical barriers to entry/participation vs simpler user experience are naturally likely to impact the quantity and quality of particpation in WO systems.

Collegiality

Describes the ambient level of interaction that draws homogenous and heterogenous groups together.

Commercial interests

Describes the existing tendencies around market-share, intellectual property and control which may affect what users are prepared to share and under which conditions,

Collection

Describes the process of collecting data/metadata about the WO interaction which might comprise information on the data, the data-source etc

Consign (data to WO)

Describes the process of depositing/linking a data set or tool to the WO

Conspicuous collection

Deals with the impact of making explicit that data is being collected and, potentially, publishing the data or analysis of the data such that previous behaviour may be affected by the disclosure.

Conflict (+ confict resolution)

Describes the process of resolution regarding some asserted fact in the WO (ownership, value, usage, permission etc)

Communication

In contrast to a single request/response from a known search engine the process of observation may be characterised as one or more communication processes across several repositories starting with discovery of sources, the disclosure of metadata, the negotiating/establish of technical data exchange and the grant (either manual/technical) of licenses

Canonical Sources

Where more than one repository offers the same or overlapping datasets there will be the requirement to establish a de facto or canonical source.

Clarification

Clarification is a multi-step process to ascertain values through supplementary enquiry. This may apply to questions of provenance, usage or cost.

Connection

Observers will typically need to access the raw data from the one or more repositories which their search has identified – hence a further individual connection protocols and processes will be required

Certification

Where the observer’s process requires confirmation of the source (publisher) of the data to be explicitly documented a certificate format and certification process may be required

Charging Models

It is not anticipated that all data that will be observed will necessarily be open data and hence provided free of charge. It is anticipated that observations may involve the payment of a license fee with a mechanism to grant the permissions associated with the license vs those without a license or with a different license

Confidentiality

Aspects of access and privacy must be addressed to ensure that data/services are accessible according to legal and ethical standards.

Calibration

Allows for the process of adjusting/modifying some data/service in line with a known size of external effect such as error or bias.

Commentary

It is anticipated that meta data including commentary by both users and curators of the data will provide a richer environment for a qualitative understanding of data beyond stored value

Capture / Charge / Crowdsource

The process of providing data to the system through a series of individual events (capture), through the bulk upload of a dataset (charging) or through manual input (crowd-sourcing)

Collection

Observation will often be association with longitudinal datasets from one or more sources. Whilst it is not envisaged that all observatories will seek to store all data is it anticipated that each observatory would store some data and hence a process of regular collection, snapshotting or processing of streaming data would be required

Computation

Each Data set(s) may form part of a larger analytic or visualisation requiring a series of one or more computations

Contextualise

the process by which an accurate output/outcome may depend on a set of meta-data (relating to the user or the problem statement) as well as the data itself.

Conversion

Each repository may hold datasets in a variety of formats – metadata associated with the dataset will allow the observer to invoke appropriate format conversion services

Correlation

A composite data set comprising heterogenous data will allow for the possibility of correlation analysis across disjoint datasets

Classification

The data/service which is identified from a repository as part of an observation service would be formally classified according to topics using some knowledge classification schema, some access schema and may also be linked to other data or services in the Observatory

Co-creation

The process by which results are created between users and/or between users and machines. ie Construction involving more than one participant.

Construction

Datasets addressing specific research questions may typically be assembled from more than one data source with either homogenous or heterogenous structures allowing for richer analysis of trends and correlation. This may fall into the area of big data or broad data.

Catalogues

The creation of directories of links to locally/remotely hosted data and services

Citation

Addressing the exchange of academic credit for the use of the materials or work of others through a formal reference subject to bibliometric analysis.

Compartmentalisation

The creation of logical service/data sub-structures based on community membership, permissions, licenses, confidentiality, jurisdiction or other suitable frameworks

Contextualise

Each series of observations may be made in the context of a research question which informs the relevant curation, commentary and collaboration addressing the research question. The context also informs the services/datasets that are published out to external users of the Observatory

Constrain(ts)

To allow partial access to data and/services by means of user permissions, grant of license, time/date restrictions or other contextualisations of the entities (data, services or users)

Circulation

The process of sharing data/services on a periodic basis with a specific community for the purposes of general information and synchronistion of understanding

Conflation-Compression

Allowing for the reduction in volume and/or resolution of data by techniques such as arithmetic averaging, periodoc sampling, aggregation and interpolation.

Consumption

The process by which a data feed or analytics service is used by the WO as part of one/more larger processes

Collaboration

Observations may involve the exchange of data between two or more parties for the achievement of common or complementary goals. This exchange may not involve a charging structure but may nonetheless comprise a formal agreement with a delineation of responsibilities

Curation

Datasets, which may be generated/harvested automatically may require a post-hoc (semi-)manual process of selection, deletion, annotation and re-classification.

Choreography

For data sets which need to be refreshed or multiple streaming services there will be the requirement to co-ordinate or orchestrate the updates and staging of the data (potentially feeding into a new cycle of discovery, assembly and execution)

Confirmation

Addressing the ultimate purpose of WO usage which is to provide distributed solutions to decision support problems.

Cost Benefit

Addressing the resulting conclusions around the operational economics of operating WOs

Coherence

Addressing the need for results from different sources to be non-contradictory.

Conformity (vs subversion)

Addressing the extent to which WO operations are blocked or enabled through the adoption of standard processes, licenses and methods of recognition and exchange.

Compulsion

The irrational need to start all WO processes with the letter C 😉

More seriously – this addresses legal directives/frameworks which influence the behaviour of participants for fear of legal redress

Cohesion

Addressing the extent to which distributed componets of the larger WO eco-system align to support an overall work-flow assuming a trusted position in providing a range of services and sources.

Contract (contractual agreement)

The creation of formal agreements/rules etc regulating the use/operation of WO systems and services

Conclusions

Support for hypothesis testing and decision modelling

Consistency

Address the tendency of consistent standards for sources, services and processes to emerge over time (certain aspects of system become less chaotic/dynamic over time)

Cascades

Addresses the creation of dynamic patterns of themes, and activities in the form of meta-data about the operation/usage of the WO. This belongs to “observing the observatory” or “Observing the Observers” and might be thought of as “observometrics”.

Credit

Distinct from academic citation, Credit describes attribution for discovery, participation or sharing within the eco-system resulting generally in positive (vs negative) reputation.

Catalysing

Addressing the meme effect of propogating interest and research on the WO through the community

Collective Action

Addressing the effect of providing a platform around which entities may act collectively based on communities of interest

Commercialisation

Accounting for the effect that certain sources, and services may not be free-of-charge but rather offered on a freemium or full commercial basis as a funding model to address the costs of providing the service.

Consensus (Convergance)

Accounting for the effect that understanding around certain topics may converge over time as a result of discussion/collaboration across the WO

Convention

Accounting for the effect where partcular patterns of usage, behaviour and operation may become de facto rather than de jure over time as an expression of the wishes, style and preferences of the community of users and providers.

Credibility

Accounting for the effect of increasing or decreasing reputation in terms of the accuracy, quality, contribution etc of a WO entity (Source, Service or User)

Confidence

Observers may wish to base sensitive calculations/decisions on the observed data and hence trust + provenance will be required – particularly for automated/unattended processes.

Complexity

The effect of operation and interoperation of many different datasets, tools, experiments and participants. The not only studies complex social machine but is itself, potentially, a complex social machine.

Consequence

The results (positive/negative) resulting from the identification, attribution and accountability around the use of data/services under particular agreements

Culture (Cultural norms)

The emerging typical behaviours and standards that informally appear over the lifetime of a social machine. i.e. not expressed through formal contracts but as modus vivendi practice

The Web is something like a piano ..

You can understand everything about a piano’s construction (the materials, the engineering, the scale lengths and tunings) and indeed those things are relevant at some level. All this information, however, tells you almost NOTHING about the essence of music nor the great works of jazz,  gospel and soul.

The Web offers a similar problem – the components and engineering describe clearly what the technical vocabulary/grammar is and can allow us to look for meaningful patterns in the oceans of data – but tells almost nothing about all the things, both great and terrible, that the Web may produce.