TERRA - The Existential Risk Research Assessment

Key concepts

TERRA (The Existential Risk Research Assessment) is a semi-automated process for systematically reviewing the relevance of academic research to Existential Risk Studies

Purpose of the tool

In a systematic literature review, one of many time-consuming tasks is to read the titles and abstracts of research publications, to see if they meet the inclusion criteria. TERRA provides a means of both sharing this task between multiple people, using crowdsourcing, and partially automating it, using machine learning.

This can be used to provide an evidence base for policy and risk analysis purposes.

History and background

There is now a deep division, called the “synthesis gap”, between all the research that has been published and the subset of research that has been systematically reviewed, synthesized, and used for decision-making. This is true across many different fields, including Existential Risk Studies (ERS), however ERS has particular challenges due to the relatively small size of the field, the breadth of potentially relevant research across numerous disciplines, and the long history of people thinking about existential risk and how to reduce it (going back to at least the 1940s).

Crowdsourcing can be used when screening publications,  and by sharing the workload between multiple people, the time and/or money it takes can be reduced. If the evidence base can be updated and reused, then crowdsourcing can also save time and/or money by sharing the workload between the past, present, and future. Crowdsourcing is used by Cochrane (the collaboration for systematic reviews in medicine that has set the standard for other fields of research), in the form of the “Cochrane Crowd” (http://crowd.cochrane.org). Crowdsourcing is also used in futures studies, as a method of horizon scanning for emerging threats.

Machine learning can be used to predict the relevance of publications to a systematic review, using text mining.  Based on a training set of publications that have been labelled as “relevant” or “irrelevant” by humans, a machine-learning classifier can be trained to predict which publications are relevant, using the text in their titles and/or abstracts. The accuracy of the classifier can be tested using a test set of publications that have also been labelled by humans, and the relevance of a new set of publications that have not yet been screened by humans can then be predicted by the classifier.  By using text mining, the human workload can be reduced by 30–70% when screening publications for systematic reviews.

CSER’s involvement

CSER researchers used crowdsourcing, machine learning, and evidence accumulation in an open-access online database to create a bibliography of publications about existential risk. We called this process “The Existential Risk Research Assessment” (TERRA). This followed the principles of subject-wide evidence synthesis, in which a wide-ranging search strategy is used to find publications that are relevant to a whole subject.

TERRA is a web application that is hosted at terra.cser.ac.uk and is based on the Django framework for Python (www.djangoproject.com). When using the web app, each participant is shown titles and abstracts from the results of a wide-ranging search, based on many keywords associated with Existential and Global Catastrophic Risk, of the Scopus academic publication database. Results are shown to participants in a random order, to minimize bias, and participants are asked to assess the relevance of each publication based on the inclusion criteria. To recruit participants from outside of CSER, we promoted TERRA on social media, on the CSER website, and in a workshop at the Cambridge Conference on Catastrophic Risk. Participation was open to anyone.

We used an artificial neural network, implemented in the TensorFlow library for Python (www.tensorflow.org), to predict the relevance of publications that had not yet been assessed by humans, based on the abstracts of publications that had been assessed. Finally, We generated three different models, by setting three different probability thresholds to control the unavoidable trade-off between “precision” (the percentage of publications predicted to be relevant by the machine that were judged as relevant by participants) and “recall” (the percentage of publications judged as relevant by participants that were correctly predicted to be relevant by the machine). We generated “low-recall”, “medium-recall”, and “high-recall” models that aim for 50%, 75%, and 95% recall, respectively.

Where to get started

Anyone can participate in TERRA, access its database of relevant research, and sign up for monthly updates on relevant publications at https://terra.cser.ac.uk/ 

The researchers involved in the initial setup and training of TERRA wrote a paper summarizing its development, early operations, and lessons learned - Accumulating evidence using crowdsourcing and machine learning: A living bibliography about existential risk and global catastrophic risk

CSER's 3 top tips for using Horizon Scanning

  1. TERRA is a tool for identifying potentially relevant literature to Existential Risk Studies, but this is only one step in a systemic literature review and it’s outputs will require further analysis to be useful for the field.
  2. When assessing publications we have found that participants can sometimes be too permissive, especially at first, out of a desire not to exclude publications that someone might see as relevant in some respect. This has likely lead to a greater than desired false positive rate in the resulting data and we have sought to clarify inclusion criteria for participants (available at https://terra.cser.ac.uk/methods/) and to manually remove some false positives from the database in order to improve performance.
  3. In the short term, machine learning seems most useful for rapid evidence synthesis, in which timeliness is more important than comprehensiveness.  In the long term, if crowdsourcing and evidence accumulation can be used to share the workload between multiple people and multiple years, then machine learning seems less useful, unless there is an improvement in both precision and recall at the same time (using a larger or better training set or a better algorithm). There is some evidence that more recent developments in Large Language Models may provide such improvements, however we are keen to ensure that such technologies are assessed and implemented responsibly and feel it may be too soon for them to be adopted for this purpose.