Data and Databases: Scoping a Database

Emily Genatowski; James Baille

Data and Databases: Scoping a Database

Authors

Topics:

Data management

Introduction

The scope of a project, as we’re defining it for this class, is basically the answer to the question “what gets into your dataset and what doesn’t”? For database projects in the humanities and social sciences, having a concrete idea of your project scope can be very important.

Learning outcomes

After completing this resource, learners should be able to:

Describe the main reasons for formal scoping
Distinguish between source scoping and model scoping
Detail the questions a researcher must ask themselves when performing formal scoping of a database

Setting a Scope

There are three main reasons for setting out a formal scope:

It tells other people what they can find in your database.
- This is extremely important for making databases other people can subsequently use – they need to know on what basis people get into your database, so they know how much their own work overlaps with that. Even when the database is just for your own use, knowing what is and is not present in the input data is needed when explaining any analyses of your data.
It makes your project manageable in scale and planning.
- Without any kinds of limits or boundaries, a prosopography can extend almost infinitely – you need to be able to plan the work you can feasibly do, or at least plan a set of meaningful units of work. One of the biggest and most common failure points for SHS database projects is making them either too small to meaningfully analyse or too large and as a result extremely patchy in what is actually achieved, which likewise prevents effective analysis.
It interacts with your research questions
- You will be making a database because you have questions you want to answer: if you haven’t formulated those questions, you should do so. The scope is closely tied to those questions: you need to have the data necessary for the question you want to answer in your database, and so referring back to your questions throughout the process is very important.

For example, someone studying law and human geography might want to look at changes in the incidence and conviction rates for different sorts of crime in different locations or areas over time. The scope of such a project would have to include the different types of crime being discussed, the locations or areas used, and the time frame. These aspects of the scope would affect what questions the data set could be used to answer, in turn: if the data set’s areas were a representative sample of the whole country, or actually covered the whole county, this could be used to discuss national trends, whereas if the areas were all focused in particular types of area (for example, all in rural areas), it might improve the utility of the data set for looking at rural crime but at the expense of being able to draw national conclusions.

A historian, meanwhile, might start with a question like “how were non-Chinese generals or officials seen by Han Chinese during the T’ang dynasty?” To answer this, building a persons database of generals and officials under the T’ang dynasty might well be desirable: there would be no problem for this question in including all records, even if they conflicted in the detail, because this adds another perspective and strengthens the dataset for answering the question. To narrow the focus, the scope might include or exclude specific types of source material, or only look at documents written in Chinese, or only cover part of the possible time period (something it would be then important to flag up when discussing how representative the data were for the T’ang dynasty more widely).

It is often necessary and valid to argue that a limited dataset represents a wider pattern in reality: data are usually small samples of the possible total of information. However, this has to be justified, and it is important both when working on your own data and when working with other people’s to consider whether the claims made for the data are genuinely justifiable given both its scope and the extent to which the data fulfill that scope. The scope has to be written – and where necessary, adjusted through the project – such that it matches the final dataset. An example of this problem would be the following map, produced from a Wikipedia dataset of historical battles and presented with the claim that “historically Europe has been overwhelmingly a centre of global conflict”.

This, however, is a case where the claimed scope – worldwide, throughout human history –does not match the available data. Wikipedia’s data is compiled on an ad-hoc basis by contributors, not on a systematic basis: what is presented is not a representative sample of the whole. Not only that, but there are significant biases in data collection, with more contributions to the dataset and therefore more records from Europe and North America, and with lower contributions from areas that either lack historical record-keeping or have significant language barriers to accessing those records for people writing articles on an English language Wiki. The result is a map that says absolutely nothing about the relative likelihood of battles in different parts of the world: the data gathered were simply not appropriate in their scope to draw worldwide, or relative regional, conclusions. Worse, the map ends up reflecting a different disparity to the one it claims: global disparities in access to information and technology, and retention of knowledge and records from the precolonial period, have influenced the map, such that the result does not merely fail to show what it intends but actually suggests a particular pattern based on those trends. Wrong results gained in this way can provide incorrect and indeed damaging outputs: we have both a practical and an ethical imperative as researchers to avoid misleading people by failing to understand our data and its scope.

We have previously (in section two, From Source to Data) discussed the difference between using a database to compile and model source material and using a database to model an actual situation. This division of research questions helps inform what parameters you use to define your scope. These can broadly be split into two categories, which tend towards, whilst not necessarily mapping neatly onto, the above division of database type and research question. Firstly there are scope boundaries that are created by features of your source material and data gathered, and secondly there are those that are created by features of the system or model you want to look at via that data.

The former category, source scoping, defines “entry to the club” by questions about source material – does the person appear in a particular source or type of source, or do they appear in a certain language group’s body of sources?

Here are some examples of criteria:

Document type – Whilst any SHS database will need a list of sources for its data, some databases are primarily attempts to work with a certain document type or types.
Language – Many areas of history have source material from a number of different language groups available. For indexing, division by language can be sensible to allow specialists on different language areas an angle to work on their material.
Authorship or provenance – This could be texts by a particular author or authors, but could also be for example reports from a particular business or legal decisions from a particular judicial circuit.
Genre – A more difficult area as many texts lack formal categorisation, but such categorisations can be imposed or assessed as part of a project in order to narrow down the scope to a more specific set of sources.

The latter category, model scoping, defines “entry to the club” by questions about the real things that the data nominally represents – where they were, when they were, and who they were. This is more necessary for modelling questions where the scholar wishes to test questions about a specific group of people and the society and world they live or lived in. A non-exhaustive list of categories follows:

Chronology – this is relatively self explanatory, setting dated bounds for the data objects.
Space – setting a spatial bound. This could be a simple set of lines on a map, but it can also interact with chronology. For example, some data on public opinion or income levels in Sudan might reasonably want to set its space boundary as being the state of Sudan – and this would mean that it would include data from the area that is now the Republic of South Sudan in parts of the dataset collected before South Sudan’s independence in 2011, and not afterwards.
- Note that space and chronology boundaries may or may not apply within records as well as to entire people. For example, a database about European colonists in the Americas might include their activities in the Americas but not their lives or activities prior to the migration, colonisation or conquest they were engaged in.
Role – a formally defined category based on action or external categorisations. For example, only people with particular offices are included, or businesses operating in a particular market, or legal case of a specific set of crimes.
Identity – for example, only people with a certain ethnicity, gender, race, sexuality, etc, are included. This functions somewhat similarly to the role category in some ways, excepting that identity is often written about in more oblique ways and is therefore harder to easily categorise (a problem we will cover in more depth later in this course).
Connection – if one particular figure or institution or place is at the core of the study, the category could be people with known connections to that core element or elements. The difficulty of this sort of categorisation is that it requires more circular working as one is less likely to know about these connections in advance – one may for example need to produce more data than needed and then refine it after doing analyses of the networks involved.

Of course all of these categories may have edge cases, and one of the difficulties is working out where to draw the line and making oneself rules for where to do so, as you may find out when producing data. One notable feature of the model scoping categories is that they often require more academic judgement.

These scoping features, including both textual and historical categories, can and usually are combined, and the neat distinctions drawn here will break down to a noticeable extentwhen you practically scope out a project. There are some areas where it is rare for scholars to actually be able to consult all the extant material, so even if the scope is not nominally textual, clarity on which sources were incorporated into the project is still an absolute necessity. Meanwhile in source-scoped databases there are often still some limitations on inclusion – for example, if in the course of some legal research one wanted to look at all of a particular judge’s rulings, it might be a necessity to use a practical filter within that of particular types of case or periods in that person’s career. As another example, medieval chronicles often have starting sections that give a brief summary of events from the creation of the world to the main point of the chronicle, or insert analogies and stories about classical heroes: this means a chronological requirement on compiling person data can be useful to avoid Adam and Eve needing to be detailed in every database.

One way of summarising a lot of these rules and how to think about them practically is what may be thought of as the Richard the Lionheart Problem. King Richard I of England had a varied career that involved stretches in England, France, Cyprus, Outremer, and Austria, among others. So if someone made a database of people in medieval Cyprus, how much about Richard should they include? They could choose to only represent what appears in local or regional sources, or they could present material from all sources but only cover his actions in Cyprus, or they could cover his actions in Cyprus and one or two connected actions (to cover for example where he was before and after his stay there), or they could say that as a person who was at some point on Cyprus, every bit of Richard’s life is valid for inclusion.

Conclusion

Throughout this resource, the reader should have achieved an understanding of how to answer the question “what gets into your dataset and what doesn’t?” The reader should have an understanding of why scope is so important including helping other researchers, supporting a research question and making sure a project is manageable. The reader should also now be familiar with examples of criteria with regard to sources including document types, language, provenance, genre as well as criteria for scoping models including chronology, space, role, identity and connection.

Data and Databases: Scoping a Database

Introduction

Learning outcomes

Setting a Scope

Conclusion

Cite as

Reuse conditions

Full metadata

#Introduction

#Learning outcomes

#Setting a Scope

#Conclusion

Cite as

Reuse conditions

Full metadata

Introduction

Learning outcomes

Setting a Scope

Conclusion