Introduction

In essence, bibliodiversity refers to how different communities manage knowledge creation and dissemination. Consequently, the information needs of these communities will also vary. This article will investigate whether evidence for this can be found by looking at frequently downloaded open access books in 100 countries. The subjects of these books are indicated by their Thema classification, which aims to be global in scope. Deploying a clustering algorithm will help to find patterns in the combination of classification codes and countries. This helps to determine whether the readers within a region also share an interest in the same topics.

When looking at global information needs and open access books, it is obvious that these titles will not just be written in English. The set of books used in this investigation contains text in 20 different languages. The subject classification is, however, language independent, which enables grouping all of them based on shared subjects. The next section will further explore the Thema classification and bibliodiversity.

Background and literature review

The data collected for this article comes from the OAPEN Library. Launched in 2010, the OAPEN Library has been set up to host and disseminate open access books and chapters.1 In May 2024, its collection contained over 34,000 titles, written in more than 50 languages and published by over 400 publishers. The collection is used globally, and COUNTER-conformant usage statistics are provided in collaboration with IRUS-UK.2 Since its launch, the books’ subjects have been coded using the BIC Standard Subject Categories Scheme. However, the BIC classification was permanently frozen in 2017 and was officially declared obsolete in February 2024.3 Its replacement is the Thema classification.

Before examining Thema in more detail, it is perhaps a good idea to start with a definition of classifications. As Barbara Kwasnik has written, ‘Classification is the meaningful clustering of experience’.4 It is a way to arrange and link subject descriptions, making it possible to see how these subjects are connected. This can be done in several ways: hierarchies, trees, paradigms, faceted. Hierarchies assume that subjects can be divided into smaller elements that inherit aspects of the overarching subject. For instance, the Thema classification assumes that ‘Comedic plays’ is a subset of ‘Plays, playscripts’ which itself is a subset of ‘Biography, Literature and Literary studies’.

A slightly different way to organize subjects is by trees. Here, the connection between subjects does not imply that one subject is part of a broader subject. Paradigms are quite different: they are based on a matrix comparing two sets of aspects. An example is the so-called ‘Alignment chart’,5 which compares law versus chaos and good versus evil – the chart has been used in several memes and jokes.6 Finally, facets are additional aspects that can be linked to the subjects. The Thema classification contains several of them, such as place, language or time period.

The Thema classification was launched in 2013, and its current version dates from 2022.7 In contrast to older schemas, Thema aims to be global in scope, and also to be useful throughout the whole of the book supply chain, from the publisher to the acquiring library or retailer. Its subjects are hierarchically organized, where 20 top-level categories are subdivided into 3,000 subject headings, completed with several facets – called ‘qualifiers’.

An important aspect of classifications – and Thema is no exception – is their independence from language. This means that Thema can be applied in a multilingual collection such as the OAPEN Library. The diversity of languages versus the dominance of English is seen by many as an important topic in bibliodiversity. Others have focused on the means of knowledge production and dissemination, where the strong position of a few large and commercial players is criticized.

These two aspects are related, as the majority of academic books are published in the English language by a relatively small number of publishers. Several of those companies have been expanding their services to incorporate (almost) all aspects of scholarly publication, a process described as ‘platform capitalism’.8 Based on the publications, these companies provide consultancy and data analysis to universities.9 Measuring how well publications and institutions are performing is an important part of this analysis. Two recent articles illustrate the issues around analysis and bibliodiversity: the dominance of English in the Journal Impact Factor,10 and the change in university rankings when open access sources are incorporated.11

Multilingualism in scholarly communication is – obviously – a global affair. It encompasses local indigenous knowledge systems,12 but also Chinese government policies regarding publications in English.13 In the literature, non-English academic publications have been linked to regional issues. This is visible in a European context, investigating multilingual publications patterns in the humanities and social sciences.14 A recent literature study by Balula and Leão concludes that a ‘balanced multilingualism’ is vital for a diverse academic publication landscape.15 Bibliodiversity as a way to enhance local knowledge is also the subject of Mkhize and Ndimande-Hlongwa.16 The authors argue that local African languages and indigenous knowledge systems are indispensable for higher education. The dominance of English versus the role of regional languages is also debated by Flowerdew and Li, who investigated the publication choices of Chinese humanities and social science scholars.17 And finally, the study of Argentinian research output conducted by Chinchilla et al. links Spanish language publications to regional subjects and English language articles to more global issues.18

The current literature discusses the production of knowledge, but there is little on the connection between bibliodiversity and the consumption of knowledge. An older article by Snijder examined whether the most downloaded books from the OAPEN Library were written in English in non-English speaking countries. That was mostly not the case, and even when English language titles are part of the top ten, many mention regional concerns.19 Bibliodiversity is linked to languages, and the Thema classification scheme aims to be more international than BIC. Does this new classification align with the OAPEN Library’s international community of readers?

Methodology

Social network analysis and clustering

Social network analysis is used to answer this question. Each book in the OAPEN Library has been assigned one or more classifications. The classification codes are the same, regardless of the language the books have been written in. Furthermore, the hierarchical nature of Thema combined with its large number of possibilities allows for a fine-tuned description of the subject – or subjects – of the book. This can be combined with another aspect of OAPEN Library usage: the country from which the books have been downloaded. Thus, it is possible to find the ten most downloaded books from each country, during a certain period: January to December 2023. The dataset used for this article is based on the top ten of 100 countries. This allows us to cluster all the different classification codes with the countries. The dataset contains a network of 114 different classifications.

The full dataset is available using the link in the data accessibility statement at the end of the article.

The dataset consists of a network with two types of entities or modes: classifications and countries. Faust and Wassermann use the term ‘two-mode network’ to describe this.20 Furthermore, the relationship between the classifications and the countries is not reciprocal. The readers in the different countries have selected the books and their connected classification codes, but the other way around is impossible: the classification codes cannot select the readers. The technical term for a network of relations that only act in one way is ‘directed’. So, this two-mode network is directed. We have also seen that the classification codes cannot act independently. Networks consisting of actors and passive elements are called an affiliation or membership network.

In other words, the social network analysis in this article is based on affiliations of actors to the passive elements. In this case, the analysis tries to find communities of actors – the readers of the books who are represented by the country names – and passive elements: the subjects of the books. Newman and Girvan21 wrote an algorithm that repeatedly removes the connection between the elements in the network that function as a ‘bridge’ between other groups of elements. This results in dividing the elements of the network into closely connected groups. This article utilizes an updated version of the algorithm, written by Wakita and Tsurumi,22 using the application NodeXL.23

The data selection process was guided by multiple decisions. The first question was: which countries should be included in the study? To accurately assess global usage, it was crucial to analyse a substantial group of countries. However, both the quantity of downloaded books and the total download figures varied significantly among nations, with usage patterns differing considerably from one country to another. Focusing on the 100 countries with the highest download volumes addressed this issue. The following decision involved determining the number of books to examine. Ultimately, the ten most popular titles for each selected country were analysed.

The procedure starts with ten books per country and there are a varying number of classification codes linked to them. This combination of classification codes and countries forms the basis of the analysis. The analysis is based on unique values, so repeating combinations of classification and country are reduced to one. The clustering algorithm then creates groups of countries and codes. In Figure 1, the relations between countries and classifications are made visible by a line. Applying the clustering algorithm results in 21 groups of countries and classifications. The number of the groups is determined by the total number of elements, where group 1 contains the largest number and group 21 contains the smallest number of countries and classification codes.

Figure 1 

Clustered countries and subjects

Analysing the clusters

To explore possible relations between communities and subjects, it is important to establish clear guidelines to ensure that each group is analysed based on the same principles. The analysis is focused on regions. For this investigation, a region is defined as a group of countries that share borders. Within a group, the majority of countries – more than half – should be part of the same region. All groups that contain just one country will not be considered.

When a group contains several countries and the majority of those countries share borders, the next step is to look at the classifications linked to that group. This will give an indication of a shared interest in one or more subjects.

The classifications in the dataset are all unique, but due to their hierarchical nature, it is possible to group them based on the stem. For example, the first group contains ‘JNB: History of education’, ‘JNDG: Curriculum planning and development’, ‘JNRV: Industrial or vocational training’. All these can be linked to ‘JN: Education’, and, on top of that, can be grouped in the overarching classification ‘J: Society and Social Sciences’. Creating larger sets of classifications in this way helps to visualize the overarching theme. What will also be apparent is that certain classification stems are visible in several groups. For instance, there are 12 instances of the classification ‘J: Society and Social Sciences’ in Group 1, but also 3 instances in Group 3. To put this in perspective, the total dataset contains 86 unique classifications that all stem from ‘J: Society and Social Sciences’.

Furthermore, the focus lies on the largest sets of classifications per group. For instance, Group 1 contains 12 classification codes that belong to ‘J: Society and Social Sciences’, and on the other end of the scale there is one code that is part of ‘P: Mathematics and Science’. The next step is to look within the classifications stemming from the same main category. As mentioned before, the classifications are hierarchical. This can be used to find subgroups. In the case of Group 1, nine of the 12 ‘J classifications’ stem from ‘JN: Education’. This procedure has been used in all groups in order to find clusters of classifications, highlighting specific areas of interest. Of course, when the classification codes are very diverse, only the highest classification is shown, see for instance ‘T: Technology, Engineering, Agriculture, Industrial processes’ in Group 2.

The next sections briefly describe the groups. As this is a first exploration, there will not be an extensive analysis.

Language

In the literature, bibliodiversity is strongly connected to language. Yet, in this analysis language does not play an important role. Firstly, the focus lies on the subjects of the books, and these subjects are represented by a code that is independent of language. The Thema classification aims to be international in scope and should therefore be language independent. Secondly, this is a first attempt to use social network analysis to determine if regions share an interest in the same subjects. Consequently, the process itself is still in development. Adding another aspect would probably work better in further research.

Results

This section lists all the groups of countries and classifications that fall within the parameters set in the Methodology section. For each group, the region and the main subjects will be described briefly. Each group description also contains a graphic with the total number of classifications. The corresponding description for the classification stems is listed in Table 1.

Table 1

Classification stems and descriptions

StemClassification

11: Place qualifiers
22: Language qualifiers
33: Time period qualifiers
55: Interest qualifiers
66: Style qualifiers
AA: The Arts
CC: Language and Linguistics
DD: Biography, Literature and Literary studies
GG: Reference, Information and Interdisciplinary subjects
JJ: Society and Social Sciences
KK: Economics, Finance, Business and Management
LL: Law
MM: Medicine and Nursing
NN: History and Archaeology
PP: Mathematics and Science
QQ: Philosophy and Religion
RR: Earth Sciences, Geography, Environment, Planning
TT: Technology, Engineering, Agriculture, Industrial processes
UU: Computing and Information Technology
WW: Lifestyle, Hobbies and Leisure

Group 1

The majority of countries in Group 1 are in Southern and Eastern Africa – see Figure 2.

Figure 2 

Clusters and countries of Group 1

The most frequently occurring classifications – see Figure 3 – stem from ‘JN: Education’ or ‘K: Economics, Finance, Business and Management’. The third most common classifications are part of ‘NH: History’.

Figure 3 

Classifications of Group 1

Group 3

The majority of countries in Group 3 are located in Eastern Asia – see Figure 4.

Figure 4 

Clusters and countries of Group 3

Here the most common classifications stem from ‘P: Mathematics and science’ and ‘T: Technology, Engineering, Agriculture, Industrial processes’ – see Figure 5.

Figure 5 

Classifications of Group 3

Group 4

The majority of countries in Group 4 are close to or neighbouring Russia – see Figure 6.

Figure 6 

Clusters and countries of Group 4

As shown in Figure 7, most classifications stem from ‘J: Society and Social Sciences’ and ‘N: History and Archaeology’. Within the latter classification cluster, half are based on ‘NHW: Military history’.

Figure 7 

Classifications of Group 4

Group 5

In this group, three of the four countries are the so-called DACH Länder: Germany, Austria, Switzerland – see Figure 8.

Figure 8 

Clusters and countries of Group 5

The most common classifications in Figure 9 stem from ‘CF: Linguistics’. The second largest groups are connected to ‘KJ: Business and Management’ and ‘M: Medicine and Nursing’.

Figure 9 

Classifications of Group 5

Group 8

This group consists of Southern and Eastern Asian countries – see Figure 10.

Figure 10 

Clusters and countries of Group 8

Most classifications in this group stem from ‘R: Earth Sciences, Geography, Environment, Planning’, and can be divided into ‘RG: Geography’ and ‘RN: Environment’. What is not immediately clear from Figure 11 is that the GT classification is divided between ‘GTM: Regional/International studies’ and ‘GTQ: Globalization’.

Figure 11 

Classifications of Group 8

Group 11

The three countries in Group 11 are all based in Oceania – see Figure 12.

Figure 12 

Clusters and countries of Group 11

Apart from classifications stemming from ‘JB: Society and culture’ and ‘JP: Politics and government’, in this group there are relatively many classifications based on a region: ‘1M: Australasia, Oceania, Pacific Islands, Atlantic Islands’. See Figure 13.

Figure 13 

Classifications of Group 11

Group 12

In this group, most of the countries are in the Middle East – see Figure 14.

Figure 14 

Clusters and countries of Group 12

The most used classifications in Figure 15 stem from ‘JB: Society and culture: general’; ‘M: Medicine’ and ‘T: Technology’.

Figure 15 

Classifications of Group 12

Group 15

The final group consists of the United Kingdom and Ireland – see Figure 16.

Figure 16 

Clusters and countries of Group 15

Apart from classifications stemming from ‘JB: Society and culture: general’, the second most common stem is ‘1D: Europe’. See Figure 17.

Figure 17 

Classifications of Group 15

Discussion

The goal of this article is to explore a different aspect of bibliodiversity: the consumption of knowledge in different regions. While it makes sense to assume that if different regions publish research in distinctive ways this would also lead to differences in the consumption of knowledge, there is not much evidence to support this idea. A possible way to mitigate this is by looking at the preferences of a global audience. In this study, the most popular titles of the OAPEN Library from the readers of 100 countries were examined. A large collection of freely accessible books allows readers to select titles on many subjects, and these subjects have been described using a classification that aims to perform well in a global environment. It also implies that different languages should not interfere with the results.

Instead of focusing on books, the level of analysis is the classification code. Bibliodiversity is the notion that different regions have different publication cultures. This has mostly been analysed along the axis of language. Bibliodiversity is not just about languages but also about localized concerns: what subject plays an important role in a particular region?

Social network analysis tools were used to make sense of the data. This type of analysis requires a network that links the items under investigation. Here, the country from which the readers of the OAPEN Library have downloaded books is one such item, and the other is the classification of the books. The clustering algorithm helped to minimize the number of links between the countries and the unique classifications, leading to distinct groups. The next step was to select those groups that contained several neighbouring countries; the underlying assumption is that these countries share common traits and thus will share the same interests.

The results are encouraging; it makes sense that countries close to Russia have an interest in war and defence, while countries in Oceania share an interest in their continent. Also, the hierarchical nature of the Thema classification makes it possible to look at overarching concepts. For instance, the classifications in Group 3 stemming from ‘T: Technology, Engineering, Agriculture, Industrial processes’ encompass among other subjects biochemical engineering, electronics, traffic and environmental engineering.

This article aims to show that this kind of research is suitable to broadly explore the notion of shared interest in global regions. However, the limitations of the dataset must be considered. The publishers have provided each book with one or more classification code, but this has not always been done with the same level of granularity: some books were described with one code, while others have five. Additionally, the classification can be as broad as ‘J: Society and Social Sciences’ or as narrow as ‘JHMC: Social and cultural anthropology’.

The goal of this article is not to exactly pinpoint whether a specific subject is more ‘popular’ in East Asia compared with West Africa. Instead, it aims to investigate a rather unexplored aspect of bibliodiversity: regional differences in knowledge consumption. Hopefully, more research will follow.

Data accessibility statement

The full dataset can be downloaded from https://doi.org/10.5281/zenodo.12680030