Skip to content

The electronic theses and dissertations topic modeling project was conducted by the Chinese University of Hong Kong Library.

Notifications You must be signed in to change notification settings

michaelkinfu/etd-topic-modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

etd-topic-modeling

The electronic theses and dissertations topic modeling project was conducted by the Chinese University of Hong Kong Library (CUHK Library). This digital scholarship project is strictly intended for nonprofit and academic purpose.

Research Cycle

  • Data Collection: The data were cataloged in previous years. Our team extracted this valuable data from Alma by conducting simple queries.
  • Data Processing: Based on the thesis titles and subject headings created by the cataloguer, our team extracted the titles related to Hong Kong from the data.
  • Topic Modeling: Base on sentence embedding, discover similar titles.
  • Clustering: Divide all the titles into five different clusters using the K-means algorithm.

Observation

Trends in Hong Kong-related Theses

Based on the extracted data, our team found that there were a total of 20,423 theses in the ETD collection. Of these, 3,878 were related to Hong Kong studies. This indicates that approximately 20% of the total postgraduate theses are associated with the theme of Hong Kong. Base on the dataset, our team had below findings:

  • The total number of theses showed an increasing trend over time. (Figure 2)
  • The proportion of Hong Kong-related theses exhibited a decrease among the overall research topics. (Figure 1)
  • A change point was identified in the year 1995. (Figure 2)

In concluded, the gap between the number of Hong Kong-related theses and the total number of theses increased. This indicates a significant divergence in research topics, with a lesser emphasis on Hong Kong-related subjects post-1995. Our team has a hypothesis that this may be related to two reasons:

  • There has been a significant increase in the number of research theses across various disciplines in recent years.
  • The CUHK Business School has shown less interest in Hong Kong-related fields.

Figure 1 - Proportion of Hong Kong Related Thesis

alt text

Figure 2 - Compared the Trends

alt text

Five Clusters

By employing k-means clustering, the ETD theses were segregated into five distinct clusters (Figure 3 - HTML):

  • Marketing and Business: 1115 titles
  • Cultural and Political: 792 titles
  • Urbanization and Land use: 788 titles
  • Population and Chinese: 656 titles
  • School and Education: 527 titles

Figure 3 - Five Clusters of ETD Collection

alt text

Acknowledgement

CUHK Digital Repository (ETD Collection): https://repository.lib.cuhk.edu.hk/en/collection/etd

BERTopic website: https://maartengr.github.io/BERTopic/index.html

About

The electronic theses and dissertations topic modeling project was conducted by the Chinese University of Hong Kong Library.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages