publications | Center for Science of Science

2024

Do more heads imply better performance? An empirical study of team thought leaders’ impact on scientific team performance

Yi Zhao, Yuzhuo Wang, Heng Zhang, Donghun Kim, Chao Lu, Yongjun Zhu, and Chengzhi Zhang

Information Processing & Management, Apr 2024

Abs HTML

Thought leadership plays a crucial role in boosting team performance; thus, teams with more thought leaders may perform better. However, the impact of the number of thought leaders on team performance in a scientific context remains understudied. In this study, we consider the authors of a publication as a scientific team and define the authors responsible for conceptual tasks, i.e., “conceived and designed the experiments” (one of the tasks described in the PLOS contribution statements classification system), as thought leaders. Leveraging more than 140,000 papers from PLOS journals, we examine the relationship between the number of thought leaders and two aspects of team performance (i.e., team impact and team disruptiveness) from both correlational and causal perspectives. The results showed that (1) an inverted U-shaped relationship exists between the number of thought leaders and the team’s impact, and (2) teams with more thought leaders tend to produce less disruptive ideas. We also explored the impact of international collaboration, team size, and gender diversity together with the number of thought leaders on team performance and found that (3) international collaboration improves team impact but lowers the disruptiveness of team outputs. This study advances scholarly understanding of thought leadership in scientific teams and provides valuable insights for policymakers and team managers.
Rhetorical structure parallels research topic in LIS articles: a temporal bibliometrics examination

Wen Lou, Jiangen He, Qianqian Xu, Zhijie Zhu, Qiwen Lu, and Yongjun Zhu

Library Hi Tech, Apr 2024

Abs HTML

Purpose: The effectiveness of rhetorical structure is essential to communicate key messages in research articles (RAs). The interdisciplinary nature of library and information science (LIS) has led to unclear patterns and practice of using rhetorical structures. Understanding how RAs are constructed in LIS to facilitate effective scholarly communication is important. Numerous studies investigated the rhetorical structure of RAs in a range of disciplines, but LIS articles have not been well studied. Design/methodology/approach: In this study, the authors encoded rhetorical structures to 2,216 articles in the Journal of the Association for Information Science and Technology covering a period from 2001 to 2018 with the approaches of co-word analysis and visualization. The results show that the predominant rhetorical structures used by LIS researchers follow the sequence of Introduction-Literature Review-Methodology-Result-Discussion-Conclusion (ILMRDC). Findings: The authors’ temporal examination reveals the shifts of evolutionary pattern of rhetorical structure in 2008 and 2014. More importantly, the authors’ study demonstrates that rhetorical structures have varied greatly across research areas in LIS community. For example, scholarly communication and scientometrics studies tend to exclude literature review in articles. Originality/value: The present paper offers a first systematic examination of how rhetorical structures are used in a representative sample of a LIS journal, especially from a temporal perspective.
Dependency, reciprocity, and informal mentorship in predicting long-term research collaboration: A co-authorship matrix-based multivariate time series analysis

Yongjun Zhu, Donghun Kimg, Ting Jiang, Yi Zhao, Jiangen He, Xinyi Chen, and Wen Lou

Journal of Informetrics, Feb 2024

Abs HTML

In this study, we examine the roles of dependency, reciprocity, and informal mentorship in the prediction of long-term research collaboration in five disciplines. We use co-authorship matrix-based multivariate time series features and interpretable machine learning to train long-term collaboration prediction models and interpret the feature importance of trained models. Overall, long-term research collaboration that is defined using various standards was rare across the examined disciplines, and the prediction results were moderate to good. We found dependency, reciprocity, and informal mentorship to have different roles in different disciplines. Among the three, informal mentorship was important in predicting long-term research collaboration in Agriculture, Geology, and Library and Information Science. Reciprocity, which measures the interdependence between two researchers was important to prediction in the fields of Agriculture and Geology. Finally, dependency was important in all the disciplines with varying degrees of importance.

2023

Support behind the scenes: the relationship between acknowledgement, coauthor, and citation in Nobel articles

Wen Lou, Jiangen He, Lingxin Zhang, Zhijie Zhu, and Yongjun Zhu

Scientometrics, Aug 2023

Abs HTML

Acknowledging individuals in research articles is known to be a personal and private expression of appreciation compared to other types of acknowledgment, such as financial support. Early studies have demonstrated the significant relationship between acknowledgement, coauthor, and citation. Little did we know to what extent of these relationships and which prompt what to some degree among them. We adopt a series of multivariate analyses, Bayes’ theorem, statistical analysis, and “before and after” matched-group studies to illustrate the acknowledgement patterns in 6323 research articles of 196 Nobel Prize laureates (NPL) from 2008 to 2018. Acknowledgment is consistently proved to significantly relate to co-authorship and citation where co-authorship and citing have an approximately 10% increasing effect on acknowledgement behavior. Our study is the first to state the order of such triangle: acknowledgement is significantly ahead of co-authorship and arguably occurs before citing behavior. Moreover, acknowledgement strengthens more than half of NPL on their co-authorship for 11% and citation for 72% after they acknowledge others. We verify the substantive possibility of co-authorship and citing behavior from acknowledgement and introduce a formation of a new norm of scholarly communication. This will greatly contribute to the matter of evaluation metrics and social network detection.
An Exploratory Study of Medical Journal’s Twitter Use: Metadata, Networks, and Content Analyses

Donghun Kim, Woojin Jung, Ting Jiang, and Yongjun Zhu

Journal of Medical Internet Research, Jan 2023

Abs HTML

Background: An increasing number of medical journals are using social media to promote themselves and communicate with their readers. However, little is known about how medical journals use Twitter and what their social media management strategies are. Objective: This study aimed to understand how medical journals use Twitter from a global standpoint. We conducted a broad, in-depth analysis of all the available Twitter accounts of medical journals indexed by major indexing services, with a particular focus on their social networks and content. Methods: The Twitter profiles and metadata of medical journals were analyzed along with the social networks on their Twitter accounts. Results: The results showed that overall, publishers used different strategies regarding Twitter adoption, Twitter use patterns, and their subsequent decisions. The following specific findings were noted: journals with Twitter accounts had a significantly higher number of publications and a greater impact than their counterparts; subscription journals had a slightly higher Twitter adoption rate (2%) than open access journals; journals with higher impact had more followers; and prestigious journals rarely followed other lesser-known journals on social media. In addition, an in-depth analysis of 2000 randomly selected tweets from 4 prestigious journals revealed that The Lancet had dedicated considerable effort to communicating with people about health information and fulfilling its social responsibility by organizing committees and activities to engage with a broad range of health-related issues; The New England Journal of Medicine and the Journal of the American Medical Association focused on promoting research articles and attempting to maximize the visibility of their research articles; and the British Medical Journal provided copious amounts of health information and discussed various health-related social problems to increase social awareness of the field of medicine. Conclusions: Our study used various perspectives to investigate how medical journals use Twitter and explored the Twitter management strategies of 4 of the most prestigious journals. Our study provides a detailed understanding of medical journals’ use of Twitter from various perspectives and can help publishers, journals, and researchers to better use Twitter for their respective purposes.
Predicting coauthorship using bibliographic network embedding

Yongjun Zhu, Lihong Quan, Pei-Ying Chen, Meen Chul Kim, and Chao Che

Journal of the Association for Information Science & Technology, Apr 2023

Abs HTML

Coauthorship prediction applies predictive analytics to bibliographic data to predict authors who are highly likely to be coauthors. In this study, we propose an approach for coauthorship prediction based on bibliographic network embedding through a graph-based bibliographic data model that can be used to model common bibliographic data, including papers, terms, sources, authors, departments, research interests, universities, and countries. A real-world dataset released by AMiner that includes more than 2 million papers, 8 million citations, and 1.7 million authors were integrated into a large bibliographic network using the proposed bibliographic data model. Translation-based methods were applied to the entities and relationships to generate their low-dimensional embeddings while preserving their connectivity information in the original bibliographic network. We applied machine learning algorithms to embeddings that represent the coauthorship relationships of the two authors and achieved high prediction results. The reference model, which is the combination of a network embedding size of 100, the most basic translation-based method, and a gradient boosting method achieved an F1 score of 0.9 and even higher scores are obtainable with different embedding sizes and more advanced embedding methods. Thus, the strengths of the proposed approach lie in its customizable components under a unified framework.
Structured abstract summarization of scientific articles: Summarization using full-text section information

Hanseok Oh, Seojin Nam, and Yongjun Zhu

Journal of the Association for Information Science & Technology, Feb 2023

Abs HTML

The automatic summarization of scientific articles differs from other text genres because of the structured format and longer text length. Previous approaches have focused on tackling the lengthy nature of scientific articles, aiming to improve the computational efficiency of summarizing long text using a flat, unstructured abstract. However, the structured format of scientific articles and characteristics of each section have not been fully explored, despite their importance. The lack of a sufficient investigation and discussion of various characteristics for each section and their influence on summarization results has hindered the practical use of automatic summarization for scientific articles. To provide a balanced abstract proportionally emphasizing each section of a scientific article, the community introduced the structured abstract, an abstract with distinct, labeled sections. Using this information, in this study, we aim to understand tasks ranging from data preparation to model evaluation from diverse viewpoints. Specifically, we provide a preprocessed large-scale dataset and propose a summarization method applying the introduction, methods, results, and discussion (IMRaD) format reflecting the characteristics of each section. We also discuss the objective benchmarks and perspectives of state-of-the-art algorithms and present the challenges and research directions in this area.

2022

Understanding the Research Landscape of Deep Learning in Biomedical Science: Scientometric Analysis

Seojin Nam, Donghun Kim, Woojin Jung, and Yongjun Zhu

Journal of Medical Internet Research, Apr 2022

Abs HTML

Background: Advances in biomedical research using deep learning techniques have generated a large volume of related literature. However, there is a lack of scientometric studies that provide a bird’s-eye view of them. This absence has led to a partial and fragmented understanding of the field and its progress. Objective: This study aimed to gain a quantitative and qualitative understanding of the scientific domain by analyzing diverse bibliographic entities that represent the research landscape from multiple perspectives and levels of granularity. Methods: We searched and retrieved 978 deep learning studies in biomedicine from the PubMed database. A scientometric analysis was performed by analyzing the metadata, content of influential works, and cited references. Results: In the process, we identified the current leading fields, major research topics and techniques, knowledge diffusion, and research collaboration. There was a predominant focus on applying deep learning, especially convolutional neural networks, to radiology and medical imaging, whereas a few studies focused on protein or genome analysis. Radiology and medical imaging also appeared to be the most significant knowledge sources and an important field in knowledge diffusion, followed by computer science and electrical engineering. A coauthorship analysis revealed various collaborations among engineering-oriented and biomedicine-oriented clusters of disciplines. Conclusions: This study investigated the landscape of deep learning research in biomedicine and confirmed its interdisciplinary nature. Although it has been successful, we believe that there is a need for diverse applications in certain areas to further boost the contributions of deep learning in addressing biomedical research problems. We expect the results of this study to help researchers and communities better align their present and future work.

2021

Gender imbalance in the productivity of funded projects: A study of the outputs of National Institutes of Health R01 grants

Chaojiang Wu, Erjia Yan, Yongjun Zhu, and Kai Li

Journal of the Association for Information Science & Technology, Nov 2021

Abs HTML

This study examines the relationship between team’s gender composition and outputs of funded projects using a large data set of National Institutes of Health (NIH) R01 grants and their associated publications between 1990 and 2017. This study finds that while the women investigators’ presence in NIH grants is generally low, higher women investigator presence is on average related to slightly lower number of publications. This study finds empirically that women investigators elect to work in fields in which fewer publications per million-dollar funding is the norm. For fields where women investigators are relatively well represented, they are as productive as men. The overall lower productivity of women investigators may be attributed to the low representation of women in high productivity fields dominated by men investigators. The findings shed light on possible reasons for gender disparity in grant productivity.
Mapping scientific profile and knowledge diffusion of Library Hi Tech

Meen Chul Kim, Yuanyuan Feng, and Yongjun Zhu

Library Hi Tech, Jun 2021

Abs HTML

Purpose: Library Hi Tech is one of the most influential journals that publish leading research in library and information science (LIS). The present study aims to understand the scholarly communication in Library Hi Tech by profiling its historic footprint, emerging trends and knowledge diffusion. Design/methodology/approach: A total of 3,131 bibliographic records between 1995 and 2018 were collected from the Web of Science. Text mining, graph analysis and data visualization were used to analyze subject category assignment, domain-level citation trends, co-occurrence of keywords, keyword bursts, networks of document co-citation and landmark articles. Findings: Findings indicated that published research in the journal was largely influenced by the psychology, education and social domain as a unidisciplinary discipline. Knowledge of the journal has been disseminated into multiple domains such as LIS, computer science and education. Dominant thematic concentrations were also identified: (1) library services in academic libraries and related to digital libraries, (2) adoption of new information technologies and (3) information-seeking behavior in these contexts. Additionally, the journal has exhibited an increased research emphasis on mixed-method user-centered studies and investigations into libraries’ use of new media. Originality/value: This study provides a promising approach to understand scientific trends and the intellectual growth of journals. It also helps Library Hi Tech to become more self-explanatory with a detailed bibliometric profile and to identify future directions in editorship and readership. Finally, researchers in the community can better position their studies within the emerging trends and current challenges of the journal.
Analyzing China’s research collaboration with the United States in high-impact and high-technology research

Yongjun Zhu, Donghun Kim, Erjia Yan, Meen Chul Kim, and Guanqiu Qi

Quantitative Science Studies, Apr 2021

Abs HTML

This study investigates China’s international research collaboration with the United States through a bibliometric analysis of coauthorship over time using historical research publication data. We investigate from three perspectives: overall, high-impact, and high-technology research collaborations using data from Web of Science (WoS), Nature Index, and Technology Alert List maintained by the U.S. Department of State. The results show that the United States is China’s largest research collaborator and that in all three aspects, China and the United States are each other’s primary collaborators much of the time. From China’s perspective, we have found weakening collaboration with the United States over the past 2 years. In terms of high-impact research collaboration, China has historically shared a higher percentage of its research with the United States than vice versa. In terms of high-technology research, the situation is reversed, with the United States sharing more. The percentage of the United States’ high-technology research shared with China has been continuously increasing over the past 10 years, while in China the percentage has been relatively stable.

2020

Analyzing academic mobility of U.S. professors based on ORCID data and the Carnegie Classification

Erjia Yan, Yongjun Zhu, and Jiangen He

Quantitative Science Studies, Dec 2020

Abs HTML

This paper uses two open science data sources—ORCID and the Carnegie Classification of Institutions of Higher Education (CCIHE)—to identify tenure-track and tenured professors in the United States who have changed academic affiliations. Through a series of data cleaning and processing actions, 5,938 professors met the selection criteria of professorship and mobility. Using ORCID professor profiles and the Carnegie Classification, this paper reveals patterns of academic mobility in the United States from the aspects of institution types, locations, regions, funding mechanisms of institutions, and professors’ genders. We find that professors tended to move to institutions with higher research intensity, such as those with an R1 or R2 designation in the Carnegie Classification. They also tend to move from rural institutions to urban institutions. Additionally, this paper finds that female professors are more likely to move within the same geographic region than male professors and that when they move from a less research-intensive institution to a more research-intensive one, female professors are less likely to retain their rank or attain promotion.
Mapping scientific landscapes in UMLS research: a scientometric review

Meen Chul Kim, Seojin Nam, Fei Wang, and Yongjun Zhu

Journal of the American Medical Informatics Association, Oct 2020

Abs HTML

Objective: The Unified Medical Language System (UMLS) is 1 of the most successful, collaborative efforts of terminology resource development in biomedicine. The present study aims to 1) survey historical footprints, emerging technologies, and the existing challenges in the use of UMLS resources and tools, and 2) present potential future directions. Materials and Methods: We collected 10 469 bibliographic records published between 1986 and 2019, using a Web of Science database. graph analysis, data visualization, and text mining to analyze domain-level citations, subject categories, keyword co-occurrence and bursts, document co-citation networks, and landmark papers. Results: The findings show that the development of UMLS resources and tools have been led by interdisciplinary collaboration among medicine, biology, and computer science. Efforts encompassing multiple disciplines, such as medical informatics, biochemical sciences, and genetics, were the driving forces behind the domain’s growth. The following topics were found to be the dominant research themes from the early phases to mid-phases: 1) development and extension of ontologies and 2) enhancing the integrity and accessibility of these resources. Knowledge discovery using machine learning and natural language processing and applications in broader contexts such as drug safety surveillance have recently been receiving increasing attention. Discussion: Our analysis confirms that while reaching its scientific maturity, UMLS research aims to boundary-span to more variety in the biomedical context. We also made some recommendations for editorship and authorship in the domain. Conclusion: The present study provides a systematic approach to map the intellectual growth of science, as well as a self-explanatory bibliometric profile of the published UMLS literature. It also suggests potential future directions. Using the findings of this study, the scientific community can better align the studies within the emerging agenda and current challenges.
Nine million book items and eleven million citations: a study of book-based scholarly communication using OpenCitations

Yongjun Zhu, Erjia Yan, Silvio Peroni, and Chao Che

Scientometrics, Feb 2020

Abs HTML

Books have been widely used to share information and contribute to human knowledge. However, the quantitative use of books as a method of scholarly communication is relatively unexamined compared to journal articles and conference papers. This study uses the COCI dataset (a comprehensive open citation dataset provided by OpenCitations) to explore books’ roles in scholarly communication. The COCI data we analyzed includes 445,826,118 citations from 46,534,705 bibliographic entities. By analyzing such a large amount of data, we provide a thorough, multifaceted understanding of books. Among the investigated factors are (1) temporal changes to book citations; (2) book citation distributions; (3) years to citation peak; (4) citation half-life; and (5) characteristics of the most-cited books. Results show that books have received less than 4% of total citations, and have been cited mainly by journal articles. Moreover, 97.96% of books have been cited fewer than ten times. Books take longer than other bibliographic materials to reach peak citation levels, yet are cited for the same duration as journal articles. Most-cited books tend to cover general (yet essential) topics, theories, and technological concepts in mathematics and statistics.

2018

Joint modeling of the association between NIH funding and its three primary outcomes: patents, publications, and citation impact

Fengqing Zhang, Erjia Yan, Xin Niu, and Yongjun Zhu

Scientometrics, Jul 2018

Abs HTML

This paper examines the impact of NIH funding on research outcomes using data from 108,803 projects funded by NIH between January 2009 and March 2017. We extend the prior knowledge on this topic by incorporating the correlation structure of multiple research outcomes, as well as a comprehensive list of grant-level features capturing information on funding size, gender composition and funding type. Specifically, we utilize partial least squares regression (PLS) to jointly model all three primary outcomes (publications, patents and citation impact) and identify the effects of grant-level features on research outputs. Our results show that joint modeling of research outcomes via PLS yields a more accurate prediction than analyzing each outcome separately. Additionally, we find that when other grant-level features are held constant, a 2-year-longer project duration would produce a similar improvement in research outputs to that achieved by $1 million in additional funding. Based on this finding, we recommend no-cost extension of funded projects instead of increased funding support to achieve a comparable increase in research outputs. Promoting multi-organizational grants is found to be more effective for increasing patents, whereas encouraging multiple-PI grants is more productive in terms of publications and citation impact. Of the various NIH grant types, program project/center grants (P series) and research training grants (T series) are the two most productive and impactful. Results also suggest that projects with a higher proportion of male PIs tend to produce more research outputs. This finding, however, needs to be interpreted with caution due to the limitation of our data set.
Understanding the research landscape of major depressive disorder via literature mining: an entity-level analysis of PubMed data from 1948 to 2017

Yongjun Zhu, Min-Hyung Kim, Samprit Banerjee, Joseph Deferio, George S Alexopoulos, and Jyotishman Pathak

JAMIA Open, Apr 2018

Abs HTML

Objective: To analyze literature-based data from PubMed to identify diseases and medications that have frequently been studied with major depressive disorder (MDD). Materials and methods: Abstracts of 23 799 research articles about MDD that have been published since 1948 till 2017 were analyzed using data and text mining approaches. Methods such as information extraction, frequent pattern mining, regression, and burst detection were used to explore diseases and medications that have been associated with MDD. Results: In addition to many mental disorders and antidepressants, we identified several nonmental health diseases and nonpsychotropic medications that have frequently been studied with MDD. Our results suggest that: (1) MDD has been studied with disorders such as Pain, Diabetes Mellitus, Wounds and Injuries, Hypertension, and Cardiovascular Diseases; (2) medications such as Hydrocortisone, Dexamethasone, Ketamine, and Lithium have been studied in terms of their side effects and off-label uses; (3) the relationships between nonmental disorders and MDD have gained increased attention from the scientific community; and (4) the bursts of Diabetes Mellitus and Cardiovascular Diseases explain the psychiatric and/or depression screening recommended by authoritative associations during the periods of the bursts. Discussion and conclusion: This study summarized and presented an overview of the previous MDD research in terms of diseases and medications that are highly relevant to MDD. The reported results can potentially facilitate hypothesis generation for future studies. The approaches proposed in the study can be used to better understand the progress and advance of the field.
Tracking word semantic change in biomedical literature

Erjia Yan, and Yongjun Zhu

International Journal of Medical Informatics, Jan 2018

Abs HTML

Up to this point, research on written scholarly communication has focused primarily on syntactic, rather than semantic, analyses. Consequently, we have yet to understand semantic change as it applies to disciplinary discourse. The objective of this study is to illustrate word semantic change in biomedical literature. To that end, we identify a set of representative words in biomedical literature based on word frequency and word-topic probability distributions. A word2vec language model is then applied to the identified words in order to measure word- and topic-level semantic changes. We find that for the selected words in PubMed, overall, meanings are becoming more stable in the 2000s than they were in the 1980s and 1990s. At the topic level, the global distance of most topics (19 out of 20 tested) is declining, suggesting that the words used to discuss these topics are stabilizing semantically. Similarly, the local distance of most topics (19 out of 20) is also declining, showing that the meanings of words from these topics are becoming more consistent with those of their semantic neighbors. At the word level, this paper identifies two different trends in word semantics, as measured by the aforementioned distance metrics: on the one hand, words can form clusters with their semantic neighbors, and these words, as a cluster, coevolve semantically; on the other hand, words can drift apart from their semantic neighbors while nonetheless stabilizing in the global context. In relating our work to language laws on semantic change, we find no overwhelming evidence to support either the law of parallel change or the law of conformity.

2017

A natural language interface to a graph-based bibliographic information retrieval system

Yongjun Zhu, Erjia Yan, and Il-Yeol Song

Data & Knowledge Engineering, Sep 2017

Abs HTML

With the ever-increasing volume of scientific literature, there is a need for a natural language interface to bibliographic information retrieval systems to retrieve relevant information effectively. In this paper, we propose one such interface, NLI-GIBIR, which allows users to search for a variety of bibliographic data through natural language. NLI-GIBIR makes use of a novel framework applicable to graph-based bibliographic information retrieval systems in general. This framework incorporates algorithms/heuristics for interpreting and analyzing natural language bibliographic queries via a series of text- and linguistic-based techniques, including tokenization, named entity recognition, and syntactic analysis. We find that our framework, as implemented in NLI-GIBIR, can effectively represent and address complex bibliographic information needs. Thus, the contributions of this paper are as follows: First, to our knowledge, it is the first attempt to propose a natural language interface for graph-based bibliographic information retrieval. Second, we propose a novel customized natural language processing framework that integrates a few original algorithms/heuristics for interpreting and analyzing bibliographic queries. Third, we show that the proposed framework and natural language interface provide a practical solution for building real-world bibliographic information retrieval systems. Our experimental results show that the presented system can correctly answer 39 out of 40 example natural language queries with varying lengths and complexities.
Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

Yongjun Zhu, Erjia Yan, and Fei Wang

BMC Medical Informatics and Decision Making, Jul 2017

Abs HTML

Background: Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec’s ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec. Methods: We download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects. Results: Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task). Conclusions: Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations.
Examining academic ranking and inequality in library and information science through faculty hiring networks

Yongjun Zhu, and Erjia Yan

Journal of Informetrics, May 2017

Abs HTML

In this study, we examine academic ranking and inequality in library and information science (LIS) using a faculty hiring network of 643 faculty members from 44 LIS schools in the United States. We employ four groups of measures to study academic ranking, including adjacency, placement and hiring, distance-based measures, and hubs and authorities. Among these measures, closeness and hub measures have the highest correlation with the U.S. News ranking (r = 0.78). We study academic inequality using four distinct methods that include downward/upward placement, Lorenz curve, cliques, and egocentric networks of LIS schools and find that academic inequality exists in the LIS community. We show that the percentage of downward placement (68%) is much higher than that of upward placement (22%); meanwhile, 20% of the 30 LIS schools that have doctoral programs produced nearly 60% of all LIS faculty, with a Gini coefficient of 0.53. We also find cliques of highly ranked schools and a core/periphery structure that distinguishes LIS schools of different ranks. Overall, LIS faculty hiring networks have considerable value in deriving credible academic ranking and revealing faculty exchange within the field.
Adding the dimension of knowledge trading to source impact assessment: Approaches, indicators, and implications

Erjia Yan, and Yongjun Zhu

Journal of the Association for Information Science & Technology, May 2017

Abs HTML

The objective of this paper is to systematically assess sources’ (e.g., journals and proceedings) impact in knowledge trading. While there have been efforts at evaluating different aspects of journal impact, the dimension of knowledge trading is largely absent. To fill the gap, this study employed a set of trading-based indicators, including weighted degree centrality, Shannon entropy, and weighted betweenness centrality, to assess sources’ trading impact. These indicators were applied to several time-sliced source-to-source citation networks that comprise 33,634 sources indexed in the Scopus database. The results show that several interdisciplinary sources, such as Nature, PLoS One, Proceedings of the National Academy of Sciences, and Science, and several specialty sources, such as Lancet, Lecture Notes in Computer Science, Journal of the American Chemical Society, Journal of Biological Chemistry, and New England Journal of Medicine, have demonstrated their marked importance in knowledge trading. Furthermore, this study also reveals that, overall, sources have established more trading partners, increased their trading volumes, broadened their trading areas, and diversified their trading contents over the past 15 years from 1997 to 2011. These results inform the understanding of source-level impact assessment and knowledge diffusion.
An investigation of the intellectual structure of opinion mining research

Yongjun Zhu, Meen Chul Kim, and Chaomei Chen

Information Research: An International Electronic Journal, Mar 2017

Abs HTML

Introduction: Opinion mining has been receiving increasing attention from a broad range of scientific communities since early 2000s. The present study aims to systematically investigate the intellectual structure of opinion mining research. Method: Using topic search, citation expansion, and patent search, we collected 5,596 bibliographic records of opinion mining research. Then, intellectual landscapes, emerging trends, and recent developments were identified. We also captured domain-level citation trends, subject category assignment, keyword co-occurrence, document co-citation network, and landmark articles. Analysis: Our study was guided by scientometric approaches implemented in CiteSpace, a visual analytic system based on networks of co-cited documents. We also employed a dual-map overlay technique to investigate epistemological characteristics of the domain. Results: We found that the investigation of algorithmic and linguistic aspects of opinion mining has been of the community’s greatest interest to understand, quantify, and apply the sentiment orientation of texts. Recent thematic trends reveal that practical applications of opinion mining such as the prediction of market value and investigation of social aspects of product feedback have received increasing attention from the community. Conclusion: Opinion mining is fast-growing and still developing, exploring the refinements of related techniques and applications in a variety of domains. We plan to apply the proposed analytics to more diverse domains and comprehensive publication materials to gain more generalized understanding of the true structure of a science.
The use of a graph-based system to improve bibliographic information retrieval: System design, implementation, and evaluation

Yongjun Zhu, Erjia Yan, and Il-Yeol Song

Journal of the Association for Information Science & Technology, Feb 2017

Abs HTML

In this article, we propose a graph-based interactive bibliographic information retrieval system—GIBIR. GIBIR provides an effective way to retrieve bibliographic information. The system represents bibliographic information as networks and provides a form-based query interface. Users can develop their queries interactively by referencing the system-generated graph queries. Complex queries such as “papers on information retrieval, which were cited by John’s papers that had been presented in SIGIR” can be effectively answered by the system. We evaluate the proposed system by developing another relational database-based bibliographic information retrieval system with the same interface and functions. Experiment results show that the proposed system executes the same queries much faster than the relational database-based system, and on average, our system reduced the execution time by 72% (for 3-node query), 89% (for 4-node query), and 99% (for 5-node query).

2016

Searching bibliographic data using graphs: A visual graph query interface

Yongjun Zhu, and Erjia Yan

Journal of Informetrics, Nov 2016

Abs HTML

With the ever-increasing scientific literature, improving the efficiency of searching bibliographic data has become an important issue. With a lack of support of current bibliographic information retrieval systems in expressing complicated information needs, getting relevant bibliographic data is a demanding task. In this paper, we propose a visual graph query interface for bibliographic information retrieval. Through this interface, users can formulate bibliographic queries by interacting with a graph. Visual graph queries use a set of nodes with constraints and links among nodes to represent explicit and precise bibliographic information needs. The proposed visual graph query interface allows users to formulate several complex bibliographic queries (e.g., bibliographic coupling) that are not attainable in current major bibliographic information retrieval systems. In addition, the proposed interface requires less number of queries in completing everyday bibliographic search tasks.
Understanding the evolving academic landscape of library and information science through faculty hiring data

Yongjun Zhu, Erjia Yan, and Min Song

Scientometrics, Jun 2016

Abs HTML

Using a 40-year (from 1975 to 2015) hiring dataset of 642 library and Information science (LIS) faculty members from 44 US universities, this research reveals the disciplinary characteristics of LIS through several key aspects including gender, rank, country, university, major, and research area. Results show that genders and ranks among LIS faculty members are evenly distributed; geographically, more than 90 % of LIS faculty members received doctoral degrees in the US; meanwhile, 60 % of LIS faculty received Ph.D. in LIS, followed by Computer Science and Education; in regards to research interests, Human–Computer interaction, Digital Librarianship, Knowledge Organization and Management, and Information Behavior are the most popular research areas among LIS faculty members. Through a series of dynamic analyses, this study shows that the educational background of LIS faculty members is becoming increasingly diverse; in addition, research areas such as Human–Computer interaction, Social Network Analysis, Services for Children and Youth, Information Literacy, Information Ethics and Policy, and Data and Text Mining, Natural Language Processing, Machine Learning have received an increasing popularity. Predictive analyses are performed to discover trends on majors and research areas. Results show that the growth rate of LIS faculty members is linearly distributed. In addition, among faculty member’s Ph.D. majors, the share of LIS is decreasing while that the share of Computer Science is growing; among faculty members’ research areas, the share of Human–Computer interaction is on the rise.
How are they different? A quantitative domain comparison of information visualization and data visualization (2000–2014)

Meen Chul Kim, Yongjun Zhu, and Chaomei Chen

Scientometrics, Jan 2016

Abs HTML

Information visualization and data visualization are often viewed as similar, but distinct domains, and they have drawn an increasingly broad range of interest from diverse sectors of academia and industry. This study systematically analyzes and compares the intellectual landscapes of the two domains between 2000 and 2014. The present study is based on bibliographic records retrieved from the Web of Science. Using a topic search and a citation expansion, we collected two sets of data in each domain. Then, we identified emerging trends and recent developments in information visualization and data visualization, captivated in intellectual landscapes, landmark articles, bursting keywords, and citation trends of the domains. We found out that both domains have computer engineering and applications as their shared grounds. Our study reveals that information visualization and data visualization have scrutinized algorithmic concepts underlying the domains in their early years. Successive literature citing the datasets focuses on applying information and data visualization techniques to biomedical research. Recent thematic trends in the fields reflect that they are also diverging from each other. In data visualization, emerging topics and new developments cover dimensionality reduction and applications of visual techniques to genomics. Information visualization research is scrutinizing cognitive and theoretical aspects. In conclusion, information visualization and data visualization have co-evolved. At the same time, both fields are distinctively developing with their own scientific interests.

2015

Identifying entities from scientific publications: A comparison of vocabulary- and model-based methods

Erjia Yan, and Yongjun Zhu

Journal of Informetrics, Jul 2015

Abs HTML

The objective of this study is to evaluate the performance of five entity extraction methods for the task of identifying entities from scientific publications, including two vocabulary-based methods (a keyword-based and a Wikipedia-based) and three model-based methods (conditional random fields (CRF), CRF with keyword-based dictionary, and CRF with Wikipedia-based dictionary). These methods are applied to an annotated test set of publications in computer science. Precision, recall, accuracy, area under the ROC curve, and area under the precision-recall curve are employed as the evaluative indicators. Results show that the model-based methods outperform the vocabulary-based ones, among which CRF with keyword-based dictionary has the best performance. Between the two vocabulary-based methods, the keyword-based one has a higher recall and the Wikipedia-based one has a higher precision. The findings of this study help inform the understanding of informetric research at a more granular level.
Dynamic subfield analysis of disciplines: an examination of the trading impact and knowledge diffusion patterns of computer science

Yongjun Zhu, and Erjia Yan

Scientometrics, Apr 2015

Abs HTML

The objective of this research is to examine the dynamic impact and diffusion patterns at the subfield level. Using a 15-year citation data set, this research reveals the characteristics of the subfields of computer science from the aspects of citation characteristics, citation link characteristics, network characteristics, and their dynamics. Through a set of indicators including incoming citations, number of citing areas, cited/citing ratios, self-citations ratios, PageRank, and betweenness centrality, the study finds that subfields such as Computer Science Applications, Software, Artificial Intelligence, and Information Systems possessed higher scientific trading impact. Moreover, it also finds that Human–Computer Interaction, Computational Theory and Mathematics, and Computer Science Applications are among the subfields of computer science that gained the fastest growth in impact. Additionally, Engineering, Mathematics, and Decision Sciences form important knowledge channels with subfields in computer science.