Ground truth? Concept-based communities versus the external classification of physics manuscripts

IRIS

Community detection techniques are widely used to infer hidden structures within interconnected systems. Despite demonstrating high accuracy on benchmarks, they reproduce the external classification for many real-world systems with a significant level of discrepancy. A widely accepted reason behind such outcome is the unavoidable loss of non-topological information (such as node attributes) encountered when the original complex system is converted to a network. In this article we systematically show that the observed discrepancies may also be caused by a different reason: the external classification itself. For this end we use scientific publication data which (i) exhibit a well defined modular structure and (ii) hold an expert-made classification of research articles. Having represented the articles and the extracted scientific concepts both as a bipartite network and as its unipartite projection, we applied modularity optimization to uncover the inner thematic structure. The resulting clusters are shown to partly reflect the author-made classification, although some significant discrepancies are observed. A detailed analysis of these discrepancies shows that they may carry essential information about the system, mainly related to the use of similar techniques and methods across different (sub)disciplines, that is otherwise omitted when only the external classification is considered.

Ground truth? Concept-based communities versus the external classification of physics manuscripts

Palchykov V;Gemmetto V;Boyarsky A;Garlaschelli D

2016

Abstract

Community detection techniques are widely used to infer hidden structures within interconnected systems. Despite demonstrating high accuracy on benchmarks, they reproduce the external classification for many real-world systems with a significant level of discrepancy. A widely accepted reason behind such outcome is the unavoidable loss of non-topological information (such as node attributes) encountered when the original complex system is converted to a network. In this article we systematically show that the observed discrepancies may also be caused by a different reason: the external classification itself. For this end we use scientific publication data which (i) exhibit a well defined modular structure and (ii) hold an expert-made classification of research articles. Having represented the articles and the extracted scientific concepts both as a bipartite network and as its unipartite projection, we applied modularity optimization to uncover the inner thematic structure. The resulting clusters are shown to partly reflect the author-made classification, although some significant discrepancies are observed. A detailed analysis of these discrepancies shows that they may carry essential information about the system, mainly related to the use of similar techniques and methods across different (sub)disciplines, that is otherwise omitted when only the external classification is considered.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2016
			
	Rivista
	
				EPJ DATA SCIENCE
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Ground Truth.pdf non disponibili Licenza: Non specificato Dimensione 1.48 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.48 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11771/3488

Citazioni

ND

12

social impact