The General Index and Decentralized Science

Network visualization of General Index with manuscripts as nodes connected by edges representing their similarity to one another. View of a bridge node connecting two topics: Antediluvian (blue) and Hennig86 (yellow). Source code: https://github.com/hayitsian/General-Index-Visualization

The system for research publication is the bedrock of human knowledge sharing and expansion. Yet the current state of the art is broken. Legacy institutions, such as Nature, Science, and Elsevier, comprise a significant portion of article dispersion and revenue collection and retain much of the distribution rights of the work they had a minimal role in curating. This includes the peer review process, considered by many to be quintessential to validating results before publication and ensuring accurate information distribution. The result is rampant positive publication bias and instances where published work has later been found to be factually untrue. To those in the publishing field, disdain for the process is far from news and several modifications have been theorized to rectify these issues. A growing movement has formed to decentralize the publication process away from legacy organizations and transfer the power back to the producers and validators of content: the researchers.

Decentralizing publication is no easy feat. There are several pressing issues that must be addressed for such a system to be viable enough for academics to jump ship. How will manuscripts be stored and accessed? How will they be validated? How will similar articles congregate together? How will peer reviewers of similar subject matter expertise be selected? How will all of this be done transparently? Of particular note is the peer review process. Peer reviewers are typically selected by the editors of a journal who have lists of researchers with expertise in particular topics. Decentralizing this selection process is a non-trivial task that requires knowledge of researchers and their subject matter expertise.

This is where the General Index comes into play as a collection of 107 million published articles comprising a substantial subset of human knowledge in recent history. Created by Carl Malamud and his team at Public Resource, the General Index is a map of research content for use by the public domain. In total, it contains 38 terabytes of data with 335 billion rows of ngrams and 19 billion more of keywords, extracted by the Spacy and YAKE! libraries, respectively, from the underlying 107 million articles. These algorithms extract the relevant information from manuscripts and produce one to five words that best represent the content and semantics. While an incomplete dataset**—the team hopes to update and revise areas where keyword extraction failed—**, it represents a starting point for mapping the library of human knowledge.

Such a tool has the potential to kickstart the decentralization of research publications. This map allows developers to index a substantial network of knowledge to determine the similarity between article subjects, for example, when selecting peer reviewers of similar subject matter expertise. It can aid in the literature review process by pulling articles that may be relevant to the searching researcher. A model trained on the index could automatically assign metadata labels to manuscripts for publication, curating machine-readable documents. As the dataset is updated over time and enhanced with supplemental sources, increasingly complex and consequential models can be trained utilizing advancements in natural language processing. A future use case could implement a model that, given a manuscript, determines if it cites relevant literature and flags the reviewer if not. Automating the peer review process as much as possible would help lift the burden on the community to volunteer precious time and remove bottlenecks in the process as they exist today.

Zoomed in view of manuscripts in the General Index under the Antediluvian topic. The nodes are sized according to how frequent the subject appears in the underlying manuscript.

Getting to these awe-inspiring tools requires developing infrastructure and initial implementations for floor-level use cases. The product team at DeSci Labs is currently hosting a Request For Proposals for General Index implementations in the context of decentralized science. If you have a potential use case, join the Discord and pitch it to the community! One proposal on the table is to craft an API to interact with the General Index. Given its size and complexity, storing 38TB of every local machine working with it is infeasible, and accessing it through the online archive has proven non-trivial. With an API, those interested in accessing the General Index can do so locally by querying a cloud server for a particular manuscript or keyword. Additional implementations can create a system for determining the similarity between manuscripts given their ngrams or keywords. A model I find particularly enticing is one that, given a manuscript’s ngrams, selects authors who have published work in similar fields and subtopics. Such a model could aid in transitioning to an automated peer review selection process with increased transparency, contrary to the intertwined legacy system of editors selecting reviewers from private lists. To assist in the early development stages of General Index models, I have created a repository with a framework for working with the data. It contains all the code necessary to import a test slice of General Index data and a few basic natural language models for topic extraction. It also crafted the visualizations seen in this writeup. I hope those interested in learning more find it to be a beneficial starting point. https://github.com/hayitsian/General-Index-Visualization

Proposed use cases for the General Index:

API for querying the dataset
- Store the General Index on a cloud server and create an API for querying particular manuscripts, keywords, or topics.
Content similarity model
- Train a model to determine the semantic similarity between manuscripts from their ngrams.
Automated metadata curation
- Given a manuscript, automatically generate keywords and assign a topic and subtopics.
Literature review paper compilation
- Researchers compiling papers query a model for manuscripts on a particular subject, or give a manuscript and receive contextually similar papers.