ALIS (Algorithm-based Literature research In Science)
Data
- Balanced subset of publications available on arXiv (~113k papers)
- Includes titles, abstracts, authors, categories and publication dates
Embeddings
We used precomuted vector embeddings generated by the following models:
- MiniLM: 384-dimensional, lightweight semantic encoder [Doc]
- SPECTER: 768-dimensional, trained on scientific papers + citation graphs [Doc]
- SciBERT: 768-dimensional, trained on full-text scientific corpus with SciVocab [Doc]
How It Works
When you submit a query, the app:
- Checks if your query matches an existing paper title exactly
- If found, uses that paper's embedding as an anchor for recommendations
- If not found, embeds your query string and compares it to all papers using cosine similarity
Categories
Papers are categorized by arXiv Category Taxonomy.
References
- MiniLM: Wang et al., MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers, NeurIPS 2020 [PDF]
- SPECTER: Cohan et al., SPECTER: Document-level Representation Learning using Citation-informed Transformers, ACL 2020 [PDF]
- SciBERT: Beltagy et al., SciBERT: A Pretrained Language Model for Scientific Text, EMNLP 2019 [PDF]
Disclaimer: This tool is a prototype developed for academic research purposes. The recommendations provided are based on automated similarity computations and do not substitute for expert review.