Textoscopia
Analyse your corpus
with ease
Upload one or more PDF or Word documents (.docx) to explore your texts using a full set of corpus linguistics tools — all in your browser, nothing sent to a server.
See every occurrence of a word in context, with a configurable left and right window.
Ranked list of the most common words, with frequency bars.
Find which words habitually appear near a node word. Click any collocate to see all concordance lines.
Use regular expressions to find any linguistic pattern, shown in KWIC-style context. Includes a regex help panel.
Find the most frequent sequences of 2, 3, 4 or 5 consecutive words in the corpus.
Visualise where in the corpus a word appears — whether it clusters in one section or spreads evenly.
Measure lexical diversity with TTR and the more reliable MATTR. Shows a per-file breakdown with multiple texts.
With two or more files, identifies which words are statistically distinctive in each text using log-likelihood.
Download the full extracted text of your corpus as a plain .txt file for use in other tools.
Textoscopia was designed and developed by Marc Olivier-Loiseau.
This tool was built with the assistance of Claude, an AI assistant made by Anthropic. All conceptual decisions, feature choices, design directions, and linguistic framing were made by the author. The use of AI assistance is disclosed in accordance with emerging best practices in digital humanities scholarship.
If you use Textoscopia in your research, please cite it as:
Olivier, Marc (2026). Textoscopia: A Simple Browser-Based Tool for Corpus Linguistics Analysis (v1.0). Zenodo. https://doi.org/10.5281/zenodo.19071710
If you want to share feedback or report an issue, please contact the author.
The number on each chip (e.g. 18) is how many times that word appeared within the ±span window around your node word. When you click a chip, the concordance table shows how many lines contain both words — this count may differ because it centres on each occurrence of the node word rather than counting individual token hits.
Tokens = total words in the corpus. Types = unique word forms. TTR = types ÷ tokens × 100. Because TTR decreases as texts get longer, the MATTR (Moving Average TTR) gives a more reliable comparison across texts of different lengths.
Each file is compared against all the others combined. A word is a keyword when it appears significantly more often in one text than its frequency across the whole corpus would lead you to expect. Requires at least two files. Uses the log-likelihood (LL) statistic.