Textoscopia

Analyse your corpus
with ease

Upload one or more PDF or Word documents (.docx) to explore your texts using a full set of corpus linguistics tools — all in your browser, nothing sent to a server.

KWIC

See every occurrence of a word in context, with a configurable left and right window.

Word frequency

Ranked list of the most common words, with frequency bars.

Collocations

Find which words habitually appear near a node word. Click any collocate to see all concordance lines.

Pattern search

Use regular expressions to find any linguistic pattern, shown in KWIC-style context. Includes a regex help panel.

N-grams

Find the most frequent sequences of 2, 3, 4 or 5 consecutive words in the corpus.

Dispersion

Visualise where in the corpus a word appears — whether it clusters in one section or spreads evenly.

Type / Token ratio

Measure lexical diversity with TTR and the more reliable MATTR. Shows a per-file breakdown with multiple texts.

Keyword analysis

With two or more files, identifies which words are statistically distinctive in each text using log-likelihood.

Export

Download the full extracted text of your corpus as a plain .txt file for use in other tools.

Textoscopia was designed and developed by Marc Olivier-Loiseau.

This tool was built with the assistance of Claude, an AI assistant made by Anthropic. All conceptual decisions, feature choices, design directions, and linguistic framing were made by the author. The use of AI assistance is disclosed in accordance with emerging best practices in digital humanities scholarship.

If you use Textoscopia in your research, please cite it as:
Olivier, Marc (2026). Textoscopia: A Simple Browser-Based Tool for Corpus Linguistics Analysis (v1.0). Zenodo. https://doi.org/10.5281/zenodo.19071710

If you want to share feedback or report an issue, please contact the author.

Keyword in context

See every occurrence of a word surrounded by its context.

Search word

Window size

—

Enter a word above and click Search.

Word frequency

Ranked list of the most common words in the corpus.

Show top N

—

Click Analyse to generate frequency list.

Collocations

Discover words that habitually co-occur with a node word.

The number on each chip (e.g. 18) is how many times that word appeared within the ±span window around your node word. When you click a chip, the concordance table shows how many lines contain both words — this count may differ because it centres on each occurrence of the node word rather than counting individual token hits.

Node word

Span ± words

—

Enter a node word and click Find collocates.

Pattern search

Use regular expressions to find patterns — results shown in KWIC-style context.

Regex pattern

Context window

Case sensitive

Characters & wildcards

\wany word character (letter, digit, _)

\Wany non-word character

\dany digit (0–9)

\sany whitespace (space, tab…)

.any single character except newline

[abc]one of: a, b, or c

[a-z]any lowercase letter

[^abc]any character except a, b, c

Quantifiers

*0 or more of the preceding

+1 or more of the preceding

?0 or 1 — makes preceding optional

{n}exactly n repetitions

{n,m}between n and m repetitions

{n,}n or more repetitions

Anchors & boundaries

\bword boundary (start or end of word)

\Bnot a word boundary (inside a word)

^start of string

$end of string

Groups & alternation

(abc)group — treat as a single unit

(?:abc)non-capturing group

a|ba or b (alternation)

\s+one or more spaces (gap between words)

Examples — click any row to insert into the search box

\w*ing\ball gerunds / present participlesmorphology

\bun\w+words starting with "un"prefix

\w*tion\bwords ending in "-tion" (nominalisations)suffix

\w*ment\bwords ending in "-ment"suffix

\w*r\b\s+la\bword ending in "r" followed by "la"sequence

\bthe\s+\w+\s+of\b"the … of" nominal patternsequence

\b(very|quite|rather|fairly)\s+\w+degree adverb + following wordcolligation

\b\w{10,}\bwords of 10 or more charactersword length

\b[A-Z][a-z]+\bcapitalised words (proper nouns etc.) — use case sensitiveproper nouns

\b\d+\bstandalone numbersnumerals

\bne\s+\w+\s+pas\bFrench negation pattern "ne … pas"syntax

\b(is|are|was|were)\s+\w+ed\bpassive constructionsgrammar

—

Enter a pattern and click Search.

N-grams

Find the most frequent sequences of words in the corpus.

N (words per sequence)

Show top

—

Select N and click Analyse.

Dispersion plot

See where in the corpus a word appears — whether it clusters in one area or spreads evenly throughout.

Word(s) — comma separated

—

Enter one or more words to plot their dispersion.

Type / Token ratio

A measure of lexical diversity — how varied the vocabulary is. A higher ratio means more varied language.

Tokens = total words in the corpus. Types = unique word forms. TTR = types ÷ tokens × 100. Because TTR decreases as texts get longer, the MATTR (Moving Average TTR) gives a more reliable comparison across texts of different lengths.

MATTR window (words)

—

Click Calculate to measure lexical diversity.

Keyword analysis

Identify which words are statistically distinctive in your corpus — words that appear more often than you would expect by chance.

Each file is compared against all the others combined. A word is a keyword when it appears significantly more often in one text than its frequency across the whole corpus would lead you to expect. Requires at least two files. Uses the log-likelihood (LL) statistic.

TextoscopiaAnalyse your corpuswith ease

Textoscopia

Analyse your corpus
with ease