Voynich MS · Analysis Portal

Voynich Manuscript · Analysis Portal

31,773 tokens  ·  1,414 word types  ·  4183 lines  ·  183 folios  ·  7 sections

Corpus statistics

31,773total tokens
1,414unique word types
4183lines
183folios
7sections
0.045type/token ratio

Stars 11,411 · Herbal 10,738 · Balneological 6,359 · Unknown 1,292 · Text-only 1,113 · Cosmological 476 · Zodiac 384

Embedding quality — section separation (silhouette score)

word2vec64d skip-gram, window=5
FastTextchar n-gram, 64d

Silhouette score ∈ [–1, 1]. Negative values are expected when sections overlap in embedding space. Run make embed then python3 portal.py to refresh after re-embedding.

Top 10 most frequent words

WordFreqSectionInitial% Ctx entropyNext entropyPrev entropy
daiin744Herbal17.5%9.328.097.99
chedy465Stars1.3%8.597.227.18
ol457Balneological6.3%8.657.087.61
shedy403Balneological1.5%8.606.937.12
aiin362Stars0.0%9.037.566.75
chol351Herbal4.8%8.656.977.35
chey308Stars1.3%8.607.257.16
qokeedy300Balneological10.7%8.186.896.45
qokeey283Stars10.2%8.417.016.67
or283Herbal9.9%8.786.497.31

Function-word candidates (high line-initial rate + high context entropy)

WordFreqInitial%Context entropySection
sor4770.2%7.20Balneological
tol3857.9%7.25Herbal
sol6456.2%7.30Balneological
sain6150.8%7.31Stars
saiin12542.4%8.19Stars
sar6338.1%7.57Stars
qotchy6131.1%7.30Herbal
dor6530.8%7.47Herbal

High initial fraction = often starts a line (grammatical role?). High context entropy = appears in many different contexts.

Top PMI collocations (adjacent words, count ≥ 5)

Word 1Word 2NPMICountSection
orshyporor0.8991
opodsheetchy0.8991
octheody0.8991
cheefypykeor0.8991
checthdyqotedl0.8991
checthdyqokeeodor0.8991
dalydaltoror0.8991
sheotamshet0.8991
okeodainshet0.8991
chpcheychpchol0.8991

Top morphological word families (FastText, threshold 0.85)

Family coreMembersTop members
ched27
ched27
ched27
ched27
ched27
ched27
ched27
ched27
ched27
ched27

Analysis Library

Embeddings & Semantic Space

Word embedding scatter
Interactive UMAP 2D projection of 64-dimensional word2vec embeddings. Seven coloring modes: section, frequency, prefix, suffix, HMM state, context entropy, and word length. Hover any point to see the word and its nearest neighbours.
Embeddings & Semantic Space
Folio embeddings
Per-folio mean-pooled word2vec embeddings projected to 2D via UMAP. Shows whether pages that share content also cluster in semantic space, and reveals the manuscript's large-scale thematic structure folio by folio.
Embeddings & Semantic Space
3D embedding scatter
UMAP 3D reduction of word2vec embeddings. Rotate freely; colour by section, frequency, prefix, or HMM state. Useful for spotting clusters that flatten out in 2D projections.
Embeddings & Semantic Space
Embedding stability
Bootstrap reliability of the word2vec model: 15 independent training runs with different random seeds, K=10 neighbour Jaccard measured across runs. High-Jaccard words have stable neighbourhoods regardless of seed.
Embeddings & Semantic Space
PPMI distributional vectors
Classical NLP baseline: Positive PMI co-occurrence matrix (window=5) factorised via truncated SVD (d=50). Compares PPMI vs word2vec via neighbour Jaccard and pairwise cosine correlation. Mean Jaccard ≈ 0.128 — the two methods agree on ~1 in 8 neighbours.
Embeddings & Semantic Space
SIF line embeddings
Smooth Inverse Frequency weighted sentence vectors with principal-component removal (Arora et al. 2017). Compares SIF vs mean-pool for line-level similarity and section separation.
Embeddings & Semantic Space
Folio semantic trajectory
PCA 2D of per-folio mean embeddings ordered along the manuscript. Reveals whether semantic content drifts smoothly or jumps abruptly as you page through the manuscript.
Embeddings & Semantic Space
Folio similarity matrix
Pairwise cosine similarity of SIF folio vectors, Ward-clustered. Same-section mean = 0.152, cross-section mean = −0.142, confirming strong section-level coherence.
Embeddings & Semantic Space
Nearest-neighbour graph
K=4 nearest-neighbour graph for the top-200 words computed in the full 64-dimensional embedding space (not in the projected 2D). Edges connect words that are genuinely close in high-dimensional space.
Embeddings & Semantic Space
Similarity matrix
Pairwise cosine similarity heatmap for the top-N most frequent words, hierarchically clustered. A quick way to see which words are semantically interchangeable.
Embeddings & Semantic Space
Semantic field dendrogram
Ward hierarchical clustering of the top-200 words with cosine distance, shown as a dendrogram and flat K=12 clusters. Identifies coherent semantic fields within the vocabulary.
Embeddings & Semantic Space

Morphology & Word Structure

Morphological word families
FastText character-ngram embeddings cluster words that share surface form. Treemap shows families (ke*, cho*, ched*, …) and their members. Suggests agglutinative-like prefix/suffix morphology.
Morphology & Word Structure
Morpheme inventory
EVA morpheme candidates discovered via character n-gram log-odds segmentation. Shows which sub-word units recur with above-chance frequency and where they tend to appear within words.
Morphology & Word Structure
Affix continuation entropy
Trie-based analysis of prefix and suffix continuation entropy (H). Low H means the same sequence always continues the same way — rc/ts/lc have H=0, confirming they are EVA digraphs, not free morphemes.
Morphology & Word Structure
Morphological paradigms
Detects suffix/prefix substitution groups via embedding vector offsets (analogous to word2vec analogy tasks). Shows how consistent the offset is across the paradigm, and the analogy accuracy.
Morphology & Word Structure
Prefix × suffix PMI matrix
2-character prefix × suffix heatmap weighted by Pointwise Mutual Information. The ai+in pair is the most frequent (PMI=2.09, freq=470), revealing which endings follow which beginnings non-randomly.
Morphology & Word Structure
CV skeleton analysis
Consonant/Vowel skeleton of every word in the EVA transcription. Corpus V-fraction = 0.456; CCVCV is the most frequent skeleton template. Shows phonotactic shape constraints on word form.
Morphology & Word Structure
Character n-gram analysis
EVA character bigram transition matrix plus log-odds surprise heatmap. Reveals which character pairs are over- or under-represented relative to a unigram baseline — the building blocks of Voynich phonotactics.
Morphology & Word Structure
EVA phonotactics
Positional character bias, trigram log-odds, bigram transitions, and CV-shape templates. Shows which characters are strongly position-biased (line-initial, word-initial, word-final) and which are free.
Morphology & Word Structure
Character position entropy
Per-slot Shannon entropy of EVA characters across all words. Early positions are highly constrained (low H); late positions are more varied. Quantifies where within a word structure is most predictable.
Morphology & Word Structure
Word-length profile
Word length distribution by section, line position, and folio. Includes character-type composition (vowel vs consonant fraction) and the folio trend in mean word length across the manuscript.
Morphology & Word Structure

Bigrams, Co-occurrence & Context

Bigram heatmap
Frequency heatmap of top-prefix × top-suffix word pairs. Shows which word combinations appear more or less often than chance, making systematic formulaic phrases visible at a glance.
Bigrams, Co-occurrence & Context
Bigram network
Force-directed graph where nodes are words and edges are frequent adjacent pairs. Edge weight encodes bigram count; community structure reveals clusters of words that chain together.
Bigrams, Co-occurrence & Context
NPMI heatmap
Normalized Pointwise Mutual Information for the top-40 most frequent words. NPMI ∈ [−1, 1]; strong positive values identify reliable collocations that co-occur more than chance predicts.
Bigrams, Co-occurrence & Context
Positional bigrams
G² log-likelihood enrichment of word bigrams split by line zone (beginning / middle / end). Shows which word pairs are specific to particular positions in a line rather than occurring uniformly.
Bigrams, Co-occurrence & Context
Context profile heatmap
Left and right conditional context probabilities for the top-40 words: P(context | word) shown as a heatmap. Reveals which words have stereotyped neighbours and which appear in diverse surroundings.
Bigrams, Co-occurrence & Context
Left-right context asymmetry
Asymmetry score A = H_left − H_right per word. A >> 0 means many possible predecessors but few successors ('closer'); A << 0 means the opposite ('opener'). dshedy/pol have H_left = 0: only one possible predecessor.
Bigrams, Co-occurrence & Context
Co-occurrence network
Word co-occurrence graph with window=3 and greedy community detection. Louvain communities map loosely onto manuscript sections, suggesting section-specific vocabulary clusters.
Bigrams, Co-occurrence & Context
Word network centrality
Co-occurrence graph (254 nodes, 4471 edges) with three Louvain communities. daiin has the highest PageRank. Betweenness and degree centrality reveal structural hubs vs peripheral words.
Bigrams, Co-occurrence & Context
Word transition network
Directed graph of transition probabilities P(B|A) for the top-120 words, laid out with a spring algorithm. Thick edges are high-probability transitions; shows the most predictable word sequences.
Bigrams, Co-occurrence & Context

Lexical Statistics & Frequency

Hapax & Zipf analysis
Zipf and Heap's law fits for the Voynich vocabulary. Shows the hapax legomena ratio by section and breaks the vocabulary into frequency tiers (hapax, rare, medium, frequent, very frequent).
Lexical Statistics & Frequency
Zipf / power-law analysis
Maximum-likelihood Zipf exponent α = 1.837 (Clauset 2009 method). Mandelbrot extension fit; Heaps β = 0.760; 68.5% hapax rate. Compares Voynich to natural-language baselines.
Lexical Statistics & Frequency
Word burstiness
Goh-Barabási burstiness B = (σ−μ)/(σ+μ) and memory coefficient M for inter-occurrence intervals. cthy is the most bursty word (B=0.61), meaning it appears in concentrated bursts rather than uniformly.
Lexical Statistics & Frequency
Vocabulary drift
Word frequency tracked across sliding folio windows. Shows which words rise and fall in usage across the manuscript — a proxy for topic drift or scribal variation over time.
Lexical Statistics & Frequency
Folio vocabulary richness
Per-folio type-token ratio (TTR), MSTTR-50, hapax rate, Shannon entropy, and Yule's K across all 182 folios. Maps where the manuscript is lexically dense vs repetitive.
Lexical Statistics & Frequency
Section-distinctive vocabulary
TF-IDF-style heatmap of words by section. High values mean the word is characteristic of that section relative to the rest of the corpus. Reveals the section-specific lexicon of each part of the manuscript.
Lexical Statistics & Frequency
Section distance matrix
Pairwise section distances computed six ways: embedding cosine, vocabulary Jaccard, KL divergence, Jensen-Shannon divergence, frequency correlation, and bigram overlap. Shows which sections are most similar and most different.
Lexical Statistics & Frequency
Cross-section word borrowing
Chi² exclusivity scores, TF-IDF distinctiveness, vocabulary overlap heatmap, and 'border words' — words that appear in exactly two sections. Quantifies how isolated each section's vocabulary is.
Lexical Statistics & Frequency
Word role profiling
GMM functional role clustering on distributional statistics (frequency, entropy, initial rate, asymmetry). BIC selects the number of clusters. Groups words into functional roles without supervision.
Lexical Statistics & Frequency
Function-word scatter
Scatter of initial fraction (how often a word starts a line) vs context entropy (how varied its neighbours are). High-initial + high-entropy words are function-word candidates: sol, sain, saiin, y.
Lexical Statistics & Frequency
Directional entropy scatter
Previous-context entropy vs next-context entropy per word. Words above the diagonal are more predictable from what follows than from what precedes, and vice versa — revealing asymmetric positional constraints.
Lexical Statistics & Frequency

Lines & Document Structure

Line structure analysis
First and last word distributions, line-length histograms, and section-by-section comparison. Shows that line-initial and line-final lexicons are almost completely disjoint (Jaccard ≈ 0).
Lines & Document Structure
Line-level entropy
Per-slot Shannon entropy, adjacent line Jaccard (2.79× random baseline), repetition rate, and first-word PMI. Quantifies how formulaic lines are relative to a random shuffled baseline.
Lines & Document Structure
Line embedding clusters
Mean-pooled line embeddings projected to 2D via UMAP, then clustered with KMeans K=7. Section recovery NMI = 0.222 — embeddings partially but not fully capture section identity at the line level.
Lines & Document Structure
Near-duplicate line detection
Detects 448 near-duplicate line pairs (336 cross-folio) and 40 exact copy 'formula' lines. Shows which formulaic sequences recur across the manuscript and where they appear.
Lines & Document Structure
Bigram LM perplexity
Kneser-Ney smoothed bigram language model perplexity per line. Identifies the most and least 'typical' lines in each section. Balneological section has the highest per-line surprise.
Lines & Document Structure
Bigram surprisal map
Per-token surprisal −log₂ P(w|w_{t−1}) aggregated to line and folio level. Balneological highest mean S = 6.14 bits; Zodiac lowest S = 5.20 bits. Shows the 10-line rolling surprisal trajectory.
Lines & Document Structure
Block entropy rate
Block entropy H(n) for n = 1..5 words. The estimated entropy rate h(4) ≈ 0.009 bits — Voynich word sequences are nearly deterministic at the 4-word scale, far below natural language.
Lines & Document Structure
Folio change-point detection
Sliding-window Maximum Mean Discrepancy (MMD) and CUSUM on folio embeddings. The top detected shift is at folio f11r (Stars→Herbal transition, MMD = 0.498). Identifies abrupt semantic breaks.
Lines & Document Structure
Line-position word analysis
G² slot affinity for words at each line position. Initial and final lexicons are completely disjoint (Jaccard = 0.000). -am words are strongly line-final; qo- words are strongly line-initial.
Lines & Document Structure

Sequence & Topic Models

HMM latent states
Unsupervised Baum-Welch Hidden Markov Model with K=6 latent states trained on word sequences. States loosely correspond to POS-like roles. Words coloured by Viterbi-decoded state on UMAP scatter.
Sequence & Topic Models
NMF topic model (K=8)
Non-negative Matrix Factorisation on word×line TF-IDF matrix. Nine-panel dashboard: topic word loadings, per-line mixture heatmap, section alignment, and topic co-occurrence network.
Sequence & Topic Models
Word sequence model
Successor entropy, high-PMI bigrams, boundary-zone constraints, and per-section bigram Jaccard. Maps which parts of word sequences are tightly constrained vs open.
Sequence & Topic Models
Cluster purity analysis
ARI and NMI of KMeans clusters vs section, HMM state, prefix, and word-length labels. Tells you how much of the embedding geometry is explained by each categorical label.
Sequence & Topic Models

Word Fingerprints

Word fingerprint: daiin
Multi-signal fingerprint card for 'daiin' (most frequent word, 744 occurrences). Shows embedding neighbourhood, context profile, positional bias, PMI collocations, and section distribution.
Word Fingerprints
Word fingerprint: chedy
Multi-signal fingerprint card for 'chedy' (465 occurrences, Stars section). Same six-panel format as daiin fingerprint.
Word Fingerprints
Word fingerprint: ol
Multi-signal fingerprint card for 'ol' (457 occurrences, Balneological section). Same six-panel format.
Word Fingerprints
No analyses match your search.