Voynich MS · Analysis Portal

Word	Freq	Section	Initial%	Ctx entropy	Next entropy	Prev entropy
daiin	744	Herbal	17.5%	9.32	8.09	7.99
chedy	465	Stars	1.3%	8.59	7.22	7.18
ol	457	Balneological	6.3%	8.65	7.08	7.61
shedy	403	Balneological	1.5%	8.60	6.93	7.12
aiin	362	Stars	0.0%	9.03	7.56	6.75
chol	351	Herbal	4.8%	8.65	6.97	7.35
chey	308	Stars	1.3%	8.60	7.25	7.16
qokeedy	300	Balneological	10.7%	8.18	6.89	6.45
qokeey	283	Stars	10.2%	8.41	7.01	6.67
or	283	Herbal	9.9%	8.78	6.49	7.31

Word	Freq	Initial%	Context entropy	Section
sor	47	70.2%	7.20	Balneological
tol	38	57.9%	7.25	Herbal
sol	64	56.2%	7.30	Balneological
sain	61	50.8%	7.31	Stars
saiin	125	42.4%	8.19	Stars
sar	63	38.1%	7.57	Stars
qotchy	61	31.1%	7.30	Herbal
dor	65	30.8%	7.47	Herbal

Word 1	Word 2	NPMI	Count
orshy	poror	0.899	1
opod	sheetchy	0.899	1
oc	theody	0.899	1
cheefy	pykeor	0.899	1
checthdy	qotedl	0.899	1
checthdy	qokeeodor	0.899	1
dalydal	toror	0.899	1
sheotam	shet	0.899	1
okeodain	shet	0.899	1
chpchey	chpchol	0.899	1

Family core	Members	Top members
ched	27
ched	27
ched	27
ched	27
ched	27
ched	27
ched	27
ched	27
ched	27
ched	27

Embeddings & Semantic Space

Interactive UMAP 2D projection of 64-dimensional word2vec embeddings. Seven coloring modes: section, frequency, prefix, suffix, HMM state, context entropy, and word length. Hover any point to see the word and its nearest neighbours.

Embeddings & Semantic Space

Folio embeddings

Per-folio mean-pooled word2vec embeddings projected to 2D via UMAP. Shows whether pages that share content also cluster in semantic space, and reveals the manuscript's large-scale thematic structure folio by folio.

Embeddings & Semantic Space

3D embedding scatter

UMAP 3D reduction of word2vec embeddings. Rotate freely; colour by section, frequency, prefix, or HMM state. Useful for spotting clusters that flatten out in 2D projections.

Embeddings & Semantic Space

Embedding stability

Bootstrap reliability of the word2vec model: 15 independent training runs with different random seeds, K=10 neighbour Jaccard measured across runs. High-Jaccard words have stable neighbourhoods regardless of seed.

Embeddings & Semantic Space

PPMI distributional vectors

Classical NLP baseline: Positive PMI co-occurrence matrix (window=5) factorised via truncated SVD (d=50). Compares PPMI vs word2vec via neighbour Jaccard and pairwise cosine correlation. Mean Jaccard ≈ 0.128 — the two methods agree on ~1 in 8 neighbours.

Embeddings & Semantic Space

SIF line embeddings

Smooth Inverse Frequency weighted sentence vectors with principal-component removal (Arora et al. 2017). Compares SIF vs mean-pool for line-level similarity and section separation.

Embeddings & Semantic Space

Folio semantic trajectory

PCA 2D of per-folio mean embeddings ordered along the manuscript. Reveals whether semantic content drifts smoothly or jumps abruptly as you page through the manuscript.

Embeddings & Semantic Space

Folio similarity matrix

Pairwise cosine similarity of SIF folio vectors, Ward-clustered. Same-section mean = 0.152, cross-section mean = −0.142, confirming strong section-level coherence.

Embeddings & Semantic Space

Nearest-neighbour graph

K=4 nearest-neighbour graph for the top-200 words computed in the full 64-dimensional embedding space (not in the projected 2D). Edges connect words that are genuinely close in high-dimensional space.

Embeddings & Semantic Space

Similarity matrix

Pairwise cosine similarity heatmap for the top-N most frequent words, hierarchically clustered. A quick way to see which words are semantically interchangeable.

Embeddings & Semantic Space

Semantic field dendrogram

Ward hierarchical clustering of the top-200 words with cosine distance, shown as a dendrogram and flat K=12 clusters. Identifies coherent semantic fields within the vocabulary.

Embeddings & Semantic Space

Morphology & Word Structure

Morphological word families

FastText character-ngram embeddings cluster words that share surface form. Treemap shows families (ke*, cho*, ched*, …) and their members. Suggests agglutinative-like prefix/suffix morphology.

Morphology & Word Structure

Morpheme inventory

EVA morpheme candidates discovered via character n-gram log-odds segmentation. Shows which sub-word units recur with above-chance frequency and where they tend to appear within words.

Morphology & Word Structure

Affix continuation entropy

Trie-based analysis of prefix and suffix continuation entropy (H). Low H means the same sequence always continues the same way — rc/ts/lc have H=0, confirming they are EVA digraphs, not free morphemes.

Morphology & Word Structure

Morphological paradigms

Detects suffix/prefix substitution groups via embedding vector offsets (analogous to word2vec analogy tasks). Shows how consistent the offset is across the paradigm, and the analogy accuracy.

Morphology & Word Structure

Prefix × suffix PMI matrix

2-character prefix × suffix heatmap weighted by Pointwise Mutual Information. The ai+in pair is the most frequent (PMI=2.09, freq=470), revealing which endings follow which beginnings non-randomly.

Morphology & Word Structure

CV skeleton analysis

Consonant/Vowel skeleton of every word in the EVA transcription. Corpus V-fraction = 0.456; CCVCV is the most frequent skeleton template. Shows phonotactic shape constraints on word form.

Morphology & Word Structure

Character n-gram analysis

EVA character bigram transition matrix plus log-odds surprise heatmap. Reveals which character pairs are over- or under-represented relative to a unigram baseline — the building blocks of Voynich phonotactics.

Morphology & Word Structure

EVA phonotactics

Positional character bias, trigram log-odds, bigram transitions, and CV-shape templates. Shows which characters are strongly position-biased (line-initial, word-initial, word-final) and which are free.

Morphology & Word Structure

Character position entropy

Per-slot Shannon entropy of EVA characters across all words. Early positions are highly constrained (low H); late positions are more varied. Quantifies where within a word structure is most predictable.

Morphology & Word Structure

Word-length profile

Word length distribution by section, line position, and folio. Includes character-type composition (vowel vs consonant fraction) and the folio trend in mean word length across the manuscript.

Morphology & Word Structure

Bigrams, Co-occurrence & Context

Bigram heatmap

Frequency heatmap of top-prefix × top-suffix word pairs. Shows which word combinations appear more or less often than chance, making systematic formulaic phrases visible at a glance.

Bigrams, Co-occurrence & Context

Bigram network

Force-directed graph where nodes are words and edges are frequent adjacent pairs. Edge weight encodes bigram count; community structure reveals clusters of words that chain together.

Bigrams, Co-occurrence & Context

NPMI heatmap

Normalized Pointwise Mutual Information for the top-40 most frequent words. NPMI ∈ [−1, 1]; strong positive values identify reliable collocations that co-occur more than chance predicts.

Bigrams, Co-occurrence & Context

Positional bigrams

G² log-likelihood enrichment of word bigrams split by line zone (beginning / middle / end). Shows which word pairs are specific to particular positions in a line rather than occurring uniformly.

Bigrams, Co-occurrence & Context

Context profile heatmap

Left and right conditional context probabilities for the top-40 words: P(context | word) shown as a heatmap. Reveals which words have stereotyped neighbours and which appear in diverse surroundings.

Bigrams, Co-occurrence & Context

Left-right context asymmetry

Asymmetry score A = H_left − H_right per word. A >> 0 means many possible predecessors but few successors ('closer'); A << 0 means the opposite ('opener'). dshedy/pol have H_left = 0: only one possible predecessor.

Bigrams, Co-occurrence & Context

Co-occurrence network

Word co-occurrence graph with window=3 and greedy community detection. Louvain communities map loosely onto manuscript sections, suggesting section-specific vocabulary clusters.

Bigrams, Co-occurrence & Context

Word network centrality

Co-occurrence graph (254 nodes, 4471 edges) with three Louvain communities. daiin has the highest PageRank. Betweenness and degree centrality reveal structural hubs vs peripheral words.

Bigrams, Co-occurrence & Context

Word transition network

Directed graph of transition probabilities P(B|A) for the top-120 words, laid out with a spring algorithm. Thick edges are high-probability transitions; shows the most predictable word sequences.

Bigrams, Co-occurrence & Context

Lexical Statistics & Frequency

Hapax & Zipf analysis

Zipf and Heap's law fits for the Voynich vocabulary. Shows the hapax legomena ratio by section and breaks the vocabulary into frequency tiers (hapax, rare, medium, frequent, very frequent).