Universal semantic representation experiments over public text evidence.
Theory and experimentation documentation
Universal Semantic Representations
IOTA-1 studies whether mutable language expressions can converge on stable, reviewable concept evidence. The pursuit is a measured representation stack that survives translation, paraphrase, locale, script, and model drift well enough for human and machine review.
A conversion is stronger when it preserves task-relevant relations among source expression, normalized text, concept identity, candidate rendering, validator behavior, provenance, and drift report under a declared policy.
Mutable languages
Difference is evidence
Translation and paraphrase can change idiom, register, politeness, tense, culture, and domain assumptions. IOTA-1 treats those shifts as measurable evidence rather than proof that all expressions are identical.
Language-agnostic embeddings
Shared space, tagged source
Vectors are useful only with model ID, dimensions, metric, source language, locale, normalization, and version. A shared vector space can support ranking and comparison, but it is not the system of record.
ISO/IEC 10646
Public substrate
Assigned characters, normalization, grapheme behavior, script metadata, and public names make text transport inspectable. IOTA-1 semantics remain registry-backed and evidence-backed.
Representation stack
From expression to reviewed concept
The current experiment separates layers so that each claim can be inspected: source expression, Unicode normalization, language and locale tags, segmentation, raw embedding, optional neutralization, ranked concept candidates, selected concept, canonical concept vector, public rendering candidates, safety warnings, and round-trip drift.
This keeps mutable language facts visible. A Spanish greeting, a Japanese greeting, and an Arabic greeting can resolve to the same broad concept only when the declared profile accepts that equivalence and records what was lost.
1. Sense alignment
Start with reviewed meaning
Translations enter the centroid pool only when the sense, domain, example, and negative example are attached to a reviewed concept record.
2. Same space
Average only comparable vectors
Embeddings must be produced by the same model profile, or by an explicitly aligned profile. Mixing incompatible vector spaces creates false precision.
3. Spread report
Store the failures
Centroid spread, language outliers, near misses, source family coverage, and model version are part of the result. A centroid without its failure modes is not reviewable evidence.
4. Ranking role
Prototype, not authority
The centroid may help rank candidates, detect drift, and compare expressions. The accepted meaning still comes from concept identity, provenance, validators, and policy.
Unicode and ISO/IEC 10646 policy
Public characters only
The public IOTA-1 path uses assigned ISO/IEC 10646 and Unicode characters, public standard sequences, NFC normalization, grapheme-aware handling, and explicit script and direction metadata. It rejects private-use meaning channels, hidden controls, unsafe zero-width patterns, and secret dictionaries as public semantic authority.
This protects the distinction between expression and concept. A glyph candidate can be beautiful, compact, or familiar, but it is not meaning by itself unless the evidence packet ties it to a reviewed concept.
Retrieval
Top-K and rank
Track selected concept hit rate, top-K agreement, mean reciprocal rank, and false-neighbor rates across language families and scripts.
Drift
Retained, lost, added
Round-trip reports should separate retained concepts from lost details, added assumptions, ambiguity, and unknown segments.
Centroid quality
Spread and outliers
Measure centroid spread, translation family coverage, model deltas, and failure cases instead of relying on a single aggregate score.
Review
Human outcomes
Capture reviewer accept, revise, abstain, and reject decisions. Better universal representation work makes abstention clearer, not just confidence higher.
The language converter should continue moving toward evidence-rich responses: explicit source tags, ranked concepts, canonical vector hashes, centroid provenance, public rendering candidates, Unicode safety checks, and drift reports. The public claim remains practical convergence under policy, not exact semantic sameness.