Theory and experimentation documentation

Universal Semantic Representations

IOTA-1 studies whether mutable language expressions can converge on stable, reviewable concept evidence. The pursuit is a measured representation stack that survives translation, paraphrase, locale, script, and model drift well enough for human and machine review.

Semantic isomorphism

Bounded preservation

A conversion is stronger when it preserves task-relevant relations among source expression, normalized text, concept identity, candidate rendering, validator behavior, provenance, and drift report under a declared policy.

Mutable languages

Difference is evidence

Translation and paraphrase can change idiom, register, politeness, tense, culture, and domain assumptions. IOTA-1 treats those shifts as measurable evidence rather than proof that all expressions are identical.

Language-agnostic embeddings

Shared space, tagged source

Vectors are useful only with model ID, dimensions, metric, source language, locale, normalization, and version. A shared vector space can support ranking and comparison, but it is not the system of record.

ISO/IEC 10646

Public substrate

Assigned characters, normalization, grapheme behavior, script metadata, and public names make text transport inspectable. IOTA-1 semantics remain registry-backed and evidence-backed.

Representation stack

From expression to reviewed concept

The current experiment separates layers so that each claim can be inspected: source expression, Unicode normalization, language and locale tags, segmentation, raw embedding, optional neutralization, ranked concept candidates, selected concept, canonical concept vector, public rendering candidates, safety warnings, and round-trip drift.

This keeps mutable language facts visible. A Spanish greeting, a Japanese greeting, and an Arabic greeting can resolve to the same broad concept only when the declared profile accepts that equivalence and records what was lost.

1. Sense alignment

Start with reviewed meaning

Translations enter the centroid pool only when the sense, domain, example, and negative example are attached to a reviewed concept record.

2. Same space

Average only comparable vectors

Embeddings must be produced by the same model profile, or by an explicitly aligned profile. Mixing incompatible vector spaces creates false precision.

3. Spread report

Store the failures

Centroid spread, language outliers, near misses, source family coverage, and model version are part of the result. A centroid without its failure modes is not reviewable evidence.

4. Ranking role

Prototype, not authority

The centroid may help rank candidates, detect drift, and compare expressions. The accepted meaning still comes from concept identity, provenance, validators, and policy.

Unicode and ISO/IEC 10646 policy

Public characters only

The public IOTA-1 path uses assigned ISO/IEC 10646 and Unicode characters, public standard sequences, NFC normalization, grapheme-aware handling, and explicit script and direction metadata. It rejects private-use meaning channels, hidden controls, unsafe zero-width patterns, and secret dictionaries as public semantic authority.

This protects the distinction between expression and concept. A glyph candidate can be beautiful, compact, or familiar, but it is not meaning by itself unless the evidence packet ties it to a reviewed concept.

Retrieval

Top-K and rank

Track selected concept hit rate, top-K agreement, mean reciprocal rank, and false-neighbor rates across language families and scripts.

Drift

Retained, lost, added

Round-trip reports should separate retained concepts from lost details, added assumptions, ambiguity, and unknown segments.

Centroid quality

Spread and outliers

Measure centroid spread, translation family coverage, model deltas, and failure cases instead of relying on a single aggregate score.

Review

Human outcomes

Capture reviewer accept, revise, abstain, and reject decisions. Better universal representation work makes abstention clearer, not just confidence higher.

Evidence packet sketch

What a semantic result must carry

{
  "profile": "iota-semantic-interlingua-v1",
  "source": {
    "text": "Hola",
    "language": "es",
    "locale": "es-ES",
    "normalization": "NFC"
  },
  "selectedConcept": {
    "id": "C0001842",
    "label": "broad greeting",
    "canonicalVectorHash": "sha256:reviewed-concept-vector"
  },
  "centroidEvidence": {
    "model": "declared-embedding-profile",
    "languageCount": 6,
    "spread": "measured",
    "role": "candidate ranking evidence"
  },
  "publicRendering": {
    "glyphCandidates": ["hello", "greeting"],
    "iso10646Policy": "assigned-public-characters-only"
  },
  "drift": {
    "retained": ["greeting"],
    "lost": ["locale-specific register"],
    "ambiguous": []
  }
}

Implementation path

How this informs the converter

The language converter should continue moving toward evidence-rich responses: explicit source tags, ranked concepts, canonical vector hashes, centroid provenance, public rendering candidates, Unicode safety checks, and drift reports. The public claim remains practical convergence under policy, not exact semantic sameness.