Pipeline
Layered conversion
Source text is normalized to NFC, annotated with locale/script/direction metadata, segmented phrase-first, embedded, neutralized, resolved against the concept registry, and then rendered as public Unicode candidates.
Language-agnostic semantic experiments with public-symbol rendering.
IOTA-1 semantic interlingua
IOTA-1 studies whether mutable language expressions can converge on reviewable concept evidence while staying grounded in public ISO/IEC 10646 rendering. The target is bounded semantic isomorphism: measured preservation of task-relevant relations, not exact universal translation.
Pipeline
Source text is normalized to NFC, annotated with locale/script/direction metadata, segmented phrase-first, embedded, neutralized, resolved against the concept registry, and then rendered as public Unicode candidates.
Converter
The language converter now exposes source language, locale, top-K ranking, evidence, vector preview, quantized payload opt-in, public Unicode enforcement, and the selected concept authority in one place.
Concept registry
Concepts use stable IDs such as C0001842. Human labels, aliases, and glyphs are evidence and display material, not the system of record.
Vectors
Raw expression vector hashes may differ for Hello, Hola, Bonjour, こんにちは, 你好, and مرحبا. After concept resolution they may share the same selected concept ID and canonical vector hash when the profile accepts that broad greeting equivalence.
Neutralization
The first profile is identity-v1 when no trained centering or projection profile is configured. The response still records whether neutralization ran and which version was used.
Quantization
Quantized payloads are optional, profiled, and lossy. They can help compact transport, but they do not replace concept IDs, provenance, or evidence.
Unicode safety
Normal public-symbol mode rejects private-use characters, unsafe controls, hidden bidi controls, and zero-width risk patterns. Private-use characters are not a public IOTA meaning channel.
Validation
Round-trip output is treated as a diagnostic gist. The protocol tracks retained, lost, added, and ambiguous concepts instead of pretending conversion is lossless.
Semantic isomorphism
IOTA-1 uses semantic isomorphism as a testable engineering property. A result is stronger when declared normalization, locale, registry, vector model, and validation policy preserve the intended relation across expression changes.
Mutable languages
Translations and paraphrases can carry different idioms, politeness, domain assumptions, and cultural context. They help only when the sense, task, model, and provenance are explicit.
Translation centroids
A normalized average across sense-aligned translations can be useful as a concept prototype. Its spread, outliers, model ID, language list, and near-miss failures must remain visible.
Shared vector space
The experiment seeks practical invariants: retrieval intent, relation structure, validator behavior, and candidate ranking that remain stable enough across languages for reviewable AI handoff.
ISO/IEC 10646
Assigned public characters, names, normalization, grapheme behavior, and public metadata make output inspectable. They do not make a code point or glyph carry IOTA meaning by itself.
Experiment metrics
Track top-K concept hit rate, centroid spread, unknown rate, Unicode-safety warnings, drift, retrieval rank, human review outcomes, and model/version deltas.
Architecture stance
IOTA-1 promises a public, inspectable approximation pipeline. It does not promise identical raw embeddings across languages, secret-codebook compression, private-use Unicode meaning, or exact translation.
The long-term direction is concept canonicalization: different source languages may produce different raw vectors, but successful matches should converge on a stable concept ID, canonical vector hash, provenance trail, public rendering candidates, and explicit drift evidence.
API surface
The converter accepts source_language, locale, topK, includeEvidence, includeVectorPreview, includeQuantizedCode, publicUnicodeOnly, and mode values including DatabaseOnly, Hybrid, Semantic Hybrid, and Semantic Interlingua.
Responses include normalization, source language, segments, concept candidates, selected concept, glyph candidates, confidence, unknown_rate, drift, vector hashes/previews, provenance, ranking lanes, and Unicode safety checks.