A few months ago I was debugging a search-relevance pipeline backed by a state-of-the-art embedding model. The query was an unambiguous negation; the top result was an unambiguous affirmation; the cosine similarity was 0.92.
You can guess the pair. “I love it” vs “I don’t love it.” Or near enough. The model could distinguish them in some abstract sense — the vectors weren’t identical — but in the only sense that mattered to the pipeline, they were neighbors.
This is funny until you notice it is a structural problem, not a model deficit. A larger model trained on the same objective produces the same neighborhood, because the neighborhood is what the objective rewards.
Embeddings know one shape of meaning: things that appear in similar contexts. Ontologies know a different shape: the categories things belong to, and how those categories relate. These are not the same shape. You cannot, in general, recover one from the other.
What follows is a short defense of explicit ontology — operationalized as POS, NER, dependency parsing, and other structural-linguistic features — as the most under-priced lever in production AI today. With notes on the tools that make it tractable: spaCy in 2026, and Rust at the throughput layer.
The neighborhood problem
Distributional semantics rests on a single bet: words and phrases that occur in similar contexts mean similar things. That bet has paid out spectacularly. We have models that produce vectors where doctor and physician sit very close, where bank (financial) and bank (river) separate cleanly given enough context, where you can do analogical arithmetic and have it sometimes work.
But the bet doesn’t pay out evenly. There are entire categories of meaning that distributional contexts cannot disambiguate, because the words that signal those categories also appear everywhere else:
- Negation. “Not” is one of the most common tokens in English. The contexts of “the patient is responsive” and “the patient is not responsive” overlap almost completely; the entire distinction is one short, ubiquitous token.
- Role assignment. “The dog bit the man” and “the man bit the dog” share every content word. The order — encoded in syntax — carries the whole meaning. Bag-like statistics can’t recover it; even sequence models trained on similarity objectives often collapse the two.
- Numerical magnitude. “Three” and “thirty” are different categories; “3” and “30” even more so. But in most natural text their distributional contexts are nearly identical (counting nouns, age ranges, dosages).
- Modality and hedging. “May have caused” and “caused” are categorically different statements about evidence. Models often treat them as paraphrases, because in training corpora, they are paraphrased.
- Temporal frame. “Before the merger” and “after the merger” embed close — they share the merger, they share the temporal-frame construction, they share almost every co-occurrent.
Each of these is a categorical distinction that the language explicitly marks — with tokens, with position, with grammatical role — but that the embedding objective is not asked to preserve. You can throw a thousand H100s at a contrastive loss and watch SICK-R Pearson go from 0.91 to 0.92. The pairs that move are not, by and large, the ones from the bullet list above.
The bullet list is where embeddings fail. The bullet list is also where structural features succeed.
What structural features see
A POS tagger does not see “I love it” vs “I don’t love it” as two strings of indistinguishable cosine neighbors. It sees:
I/PRON love/VERB it/PRON .
I/PRON do/AUX n't/PART love/VERB it/PRON .
The presence of AUX + PART (and specifically the n't lemma, or a not token under negation analysis) is a categorical feature. A two-line classifier on “polarity flip” — do the two sentences disagree on the count of clausal-negation markers? — separates the bullet-list category cleanly.
The same holds for the rest of the list. Dependency parsing handles role inversion: who is nsubj, who is obj? Quantifier scope is exposed by det and nummod attachments. Modality is right there in MD tags and modal-verb dependencies. Temporal frames hang off case markers like before, after, during, which sit attached to events as discoverable subtrees.
None of this is novel. Linguists have spent fifty years annotating exactly these features. The novelty is operational: we have, finally, distributional models good enough to handle the easy 90% of meaning, which leaves the categorical 10% as a tractable residual problem rather than the whole game.
The shape of work in 2026 looks like: embeddings do the heavy lifting; structural features handle the part embeddings flatten.
spaCy in 2026
I get asked, more than I’d expect, why I’m still reaching for spaCy in 2026 when there are transformer-based parsers that score higher on PennTreebank.
A few honest answers.
Stability of API. I have production pipelines I wrote in 2021 that import spaCy 3.x and run unchanged. Try saying that about most of the rest of the NLP stack. The pipeline contract — nlp(text).ents, .pos_, .dep_, .lemma_ — is a moat. I do not want to rewrite my consumers every time someone publishes a new parsing paper.
Cost-per-token at scale. en_core_web_sm parses something like 10k–20k short sentences per second on a modern CPU. A transformer parser, even a small one, costs 50–500ms per sentence on a GPU. If I’m processing a million-pair training corpus and I want POS/NER/DEP/MORPH/LEMMA on every sentence, the math kills the transformer path before it starts.
Sufficient accuracy on the categories that matter. For the residual categories embeddings miss — negation, role, modality, numerical, temporal — the relevant spaCy features (the neg dependency, nsubj/obj, MD POS, NUM lemma, temporal-case markers) are well above the noise floor. The marginal improvement of a transformer parser on PennTreebank does not translate into measurably better residual features. The cheap, fast parser is good enough.
When to reach elsewhere. Stanza is what I run when I need a non-English language at production accuracy. Trankit is what I run when I need maximal accuracy on a single domain and can afford the GPU cost. spaCy is what I run when I want all of English and I want it now.
The other thing worth saying about spaCy is that the project has aged exceptionally well. Custom components fit cleanly into nlp.pipe; you can chain a vectorizer, a custom NER, a domain lemmatizer without forking the framework. The serialization story is clean. The Hugging Face hooks land at the right boundary. None of this is exciting; all of it is load-bearing.
Where Rust earns its place
There is a moment, in every structural-feature pipeline I have ever shipped, where Python stops being adequate. The moment is not in the parsing — nlp.pipe(texts, n_process=8) will saturate your CPUs fine. The moment is in the downstream feature math: pairwise Jaccards over entity sets, hash-based fingerprint matching, dependency-pattern lookups, all running over hundreds of thousands or millions of pairs.
Python at that boundary is dying by a thousand object allocations. NumPy helps for the array-shaped subset. For the set-shaped and hash-shaped subset — which is where most ontology features live — it does not.
This is where I reach for Rust.
The pattern I’ve converged on: spaCy runs the parse and materializes columnar arrays into Parquet or .npy (entity strings, entity types, POS tag sequences, lemma sequences, negation counts, numeric mentions). A small Rust binary then takes the columnar features and runs the pair-wise math at memory-bandwidth speed.
A specific shape of this I find useful: a binary fuse filter over content fingerprints. The construction looks like — build a hash signature over per-pair entity-set + POS-bigram + negation-asymmetry features, dump the signatures into a fuse filter, query at ~10 ns per pair. You can run that filter inline with embedding cosine and decide, per pair, whether the structural-signature feature is asserting something the embedding cannot.
Other Rust crates that earn their keep in this stack:
ndarrayfor the columnar array math that doesn’t fit NumPy’s broadcasting cleanly.serde_yaml+clapfor the manifest-driven workflows you end up wanting — which features to extract, which cohort to score against, which dataset to consume.tokioandaxumif any of this needs to live behind an HTTP service.indexmapfor the ordered-by-insertion dict semantics that rank-based scoring wants.
None of these are exotic. The point is not that Rust is necessary for ontology features. The point is that Rust is what makes them cheap enough at inference latency to combine with embedding-cosine in the same per-pair forward pass. You pay 50 µs of additional latency, not 50 ms.
A pattern that works
The pattern I have ended up running, in various shapes, across several embedding pipelines:
- Run spaCy
en_core_web_smover the corpus once. Materialize POS, NER, lemma, dependency, morphology features to columnar storage. - Train (or use off-the-shelf) embeddings as usual. Compute cosine similarity for every pair.
- Maintain a small Rust binary that, given two sentences and their precomputed structural features, computes a residual signal — a small number that says “the embedding cosine disagrees with the structural read of this pair.”
- Combine the two — usually as a discriminator at the head, sometimes as a feature in a small learned classifier, sometimes as a hard veto.
In numbers I am willing to share at this granularity: on production STS-style problems with a saturated embedding baseline at Pearson 0.88, adding a structural-feature residual lifts the corpus to 0.91+, and lifts the slices where embeddings were always going to fail (the bullet list above) into the 0.93+ range. The marginal cost is one CPU-second per thousand pairs at inference, on top of the existing GPU embedding cost. It is, in expected lift per dollar, the best engineering trade I know how to make in 2026.
The case for explicit ontology
There is a common framing in which “ontology” is an artifact of the 2000s — a Semantic Web grave good, replaced cleanly by dense vectors. I think that framing is wrong, or at least incomplete.
What changed is not that ontology became unnecessary. What changed is that we now have a complementary representation — the vector — that handles the part ontology was bad at: the soft, gradient, similarity-by-context part. The hard, categorical, structural part — the part that requires you to identify a negation as a negation, a subject as a subject, a numeral as a numeral — never went away. It moved.
It moved into the part of the system you cannot escape if you ship to production. Into the part that breaks on input that wasn’t in the training distribution. Into the part that, when it breaks, makes your customers stop trusting your search results.
There is no honest way to dodge that part with a bigger model. The objective the bigger model was trained on does not address the bullet list.
What does address the bullet list: a POS tagger, an NER tagger, a dependency parser, a negation classifier, a quantifier-scope extractor — all the boring infrastructure linguists have been refining for fifty years. With spaCy at the parse layer and Rust at the throughput layer, they ship at production latency and cost.
That is, I think, what AI ontology means in 2026. Not the return of OWL and RDF. Not knowledge graphs as standalone systems. The recovery, at production speed, of the structural distinctions that vectors flatten — and their reintegration as a complementary representation alongside the embeddings that handle the rest.
The next decade’s gains in AI, in my view, will not come from another order of magnitude of pretraining. They will come from finding the categories that embeddings flatten, building the structural features that distinguish them, and shipping the hybrid pipeline that has both.
Vectors are not ontologies. They were never meant to be. The work is in combining them with what is.