Learning about learning?

Mon, 23 Feb 2026 00:00:00 GMT

[Continuing] Beyond LLMs/Transformers

The power-waste, inefficiencies, and general limits of throwing large corpora of labeled data into a blender to create probability distributions of next-token are becoming apparent to the broader world.

Useful to create once, and part of the solution, but not the answer.

Connected, but disconnected thoughts:

It's the data, stupid!

Multiple papers are coming to interesting observations:

Which I've taken away as: if the same underlying data is taken, regardless of which blender, the models are going to encode/decode into similar spaces. This has fun implications:

Less high-quality and detailed open-source data can generalize to the same embedding concepts as proprietary labeled data.
We can (and should) be able to move between same-dim embedding spaces as they should evolve without much loss of the original text.
We can use Matryoshka Representation Learning to focus on "what's shared" between embedding spaces.

See:

Out-of-Distribution

"Slop" is the natural consequence of leveraging output probabilities fitting a PDF/PD curve. More parameters can, of course, make the field/curve more varied but in the end, it's probability and stats.

It's a blessing

Just as most early middle-school kids want to "fit in," we can feel confident going into areas that we are not familiar with and landing towards the median of the curve.

It's a curse

Outside the median's boilerplate, the edges of language are where real ideas live. How can we spend our energies at the edge?

GraphQL is not the answer.

While I do love talking to the technologists at GraphQL / SurrealDB, they are not the answer as they require manual relationship mapping. We know from papers like GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface, that we should be able to use "stand-ins" for unique concepts with lower effort at scale. How does that change embedding space distributions and other models?

On success outside the straight line

One of the more interesting books that I've come across came from the folks at Sakana.ai, entitled "Why Greatness Cannot Be Planned: The Myth of the Objective". Similar to my academic paper backlog, it's slowly being iterated through but is indeed quite interesting, especially when you think about machine learning beyond the transformer architecture.

On Curiosity

Along similarly delightful lines was coming across this video on my YouTube backlog from Pierre-Yves Oudyer on Curiosity Driven Learning.

Fun new word:

Autotelic: From the Greek autos (self) and telos (goal). In the context of curiosity-driven learning, an autotelic agent is one that sets its own goals and finds intrinsic reward in the process of learning itself, rather than just optimizing for an external objective.

Explore more at the Flowers Inria project.

How am I going to prove it?

By building an "Autotelic Agent"—one that doesn't just respond to prompts but actively explores its environment (via the GarageCam/HomeKit mesh) to build its own internal model of reality. This requires moving beyond the "next-token prediction" blender and into true structured, curiosity-driven exploration.

Time Decay of Information

Wed, 10 Dec 2025 00:00:00 GMT

The below is a snapshot in time of my evolving thought process on how to deal with information aging out in learning models. I may periodically refresh that thinking in place or extend to a new post.

(Textual) Information

Information aging out for learning (human and machine) requires thinking about the problem space from at least two different angles: 1.) Embeddings 2.) Weights / Models

Of which there are three scenarios: 1.) Explicitly dated information 2.) Implicity dated information 3.) Undated information

Of sources across a different set of vectors: a.) Mostly Trusted b.) Untrusted

from multiple sources including, but not limited to:

Books (our oldest form of information) - Permanent form of information
Articles (news, blogs, journals) - Semipermanent form of information
Social Media (the most ephemeral form of information)

In addition, we need to consider whether the model or human is aiming for general or deep knowledge of the topic at hand.

Deep knowledge may have less stickiness over time, while general knowledge may be more resilient to time decay.

Finally, we need to look at how that knowledge could decay over time. Could an initially 2048-dim embedding decay into a 128-dim embedding? This "semantic evaporation" could be a mechanism for long-term memory management, where detailed nuances are pruned while the core concept (the "centroid") remains.

Multimodal Information?

A lot of the public work that I've read has gone into single-mode learning study, whereas humans do not learn (well) from text-only. Even if one is "book smart," the structures that we use to retain and process information really on mapping to other concepts.

Textual Models like GLiNER (Generalist Named Entity Recognition) offer hints as to how that linkage might be established. Lead author Urchade Zaratiana and the team are pushing this into GLiNER2, which aims for unified, schema-driven information extraction across multiple tasks like NER, classification, and structured data extraction.

Future Work: Proving out the ideas

To prove out these ideas, we're looking at "Temporal Embeddings"—vector spaces that incorporate a time-decay function directly into the similarity calculation. This prioritizes recent information without entirely discarding historical context, much like how human memory functions. We also need to explore GLiNER's viability in zero-shot aspect-based sentiment analysis as a way to track changing sentiments over time.

5L Labs Blog

Learning about learning?

[Continuing] Beyond LLMs/Transformers​

Connected, but disconnected thoughts:​

It's the data, stupid!​

Out-of-Distribution​

It's a blessing​

It's a curse​

GraphQL is not the answer.​

On success outside the straight line​

On Curiosity​

How am I going to prove it?​