Training a number-aware embedding model + Text JEPA doesn’t work too well + Text auto-encoders have a strange frequency bias [R][P]

“`html A British researcher has attempted to predict company growth from the full text of their 10-K filings, a task that failed…

By AI Maestro May 13, 2026 1 min read
Training a number-aware embedding model + Text JEPA doesn’t work too well + Text auto-encoders have a strange frequency bias [R][P]

“`html

  • A British researcher has attempted to predict company growth from the full text of their 10-K filings, a task that failed despite extensive experimentation and resource investment.
  • The author developed a modified ModernBERT model capable of predicting numbers within texts without relying on traditional tokenization or prediction heads. This model was then refined into an embedding sequence for further analysis.
  • However, when applying this number-aware embedding to tasks like the Jump-Error-Predict-Jump (JEPA) and autoencoder setups, it did not perform as expected, suffering from a frequency bias issue where high-frequency information dominated the output. This necessitated additional strategies such as incorporating a Contrastive Loss term.

“`

### Takeaways
– The attempt to leverage numbers within text for predictive tasks in finance remains challenging and does not yield robust results.
– Models need careful tuning, especially when dealing with unexpected data distributions or inherent biases like frequency bias in autoencoder setups.
– Developing number-aware embeddings requires a nuanced approach that goes beyond simple tokenization and relies on specific modifications of existing architectures.


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top