Ethics & Policy
Do text embeddings perfectly encode text?
The rise of the vector database
As a result of the rapid advancement of generative AI in recent years, many companies are rushing to integrate AI into their businesses. One of the most common ways of doing this is to build AI systems that answer questions concerning information that can be found within a database of documents. Most solutions for such a problem are based on one key technique: Retrieval Augmented Generation (RAG).
This is what lots of people do now as a cheap and easy way to get started using AI: store lots of documents in a database, have the AI retrieve the most relevant documents for a given input, and then generate a response to the input that is informed by the retrieved documents.
These RAG systems determine document relevancy by using “embeddings”, vector representations of documents produced by an embedding model. These embeddings are supposed to represent some notion of similarity, so documents that are relevant for search will have high vector similarity in embedding space.
The prevalence of RAG has led to the rise of the vector database, a new type of database designed for storing and searching through large numbers of embeddings. Hundreds of millions of dollars of funding have been given out to startups that claim to facilitate RAG by making embedding search easy. And the effectiveness of RAG is the reason why lots of new applications are converting text to vectors and storing them in these vector databases.
Embeddings are hard to read
So what is stored in a text embedding? Beyond the requirement of semantic similarity, there are no constraints on which embedding must be assigned for a given text input. Numbers within embedding vectors can be anything, and vary based on their initialization. We can interpret the similarities of embedding with others but have no hope ever understanding the individual numbers of an embedding.
Now imagine you’re a software engineer building a RAG system for your company. You decide to store your vectors in a vector database. You notice that in a vector database, what’s stored are embedding vectors, not the text data itself. The database fills up with rows and rows of random-seeming numbers that represent text data but never ‘sees’ any text data at all.
You know that the text corresponds to customer documents that are protected by your company’s privacy policy. But you’re not really sending text off-premises at any time; you only ever send embedding vectors, which look to you like random numbers.
What if someone hacks into the database and gains access to all your text embedding vectors – would this be bad? Or if the service provider wanted to sell your data to advertisers – could they? Both scenarios involve being able to take embedding vectors and invert them somehow back to text.
From text to embeddings…back to text
The problem of recovering text from embeddings is exactly the scenario we tackle in our paper Text Embeddings Reveal As Much as Text (EMNLP 2023). Are embedding vectors a secure format for information storage and communication? Put simply: can input text be recovered from output embeddings?
Before diving into solutions, let’s think about the problem a little bit more. Text embeddings are the output of neural networks, sequences of matrix multiplications joined by nonlinear function operations applied to input data. In traditional text processing neural networks, a string input is split into a number of token vectors, which repeatedly undergo nonlinear function operations. At the output layer of the model, tokens are averaged into a single embedding vector.
A maxim from the signal processing community known as the data processing inequality tells us that functions cannot add information to an input, they can only sustain or decrease the amount of information available. Even though conventional wisdom tells us that deeper layers of a neural network are constructing ever-higher-order representations, they aren’t adding any information about the world that didn’t come in on the input side.
Additionally, the nonlinear layers certainly destroy some information. One ubiquitous nonlinear layer in modern neural networks is the “ReLU” function, which simply sets all negative inputs to zero. After applying ReLU throughout the many layers of a typical text embedding model, it is not possible to retain all the information from the input.
Inversion in other contexts
Similar questions about information content have been asked in the computer vision community. Several results have shown that deep representations (embeddings, essentially) from image models can be used to recover the input images with some degree of fidelity. An early result (Dosovitskiy, 2016) showed that images can be recovered from the feature outputs of deep convolutional neural networks (CNNs). Given the high-level feature representation from a CNN, they could invert it to produce a blurry-but-similar version of the original input image.
People have improved on image embedding inversion process since 2016: models have been developed that do inversion with higher accuracy, and have been shown to work across more settings. Surprisingly, some work has shown that images can be inverted from the outputs of an ImageNet classifier (1000 class probabilities).
The journey to vec2text
If inversion is possible for image representations, then why can’t it work for text? Let’s consider a toy problem of recovering text embeddings. For our toy setting we’ll restrict text inputs to 32 tokens (around 25 words, a sentence of decent length) and embed them all to vectors of 768 floating-point numbers. At 32-bit precision, these embeddings are 32 * 768 = 24,576 bits or around 3 kilobytes.
Few words represented by many bits. Do you think we could perfectly reconstruct the text within this scenario?
First things first: we need to define a measurement of goodness, to know how well we have accomplished our task. One obvious metric is “exact match”, how often we get the exact input back after inversion. No prior inversion methods have any success on exact match, so it’s quite an ambitious measurement. So maybe we want to start with a smooth measurement that measures how similar the inverted text is to the input. For this we’ll use BLEU score, which you can just think of as a percentage of how close the inverted text is to the input.
With our success metric defined, let us move on to proposing an approach to evaluate with said metric. For a first approach, we can pose inversion as a traditional machine learning problem, and we solve it the best way we know how: by gathering a large dataset of embedding-text pairs, and train a model to output the text given the embedding as input.
So this is what we did. We build a transformer that takes the embedding as input and train it using traditional language modeling on the output text. This first approach gives us a model with a BLEU score of around 30/100. Practically, the model can guess the topic of the input text, and get some of the words, but it loses their order and often gets most of them wrong. The exact match score is close to zero. It turns out that asking a model to reverse the output of another model in a single forward pass is quite hard (as are other complicated text generation tasks, like generating text in perfect sonnet form or satisfying multiple attributes).
After training our initial model, we noticed something interesting. A different way to measure model output quality is by re-embedding the generated text (we call this the “hypothesis”) and measuring this embedding’s similarity to the true embedding. When we do this with our model’s generations, we see a very high cosine similarity – around 0.97. This means that we’re able to generate text that’s close in embedding space, but not identical to, the ground-truth text.
(An aside: what if this weren’t the case? That is, what if the embedding assigned our incorrect hypothesis the same embedding as the original sequence. Our embedder would be lossy, mapping multiple inputs to the same output. If this were the case, then our problem would be hopeless, and we would have no way of distinguishing which of multiple possible sequences produced it. In practice, we never observe these types of collisions in our experiments.)
The observation that hypotheses have different embeddings to the ground truth inspires an optimization-like approach to embedding inversion. Given a ground-truth embedding (where we want to go), and a current hypothesis text and its embedding (where we are right now), we can train a corrector model that’s trained to output something that’s closer to the ground-truth than the hypothesis.
Our goal is now clear: we want to build a system that can take a ground-truth embedding, a hypothesis text sequence, and the hypothesis position in embedding space, and predict the true text sequence. We think of this as a type of ‘learned optimization’ where we’re taking steps in embedding space in the form of discrete sequences. This is the essence of our method, which we call vec2text.
After working through some details and training the model, this process works extremely well! A single forward pass of correction increases the BLEU score from 30 to 50. And one benefit of this model is that it can naturally be queried recursively. Given a current text and its embedding, we can run many steps of this optimization, iteratively generating hypotheses, re-embedding them, and feeding them back in as input to the model. With 50 steps and a few tricks, we can get back 92% of 32-token sequences exactly, and get to a BLEU score of 97! (Generally achieving BLEU score of 97 means we’re almost perfectly reconstructing every sentence, perhaps with a few punctuation marks misplaced here and there.)
Scaling and future work
The fact that text embeddings can be perfectly inverted raises many follow-up questions. For one, the text embedding vector contains a fixed number of bits; there must be some sequence length at which information can no longer be perfectly stored within this vector. Even though we can recover most texts of length 32, some embedding models can embed documents up to thousands of tokens. We leave it up to future work to analyze the relationship between text length, embedding size, and embedding invertibility.
Another open question is how to build systems that can defend against inversion. Is it possible to create models that can successfully embed text such that embeddings remain useful while obfuscating the text that created them?
Finally, we are excited to see how our method might apply to other modalities. The main idea behind vec2text (a sort of iterative optimization in embedding space) doesn’t use any text-specific tricks. It’s a method that iteratively recovers information contained in any fixed input, given black-box access to a model. It remains to be seen how these ideas might apply to inverting embeddings from other modalities as well as to approaches more general than embedding inversion.
To use our models to invert text embeddings, or to get started running embedding inversion experiments yourself, check out our Github repository: https://github.com/jxmorris12/vec2text
References
Inverting Visual Representations with Convolutional Networks (2015), https://arxiv.org/abs/1506.02753
Understanding Invariance via Feedforward Inversion of Discriminatively Trained Classifiers (2021), https://proceedings.mlr.press/v139/teterwak21a/teterwak21a.pdf
Text Embeddings Reveal (Almost) As Much As Text (2023), https://arxiv.org/abs/2310.06816
Language Model Inversion (2024), https://arxiv.org/abs/2311.13647
Author Bio
Jack Morris is a PhD student at Cornell Tech in New York City. He works on research at the intersection of machine learning, natural language processing, and security. He’s especially interested in the information content of deep neural representations like embeddings and classifier outputs.
Citation
For attribution in academic contexts or books, please cite this work as
Jack Morris, "Do text embeddings perfectly encode text?", The Gradient, 2024.
BibTeX citation:
@article{morris2024inversion,
author = {Jack Morris},
title = {Do text embeddings perfectly encode text?},
journal = {The Gradient},
year = {2024},
howpublished = {\url{https://thegradient.pub/text-embedding-inversion},
}
Ethics & Policy
AI and ethics – what is originality? Maybe we’re just not that special when it comes to creativity?
I don’t trust AI, but I use it all the time.
Let’s face it, that’s a sentiment that many of us can buy into if we’re honest about it. It comes from Paul Mallaghan, Head of Creative Strategy at We Are Tilt, a creative transformation content and campaign agency whose clients include the likes of Diageo, KPMG and Barclays.
Taking part in a panel debate on AI ethics at the recent Evolve conference in Brighton, UK, he made another highly pertinent point when he said of people in general:
We know that we are quite susceptible to confident bullshitters. Basically, that is what Chat GPT [is] right now. There’s something reminds me of the illusory truth effect, where if you hear something a few times, or you say it here it said confidently, then you are much more likely to believe it, regardless of the source. I might refer to a certain President who uses that technique fairly regularly, but I think we’re so susceptible to that that we are quite vulnerable.
And, yes, it’s you he’s talking about:
I mean all of us, no matter how intelligent we think we are or how smart over the machines we think we are. When I think about trust, – and I’m coming at this very much from the perspective of someone who runs a creative agency – we’re not involved in building a Large Language Model (LLM); we’re involved in using it, understanding it, and thinking about what the implications if we get this wrong. What does it mean to be creative in the world of LLMs?
Genuine
Being genuine, is vital, he argues, and being human – where does Human Intelligence come into the picture, particularly in relation to creativity. His argument:
There’s a certain parasitic quality to what’s being created. We make films, we’re designers, we’re creators, we’re all those sort of things in the company that I run. We have had to just face the fact that we’re using tools that have hoovered up the work of others and then regenerate it and spit it out. There is an ethical dilemma that we face every day when we use those tools.
His firm has come to the conclusion that it has to be responsible for imposing its own guidelines here to some degree, because there’s not a lot happening elsewhere:
To some extent, we are always ahead of regulation, because the nature of being creative is that you’re always going to be experimenting and trying things, and you want to see what the next big thing is. It’s actually very exciting. So that’s all cool, but we’ve realized that if we want to try and do this ethically, we have to establish some of our own ground rules, even if they’re really basic. Like, let’s try and not prompt with the name of an illustrator that we know, because that’s stealing their intellectual property, or the labor of their creative brains.
I’m not a regulatory expert by any means, but I can say that a lot of the clients we work with, to be fair to them, are also trying to get ahead of where I think we are probably at government level, and they’re creating their own frameworks, their own trust frameworks, to try and address some of these things. Everyone is starting to ask questions, and you don’t want to be the person that’s accidentally created a system where everything is then suable because of what you’ve made or what you’ve generated.
Originality
That’s not necessarily an easy ask, of course. What, for example, do we mean by originality? Mallaghan suggests:
Anyone who’s ever tried to create anything knows you’re trying to break patterns. You’re trying to find or re-mix or mash up something that hasn’t happened before. To some extent, that is a good thing that really we’re talking about pattern matching tools. So generally speaking, it’s used in every part of the creative process now. Most agencies, certainly the big ones, certainly anyone that’s working on a lot of marketing stuff, they’re using it to try and drive efficiencies and get incredible margins. They’re going to be on the race to the bottom.
But originality is hard to quantify. I think that actually it doesn’t happen as much as people think anyway, that originality. When you look at ChatGPT or any of these tools, there’s a lot of interesting new tools that are out there that purport to help you in the quest to come up with ideas, and they can be useful. Quite often, we’ll use them to sift out the crappy ideas, because if ChatGPT or an AI tool can come up with it, it’s probably something that’s happened before, something you probably don’t want to use.
More Human Intelligence is needed, it seems:
What I think any creative needs to understand now is you’re going to have to be extremely interesting, and you’re going to have to push even more humanity into what you do, or you’re going to be easily replaced by these tools that probably shouldn’t be doing all the fun stuff that we want to do. [In terms of ethical questions] there’s a bunch, including the copyright thing, but there’s partly just [questions] around purpose and fun. Like, why do we even do this stuff? Why do we do it? There’s a whole industry that exists for people with wonderful brains, and there’s lots of different types of industries [where you] see different types of brains. But why are we trying to do away with something that allows people to get up in the morning and have a reason to live? That is a big question.
My second ethical thing is, what do we do with the next generation who don’t learn craft and quality, and they don’t go through the same hurdles? They may find ways to use {AI] in ways that we can’t imagine, because that’s what young people do, and I have faith in that. But I also think, how are you going to learn the language that helps you interface with, say, a video model, and know what a camera does, and how to ask for the right things, how to tell a story, and what’s right? All that is an ethical issue, like we might be taking that away from an entire generation.
And there’s one last ‘tough love’ question to be posed:
What if we’re not special? Basically, what if all the patterns that are part of us aren’t that special? The only reason I bring that up is that I think that in every career, you associate your identity with what you do. Maybe we shouldn’t, maybe that’s a bad thing, but I know that creatives really associate with what they do. Their identity is tied up in what it is that they actually do, whether they’re an illustrator or whatever. It is a proper existential crisis to look at it and go, ‘Oh, the thing that I thought was special can be regurgitated pretty easily’…It’s a terrifying thing to stare into the Gorgon and look back at it and think,’Where are we going with this?’. By the way, I do think we’re special, but maybe we’re not as special as we think we are. A lot of these patterns can be matched.
My take
This was a candid worldview that raised a number of tough questions – and questions are often so much more interesting than answers, aren’t they? The subject of creativity and copyright has been handled at length on diginomica by Chris Middleton and I think Mallaghan’s comments pretty much chime with most of that.
I was particularly taken by the point about the impact on the younger generation of having at their fingertips AI tools that can ‘do everything, until they can’t’. I recall being horrified a good few years ago when doing a shift in a newsroom of a major tech title and noticing that the flow of copy had suddenly dried up. ‘Where are the stories?’, I shouted. Back came the reply, ‘Oh, the Internet’s gone down’. ‘Then pick up the phone and call people, find some stories,’ I snapped. A sad, baffled young face looked back at me and asked, ‘Who should we call?’. Now apart from suddenly feeling about 103, I was shaken by the fact that as soon as the umbilical cord of the Internet was cut, everyone was rendered helpless.
Take that idea and multiply it a billion-fold when it comes to AI dependency and the future looks scary. Human Intelligence matters
Ethics & Policy
Experts gather to discuss ethics, AI and the future of publishing
Publishing stands at a pivotal juncture, said Jeremy North, president of Global Book Business at Taylor & Francis Group, addressing delegates at the 3rd International Conference on Publishing Education in Beijing. Digital intelligence is fundamentally transforming the sector — and this revolution will inevitably create “AI winners and losers”.
True winners, he argued, will be those who embrace AI not as a replacement for human insight but as a tool that strengthens publishing’s core mission: connecting people through knowledge. The key is balance, North said, using AI to enhance creativity without diminishing human judgment or critical thinking.
This vision set the tone for the event where the Association for International Publishing Education was officially launched — the world’s first global alliance dedicated to advancing publishing education through international collaboration.
Unveiled at the conference cohosted by the Beijing Institute of Graphic Communication and the Publishers Association of China, the AIPE brings together nearly 50 member organizations with a mission to foster joint research, training, and innovation in publishing education.
Tian Zhongli, president of BIGC, stressed the need to anchor publishing education in ethics and humanistic values and reaffirmed BIGC’s commitment to building a global talent platform through AIPE.
BIGC will deepen academic-industry collaboration through AIPE to provide a premium platform for nurturing high-level, holistic, and internationally competent publishing talent, he added.
Zhang Xin, secretary of the CPC Committee at BIGC, emphasized that AIPE is expected to help globalize Chinese publishing scholarships, contribute new ideas to the industry, and cultivate a new generation of publishing professionals for the digital era.
Themed “Mutual Learning and Cooperation: New Ecology of International Publishing Education in the Digital Intelligence Era”, the conference also tackled a wide range of challenges and opportunities brought on by AI — from ethical concerns and content ownership to protecting human creativity and rethinking publishing values in higher education.
Wu Shulin, president of the Publishers Association of China, cautioned that while AI brings major opportunities, “we must not overlook the ethical and security problems it introduces”.
Catriona Stevenson, deputy CEO of the UK Publishers Association, echoed this sentiment. She highlighted how British publishers are adopting AI to amplify human creativity and productivity, while calling for global cooperation to protect intellectual property and combat AI tool infringement.
The conference aims to explore innovative pathways for the publishing industry and education reform, discuss emerging technological trends, advance higher education philosophies and talent development models, promote global academic exchange and collaboration, and empower knowledge production and dissemination through publishing education in the digital intelligence era.
yangyangs@chinadaily.com.cn
Ethics & Policy
Experts gather to discuss ethics, AI and the future of publishing
Publishing stands at a pivotal juncture, said Jeremy North, president of Global Book Business at Taylor & Francis Group, addressing delegates at the 3rd International Conference on Publishing Education in Beijing. Digital intelligence is fundamentally transforming the sector — and this revolution will inevitably create “AI winners and losers”.
True winners, he argued, will be those who embrace AI not as a replacement for human insight but as a tool that strengthens publishing”s core mission: connecting people through knowledge. The key is balance, North said, using AI to enhance creativity without diminishing human judgment or critical thinking.
This vision set the tone for the event where the Association for International Publishing Education was officially launched — the world’s first global alliance dedicated to advancing publishing education through international collaboration.
Unveiled at the conference cohosted by the Beijing Institute of Graphic Communication and the Publishers Association of China, the AIPE brings together nearly 50 member organizations with a mission to foster joint research, training, and innovation in publishing education.
Tian Zhongli, president of BIGC, stressed the need to anchor publishing education in ethics and humanistic values and reaffirmed BIGC’s commitment to building a global talent platform through AIPE.
BIGC will deepen academic-industry collaboration through AIPE to provide a premium platform for nurturing high-level, holistic, and internationally competent publishing talent, he added.
Zhang Xin, secretary of the CPC Committee at BIGC, emphasized that AIPE is expected to help globalize Chinese publishing scholarships, contribute new ideas to the industry, and cultivate a new generation of publishing professionals for the digital era.
Themed “Mutual Learning and Cooperation: New Ecology of International Publishing Education in the Digital Intelligence Era”, the conference also tackled a wide range of challenges and opportunities brought on by AI — from ethical concerns and content ownership to protecting human creativity and rethinking publishing values in higher education.
Wu Shulin, president of the Publishers Association of China, cautioned that while AI brings major opportunities, “we must not overlook the ethical and security problems it introduces”.
Catriona Stevenson, deputy CEO of the UK Publishers Association, echoed this sentiment. She highlighted how British publishers are adopting AI to amplify human creativity and productivity, while calling for global cooperation to protect intellectual property and combat AI tool infringement.
The conference aims to explore innovative pathways for the publishing industry and education reform, discuss emerging technological trends, advance higher education philosophies and talent development models, promote global academic exchange and collaboration, and empower knowledge production and dissemination through publishing education in the digital intelligence era.
yangyangs@chinadaily.com.cn
-
Funding & Business7 days ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Jobs & Careers6 days ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Mergers & Acquisitions6 days ago
Donald Trump suggests US government review subsidies to Elon Musk’s companies
-
Funding & Business6 days ago
Rethinking Venture Capital’s Talent Pipeline
-
Jobs & Careers6 days ago
Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)
-
Funding & Business4 days ago
Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%
-
Funding & Business7 days ago
From chatbots to collaborators: How AI agents are reshaping enterprise work
-
Jobs & Careers6 days ago
Astrophel Aerospace Raises ₹6.84 Crore to Build Reusable Launch Vehicle
-
Funding & Business4 days ago
HOLY SMOKES! A new, 200% faster DeepSeek R1-0528 variant appears from German lab TNG Technology Consulting GmbH
-
Funding & Business6 days ago
Europe’s Most Ambitious Startups Aren’t Becoming Global; They’re Starting That Way