Google's New Imagen AI Outperforms DALL-E on Text-to-Image Generation Benchmarks

2022-08-20 11:03:41 By : Ms. Angela Sun

Live Webinar and Q&A - Why Identity Should Form the Foundation of Your Product Strategy (Live Webinar Sep 8th, 2022) Save Your Seat

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

The panelists discuss ways to improve as developers. Are better tools the solution, or can simple changes in mindset help? And what practices are already here, but not yet universally adopted?

Legacy applications actually benefit the most from concepts like a Minimum Viable Product (MVP) and its related Minimum Viable Architecture (MVA). Once you realize that every release is an experiment in value in which the release either improves the value that customers experience or doesn’t, you realize that every release, even one of a legacy application, can be thought of in terms of an MVP.

In this annual report, the InfoQ editors discuss the current state of AI, ML, and data engineering and what emerging trends you as a software engineer, architect, or data scientist should watch. We curate our discussions into a technology adoption curve with supporting commentary to help you understand how things are evolving.

In this podcast Shane Hastie, Lead Editor for Culture & Methods spoke to Arpit Mohan about the importance and value of interpersonal skills in teamwork

Erin Schnabel discusses how application metrics align with other observability and monitoring methods, from profiling to tracing, and the limits of aggregation.

Learn how cloud architectures help organizations take care of application and cloud security, observability, availability and elasticity. Register Now.

Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.

Make the right decisions by uncovering how senior software developers at early adopter companies are adopting emerging trends. Register Now.

InfoQ Homepage News Google's New Imagen AI Outperforms DALL-E on Text-to-Image Generation Benchmarks

Researchers from Google's Brain Team have announced Imagen, a text-to-image AI model that can generate photorealistic images of a scene given a textual description. Imagen outperforms DALL-E 2 on the COCO benchmark, and unlike many similar models, is pre-trained only on text data.

The model and several experiments were described in a paper published on arXiv. Imagen uses a Transformer language model to convert the input text into a sequence of embedding vectors. A series of three diffusion models then convert the embeddings into a 1024x1024 pixel image. As part of their work, the team developed an improved diffusion model called Efficient U-Net, as well as a new benchmark suite for text-to-image models called DrawBench. On the COCO benchmark, Imagen achieved a zero-shot FID score of 7.27, outperforming DALL-E 2, the previous best-performing model. The researchers also discussed the potential societal impact of their work, noting:

Our primary aim with Imagen is to advance research on generative methods, using text-to-image synthesis as a test bed. While end-user applications of generative methods remain largely out of scope, we recognize the potential downstream applications of this research are varied and may impact society in complex ways...In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access.

In recent years, several researchers have investigated training multimodal AI models: systems that operate on different types of data, such as text and images. In 2021, OpenAI announced CLIP, a deep-learning model that can map both text and images into the same embedding space, allowing users to tell if a textual description is a good match for a given image. This model has proven effective at many computer-vision tasks, and OpenAI also used it to create DALL-E, a model that can generate realistic-looking images from text descriptions. CLIP and similar models were trained on a dataset of image-text pairs which are scraped from the internet, similar to the LAION-5B dataset that InfoQ reported on earlier this year.

Instead of using an image-text dataset for training Imagen, the Google team simply used an "off-the-shelf" text encoder, T5, to convert input text into embeddings. To convert the embedding into an image Imagen uses a sequence of diffusion models. These generative AI models use an iterative denoising process to convert Gaussian noise into samples from a data distribution---in this case, images. The de-noising conditioned on some input. For the first diffusion model, the condition is the input text embedding; this model outputs a 64x64 pixel image. This image is up-sampled by passing through two "super-resolution" diffusion models, to increase resolution to 1024x1024. For these models, Google developed a new deep-learning architecture called Efficient U-Net, which is "simpler, converges faster, and is more memory efficient" than previous U-Net implementations.

"A cute corgi lives in a house made out of sushi" - image source: https://imagen.research.google

In addition to evaluating Imagen on the COCO validation set, the researchers developed a new image-generation benchmark, DrawBench. The benchmark consists of a collection of text prompts that are "designed to probe different semantic properties of models," including composition, cardinality, and spatial relations. DrawBench uses human evaluators to compare two different models. First, each model generates images from the prompts. Then, the evaluators compare the results from the two, indicating which model produced the better image. Using DrawBench, the Brain team evaluated Imagen against DALL-E 2 and three other similar models; the team found that the judges "exceedingly" preferred the images generated by Imagen over the other models.

On Twitter, Google product manager Sharon Zhou discussed the work, noting that:

As always, [the] conclusion is that we need to keep scaling up [large language models]

In another thread, Google Brain team lead Douglas Eck posted a series of images generated by Imagen, all from variations on a single prompt; Eck modified the prompt by adding words to adjust the style, lighting, and other aspects of the image. Several other example images generated by Imagen can be found on the Imagen project site.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team.

ScyllaDB is the database for data-intensive apps requiring high performance + low latency. Achieve extreme scale with the lowest TCO. Learn More.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

You need to Register an InfoQ account or Login or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Real-world technical talks. No product pitches. Practical ideas to inspire you and your team. QCon San Francisco - Oct 24-28, In-person. QCon San Francisco brings together the world's most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices. Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now

InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with. Privacy Notice, Terms And Conditions, Cookie Policy