Image generating AIs such as DAL-E or Midjourney tend to have somewhat limited language capabilities, and require short, to-the-point prompts.

Writing such prompts can be a chore, especially since to get the right results, you may need to try different versions of the same "image prompt".

Here is where a text-based LLM such as Chat GPT can help fill the gap.

Using prompt engineering, we can instruct Chat GPT to build a prompt aimed at an image-generating AI. The image-generating could be Chat GPT's sibling, DAL-E, or it could be an image generator from another organisation.

The prompt we send to Chat GPT can generate several versions of the prompt, so we can experiment with the image generator. Also, we can ask Chat GPT to try generate a prompt such that the style of the image will suit the original text. For example, if the original text is a fantasy novel, or a financial blurb, we would want very different kinds of images for each.

Because Chat GPT is also capable of summarizing text, the with this approach the original text can be much longer that what the image generator would normally be capable of processing. Chat GPT has a total token size of about 4000, leaving about 2000 tokens for the input, including our prompt. This could be about 500 words. Via a 'chunking' step with a simple Python script, we can increase this limit, asking Chat GPT to summarise each chunk at a time, and then finally producing a "summary of the summaries".

In this way, we can automatically generate images for quite long input texts.

Summarizing the original text

During summarization, we use a system prompt to instruct Chat GPT to reduce the size of the input text, without losing its overall meaning:

SYSTEM: You are a helpful text analyzer that knows how to summarize a text. USER: Summarize this text denoted by backticks: ```{input_text}```

We can repeat this step iteratively, to create several summaries of a large text. We create the final summary by concatenating these summaries into one text, and finally summarizing that.

Building the image-generating prompt

Once the summary is ready, we can pass that along, wrapped in a "image-generating prompt builder" prompt, such as this:

Analyze the given input text, and create a series of 3 single-sentence summaries. The output text is intended for an image-generation AI and must describe an image that is appropriate to the text. The image style and theme should match the style and genre of the text. The input text is delimited by triple backticks.

The output format must be valid JSON, with one field 'summaries' that is an array of 3 summaries.

Each summary should be 50 words long.

text: ```{summary_text}```

This will output a set of 3 image-generating prompts, that try to encourage the image-generating AI to produce a suitable image matching the style and content of the summary text.

Result: illustrating an excerpt from Lord of The Rings by J.R.R. Tolkien

Input text:

At that moment Gandalf lifted his staff, and crying aloud he smote the bridge before him. The staff broke asunder and fell from his hand. A blinding sheet of white flame sprang up. The bridge cracked. Right at the Balrog's feet it broke, and the stone upon which it stood crashed into the gulf, while the rest remained, poised, quivering like a tongue of rock thrust out into emptiness. With a terrible cry the Balrog fell forward, and its shadow plunged down and vanished. But even as it fell it swung its whip, and the thongs lashed and curled about the wizard's knees, dragging him to the brink. He staggered and fell, grasped vainly at the stone, and slid into the abyss. 'Fly, you fools!' he cried, and was gone."

Generated 'caption' and images generated from that:

None
Gandalf uses his staff to break the bridge, causing a blinding sheet of white flame to appear.
None
Gandalf uses his staff to break the bridge, causing a blinding sheet of white flame to appear.

Result: illustrating "The Yellow Wallpaper" by C. Perkins

Input text:

"I really have discovered something at last. Through watching so much at night, when it changes so, I have finally found out. The front pattern does move — and no wonder! The woman behind shakes it! Sometimes I think there are a great many women behind, and sometimes only one, and she crawls around fast, and her crawling shakes it all over. Then in the very ' bright spots she keeps still, and in the very shady spots she just takes hold of the bars and shakes them hard. And she is all the time trying to climb through. But nobody could climb through that pattern — it strangles so:…"

Generated 'captions' and images generated from those:

None
In bright spots, the woman behind the pattern keeps still, but in shady spots, she shakes the bars hard and tries to climb through.
None
The pattern suppresses anyone who tries to climb through, but the woman behind it continues to shake it and crawl around.

Summary

Generating images via an AI such as DAL-E or Midjourney can be somewhat tricky and requires experimenting with various input prompts. This task can be mostly automated by using an LLM that has been trained to generate text, such as Chat GPT. The remaining manual task is to curate the final output of the image generator.