December 28, 2023

Generating MS Word files

It is almost a rite of passage, a sign your software is all grown up when you add reporting functionality.
Like an invoice file, a customer report, or an employee document. Anytime we need the data from our system to be presented, printed or processed we need data in the form of a document.

In this blog we will build a minimal MS Word document generator.

Let’s have a word

There are many possible approaches to generating files in code. They all share roughly the same logical parts.

A straightforward (naive) way to generate files is to build them in code. That means fetch data from our system/database and write out whatever bytes are needed.

And while this works, it is the opposite of flexible. Any desired change in the output means we need to change the code.

Therefore, in practice we divide and conquer. First, we separate the design and static information into a template and create a language that allows us to pull data. Then we create an engine that can interpret (or run) the template.

For the template definition, we can roll our own file format.

Or as we are going to do in this blog, we will use a word file as the template itself!
Now we can use all the design features of MS Word and we don’t need to build a template file editor!
Two birds on stone.

The “only” thing left is to build the template engine, that can

  1. read a word file
  2. insert data into the document
  3. write a word file

What’s the Word

Thankfully the days of binary undocumented office documents are long past. These days the specification is open and the office file is really just a bunch of files in a zip archive. See for yourself and rename any docx file to zip and look inside. The meat is in the document.xml file. For all the hairy details see here.

Approach

Our approach will be to look for special text tags like <customer.name> in the MS Word template and replace them with data from our system (database). And we will build it completely in the browser.

Open

First lets open and parse the word template file. We will be using the jszip library and the builtin browser dom xml parser. As you can see this part is quite straightforward.

const zip = await JSZip.loadAsync(wordFileBlob);
const fname = "word/document.xml";
const docXml = await zip.file(fname)?.async("text");
if (!docXml)
	throw Error("Invalid template format");

const parser = new DOMParser();
let xmlDoc = parser.parseFromString(docXml, "text/xml");

Find & Replace

The text in a word file is store in paragraphs and runs. So we will look for our special tags inside them. Because word can (and will) break our tags into separate spans, we have to deal with it and keep looking for our tag in multiple spans.

Once we found a match, we will replace the first span with data from our system/database. If the match is over multiple spans, we will remove the internal spans, and cut the reminder of the tag from the last span.

let textStack: Element[] = []
let textAccum = "";
let textProlog = "";
const resetText = () => {
	textAccum = "";
	textStack = [];
	textProlog = "";
}
let node: Element | null = xmlDoc.documentElement;
const wordNS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
while (node) {
	// paragraph -> reset search
	if (node.localName === "p" && node.namespaceURI === wordNS) {
		resetText(); // unterminated tag, ignore.
	}
	// text span 
	if (node.localName === "t" && node.namespaceURI === wordNS) {
		let text = node.textContent;
		if (text) {
			// 1. look for the beginning of the tag <...
			const start = text.indexOf("<");
			if (start >= 0) {
				resetText(); // just in case we have an extra < and then comes a real <tag>
				textProlog = text.substring(0, start); // text before < that should be preserved
				textStack = [node];
				textAccum = text.substring(start);
			}
			
			const end = text.indexOf(">");
			if (textStack.length > 0 && end >= 0) {
				// 2. We have a start and end
				const e_length = ">".length;
				if (start >= 0)
					textAccum = text.substring(start, end + e_length);
				else
					textAccum += text.substring(0, end + e_length);

				// dataMap contains key=value pairs. replacing <customer.name> with "Alfred's futterkiste"
				const replace = dataMap[textAccum] || "";

				const restOfText = text.substring(end + e_length); // text following the tag that we want to preserve

				// first span will be the prefix + the data value of the tag from the dataMap.
				const textElement = textStack[0];
				textElement.textContent = textProlog + replace;

				// if the tag is fully contained in one span append the text following the tag
				if (node === textElement)
					node.textContent = node.textContent + restOfText; 
				else
					node.textContent = restOfText; // otherwise just replace the text content completly.

				// Clear all but the first span. Observe the span with the end > is never pushed to the textStack.
				for (let i = 1; i < textStack.length; i++) {
					textStack[i].textContent = "";
				}

				resetText();
			}
			else if (start < 0 && textStack.length > 0) {
				// 3. a span between begin and end, just add the text and push the node.
				textAccum += text;
				textStack.push(node);
			}
		}
	}

	// traverse the tree. Got to children then siblings. Go to the parent otherwise.
	if (node.firstElementChild)
		node = node.firstElementChild;
	else { // get down or up
		while (node) {
			if (node.nextElementSibling) {
				node = node.nextElementSibling;
				break;
			}
			node = node.parentElement;
		}
	}
}

Multiline bonus

As simple newline is unfortunately not enough to break text into multiple lines. Instead, we have to put each line as a separate span and add newline tags between them.

const newText = textElement.textContent; 
if (newText.indexOf('\n') >= 0) {
	// \n are ignored by MS Word (and xml in general). Replace with tags.
	const lines = newText.split('\n');
	textElement.textContent = lines[0]; // assign current element text to first line.

	// insert new elements to our parent
	const parent = textElement.parentElement!;
	// insert before my sibling (if null this means the end).
	const insertPoint = textElement.nextElementSibling;

	for (let i = 1; i < lines.length; i++) {
		if (lines[i]) {
			// insert MS Word break element
			const brElement = xmlDoc.createElementNS(wordNS, "br");
			parent.insertBefore(brElement, insertPoint);

			// insert MS Word text span element
			const lineElement = xmlDoc.createElementNS(wordNS, "t");
			const lineTextNode = xmlDoc.createTextNode(lines[i]); // inner 'text' node
			lineElement.appendChild(lineTextNode);
			parent.insertBefore(lineElement, insertPoint);
		}
	}
}

Serialize, zip and done

The rest is quite straightforward, we will serialize the document object into text, and replace the document.xml in the zip.

const serializer = new XMLSerializer();
const resultXml = serializer.serializeToString(xmlDoc);

const fname = "word/document.xml";
zip.file(fname, resultXml);

const blob = await zip.generateAsync({ type: "blob" });
return blob;

Finally we use the anchor trick to start a file download.

export function download(data: any, filename: string, type: string) {
	const file = new Blob([data], { type: type });
	
	const a = document.createElement("a");
	const url = URL.createObjectURL(file);
	a.href = url;
	a.download = filename;
	document.body.appendChild(a);
	a.click();
	setTimeout(function () {
		document.body.removeChild(a);
		window.URL.revokeObjectURL(url);
	}, 0);
}

download(blob, "GeneratedWordFile.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document");

Word? Word

There you have it. A fully functional word template engine.
If this piqued your interest, there are many paths from here.
Your can extends the functionality, add tables or images. Or you could generate excel output instead.

Limited as it is, our customers use the Word template engine quite extensively. Among other things we use it as CV file generator in the filip app.

Happy hacking!