How to Print Pretty with missing close tags.

Doctor Rashir

I am looking at a Quicken QFX log file that is in a sort of XML type format. The format has many missing End tags so this causes the XML Tools - Pretty Print to indent nearly forever.

Is there a way to align the Start and End tags that are present?

For example in the following code how do I align the bolded lines:

<OFX>
	<SIGNONMSGSRQV1>
		**<SONRQ>**
			<DTCLIENT>20250520104016.123[-7:MST]
				<USERID>anonymous00000000000000000000000
					<USERPASS>X
						<GENUSERKEY>N
							<LANGUAGE>ENG
								<APPID>QWIN
									<APPVER>2700
									**</SONRQ>**
						</SIGNONMSGSRQV1>
						<INTU.BRANDMSGSRQV1>
							<INTU.BRANDTRNRQ>
								<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
									<INTU.BRANDRQ>

I am running on Win 11, latest update and Np++ v8.8.3

PeterJones

@Doctor-Rashir said in How to Print Pretty with missing close tags.:

I am looking at a Quicken QFX log file that is in a sort of XML type format. The format has many missing End tags so this causes the XML Tools - Pretty Print to indent nearly forever.

Is there a way to align the Start and End tags that are present?

XML Tools is designed to work with well-formed XML. If it’s not well-formed (ie, unclosed tags), it’s just too much of an edge case. It’s doubtful there’s any toolmaker out there who could figure out a way to “pretty print” a seemingly-random mixture of closed and unclosed tags in any meaningful way.

If you were to unindent everything (Ctrl+A, then Shift+TAB until it’s gone, or search for ^\h+ and replace with nothing), then if you knew in advance which tags (like <SONRQ>) had closing pairs, you could use the zone-of-text regex forumula from our FAQ, as:

FIND = (?-si:<SONRQ\b|(?!\A)\G)(?s-i:(?!</SONRQ\b).)*?\K(?-si:^(?!\h*</SONRQ))
REPLACE = \t
REPLACE ALL

If I do three steps: unindent, formula(SONRQ) and formula(SIGNONMSGSRQV1), then with your example data, I get

<OFX>
<SIGNONMSGSRQV1>
	<SONRQ>
		<DTCLIENT>20250520104016.123[-7:MST]
		<USERID>anonymous00000000000000000000000
		<USERPASS>X
		<GENUSERKEY>N
		<LANGUAGE>ENG
		<APPID>QWIN
		<APPVER>2700
	</SONRQ>
</SIGNONMSGSRQV1>
<INTU.BRANDMSGSRQV1>
<INTU.BRANDTRNRQ>
<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
<INTU.BRANDRQ>

I don’t know how many other closed tags there are in your file, so I don’t know whether that’s practical for you or not. But it’s the best I can come up with for now, without invoking a full-on programming language (at which point, it could be done in the contents of the Notepad++ window using a plugin like PythonScript, or it could just be done at the command-line with whatever programming language you wanted to use, without needing the file to be open in Notepad++, and thus make it off-topic here)

I did try to make use of a numbered or named capture group in the BSR section and use a backreference to make the BSR and FR invoke those (see the FAQ for the meaning of BSR / ESR / FR), rather than having to know in advance the names of all the tags… but I couldn’t get those backreference versions to work.

Doctor Rashir

@PeterJones
I really appreciate what you’ve posted. There are many closed tags. And many open tags.
I’m just trying to analyze the error I’m encountering with Quicken. I’ll look at what you propose but I have to determine how much work it is to fix or just the ones important to my analysis of the log.

Thanks again.

PeterJones

@Doctor-Rashir ,

If you are willing to use the PythonScript plugin (instructions found in our FAQ, here; I only tested with PythonScript 3, but I tried to write it so I think it’s compatible with the PythonScript 2 in the Plugins Admin; I recommend PythonScript 3)

Script: PrettyPrintBadXML.py

# encoding=utf-8
"""in response to https://community.notepad-plus-plus.org/topic/27254/

This will take malformed XML (many/most tags with no closing tag) and
pretty-print it so that each layer of closed tags indents its contents
"""
from Npp import editor
import re

editor.beginUndoAction()

sEOL = ('\r\n', '\r', '\n')[editor.getEOLMode()]

# First, one tag per line, no indentation
editor.rereplace(r'\s*<', sEOL + r'<', re.MULTILINE)

# get rid of extra newlines at beginning and end (but final line will end with EOL, so N++ shows empty last line)
editor.rereplace(r'\A\s+', '', re.MULTILINE)
editor.rereplace(r'\v+\z', sEOL, re.MULTILINE)

# figure out all the closing tags `</CLOSING>`
closers = {}
def trackClosingTags(m):
    global closers
    closers[m.group(1)] = True
editor.research(r'</(\w+)\s*>', trackClosingTags)

for tag in closers.keys():
    f = r'(?-si:<{0}\b|(?!\A)\G)(?s-i:(?!</{0}\b).)*?\K(?-si:^(?!\h*</{0}))'.format(tag)
    editor.rereplace(f, '\t', re.MULTILINE)

editor.endUndoAction()

INPUT FILE:

<OFX> <SIGNONMSGSRQV1> <SONRQ> <DTCLIENT>20250520104016.123[-7:MST] <USERID>anonymous00000000000000000000000 <USERPASS>X <GENUSERKEY>N <LANGUAGE>ENG <APPID>QWIN <APPVER>2700 </SONRQ> </SIGNONMSGSRQV1> <INTU.BRANDMSGSRQV1> <INTU.BRANDTRNRQ> <TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026 <INTU.BRANDRQ> <FAKE> <OTHER> <TAG> <FAKE> <OTHER> <EMBEDDED> <FAKE> <DEEPER> <OTHER> </DEEPER> <OTHER> </EMBEDDED> </TAG>

OUTPUT:

<OFX>
<SIGNONMSGSRQV1>
	<SONRQ>
		<DTCLIENT>20250520104016.123[-7:MST]
		<USERID>anonymous00000000000000000000000
		<USERPASS>X
		<GENUSERKEY>N
		<LANGUAGE>ENG
		<APPID>QWIN
		<APPVER>2700
	</SONRQ>
</SIGNONMSGSRQV1>
<INTU.BRANDMSGSRQV1>
<INTU.BRANDTRNRQ>
<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
<INTU.BRANDRQ>
<FAKE>
<OTHER>
<TAG>
	<FAKE>
	<OTHER>
	<EMBEDDED>
		<FAKE>
		<DEEPER>
			<OTHER>
		</DEEPER>
		<OTHER>
	</EMBEDDED>
</TAG>

Essentially, what the script does:

Puts each <XYZ> or </CCCC> starting on its own line, with no indentation
Figures out all the </CCCC> closing tags (so it knows all the tags which will need to be indented)
For each of those CCCC tags, do the indentation replacement I suggested in the last post
Since the indentation it does is cumulative, it will properly nest (as shown with my TAG...EMBEDDED...DEEPER hierarchy, for example)

The script is designed so that after you run the script, if you do Ctrl+Z to UNDO, it will go back to the state before you ran the script.

If you would prefer to indent using spaces instead of the tab character, just change '\t' in the final editor.rereplace line to ' ' then save the script, before running it.

The PythonScript FAQ explains everything you need to know for how to install the plugin (either PythonScript 2 or 3 [I recommend 3]), how to create the script by copying from this post, and how to run it.

note: the above script will also live at https://github.com/pryrt/nppStuff/blob/main/pythonScripts/nppCommunity/27xxx/p27254_PrettyPrintBadXml.py

Doctor Rashir

@PeterJones

Hey, I ran the script. The result looks much much better than before. But this file is an OFX (Open Financial Exchange) and is not truly XML. The sample I posted is only a small part. The rest contains private info so can’t be posted.

I really appreciate that you spent this time. I think it will work great for my needs.

guy038

Hello, @doctor-rashir, @peterjones and All,

I saw that you proposed a Python script to @doctor-rashir and, indeed, this is surely a better method than regexes for moving toward his goal ! However, I would like to speak about my regex solution and the way I used to switch from your solution to mine !

First of all, we get rid of all leading tabulation chars with a Edit > Blank Operations > Trim Leading Space command or with the regex S/R : ^\t+ -> Nothing

Now, here is my solution :

FIND (?-si)(?:<Tag>|(?!\A)\G).*\R\K(?!</Tag>)

REPLACE \t

For example, given this INPUT text :

<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20250520104016.123[-7:MST]
<USERID>anonymous00000000000000000000000
<USERPASS>X
<GENUSERKEY>N
<LANGUAGE>ENG
<APPID>QWIN
<APPVER>2700
</SONRQ>
</SIGNONMSGSRQV1>
<INTU.BRANDMSGSRQV1>
<INTU.BRANDTRNRQ>
<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
<INTU.BRANDRQ>

<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20250520104016.123[-7:MST]
<USERID>anonymous00000000000000000000000
<USERPASS>X
<GENUSERKEY>N
<LANGUAGE>ENG
<APPID>QWIN
<APPVER>2700
</SONRQ>
</SIGNONMSGSRQV1>
<INTU.BRANDMSGSRQV1>
<INTU.BRANDTRNRQ>
<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
<INTU.BRANDRQ>

<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20250520104016.123[-7:MST]
<USERID>anonymous00000000000000000000000
<USERPASS>X
<GENUSERKEY>N
<LANGUAGE>ENG
<APPID>QWIN
<APPVER>2700
</SONRQ>
</SIGNONMSGSRQV1>
<INTU.BRANDMSGSRQV1>
<INTU.BRANDTRNRQ>
<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
<INTU.BRANDRQ>

If I run, successively, these two S/R :

FIND (?-si)(?:<SONRQ>|(?!\A)\G).*\R\K(?!</SONRQ>)

REPLACE \t

FIND (?-si)(?:<SIGNONMSGSRQV1>|(?!\A)\G).*\R\K(?!</SIGNONMSGSRQV1>)

REPLACE \t

I get this OUTPUT text :

<OFX>
<SIGNONMSGSRQV1>
	<SONRQ>
		<DTCLIENT>20250520104016.123[-7:MST]
		<USERID>anonymous00000000000000000000000
		<USERPASS>X
		<GENUSERKEY>N
		<LANGUAGE>ENG
		<APPID>QWIN
		<APPVER>2700
	</SONRQ>
</SIGNONMSGSRQV1>
<INTU.BRANDMSGSRQV1>
<INTU.BRANDTRNRQ>
<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
<INTU.BRANDRQ>

<OFX>
<SIGNONMSGSRQV1>
	<SONRQ>
		<DTCLIENT>20250520104016.123[-7:MST]
		<USERID>anonymous00000000000000000000000
		<USERPASS>X
		<GENUSERKEY>N
		<LANGUAGE>ENG
		<APPID>QWIN
		<APPVER>2700
	</SONRQ>
</SIGNONMSGSRQV1>
<INTU.BRANDMSGSRQV1>
<INTU.BRANDTRNRQ>
<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
<INTU.BRANDRQ>

<OFX>
<SIGNONMSGSRQV1>
	<SONRQ>
		<DTCLIENT>20250520104016.123[-7:MST]
		<USERID>anonymous00000000000000000000000
		<USERPASS>X
		<GENUSERKEY>N
		<LANGUAGE>ENG
		<APPID>QWIN
		<APPVER>2700
	</SONRQ>
</SIGNONMSGSRQV1>
<INTU.BRANDMSGSRQV1>
<INTU.BRANDTRNRQ>
<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
<INTU.BRANDRQ>

How did I “slip” from your solution to mine ?

Let’s start from your solution : (?-si:<SONRQ\b|(?!\A)\G)(?s-i:(?!</SONRQ\b).)*?\K(?-si:^(?!\h*</SONRQ))

First, as some parts of the whole regex do not contain any regex dot, The s notation is unnecessary

(?-i:<SONRQ\b|(?!\A)\G)(?s-i:(?!</SONRQ\b).)*?\K(?-i:^(?!</SONRQ))

Secondly, I preferred to use the whole tag’s formulation :

(?-i:<SONRQ>|(?!\A)\G)(?s-i:(?!</SONRQ>).)*?\K(?-i:^(?!</SONRQ>))

At this point, I said to myself : we need to add a tabulation char at beginning of all lines between lines <Tag> and </Tag>>. Thus, we just need, after detecting the tag, to go through current entire line with its line-break ( .*\R ) in order to add the \t char right after \R. This leads to the following search regex :

(?-i:<SONRQ>|(?!\A)\G)(?-si:.*\R)\K(?-i:^(?!</SONRQ>))

And, as (?-si:.*\R) can be simply changed as .*\R, with a leading (?-s) part, we get :

(?-s)(?-i:<SONRQ>|(?!\A)\G).*\R\K(?-i:^(?!</SONRQ>))

Now, the beginning of len assertion ^ is not necessary as we are at a location right after \R :

(?-s)(?-i:<SONRQ>|(?!\A)\G).*\R\K(?-i:(?!</SONRQ>))

As the (?-i) is common, throughout the whole regex, the -i can be moved at beginning of the whole regex, either :

(?-si)(?:<SONRQ>|(?!\A)\G).*\R\K(?:(?!</SONRQ>))

And finally, as the expression (?:(?!</SONRQ>)) can be simplified to (?!</SONRQ>), here is the final step :

(?-si)(?:<SONRQ>|(?!\A)\G).*\R\K(?!</SONRQ>)

All these successive searches, with replacement by \t, do work as expected ! That’s usually how I do for any regex : in small steps, each time !

Best Regards,

guy038

PeterJones

@guy038 said in How to Print Pretty with missing close tags.:

Let’s start from your solution

I’d hardly call it “my” solution. In that I used your generic find/replace-in-region formula, and plugged in reasonable values for the “variables” from that generic formula. The whole point of that generic formula is to make it really easy for anyone to just plug in their BSR/ESR/FR into the formula, and have it “just work”, without having to optimize or tweak.

If I run, successively,

As @Doctor-Rashir said here, “There are many closed tags”… In other words, it’s not just SONRQ and SIGNOMSGSRQV1, and trying to manually run a separate regular expression for each of the “many closed tags” is thus not practical. That is why I went to a script to automate it.