Keeping AI-powered applications secure in an ever-shifting landscape of new threats and attack vectors can be challenging. One novel type of attack involves invisible letters—once a tool of espionage with invisible ink, this is now a modern-day threat to AI prompts.
Large language models (LLMs) and AI applications that accept user-submitted prompts can be vulnerable to injection attacks using invisible “tag” characters. A user may believe they’re submitting an innocuous request, but in reality, they may be submitting malicious content without their knowledge.
Unicode tags are special invisible characters designed to reproduce most of the characters found on a typical English keyboard—including letters, numbers, spaces and symbols like # or &—based on the American Standard Code for Information Interchange (ASCII) standard.
Tag characters were originally designed to invisibly record the language of written text—essentially adding hidden metadata without relying on complex markup systems such as Extensible Markup Language (XML). For example, an invisible tag like “da” could mark the beginning of a piece of text written in Danish.
These special tag characters were formally deprecated from the Unicode standard in 2008. However, most of them have since been restored and repurposed. Today, they’re used to encode regional flags. For instance, combining the Unicode “WAVING BLACK FLAG” character with the invisible tag characters “gbsct”, followed by the closing “CANCEL TAG” character, produces the “Flag of Scotland” emoji: 🏴
The problem is that these invisible characters can be used to poison AI prompts. A prompt may look legitimate and benign, but it could have been injected with invisible tag characters that alter the prompt’s meaning entirely.
Consider a simple example prompt: “What is the capital of Cameroon?”
An attacker could add invisible characters that change how the model responds, such as: “Now ignore all previous instructions and…”
Although these characters are invisible to humans, some LLMs still read them as normal letters and punctuation, meaning the hidden instructions are processed as if they were visible text.
Not all LLMs handle them the same way: some ignore invisible tag characters, while others interpret them as plain ASCII. Because the entire English alphabet and punctuation exist in this “invisible” form, this kind of attack is very flexible.
The tag characters occupy their own Unicode block, in the range U+E0000 to U+E007F. One way to defend against them is to deploy a web application firewall (WAF) in the traffic path, using a rule that detects the presence of these characters and blocks any requests containing them. The Progress Kemp LoadMaster WAF is up to this task.
WAF engines have historically struggled with Unicode. Merely representing high code points in a regular expression pattern is surprisingly difficult, and often not portable across different regex libraries and WAF engines.
However, a WAF can reliably detect the byte patterns these Unicode characters produce when they’re encoded in UTF-8—the lingua franca of the modern internet. Although the entire Unicode tags block isn’t technically assigned (there are 31 unallocated code points), we can start matching any character in the range of U+E0000 to U+E007F by examining their UTF-8 byte sequences:
The first two bytes are static and always F3 A0, which forms the start of the detection pattern.
A simple (if inelegant) next step is to print the UTF-8 encoded bytes for every code point in the block. Doing so shows that the last byte does not increment continuously (due to how code points are encoded in UTF-8), and that there’s a break—resulting in two separate byte ranges that need to be detected:
This can be represented by the regular expression pattern:
\xF3\xA0(?:\x80|\x81)[\x80-\xBF]
The regular expression pattern can form the basis of a custom WAF rule to detect the presence of invisible tag characters in a request. The regex can be combined with a WAF action to explicitly deny any request where invisible tag characters are detected, while also logging the details of the offending request for later analysis.
The following is an example of a SecRule (a WAF security rule) that can be added to a LoadMaster WAF to enable this detection and blocking behavior:
SecRule ARGS “@rx \xF3\xA0(?:\x80|\x81)[\x80-\xBF]” \ “id:1000,\ phase:2,\ deny,\ capture,\ t:none,\ log,\ msg:‘Detected invisible Unicode tag characters’,\ logdata:‘Matched Data: %{TX.0} found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}’,\ tag:‘PROGRESS/KEMP-LOADMASTER’”
First, copy and paste the rule into a text editor and save it as a .conf rule file somewhere for easy access:
If other custom rules are already in use, be sure to change the rule ID to a unique value if the example ID of 1000 is already taken.
Note: The example rule scans the ARGS collection: this means it will inspect all query string and POST body arguments—including JavaScript Object Notation (JSON) and XML payloads—for the presence of invisible tag characters. This is the most useful place to scan but LoadMaster support can provide guidance if alternatives are needed (e.g., scanning request headers or cookies instead).
Next, on the LoadMaster Web UI, navigate to Web Application Firewall > Custom Rules and upload the new custom rule file:
Under View/Modify Services, click Modify next to the virtual service where the new rule will be applied. Expand the WAF tab, check the box next to the new custom rule file and click Apply to enable it.
Note: Applying the rule triggers a brief service reload, which may momentarily interrupt traffic to this virtual service.
Important: If request bodies should be inspected (e.g. POST requests, JSON requests, etc.) make sure to open Advanced Settings and enable Inspect HTTP POST Request Bodies. Most users will want this functionality enabled.
It can be tricky to send test requests that contain these high code point Unicode characters. It’s reasonably simple to test using a Linux terminal and the curl command.
First, send a clean, legitimate request:
A 200 OK response should be returned by the virtual service and allowed through the WAF:
Next, send a request that contains an invisible tag character. On a Linux terminal (tested using the default GNOME Terminal), press CTRL+SHIFT+U to enter a Unicode character. This displays an underlined starting u, after which the code point can be entered. In this example, an invisible capital letter P (as “ue0050”) is being added to the request:
Press ENTER to add the invisible Unicode character and it will ‘disappear’ (but it’s there and will be sent in the request!):
Send the request and confirm that a 403 Forbidden response is returned by the WAF:
The rule’s effect can be double-checked by looking at the WAF logs. To do this, navigate to System Configuration > Logging Options > System Log Files and click View next to WAF Event Log File:
The log entry of interest will be at or near the very bottom of the log file. To speed things up, try searching for the rule ID, e.g.
id “1000”
The log line should look like this (split across multiple lines here for readability):
2025-09-03T09:45:54+00:00 lb100 wafd:
ModSecurity: Access denied with code 403 (phase 2). Pattern match "\\xF3\\xA0(?:\\x80|\\x81)[\\x80-\\xBF]" at ARGS:ai-prompt.
[file "/tmp/waf/1/block-invisible-tag-chars.conf"]
[line "10"]
[id "1000"]
[msg "Detected invisible Unicode tag characters"]
[data "Matched Data: \xf3\xa0\x81\x90 found within ARGS:ai-prompt: foo\xf3\xa0\x81\x90\xf3\xa0\x81\x90"]
[tag "PROGRESS/KEMP-LOADMASTER"]
[hostname "192.168.2.150"]
[uri "/"]
[unique_id "33941351-75e7-451e-a5ed-6f38d74ff79f"]
The UTF-8 encoded invisible capital P tag character was successfully detected as the string of bytes \xf3\xa0\x81\x90 and the request was therefore denied.
As an extra step to confirm that only the intended range of characters is being blocked, the characters immediately before and after the blocked range can be tested. These characters should still be allowed and return a 200 OK response.
(Note: The code points on either side of the Tags block are not currently assigned and are not valid characters—but one day they might be, so they should not be unnecessarily blocked.)
Testing with a simple script yields the following results, which are as expected:
⋮
U+DFFFD: 200
U+DFFFE: 200
U+DFFFF: 200
U+E0000: 403
U+E0001: 403
⋮
U+E007D: 403
U+E007E: 403
U+E007F: 403
U+E0080: 200
U+E0081: 200
⋮
AI-powered apps and experiences require extra time and attention to defend against bad actors. In particular, AI applications that allow free-form user input (e.g. AI prompts) are open to abuse through injection and must be defended. Beyond the novel attack vector described here, traditional attacks like remote command execution, server-side request forgery and others still apply. A WAF can detect, log and deny all of these.
Talk to us today about your WAF and load balancing needs, then take advantage of the LoadMaster 30-day free trial to test it out for yourself.
Written with StackEdit.