SHARE

‘Trojan Source’ a Threat to All Source Code, Languages

Researchers have outlined a method that could be used by bad actors to push vulnerabilities into source code that are invisible to human code reviewers. In a paper released this week, two researchers at the University of Cambridge in the UK wrote that the method – which they dub “Trojan Source” – essentially can be […]

Written By

Jeff Burt

Nov 2, 2021

5 minute read

eSecurity Planet content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Researchers have outlined a method that could be used by bad actors to push vulnerabilities into source code that are invisible to human code reviewers.

In a paper released this week, two researchers at the University of Cambridge in the UK wrote that the method – which they dub “Trojan Source” – essentially can be leveraged against almost every programming language in use today and could be effective in supply-chain attacks similar to the one launched against SolarWinds last year.

“As powerful supply-chain attacks can be launched easily using these techniques, it is essential for organizations that participate in a software supply chain to implement defenses,” Nicholas Boucher and Ross Anderson wrote in their paper. “We have discussed countermeasures that can be used at a variety of levels in the software development toolchain: the language specification, the compiler, the text editor, the code repository, and the build pipeline.”

On a website, they wrote that “if an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviews, downstream software will likely inherit the vulnerability.”

Long-term solutions will come from compilers, most of which already defend against a related attack, they wrote.

Unicode is the Key

The key is using Unicode control characters to reorder tokens in source code at the encoding level, they wrote. These visually reordered tokens can display logic that is semantically correct but differs from the logic presented by the logical ordering of source code tokens.

However, “compilers and interpreters adhere to the logical ordering of source code, not the visual order,” they wrote.

The Trojan Source method enables cybercriminals to create coding that is seen one way by compilers but another way by human reviewers. This puts all source code in jeopardy, the researchers said, noting that they had working proofs of concept for attacks in a range of programming languages: C, C++, C#, JavaScript, Java, Rust, Go and Python. They added that the same method would likely work with other modern languages.

Also read: Top Code Debugging and Code Security Tools

Exploiting Reordered Tokens

The researchers also outlined a number of techniques that can be used to exploit the visual reordering of source code tokens, including “early returns,” which causes “a function to short circuit by executing a return statement that visually appears to be within a comment,” they wrote.

“Commenting-out” leads to a comment visually appearing as code, which is not executed. With “stretched strings,” portions of string literals visually appear as code, which has the same impact as commenting-out and causes string comparisons to fail.

“The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic,” they wrote.

This can be done two ways. The first is by attacking the Unicode bidirectional algorithm (BiDi), which is tracked as CVD-2021-42574. The algorithm determines the order in which the text is displayed. It supports languages that run left to right, such as English and Russian, as well as right to left, including Arabic and Hebrew.

“Bidi overrides enable even single-script characters to be displayed in an order different from their logical encoding,” Boucher and Anderson wrote. “This fact has previously been exploited to disguise the file extensions of malware disseminated by email and to craft adversarial examples for NLP [natural language processing] machine-learning pipelines.”

The researchers held their work under embargo for 99 days … as part of a major coordinated disclosure effort

Another attack uses homoglyphs, or characters that visually are nearly identical. Tracked as CVE-2021-42694.

The researchers held their work under embargo for 99 days to give time for compilers, interpreters, code editors and repositories to implement defenses as part of a major coordinated disclosure effort.

Supply Chain Attacks

What the researchers uncovered could lead to novel supply-chain attacks, which are increasingly being used by cybercriminals, as illustrated by the SolarWinds attack, where a Russian-based gang pushed malicious code into an update package for the company’s Orion remote monitoring and management software used by its customers. Kaseya and its customers suffered a similar attack.

With Trojan Source, “by injecting Unicode Bidi override characters into comments and strings, an adversary can produce syntactically-valid source code in most modern languages for which the display order of characters presents logic that diverges from the real logic,” the researchers wrote. “In effect, we anagram program A into program B. Such an attack could be challenging for a human code reviewer to detect, as the rendered source code looks perfectly acceptable. If the change in logic is subtle enough to go undetected in subsequent testing, an adversary could introduce targeted vulnerabilities without being detected.”

Code Protection Steps

Jon Gaines, senior application security consultant at nVisium, told eSecurity Planet that Trojan Source shows an interesting attack surface, adding that the proof of concepts shown by the researchers are not actually malicious.

“However, in the hands of a sophisticated attacker or group who can actually weaponize it, we would definitely have a dangerous situation on our hands,” Gaines said. “This scenario demonstrates the proactive power of source code reviews and it would be a good best practice not to copy and paste code for the time being. It’s always better to rewrite it yourself and you can also enable your IDE or text editors to display Unicode.”

The issues described in the paper are serious but need to be examined in context, according to Jake Williams, co-founder and CTO at incident response firm BreachQuest.

“To exploit this, a threat actor would need to be able to change source code that is subsequently compiled into binary form by a victim,” Williams told eSecurity Planet. “This itself is already serious. The vulnerabilities here abuse Unicode processing algorithms to make backdoors less likely to be discovered during source code review. If you’re not reviewing source code for vulnerabilities and backdoors before compilation, there’s no practical change to your security after the disclosure of this vulnerability.”

He also noted that there had been concerns about Unicode processing by compilers already. This research showed widespread the issue is, he said.

Boucher and Anderson recommended some steps organizations can take, including having compilers, interpreters and build pipelines supporting Unicode “throw errors or warnings for undetermined bidirectional control characters in comments or string laterals, and for identifiers with mixed-scripts confusable characters.”

They also said language specifications should disallow unterminated bidirectional control characters in comments and string laterals and that code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.

Alexey Vishnyakov, head of Malware Detection at the Positive Technologies Expert Security Center, said the vulnerability is real – but not easily exploited.

“An attacker first needs to assemble a working payload from fragments of rearranged code, which could take weeks or even a year, depending on the length of the program code,” Vishnyakov said in a statement. “They then need to build a gadget to both make the payload work, and make it disguised – another time-intensive project.

“This absolutely affects all software, since all software is written in a programming language. But at the moment, the security community is not aware of attacks using this technique.”

Further reading: Top Vulnerability Management Tools

Jeff Burt

Jeffrey Burt has been a journalist for more than three decades, the last 20-plus years covering technology. During more than 16 years with eWEEK, he covered everything from data center infrastructure and collaboration technology to AI, cloud, quantum computing and cybersecurity. A journalist since 2017, his articles have appeared on such sites as eWEEK, eSecurity Planet, Enterprise Networking Planet, Enterprise Storage Forum, The Next Platform, ITPro Today, Channel Futures, Channelnomics, SecurityNow, and Data Breach Today.