Raw Strings in Python

In Python, escape sequences are used within strings to represent special characters. For instance, \n represents a new line, and \t represents a tab. These sequences allow for more control over the formatting of strings.

However, when you need to include literal backslashes within a string, escape sequences can complicate matters. For instance, if you define a Windows file path as a regular string, Python will misinterpret the backslashes as the start of escape sequences. Consider this example:

str = 'C:\new\text.txt'
print(str)
# Output:
# C:
# ew        ext.txt

In this case, \n is interpreted as a newline and \t as a tab. To address this issue, traditionally, you would escape each backslash by adding another backslash:

str = 'C:\\new\\text.txt'
print(str)
# Output: C:\new\text.txt

This works, but doubling backslashes can make your code unreadable, especially when dealing with complex strings like LaTeX markup:

str = "\\phi = \\\\ \\frac{1 + \\sqrt{5}}{2}"

Look how the abundance of escape characters made this string unreadable. This situation, where a string becomes cluttered with escape characters, is known as leaning toothpick syndrome.

This is where Python’s raw strings come in.

What are Raw Strings?

Raw strings are a special type of string that ignore escape sequences, treating every character inside the string as a literal character. This is particularly useful when dealing with strings that include multiple backslashes, such as regular expressions or Windows file paths.

To create a raw string, simply prefix your string with ‘r’ or ‘R’. For example, r"\n" will be treated as a literal backslash followed by an ‘n’, rather than as a newline character.

Let’s revisit the earlier example of a file path. Instead of doubling up backslashes, you can use a raw string as:

str = r'C:\new\text.txt'
print(str)
# Output: C:\new\text.txt

Raw Strings vs. Regular Strings

Despite their unique behavior, raw strings are not a different type of string. They are actually just regular Python strings in which each backslash is represented as \\.

To illustrate this, you can use the repr() function. This function returns a string representation of the object, allowing you to see how Python internally handles raw strings.

regular_string = "C:\\Users\\Username\\Documents"
raw_string = r"C:\Users\Username\Documents"

print(repr(regular_string))     # Output: C:\\Users\\Username\\Documents
print(repr(raw_string))         # Output: C:\\Users\\Username\\Documents

Notice how in the output of repr(), the double backslashes appear within the raw string representation. This confirms that both the regular string and the raw string are processed by Python in the same way internally.

Regular Expressions

Regular expressions are specially encoded text strings used as patterns for matching sets of strings. They rely heavily on the backslash character to denote metacharacters (characters with special meaning) and character classes.

Using raw strings dramatically simplifies writing regular expressions since you don’t need to double up every backslash to escape its special meaning. For example:

# Raw string for a phone number pattern
regex = r"\d{3}-\d{2}-\d{4}"

# Raw string to match variations of the 'example.com' domain
url_regex = r"https?://(www\.)?example\.com" 

# Raw string for a basic email matching pattern
email_regex = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

Windows File Paths

Windows operating systems use backslashes as directory separators in their file paths. Raw strings can simplify the syntax, making the paths easier to read and less prone to errors.

For example, the raw string

file_path = r"C:\Users\Documents\report.txt"

is far more readable than its counterpart containing double backslashes.

Quotes Inside Raw Strings

You can use both single and double quotes within a raw string, as long as they don’t match the quotes surrounding the string.

raw_string = r"We're open"       # Escape single quote
raw_string = r'He said "Wow!"'   # Escape double quotes

However, if you want to use the same type of quote within the raw string as the one that surrounds it, you’ll need to escape it with a backslash.

raw_string = r'They said, "It\'s complicated."'
print(raw_string)
# Output: They said, "It\'s complicated."

While this works, remember that raw strings treat backslashes as literal characters. This means the backslash will remain in your final string.

To get around this, you can define the string as a multiline raw string using triple quotes (''' or """). This allows you to include both single and double quotes freely within the string without any escaping.

raw_string = r'''They said, "It's complicated."'''
print(raw_string)
# Output: They said, "It's complicated."

Raw Strings Ending with a Backslash

A limitation of raw strings is that they cannot end with a single backslash \. This is because the single backslash would act as an escape character for the closing quotation mark, making Python think the string hasn’t ended. Attempting to do so would result in a SyntaxError.

raw_string = r"C:\Users\Username\Documents\"
# SyntaxError: EOL while scanning string literal

If you need a backslash at the end of your string, ensure you use two backslashes \\.

raw_string = r"C:\Users\Username\Documents\\"

Raw strings with Unicode characters

Another limitation of raw strings is that they do not treat the escape sequences for Unicode characters specially; they are interpreted as literal text.

For example, if you have a raw string like r"\u00A9", Python won’t convert it into its corresponding Unicode character (the copyright symbol ©). Instead, the output will remain as the literal string “\u00A9”.

raw_string = r"Copyright \u00A9 2024"
print(raw_string)
# Output: Copyright \u00A9 2024

Raw f-strings (Raw Formatted Strings)

Python introduced formatted string literals, also known as f-strings, in Python 3.6. You can use f-strings to embed expressions inside string literals.

You can also create raw f-strings by combining the f and r prefixes (e.g., rf"some_text{expression}"). This allows you to enjoy the benefits of both raw strings and f-strings, particularly useful when you need to include expressions within strings that contain many backslashes.

file_name = "data.txt"
path = rf"C:\Users\Documents\{file_name}"
print(path)
# Output: C:\Users\Documents\data.txt