Bài 7: Regular Expressions - Basics & Pattern Matching

Mục Tiêu Bài Học

Sau khi hoàn thành bài này, bạn sẽ:

✅ Hiểu regular expressions (regex)
✅ Sử dụng re module
✅ Viết basic patterns
✅ Sử dụng character classes và quantifiers
✅ Áp dụng match, search, findall
✅ Compile và reuse patterns

Regular Expressions Là Gì?

Regular Expressions (regex) là patterns để search, match, và manipulate text.

Why Use Regex?

# Without regex - cumbersomedef is_valid_email_manual(email):    if '@' not in email:        return False    parts = email.split('@')    if len(parts) != 2:        return False    if '.' not in parts[1]:        return False    # ... more checks ...    return True # With regex - conciseimport re def is_valid_email_regex(email):    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'    return bool(re.match(pattern, email)) # Testprint(is_valid_email_regex('[email protected]'))  # Trueprint(is_valid_email_regex('invalid-email'))     # False

re Module Basics

Module re cung cấp regex operations.

Basic Functions

import re text = "Hello, my phone is 123-456-7890" # match() - check if pattern matches at STARTresult = re.match(r'Hello', text)print(result)  # <re.Match object>print(result.group())  # Hello result = re.match(r'phone', text)print(result)  # None (not at start) # search() - find FIRST occurrence anywhereresult = re.search(r'phone', text)print(result)  # <re.Match object>print(result.group())  # phone # findall() - find ALL occurrencestext2 = "I have 3 cats, 2 dogs, and 5 birds"numbers = re.findall(r'\d+', text2)print(numbers)  # ['3', '2', '5'] # finditer() - iterator of matchesfor match in re.finditer(r'\d+', text2):    print(f"Found {match.group()} at position {match.start()}")# Found 3 at position 7# Found 2 at position 15# Found 5 at position 26

Basic Pattern Syntax

Literal Characters

import re text = "Hello World" # Exact matchprint(re.search(r'Hello', text))        # Matchprint(re.search(r'hello', text))        # None (case-sensitive)print(re.search(r'World', text))        # Matchprint(re.search(r'Python', text))       # None

Special Characters (Metacharacters)

# . - any character (except newline)print(re.search(r'H.llo', 'Hello'))     # Matchprint(re.search(r'H.llo', 'Hallo'))     # Matchprint(re.search(r'H.llo', 'Hxllo'))     # Match # ^ - start of stringprint(re.search(r'^Hello', 'Hello World'))    # Matchprint(re.search(r'^World', 'Hello World'))    # None # $ - end of stringprint(re.search(r'World$', 'Hello World'))    # Matchprint(re.search(r'Hello$', 'Hello World'))    # None # * - 0 or more occurrencesprint(re.search(r'ab*c', 'ac'))        # Match (0 b's)print(re.search(r'ab*c', 'abc'))       # Match (1 b)print(re.search(r'ab*c', 'abbc'))      # Match (2 b's) # + - 1 or more occurrencesprint(re.search(r'ab+c', 'ac'))        # None (0 b's)print(re.search(r'ab+c', 'abc'))       # Match (1 b)print(re.search(r'ab+c', 'abbc'))      # Match (2 b's) # ? - 0 or 1 occurrenceprint(re.search(r'ab?c', 'ac'))        # Match (0 b's)print(re.search(r'ab?c', 'abc'))       # Match (1 b)print(re.search(r'ab?c', 'abbc'))      # None (2 b's)

Character Classes

Built-in Character Classes

import re text = "Hello123! Test@2025" # \d - digit [0-9]digits = re.findall(r'\d', text)print(digits)  # ['1', '2', '3', '2', '0', '2', '5'] # \D - non-digitnon_digits = re.findall(r'\D', text)print(non_digits)  # ['H', 'e', 'l', 'l', 'o', '!', ' ', 'T', 'e', 's', 't', '@'] # \w - word character [a-zA-Z0-9_]words = re.findall(r'\w+', text)print(words)  # ['Hello123', 'Test', '2025'] # \W - non-word characternon_words = re.findall(r'\W', text)print(non_words)  # ['!', ' ', '@'] # \s - whitespace [ \t\n\r\f\v]text2 = "Hello World\tPython\n"spaces = re.findall(r'\s', text2)print(spaces)  # [' ', '\t', '\n'] # \S - non-whitespacenon_spaces = re.findall(r'\S+', text2)print(non_spaces)  # ['Hello', 'World', 'Python']

Custom Character Classes

import re # [abc] - any of a, b, or ctext = "The cat sat on the mat"print(re.findall(r'[cm]at', text))  # ['cat', 'mat'] # [a-z] - rangetext2 = "ABC123xyz"print(re.findall(r'[a-z]+', text2))  # ['xyz']print(re.findall(r'[A-Z]+', text2))  # ['ABC']print(re.findall(r'[0-9]+', text2))  # ['123'] # [^abc] - NOT a, b, or ctext3 = "hello world"print(re.findall(r'[^aeiou]', text3))  # ['h', 'l', 'l', ' ', 'w', 'r', 'l', 'd'] # Multiple rangestext4 = "Test123!@#"print(re.findall(r'[a-zA-Z]', text4))      # ['T', 'e', 's', 't']print(re.findall(r'[a-zA-Z0-9]', text4))   # ['T', 'e', 's', 't', '1', '2', '3']

Quantifiers

Exact Count

import re # {n} - exactly n timesprint(re.search(r'\d{3}', '12'))       # Noneprint(re.search(r'\d{3}', '123'))      # Matchprint(re.search(r'\d{3}', '1234'))     # Match (first 3) # {n,m} - between n and m timesprint(re.search(r'\d{2,4}', '1'))      # Noneprint(re.search(r'\d{2,4}', '12'))     # Matchprint(re.search(r'\d{2,4}', '123'))    # Matchprint(re.search(r'\d{2,4}', '1234'))   # Matchprint(re.search(r'\d{2,4}', '12345'))  # Match (first 4) # {n,} - n or more timesprint(re.search(r'\d{3,}', '12'))      # Noneprint(re.search(r'\d{3,}', '123'))     # Matchprint(re.search(r'\d{3,}', '12345'))   # Match

Greedy vs Non-greedy

import re html = '<div>Hello</div><div>World</div>' # Greedy (default) - matches as much as possiblegreedy = re.search(r'<div>.*</div>', html)print(greedy.group())  # <div>Hello</div><div>World</div> # Non-greedy - matches as little as possiblenon_greedy = re.search(r'<div>.*?</div>', html)print(non_greedy.group())  # <div>Hello</div> # All non-greedy quantifierstext = "aaaaa"print(re.search(r'a+?', text).group())   # a (instead of aaaaa)print(re.search(r'a*?', text).group())   # '' (empty)print(re.search(r'a{2,4}?', text).group())  # aa (instead of aaaa)

Match Objects

Accessing Match Information

import re text = "Email: [email protected], Phone: 123-456-7890" # Match object propertiesmatch = re.search(r'\w+@\w+\.\w+', text) print(match.group())     # [email protected]print(match.start())     # 7 (start index)print(match.end())       # 24 (end index)print(match.span())      # (7, 24) (start, end)print(match.string)      # Original text # Check if match existsif match:    print(f"Found: {match.group()}")else:    print("Not found")

Multiple Matches

import re text = "Python 3.9, Java 11, Go 1.16" # findall - returns list of stringsversions = re.findall(r'\d+\.\d+', text)print(versions)  # ['3.9', '1.16'] # finditer - returns iterator of match objectsfor match in re.finditer(r'(\w+)\s+(\d+\.?\d*)', text):    language = match.group(1)    version = match.group(2)    print(f"{language}: {version}")# Python: 3.9# Java: 11# Go: 1.16

Compiled Patterns

Compile patterns để reuse efficiently.

import re # Without compilationtext = "Hello 123, World 456"re.search(r'\d+', text)  # Compiles every time # With compilation - better for reusepattern = re.compile(r'\d+')pattern.search(text)     # Uses compiled patternpattern.search("Test 789")  # Reuses same compiled pattern # Example with multiple operationsemail_pattern = re.compile(r'[\w\.-]+@[\w\.-]+\.\w+') emails = [    "[email protected]",    "invalid-email",    "[email protected]",    "bad@format"] for email in emails:    if email_pattern.match(email):        print(f"Valid: {email}")    else:        print(f"Invalid: {email}")

Common Patterns

Email Validation

import re def validate_email(email):    """Validate email address."""    pattern = r'^[\w\.-]+@[\w\.-]+\.\w{2,}$'    return bool(re.match(pattern, email)) # Testemails = [    "[email protected]",      # Valid    "[email protected]",  # Valid    "invalid",               # Invalid    "@example.com",          # Invalid    "[email protected]"              # Invalid] for email in emails:    print(f"{email}: {validate_email(email)}")

Phone Number

import re def extract_phone(text):    """Extract phone numbers."""    # Pattern: (123) 456-7890 or 123-456-7890    pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'    return re.findall(pattern, text) text = """Contact us at:(123) 456-7890or 987-654-3210or 555.123.4567""" phones = extract_phone(text)print(phones)# ['(123) 456-7890', '987-654-3210', '555.123.4567']

URL Extraction

import re def extract_urls(text):    """Extract URLs from text."""    pattern = r'https?://(?:www\.)?[\w\.-]+\.\w+(?:/[\w\.-]*)*'    return re.findall(pattern, text) text = """Visit https://www.example.com orhttp://test.org/path/to/pageand https://github.com/user/repo""" urls = extract_urls(text)for url in urls:    print(url)# https://www.example.com# http://test.org/path/to/page# https://github.com/user/repo

Date Parsing

import re def parse_dates(text):    """Parse dates in various formats."""    patterns = [        r'\d{4}-\d{2}-\d{2}',        # YYYY-MM-DD        r'\d{2}/\d{2}/\d{4}',        # MM/DD/YYYY        r'\d{2}\.\d{2}\.\d{4}'       # DD.MM.YYYY    ]        dates = []    for pattern in patterns:        dates.extend(re.findall(pattern, text))        return dates text = """Meeting on 2025-10-27Birthday: 12/25/2024Event: 15.03.2025""" dates = parse_dates(text)print(dates)# ['2025-10-27', '12/25/2024', '15.03.2025']

Username Validation

import re def validate_username(username):    """    Validate username:    - 3-16 characters    - Letters, numbers, underscore, hyphen    - Must start with letter    """    pattern = r'^[a-zA-Z][a-zA-Z0-9_-]{2,15}$'    return bool(re.match(pattern, username)) usernames = [    "alice",           # Valid    "user_123",        # Valid    "test-user",       # Valid    "ab",              # Invalid (too short)    "123user",         # Invalid (starts with number)    "user@name",       # Invalid (@ not allowed)] for username in usernames:    print(f"{username}: {validate_username(username)}")

Real-world Examples

1. Log Parser

import re def parse_log_line(line):    """Parse Apache-style log line."""    pattern = r'(\S+) - - \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+) \S+" (\d{3}) (\d+)'        match = re.search(pattern, line)    if match:        return {            'ip': match.group(1),            'timestamp': match.group(2),            'method': match.group(3),            'path': match.group(4),            'status': int(match.group(5)),            'size': int(match.group(6))        }    return None log_line = '192.168.1.1 - - [27/Oct/2025:10:30:00 +0000] "GET /index.html HTTP/1.1" 200 1234'result = parse_log_line(log_line)print(result)# {'ip': '192.168.1.1', 'timestamp': '27/Oct/2025:10:30:00 +0000', ...}

2. Markdown Link Extractor

import re def extract_markdown_links(text):    """Extract links from Markdown text."""    # Pattern: [text](url)    pattern = r'\[([^\]]+)\]\(([^)]+)\)'        matches = re.findall(pattern, text)    return [{'text': text, 'url': url} for text, url in matches] markdown = """Check out [Python](https://python.org) and[GitHub](https://github.com) for more info.""" links = extract_markdown_links(markdown)for link in links:    print(f"{link['text']}: {link['url']}")# Python: https://python.org# GitHub: https://github.com

3. Password Strength Checker

import re def check_password_strength(password):    """    Check password strength:    - At least 8 characters    - Contains uppercase    - Contains lowercase    - Contains digit    - Contains special character    """    checks = {        'length': len(password) >= 8,        'uppercase': bool(re.search(r'[A-Z]', password)),        'lowercase': bool(re.search(r'[a-z]', password)),        'digit': bool(re.search(r'\d', password)),        'special': bool(re.search(r'[!@#$%^&*(),.?":{}|<>]', password))    }        strength = sum(checks.values())    checks['score'] = strength        if strength == 5:        checks['rating'] = 'Strong'    elif strength >= 3:        checks['rating'] = 'Medium'    else:        checks['rating'] = 'Weak'        return checks # Testpasswords = ["weak", "Strong1", "Strong1!", "VeryStr0ng!Pass"] for pwd in passwords:    result = check_password_strength(pwd)    print(f"{pwd}: {result['rating']} (score: {result['score']}/5)")# weak: Weak (score: 2/5)# Strong1: Medium (score: 3/5)# Strong1!: Medium (score: 4/5)# VeryStr0ng!Pass: Strong (score: 5/5)

4. HTML Tag Stripper

import re def strip_html_tags(html):    """Remove HTML tags from text."""    # Remove tags    text = re.sub(r'<[^>]+>', '', html)        # Clean up extra whitespace    text = re.sub(r'\s+', ' ', text)        return text.strip() html = """<div class="content">    <h1>Title</h1>    <p>This is a <strong>paragraph</strong> with <em>formatting</em>.</p></div>""" clean_text = strip_html_tags(html)print(clean_text)# Title This is a paragraph with formatting.

5. Credit Card Masking

import re def mask_credit_card(text):    """Mask credit card numbers in text."""    # Pattern: 4 groups of 4 digits    pattern = r'\b(\d{4})[- ]?(\d{4})[- ]?(\d{4})[- ]?(\d{4})\b'        def replacer(match):        # Keep first and last 4 digits, mask middle        return f"{match.group(1)}-****-****-{match.group(4)}"        return re.sub(pattern, replacer, text) text = """Card 1: 1234 5678 9012 3456Card 2: 1111-2222-3333-4444Card 3: 9876543210123456""" masked = mask_credit_card(text)print(masked)# Card 1: 1234-****-****-3456# Card 2: 1111-****-****-4444# Card 3: 9876-****-****-3456

Best Practices

import re # 1. Use raw strings for patternspattern = r'\d+'  # Good - no escaping issues# pattern = '\\d+'  # Bad - need double backslash # 2. Compile patterns for reuseemail_re = re.compile(r'[\w\.-]+@[\w\.-]+\.\w+')email_re.search(text)  # Efficient # 3. Use meaningful pattern namesphone_pattern = re.compile(r'\d{3}-\d{3}-\d{4}')date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}') # 4. Test patterns thoroughlydef test_email_pattern():    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'        valid = ['[email protected]', '[email protected]']    invalid = ['invalid', '@example.com', 'user@']        for email in valid:        assert re.match(pattern, email), f"Should match: {email}"        for email in invalid:        assert not re.match(pattern, email), f"Should not match: {email}" # 5. Document complex patternscomplex_pattern = r'''    ^                   # Start of string    [a-zA-Z]            # Must start with letter    [a-zA-Z0-9_-]{2,15} # 2-15 chars: letters, digits, _, -    $                   # End of string'''username_re = re.compile(complex_pattern, re.VERBOSE)

Bài Tập Thực Hành

Bài 1: IPv4 Validator

Tạo function validate IPv4 addresses.

Bài 2: Hashtag Extractor

Extract hashtags từ social media text.

Bài 3: Time Parser

Parse và convert time formats (12h ↔ 24h).

Bài 4: Code Comment Remover

Remove comments từ code (Python, JS, etc.).

Bài 5: Template Variable Replacer

Replace {{variable}} placeholders trong templates.

Tóm Tắt

✅ re module: match, search, findall, finditer
✅ Metacharacters: . ^ $ * + ? { } [ ] \ | ( )
✅ Character classes: \d \D \w \W \s \S
✅ Quantifiers: * + ? {n} {n,m} {n,}
✅ Greedy vs Non-greedy: * vs *?, + vs +?
✅ Match objects: group(), start(), end(), span()
✅ Compiled patterns: Better performance for reuse
✅ Common use cases: Email, phone, URL, dates

Bài Tiếp Theo

Bài 7.2: Regular Expressions (Part 2) - Groups, lookahead/lookbehind, flags, và advanced patterns! 🚀

Remember:

Use raw strings (r'pattern')
Compile patterns for reuse
Test patterns thoroughly
Document complex patterns
Keep patterns readable! 🎯