Bài 7: Regular Expressions - Basics & Pattern Matching
Mục Tiêu Bài Học
Sau khi hoàn thành bài này, bạn sẽ:
- ✅ Hiểu regular expressions (regex)
- ✅ Sử dụng re module
- ✅ Viết basic patterns
- ✅ Sử dụng character classes và quantifiers
- ✅ Áp dụng match, search, findall
- ✅ Compile và reuse patterns
Regular Expressions Là Gì?
Regular Expressions (regex) là patterns để search, match, và manipulate text.
Why Use Regex?
# Without regex - cumbersomedef is_valid_email_manual(email): if '@' not in email: return False parts = email.split('@') if len(parts) != 2: return False if '.' not in parts[1]: return False # ... more checks ... return True # With regex - conciseimport re def is_valid_email_regex(email): pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$' return bool(re.match(pattern, email)) # Testprint(is_valid_email_regex('[email protected]')) # Trueprint(is_valid_email_regex('invalid-email')) # False
re Module Basics
Module re cung cấp regex operations.
Basic Functions
import re text = "Hello, my phone is 123-456-7890" # match() - check if pattern matches at STARTresult = re.match(r'Hello', text)print(result) # <re.Match object>print(result.group()) # Hello result = re.match(r'phone', text)print(result) # None (not at start) # search() - find FIRST occurrence anywhereresult = re.search(r'phone', text)print(result) # <re.Match object>print(result.group()) # phone # findall() - find ALL occurrencestext2 = "I have 3 cats, 2 dogs, and 5 birds"numbers = re.findall(r'\d+', text2)print(numbers) # ['3', '2', '5'] # finditer() - iterator of matchesfor match in re.finditer(r'\d+', text2): print(f"Found {match.group()} at position {match.start()}")# Found 3 at position 7# Found 2 at position 15# Found 5 at position 26
Basic Pattern Syntax
Literal Characters
import re text = "Hello World" # Exact matchprint(re.search(r'Hello', text)) # Matchprint(re.search(r'hello', text)) # None (case-sensitive)print(re.search(r'World', text)) # Matchprint(re.search(r'Python', text)) # None
Special Characters (Metacharacters)
# . - any character (except newline)print(re.search(r'H.llo', 'Hello')) # Matchprint(re.search(r'H.llo', 'Hallo')) # Matchprint(re.search(r'H.llo', 'Hxllo')) # Match # ^ - start of stringprint(re.search(r'^Hello', 'Hello World')) # Matchprint(re.search(r'^World', 'Hello World')) # None # $ - end of stringprint(re.search(r'World$', 'Hello World')) # Matchprint(re.search(r'Hello$', 'Hello World')) # None # * - 0 or more occurrencesprint(re.search(r'ab*c', 'ac')) # Match (0 b's)print(re.search(r'ab*c', 'abc')) # Match (1 b)print(re.search(r'ab*c', 'abbc')) # Match (2 b's) # + - 1 or more occurrencesprint(re.search(r'ab+c', 'ac')) # None (0 b's)print(re.search(r'ab+c', 'abc')) # Match (1 b)print(re.search(r'ab+c', 'abbc')) # Match (2 b's) # ? - 0 or 1 occurrenceprint(re.search(r'ab?c', 'ac')) # Match (0 b's)print(re.search(r'ab?c', 'abc')) # Match (1 b)print(re.search(r'ab?c', 'abbc')) # None (2 b's)
Character Classes
Built-in Character Classes
import re text = "Hello123! Test@2025" # \d - digit [0-9]digits = re.findall(r'\d', text)print(digits) # ['1', '2', '3', '2', '0', '2', '5'] # \D - non-digitnon_digits = re.findall(r'\D', text)print(non_digits) # ['H', 'e', 'l', 'l', 'o', '!', ' ', 'T', 'e', 's', 't', '@'] # \w - word character [a-zA-Z0-9_]words = re.findall(r'\w+', text)print(words) # ['Hello123', 'Test', '2025'] # \W - non-word characternon_words = re.findall(r'\W', text)print(non_words) # ['!', ' ', '@'] # \s - whitespace [ \t\n\r\f\v]text2 = "Hello World\tPython\n"spaces = re.findall(r'\s', text2)print(spaces) # [' ', '\t', '\n'] # \S - non-whitespacenon_spaces = re.findall(r'\S+', text2)print(non_spaces) # ['Hello', 'World', 'Python']
Custom Character Classes
import re # [abc] - any of a, b, or ctext = "The cat sat on the mat"print(re.findall(r'[cm]at', text)) # ['cat', 'mat'] # [a-z] - rangetext2 = "ABC123xyz"print(re.findall(r'[a-z]+', text2)) # ['xyz']print(re.findall(r'[A-Z]+', text2)) # ['ABC']print(re.findall(r'[0-9]+', text2)) # ['123'] # [^abc] - NOT a, b, or ctext3 = "hello world"print(re.findall(r'[^aeiou]', text3)) # ['h', 'l', 'l', ' ', 'w', 'r', 'l', 'd'] # Multiple rangestext4 = "Test123!@#"print(re.findall(r'[a-zA-Z]', text4)) # ['T', 'e', 's', 't']print(re.findall(r'[a-zA-Z0-9]', text4)) # ['T', 'e', 's', 't', '1', '2', '3']
Quantifiers
Exact Count
import re # {n} - exactly n timesprint(re.search(r'\d{3}', '12')) # Noneprint(re.search(r'\d{3}', '123')) # Matchprint(re.search(r'\d{3}', '1234')) # Match (first 3) # {n,m} - between n and m timesprint(re.search(r'\d{2,4}', '1')) # Noneprint(re.search(r'\d{2,4}', '12')) # Matchprint(re.search(r'\d{2,4}', '123')) # Matchprint(re.search(r'\d{2,4}', '1234')) # Matchprint(re.search(r'\d{2,4}', '12345')) # Match (first 4) # {n,} - n or more timesprint(re.search(r'\d{3,}', '12')) # Noneprint(re.search(r'\d{3,}', '123')) # Matchprint(re.search(r'\d{3,}', '12345')) # Match
Greedy vs Non-greedy
import re html = '<div>Hello</div><div>World</div>' # Greedy (default) - matches as much as possiblegreedy = re.search(r'<div>.*</div>', html)print(greedy.group()) # <div>Hello</div><div>World</div> # Non-greedy - matches as little as possiblenon_greedy = re.search(r'<div>.*?</div>', html)print(non_greedy.group()) # <div>Hello</div> # All non-greedy quantifierstext = "aaaaa"print(re.search(r'a+?', text).group()) # a (instead of aaaaa)print(re.search(r'a*?', text).group()) # '' (empty)print(re.search(r'a{2,4}?', text).group()) # aa (instead of aaaa)
Match Objects
Accessing Match Information
import re text = "Email: [email protected], Phone: 123-456-7890" # Match object propertiesmatch = re.search(r'\w+@\w+\.\w+', text) print(match.group()) # [email protected]print(match.start()) # 7 (start index)print(match.end()) # 24 (end index)print(match.span()) # (7, 24) (start, end)print(match.string) # Original text # Check if match existsif match: print(f"Found: {match.group()}")else: print("Not found")
Multiple Matches
import re text = "Python 3.9, Java 11, Go 1.16" # findall - returns list of stringsversions = re.findall(r'\d+\.\d+', text)print(versions) # ['3.9', '1.16'] # finditer - returns iterator of match objectsfor match in re.finditer(r'(\w+)\s+(\d+\.?\d*)', text): language = match.group(1) version = match.group(2) print(f"{language}: {version}")# Python: 3.9# Java: 11# Go: 1.16
Compiled Patterns
Compile patterns để reuse efficiently.
import re # Without compilationtext = "Hello 123, World 456"re.search(r'\d+', text) # Compiles every time # With compilation - better for reusepattern = re.compile(r'\d+')pattern.search(text) # Uses compiled patternpattern.search("Test 789") # Reuses same compiled pattern # Example with multiple operationsemail_pattern = re.compile(r'[\w\.-]+@[\w\.-]+\.\w+') emails = [ "[email protected]", "invalid-email", "[email protected]", "bad@format"] for email in emails: if email_pattern.match(email): print(f"Valid: {email}") else: print(f"Invalid: {email}")
Common Patterns
Email Validation
import re def validate_email(email): """Validate email address.""" pattern = r'^[\w\.-]+@[\w\.-]+\.\w{2,}$' return bool(re.match(pattern, email)) # Testemails = [ "[email protected]", # Valid "[email protected]", # Valid "invalid", # Invalid "@example.com", # Invalid "[email protected]" # Invalid] for email in emails: print(f"{email}: {validate_email(email)}")
Phone Number
import re def extract_phone(text): """Extract phone numbers.""" # Pattern: (123) 456-7890 or 123-456-7890 pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}' return re.findall(pattern, text) text = """Contact us at:(123) 456-7890or 987-654-3210or 555.123.4567""" phones = extract_phone(text)print(phones)# ['(123) 456-7890', '987-654-3210', '555.123.4567']
URL Extraction
import re def extract_urls(text): """Extract URLs from text.""" pattern = r'https?://(?:www\.)?[\w\.-]+\.\w+(?:/[\w\.-]*)*' return re.findall(pattern, text) text = """Visit https://www.example.com orhttp://test.org/path/to/pageand https://github.com/user/repo""" urls = extract_urls(text)for url in urls: print(url)# https://www.example.com# http://test.org/path/to/page# https://github.com/user/repo
Date Parsing
import re def parse_dates(text): """Parse dates in various formats.""" patterns = [ r'\d{4}-\d{2}-\d{2}', # YYYY-MM-DD r'\d{2}/\d{2}/\d{4}', # MM/DD/YYYY r'\d{2}\.\d{2}\.\d{4}' # DD.MM.YYYY ] dates = [] for pattern in patterns: dates.extend(re.findall(pattern, text)) return dates text = """Meeting on 2025-10-27Birthday: 12/25/2024Event: 15.03.2025""" dates = parse_dates(text)print(dates)# ['2025-10-27', '12/25/2024', '15.03.2025']
Username Validation
import re def validate_username(username): """ Validate username: - 3-16 characters - Letters, numbers, underscore, hyphen - Must start with letter """ pattern = r'^[a-zA-Z][a-zA-Z0-9_-]{2,15}$' return bool(re.match(pattern, username)) usernames = [ "alice", # Valid "user_123", # Valid "test-user", # Valid "ab", # Invalid (too short) "123user", # Invalid (starts with number) "user@name", # Invalid (@ not allowed)] for username in usernames: print(f"{username}: {validate_username(username)}")
Real-world Examples
1. Log Parser
import re def parse_log_line(line): """Parse Apache-style log line.""" pattern = r'(\S+) - - \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+) \S+" (\d{3}) (\d+)' match = re.search(pattern, line) if match: return { 'ip': match.group(1), 'timestamp': match.group(2), 'method': match.group(3), 'path': match.group(4), 'status': int(match.group(5)), 'size': int(match.group(6)) } return None log_line = '192.168.1.1 - - [27/Oct/2025:10:30:00 +0000] "GET /index.html HTTP/1.1" 200 1234'result = parse_log_line(log_line)print(result)# {'ip': '192.168.1.1', 'timestamp': '27/Oct/2025:10:30:00 +0000', ...}
2. Markdown Link Extractor
import re def extract_markdown_links(text): """Extract links from Markdown text.""" # Pattern: [text](url) pattern = r'\[([^\]]+)\]\(([^)]+)\)' matches = re.findall(pattern, text) return [{'text': text, 'url': url} for text, url in matches] markdown = """Check out [Python](https://python.org) and[GitHub](https://github.com) for more info.""" links = extract_markdown_links(markdown)for link in links: print(f"{link['text']}: {link['url']}")# Python: https://python.org# GitHub: https://github.com
3. Password Strength Checker
import re def check_password_strength(password): """ Check password strength: - At least 8 characters - Contains uppercase - Contains lowercase - Contains digit - Contains special character """ checks = { 'length': len(password) >= 8, 'uppercase': bool(re.search(r'[A-Z]', password)), 'lowercase': bool(re.search(r'[a-z]', password)), 'digit': bool(re.search(r'\d', password)), 'special': bool(re.search(r'[!@#$%^&*(),.?":{}|<>]', password)) } strength = sum(checks.values()) checks['score'] = strength if strength == 5: checks['rating'] = 'Strong' elif strength >= 3: checks['rating'] = 'Medium' else: checks['rating'] = 'Weak' return checks # Testpasswords = ["weak", "Strong1", "Strong1!", "VeryStr0ng!Pass"] for pwd in passwords: result = check_password_strength(pwd) print(f"{pwd}: {result['rating']} (score: {result['score']}/5)")# weak: Weak (score: 2/5)# Strong1: Medium (score: 3/5)# Strong1!: Medium (score: 4/5)# VeryStr0ng!Pass: Strong (score: 5/5)
4. HTML Tag Stripper
import re def strip_html_tags(html): """Remove HTML tags from text.""" # Remove tags text = re.sub(r'<[^>]+>', '', html) # Clean up extra whitespace text = re.sub(r'\s+', ' ', text) return text.strip() html = """<div class="content"> <h1>Title</h1> <p>This is a <strong>paragraph</strong> with <em>formatting</em>.</p></div>""" clean_text = strip_html_tags(html)print(clean_text)# Title This is a paragraph with formatting.
5. Credit Card Masking
import re def mask_credit_card(text): """Mask credit card numbers in text.""" # Pattern: 4 groups of 4 digits pattern = r'\b(\d{4})[- ]?(\d{4})[- ]?(\d{4})[- ]?(\d{4})\b' def replacer(match): # Keep first and last 4 digits, mask middle return f"{match.group(1)}-****-****-{match.group(4)}" return re.sub(pattern, replacer, text) text = """Card 1: 1234 5678 9012 3456Card 2: 1111-2222-3333-4444Card 3: 9876543210123456""" masked = mask_credit_card(text)print(masked)# Card 1: 1234-****-****-3456# Card 2: 1111-****-****-4444# Card 3: 9876-****-****-3456
Best Practices
import re # 1. Use raw strings for patternspattern = r'\d+' # Good - no escaping issues# pattern = '\\d+' # Bad - need double backslash # 2. Compile patterns for reuseemail_re = re.compile(r'[\w\.-]+@[\w\.-]+\.\w+')email_re.search(text) # Efficient # 3. Use meaningful pattern namesphone_pattern = re.compile(r'\d{3}-\d{3}-\d{4}')date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}') # 4. Test patterns thoroughlydef test_email_pattern(): pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$' valid = ['[email protected]', '[email protected]'] invalid = ['invalid', '@example.com', 'user@'] for email in valid: assert re.match(pattern, email), f"Should match: {email}" for email in invalid: assert not re.match(pattern, email), f"Should not match: {email}" # 5. Document complex patternscomplex_pattern = r''' ^ # Start of string [a-zA-Z] # Must start with letter [a-zA-Z0-9_-]{2,15} # 2-15 chars: letters, digits, _, - $ # End of string'''username_re = re.compile(complex_pattern, re.VERBOSE)
Bài Tập Thực Hành
Bài 1: IPv4 Validator
Tạo function validate IPv4 addresses.
Bài 2: Hashtag Extractor
Extract hashtags từ social media text.
Bài 3: Time Parser
Parse và convert time formats (12h ↔ 24h).
Bài 4: Code Comment Remover
Remove comments từ code (Python, JS, etc.).
Bài 5: Template Variable Replacer
Replace {{variable}} placeholders trong templates.
Tóm Tắt
✅ re module: match, search, findall, finditer
✅ Metacharacters: . ^ $ * + ? { } [ ] \ | ( )
✅ Character classes: \d \D \w \W \s \S
✅ Quantifiers: * + ? {n} {n,m} {n,}
✅ Greedy vs Non-greedy: * vs *?, + vs +?
✅ Match objects: group(), start(), end(), span()
✅ Compiled patterns: Better performance for reuse
✅ Common use cases: Email, phone, URL, dates
Bài Tiếp Theo
Bài 7.2: Regular Expressions (Part 2) - Groups, lookahead/lookbehind, flags, và advanced patterns! 🚀
Remember:
- Use raw strings (r'pattern')
- Compile patterns for reuse
- Test patterns thoroughly
- Document complex patterns
- Keep patterns readable! 🎯