linkify-it-py#
This is Python port of linkify-it.
Links recognition library with FULL unicode support. Focused on high quality link patterns detection in plain text.
Why it’s awesome:
Full unicode support, with astral characters!
International domains support.
Allows rules extension & custom normalizers.
Install#
pip install linkify-it-py
or
conda install -c conda-forge linkify-it-py
Usage examples#
Example 1. Simple use#
from linkify_it import LinkifyIt
linkify = LinkifyIt()
print(linkify.test("Site github.com!"))
# => True
print(linkify.match("Site github.com!"))
# => [linkify_it.main.Match({
# 'schema': '',
# 'index': 5,
# 'last_index': 15,
# 'raw': 'github.com',
# 'text': 'github.com',
# 'url': 'http://github.com'
# }]
Example 2. With options#
from linkify_it import LinkifyIt
from linkify_it.tlds import TLDS
# Reload full tlds list & add unofficial `.onion` domain.
linkify = (
LinkifyIt()
.tlds(TLDS) # Reload with full tlds list
.tlds("onion", True) # Add unofficial `.onion` domain
.add("git:", "http:") # Add `git:` protocol as "alias"
.add("ftp:", None) # Disable `ftp:` protocol
.set({"fuzzy_ip": True}) # Enable IPs in fuzzy links (without schema)
)
print(linkify.test("Site tamanegi.onion!"))
# => True
print(linkify.match("Site tamanegi.onion!"))
# => [linkify_it.main.Match({
# 'schema': '',
# 'index': 5,
# 'last_index': 19,
# 'raw': 'tamanegi.onion',
# 'text': 'tamanegi.onion',
# 'url': 'http://tamanegi.onion'
# }]
Example 3. Add twitter mentions handler#
from linkify_it import LinkifyIt
linkify = LinkifyIt()
def validate(obj, text, pos):
tail = text[pos:]
if not obj.re.get("twitter"):
obj.re["twitter"] = re.compile(
"^([a-zA-Z0-9_]){1,15}(?!_)(?=$|" + obj.re["src_ZPCc"] + ")"
)
if obj.re["twitter"].search(tail):
if pos > 2 and tail[pos - 2] == "@":
return False
return len(obj.re["twitter"].search(tail).group())
return 0
def normalize(obj, match):
match.url = "https://twitter.com/" + re.sub(r"^@", "", match.url)
linkify.add("@", {"validate": validate, "normalize": normalize})
API#
LinkifyIt(schemas, options)#
Creates new linkifier instance with optional additional schemas.
By default understands:
http(s)://...
,ftp://...
,mailto:...
&//...
links“fuzzy” links and emails (google.com, foo@bar.com).
schemas
is an dict, where each key/value describes protocol/rule:
key - link prefix (usually, protocol name with
:
at the end,skype:
for example).linkify-it-py
makes sure that prefix is not preceded with alphanumeric char.value - rule to check tail after link prefix
str
just alias to existing rule
dict
validate - either a
re.Pattern
(start with^
, and don’t include the link prefix itself), or a validatorfunction
which, given arguments self, text and pos, returns the length of a match in text starting at index pos. pos is the index right after the link prefix. self can be used to access the linkify object to cache data.normalize - optional function to normalize text & url of matched result (for example, for twitter mentions).
options
:
fuzzy_link - recognize URL-s without
http(s)://
head. DefaultTrue
.fuzzy_ip - allow IPs in fuzzy links above. Can conflict with some texts like version numbers. Default
False
.fuzzy_email - recognize emails without
mailto:
prefix. DefaultTrue
.— - set
True
to terminate link with---
(if it’s considered as long dash).
.test(text)#
Searches linkifiable pattern and returns True
on success or False
on fail.
.pretest(text)#
Quick check if link MAY BE can exist. Can be used to optimize more expensive
.test()
calls. Return False
if link can not be found, True
- if .test()
call needed to know exactly.
.test_schema_at(text, name, position)#
Similar to .test()
but checks only specific protocol tail exactly at given
position. Returns length of found pattern (0 on fail).
.match(text)#
Returns list
of found link matches or null if nothing found.
Each match has:
schema - link schema, can be empty for fuzzy links, or
//
for protocol-neutral links.index - offset of matched text
last_index - index of next char after mathch end
raw - matched text
text - normalized text
url - link, generated from matched text
.matchAtStart(text)#
Checks if a match exists at the start of the string. Returns Match
(see docs for match(text)
) or null if no URL is at the start.
Doesn’t work with fuzzy links.
.tlds(list_tlds, keep_old=False)#
Load (or merge) new tlds list. Those are needed for fuzzy links (without schema) to avoid false positives. By default:
2-letter root zones are ok.
biz|com|edu|gov|net|org|pro|web|xxx|aero|asia|coop|info|museum|name|shop|рф are ok.
encoded (
xn--...
) root zones are ok.
If that’s not enough, you can reload defaults with more detailed zones list.
.add(key, value)#
Add a new schema to the schemas object. As described in the constructor
definition, key
is a link prefix (skype:
, for example), and value
is a str
to alias to another schema, or an dict
with validate
and
optionally normalize
definitions. To disable an existing rule, use
.add(key, None)
.
.set(options)#
Override default options. Missed properties will not be changed.