LaVOZs

The World’s Largest Online Community for Developers

'; python - How to make Django slugify work properly with Unicode strings? - LavOzs.Com

What can I do to prevent slugify filter from stripping out non-ASCII alphanumeric characters? (I'm using Django 1.0.2)

cnprog.com has Chinese characters in question URLs, so I looked in their code. They are not using slugify in templates, instead they're calling this method in Question model to get permalinks

def get_absolute_url(self):
    return '%s%s' % (reverse('question', args=[self.id]), self.title)

Are they slugifying the URLs or not?

There is a python package called unidecode that I've adopted for the askbot Q&A forum, it works well for the latin-based alphabets and even looks reasonable for greek:

>>> import unidecode
>>> from unidecode import unidecode
>>> unidecode(u'διακριτικός')
'diakritikos'

It does something weird with asian languages:

>>> unidecode(u'影師嗎')
'Ying Shi Ma '
>>> 

Does this make sense?

In askbot we compute slugs like so:

from unidecode import unidecode
from django.template import defaultfilters
slug = defaultfilters.slugify(unidecode(input_text))

The Mozilla website team has been working on an implementation : https://github.com/mozilla/unicode-slugify sample code at http://davedash.com/2011/03/24/how-we-slug-at-mozilla/

Also, the Django version of slugify doesn't use the re.UNICODE flag, so it wouldn't even attempt to understand the meaning of \w\s as it pertains to non-ascii characters.

This custom version is working well for me:

def u_slugify(txt):
        """A custom version of slugify that retains non-ascii characters. The purpose of this
        function in the application is to make URLs more readable in a browser, so there are 
        some added heuristics to retain as much of the title meaning as possible while 
        excluding characters that are troublesome to read in URLs. For example, question marks 
        will be seen in the browser URL as %3F and are thereful unreadable. Although non-ascii
        characters will also be hex-encoded in the raw URL, most browsers will display them
        as human-readable glyphs in the address bar -- those should be kept in the slug."""
        txt = txt.strip() # remove trailing whitespace
        txt = re.sub('\s*-\s*','-', txt, re.UNICODE) # remove spaces before and after dashes
        txt = re.sub('[\s/]', '_', txt, re.UNICODE) # replace remaining spaces with underscores
        txt = re.sub('(\d):(\d)', r'\1-\2', txt, re.UNICODE) # replace colons between numbers with dashes
        txt = re.sub('"', "'", txt, re.UNICODE) # replace double quotes with single quotes
        txt = re.sub(r'[?,:!@#~`+=$%^&\\*()\[\]{}<>]','',txt, re.UNICODE) # remove some characters altogether
        return txt

Note the last regex substitution. This is a workaround to a problem with the more robust expression r'\W', which seems to either strip out some non-ascii characters or incorrectly re-encode them, as illustrated in the following python interpreter session:

Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> # Paste in a non-ascii string (simplified Chinese), taken from http://globallives.org/wiki/152/
>>> str = '您認識對全球社區感興趣的中國攝影師嗎'
>>> str
'\xe6\x82\xa8\xe8\xaa\x8d\xe8\xad\x98\xe5\xb0\x8d\xe5\x85\xa8\xe7\x90\x83\xe7\xa4\xbe\xe5\x8d\x80\xe6\x84\x9f\xe8\x88\x88\xe8\xb6\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print str
您認識對全球社區感興趣的中國攝影師嗎
>>> # Substitute all non-word characters with X
>>> re_str = re.sub('\W', 'X', str, re.UNICODE)
>>> re_str
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print re_str
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX?的中國攝影師嗎
>>> # Notice above that it retained the last 7 glyphs, ostensibly because they are word characters
>>> # And where did that question mark come from?
>>> 
>>> 
>>> # Now do the same with only the last three glyphs of the string
>>> str = '影師嗎'
>>> print str
影師嗎
>>> str
'\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> re.sub('\W','X',str,re.U)
'XXXXXXXXX'
>>> re.sub('\W','X',str)
'XXXXXXXXX'
>>> # Huh, now it seems to think those same characters are NOT word characters

I am unsure what the problem is above, but I'm guessing that it stems from "whatever is classified as alphanumeric in the Unicode character properties database," and how that is implemented. I have heard that python 3.x has a high priority on better unicode handling, so this may be fixed already. Or, maybe it is correct python behavior, and I am misusing unicode and/or the Chinese language.

For now, a work-around is to avoid character classes, and make substitutions based on explicitly defined character sets.

With Django >= 1.9, django.utils.text.slugify has a allow_unicode parameter:

>>> slugify("你好 World", allow_unicode=True)
"你好-world"

If you use Django <= 1.8 (which you should not since April 2018), you can pick up the code from Django 1.9.

I'm afraid django's definition of slug means ascii, though the django docs don't explicitly state this. This is the source of the defaultfilters for the slugify... you can see that the values are being converted to ascii, with the 'ignore' option in case of errors:

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
return mark_safe(re.sub('[-\s]+', '-', value))

Based on that, I'd guess that cnprog.com is not using an official slugify function. You may wish to adapt the django snippet above if you want a different behaviour.

Having said that, though, the RFC for URLs does state that non-us-ascii characters (or, more specifically, anything other than the alphanumerics and $-_.+!*'()) should be encoded using the %hex notation. If you look at the actual raw GET request that your browser sends (say, using Firebug), you'll see that the chinese characters are in fact encoded before being sent... the browser just makes it look pretty in the display. I suspect this is why slugify insists on ascii only, fwiw.

You might want to look at: https://github.com/un33k/django-uuslug

It will take care of both "U"s for you. U in unique and U in Unicode.

It will do the job for you hassle free.

This is what I use:

http://trac.django-fr.org/browser/site/trunk/djangofr/links/slughifi.py

SlugHiFi is a wrapper for regular slugify, with a difference that it replaces national chars with their English alphabet counterparts.

So instead of "Ą" you get "A", instead of "Ł" => "L", and so on.

I am interested in allowing only ASCII characters in the slug this is why I tried to benchmark some of the available tools for the same string:

  • Unicode Slugify:

    In [5]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o', only_ascii=True)
    37.8 µs ± 86.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'paizo-trekho-kai-glo-la-fdo'
    
  • Django Uuslug:

    In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
    35.3 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'paizo-trekho-kai-g-lo-la-fd-o'
    
  • Awesome Slugify:

    In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
    47.1 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'Paizo-trekho-kai-g-lo-la-fd-o'
    
  • Python Slugify:

    In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
    24.6 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'paizo-trekho-kai-g-lo-la-fd-o'
    
  • django.utils.text.slugify with Unidecode:

    In [15]: %timeit slugify(unidecode('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o'))
    36.5 µs ± 89.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'paizo-trekho-kai-glo-la-fdo'
    
Related
How to pad zeroes to a string?
How do I check if a string is a number (float)?
How do I parse a string to a float or int?
How do I get a substring of a string in Python?
How to make a chain of function decorators?
How to make a flat list out of list of lists?
Convert a Unicode string to a string in Python (containing extra symbols)
How do I lowercase a string in Python?
How to check if the string is empty?