2012. december 16., vasárnap

How to match unicode characters?

Lately I have been digging through scraped movie information pages with grep, trying to extract information.
I was surprised to find, that '.*' in a regexp just does not match 'Castillos de cartón' (yes, that is a movie suggestion as well. :)
After some testing I saw that matching brakes at the accented character being actually represented in two bytes. After some digging I have found the solution at: http://www.regular-expressions.info/unicode.html
To sum it all up you will need a perl regexp, where \X is the unicode version of the dot, so instead of:

grep 'class="blackbigtitle">.*' 135428/.porthu.html

you should use:

grep -P 'class="blackbigtitle">\X*' 135428/.porthu.html

And that is all, folks! :)

Nincsenek megjegyzések:

Megjegyzés küldése

Rendszeres olvasók