Perl изменить кодировку файла

Обновлено: 30.06.2024

Encode consists of a collection of modules whose details are too extensive to fit in one document. This one itself explains the top-level APIs and general topics at a glance. For other topics and more details, see the documentation for these modules:

The Encode module provides the interface between Perl strings and the rest of the system. Perl strings are sequences of characters.

The repertoire of characters that Perl can represent is a superset of those defined by the Unicode Consortium. On most platforms the ordinal values of a character as returned by ord(S) is the Unicode codepoint for that character. The exceptions are platforms where the legacy encoding is some variant of EBCDIC rather than a superset of ASCII; see perlebcdic.

This document mostly explains the how. perlunitut and perlunifaq explain the why.

A character in the range 0..255; a special case of a Perl character.

8 bits of data, with ordinal values 0..255; term for bytes passed to or from a non-Perl context, such as a disk file, standard I/O stream, database, command-line argument, environment variable, socket etc.

CAVEAT: the input scalar STRING might be modified in-place depending on what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be left unchanged.

CAVEAT: When you run $octets = encode("UTF-8", $string) , then $octets might not be equal to $string. Though both contain the same data, the UTF8 flag for $octets is always off. When you encode anything, the UTF8 flag on the result is always off, even when it contains a completely valid UTF-8 string. See "The UTF8 flag" below.

If the $string is undef , then undef is returned.

str2bytes may be used as an alias for encode .

CAVEAT: the input scalar OCTETS might be modified in-place depending on what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be left unchanged.

CAVEAT: When you run $string = decode("UTF-8", $octets) , then $string might not be equal to $octets. Though both contain the same data, the UTF8 flag for $string is on. See "The UTF8 flag" below.

If the $string is undef , then undef is returned.

bytes2str may be used as an alias for decode .

Returns the encoding object corresponding to ENCODING. Returns undef if no matching ENCODING is find. The returned object is what does the actual encoding or decoding.

with more error checking.

You can therefore save time by reusing this object as follows;

Besides "decode" and "encode", other methods are available as well. For instance, name() returns the canonical name of the encoding object.

Returns the encoding object corresponding to MIME_ENCODING. Acts same as find_encoding() but mime_name() of returned object must match to MIME_ENCODING. So as opposite of find_encoding() canonical names and aliases are not used when searching for object.

and to convert it back:

Because the conversion happens in place, the data to be converted cannot be a string constant: it must be a scalar variable.

from_to() returns the length of the converted string in octets on success, and undef on error.

CAVEAT: The following operations may look the same, but are not:

is equivalent to:

Yes, it does not respect the $check during decoding. It is deliberately done that way. If you need minute control, use decode followed by encode as follows:

WARNING: do not use this function for data exchange as it can produce not strict utf8 $octets! For strictly valid UTF-8 output use $octets = encode("UTF-8", $string) .

Equivalent to $string = decode("utf8", $octets [, CHECK]) . The sequence of octets represented by $octets is decoded from (loose, not strict) utf8 into a sequence of logical characters. Because not all sequences of octets are valid not strict utf8, it is quite possible for this function to fail. For CHECK, see "Handling Malformed Data".

WARNING: do not use this function for data exchange as it can produce $string with not strict utf8 representation! For strictly valid UTF-8 $string representation use $string = decode("UTF-8", $octets [, CHECK]) .

CAVEAT: the input $octets might be modified in-place depending on what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be left unchanged.

Returns a list of canonical names of available encodings that have already been loaded. To get a list of all available encodings including those that have not yet been loaded, say:

Or you can give the name of a specific module:

When " :: " is not in the name, " Encode:: " is assumed.

To find out in detail which encodings are supported by this package, see Encode::Supported.

To add a new alias to a given encoding, use:

After that, NEWNAME can be used as an alias for ENCODING. ENCODING may be either the name of an encoding or an encoding object.

Before you do that, first make sure the alias is nonexistent using resolve_alias() , which returns the canonical name thereof. For example:

resolve_alias() does not need use Encode::Alias ; it can be imported via use Encode qw(resolve_alias) .

As of Encode version 2.21, a new method mime_name() is therefore added.

If your perl supports PerlIO (which is the default), you can use a PerlIO layer to decode and encode directly via a filehandle. The following two examples are fully identical in functionality:

In the first version above, you let the appropriate encoding layer handle the conversion. In the second, you explicitly translate from one encoding to the other.

Unfortunately, it may be that encodings are not PerlIO -savvy. You can check to see whether your encoding is supported by PerlIO by invoking the perlio_ok method on it:

Fortunately, all encodings that come with Encode core are PerlIO -savvy except for hz and ISO-2022-kr . For the gory details, see Encode::Encoding and Encode::PerlIO.

The optional CHECK argument tells Encode what to do when encountering malformed data. Without CHECK, Encode::FB_DEFAULT (== 0) is assumed.

As of version 2.12, Encode supports coderef values for CHECK ; see below.

NOTE: Not all encodings support this feature. Some encodings ignore the CHECK argument. For example, Encode::Unicode ignores CHECK and it always croaks on error.

If CHECK is 0, encoding and decoding replace any malformed character with a substitution character. When you encode, SUBCHAR is used. When you decode, the Unicode REPLACEMENT CHARACTER, code point U+FFFD, is used. If the data is supposed to be UTF-8, an optional lexical warning of warning category "utf8" is given.

If CHECK is 1, methods immediately die with an error message. Therefore, when CHECK is 1, you should trap exceptions with eval<> , unless you really want to let it die .

This is the same as FB_QUIET above, except that instead of being silent on errors, it issues a warning. This is handy for when you are debugging.

CAVEAT: All warnings from Encode module are reported, independently of pragma warnings settings. If you want to follow settings of lexical warnings configured by pragma warnings then append also check value ENCODE::ONLY_PRAGMA_WARNINGS . This value is available since Encode version 2.99.

For encodings that are implemented by the Encode::XS module, CHECK == Encode::FB_PERLQQ puts encode and decode into perlqq fallback mode.

When you decode, \xHH is inserted for a malformed character, where HH is the hex representation of the octet that could not be decoded to utf8. When you encode, \xHHHH> will be inserted, where HHHH is the Unicode code point (in any number of hex digits) of the character that cannot be found in the character repertoire of the encoding.

In Encode 2.10 or later, LEAVE_SRC is also implied.

These modes are all actually set via a bitmask. Here is how the FB_XXX constants are laid out. You can import the FB_XXX constants via use Encode qw(:fallbacks) , and you can import the generic bitmask constants via use Encode qw(:fallback_all) .

As of Encode 2.12, CHECK can also be a code reference which takes the ordinal value of the unmapped character as an argument and returns octets that represent the fallback character. For instance:

Acts like FB_PERLQQ but U+XXXX is used instead of \xXXXX> .

Fallback for decode must return decoded string (sequence of characters) and takes a list of ordinal values as its arguments. So for example if you wish to decode octets as UTF-8, and use ISO-8859-15 as a fallback for bytes that are not valid UTF-8, you could write

To define a new encoding, use:

CANONICAL_NAME will be associated with $object. The object should provide the interface described in Encode::Encoding. If more than two arguments are provided, additional arguments are considered aliases for $object.

Before the introduction of Unicode support in Perl, The eq operator just compared the strings represented by two scalars. Beginning with Perl 5.8, eq compares two strings with simultaneous consideration of the UTF8 flag. To explain why we made it so, I quote from page 402 of Programming Perl, 3rd ed.

Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on.

Old byte-oriented programs should magically start working on the new character-oriented data when appropriate.

Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode.

Perl should remain one language, rather than forking into a byte-oriented Perl and a character-oriented Perl.

When Programming Perl, 3rd ed. was written, not even Perl 5.6.0 had been born yet, many features documented in the book remained unimplemented for a long time. Perl 5.8 corrected much of this, and the introduction of the UTF8 flag is one of them. You can think of there being two fundamentally different kinds of strings and string-operations in Perl: one a byte-oriented mode for when the internal UTF8 flag is off, and the other a character-oriented mode for when the internal UTF8 flag is on.

[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If CHECK is true, also checks whether STRING contains well-formed UTF-8. Returns true if successful, false otherwise.

CAVEAT: If STRING has UTF8 flag set, it does NOT mean that STRING is UTF-8 encoded and vice-versa.

As of Perl 5.8.1, utf8 also has the utf8::is_utf8 function.

NOTE: For security reasons, this function does not work on tainted values.

The former default in which Perl would always use a loose interpretation of UTF-8 has now been overruled:

Got that? As of Perl 5.8.7, "UTF-8" means UTF-8 in its current sense, which is conservative and strict and security-conscious, whereas "utf8" means UTF-8 in its former sense, which was liberal and loose and lax. Encode version 2.10 or later thus groks this subtle but critically important distinction between "UTF-8" and "utf8" .

While Dan Kogai retains the copyright as a maintainer, credit should go to all those involved. See AUTHORS for a list of those who submitted code to the project.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Perldoc Browser is maintained by Dan Book (DBOOK). Please contact him via the GitHub issue tracker or email regarding any issues with the site itself, search, or rendering of documentation.

The Perl documentation is maintained by the Perl 5 Porters in the development of Perl. Please contact them via the Perl issue tracker, the mailing list, or IRC to report any issues with the contents or format of the documentation.

Очень много вопросов связано с многообразием кодировок, а также используемой терминологией. Кроме того, многие из нас сталкивались с проблемами, которые связаны с кодировками. Я постараюсь в этой статье написать в понятной форме информацию по этому вопросу. Начну с вопроса автоматического определения кодировки текста.

В Perl для этого вы можете использовать Encode::Guess, однако более «продвинутым» промышленным вариантом является Encode::Detect::Detector. Как написано в документации к нему, он предоставляет интерфейс к Мозиловскому универсальному определителю кодировки.

Если вы будете изучать исходный код, обратите внимание на файл vnsUniversalDetector.cpp и метод

nsresult nsUniversalDetector::HandleData(const char* aBuf, PRUint32 aLen)

EF BB BF UTF-8 encoded BOM
FE FF 00 00 UCS-4, unusual octet order BOM (3412)
FE FF UTF-16, big endian BOM
00 00 FE FF UTF-32, big-endian BOM
00 00 FF FE UCS-4, unusual octet order BOM (2143)
FF FE 00 00 UTF-32, little-endian BOM
FF FE UTF-16, little endian BOM

nsMBCSGroupProber;
nsSBCSGroupProber;
nsLatin1Prober;

каждый из которых отвечает за анализ групп кодировок (MB – мультибайтовые, SB – однобайтовые).

nsMBCSGroupProber поддерживает такие кодировки как: «UTF8», «SJIS», «EUCJP», «GB18030», «EUCKR», «Big5», «EUCTW».

nsSBCSGroupProber – такие как Win1251,koi8r,ibm866 и другие.

Определение однобайтовой кодировки базируется на анализе частоты вхождения 2-ух символьных последовательностей в текст.

Следует сказать, что все эти методы носят вероятностный характер. Например, если будет недостаточное количество слов для определения, никакой алгоритм не сможет автоматически определить кодировку. Поэтому, в различных среда программирования вопрос с кодировками решается по своему, но нет такого, чтобы все определялось само.

Разработчики Perl следуя прогрессу в части повсеместной реализации кодировок Unicode в приложениях, также реализовали поддержку Unicode в Perl. Кроме того модуль Encode поддерживает также другие кодировки как однобайтовые так и многобайтовые, список которых можно просмотреть в пакете Encode::Config. Для работы с письмами, поддерживаются «MIME кодировки»: MIME-Header, MIME-B, MIME-Q, MIME-Header-ISO_2022_JP.

Следует сказать, что UTF-8 очень широко распространена в качестве кодировки для веб документов. UTF-16 используется в Java и Windows, UTF-8 и UTF-32 используется Linux и другими Unix-подобными системами.

Начиная с версии Perl 5.6.0 была изначально реализована возможность работы с Unicode. Тем не менее, для более серьезной работы с Unicode был рекомендован Perl 5.8.0. Perl 5.14.0 – первая версия в которой поддержка Unicode легко (почти) интегрируемая без нескольких подводных камней (исключения составляют некоторые различия в quotemeta). Версия 5.14 также исправляет ряд ошибок и отклонений от стандарта Unicode.

“Unicode Bug в Perl”. Так же как и в Visual Studio, что-то похожое происходит и с программой на Perl, но разработчики Perl могут явно указывать кодировку исходного кода приложения. Вот почему когда начинающие программировать на perl открывают на русскоязычной Windows XP свой любимый редактор и в ANSI (тоесть cp1251) пишут что-то в духе

а на выходе получают, что строки в переменных не равны, им вначале сложно понять, что происходит. Аналогичные вещи происходят с регулярными выражениями, строковыми функциями (но uc($c) будет работать корректно).

Достаточно «подсказать» интерпретатору, что кодировка исходного файла cp1251 и все будет работать правильно. Более точно в приведенном ниже коде, переменные $a и $b будут хранить строки во внутреннем формате Perl.

Внутренний формат строк в Perl. В не очень старых версиях Perl строки могут хранится в так называемом внутреннем формате (Perl's internal form). Обратите внимание, что также они могут хранится как просто набор байтов. В примере выше, там, где явно не задавалась кодировка исходного файла (с помощью use encoding 'cp1251';) переменные $a, $b, $c хранят просто набор байтов (еще в документации к Perl используется термин последовательность октетов — a sequence of octets).

Внутренний формат от набора байтов отличается тем, что используется кодировка UTF-8 и для переменной включен флаг UTF8. Приведу пример. Изменим немного исходный код программы на следующий

Вот, что мы получим в результате

SV = PV(0x199ee4) at 0x19bfb4
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x19316c "\321\201\320\273\320\276\320\262\320\276"\0 [UTF8 "\x\x\x\x\x"]
CUR = 10
LEN = 12

Обратите внимание, что FLAGS = (PADMY,POK,pPOK,UTF8). Если мы уберем use encoding 'cp1251';
то получим

SV = PV(0x2d9ee4) at 0x2dbfc4
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2d316c "\321\201\320\273\320\276\320\262\320\276"\0
CUR = 10
LEN = 12

Когда мы указываем, что исходный код файла в кодировке cp1251 или какой-либо другой то Perl знает, что нужно конвертировать строковые литерали в исходном коде из указанной кодировки во внутренний формат (в данном случае из cp1251 во внутренний формат UTF-8 )и делает это.

Аналогичная проблема определения кодировки возникает при работе с данными получаемыми «извне», например файлов или веб. Рассмотрим каждый из случаев.

Пусть у нас есть файл в кодировке cp866, который содержит слово «Когда» (в текстовом файле слово Когда с большой буквы). Нам необходимо открыть его и проанализировать все строки на предмет нахождения слова «когда». Вот как это сделать правильно (при этом сам исходный код должен быть в utf8).

Обратите внимание, что в случае если мы не будем использовать "<:encoding(cp866)", и укажем use encoding ‘cp866’ то регулярные выражения будут работать, но только с набором байт и /i работать не будет. Конструкция «<:encoding(cp866)» подсказывает Perl, что данные в текстовом файле в кодировке CP866, поэтому он правильно выполняет перекодировку из CP866 во внутренний формат (CP866 -> UTF8 + включает флаг UTF8).

Следующий пример, мы получаем страницу с помощью LWP::UserAgent. Вот правильний пример, как это нужно делать.

Обратите внимание на вызов $content = decode('utf8',$content).

LWP::UserAgent работает с байтами, он не знает, и это не его забота, в какой кодировке страница в однобайтовой cp1251 или в UTF8, мы должны явно указывать это. К сожалению, много литературы содержит примеры на английском языке и для более старых версий Perl, как следствие, в этих примерах нет ничего о перекодировке.

На примере получения внешних данных с веб сайта мы подошли к рассмотрению использования модуля Encode. Вот его основное API, очень важное в работе любого Perl программиста:

В примере, в котором мы открывали текстовый файл в CP866 мы можем не указывать <:encoding(cp866). Тогда, при каждой операции чтения мы будем получать набор байтов в CP866. Мы можем сами конвертировать их во внутренний формат с помощью

и дальше работать с переменной $str.

Кто-то может предположить, что можно в качестве исходного текста программы использовать utf8, а кроме того, перекодировать из cp866 в utf8 и все будет работать как нужно. Это не так, рассмотрим пример (в текстовом файле слово Когда с большой буквы).

$str после выполнения Encode::from_to($str,'cp866','utf8') содержит данные в utf8 но как последовательность байтов (октетов) поэтому /i не работает. Чтобы все работало как нужно добавить вызов

Конечно же более простым вариантом является одна строка вместо двух

Внутренний формат строк Perl, более подробно. Мы уже говорили о том, что регулярные выражения, часть модулей и строковые функции корректно работают со строками, которые хранятся не как набор байтов а во внутреннем представлении Perl. Также было сказано, что в качестве внутреннего формата хранения строк в Perl используется UTF-8. Эта кодировка выбрана не просто так. Часть кодов символов в этой кодировке от 0-127 совпадает с ASCII (US-ASCII), которые как раз отвечают за английский алфавит, вот почему вызов uc для строки с кодами от 0 до 127 отрабатывает правильно и это будет работать в независимости от однобайтовой кодировки в которой сохранен исходный код. Для UTF8 все так же работает корректно.

Однако это еще не все, что нужно знать.

UTF-8 vs utf8 vs UTF8. Кодировка UTF-8 со временем стала более «строгой» (например, наличие определенных символов было запрещено). Поэтому реализация UTF-8 в Perl устарала. Начиная с Perl 5.8.7 “UTF-8” означает современный «диалент» более «строгий», тогда как “utf8” означает более «либеральный старый диалект». Вот небольшой пример

Таким образом дефис между “UTF” и “8” важен, без него Encode становится более либеральной и возможно чрезмерно разрешительной. Если выполнить

Работа с консолью. Рассмотрим консоль ОС семейства Windows. Как все знают в Windows есть понятие кодировки Unicode, ANSI, OEM. API самой ОС поддерживает 2-а типа функций, которые работают с ANSI и Unicode (UTF-16). ANSI зависит от локализации ОС, для русской версии используется кодировка CP1251. OEM – это кодировка, которая используется для операций ввода/вывода консоли, для русскоязычной Windows – это CP866. Эта та кодировка, которая была предложена в русскоязычной MS-DOS, а позже перекочевала и в Windows для обратной совместимости со старым ПО. Вот почему, следующая программа в utf-8

не выведет заветной строки, мы же выводим UTF8, когда нужно CP866. Здесь нужно использовать модуль Encode::Locale. Если просмотреть его исходный код то можно увидеть, что для ОС Windows он определяет кодировку ANSI и консоли и создает алиасы console_in, console_out, locale, locale_fs. Все что остается сделать это немного изменить нашу программу.

Проблема заключена в том, что в русском Уиндус кодировка исходного текста как правило "windows-1251",
а консольное окно русского Уиндус изо всех сил ждёт русскую кодировку DOS cp866, посему и возникает сей эффект.

поэтому, чтобы правильно указать Перл, что ему в этом случае( сидим в 1251, а стандартный вывод в 866) делать, нужно
добавить в начало след. строчку:

соответственно, если среда редактирования в UTF-8, то

Добавлено через 26 минут
ВАЖНОЕ ДОПОЛНЕНИЕ:

в случае использования Перл5 версий > 5.18,
будет появляться след. предупреждение:
Use of the encoding pragma is deprecated

если вас это пугает, то the right way в этом случае - таки использовать utf-8 среду
и добавлять строки:

Инструкции для вывода в консольном окне
Есть такое задание Создать приложение, состоящее из трех потоков. Первый поток создает файл и.

Как называется язык программирования, на котором пишут команды в консольном окне cmd?
Как называется язык программирования, которым пишут команды в консоли (cmd) ?

Программа для вывода русских букв
Добрый день форумчане. Есть программа взятая с учебника, мной не однократно проверенная, на.

Почему вместо вывода русских букв выводится ? ?
При запуске в IntelliJ IDEA программы с выводом на экран фразы на русском (через.

Дополнение.
(маленькое)

всё вышенаписанное стоит учитывать, если вы таки намерены использовать use utf8;
при написании своих Perl-программ.

если же всё что вам надо - это отладка перл-скриптов, используя консольное окно CMD,
тогда, если у вас русский виндус, ~~его не отравит сосед~~самое разумное сказать один раз
в этом консольном окне chcp 1251 и наступит тихое счастье по причине полного совпадения кодировки по умолчанию
во всех компонентах ОС Windows ( текстовых редакторов, имён файлов файловой системы, подсистемы ввода-вывода ).

аналогичный вариант - добавить `chcp 1251`; в свой скрипт в любом месте перед первым оператором вывода.

Дополнение
( ещё одно )

Если всё же очень нужны полновесные Unicode-символы в консоли,
то однобайтная кодировка 1251 не спасает,
придётся в cmd-консоли переключаться в utf-8 ( chcp 65001 )

Но на выходе получается полная фигня( строчки рвутся и дублируются ). Скорее всего это Windows-bug, пока не исследовал.

маленькое теоретическое отступление:
в ядре Perl есть так называемые I/O Layers,
с помощью которых прозрачно осуществляется различные трансформации при вводе-выводе
( такие как перекодировка, добавление CRLF, шифровка-дешифровка итд )

эти слои для своих целей можно подключать, отключать, менять местами.

для манипуляции слоями могут использоваться операторы use, require, binmode, open.

уже имеющиеся внутри слои:
"unix" "perlio" "crlf" "mmap" "pending" "raw" "utf8"
некоторые из них "пустые" - например raw и utf8. пустые в том смысле, что они просто меняют местами слои на I/O стэке.

безусловно, можно добавлять и свои слои.

Итак, нашлось вот такое решение:

если определить слои в следующем порядке:

:unix :encoding(utf8) :crlf

то получается вполне удовлетворительный результат с выводом Unicode-символов;
(Консоль предварительно переключаем в 65001)

binmode STDOUT, ":unix:encoding(utf8):crlf"

The use utf8 pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope. The no utf8 pragma tells Perl to switch back to treating the source text as literal bytes in the current lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC, and not UTF-8, but this distinction is academic, so in this document the term UTF-8 is used to mean both).

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are directly usable without use utf8; .

Because it is not possible to reliably tell UTF-8 from native 8 bit encodings, you need either a Byte Order Mark at the beginning of your source code, or use utf8; , to instruct perl.

When UTF-8 becomes the standard source format, this pragma will effectively become a no-op.

See also the effects of the -C switch and its cousin, the PERL_UNICODE environment variable, in perlrun.

Enabling the utf8 pragma has the following effect:

Bytes in the source text that are not in the ASCII character set will be treated as being part of a literal UTF-8 sequence. This includes most literals such as identifier names, string constants, and constant regular expression patterns.

Note that if you have non-ASCII, non-UTF-8 bytes in your script (for example embedded Latin-1 in your string literals), use utf8 will be unhappy. If you want to have such bytes under use utf8 , you can disable this pragma until the end the block (or file, if at top level) by no utf8; .

The following functions are defined in the utf8:: package by the Perl core. You do not need to say use utf8 to use these and in fact you should not say that unless you really want to have UTF-8 source code.

(Since Perl v5.8.0) Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The logical character sequence itself is unchanged. If $string is already upgraded, then this is a no-op. Returns the number of octets necessary to represent the string as UTF-8.

Note that this function does not handle arbitrary encodings; use Encode instead.

$success = utf8::downgrade($string[, $fail_ok])

(Since Perl v5.8.0) Converts in-place the internal representation of the string from UTF-8 to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). The logical character sequence itself is unchanged. If $string is already stored as native 8 bit, then this is a no-op. Can be used to make sure that the UTF-8 flag is off, e.g. when you want to make sure that the substr() or length() function works with the usually faster byte algorithm.

Fails if the original UTF-8 sequence cannot be represented in the native 8 bit encoding. On failure dies or, if the value of $fail_ok is true, returns false.

Returns true on success.

Note that this function does not handle arbitrary encodings; use Encode instead.

(Since Perl v5.8.0) This takes an unsigned integer (which represents the ordinal number of a character (or a code point) on the platform the program is being run on) and returns its Unicode equivalent value. Since ASCII platforms natively use the Unicode code points, this function returns its input on them. On EBCDIC platforms it converts from EBCDIC to Unicode.

A meaningless value will currently be returned if the input is not an unsigned integer.

Since Perl v5.22.0, calls to this function are optimized out on ASCII platforms, so there is no performance hit in using it there.

(Since Perl v5.8.0) This is the inverse of utf8::native_to_unicode() , converting the other direction. Again, on ASCII platforms, this returns its input, but on EBCDIC platforms it will find the native platform code point, given any Unicode one.

A meaningless value will currently be returned if the input is not an unsigned integer.

Since Perl v5.22.0, calls to this function are optimized out on ASCII platforms, so there is no performance hit in using it there.

If you still think you need this outside of debugging, testing or dealing with filenames, you should probably read perlunitut and "What is "the UTF8 flag"?" in perlunifaq.

To force unicode semantics in code portable to perl 5.8 and 5.10, call utf8::upgrade($string) unconditionally.

utf8::encode is like utf8::upgrade , but the UTF8 flag is cleared. See perlunicode, and the C API functions sv_utf8_upgrade , "sv_utf8_downgrade" in perlapi , "sv_utf8_encode" in perlapi , and "sv_utf8_decode" in perlapi , which are wrapped by the Perl functions utf8::upgrade , utf8::downgrade , utf8::encode and utf8::decode . Also, the functions utf8::is_utf8 , utf8::valid , utf8::encode , utf8::decode , utf8::upgrade , and utf8::downgrade are actually internal, and thus always available, without a require utf8 statement.

Some filesystems may not support UTF-8 file names, or they may be supported incompatibly with Perl. Therefore UTF-8 names that are visible to the filesystem, such as module names may not work.

Perldoc Browser is maintained by Dan Book (DBOOK). Please contact him via the GitHub issue tracker or email regarding any issues with the site itself, search, or rendering of documentation.

Читайте также: