সম্পাদনা: আমি Regex একটি টাইপো-বাগ সমাধান করে ফেলেছেন .. এটা একটা '\ x80` না প্রয়োজন \ 80 ।
ইউটিএফ -8-র কঠোর আনুগত্যের জন্য, অবৈধ ইউটিএফ -8 ফর্মগুলি ফিল্টার করার জন্য রেজেক্সটি নিম্নরূপ
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print'
আউটপুট (মূল লাইনগুলির। টেস্ট 1 থেকে ):
Codepoint
=========
00001000 Test=1 mode=strict valid,invalid,fail=(1000,0,0)
0000E000 Test=1 mode=strict valid,invalid,fail=(D800,800,0)
0010FFFF mode=strict test-return=(0,0) valid,invalid,fail=(10F800,800,0)
প্র: অবৈধ ইউনিকোড ফিল্টার করে এমন একটি রেইজেক্স পরীক্ষা করার জন্য কেউ কীভাবে ডেটা তৈরি করে?
উ: আপনার নিজের ইউটিএফ -8 পরীক্ষা অ্যালগরিদম তৈরি করুন, এবং এর নিয়মগুলি ভঙ্গ করুন ...
ক্যাচ -২২ .. তবে তারপরে কীভাবে আপনি আপনার পরীক্ষার অ্যালগরিদম পরীক্ষা করেন?
Regex, উপরোক্ত, পরীক্ষা করা হয়েছে (ব্যবহার iconv
থেকে পূর্ণসংখ্যা মান জন্য রেফারেন্স হিসেবে) 0x00000
থেকে 0x10FFFF
হচ্ছে .. এই ঊর্ধ্ব মান একটি ইউনিকোড কোডপয়েন্ট সর্বোচ্চ পূর্ণসংখ্যা মান
এই উইকিপিডিয়া ইউটিএফ -8 পৃষ্ঠা অনুসারে,।
- ইউটিএফ -8 ইউনিকোড অক্ষর সেটটিতে 1,112,064 কোড পয়েন্টগুলির প্রতিটি এনকোড করে এক থেকে চারটি 8-বিট বাইট ব্যবহার করে
এই numeber (1,112,064) একটি পরিসরের equates 0x000000
করার 0x10F7FF
, যা সর্বোচ্চ ইউনিকোড কোডপয়েন্ট জন্য প্রকৃত সর্বাধিক পূর্ণসংখ্যা মান এর 0x0800 লাজুক:0x10FFFF
এই পূর্ণসংখ্যার ব্লক হল UTF-16 এনকোডিং জন্য প্রয়োজন কারণ ইউনিকোড Codepoints বর্ণালী থেকে অনুপস্থিত, তার মূল নকশা অভিপ্রায় পরলোক পইঠা একটি সিস্টেম নামক মাধ্যমে ভাড়াটে জোড়া । একটি ব্লক 0x0800
পূর্ণসংখ্যার .. হল UTF-16 দ্বারা ব্যবহার করা হবে সংরক্ষিত করা হয়েছে এই ব্লক পরিসীমা ঘটনাকাল 0x00D800
থেকে 0x00DFFF
। এই ইন্টারেটারগুলির কোনওটিই আইনী ইউনিকোড মান নয় এবং তাই অবৈধ ইউটিএফ -8 মান।
ইন পরীক্ষা 1 , regex
ইউনিকোড Codepoints সীমার মধ্যে যে সংখ্যা বিরুদ্ধে পরীক্ষা করা হয়েছে, এবং তা exectly ফলাফল মিলে যায় iconv
.. অর্থাত। 0x010F7FF বৈধ মান এবং 0x000800 অবৈধ মান।
যাইহোক, সমস্যাটি এখন উঠে আসে, * রেইজেক্স কীভাবে বহিরাগত UTF-8 মান পরিচালনা করে; উপরে 0x010FFFF
(ইউটিএফ -8 সর্বাধিক 0x7FFFFFFF এর পূর্ণসংখ্যার মান সহ 6 বাইট পর্যন্ত প্রসারিত হতে পারে ?
প্রয়োজনীয় * নন-ইউনিকোড ইউটিএফ -8 বাইট মান তৈরি করতে , আমি নিম্নলিখিত কমান্ডটি ব্যবহার করেছি:
perl -C -e 'print chr 0x'$hexUTF32BE
তাদের বৈধতা পরীক্ষা করার জন্য (কিছু ফ্যাশনে), আমি Gilles'
ইউটিএফ -8 রেজেক্স ব্যবহার করেছি ...
perl -l -ne '/
^( [\000-\177] # 1-byte pattern
|[\300-\337][\200-\277] # 2-byte pattern
|[\340-\357][\200-\277]{2} # 3-byte pattern
|[\360-\367][\200-\277]{3} # 4-byte pattern
|[\370-\373][\200-\277]{4} # 5-byte pattern
|[\374-\375][\200-\277]{5} # 6-byte pattern
)*$ /x or print'
'পারেলের মুদ্রণ ক্রিয়াকলাপ' এর আউটপুট গিলসের 'রেজেক্স' এর ফিল্টারিংয়ের সাথে মেলে .. একটিতে অন্যটির বৈধতা জোরদার হয় .. আমি ব্যবহার করতে পারি না iconv
কারণ এটি কেবল বিস্তৃত (মূল) ইউটিএফ -8 এর বৈধ-ইউনিকোড স্ট্যান্ডার্ড উপসেটটি পরিচালনা করে I মান ...
জড়িত নুনবারগুলি বরং বড়, সুতরাং আমি 11111, 13579, 33333, 53441 এর মতো বর্ধনের দ্বারা শীর্ষস্থানীয়-রেঞ্জ, নীচে-পরিসীমা এবং কয়েকটি স্ক্যান পরীক্ষা করেছি ... ফলাফলগুলি সমস্ত মিলছে, তাই এখন যা কিছু অবশিষ্ট রয়েছে তা হ'ল এই সীমার বাইরে থাকা ইউটিএফ -8-স্টাইলের মানগুলির (ইউনিকোডের জন্য অবৈধ, এবং তাই নিজেই কঠোর ইউটিএফ -8 এর জন্যও অবৈধ) পরীক্ষা করা ..
পরীক্ষার মডিউলগুলি এখানে:
[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }
# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
# been accurately observed and included in the test code and final regex.
# =========================================================================
: 2 bytes; B2=0 # run-test=1 do-not-test=0
: 3 bytes; B3=0 # run-test=1 do-not-test=0
: 4 bytes; B4=0 # run-test=1 do-not-test=0
: regex; Rx=1 # run-test=1 do-not-test=0
((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE then iconv -f UTF-32BE beyond 0xFFFF)
(( lax=32)); mode[$lax]=lax # iconv -f UTF-32BE only)
# modebits=$strict
# UTF-8, in relation to UTF-16 has invalid values
# modebits=$strict automatically shifts to modebits=$lax
# when the tested integer exceeds 0xFFFF
# modebits=$lax
# UTF-8, in relation to UTF-32, has no restrictione
# Test 1 Sequentially tests a range of Big-Endian integers
# * Unicode Codepoints are a subset ofBig-Endian integers
# ( based on 'iconv' -f UTF-32BE -f UTF-8 )
# Note: strict UTF-8 has a few quirks because of UTF-16
# Set modebits=16 to "strictly" test the low range
Test=1; modebits=$strict
# Test=2; modebits=$lax
# Test=3
mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
#########################################################################
# 1 byte UTF-8 values: Nothing to do; no complexities.
#########################################################################
# 2 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B2==1)) ; then
echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
# =========================================================================
time \
for d1 in {194..223} ;do
# bin oct hex dec
# lo 11000010 302 C2 194
# hi 11011111 337 DF 223
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
#
done
done
echo
# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading \x00-\x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at \x80
time \
for d1 in {128..193} {224..255} ;do
#for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2)
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {0..127} {192..255} ;do
#for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
#
done
done
echo
fi
#########################################################################
# 3 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B3==1)) ; then
echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
# ========================================================================
time \
for d1 in {224..239} ;do
# bin oct hex dec
# lo 11100000 340 E0 224
# hi 11101111 357 EF 239
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {160..191})"
# bin oct hex dec
# lo 10100000 240 A0 160
# hi 10111111 277 BF 191
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {128..159})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10011111 237 9F 159
else
B3b2range="$(echo {128..191})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
fi
#
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
#
done
done
done
echo
# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading \x00-\x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at \x80
#
# real 26m28.462s \
# user 27m12.526s | stepping by 2
# sys 13m11.193s /
#
# real 239m00.836s \
# user 225m11.108s | stepping by 1
# sys 120m00.538s /
#
time \
for d1 in {128..223..1} {240..255..1} ;do
#for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3)
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {0..159..1} {192..255..1})"
#B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {0..127..1} {160..255..1})"
#B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
else
B3b2range="$(echo {0..127..1} {192..255..1})"
#B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
fi
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {0..127..1} {192..255..1} ;do
#for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
#
done
done
done
echo
fi
#########################################################################
# Brute force testing in the Astral Plane will take a VERY LONG time..
# Perhaps selective testing is more appropriate, now that the previous tests
# have panned out okay...
#
# 4 Byte UTF-8 values:
if ((B4==1)) ; then
echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
# ==================================================================
# real 58m18.531s \
# user 56m44.317s |
# sys 27m29.867s /
time \
for d1 in {240..244} ;do
# bin oct hex dec
# lo 11110000 360 F0 240
# hi 11110100 364 F4 244 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
B4b1=$(printf "%0.2X" $d1)
#
if [[ $B4b1 == "F0" ]] ; then
B4b2range="$(echo {144..191})" ## f0 90 80 80 to f0 bf bf bf
# bin oct hex dec 010000 -- 03FFFF
# lo 10010000 220 90 144
# hi 10111111 277 BF 191
#
elif [[ $B4b1 == "F4" ]] ; then
B4b2range="$(echo {128..143})" ## f4 80 80 80 to f4 8f bf bf
# bin oct hex dec 100000 -- 10FFFF
# lo 10000000 200 80 128
# hi 10001111 217 8F 143 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
else
B4b2range="$(echo {128..191})" ## fx 80 80 80 to f3 bf bf bf
# bin oct hex dec 0C0000 -- 0FFFFF
# lo 10000000 200 80 128 0A0000
# hi 10111111 277 BF 191
fi
#
for d2 in $B4b2range ;do
B4b2=$(printf "%0.2X" $d2)
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b3=$(printf "%0.2X" $d3)
echo "${B4b1} ${B4b2} ${B4b3} xx"
#
for d4 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b4=$(printf "%0.2X" $d4)
#
echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
#
done
done
done
done
echo "# Test 4 bytes for Valid UTF-8 values: END"
echo
fi
########################################################################
# There is no test (yet) for negated range values in the astral plane. #
# (all negated range values must be invalid) #
# I won't bother; This was mainly for me to ge the general feel of #
# the tests, and the final test below should flush anything out.. #
# Traversing the intire UTF-8 range takes quite a while... #
# so no need to do it twice (albeit in a slightly different manner) #
########################################################################
################################
### The construction of: ####
### The Regular Expression ####
### (de-construction?) ####
################################
# BYTE 1 BYTE 2 BYTE 3 BYTE 4
# 1: [\x00-\x7F]
# ===========
# ([\x00-\x7F])
#
# 2: [\xC2-\xDF] [\x80-\xBF]
# =================================
# ([\xC2-\xDF][\x80-\xBF])
#
# 3: [\xE0] [\xA0-\xBF] [\x80-\xBF]
# [\xED] [\x80-\x9F] [\x80-\xBF]
# [\xE1-\xEC\xEE-\xEF] [\x80-\xBF] [\x80-\xBF]
# ==============================================
# ((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))
#
# 4 [\xF0] [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]
# [\xF1-\xF3] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]
# [\xF4] [\x80-\x8F] [\x80-\xBF] [\x80-\xBF]
# ===========================================================
# ((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))
#
# The final regex
# ===============
# 1-4: (([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))
# 4-1: (((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|([\xC2-\xDF][\x80-\xBF])|([\x00-\x7F]))
#######################################################################
# The final Test; for a single character (multi chars to follow) #
# Compare the return code of 'iconv' against the 'regex' #
# for the full range of 0x000000 to 0x10FFFF #
# #
# Note; this script has 3 modes: #
# Run this test TWICE, set each mode Manually! #
# #
# 1. Sequentially test every value from 0x000000 to 0x10FFFF #
# 2. Throw a spanner into the works! Force random byte patterns #
# 2. Throw a spanner into the works! Force random longer strings #
# ============================== #
# #
# Note: The purpose of this routine is to determine if there is any #
# difference how 'iconv' and 'regex' handle the same data #
# #
#######################################################################
if ((Rx==1)) ; then
# real 191m34.826s
# user 158m24.114s
# sys 83m10.676s
time {
invalCt=0
validCt=0
failCt=0
decBeg=$((0x00110000)) # incement by decimal integer
decMax=$((0x7FFFFFFF)) # incement by decimal integer
#
for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
((D==1)) && echo "=========================================================="
#
# Convert decimal integer '$CPDec' to Hex-digits; 6-long (dec2hex)
hexUTF32BE=$(printf '%0.8X\n' $CPDec) # hexUTF32BE
# progress count
if (((CPDec%$((0x1000)))==0)) ;then
((Test>2)) && echo
echo "$hexUTF32BE Test=$Test mode=${mode[$modebits]} "
fi
if ((Test==1 || Test==2 ))
then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
#
if ((Test==2)) ; then
bits=32
UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
perl -l -ne '/^( [\000-\177]
| [\300-\337][\200-\277]
| [\340-\357][\200-\277]{2}
| [\360-\367][\200-\277]{3}
| [\370-\373][\200-\277]{4}
| [\374-\375][\200-\277]{5}
)*$/x and print' |xxd -p )"
UTF8="${UTF8%0a}"
[[ -n "$UTF8" ]] \
&& rcIco32=0 || rcIco32=1
rcIco16=
elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
bits=16
UTF8="$( echo -n "${hexUTF32BE:4}" |
xxd -p -u -r |
iconv -f UTF-16BE -t UTF-8 2>/dev/null)" \
&& rcIco16=0 || rcIco16=1
rcIco32=
else
bits=32
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \
&& rcIco32=0 || rcIco32=1
rcIco16=
fi
# echo "1 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((${rcIco16}${rcIco32}!=0)) ;then
# 'iconv -f UTF-16BE' failed produce a reliable UTF-8
if ((bits==16)) ;then
((D==1)) && echo "bits-$bits rcIconv: error $hexUTF32BE .. 'strict' failed, now trying 'lax'"
# iconv failed to create a 'srict' UTF-8 so
# try UTF-32BE to get a 'lax' UTF-8 pattern
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \
&& rcIco32=0 || rcIco32=1
#echo "2 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
if ((rcIco32!=0)) ;then
((D==1)) && echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE"
rcIco32=1
fi
fi
fi
# echo "3 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((rcIco16==0 || rcIco32==0)) ;then
# 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern
((D==1)) && echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE"
((D==1)) && if [[ $bits == "16" && $rcIco32 == "0" ]] ;then
echo " .. 'lax' UTF-8 produced a pattern"
else
echo
fi
# regex test
if ((modebits==strict)) ;then
#rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))*$/ or print' )"
rxOut="$(echo -n "$UTF8" |
perl -l -ne '/^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' )"
else
if ((Test==2)) ;then
rx="$(echo -n "$UTF8" |perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ and print')"
[[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut=
rx="$(echo -n "$rx" |sed -e "s/\(..\)/\1 /g")"
else
rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print' )"
fi
fi
if [[ "$rxOut" == "" ]] ;then
((D==1)) && echo " rcRegex: ok"
rcRegex=0
else
((D==1)) && echo -n "bits-$bits rcRegex: error $hexUTF32BE .. 'strict' failed,"
((D==1)) && if [[ "12" == *$Test* ]] ;then
echo # " (codepoint) Test $Test"
else
echo
fi
rcRegex=1
fi
fi
#
elif [[ $Test == 2 ]]
then # Test 2. Throw a randomizing spanner into the works!
# Then test the arbitary bytes ASIS
#
hexLineRand="$(echo -n "$hexUTF32BE" |
sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/\1\n\2\n\3\n\4\n\5\n\6\n\7\n\8/" |
sort -R |
tr -d '\n')"
#
elif [[ $Test == 3 ]]
then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF
#
echo "Test 3 is not properly implemented yet.. Exiting"
exit 99
else
echo "ERROR: Invalid mode"
exit
fi
#
#
if ((Test==1 || Test=2)) ;then
if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
((rcIconv=rcIco16))
else
((rcIconv=rcIco32))
fi
if ((rcRegex!=rcIconv)) ;then
[[ $Test != 1 ]] && echo
if ((rcRegex==1)) ;then
echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} "
else
echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} "
fi
((failCt++));
elif ((rcRegex!=0)) ;then
# ((invalCt++)); echo -ne "$hexUTF32BE exit-codes $${rcIco16}${rcIco32}=,$rcRegex\t: $(printf "%0.8X\n" $invalCt)\t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")\r"
((invalCt++))
else
((validCt++))
fi
if ((Test==1)) ;then
echo -ne "$hexUTF32BE " "mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt)) \r"
else
echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))\r"
fi
fi
done
} # End time
fi
exit