Processing.js codePointAt()

After taking a small break from OSD600 for a few days to do some work in a couple other classes, I finally came back to it today.  I left off were I was about to write a test for codePointAt that could be added to the other tests that processing.js runs now.  Processing.js has these tests to ensure no functions or crucial parts of the code get broken when someone implements a fix/new function.  By running these test after you have written your code you ensure that no existing code has been broken and what you have added hasn't altered it in any way.  When I started writing my test, I made a normal string comprised of characters that I knew would work and give me output that I expected.  I tested this first and it worked.  After this I started adding chinese characters such as 𧺆. This is when stuff started to get weird.

 

I tried running the test again and the test came back false, saying that 163462 (the unicode value of the chinese character) was != 63.  63?  How did I get 63?  I look up what character was = 63 and it was a question mark.  After talking to Pomax and yury on IRC, I found out that when the encoding type of a file is set to ANSI (which it is by default on notepad++) characters that are not within the character list show up as ?.  So one of the guys on IRC suggested I switch it to UTF8 encoding which should solve the problem (for anyone else doing this in notepad++ change the encoding type, don't convert to).  After doing this I ran the test again and this time I got an unexpected character(s) at the start of my file.  What the f**k, how did those get there.  Back to IRC.  As soon as i mentioned the unexpected characters I think 3 or so people all said something about BOM.  What is a BOM you ask? BOM stands for byte order mark and is used to signal the byte order of a text file.  Since the processing.js tester (or the shell it was running in, I don't really know) already has a BOM, it isnt needed when encoding in UTF8.  So after changing it to encode in UTF8 without a BOM it finally ran.  What did it output? A whole pile of NaN's (Not a number).  Results at least!

 

After going back to IRC and listening in on a conversation about UTF8 and why my code wasn't working (about 80% of which I didn't understand), someone suggested I try passing in hex values of the character I was trying to find the unicode values for.  I was skepticle of this working but tried it anyways, and to my surprise, it worked.  For the one character that I gave it, the test confirmed that it had the correct unicode value. AWESOME.  I then tried adding another hex value on the end as yury suggested, and it then failed.  This was because the code I got from a mozilla fix I found online had something to do with increasing the size of the index that was passed into codePointAt depending if there were surrogate pairs or not(I'm pretty sure after doing this that a surrogate pair is for when a character is too big to fit in 16 bits, its split into two 16 bit pairs, each with their own hex value, by grouping these values, you check both pairs togethor and get the proper unicode value.  Both pomax and yury said this was unnecesary and I should be able to do it without it.  I removed this and voila, everything worked as it should.  I then tested this again later in somde html page, and had a small blunder because I accidentally uncommented some code, but we wont go into that ahah.  After I commented the code again, it too worked in an html page.

 

To be honest after I did all of this, and listened to everyones conversations in IRC, I learned quite a bit about unicode and how it actually works.  Last week when I got the code from that mozilla webpage, I understood nearly none of the code except some simple obvious stuff.  After hacking the code a bit and listening to everyone, I slowly understood more and more about how it worked, what was happening, and what was causing problems.   What a satisfying night.

 

Man I love this course.