Manual: How to add UTF-8 support

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hello

I was very surprised that the Syntax Edit control is not able to show UTF-8 files !!

It supports ANSI and Unicode but not UTF-8 although UTF-8 is more widley used than Unicode!

It is not very difficult to add UTF-8 Support if your project is Unicode compiled.

If you still use the MBCS compiler switch some more work will be necessary due to some design flaws in the code.

___________________________________

Add tis line to XTPSyntaxEditBufferManager.h

    BOOL IsUnicodeFile(CFile *pFile);
___________________________________

In XTPSyntaxEditBufferManager.cpp:

convert the function
BOOL IsUnicodeFile(CFile *pFile)
into a member function and add the following bold lines:

BOOL CXTPSyntaxEditBufferManager::IsUnicodeFile(CFile *pFile)
{
    pFile->SeekToBegin();

    WORD wPrefix;
    UINT uReaded = pFile->Read(&wPrefix, 2);
    if (uReaded == 2 && wPrefix == 0xFEFF)
    {
        return TRUE;
    }

    // Check if UTF-8 file identifier "" exists
    pFile->SeekToBegin();

    BYTE u8_Buf[3];
    UINT u32_Read = pFile->Read(u8_Buf, 3);
    if  (u32_Read == 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF)
    {
        m_nCodePage = CP_UTF8;
        return FALSE;
    }

_________________________________________________

Then in 

void CXTPSyntaxEditBufferManager::SerializeEx() add the bold lines:

    else if (ar.IsStoring())
    {
        if (bUnicode == -1)
            bUnicode = m_bUnicodeFileFormat;

        if (bUnicode && bWriteUnicodeFilePrefix)
        {
            ar << (BYTE)0xFF;
            ar << (BYTE)0xFE;
        }

        if (!bUnicode && m_nCodePage == CP_UTF8)
        {
            ar << (BYTE)0xEF;
            ar << (BYTE)0xBB;
            ar << (BYTE)0xBF;
        }

        CByteArray arBuffer;

        int nCRLFStyle = GetCurCRLFType();

____________________________________________

And last but no least in XTPSyntaxEditCtrl.cpp

    int nBytes = ::WideCharToMultiByte(uCodePage, 0, (LPWSTR)lpSource, -1, lpMBCSSource, nLen, NULL, NULL);

    // ASSERT(nBytes <= (int)dwBytes);  // removed: Nonsense

    lpMBCSSource[nBytes] = _T('\0');

This Assert is complete nonsense.
It is normal that a conversion from Unicode to any codepage makes the string longer.
There is no reason to check the length of the string and assert that it is shorter than the Unicode string.
In UTF8 one character may be represented by 4 bytes.

_______________________________________________

The buffer manager has several design flaws.
Instead of working with a lot of fixed size buffers and calling WideCharToMultiByte several times 
it would have been nicer to write a class that encapsulates this stuff and automatically converts from

Unicode --> ANSI
ANSI --> Unicode
UTF8 --> Unicode
Unicode --> UTF8
ANSI --> UTF8 
etc..
and this class should take care to allocate the buffer.

This would result in much cleaner code and less coding flaws.

The syntax highlighting will not work any longer if the user enters lines that are longer than 128 characters.
Why do the Codejock programmers not use a dynamic buffer that is allocated once the file gets parsed ?

I hope to see UTF8 support in the next version.

Elm�

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
elmue Members Profile Send Private Message Find Members Posts Add to Buddy List Groupie Joined: 05 June 2010 Location: Germany Status: Offline Points: 24	Post Options Post Reply Quote elmue Report Post Thanks(0) Quote Reply Topic: Manual: How to add UTF-8 support Posted: 18 July 2010 at 12:28am
	Hello I was very surprised that the Syntax Edit control is not able to show UTF-8 files !! It supports ANSI and Unicode but not UTF-8 although UTF-8 is more widley used than Unicode! It is not very difficult to add UTF-8 Support if your project is Unicode compiled. If you still use the MBCS compiler switch some more work will be necessary due to some design flaws in the code. ___________________________________ Add tis line to XTPSyntaxEditBufferManager.h BOOL IsUnicodeFile(CFile pFile); ___________________________________ In XTPSyntaxEditBufferManager.cpp: convert the function BOOL IsUnicodeFile(CFile pFile) into a member function and add the following bold lines: BOOL CXTPSyntaxEditBufferManager::IsUnicodeFile(CFile pFile) { pFile->SeekToBegin(); WORD wPrefix; UINT uReaded = pFile->Read(&wPrefix, 2); if (uReaded == 2 && wPrefix == 0xFEFF) { return TRUE; } // Check if UTF-8 file identifier "" exists pFile->SeekToBegin(); BYTE u8_Buf[3]; UINT u32_Read = pFile->Read(u8_Buf, 3); if (u32_Read == 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF) { m_nCodePage = CP_UTF8; return FALSE; } _________________________________________________ Then in void CXTPSyntaxEditBufferManager::SerializeEx() add the bold lines: else if (ar.IsStoring()) { if (bUnicode == -1) bUnicode = m_bUnicodeFileFormat; if (bUnicode && bWriteUnicodeFilePrefix) { ar << (BYTE)0xFF; ar << (BYTE)0xFE; } if (!bUnicode && m_nCodePage == CP_UTF8) { ar << (BYTE)0xEF; ar << (BYTE)0xBB; ar << (BYTE)0xBF; } CByteArray arBuffer; int nCRLFStyle = GetCurCRLFType(); ____________________________________________ And last but no least in XTPSyntaxEditCtrl.cpp int nBytes = ::WideCharToMultiByte(uCodePage, 0, (LPWSTR)lpSource, -1, lpMBCSSource, nLen, NULL, NULL); // ~~ASSERT(nBytes <= (int)dwBytes);~~ // removed: Nonsense* lpMBCSSource[nBytes] = _T('\0'); This Assert is complete nonsense. It is normal that a conversion from Unicode to any codepage makes the string longer. There is no reason to check the length of the string and assert that it is shorter than the Unicode string. In UTF8 one character may be represented by 4 bytes. _______________________________________________ The buffer manager has several design flaws. Instead of working with a lot of fixed size buffers and calling WideCharToMultiByte several times it would have been nicer to write a class that encapsulates this stuff and automatically converts from Unicode --> ANSI ANSI --> Unicode UTF8 --> Unicode Unicode --> UTF8 ANSI --> UTF8 etc.. and this class should take care to allocate the buffer. This would result in much cleaner code and less coding flaws. The syntax highlighting will not work any longer if the user enters lines that are longer than 128 characters. Why do the Codejock programmers not use a dynamic buffer that is allocated once the file gets parsed ? I hope to see UTF8 support in the next version. Elm�