Syntax Edit - Manual: How to add UTF-8 support

Print Page | Close Window

Manual: How to add UTF-8 support

Printed From: Codejock Forums
Category: Codejock Products
Forum Name: Syntax Edit
Forum Description: Topics Related to Codejock Syntax Edit
URL: http://forum.codejock.com/forum_posts.asp?TID=16970
Printed Date: 28 April 2024 at 10:17am
Software Version: Web Wiz Forums 12.04 - http://www.webwizforums.com

Topic: Manual: How to add UTF-8 support

Posted By: elmue
Subject: Manual: How to add UTF-8 support
Date Posted: 18 July 2010 at 12:28am

Hello

I was very surprised that the Syntax Edit control is not able to show UTF-8 files !!
It supports ANSI and Unicode but not UTF-8 although UTF-8 is more widley used than Unicode!

It is not very difficult to add UTF-8 Support if your project is Unicode compiled.
If you still use the MBCS compiler switch some more work will be necessary due to some design flaws in the code.

___________________________________

Add tis line to XTPSyntaxEditBufferManager.h

    BOOL IsUnicodeFile(CFile *pFile);
___________________________________

In XTPSyntaxEditBufferManager.cpp:

convert the function
BOOL IsUnicodeFile(CFile *pFile)
into a member function and add the following bold lines:

BOOL CXTPSyntaxEditBufferManager::IsUnicodeFile(CFile *pFile)
{
    pFile->SeekToBegin();

    WORD wPrefix;
    UINT uReaded = pFile->Read(&wPrefix, 2);
    if (uReaded == 2 && wPrefix == 0xFEFF)
    {
        return TRUE;
    }

    // Check if UTF-8 file identifier "" exists
    pFile->SeekToBegin();

    BYTE u8_Buf[3];
    UINT u32_Read = pFile->Read(u8_Buf, 3);
    if (u32_Read == 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF)
    {
        m_nCodePage = CP_UTF8;
        return FALSE;
    }

_________________________________________________

Then in

void CXTPSyntaxEditBufferManager::SerializeEx() add the bold lines:

    else if (ar.IsStoring())
    {
        if (bUnicode == -1)
            bUnicode = m_bUnicodeFileFormat;

        if (bUnicode && bWriteUnicodeFilePrefix)
        {
            ar << (BYTE)0xFF;
            ar << (BYTE)0xFE;
        }

        if (!bUnicode && m_nCodePage == CP_UTF8)
        {
            ar << (BYTE)0xEF;
            ar << (BYTE)0xBB;
            ar << (BYTE)0xBF;
        }

        CByteArray arBuffer;

        int nCRLFStyle = GetCurCRLFType();

____________________________________________

And last but no least in XTPSyntaxEditCtrl.cpp

    int nBytes = ::WideCharToMultiByte(uCodePage, 0, (LPWSTR)lpSource, -1, lpMBCSSource, nLen, NULL, NULL);

    // ~~ASSERT(nBytes <= (int)dwBytes);~~ // removed: Nonsense

    lpMBCSSource[nBytes] = _T('\0');

This Assert is complete nonsense.
It is normal that a conversion from Unicode to any codepage makes the string longer.
There is no reason to check the length of the string and assert that it is shorter than the Unicode string.
In UTF8 one character may be represented by 4 bytes.

_______________________________________________

The buffer manager has several design flaws.
Instead of working with a lot of fixed size buffers and calling WideCharToMultiByte several times
it would have been nicer to write a class that encapsulates this stuff and automatically converts from

Unicode --> ANSI
ANSI --> Unicode
UTF8 --> Unicode
Unicode --> UTF8
ANSI --> UTF8
etc..
and this class should take care to allocate the buffer.

This would result in much cleaner code and less coding flaws.

The syntax highlighting will not work any longer if the user enters lines that are longer than 128 characters.
Why do the Codejock programmers not use a dynamic buffer that is allocated once the file gets parsed ?

I hope to see UTF8 support in the next version.

Elm�