DataHighBitAnalysis

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.afcs.warts.db
Class DataHighBitAnalysis

java.lang.Object
  org.afcs.warts.db.DataHighBitAnalysis

All Implemented Interfaces:: java.lang.Comparable

public final class DataHighBitAnalysis
extends java.lang.Object
implements java.lang.Comparable

The DataHighBitAnalysis class performs "high-bit" analysis on an array of bytes, classifying each byte as ascii, latin-1, utf-8 or ambiguous, and classifying the string as a whole.

One of the tricky parts of character classification is trying to tell the difference between bytes that make up a single multibyte UTF-8 character and bytes that make up several Latin-1 characters. These bytes are currently classified as ambiguous, with the exception of certain east european characters (lower case vowels with umlauts etc.), where the probability of them being part of a 2 byte UTF-8 character is much higher than of them being 2 Latin-1 characters (which would typically look something like 'ü').

A byte may be classed as illegal when it appears that a combination of bytes could not possibly be transformed into a valid transformation of bytes. This most often occurs when a byte in the range 0x80 - 0x9F would have to lead off a character. This is not in the valid Latin 1 range, so can be characterised as illegal.

LICENSE: This code is released to the public domain and may be used for any purpose whatsoever without permission or acknowledgment.

Version:: Last Modified 19 September 2003
Author:: Warren Hedley ( whedley at sdsc dot edu )

Field Summary
`static byte`	`BYTE_CLASS_ASCII` The byte was classified as an ASCII byte (0-127).
`static byte`	`BYTE_CLASS_FIRST_AMBIGUOUS` The byte was classified as the first byte of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters).
`static byte`	`BYTE_CLASS_ILLEGAL` The byte was classified as illegal, which can happen when a combination of bytes can not be transformed into a collection of valid characters.
`static byte`	`BYTE_CLASS_LATIN_1` The byte was classified as a Latin-1 character (128-255).
`static byte`	`BYTE_CLASS_NOT_FIRST_AMBIGUOUS` The byte was classified as one byte (but not the first) of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters).
`static byte`	`BYTE_CLASS_THREE_BYTE_UTF_8` The byte was classified as part of a 3 byte UTF-8 character.
`static byte`	`BYTE_CLASS_TWO_BYTE_UTF_8` The byte was classified as part of a 2 byte UTF-8 character.
`static int`	`DATA_CLASS_ASCII` The string consists of nothing but ascii characters.
`static int`	`DATA_CLASS_ASCII_AND_2_BYTE_UTF_8` The string consists of a mix of ascii and 2 byte UTF-8 characters.
`static int`	`DATA_CLASS_ASCII_AND_3_BYTE_UTF_8` The string consists of a mix of ascii and 3 byte UTF-8 characters.
`static int`	`DATA_CLASS_ASCII_AND_LATIN_1` The string consists of a mix of ascii and Latin-1 characters.
`static int`	`DATA_CLASS_CONTAINS_AMBIGUOUS` The string consists of a mix of ascii and "ambiguous" characters.
`static int`	`DATA_CLASS_CONTAINS_ILLEGAL` The string contains illegal bytes.
`static int`	`DATA_CLASS_CONTAINS_MULTIPLE` The string consists of a mix of ascii and multiple classes of non-ascii characters.

Constructor Summary
`DataHighBitAnalysis(byte[] data, int numBytesAllowed)` Constructs a new instance with the specified data, and the size of the column that the data is in.

Method Summary
`int`	`compareTo(java.lang.Object otherObj)` Compares this instance to another object, returning an integer that can be used to sort an array of DataHighBitAnalysis instances based on a case-insensitive comparison of the string returned by `getString()`.
`boolean`	`equals(java.lang.Object otherObj)` Returns true if the specified object is a DataHighBitAnalysis instance with the same value (what is returned by `getString()`) as the current instance.
`byte[]`	`getClassifications()` Returns the array of classifications for each byte.
`byte[]`	`getData()` Returns the original byte array that was analysed.
`int`	`getDataClass()` Returns a classification for the byte array as a whole.
`int`	`getNum2ByteUtf8Chars()` Returns the number of 2 byte UTF-8 characters found during analysis.
`int`	`getNum3ByteUtf8Chars()` Returns the number of 3 byte UTF-8 characters found during analysis.
`int`	`getNumAmbiguousBytes()` Returns the number of ambiguous bytes found during analysis.
`int`	`getNumIllegalBytes()` Returns the number of illegal bytes found during analysis.
`int`	`getNumLatin1Chars()` Returns the number of Latin-1 characters found during analysis.
`java.lang.String`	`getString()` Returns a string representation of the byte data using the preferred encoding.
`java.lang.String`	`getStringAsLatin1()` Returns a string representation of the byte data where the encoding is assumed to be Latin-1.
`java.lang.String`	`getStringAsUtf16()` Returns a string representation of the byte data where the encoding is assumed to be UTF-16, also known as UCS-2.
`java.lang.String`	`getStringAsUtf8()` Returns a string representation of the byte data where the encoding is assumed to be UTF-8.
`int`	`hashCode()` Returns a hashcode for the current instance based on the current string value (as returned by `getString()`).
`java.lang.String`	`toString()` Returns a text description of the current instance that can be used for debugging purposes.
`boolean`	`utf8Oversize()` Returns true if the UTF-8 representation of this string would overflow the column its in (presumably the string is encoded using Latin-1 now).

Methods inherited from class java.lang.Object

clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Detail

BYTE_CLASS_ASCII

public static final byte BYTE_CLASS_ASCII

The byte was classified as an ASCII byte (0-127).

See Also:: Constant Field Values

BYTE_CLASS_LATIN_1

public static final byte BYTE_CLASS_LATIN_1

The byte was classified as a Latin-1 character (128-255).

See Also:: Constant Field Values

BYTE_CLASS_TWO_BYTE_UTF_8

public static final byte BYTE_CLASS_TWO_BYTE_UTF_8

The byte was classified as part of a 2 byte UTF-8 character.

See Also:: Constant Field Values

BYTE_CLASS_THREE_BYTE_UTF_8

public static final byte BYTE_CLASS_THREE_BYTE_UTF_8

The byte was classified as part of a 3 byte UTF-8 character.

See Also:: Constant Field Values

BYTE_CLASS_FIRST_AMBIGUOUS

public static final byte BYTE_CLASS_FIRST_AMBIGUOUS

The byte was classified as the first byte of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters).

See Also:: Constant Field Values

BYTE_CLASS_NOT_FIRST_AMBIGUOUS

public static final byte BYTE_CLASS_NOT_FIRST_AMBIGUOUS

The byte was classified as one byte (but not the first) of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters).

See Also:: Constant Field Values

BYTE_CLASS_ILLEGAL

public static final byte BYTE_CLASS_ILLEGAL

The byte was classified as illegal, which can happen when a combination of bytes can not be transformed into a collection of valid characters.

See Also:: Constant Field Values

DATA_CLASS_ASCII

public static final int DATA_CLASS_ASCII

The string consists of nothing but ascii characters.

See Also:: Constant Field Values

DATA_CLASS_ASCII_AND_LATIN_1

public static final int DATA_CLASS_ASCII_AND_LATIN_1

The string consists of a mix of ascii and Latin-1 characters.

See Also:: Constant Field Values

DATA_CLASS_ASCII_AND_2_BYTE_UTF_8

public static final int DATA_CLASS_ASCII_AND_2_BYTE_UTF_8

The string consists of a mix of ascii and 2 byte UTF-8 characters.

See Also:: Constant Field Values

DATA_CLASS_ASCII_AND_3_BYTE_UTF_8

public static final int DATA_CLASS_ASCII_AND_3_BYTE_UTF_8

The string consists of a mix of ascii and 3 byte UTF-8 characters.

See Also:: Constant Field Values

DATA_CLASS_CONTAINS_AMBIGUOUS

public static final int DATA_CLASS_CONTAINS_AMBIGUOUS

The string consists of a mix of ascii and "ambiguous" characters.

See Also:: Constant Field Values

DATA_CLASS_CONTAINS_MULTIPLE

public static final int DATA_CLASS_CONTAINS_MULTIPLE

The string consists of a mix of ascii and multiple classes of non-ascii characters. This usually reflects a real problem in the data.

See Also:: Constant Field Values

DATA_CLASS_CONTAINS_ILLEGAL

public static final int DATA_CLASS_CONTAINS_ILLEGAL

The string contains illegal bytes. This overrides any other data classes.

See Also:: Constant Field Values

Constructor Detail

DataHighBitAnalysis

public DataHighBitAnalysis(byte[] data,
                           int numBytesAllowed)

Constructs a new instance with the specified data, and the size of the column that the data is in.
Parameters:: data - The bytewise representation of the string to analyse.; numBytesAllowed - The number of bytes allowed in the column from which the data was taken. This determines what utf8Oversize() will return.
Throws:: java.lang.NullPointerException - If data is null.

Method Detail

getData

public byte[] getData()

Returns the original byte array that was analysed. The array returned is the same as the one used internally, so should not be modified by the caller if the reference to the analysis object is shared.

Returns:: The original byte array that was analysed.

getClassifications

public byte[] getClassifications()

Returns the array of classifications for each byte. The array returned will be the same length as the data array specified at initialisation (and returned by getData(), and each byte in the array will be one of the BYTE_CLASS_* constants defined in this class. The array returned is the same as the one used internally, so should not be modified by the caller if the reference to the analysis object is shared.

Returns:: The array of classifications for each byte.

getNumLatin1Chars

public int getNumLatin1Chars()

Returns the number of Latin-1 characters found during analysis.

Returns:: The number of Latin-1 characters found during analysis.

getNum2ByteUtf8Chars

public int getNum2ByteUtf8Chars()

Returns the number of 2 byte UTF-8 characters found during analysis.

Returns:: The number of 2 byte UTF-8 characters found during analysis.

getNum3ByteUtf8Chars

public int getNum3ByteUtf8Chars()

Returns the number of 3 byte UTF-8 characters found during analysis.

Returns:: The number of 3 byte UTF-8 characters found during analysis.

getNumAmbiguousBytes

public int getNumAmbiguousBytes()

Returns the number of ambiguous bytes found during analysis. An ambiguous byte may be part of a multibyte UTF-8 character or may be multiple Latin-1 characters.

Returns:: The number of ambiguous bytes found during analysis.

getNumIllegalBytes

public int getNumIllegalBytes()

Returns the number of illegal bytes found during analysis. An illegal byte is flagged when a set of bytes couldn't possibly be transformed into a set of valid characters.

Returns:: The number of illegal bytes found during analysis.

getString

public java.lang.String getString()

Returns a string representation of the byte data using the preferred encoding. If the string contains any Latin-1 characters, this will be Latin-1, otherwise it will be UTF-8.

Returns:: A string representation of the byte data using the preferred encoding for the data.

getStringAsLatin1

public java.lang.String getStringAsLatin1()

Returns a string representation of the byte data where the encoding is assumed to be Latin-1. Bytes making up any UTF-8 characters will come out looking rather strange.

Returns:: A string representation of the byte data where the encoding is assumed to be Latin-1.

getStringAsUtf8

public java.lang.String getStringAsUtf8()

Returns a string representation of the byte data where the encoding is assumed to be UTF-8.

Returns:: A string representation of the byte data where the encoding is assumed to be UTF-8.

getStringAsUtf16

public java.lang.String getStringAsUtf16()

Returns a string representation of the byte data where the encoding is assumed to be UTF-16, also known as UCS-2.

Returns:: A string representation of the byte data where the encoding is assumed to be UTF-16.

getDataClass

public int getDataClass()

Returns a classification for the byte array as a whole. The code returned will be one of the DATA_CLASS_* constants specified in this class. Note that the presence of illegal characters will always cause the data class to be DATA_CLASS_CONTAINS_ILLEGAL regardless of what other characters are in the string.

Returns:: A classification for the byte array as a whole.

utf8Oversize

public boolean utf8Oversize()

Returns true if the UTF-8 representation of this string would overflow the column its in (presumably the string is encoded using Latin-1 now). The column size is set at initialisation.

Returns:: True if the UTF-8 representation of this string would overflow the column its in.

compareTo

public int compareTo(java.lang.Object otherObj)

Compares this instance to another object, returning an integer that can be used to sort an array of DataHighBitAnalysis instances based on a case-insensitive comparison of the string returned by getString().

Specified by:: compareTo in interface java.lang.Comparable

Parameters:: otherObj - The object to compare this instance to.
Returns:: An integer that can be used to sort an array of DataHighBitAnalysis instances.

equals

public boolean equals(java.lang.Object otherObj)

Returns true if the specified object is a DataHighBitAnalysis instance with the same value (what is returned by getString()) as the current instance.

Parameters:: otherObj - The object to compare this instance to.
Returns:: True if the specified object is a DataHighBitAnalysis instance with the same value as the current instance.

hashCode

public int hashCode()

Returns a hashcode for the current instance based on the current string value (as returned by getString()).

Returns:: A hashcode for the current instance.

toString

public java.lang.String toString()

Returns a text description of the current instance that can be used for debugging purposes.

Returns:: A text description of the current instance that can be used for debugging purposes.

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.afcs.warts.db Class DataHighBitAnalysis

BYTE_CLASS_ASCII

BYTE_CLASS_LATIN_1

BYTE_CLASS_TWO_BYTE_UTF_8

BYTE_CLASS_THREE_BYTE_UTF_8

BYTE_CLASS_FIRST_AMBIGUOUS

BYTE_CLASS_NOT_FIRST_AMBIGUOUS

BYTE_CLASS_ILLEGAL

DATA_CLASS_ASCII

DATA_CLASS_ASCII_AND_LATIN_1

DATA_CLASS_ASCII_AND_2_BYTE_UTF_8

DATA_CLASS_ASCII_AND_3_BYTE_UTF_8

DATA_CLASS_CONTAINS_AMBIGUOUS

DATA_CLASS_CONTAINS_MULTIPLE

DATA_CLASS_CONTAINS_ILLEGAL

DataHighBitAnalysis

getData

getClassifications

getNumLatin1Chars

getNum2ByteUtf8Chars

getNum3ByteUtf8Chars

getNumAmbiguousBytes

getNumIllegalBytes

getString

getStringAsLatin1

getStringAsUtf8

getStringAsUtf16

getDataClass

utf8Oversize

compareTo

equals

hashCode

toString

org.afcs.warts.db
Class DataHighBitAnalysis