|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.afcs.warts.db.DataHighBitAnalysis
The DataHighBitAnalysis class performs "high-bit" analysis on an array of bytes, classifying each byte as ascii, latin-1, utf-8 or ambiguous, and classifying the string as a whole.
One of the tricky parts of character classification is trying to tell the difference between bytes that make up a single multibyte UTF-8 character and bytes that make up several Latin-1 characters. These bytes are currently classified as ambiguous, with the exception of certain east european characters (lower case vowels with umlauts etc.), where the probability of them being part of a 2 byte UTF-8 character is much higher than of them being 2 Latin-1 characters (which would typically look something like 'ΓΌ').
A byte may be classed as illegal when it appears that a combination of bytes could not possibly be transformed into a valid transformation of bytes. This most often occurs when a byte in the range 0x80 - 0x9F would have to lead off a character. This is not in the valid Latin 1 range, so can be characterised as illegal.
LICENSE: This code is released to the public domain and may be used for any purpose whatsoever without permission or acknowledgment.
Field Summary | |
static byte |
BYTE_CLASS_ASCII
The byte was classified as an ASCII byte (0-127). |
static byte |
BYTE_CLASS_FIRST_AMBIGUOUS
The byte was classified as the first byte of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters). |
static byte |
BYTE_CLASS_ILLEGAL
The byte was classified as illegal, which can happen when a combination of bytes can not be transformed into a collection of valid characters. |
static byte |
BYTE_CLASS_LATIN_1
The byte was classified as a Latin-1 character (128-255). |
static byte |
BYTE_CLASS_NOT_FIRST_AMBIGUOUS
The byte was classified as one byte (but not the first) of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters). |
static byte |
BYTE_CLASS_THREE_BYTE_UTF_8
The byte was classified as part of a 3 byte UTF-8 character. |
static byte |
BYTE_CLASS_TWO_BYTE_UTF_8
The byte was classified as part of a 2 byte UTF-8 character. |
static int |
DATA_CLASS_ASCII
The string consists of nothing but ascii characters. |
static int |
DATA_CLASS_ASCII_AND_2_BYTE_UTF_8
The string consists of a mix of ascii and 2 byte UTF-8 characters. |
static int |
DATA_CLASS_ASCII_AND_3_BYTE_UTF_8
The string consists of a mix of ascii and 3 byte UTF-8 characters. |
static int |
DATA_CLASS_ASCII_AND_LATIN_1
The string consists of a mix of ascii and Latin-1 characters. |
static int |
DATA_CLASS_CONTAINS_AMBIGUOUS
The string consists of a mix of ascii and "ambiguous" characters. |
static int |
DATA_CLASS_CONTAINS_ILLEGAL
The string contains illegal bytes. |
static int |
DATA_CLASS_CONTAINS_MULTIPLE
The string consists of a mix of ascii and multiple classes of non-ascii characters. |
Constructor Summary | |
DataHighBitAnalysis(byte[] data,
int numBytesAllowed)
Constructs a new instance with the specified data, and the size of the column that the data is in. |
Method Summary | |
int |
compareTo(java.lang.Object otherObj)
Compares this instance to another object, returning an integer that can be used to sort an array of DataHighBitAnalysis instances based on a case-insensitive comparison of the string returned by getString() . |
boolean |
equals(java.lang.Object otherObj)
Returns true if the specified object is a DataHighBitAnalysis instance with the same value (what is returned by getString() ) as the
current instance. |
byte[] |
getClassifications()
Returns the array of classifications for each byte. |
byte[] |
getData()
Returns the original byte array that was analysed. |
int |
getDataClass()
Returns a classification for the byte array as a whole. |
int |
getNum2ByteUtf8Chars()
Returns the number of 2 byte UTF-8 characters found during analysis. |
int |
getNum3ByteUtf8Chars()
Returns the number of 3 byte UTF-8 characters found during analysis. |
int |
getNumAmbiguousBytes()
Returns the number of ambiguous bytes found during analysis. |
int |
getNumIllegalBytes()
Returns the number of illegal bytes found during analysis. |
int |
getNumLatin1Chars()
Returns the number of Latin-1 characters found during analysis. |
java.lang.String |
getString()
Returns a string representation of the byte data using the preferred encoding. |
java.lang.String |
getStringAsLatin1()
Returns a string representation of the byte data where the encoding is assumed to be Latin-1. |
java.lang.String |
getStringAsUtf16()
Returns a string representation of the byte data where the encoding is assumed to be UTF-16, also known as UCS-2. |
java.lang.String |
getStringAsUtf8()
Returns a string representation of the byte data where the encoding is assumed to be UTF-8. |
int |
hashCode()
Returns a hashcode for the current instance based on the current string value (as returned by getString() ). |
java.lang.String |
toString()
Returns a text description of the current instance that can be used for debugging purposes. |
boolean |
utf8Oversize()
Returns true if the UTF-8 representation of this string would overflow the column its in (presumably the string is encoded using Latin-1 now). |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
public static final byte BYTE_CLASS_ASCII
public static final byte BYTE_CLASS_LATIN_1
public static final byte BYTE_CLASS_TWO_BYTE_UTF_8
public static final byte BYTE_CLASS_THREE_BYTE_UTF_8
public static final byte BYTE_CLASS_FIRST_AMBIGUOUS
public static final byte BYTE_CLASS_NOT_FIRST_AMBIGUOUS
public static final byte BYTE_CLASS_ILLEGAL
public static final int DATA_CLASS_ASCII
public static final int DATA_CLASS_ASCII_AND_LATIN_1
public static final int DATA_CLASS_ASCII_AND_2_BYTE_UTF_8
public static final int DATA_CLASS_ASCII_AND_3_BYTE_UTF_8
public static final int DATA_CLASS_CONTAINS_AMBIGUOUS
public static final int DATA_CLASS_CONTAINS_MULTIPLE
public static final int DATA_CLASS_CONTAINS_ILLEGAL
Constructor Detail |
public DataHighBitAnalysis(byte[] data, int numBytesAllowed)
data
- The bytewise representation of the string to analyse.numBytesAllowed
- The number of bytes allowed in the column from
which the data was taken. This determines what
utf8Oversize()
will return.
java.lang.NullPointerException
- If data is null.Method Detail |
public byte[] getData()
public byte[] getClassifications()
getData()
, and each byte in the array will be one of
the BYTE_CLASS_*
constants defined in this class. The array
returned is the same as the one used internally, so should not be modified
by the caller if the reference to the analysis object is shared.
public int getNumLatin1Chars()
public int getNum2ByteUtf8Chars()
public int getNum3ByteUtf8Chars()
public int getNumAmbiguousBytes()
public int getNumIllegalBytes()
public java.lang.String getString()
public java.lang.String getStringAsLatin1()
public java.lang.String getStringAsUtf8()
public java.lang.String getStringAsUtf16()
public int getDataClass()
DATA_CLASS_*
constants specified in this
class. Note that the presence of illegal characters will always cause the
data class to be DATA_CLASS_CONTAINS_ILLEGAL
regardless of what
other characters are in the string.
public boolean utf8Oversize()
public int compareTo(java.lang.Object otherObj)
getString()
.
compareTo
in interface java.lang.Comparable
otherObj
- The object to compare this instance to.
public boolean equals(java.lang.Object otherObj)
getString()
) as the
current instance.
otherObj
- The object to compare this instance to.
public int hashCode()
getString()
).
public java.lang.String toString()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |