code page detect

1251 866 codepage encoder encoding golang koi8 utf utf-16 utf-32 utf-8

Перейти к файлу

softlandia 71875ae9e4 v0.4.1 add decode UTF-16		2020-01-29 21:15:54 +04:00
.vscode	v0.2.0	2019-11-07 18:28:00 +04:00
sample	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
test_files	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
.gitignore	v0.2.4	2019-12-18 20:40:56 +04:00
HIST.md	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
README-RU.md	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
README.md	v0.4.1 add decode UTF-16	2020-01-29 21:15:54 +04:00
char_frac.xlsx	v0.3.5 minor updates	2020-01-27 18:19:24 +04:00
code_pages.go	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
code_pages_id.go	v0.3.0 add pairs	2020-01-03 20:14:54 +04:00
cpTable.go	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
cpd.go	v0.4.1 add decode UTF-16	2020-01-29 21:15:54 +04:00
cpd_test.go	v0.4.1 add decode UTF-16	2020-01-29 21:15:54 +04:00
ibm866.go	v0.3.5 minor updates	2020-01-27 18:19:24 +04:00
iso-8859-5.go	v0.3.5 minor updates	2020-01-27 18:19:24 +04:00
koi8.go	v0.3.5 minor updates	2020-01-27 18:19:24 +04:00
linter.md	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
utf8.go	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
utf16be.go	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
utf16le.go	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
utf32be.go	v0.3.5 minor updates	2020-01-27 18:19:24 +04:00
utf32le.go	v0.3.5 minor updates	2020-01-27 18:19:24 +04:00
utils.go	v0.4.0 multithreading support	2020-01-29 19:40:19 +04:00
win1251.go	v0.3.5 minor updates	2020-01-27 18:19:24 +04:00
сheckBom.go	v0.3.5 minor updates	2020-01-27 18:59:59 +04:00

README.md

code page detect

download: go get -u github.com/softlandia/cpd
install: go install

golang library for detecting code page of text files
multibyte code pages and single-byte Russian code pages are supported:

no ID Name uint16

ASCII: "ASCII", 3
ISOLatinCyrillic: "ISO-8859-5", 8
CP866: "CP866", 2086
Windows1251: "Windows-1251", 2251
UTF8: "UTF-8", 106
UTF16LE: "UTF-16LE", 1014
UTF16BE: "UTF-16BE", 1013
KOI8R: "KOI8-R", 2084
UTF32LE: "UTF-32LE", 1019
UTF32BE: "UTF-32BE", 1018

feature

encoding is determined both by the presence of the bom attribute and by heuristic
if file contain only latin symbols from first half of code page, this file detected as UTF-8
this is not a mistake, this is a completely correct statement
have touble with detecting UTF32 without russians char

ATTANTION! library support multithreading

dependences

"golang.org/x/text/encoding/charmap"
"golang.org/x/text/transform"

types

IDCodePage uint16 - index of code page, support String() interface, you can fmt.Printf("code page index, name: %d, %s\n", cp, cp) where var cp received from cpd functions

variables

ReadBufSize int = 1024 // count of byte to read from input reader by default

functions

CodePageDetect(r io.Reader) (IDCodePage, error)
FileCodePageDetect(fn string, stopStr ...string) (IDCodePage, error)
DecodeUTF16be(s string) string
DecodeUTF16le(s string) string

description

func CodePageAutoDetect(content []byte) (result IDCodePage) 
  autodetect code page from input slice of byte
  use this function instead golang.org/x/net/html/charset.DetermineEncoding()

CodePageDetect(r io.Reader) (IDCodePage, error)
  detect code page of ascii data from reader 'r' 
  use library 'reflect' to check input reader
  default read only first 1024 byte from 'r' (var ReadBufSize to change this setting)
  input parameter stopStr not using

FileCodePageDetect(fn string, stopStr ...string) (IDCodePage, error)
  detect code page of text file "fn", read first 1024 byte (var ReadBufSize to change this setting)
  return error if problem with file "fn"
  return cpd.ASCII if code page not detected
  return one of next constant (code_pages_id.go): cpd.IBM866, cpd.Windows1251, cpd.KOI8R, cpd.UTF8, UTF16LE, UTF16BE
  file must contain characters of the Rusian alphabet
  string stopStr now not using

func StrConvertCodePage(s string, fromCP, toCP IDCodePage) (string, error)  //convert string from one code page to another, support Windows1251 & IBM866

func FileConvertCodePage(fileName string, fromCP, toCP IDCodePage) error    //convert code page file with "fileName", support Windows1251 & IBM866

func DecodeUTF16be(s string) string // convert input string from UTF-16BE to Utf-8

func DecodeUTF16le(s string) string // convert input string from UTF-16LE to Utf-8

tests and static analiz

coverage: 88% of statements
folder "test_files" contain files for testing, do not remove/change/add if want support tests is work

file linter.md report from golangci-lint