code page detect
Перейти к файлу
softlandia 71875ae9e4 v0.4.1 add decode UTF-16 2020-01-29 21:15:54 +04:00
.vscode v0.2.0 2019-11-07 18:28:00 +04:00
sample v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
test_files v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
.gitignore v0.2.4 2019-12-18 20:40:56 +04:00
HIST.md v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
README-RU.md v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
README.md v0.4.1 add decode UTF-16 2020-01-29 21:15:54 +04:00
char_frac.xlsx v0.3.5 minor updates 2020-01-27 18:19:24 +04:00
code_pages.go v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
code_pages_id.go v0.3.0 add pairs 2020-01-03 20:14:54 +04:00
cpTable.go v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
cpd.go v0.4.1 add decode UTF-16 2020-01-29 21:15:54 +04:00
cpd_test.go v0.4.1 add decode UTF-16 2020-01-29 21:15:54 +04:00
ibm866.go v0.3.5 minor updates 2020-01-27 18:19:24 +04:00
iso-8859-5.go v0.3.5 minor updates 2020-01-27 18:19:24 +04:00
koi8.go v0.3.5 minor updates 2020-01-27 18:19:24 +04:00
linter.md v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
utf8.go v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
utf16be.go v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
utf16le.go v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
utf32be.go v0.3.5 minor updates 2020-01-27 18:19:24 +04:00
utf32le.go v0.3.5 minor updates 2020-01-27 18:19:24 +04:00
utils.go v0.4.0 multithreading support 2020-01-29 19:40:19 +04:00
win1251.go v0.3.5 minor updates 2020-01-27 18:19:24 +04:00
сheckBom.go v0.3.5 minor updates 2020-01-27 18:59:59 +04:00

README.md

code page detect

(c) softlandia@gmail.com

download: go get -u github.com/softlandia/cpd
install: go install

golang library for detecting code page of text files
multibyte code pages and single-byte Russian code pages are supported:

no ID Name uint16

  1. ASCII: "ASCII", 3
  2. ISOLatinCyrillic: "ISO-8859-5", 8
  3. CP866: "CP866", 2086
  4. Windows1251: "Windows-1251", 2251
  5. UTF8: "UTF-8", 106
  6. UTF16LE: "UTF-16LE", 1014
  7. UTF16BE: "UTF-16BE", 1013
  8. KOI8R: "KOI8-R", 2084
  9. UTF32LE: "UTF-32LE", 1019
  10. UTF32BE: "UTF-32BE", 1018

feature

encoding is determined both by the presence of the bom attribute and by heuristic
if file contain only latin symbols from first half of code page, this file detected as UTF-8
this is not a mistake, this is a completely correct statement
have touble with detecting UTF32 without russians char

ATTANTION! library support multithreading

dependences

"golang.org/x/text/encoding/charmap"
"golang.org/x/text/transform"

types

IDCodePage uint16 - index of code page, support String() interface, you can fmt.Printf("code page index, name: %d, %s\n", cp, cp) where var cp received from cpd functions

variables

ReadBufSize int = 1024 // count of byte to read from input reader by default

functions

  1. CodePageDetect(r io.Reader) (IDCodePage, error)
  2. FileCodePageDetect(fn string, stopStr ...string) (IDCodePage, error)
  3. DecodeUTF16be(s string) string
  4. DecodeUTF16le(s string) string

description

func CodePageAutoDetect(content []byte) (result IDCodePage) 
  autodetect code page from input slice of byte
  use this function instead golang.org/x/net/html/charset.DetermineEncoding()

CodePageDetect(r io.Reader) (IDCodePage, error)
  detect code page of ascii data from reader 'r' 
  use library 'reflect' to check input reader
  default read only first 1024 byte from 'r' (var ReadBufSize to change this setting)
  input parameter stopStr not using

FileCodePageDetect(fn string, stopStr ...string) (IDCodePage, error)
  detect code page of text file "fn", read first 1024 byte (var ReadBufSize to change this setting)
  return error if problem with file "fn"
  return cpd.ASCII if code page not detected
  return one of next constant (code_pages_id.go): cpd.IBM866, cpd.Windows1251, cpd.KOI8R, cpd.UTF8, UTF16LE, UTF16BE
  file must contain characters of the Rusian alphabet
  string stopStr now not using

func StrConvertCodePage(s string, fromCP, toCP IDCodePage) (string, error)  //convert string from one code page to another, support Windows1251 & IBM866

func FileConvertCodePage(fileName string, fromCP, toCP IDCodePage) error    //convert code page file with "fileName", support Windows1251 & IBM866

func DecodeUTF16be(s string) string // convert input string from UTF-16BE to Utf-8

func DecodeUTF16le(s string) string // convert input string from UTF-16LE to Utf-8

tests and static analiz

coverage: 88% of statements
folder "test_files" contain files for testing, do not remove/change/add if want support tests is work

file linter.md report from golangci-lint