document-processing-docs/knowledge-base/how-to-identify-document-ty...

2.6 KiB

title description type page_title slug position tags res_type
How to identify the actual document type when the filename extension is not correct This knowledge base article describes how to identify the actual document type when the filename extension is not correct how-to How to identify the actual document type when the filename extension is not correct how-to-identify-document-type 0 processing, file, filename, extension, incorrect kb
Product Version Product Author
2022.1.217 WordsProcessing Martin Velikov

Description

This article describes how to identify the actual document type when the filename extension is incorrect which helps us to determine the appropriate format provider.

Solution

The following example demonstrates how to read two documents with ".doc" filename extensions but actually different document types. Using the StringBuilder class we are creating the document signature (header) string, which later to compare with predefined values. Once having the right document type we can determine which format provider to use to import the document.

[C#] Example

{{region how-to-identify-document-type1}}

List<byte[]> documents = new List<byte[]>();
documents.Add(File.ReadAllBytes("rtf.doc"));
documents.Add(File.ReadAllBytes("doc.doc"));

foreach (byte[] document in documents)
{
	string headerCode = GetHeaderInfo(document).ToUpper();

	//! The signatures are taken from: https://www.filesignatures.net/index.php?page=search
	if (headerCode.StartsWith("7B5C72746631"))
	{
		//! The document is RTF
		RtfFormatProvider rtfFormatProvider = new RtfFormatProvider();
		RadFlowDocument rtfDocument = rtfFormatProvider.Import(new MemoryStream(document));
	}
	else if (headerCode.StartsWith("D0CF11E0A1B11AE1"))
	{
		//! The document is DOC
		DocFormatProvider docFormatProvider = new DocFormatProvider();
		RadFlowDocument docDocument = docFormatProvider.Import(document);
	}
}

{{endregion}}

[C#] Getting document header

{{region how-to-identify-document-type1}}

private static string GetHeaderInfo(byte[] documentData)
{
	byte[] buffer = documentData.Take(8).ToArray();

	StringBuilder sb = new StringBuilder();
	foreach (byte b in buffer)
	{
		sb.Append(b.ToString("X2"));
	}

	return sb.ToString();
}

{{endregion}}