PDFlib TET官方最新版免费下载,可从PDF文档中提取文本、图像和元数据，PDFlib TET正版购买、在线文档支持-慧都网

下载：568 收藏：89

查看价格免费下载

PDFlib TET (产品编号：10596)

PDFlib TET是一款可以从任意PDF文档格式中可靠地提取文本信息的软件。

标签：PDF

开发商： PDFlib

当前版本： v5.4

产品类型：控件

产品功能：文档管理

平台语言：Activex & COM|.NET|JAVA|C++/ MFC|其他

开源水平：不提供源码

本产品的分类与介绍仅供参考，具体以商家网站介绍为准，如有疑问请来电 023-68661681 咨询。

文本和图像提取工具包

接受所有的PDF输入

世界所有书写系统均可使用

允许多种许可证程序运行

全球信赖的PDF产品

PDFlib TET（文本和图像提取工具包）可靠地从 PDF 文档中提取文本、图像和元数据。TET 将 PDF 的文本内容作为 Unicode 字符串提供，以及详细的颜色、字形和字体信息以及页面上的位置。以通用图像格式提取栅格图像。TET 可以选择将 PDF 文档转换为基于 XML 的格式，称为 TETML，该格式包含文本和元数据以及资源信息。TET 包含用于确定字边界、将文本分组到列、标识表结构和删除冗余项（如阴影文本）的高级内容分析算法。

* 关于本产品的分类与介绍仅供参考，精准产品资料以官网介绍为准，如需购买请先行测试。

PDFlib TET支持功能

为搜索引擎实现PDF索引器
重新利用PDF中的文本和图像
将PDF的内容转换为其他格式
根据PDF的内容进行处理，例如，根据标题进行拆分（除TET之外还需要PDFlib + PDI）
检查页面上的特定位置是否为空，例如用于放置条形码或图章
TET还包括pCOS界面，用于查询有关PDF文档的详细信息，例如文档信息字段和XMP元数据，字体列表，页面大小等（请参阅pCOS产品描述和pCOS Cookbook）

为什么选择TET提取文本？

用连字符号连接

TET可检测跨越多行的连字词，删除连字符，并将各个部分组合成一个完整的词。这对确保完整的单词搜索成功是很重要的，尽管文档中仅包含带连字符的部分。破折号（与连字符不同）要分开处理，因为不能将其删除。

阴影和粗体文本检测

TET的专利阴影检测算法可识别并删除多余的文本实例，以避免过多的文本提取。就算其他软件会提取阴影或粗体文本乘积，但TET会正确删除多余的副本。尽管一个单词的额外实例仍将导致搜索引擎的点击，但是，如示例中所示，如果逐个字符地重复复制文本，则将找不到更多的点击。

重音字符

在许多语言中，都会将重音符号和其他变音标记放置在其他字符附近，以形成组合字符。一些排版程序（最著名的是TeX）分别发出两个字符（基本字符和重音符）以创建组合字符。例如，要创建字符ä，首先将字母a放置在页面上，然后将降压字符¨放置在页面顶部。 TET会检测到这种情况，并重新组合两个字符以形成适当的组合字符。

连字

连字在单个字形中组合了两个或更多字符。最常见的连字用于fi，fl和ffi的组合；Th，sp，ct，st和许多其他组合使用了较少见的连字。从数字文档中提取文本时，必须分析连字并将其分离为组成字符以进行正确的文本处理。TET可以检测连字并酌情提供两个或更多字符。

首字下沉

首字下沉是段落开头的较大的初始字符，其中初始字符的顶部与行的顶部对齐，而其余字符则下降几行，首字下沉用于强调段落的开头。如果对它们的处理不当，则会从两个部分提取初始单词：单个初始字符和单词其余部分，TET会正确提取完整单词。

Unicode映射

TET获得专利的Unicode映射算法实现了一种级联算法，该算法采用所有可用信息来确定Unicode值。对于许多有问题的文档，TET会提取适当的Unicode文本，而其他产品只会传递不可用的垃圾。

带有阿拉伯语和希伯来语的双向文本

PDF不对逻辑文本进行编码，而只是页面上字形的容器。阿拉伯语和希伯来语脚本中的文本从右到左排列。由于它通常包含从左到右的插入物（例如西方语言中的数字或名称），因此文本必须在两个方向上都进行解释，因此使用术语“双向”。 TET对从右到左和从左到右的文本的视觉混合重新排序，以创建适当的逻辑文本输出。

修复损坏的PDF文档

PDF文档可能由于传输错误或其他问题而损坏。TET的修复模式可恢复多种损坏的PDF。有时，PDF文档损坏严重，以致页面甚至无法在Acrobat中显示。即使在这种极端情况下，TET仍经常交付文档的页面内容。

为什么选择TET提取图像？

色彩空间和压缩

PDF中的栅格图像数据可以以11种颜色空间和9种压缩滤镜的组合进行编码，但是常见的图像文件格式（例如JPEG和TIFF）仅支持这些组合的子集。TET的图像引擎在PDF图像的特性与图像输出格式的功能之间取得了平衡。无论PDF图像的内部结构如何，像素图像都是以一种常见的图像文件格式提取的。

专色

TET创建带有其他专色通道的TIFF输出。这适用于需要出色的色彩保真度并且不能接受任何颜色转换的应用。如果具有DeviceN颜色的图像仅包含常见CMYK印刷色的子集，则会添加缺少的印刷通道，以便可以创建纯CMYK输出。但是，某些应用程序可能无法处理专色通道，但仅限于普通TIFF输出。在这种情况下，可以指示TET发出单个专色通道作为灰度TIFF，以便于处理。

合并碎片图像

许多PDF文档中的图像被生成PDF的软件分解为小片段。在页面上看似单一的图像实际上可能由许多小块组成。例如，Microsoft Office应用程序和TeX通常会产生大量碎片图像，其中包含成百上千个小碎片。Adobe InDesign通常将图像分成大小不一的片段。TET检测碎片图像并将其合并以形成可用的较大图像。只有合并图像后，才能合理地重新使用碎片图像。

TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. Using the integrated pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc.

With PDFlib TET you can:
Implement the PDF indexer for a search engine
Repurpose the text and images in PDFs
Convert the contents of PDFs to other formats
Process PDFs based on their contents, e.g. splitting based on headings (requires PDFlib+PDI in addition to TET)

Accepted PDF input

TET supports all relevant flavors of PDF input:

All PDF versions up to Acrobat 9, including ISO 32000-1
Protected PDFs which do not require a password for opening the document
Damaged PDF documents will be repaired

Unicode

Since text in PDF is usually not encoded in Unicode, PDFlib TET normalizes the text in a PDF document to Unicode:

TET converts all text contents to Unicode. In C and other non-Unicode aware languages the text is returned in the UTF-8 or UTF-16 formats, and as native strings in Unicode-capable programming languages.
Ligatures and other multi-character glyphs are decomposed into a sequence of the corresponding Unicode characters.
Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character in order to avoid misinterpretation.
TET implements various workarounds for problems with specific document creation packages, such as InDesign and TeX documents or PDFs generated on mainframe systems.

Content analysis and word detection

TET includes advanced content analysis algorithms:

Patented algorithm for determining word boundaries which is required to retrieve proper words
Recombine the parts of hyphenated words (dehyphenation)
Remove duplicate instances of text, e.g. shadow and artificially bolded text
Recombine paragraphs in reading order
Correctly order text which is scattered over the page

Page Layout and Table Detection

The page content is analyzed to determine text columns. Tables are detected, including cells which span multiple columns. This improves the ordering of the extracted text. Table rows and the contents of each table cell can be identified.

Geometry

TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

Image Extract

Images on PDF pages can be extracted as TIFF, JPEG, or JPEG 2000 files. Precise geometric information (position, size, and angles) are reported for each image. Fragmented images will be combined to larger images to facilitate repurposing. Image fidelity is guaranteed since no downsampling or color space conversion occurs. This ensures the highest possible image quality.

PDF Analysis

The TET library includes the pCOS interface for querying details about a PDF document, such as document info and XMP metadata, font lists, page size, and many more.

Configuration Options for problematic PDF

TET contains special handling and workarounds for various kinds of PDF where the text cannot be extracted correctly with other products. In addition, it includes various configuration features to improve processing of problem documents:

Unicode mapping can be customized via user-supplied tables for mapping character codes or glyph names to Unicode.
PDFlib FontReporter is an auxiliary tool for analyzing fonts, encodings, and glyphs in PDF. It works as a plugin for Adobe Acrobat. This plugin is freely available for Mac and Windows.
Embedded fonts are analyzed to find additional hints which are useful for Unicode mapping. External font files or system fonts are used to improve text extraction results if a font is not embedded.

Unicode Postprocessing

TET supports various Unicode postprocessing steps which can be used to improve the extracted text:

Foldings preserve, remove or replace characters, e.g. remove punctuation or characters from irrelevant scripts.
Decompositions replace a character with an equivalent sequence of one or more other characters, e.g. replace narrow, wide or vertical Japanese characters or Latin superscript variants with their respective standard counterparts.
Text can be converted to all four Unicode normalization forms, e.g. emit NFC form to meet the requirements for Web text or a database.

Document Domains

PDF documents may contain text in other places than the page contents. While most applications will deal with the page contents only, in many situations other document domains may be relevant as well. TET extracts the text from all of the following document domains:

page contents
predefined and custom document info entries
XMP metadata on document and image level
bookmarks
file attachments and PDF portfolios can be processed recursively
form fields
comments (annotations)
general PDF properties can be queried, such as page count, conformance to standards like PDF/A or PDF/X, etc.

XMP Metadata

TET supports XMP metadata in several ways:

Using the integrated pCOS interface, XMP metadata for the document, inpidual pages, images, or other parts of the document can be extracted programmatically.
TETML output contains XMP document and image metadata if present in the PDF.
Images extracted in the TIFF or JPEG formats contain image metadata if present in the PDF.

TETML represents PDF Contents as XML

TET optionally represents the PDF contents in an XML flavor called TETML. It contains a variety of PDF information in a form which can easily be processed with common XML tools. TETML contains the actual text plus optionally font and position information, resource details (fonts, images, colorspaces), and metadata.

TETML is governed by a corresponding XML schema to make sure that TET always creates consistent and reliable XML output. TETML can be processed with XSLT stylesheets, e.g. to apply certain filters or to convert TETML to other formats. Sample XSLT stylesheets for processing TETML are included in the TET distribution.

The following fragment shows TETML output with glyph details:

<Word>
<Text>PDFlib</Text>
<Box llx="111.48" lly="636.33" urx="161.14" ury="654.33">
<Glyph font="F1" size="18" x="111.48" y="636.33" width="9.65">P</Glyph>
<Glyph font="F1" size="18" x="121.12" y="636.33" width="11.88">D</Glyph>
<Glyph font="F1" size="18" x="133.00" y="636.33" width="8.33">F</Glyph>
<Glyph font="F1" size="18" x="141.33" y="636.33" width="4.88">l</Glyph>
<Glyph font="F1" size="18" x="146.21" y="636.33" width="4.88">i</Glyph>
<Glyph font="F1" size="18" x="151.08" y="636.33" width="10.06">b</Glyph>
</Box>
</Word>

TET Connectors

TET connectors provide the necessary glue code to interface TET with other software. The following TET connectors make PDF text extraction functionality available for various software environments:

TET connector for the Lucene Search Engine
TET connector for the Solr Search Server
TET connector for Oracle Text
TET connector for MediaWiki
TET PDF IFilter for Microsoft products is available as a separate product. It extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows.

TET Cookbook

The TET Cookbook is a collection of programming examples which demonstrate the use of TET for various text and image extraction tasks. Several Cookbook samples show how to combine the TET and PDFlib+PDI products in order to enhance PDF documents, e.g. add bookmarks or links based on the text on the page.

更新时间:2023-07-13 15:00:44.000 | 录入时间:2006-01-18 11:46:00.000 | 责任编辑:胡涛

慧都公开课 更多