翻译|使用教程|编辑:胡涛|2022-08-29 10:51:06.847|阅读 262 次
概述:本文介绍了如何用Java从Word文档中提取文本,欢迎查阅!
# 慧都年终大促·界面/图表报表/文档/IDE等千款热门软控件火热促销中 >>
相关链接:
Aspose.Words For .NET是一种高级Word文档处理API,用于执行各种文档管理和操作任务。API支持生成,修改,转换,呈现和打印文档,而无需在跨平台应用程序中直接使用Microsoft Word。此外,API支持所有流行的Word处理文件格式,并允许将Word文档导出或转换为固定布局文件格式和最常用的图像/多媒体格式。本文介绍了如何用Java从Word文档中提取文本
从 Word 文档中提取文本通常在不同的场景中执行。例如,分析文本,提取文档的特定部分并将它们组合成单个文档,等等。在本文中,您将学习如何在 Java 中以编程方式从 Word 文档中提取文本。此外,我们将介绍如何动态提取段落、表格等特定元素之间的内容。
Aspose.Words for Java 是一个功能强大的库,可让您从头开始创建 MS Word 文档。此外,它可以让您操作现有的 Word 文档进行加密、转换、文本提取等。我们将使用这个库从 Word DOCX 或 DOC 文档中提取文本。您可以下载API 的 JAR 或使用以下 Maven 配置安装它。
<repository> <id>AsposeJavaAPI</id> <name>Aspose Java API</name> <url>//repository.aspose.com/repo/</url> </repository> <dependency> <groupId>com.aspose</groupId> <artifactId>aspose-words</artifactId> <version>22.6</version> <type>pom</type> </dependency>
MS Word 文档由各种元素组成,包括段落、表格、图像等。因此,文本提取的要求可能因场景而异。例如,您可能需要在段落、书签、评论等之间提取文本。
Word DOC/DOCX 中的每种元素都表示为一个节点。因此,要处理文档,您将不得不使用节点。那么让我们开始看看如何在不同的场景下从 Word 文档中提取文本。
在本节中,我们将为 Word 文档实现一个 Java 文本提取器,文本提取的工作流程如下:
现在让我们编写一个名为extractContent的方法,我们将向该方法传递节点和一些其他参数来执行文本提取。此方法将解析文档并克隆节点。以下是我们将传递给此方法的参数。
以下是提取传递的节点之间的内容的extractContent方法的完整实现。
// For complete examples and data files, please go to //github.com/aspose-words/Aspose.Words-for-Java public static ArrayList extractContent(Node startNode, Node endNode, boolean isInclusive) throws Exception { // First check that the nodes passed to this method are valid for use. verifyParameterNodes(startNode, endNode); // Create a list to store the extracted nodes. ArrayList nodes = new ArrayList(); // Keep a record of the original nodes passed to this method so we can split marker nodes if needed. Node originalStartNode = startNode; Node originalEndNode = endNode; // Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them. // We will split the content of first and last nodes depending if the marker nodes are inline while (startNode.getParentNode().getNodeType() != NodeType.BODY) startNode = startNode.getParentNode(); while (endNode.getParentNode().getNodeType() != NodeType.BODY) endNode = endNode.getParentNode(); boolean isExtracting = true; boolean isStartingNode = true; boolean isEndingNode; // The current node we are extracting from the document. Node currNode = startNode; // Begin extracting content. Process all block level nodes and specifically split the first and last nodes when needed so paragraph formatting is retained. // Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful. while (isExtracting) { // Clone the current node and its children to obtain a copy. /*System.out.println(currNode.getNodeType()); if(currNode.getNodeType() == NodeType.EDITABLE_RANGE_START || currNode.getNodeType() == NodeType.EDITABLE_RANGE_END) { currNode = currNode.nextPreOrder(currNode.getDocument()); }*/ System.out.println(currNode); System.out.println(endNode); CompositeNode cloneNode = null; ///cloneNode = (CompositeNode) currNode.deepClone(true); Node inlineNode = null; if(currNode.isComposite()) { cloneNode = (CompositeNode) currNode.deepClone(true); } else { if(currNode.getNodeType() == NodeType.BOOKMARK_END) { Paragraph paragraph = new Paragraph(currNode.getDocument()); paragraph.getChildNodes().add(currNode.deepClone(true)); cloneNode = (CompositeNode)paragraph.deepClone(true); } } isEndingNode = currNode.equals(endNode); if (isStartingNode || isEndingNode) { // We need to process each marker separately so pass it off to a separate method instead. if (isStartingNode) { processMarker(cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode); isStartingNode = false; } // Conditional needs to be separate as the block level start and end markers maybe the same node. if (isEndingNode) { processMarker(cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode); isExtracting = false; } } else // Node is not a start or end marker, simply add the copy to the list. nodes.add(cloneNode); // Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section. if (currNode.getNextSibling() == null && isExtracting) { // Move to the next section. Section nextSection = (Section) currNode.getAncestor(NodeType.SECTION).getNextSibling(); currNode = nextSection.getBody().getFirstChild(); } else { // Move to the next node in the body. currNode = currNode.getNextSibling(); } } // Return the nodes between the node markers. return nodes; } extractContent方法还需要一些辅助方法来完成文本提取操作,如下所示。 /** * Checks the input parameters are correct and can be used. Throws an exception * if there is any problem. */ private static void verifyParameterNodes(Node startNode, Node endNode) throws Exception { // The order in which these checks are done is important. if (startNode == null) throw new IllegalArgumentException("Start node cannot be null"); if (endNode == null) throw new IllegalArgumentException("End node cannot be null"); if (!startNode.getDocument().equals(endNode.getDocument())) throw new IllegalArgumentException("Start node and end node must belong to the same document"); if (startNode.getAncestor(NodeType.BODY) == null || endNode.getAncestor(NodeType.BODY) == null) throw new IllegalArgumentException("Start node and end node must be a child or descendant of a body"); // Check the end node is after the start node in the DOM tree // First check if they are in different sections, then if they're not check // their position in the body of the same section they are in. Section startSection = (Section) startNode.getAncestor(NodeType.SECTION); Section endSection = (Section) endNode.getAncestor(NodeType.SECTION); int startIndex = startSection.getParentNode().indexOf(startSection); int endIndex = endSection.getParentNode().indexOf(endSection); if (startIndex == endIndex) { if (startSection.getBody().indexOf(startNode) > endSection.getBody().indexOf(endNode)) throw new IllegalArgumentException("The end node must be after the start node in the body"); } else if (startIndex > endIndex) throw new IllegalArgumentException("The section of end node must be after the section start node"); } /** * Checks if a node passed is an inline node. */ private static boolean isInline(Node node) throws Exception { // Test if the node is desendant of a Paragraph or Table node and also is not a // paragraph or a table a paragraph inside a comment class which is decesant of // a pararaph is possible. return ((node.getAncestor(NodeType.PARAGRAPH) != null || node.getAncestor(NodeType.TABLE) != null) && !(node.getNodeType() == NodeType.PARAGRAPH || node.getNodeType() == NodeType.TABLE)); } /** * Removes the content before or after the marker in the cloned node depending * on the type of marker. */ private static void processMarker(CompositeNode cloneNode, ArrayList nodes, Node node, boolean isInclusive, boolean isStartMarker, boolean isEndMarker) throws Exception { // If we are dealing with a block level node just see if it should be included // and add it to the list. if (!isInline(node)) { // Don't add the node twice if the markers are the same node if (!(isStartMarker && isEndMarker)) { if (isInclusive) nodes.add(cloneNode); } return; } // If a marker is a FieldStart node check if it's to be included or not. // We assume for simplicity that the FieldStart and FieldEnd appear in the same // paragraph. if (node.getNodeType() == NodeType.FIELD_START) { // If the marker is a start node and is not be included then skip to the end of // the field. // If the marker is an end node and it is to be included then move to the end // field so the field will not be removed. if ((isStartMarker && !isInclusive) || (!isStartMarker && isInclusive)) { while (node.getNextSibling() != null && node.getNodeType() != NodeType.FIELD_END) node = node.getNextSibling(); } } // If either marker is part of a comment then to include the comment itself we // need to move the pointer forward to the Comment // node found after the CommentRangeEnd node. if (node.getNodeType() == NodeType.COMMENT_RANGE_END) { while (node.getNextSibling() != null && node.getNodeType() != NodeType.COMMENT) node = node.getNextSibling(); } // Find the corresponding node in our cloned node by index and return it. // If the start and end node are the same some child nodes might already have // been removed. Subtract the // difference to get the right index. int indexDiff = node.getParentNode().getChildNodes().getCount() - cloneNode.getChildNodes().getCount(); // Child node count identical. if (indexDiff == 0) node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node)); else node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node) - indexDiff); // Remove the nodes up to/from the marker. boolean isSkip; boolean isProcessing = true; boolean isRemoving = isStartMarker; Node nextNode = cloneNode.getFirstChild(); while (isProcessing && nextNode != null) { Node currentNode = nextNode; isSkip = false; if (currentNode.equals(node)) { if (isStartMarker) { isProcessing = false; if (isInclusive) isRemoving = false; } else { isRemoving = true; if (isInclusive) isSkip = true; } } nextNode = nextNode.getNextSibling(); if (isRemoving && !isSkip) currentNode.remove(); } // After processing the composite node may become empty. If it has don't include // it. if (!(isStartMarker && isEndMarker)) { if (cloneNode.hasChildNodes()) nodes.add(cloneNode); } } public static Document generateDocument(Document srcDoc, ArrayList nodes) throws Exception { // Create a blank document. Document dstDoc = new Document(); // Remove the first paragraph from the empty document. dstDoc.getFirstSection().getBody().removeAllChildren(); // Import each node from the list into the new document. Keep the original // formatting of the node. NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING); for (Node node : (Iterable<Node>) nodes) { Node importNode = importer.importNode(node, true); dstDoc.getFirstSection().getBody().appendChild(importNode); } // Return the generated document. return dstDoc; }
现在我们准备好使用这些方法并从 Word 文档中提取文本。
让我们看看如何在 Word DOCX 文档的两个段落之间提取内容。以下是在 Java 中执行此操作的步骤。
以下代码示例展示了如何在 Java 的 Word DOCX 中提取第 7 段和第 11 段之间的文本。
// Load document Document doc = new Document("TestFile.doc"); // Gather the nodes. The GetChild method uses 0-based index Paragraph startPara = (Paragraph) doc.getFirstSection().getChild(NodeType.PARAGRAPH, 6, true); Paragraph endPara = (Paragraph) doc.getFirstSection().getChild(NodeType.PARAGRAPH, 10, true); // Extract the content between these nodes in the document. Include these // markers in the extraction. ArrayList extractedNodes = extractContent(startPara, endPara, true); // Insert the content into a new separate document and save it to disk. Document dstDoc = generateDocument(doc, extractedNodes); dstDoc.save("output.doc");
您还可以在不同类型的节点之间提取内容。为了演示,让我们提取段落和表格之间的内容并将其保存到新的 Word 文档中。以下是在 Java 中提取 Word 文档中不同节点之间的文本的步骤。
以下代码示例展示了如何使用 Java 在 DOCX 中提取段落和表格之间的文本。
// Load documents Document doc = new Document("TestFile.doc"); // Get reference of starting paragraph Paragraph startPara = (Paragraph) doc.getLastSection().getChild(NodeType.PARAGRAPH, 2, true); Table endTable = (Table) doc.getLastSection().getChild(NodeType.TABLE, 0, true); // Extract the content between these nodes in the document. Include these markers in the extraction. ArrayList extractedNodes = extractContent(startPara, endTable, true); // Lets reverse the array to make inserting the content back into the document easier. Collections.reverse(extractedNodes); while (extractedNodes.size() > 0) { // Insert the last node from the reversed list endTable.getParentNode().insertAfter((Node) extractedNodes.get(0), endTable); // Remove this node from the list after insertion. extractedNodes.remove(0); } // Save the generated document to disk. doc.save("output.doc");
现在让我们看看如何根据样式提取段落之间的内容。为了演示,我们将提取 Word 文档中第一个“标题 1”和第一个“标题 3”之间的内容。以下步骤演示了如何在 Java 中实现此目的。
以下代码示例展示了如何根据样式提取段落之间的内容。
// Load document Document doc = new Document(dataDir + "TestFile.doc"); // Gather a list of the paragraphs using the respective heading styles. ArrayList parasStyleHeading1 = paragraphsByStyleName(doc, "Heading 1"); ArrayList parasStyleHeading3 = paragraphsByStyleName(doc, "Heading 3"); // Use the first instance of the paragraphs with those styles. Node startPara1 = (Node) parasStyleHeading1.get(0); Node endPara1 = (Node) parasStyleHeading3.get(0); // Extract the content between these nodes in the document. Don't include these markers in the extraction. ArrayList extractedNodes = extractContent(startPara1, endPara1, false); // Insert the content into a new separate document and save it to disk. Document dstDoc = generateDocument(doc, extractedNodes); dstDoc.save("output.doc");
以上便是如何用Java从Word文档中提取文本 ,要是您还有其他关于产品方面的问题,欢迎咨询我们,或者加入我们官方技术交流群。
欢迎下载|体验更多Aspose产品
本站文章除注明转载外,均为本站原创或翻译。欢迎任何形式的转载,但请务必注明出处、不得修改原文相关链接,如果存在内容上的异议请邮件反馈至chenjj@pclwef.cn