(四) solr 索引数据导入：pdf格式-白红宇

(四) solr 索引数据导入：pdf格式

阅读量：5965 次

发布时间：2019-06-19

本文共 1791 字，大约阅读时间需要 5 分钟。

一个偶然需求，需要对pdf（非扫描）文档进行索引，

schema.xml

fields

field

name

="id"

type

="string"

indexed

="true"

stored

="true"

required

="true"

field

name

="content"

type

="text_general"

indexed

="true"

stored

="true"

required

="true"

field

name

="size"

type

="slong"

indexed

="true"

stored

="true"

required

="true"

dynamicField

name

="ignored_*"

type

="ignored"

multiValued

="true"

indexed

="false"

stored

="false"

fields

uniqueKey

>id

uniqueKey

defaultSearchField

>content

defaultSearchField

solrQueryParser

defaultOperator

="AND"

solrconfig.xml需要配置的地方为：

requestHandler

name

="/update/extract"

startup

="lazy"

class

="solr.extraction.ExtractingRequestHandler"

lst

name

="defaults"

<!--

All the main content goes into "text"... if you need to return

the extracted text or do highlighting, use a stored field.

-->

str

name

="fmap.content"

>content

str

name

="fmap.stream_size"

>size

str

name

="lowernames"

>true

str

name

="uprefix"

>ignored_

str

<!--

capture link hrefs but ignore div attributes

-->

str

name

="captureAttr"

>true

str

<!--

<str name="fmap.a">links</str>

-->

<!--

<str name="fmap.div">ignored_div</str>

-->

lst

requestHandler

参数解释：

fmap.source=target : 映射规则，将在pdf文件中提取出的字段（source）映射到solr中的字段(tar)

uprefix : 如果指定了该参数，任何在schema中未定义的字段，都将以该参数指定的值作为字段名前缀

defaultField ：如果没有指定uprefix参数值，且有字段无法在schema中无法检测到，则使用defaultField指定的字段名

captureAttr ：(true|false)捕获属性，对Tika XHTML 元素的属性进行索引。

literal：自定义metadata信息，也就是给schema文件中定义的某一个字段指定一个值

提交文档进行索引：

curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=ignored_undefined" -F "commit=true" -F "file=@t2.pdf"

具体的参考文档：

注：对word文档的处理与pdf的方法一样哦

转载于:https://www.cnblogs.com/xiazh/archive/2012/10/29/2544824.html

你可能感兴趣的文章

深入浅出JavaScript （五）详解Document.write()方法

查看>>

Beta冲刺——day6

查看>>

Comet OJ - Contest #3 题解

查看>>

在一个程序中调用另一个程序并且传输数据到选择屏幕执行这个程序

通过jQuery.support看javascript中的兼容性问题

查看>>

NYOJ-取石子

查看>>

《zw版·Halcon-delphi系列原创教程》halconxlib控件列表

查看>>

代码生成工具Database2Sharp中增加视图的代码生成以及主从表界面生成功能

查看>>

关于在VS2005中编写DLL遇到 C4251 警告的解决办法