版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領
文檔簡介
1、<p> 本科畢業(yè)設計(論文)外文翻譯</p><p> Prototype of Semantic Search Engine Using Ontology</p><p> Ahmad Maziz Esa, Shakirah Mohd Taib, Nguyen Thi Hong</p><p> Computer Information Sci
2、ences</p><p> Universiti Teknologi Petronas</p><p> Tronoh, Perak</p><p> maziz.esa@gmail.com, shakita@petronas.com.my, hongutp@gmail.com</p><p> Abstract—In this p
3、aper we discuss the fundamental problem of information retrieval on the Web. Information on the Web is not semantically categorized and stored. This research focuses on applying semantic capabilities using ontology on se
4、arch engine. By using ontology, search engine can search keywords that are conceptually linked instead of just similarity of the words used. This paper also provides in depth description of the architecture design of our
5、 proposed modified search engine. This pap</p><p> INTRODUCTION</p><p> The Web at its infancy was a static page which allows users to open and read the contents of the Web pages. There was on
6、ly a one-way interaction between the users and the Web. As the technology advances, Web-enabled devices were getting cheaper and more ubiquitous. More and more people are able to access the Web and utilize the wealth of
7、information in it. This triggered a paradigm shift in Web usage and the way people interact with the Web. Experts and laymen coined this shifting in Web interac</p><p> The large amount of information on th
8、e Web can be retrieved using a search engine. Since Web 1.0, many search engines were developed and been commercialized. These search engines such as Google[7], AskJeeves[8], Yahoo![9], and Lycos[10] were among the searc
9、h engines that were dominating at its time. Search engines help users by indexing all the information on the Web and make it easy and quickly retrievable for the users. Early search engines were not a search engine at al
10、l. Instead it was a dir</p><p> There is no doubt that the collection of information on the Web is increasing. As for now, with the current search engine which utilizes on mathematical algorithms will be ab
11、le to cope. As the collection of the information becomes larger, it will dilute the accuracy of conventional search engine making it less accurate and less precise. The dilution of the result accuracy will be further agg
12、ravated as the collection of information grows rapidly. This work aims to tackle the problem from a differ</p><p> Section 2 describes related works done. Section 3 describes the methodology used to analyze
13、 and develop a semantic search engine. Section 4 describes the architecture and algorithm used in order to provide semantic mechanism for the search engine. Section 5 concludes the paper and finally section 6 describes f
14、uture improvements to this work.</p><p> RELATED WORKS</p><p> Currently, a general purpose “semantic” search engine had been developed. The search engine can be accessed at www.hakia.com. How
15、ever, most of the mechanism used in the search engine were patented and focused on commercial use. As quoted by Tim Berners Lee [6], “I mention patents in passing, but they are a great stumbling block for Web development
16、”. All the technologies used in Hakia[12] were patented and therefore are trade secret. This prevent academic circle to study intricate workings of the</p><p> Many search engines have been developed throug
17、hout the years. One of the most dominant was Google[7]. Many the components work together to produce search results. In the architecture, Google implemented Page Rank algorithm to identify the relevancy of the result. Pa
18、gerank algorithm will be explained further in the next section. The crawlers work 24/7 traversing all the hyperlinks and downloading Web content into storage. All the contents are parsed and indexed and stored into anoth
19、er storage are</p><p> The PageRank algorithm ranks the Web pages based on citation principle. The more links referred to a particular link, the higher the point it will have. The weight of Web pages will a
20、lso be taken into account. If a Web page with a high weight is a reference to a Web page, it will have higher points. The higher point will result in PageRank to rank higher in the result. The PageRank calculates the lin
21、ks from all pages equally and normalizing it. The basic formula is BRIN [3]:</p><p> PR(A) = (1-d) + d (PR(T1)/C(T1) + …… + PR(Tn)/C(Tn))</p><p> Where PR(A) is the probability of Web Site A,
22、which contain T1 pages… to Tn. T1 - Tn are pages linking to page A. PR(T1) is the PageRank value for page T1. D is the damping factor that can be set to 0 to 1. C(A) defines number of links going out of Web site A.</p
23、><p> Nutch is an open source search engine which was developed by Doug Cutting[13]. Nutch is an extension of Lucene [14] which is an open source information retrieval system. Most of the Lucene libraries were
24、 used in Nutch.</p><p> Most of the package in the figure provides Nutch functionality such as indexing and searching capabilities. According to Cutting [13], Nutch consist of two main components:</p>
25、<p> a) Crawler</p><p><b> ? Webdb</b></p><p><b> ? Fetcher</b></p><p><b> ? Indexer</b></p><p> ? Segments</p>&
26、lt;p> b) Searcher</p><p> Webdb is a persistent database that tracks page, relevant link last crawled date and other facts. In addition, Webdb stores image of Web graph. Fetcher on the other hand, is wh
27、at made crawler it is. Fetcher basically, crawls from one Web site to the other and fetch the content back to the system. The indexer uses the content fetched by fetcher to generate an inverted index. The inverted index
28、is then divided into segments which than can be used by searcher to display query results. Searcher comp</p><p> Nutch leverages on distributed computing to process large data sets [15]. The distribution fi
29、le system it’s using is Hadoop [16] which is also used by Yahoo! for its search engine system. Hadoop uses a programming model call MapReduce which was developed by Google [7]. In the model, it uses a set of <key,valu
30、e> as computation inputs. This input is used by map function to parse the task and generate intermediate keys. These intermediate keys will be the input for reduce function and merges togeth</p><p> Liyi
31、 Zhang [17] had conducted a research of using ontology to improve search accuracy. The retrieval system is an E-Commerce product retrieval system which uses ontology-based adoption Vector Space Model. It modified existin
32、g vector space model to treat documents as a collection of concepts instead of documents as collection of keywords. To determine the similarity between the documents and user query, it uses weights that are calculated us
33、ing tf-idf(term frequency, in this case concept frequenc</p><p><b> User.</b></p><p> Our project focuses on development of search engine model extension named Zenith. This extensi
34、on is a plug-in for Nutch [18] which enables it to function as a semantic search engine. By integrating Zenith and Nutch, they work together as a hybrid semantic search engine which can be used as a proof of concept for
35、our research.</p><p> METHODOLOGY</p><p> Zenith development uses a combination of reusable prototyping and component-based development as shown in Figure 1. The development begins by conducti
36、ng literature review.</p><p> Components that can be reused in this project are also identified. This process is called domain engineering. Domain engineering is a process of identifying the software compon
37、ents that is applicable for Zenith’s development [19]. Each Zenith’s function is compartmentalized into components. In the component sub-phase, reusable prototyping model is implemented. In general, the whole system is b
38、asically a reusable prototyping.</p><p> This methodology is most suited for Zenith development because of a few factors. Zenith architecture is highly modular. Components from other past projects can be re
39、used in Zenith’s development. As mention above, Zenith development is highly unpredictable. This methodology facilitates unpredictability of Zenith development. For instance, this methodology allows developer to experime
40、nt with components and methods and test selected components as proof of concept. The way this methodology flow allo</p><p> Figure 1. Zenith Methodology</p><p> Aside from adapting to the deve
41、lopment requirements of the system, this methodology will increase the system maintainability, and scalability. A highly scalable system will be able to cater large number of users in accordance to the system resources i
42、t can use. Maintainability is important to keep Zenith relevant in the future. When it is maintainable and scalable, the system can easily be enhanced for reliability. Fault tolerant capabilities can be implemented by ma
43、king each component redundant</p><p> All in all, this methodology is designed and modified to specially suit the nature of Zenith development. Even though this model tries to capture as much development ac
44、tivities as possible is does not capture all.</p><p> ANALYSIS OF GENERIC SEARCH ENGINE SKELETON</p><p> Generic search engine skeleton describes the back bone of search engine. It consists of
45、 features and functionalities necessary for it to be identified and function as search engine. Generic Search Engine Skeleton had been derived from educational and experienced conjecture. In this research, Nutch [18] wil
46、l be used as conventional search engine prototype. This prototype will enable better understanding of the mechanism and the nature of search engine. From the prototype, a semantic search engine</p><p> A se
47、arch engine consists of few major key components. These components are divided into two sections which are the front end and the back end. The main components were analyzed based on the architecture proposed by Brin [20]
48、 and analyzed by Manjula [21]. Nutch[18] is designed based on the skeleton that has a Back end and a Front end sections. We use this skeleton as the basis of the proposed design for semantic search engine. Few components
49、 are added and modifications of the skeleton components </p><p> A. Back-end </p><p> Back end is where the process of getting and storing information gathered from the Web. Majority of the co
50、re functions and search engine capabilities depended on how the backend is designed. Backend has Web crawler, URL Server, Indexer and the storage.</p><p> Web Crawler</p><p> Web crawler is a
51、script which is executed to retrieve Web pages based on the URL list stored in the URL server. To make Web crawling more effective, Web crawlers must be implemented in a way where many crawlers can simultaneously crawl m
52、ultiple Web from the URL server. Threading implementation is required to enable concurrent processing. Another important functionality is the crawler has to be able to understand robot exclusion protocol. Webmaster that
53、wants their site to be excluded from being </p><p> URL server acts as the storage for URL links. A list of URLs of commonly visited sites is manually stored and becomes the starting point for the crawlers.
54、 New URLs found by the crawler will be stored in the server.</p><p><b> Indexer</b></p><p> An indexer functions by indexing the parsed data into its according type. Indexer will “
55、organize” the data into categories. A document found by crawler will be parsed and indexed with a unique id, its data type, file and content. Indexer must be able to parse HTML, PDF, Words and other documents found from
56、the crawling activities. The data from the parsed HTML will be extracted and stored into a storage area such as database or custom data storage. Most implementations of search engine will comp</p><p> B. Fr
57、ont End</p><p> Front end has only one main component which is the searcher. This component acts as an intermediary between the user and the system. The component provides users the interface to obtain user
58、 search keyword. The keyword is then search in the reversed index which then will point to the links to the site. Multi-threading capabilities is required in the searcher as many users will conduct search simultaneously.
59、</p><p> OVERALL ARCHITECTURE</p><p> Zenith is an expansion for Nutch[18] that enables it to be a hybrid semantic search engine. The original design of data flow will be intercepted and modif
60、ied before being rechanneled to the indexer. Figure 2 shows the overall architecture of Zenith model.</p><p> Figure 2. Nutch with Zenith Expansion Architecture</p><p> Crudely the essence of
61、the mechanism to implement the semantic capability is by utilizing ontology. The mechanism lies on how statements are extracted and manipulated to find semantic relation. Using external framework called Jena[22], ontolog
62、y information is extracted in the form of:</p><p> “Subject Relationship Predicate”</p><p> Subject represents classes or entity that the statement is describing. It could be bank name, paymen
63、t method or etc.</p><p> Predicate represents object or entity that the statement is describing.</p><p> Relationship describes how subjects and predicates are related.</p><p> S
64、emantic indexer will use the subject item as key term when searching documents crawled by crawler. Once found, it will then index the predicate in the index. Once all subject terms are completed it will repeat the proces
65、s using predicate and index the subject instead. Currently there are two types of relationship, positive and negative relationship. Positive depicts related relationship while negative depicts unrelated relationship.<
66、/p><p> ZENITH EXPANSION ARCHITECTURE</p><p> In Zenith expansion there are few components that work together to give Nutch[18] the ability the semantic ability. The components in Zenith expansio
67、n are semantic indexer, Jena framework, Ontology, Xerces and OntoIndex .</p><p> C. Semantic Indexer</p><p> All the core functionalities and the algorithm that orchestrate the process getting
68、 the data from the indexer, extract information from ontology using Jena [22] and incorporate semantic value in the data. The data then is channel back to the indexer to be passed to the searcher which that will be displ
69、ayed in the search result. The algorithm is as shown in Figure 3.</p><p> 1. Get data from Nutch</p><p> 2. Extract Data from ontology using Jena</p><p> 3. Run inference engine
70、on the extracted ontology</p><p> 4. Compare Subjects from document (Index Predicate if Yes)</p><p> 5. Compare Predicate from document (Index Subject if Yes)</p><p> Iterate if
71、there is still more class</p><p> 6. Re-channel data to flow</p><p> This is enabled by the design of Nutch which implements plug-ins capabilities. Zenith expansion acts as Nutch’s plug-in whi
72、ch will be called when the data is being indexed. Developing Nutch plugin involves extending IndexerFilter extension interface provided by Nutch. Semantic Indexer implements IndexFilter interface. SemanticIndexer.java sk
73、eleton is as shown in Figure 3:</p><p> Figure 3. Semantic Indexer Skeleton</p><p> The main Semantic Indexer will instantiate Jena framework and Xerces and will then extract data from ontolog
74、y and ontoIndex accordingly. These data is stored in memory for manipulation subsequently. As mentioned above, Semantic Indexer performs its tasks by calling other components which is explained in the next section.</p
75、><p> D. Jena Framework</p><p> Jena framework is an open source semantic Web framework [22]. It enables the Semantic Indexer to extract data from .owl file (an ontology) for query and manipulati
76、on. It also provide semantic indexer a reasoning engine which will infer the ontology contained in the ontology by adding rules to it. This established the logical rules based on the relationships in the ontology, hence
77、assist the search engine to better define the semantic relationship between the concepts.</p><p> E. Ontology</p><p> Ontology is the component where the semantic is derived from. It acts as a
78、 “brain” or central location where the source of semantic or “knowledge” is from [23]. The wider or broader the subject scope of the ontology the more search engine can derive terms to index semantically.</p><
79、p><b> F. Xerces</b></p><p> Xerces is an open source library which enables the Semantic Indexer to extract data from .xml files [24]. The library is used to extract data from ontoIndex which
80、 is used in retrieving semantic information from the ontology before indexing it in the index.</p><p> G. OntoIndex</p><p> The OntoIndex is an index file act as a point of reference for Seman
81、tic Indexer to refer to when iterating through the ontology data. The ontology index stores information in tag with name and value. The value tag contains multiple value that is separated by “|” without the quote. The on
82、toIndex contains information such as class name and relationship used. </p><p> PERFORMANCE EVALUATION</p><p> H. Methodology</p><p> This section highlights the test conducted t
83、o see the effectiveness of the semantic mechanism implemented in Nutch. A test dataset was developed using html files with hyperlinks. The search engine and the test data site are on the same machine served by tomcat and
84、 apache server as in Figure 4. We developed ontology of e-commerce as our subject scope</p><p> for the testing. </p><p> Figure 4. Test Architecture</p><p> The hierarchy of the
85、 test data is shown in Figure 5.</p><p> Figure 5. Hierarchies of Test Data</p><p> The ontology is built based on Methontology (Fernandez)[25?]. The methodology includes steps and activities
86、carried out in several cycles. Each cycle includes three main types of activities: management, technical activities, and support activities. Once ontology is drafted, it is tested using the search engine. Then the ontolo
87、gy is modified accordingly to best support the search engine’s efficiency. Management work is about planning of the objectives and its users. Technical activities include s</p><p> I. Results</p><
88、;p> Two processes of indexing were conducted with semantic indexer enable and disabled. Table 1 shows the result of testing with conventional model and semantic enabled model. The result shown is a comparison of data
89、 indexed by search engine.</p><p> Table 1. Difference in amount of data indexed when enabled and disabled</p><p> From the result in Table 1, it shows there is an increase in disk space by 0.
90、06% and number of terms by 0.03%. This shows data being index is more compared when the semantic capability is disabled.</p><p> Table 2 shows the search result comparison using sample</p><p>
91、 search keywords.</p><p> Table 2. Search Results Comparison</p><p> As a model, the ontology used for the search engine does not reflect the real world data and naming convention. Keyword use
92、d can be replaced with bank names, book or movie title to reflect real world entity and thus real world semantic relationships. Even though it does not reflect real world, as a proof of concept it is adequate to say that
93、 the semantic search engine is possible. Based on Table 2, semantic search engine are capable of returning information that conventional search engine cant b</p><p><b> J. Issues</b></p>
94、<p> Limited Amount of data</p><p> Due to limited resources, only small case test with limited amount of data can be conducted. This could not exhibit the full potential of Zenith and the full extent
95、 of the problem in Zenith when handling large amount of data.</p><p> Buggy Scoring System</p><p> Scoring system helps search engine to organize search result according to ranks which is base
96、d on the importance of the documents or site indexed. Due to the implementation of Zenith, it had disrupted the scoring system in Nutch. Although the result will displayed site or documents that contains instance of keyw
97、ord or semantically related to the keyword, it would not displayed according to rank of importance. To resolve this, the scoring system may be modified.</p><p> FUTURE IMPROVEMENT</p><p> K. A
98、rtificial Intelligence</p><p> In the realm of Artificial Intelligent there are technologies such as fuzzy logic, neural networks and genetic algorithm. Fuzzy logic can be use in Nutch searcher. Most words
99、in English are ambiguous with many meaning depending on context. Together with fuzzy logic, Genetic algorithm can be used to determine the accuracy of the keyword and the terms stored in the documents in Nutch index. It
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 基于語義本體的垂直搜索引擎模型研究.pdf
- 基于本體論的智能搜索引擎模型的研究.pdf
- 基于Rough本體的語義搜索引擎研究.pdf
- 基于語義本體的智能搜索引擎研究.pdf
- 基于本體的Deep Web語義搜索引擎.pdf
- 基于本體的語義垂直搜索引擎研究.pdf
- 基于本體論的智能搜索引擎的設計與實現(xiàn).pdf
- [學習]搜索引擎優(yōu)化與搜索引擎營銷
- 搜索引擎
- 基于本體論的領域元搜索引擎的研究與設計.pdf
- 搜索引擎及搜索引擎優(yōu)化(seo)實驗
- 搜索引擎優(yōu)化畢業(yè)論文外文翻譯
- 搜索引擎優(yōu)化畢業(yè)論文外文翻譯
- 搜索引擎優(yōu)化畢業(yè)論文外文翻譯
- 基于語義的搜索引擎研究.pdf
- 本體搜索引擎的相關問題研究.pdf
- 基于本體的智能搜索引擎模型ismbdi的優(yōu)化研究
- 基于語義的主題搜索引擎研究.pdf
- 基于Nutch的智能語義搜索引擎.pdf
- 搜索引擎07011
評論
0/150
提交評論