ht://Dig installation manual

Authors: Ferruca <fernando@nediam.com.mx> and Nediam <javier@nediam.com.mx>
Publication date: 2006-01-16


This manual helps to do the installation and basic and personalized configuration of htdig. It should be noted that everything that is explained here is in terms of our system. We have Debian GNU/Linux installed, but it should work just as well with other systems. For further reference visit the website of ht://Dig.

Roughly explained, htdig works like this: first databases are created, with the text contained in the site being indexed. To do this the binaries htdig, htmerge y htfuzzy will be used. Then we can start to perform searches in the site. The script that carries out the search is a cgi called htsearch, which searches the word(s) given in the previously created databases, and displays the results based on the format specified in the templates footer.html, header.html, long.html, short.html, nomatch.html, syntax.html, and wrapper.html.

  1. Download from www.htdig.org the package: htdig-x.x.x.tar.gz

    Note: At the time of this writing, the most recent version of PostgreSQL is 3.1.6.

  2. Copy the package to the directory /usr/src/, extract and decompress it:
    SERVER:~# cp htdig-3.1.6.tar.gz /usr/src/
    SERVER:~# cd /usr/src/
    SERVER:/usr/src# tar -zxvf htdig-3.1.6.tar.gz

  3. Create the Makefile:
    SERVER:/usr/src# cd htdig-3.1.6
    SERVER:/usr/src/htdig-3.1.6# ./configure

    Tip: If the configure script throws an error of not finding the library libstdc++ and we are certain of having it installed, we execute:
    SERVER:/usr/src/htdig-3.1.6# CXXFLAGS=-Wno-deprecated CPPFLAGS=-Wno-deprecated ./configure.


  4. Once the configure script has been successfully executed, we need to edit the file CONFIG. In this file we will configure the paths that will be used when installing htdig (configuration files, binary files, cgi files, databases, etc.):
    SERVER:/usr/src/htdig-3.1.6# vim CONFIG

    EDITING THE CONFIG FILE
    • prefix= Root directory where we want to install htdig.
      Example: prefix=/usr/local/htdig
    • exec_prefix= Root directory where the programs installed by htdig will be located.
      Example: exec_prefix= ${prefix}
    • BIND_DIR= Directory where executable binaries will be located.
      Example: BIND_DIR= ${exec_prefix}/bin
    • CONFIG_DIR= Directory where config files will be located.
      Example: CONFIG_DIR= ${prefix}/conf
    • COMMON_DIR= Directory where the files that the different databases will use will be placed, for example the files header.html, footer.html, nomatch.html, which are files that will be read by cgi, for the header, the footer, and when there are no matches. These files can be modified according to our necessities.
      Example: COMMON_DIR= ${prefix}/common
    • DATABASE_DIR= Directory where the databases will be (databases of what we have indexed in our site).
      Example: DATABASE_DIR= /var/htdig/db
    • DEFAULT_CONFIG_FILE= Name of the main configuration file for htdig.
      Example: DEFAULT_CONFIG_FILE= ${CONFIG_DIR}/htdig.conf
    • CGIBIN_DIR= Directory where we will have our cgi's, or where we want the cgi for htdig (htsearch). Take into consideration that in case it is not configured to run cgi's you must also check the directory herein specified must have permission to run cgi's.
      Example: CGIBIN_DIR= /var/www/cgi-bin
    • IMAGE_DIR= Directory where the images used by default htdig will be located.
      Example: IMAGE_DIR= /var/www/html/images/htdig
    • IMAGE_URL_PREFIX= URL prefix that will write in the files header.html and footer.html from where to search the images.
      Example: IMAGE_DIR= /var/www/html/images/htdig
    • SEARCH_DIR= Directory where the sample file will be located, which contains the form of search.
      Example: SEARCH_DIR= /var/www/html/
    Tip: You can skip editing the CONFIG file and instead convey the options directly on configure, example: ./configure --prefix=/usr/local/htdig --with-cgi-bin-dir=/usr/local/apache/cgi-bin/ --with-image-dir=/usr/local/apache/htdocs/htdig/images --with-image-url-prefix=/htdig/images.

  5. Once you have finished editing the CONFIG file you are ready to compile and copy files, and thus conclude the installation.
    SERVER:/usr/src/htdig-3.1.6# make
    SERVER:/usr/src/htdig-3.1.6# make install


  6. Now we have installed htdig, but still need to make some adjustments in the main configuration file:
    SERVER:/usr/src/htdig-3.1.6# cd /usr/local/htdig/conf/
    SERVER:/usr/local/htdig/conf# vim htdig.conf

    EDITING THE htdig.conf FILE
    • This file can be very extensive, with many variables that can be configured. The full list of variables is available at http://www.htdig.org/confindex.html. However, here we will focus on the ones we had to modify for our website and we think they are enough to make a site personalized.

    • start_url: In this option we indicate the address where we want the htdig to start indexing; the main index of our site.
      Example: start_url: http://www.nediam.com.mx/index.php
    • limit_urls_to: We use it to specify the reach of the indexer. For example, we can limit it to our site only, meaning that if it finds a link to a site beyond our site it will not try to index that link. If we want to limit it to more than one site the sites must be separated by one or more spaces. If we want it to have no limit then we simply comment this line (not recommended).
      Example: limit_urls_to: http://www.nediam.com.mx  http://gacs.sourceforge.net
    • search_algorithm: Search algorithms that the cgi will use. The numeric values are values that will be multiplied by the weight of the search word. We add an algorithm of accents since our site is in Spanish from Mexico.
      Example: search_algorithm: exact:1 synonyms:0.5 endings:0.1  accents:1
    • page_number_text: Images or numbers we want to appear as links to the different result pages. In our case we only used numbers (html code can be used).
      Example: page_number_text: 1 2 3 4 5 6 7 8 9 10
    • no_page_number_text: Images or numbers we want to appear on the page where we find ourselves. In our case we simply used somewhat deeper colors (html code can be used).
      Example: no_page_number_text: <b>1</b> <b>2</b>\
      				<b>3</b> <b>4</b>\
      				<b>5</b> <b>6</b>\
      				<b>7</b> <b>8</b>\
      				<b>9</b> <b>10</b>
    • locale: Configure the language that will be used by htdig. It is important that it be previously installed in the operating system.
      Example: locale: es_MX
    • page_list_header: (text or html code) Text for the header for the list of result pages. By default it says "pages". Change it to "Páginas" in Spanish.
      Example: page_list_header: <hr noshade size=1>Páginas:<br />
    • allow_numbers: By default number searches give no results. In order to index and search numbers, use this option.
      Example: allow_numbers: true
    • For further reference we have made available our configuration file. You can see it by clicking here.


  7. When it is correctly configured, execute the command rundig, which indexes the site, puts together the indexed words and their matches, and puts the indexed data into the databases:
    SERVER:/usr/local/htdig/conf# cd /usr/local/htdig/bin/
    SERVER:/usr/local/htdig/bin# ./rundig

    Tip: You can see in detail what rundig is doing if you run it with the option -v, and with even more detail using the option -vv.

  8. We have created the databases that will be used by htsearch but in the configuration file we added a search algorithm (accents), which needs to have its own database created. This is how we do it:
    SERVER:/usr/local/htdig/bin# ./htfuzzy accents

    Tip: just as with rundig we can see in detail what it is doing, we can see what htfuzzy is doing by running it with the option -v or with even more detail using -vv.

  9. Once the databases have been created we are ready to use our browser. We can do an example with the search form created by htdig when it was installed (which, following the example, would end up in /var/www/html/). Said file is named search.html. Open a navigator and call it up:
    Example: http://www.nediam.com.mx/search.html
    Note: It is very important to make sure that the pathways to cgi's in the Apache configuration file are all correctly configured (ScriptAlias). Otherwise our search engine is not ready yet.

  10. To add more personalization to our search engine we can also edit the files: header.html, footer.html y nomatch.html, which are found, in the case of our installation, in the directory: /usr/local/htdig/common/.


    TO INDEX PDF FILES
  11. By default htdig will only index .html files, but external applications can be used to convert other types to html type. In this case we will see how to do this so we can index .pdf files.

    • It is necessary to download from the web page http://www.foolabs.com/xpdf/download.html the XPDF package, which is used to get the information from the pdf file and convert it to text.

    • Install the package:
      SERVER:~# cp xpdf-3.01.tar.gz /usr/local/
      SERVER:~# cd /usr/local/
      SERVER:/usr/local# tar -zxvf xpdf-3.01.tar.gz
      SERVER:/usr/local# cd xpdf-3.01
      SERVER:/usr/local/xpdf-3.01# ./configure
      SERVER:/usr/local/xpdf-3.01# make
      SERVER:/usr/local/xpdf-3.01# make install

      This package will install various utilities but the ones we will be using are: pdfinfo and pdftotext.

    • The perl script pdf2html.pl serves to convert the text given by pdftotext to html format. The directory where this script is located is contrib/doc2html in the installation directory (in this example it would be /usr/src/htdig-3.1.6/), Click here to take a look at it.

    • Copy the file to the directory /usr/local/bin and edit it:
      SERVER:/usr/local/xpdf-3.01# cd /usr/src/htdig-3.1.6/contrib/doc2html/
      SERVER:/usr/src/htdig-3.1.6/contrib/doc2html# cp pdf2html.pl /usr/local/bin/
      SERVER:/usr/src/htdig-3.1.6/contrib/doc2html# cd /usr/local/bin/
      SERVER:/usr/local/bin# vim pdf2html.pl

      Look for the line that says: my $PDFTOTEXT=, and in that line write the pathway of the binary pdftotext (previously installed).
      Example: my $PDFTOTEXT= "/usr/local/bin/pdftotext"
      In the line that says: my $PDFINFO=, write the pathway of the binary pdfinfo that has been previously installed.
      Example: my $PDFINFO= "/usr/local/bin/pdfinfo"

    • Now we will need to edit the htdig configuration file one more time:
      SERVER:/usr/local/bin# cd /usr/local/htdig/conf/
      SERVER:/usr/local/htdig/conf# vim htdig.conf

      At the end of the document add a line as follows:
      external_parsers: application/pdf->text/html /usr/local/bin/pdf2html.pl

    • Refresh the htdig databases:
      SERVER:/usr/local/htdig/conf# cd ../bin/
      SERVER:/usr/local/htdig/bin# ./rundig
      SERVER:/usr/local/htdig/bin# ./htfuzzy accents



    • Now htdig will also index PDF files.

      Tip1: If you have large pdf files (over 100 kb), then the variable: max_doc_size in the htdig configuration file has to be edited. By default this variable allows a maximum 100 kb but pdf's are usually much larger.

      Tip2: to have the titles of the pdf documents appear correctly on the result page when performing a search, the title and description in the file properties must be accurate. For example with Adobe Acrobat Writer we would do it through file->file properties->description.


    FORMATTING THE SEARCH RESULTS HTSEARCH TURNED OUT WITH PHP
  12. Recently we came across some problems in the site we work for, with the new design. We were supposed to display htdig search results in a page with a special format (in a table) and which interacted with php session variables. We could not do this with the ordinary functioning of htdig so we did some research and found a solution: to create a php class that captures the results turned out by htsearch and manipulate those results to display them as we needed. Here is the whole procedure as well as the necessary scripts.

    • Edit the htdig configuration file to change the results format template that uses htsearch. This is the format that will be used to deliver the htsearch results. These configurations are by default commented in the configuration files. The one that is used is the htdig default, so first we need to discomment the lines and then edit them:
      SERVER:/usr/local/htdig/bin# cd ../conf/
      SERVER:/usr/local/htdig/conf# vim htdig.conf

      Discomment first:
      # template_map: Long long ${common_dir}/long.html \
      #               Short short ${common_dir}/short.html
      # template_name: long
      
      Edit and change:
      template_map: Long long /usr/local/htdig/common/long.html
      template_name: long
      

    • Edit the files footer.html, header.html, long.html, nomatch.html y syntax.html all found in the directory: /usr/local/htdig/common/ and change the content to this:

      First the file: footer.html:
      $(PAGEHEADER)
      $(PREVPAGE)
      $(PAGELIST)
      $(NEXTPAGE)
      
      header.html:
      $(MATCHES)
      $(FIRSTDISPLAYED)
      $(LASTDISPLAYED)
      $(LOGICAL_WORDS)
      
      long.html:
      $(TITLE)
      $(URL)
      $(STARSLEFT)
      $(PERCENT)
      $(EXCERPT)
      $(SCORE)
      
      nomatch.html:
      NOMATCH
      
      syntax.html:
      SYNTAXERROR $(SYNTAXERROR)
      
      Note: It is very important that these files have this content and exactly in that order (with a line break between each variable) so that the php class will function correctly.

    • Create the php class with the following code and save it with the name class_htdig.php in the directory where we keep the function files or in some directory where other files can also access it.

      Check the code here.

      Note: Check the file class_htdig.php. In some parts of the file are the addresses to physical pathways to files. Check these carefully to see that they lead to our files (if they are somewhere else).

    • Lastly, create your file for performing searches. In this example we will call it buscador.php:

      Check the code here.

      Note: Check the file buscador.php. In some parts of the file are the addresses to physical pathways to files. Check these carefully to see that they lead to our files (if they are somewhere else).

    • The php class for htdig, as well as this part of the tutorial are based on the page: http://www.computerengineering.ca/a_way_to_use_htdig_with_php/, . We made quite a few changes to the content of that page and to its php code also, but our material here is based on that. Check out that site for better reference.


    USING HTDIG WITH THE HTTPS PROTOCOL
  13. By default htdig does not support working with websites that are under the secure http protocol (https). That is because for a long time it was illegal in the United States (according to the reference of the official htdig page).

    There is a patch for htdig code 3.1.x (keep in mind that in this tutorial we used htdig version 3.1.6), which uses the library of OpenSSL to support https. This patch can be downloaded from: ftp://ftp.ccsf.org/htdig-patches/3.1.6/.

    In order for htdig to have https support we should compile and install it again (it is important to back up our previously done configuration files):

    • We must be sure that the OpenSSL package is installed, as well as the library libssl-dev. These can be downloaded directly from the official OpenSSL website http://www.openssl.org.

    • Download the patch from the mentioned page and put them in the htdig installation directory (/usr/src/htdig-3.1.6).

      Check the patch code here.

    • Apply the patch:
      SERVER:/usr/src/htdig-3.1.6# patch -p1 -l < ssl.12
      patching file Makefile.config.in
      patching file htcommon/DocumentDB.cc
      patching file htcommon/defaults.cc
      patching file htdig/Document.cc
      patching file htdig/Images.cc
      patching file htdig/Retriever.cc
      patching file htdig/Server.cc
      patching file htdig/Server.h
      patching file htdig/htdig.cc
      patching file htlib/Connection.cc
      patching file htlib/Connection.h
      patching file htlib/URL.cc
      patching file htlib/URL.h
      

    • If no error occurred we can continue with the htdig installation with the steps mentioned in this tutorial.

    • Once the installation process is finished and the configuration files have been recovered (the ones we backed up previous to re-installing htdig), we edit the file htdig.conf.
      Example: limit_urls_to: http://www.nediam.com.mx  http://gacs.sourceforge.net https://www.nediam.com.mx \
      http://www.nediam.com.mx:443
      Note: It is important to put both forms of the protocol (https and http:443) because htdig follows absolute links by https and relative links by http:443.

References:


The latest version of this document is available at: http://nediam.com.mx/en/docs/htdig_manual/index.php

<< 0 comments >>



TOP