phpdig doc en


PhpDig : Documentation h1 { font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 18pt; font-weight: normal} h2 { font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14pt; font-weight: normal} body { font-family: Verdana, Arial, Helvetica, sans-serif; color: #000000; font-size: 10pt; background-color: #FFFFFF;} a { font-family: Verdana, Arial, Helvetica, sans-serif; color: #004488; text-decoration: none; font-size: 10pt} a:hover { font-family: Verdana, Arial, Helvetica, sans-serif; color: #FFFFFF; background-color: #004488; font-size: 10pt} a.high { font-family: Verdana, Arial, Helvetica, sans-serif; color: #000000; background-color: #FFCC55; text-decoration: none; font-size: 10pt; font-weight: bold;} a.high:hover { font-family: Verdana, Arial, Helvetica, sans-serif; color: #FFFFDD; background-color: #003060; font-size: 10pt;} .grey {padding:5px; margin-top:8px; margin-bottom:18px; margin-right:32px; border:2px dotted #BBBBBB; background-color:#EEEEEE;} .blue {background-color:#CCDDFF; margin-top:8px; margin-bottom:8px; margin-right:32px; padding:3px; border-top:1px solid #7688A7; border-bottom:1px solid #7688A7; } PhpDig version 1.6.2 Documentation Last update : 2003-04-06 Note : Spelling corrections by Brien Louque and John Zastrow. 1. Table of contents 1. Table of contents2. Where to find the lastest PhpDig version ?3. PhpDig Features4. Installation5. Configuration6. Update PhpDig7. Indexing with web interface8. Indexing by command line interface9. Templates10. Insert PhpDig in a website11. Inside PhpDig12. Getting help with PhpDig 2. Where to find the lastest PhpDig version ? At address : http://phpdig.toiletoine.net/ 3. PhpDig Features 3.1. HTTP Spidering PhpDig follows links like any web browser, to build the pages list to index. Links can be in an AreaMap or frames. PhpDig supports relocations. Any syntax of HREF attribute is followed by Phpdig. Simple javascript links like window.open() or window.location() are followed too. PhpDig does not go out the root site you define for the indexing (but see Domain Indexing option). Spidering depth is chosen by user. All html content is listed, both static and dynamic pages. PhpDig searches the Mime-Type of the document. 3.2. Full Text Indexing PhpDig indexes all words of a document, excepting small words (less than 3 letters) an common words, those are definded in a text file. Lone numbers are not indexed, but those included in words are indexed. Words with underscores are included as well. A count of word occurances in each document is also stored. Words in the title can have a more important weight in ranking results. 3.3. File types wich can be indexed PhpDig indexes HTML and text files by itself. PhpDig could index PDF, MS-Word and MS-Excel files if you install external binaries on the spidering machines to this purpose. PhpDig is configured using catdoc, xls2csv and pstotext programs. You could find catdoc and xls2csv at this url : http://www.45.free.net/~vitus/ice/catdoc/. Choose the 0.91.5 version. The "stable" version have trouble with some encodings and does not include xls2csv program. You could find pstotext at this url : http://research.compaq.com/SRC/virtualpaper/pstotext.html. The author does not offer support on those tools. Contact the authors of those if you have trouble in compiling and/or installing them. Of course, you can use another tools to do the job. Choice is yours. Output format must be plaintext. 3.4. Other features PhpDig Tries to read a robots.txt file at the server root. It searches meta robots tags too. PhpDig can spider sites served on another port than the default 80. Password protected sites can be indexed giving to the robot an username and valid password. Be Careful ! This feature could permit to an unauthorized user reading protected informations. We recommend to create a specific instance of PhpDig, protected by the same credentials than the restricted site. You have to create a special account for the robot too. The Last-Modified header value is stored in the database to avoid redundant indexing. Also the <META> revisit-after tag. If desired, the engine can store textual content of indexed documents. In this case, relevant extracts of found pages are displayed in the results page with highlighted search keys. 3.5. Display templates A simple template system adapts search and results page to an existing site look. Making a template consists only in insert few xml-like tags in an html page. 3.6. Limits PhpDig can't perform an exact expression search. Because of the time consuming indexing process, the Apache/php web server which performs the process must not be safe_mode configured. This limit can be turn : - Using distant indexing with MySql TCP connexion and FTP connexion ; - Launching indexing process in a shell command. This can be made by a cron task. Spidering and indexing is a bit slow. On the other hand, search queries are fast, even in an extended content. 4. Installation 4.1. Prerequisites PhpDig requires a Web server (Apache is my preference) with Php (module or cgi), and a MySql database server. The following configurations were tested : Php/4.1.1, Apache/1.3.20 (Win32), Windows 2000 ; Php/4.1.2, Apache/1.3.23 (Unix) mod_ssl/2.8.7, Linux kernel 2.4.3 ; Php/4.3.0, Apache/2.0.44 (Unix) OpenSSL/0.9.6g, Linux kernel 2.4.18; Php/4.3.1, Apache/2.0.44 (Win32), Windows 2000 . 4.2. Scripts installation Unzip the archive in a directory and configure Apache to serve it. (it will be named [PHPDIG_DIR] in the following) The engine did not need a dedicated VirtualHost to operate. If PhpDig is installed on an Unix operating system server, set the file permissions to writable on the following directories, for the suid Apache server is running : [PHPDIG_DIR]/text_content [PHPDIG_DIR]/include [PHPDIG_DIR]/admin/temp 4.3. MySql database installation There are two ways to install the database. - Php install script : In your favorite browser, request the page : [PHPDIG_DIR]/admin/install.php And follow the instructions. Choose if you want to create the entire database or only the table, or update from a previous database version of PhpDig. This script uses the form datas to complete the fields of the "[PHPDIG_DIR]/include/_connect.php" script and copying it to "[PHPDIG_DIR]/include/connect.php". - Manual installation : You have to create the database (You can choose any other name than "phpdig") : #mysql mysql mysql> CREATE DATABASE phpdig; mysql> quit #mysql phpdig < [PHPDIG_DIR]/sql/init_db.sql Verify that all tables are present : #mysql phpdig mysql> SHOW TABLES; The database answer must be : +------------------+ | Tables_in_phpdig | +------------------+ | engine | | excludes | | keywords | | logs | | sites | | spider | | tempspider | +------------------+ 7 rows in set (0.00 sec) mysql> After the database was created, copy the "[PHPDIG_DIR]/include/_connect.php" file to "[PHPDIG_DIR]/include/connect.php" and edit the new one. Replace the values "<host>", "<user>", "<pass>", and "<database>" to your database server URL, the username, the password to connect to it (if required) and the name you give to the phpdig database. If you want to prefix all table names by a custom string, replace "<dbprefix>" to it and rename tables in any MySql client. If you do not, empty this value. In a local installation, the values "localhost", "root", and "" are sufficient in most cases. To verify the install is complete, open the main page [PHPDIG_DIR]/index.php with your favorite web browser. The search form must be visible. 5. Configuration After installaton, the engine will work without modification to the configuration file. The configuration steps depend on your needs. Don't forget to change the administration login and password if you use a Php compiled in an Apache dynamic or static module. Notice : Authentification doesn't operate with a CGI php. In this case, uses an .htaccess file in order to protect the [PHPDIG_DIR]/admin directory. All configuration parameters are in the [PHPDIG_DIR]/include/config.php file. Each of them is followed by a comment explaining it purpose. In the following, all statements are lines of the config.php file. The values are default values. 5.1. Configuring administrator access Change the following constants. If you don't want to see a clear password value, use the Apache authentification functions. define('PHPDIG_ADM_AUTH','1'); //Activates/deactivates the authentification functions define('PHPDIG_ADM_USER','admin'); //Username define('PHPDIG_ADM_PASS','admin'); //Password 5.2. Configuring robot and engine Change following variables and constants. define('SPIDER_MAX_LIMIT',20); //max recurse levels in sipder define('SPIDER_DEFAULT_LIMIT',3); //default value define('RESPIDER_LIMIT',4); //recurse limit for update define('LIMIT_DAYS',7); //default days before reindex a page define('SMALL_WORDS_SIZE',2); //words to not index define('MAX_WORDS_SIZE',30); //max word size define('PHPDIG_EXCLUDE_COMMENT','<!-- phpdigExclude -->'); //HTML comment to exclude //a part of a page define('PHPDIG_INCLUDE_COMMENT','<!-- phpdigInclude -->'); //HTML comment to stop exclude of //a part of a page define('PHPDIG_DEFAULT_INDEX',true); //phpDig considers /index or /default //html, htm, php, asp, phtml as the //same as '/' define('PHPDIG_SESSID_REMOVE',true); // remove SIDS from indexed URLS define('PHPDIG_SESSID_VAR','PHPSESSID');// name of the SID variable define('TITLE_WEIGHT',3); //relative title weight define('CHUNK_SIZE',2048); //chunk size for regex processing define('SUMMARY_LENGTH',500); //length of results summary define('TEXT_CONTENT_PATH','text_content/'); //Text content files path define('CONTENT_TEXT',1); //Activates/deactivates the //storage of text content. define('PHPDIG_IN_DOMAIN',false); //allows phpdig jump hosts in the same //domain. If the host is "www.mydomain.tld", //domain is "mydomain.tld" 5.3. Configure PhpDig encoding Modify the following contant. PhpDig does not support multiple encodings : The choosen applies to all indexed documents and admin interface. define('PHPDIG_ENCODING','iso-8859-1'); // iso-8859-1 and iso-8859-2 supported If you want PhpDig supports others encoding, you have to add array indexes to the following variables, taking example on existing ones. $phpdig_string_subst['iso-8859-1'] $phpdig_string_subst['iso-8859-2'] ... $phpdig_words_chars['iso-8859-1'] $phpdig_words_chars['iso-8859-2'] ... 5.4. Configuring external binaries Each external tool is defined by three constants : - INDEX (true or false) : Activate this file type indexing ; - PARSE (path) : Executable path ; - OPTION (options) : Options of program. define('PHPDIG_INDEX_MSWORD',true); define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc'); define('PHPDIG_OPTION_MSWORD','-s 8859-1'); define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext'); define('PHPDIG_OPTION_PDF','-cork'); define('PHPDIG_INDEX_MSEXCEL',true); define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv'); define('PHPDIG_OPTION_MSEXCEL',''); 5.5. Configuring templates Change following variables and constants. $phpdig_language = "en"; //GUI language $template = "$relative_script_path/templates/phpdig.html"; //Template file path define('HIGHLIGHT_BACKGROUND','yellow'); //Highlighting background color //Only for classic mode define('HIGHLIGHT_COLOR','#000000'); //Highlighting text color //Only for classic mode define('LINK_TARGET','_blank'); //Target for result links define('WEIGHT_IMGSRC','./tpl_img/weight.gif'); //Baragraph image path define('WEIGHT_HEIGHT','5'); //Baragraph height define('WEIGHT_WIDTH','50'); //Max baragraph width define('SEARCH_PAGE','index.php'); //The name of the search page define('SUMMARY_DISPLAY_LENGTH','150'); //Max chars displayed in //description define('DISPLAY_SNIPPETS',true); //Display text snippets define('DISPLAY_SNIPPETS_NUM',4); //Max snippets to display define('DISPLAY_SUMMARY',false); //Display description define('PHPDIG_DATE_FORMAT','\1-\2-\3'); // Date format for last update // \1 is year, \2 month and \3 day define('SEARCH_DEFAULT_LIMIT',10); //results per page define('SEARCH_DEFAULT_MODE','start'); // default search mode (start|exact|any) 5.6. FTP configuration (if necessary) PhpDig doesn't indexes FTP sites. Lot of PhpDig users install it on shared web servers, and on those, Php is always configured with safe_mode activated. On those shared hostings, access to the crontab is also not allowed. Another instance of Php, on a distinct server is the solution. In my case, a linux server installed at my home and plugged on a cable connexion runs the update process for the demo version of PhpDig. Your host must permit you to connect to your MySql database through TCP/IP. And what about this famous FTP connection ? It sends textual content of indexed documents to the proper directory in the remote server. If you deactivate the FTP function (in case of low-bandwidth connections, like by modem), only the summary stored in the database and not the exact document is displayed on the results page. FTP parameters are the following. define('FTP_ENABLE',0); //Activate/deactivate the ftp connection define('FTP_HOST','<ftp host dir>'); //FTP server name define('FTP_PORT',21); //FTP port define('FTP_PASV',1); //Use passive mode (PASV), recommended define('FTP_PATH','<phpdig root dir>'); //Path of the phpdig directory on server, //relative to the ftp rootdir define('FTP_TEXT_PATH','text_content'); //Text content directory (default) define('FTP_USER','<ftp username>'); //FTP username account define('FTP_PASS','<ftp password>'); //FTP password account 6. Update PhpDig 6.1. Database update The [PHPDIG_DIR]/sql/update_db_to[version].sql contains all required SQL instructions to update your existing install of PhpDig. Vous pouvez galement utiliser l'interface d'installation en choississant l'option de mise ą jour de la base. Cette fonction n'opŁre que pour la version immdiatement prcdente de la base de donnes. 6.2. Scripts update Save your configurations files, and just replace the existing scripts by the new ones. 7. Indexing with web interface 7.1. Index a new host Open the admin interface with your browser : [PHPDIG_DIR]/admin/index.php. Just fill in the url field, PhpDig reconizes if it is a new host or an existing one. You can also precise a path and/or a file, wich is the starting point of the robot. Select the maximum search depth in levels and click on the "Dig This !" button. A new page opens showing the indexing and spidering process. If a double is displayed, it means that PhpDig has detected that the current document, with a new url, is a duplicate of an existing one in the database. Each "+" sign means that a new link was detected and will be followed at the next spidering level. For each level, PhpDig displays the number of new links it has found. If no new link is found, PhpDig stops its browsing and displays the list of the documents. 7.2. Update an existing host From the admin page, you can reach the update interface by choosing a site and clicking on the [update form] button. A two parts inteface appears. On the left side of the screen is the client-side folder structure of the site. The blue arrow displays the "folder" content, in order to reindex the documents individually. The document's listing of a folder is on the right side of the screen. On both sides, the red cross deletes all the selected branch or file, including sub-folders in case of deleting a branch, from the engine. The green check mark reindexes the selected branch or document if they were indexed for more than [LIMIT_DAYS] days. It also search new links for documents wich are changed. 7.3. Index maintenance 3 scripts are used to delete useless data in the PhpDig database. The links are in the admin page. Clean index deletes index records not linked to any page. Useful if manual deletes are done in the database. Clean dictionary deletes keywords which are not used by the index. Useful for reducing the size of the dictionary, particularly when a large site contains a great deal of technical words and is deleted from the engine. Clean common words must be run when new common words are added in the [PHPDIG_DIR]/includes/common_words.txt file. It deletes all reference to those common words. 8. Indexing by command line interface Le script [PHPDIG_DIR]/admin/spider.php could be lauched by the shell in order to not overload the webserver. Launching the script : #php -f [PHPDIG_DIR]/admin/spider.php [option] List of options : - all (default) : Update all hosts ; - forceall : Force update all hosts ; - http://mondomaine.tld : Add or update the url ; - path/file : Add or update all urls listed in the given file. Examples : #php -f [PHPDIG_DIR]/admin/spider.php http://host.mydomain.com #php -f [PHPDIG_DIR]/admin/spider.php [File containing an urls list] As any shell command, the output can be redirected to a textfile. (If you want some logs.) #php -f [PHPDIG_DIR]/admin/spider.php all >> /var/log/phpdig.log The [PHPDIG_DIR]/admin/spider.php can be launch by a cron task too, in order to auto update the index. The recommended periodicity is 7 days. The updated documents you want to see immediately in the searches can be updated manually. Those pages can contain a "revisit-after" metatag with a short delay. 9. Templates 9.1. Description of templates Templates are HTML files containing some xml-like tags wich are replaced with the dynamic PhpDig content. See the provided templates source code as making templates example. The <phpdig:results></phpdig:results> tags display the results table : All content between the tags will be repeated as much time there are results in the results page. Two CSS classes are used by PhpDig : .phpdigHighlight : <SPAN/> class for highlighting of search terms. a.phpdig : <A/> class for phpdig results and navigation links. All template tags look like : <phpdig:parametre/>. Excepted the <phpdig:results></phpdig:results> tag, all are stand-alone tags. 9.2. Tags outside the results table phpdig:title_message Page title phpdig:form_head Starting the search form phpdig:form_title Form title phpdig:form_field Text field of the form phpdig:form_button Submit button of the form phpdig:form_select Select list to choose the num of results per page phpdig:form_radio Radio button to choose the parsing of search keys phpdig:form_foot Ending the search form phpdig:result_message Num of results message phpdig:ignore_message Too short words message phpdig:ignore_commess Too common words message phpdig:nav_bar Navigation bar to browse results phpdig:pages_bar Navigation bar without previous/next links phpdig:previous_link src='[img src]' "Previous" icon phpdig:next_link src='[img src]' "Next" icon 9.3. Results table tags phpdig:results Contains results list phpdig:img_tag Relevance Baragraph phpdig:weight Relevance of the page (in percents) phpdig:page_link Result title and link to the document phpdig:limit_links Links of limitation to an host / path phpdig:text Highlighted text extract or summary phpdig:n Result ranking, starting 1. phpdig:complete_path Complete URL of the document phpdig:update_date Last update of the document phpdig:filesize Size of the document (KiloBytes) 10. Insert PhpDig in a website 10.1. The index.php script The index.php script is only an example of using PhpDig with the same name template. This script can be inserted in any part of your website, assuming the configuration files and libraries are included. The $relative_script_path must contain the relative path of PhpDig's root directory from the current script. The phpdigSearch() must be called always as this : extract(phpdigHttpVars( array('query_string'=>'string', 'template_demo'=>'string', 'refine'=>'integer', 'refine_url'=>'string', 'site'=>'integer', 'limite'=>'integer', 'option'=>'string', 'search'=>'string', 'lim_start'=>'integer', 'browse'=>'integer', 'path'=>'string' ) )); phpdigSearch($id_connect, $query_string, $option, $refine, $refine_url, $lim_start, $limite, $browse, $site, $path, $relative_script_path, $template); The last parameter, $template, sets the way how PhpDig works : Use 'classic' for the static look of PhpDig. All html tags are included. Use the $template variable to use a template. The variable could be set in the config.php file or anywhere else, in order to have a different look in distincts part of your website. Use 'array' to do what you want with the search form and results. 10.2. Using 'classic' mode There is nothing to do : All display is done by the phpdigSearch() function. But you can't modify display with this. 10.3. Using templates With templates described earlier, it is very easy to insert PhpDig in an existing website look. A template could be an entire HTML page as sample templates provided, but only a part of it too. Only this part is described by the template and will appear where you call the phpdigSearch() function. You just have to add a .phpdigHighlight CSS class to you existing CSS. 10.4. Using PHP using the 'array' mode, The phpdigSearch() function returns an array containing both results and search form elements. Use this to get all content of this array, giving the script a full search results URL : print '<pre>'; print_r( phpdigSearch($id_connect, $query_string, $option, $refine, $refine_url, $lim_start, $limite, $browse, $site, $path, $relative_script_path, 'array') ); print '</pre>'; And do what you want with the results (first in big, the following three in medium size, and only the title of others at the right side, all is possible !). 11. Inside PhpDig 11.1. Spidering and Indexing PhpDig reads the fist page entered for indexing and adds found links in a list of links to follow. When no more valid link is found by the robot, it stops the process. Do decide what to do with a new link, PhpDig follows this procedure : - It requests the HTTP header for the current URI. If the returned Mime-type could be parsed by the robot, it continues its process. If the server returns a redirection, PhpDig search if the redirection go to another host or not. Then the robot compares "last-modified" header with the previous stored date. If they are the same, the URI is not processed. At least, the robot compare the URI with the exclude list. - If the document type is HTML, PhpDig reads the META Robots content to know if it is allowed to index and/or follow links from the current document. - Then PhpDig downloads the document in a temporary file. In first the document is indexed : The text content is stored in a file in order to display snippets on results page, then parsed in order to get the keywords. For an HTML document, the exclude and include comments are searched (PHPDIG_EXCLUDE_COMMENT and PHPDIG_INCLUDE_COMMENT constants) to exclude parts from indexing. - At least, PhpDig reads again the temporary file (in case of HTML document) in order to extract new links. All links are tested and parsed to decide those to index, those giving a 404 error, linking to another host and so on. - Indexing process is exclusive : An host is locked by the spider during indexing, update or delete. No operation on this host is permitted (excepting search of course) as long as the host is locked. 11.2. Clearings on search The search form is so simple that it not needs lot of explanation. But it could be useful to know that : - An AND operator is applied between each search key ; - Putting a '-' sign before a word excludes it from the search results. No document containing this word would be displayed ; - Search is case-insensitive and accent-insensitive. 12. Getting help with PhpDig A small messageboard dedicated to PhpDig can be found at : http://phpdig.toiletoine.net/messageboard/ Ask there any questions you have about this script. You can also mail the author at phpdig@toiletoine.net File created by XSLT parser Php 4.3.0 - Sablotron 0.97

Wyszukiwarka

Podobne podstrony:
phpdig doc en
phpdig doc api
phpdig doc api
phpdig doc fr
phpdig doc fr
AGH Sed 4 sed transport & deposition EN ver2 HANDOUT
Blaupunkt CR5WH Alarm Clock Radio instrukcja EN i PL
readme en
pgmplatz tast en
en 48
03 PEiM Met opisu ukł elektr doc (2)
Od Pskowa do Parkan 2 02 doc
punto de cruz Cross Stitch precious moment puntotek Indios en canoa
protokół różyca doc
CW5 doc

więcej podobnych podstron