search_simplify($text, $langcode = NULL)
Simplifies and preprocesses text for searching.
Processing steps:
- Entities are decoded.
- Text is lower-cased and diacritics (accents) are removed.
- hook_search_preprocess() is invoked.
- CJK (Chinese, Japanese, Korean) characters are processed, depending on the search settings.
- Punctuation is processed (removed or replaced with spaces, depending on where it is; see code for details).
- Words are truncated to 50 characters maximum.
Parameters
string $text: Text to simplify.
string|null $langcode: Language code for the language of $text, if known.
Return value
string Simplified and processed text.
See also
File
- core/modules/search/search.module, line 255
- Enables site-wide keyword searching.
Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | function search_simplify( $text , $langcode = NULL) { // Decode entities to UTF-8 $text = Html::decodeEntities( $text ); // Lowercase $text = Unicode:: strtolower ( $text ); // Remove diacritics. $text = \Drupal::service( 'transliteration' )->removeDiacritics( $text ); // Call an external processor for word handling. search_invoke_preprocess( $text , $langcode ); // Simple CJK handling if (\Drupal::config( 'search.settings' )->get( 'index.overlap_cjk' )) { $text = preg_replace_callback( '/[' . PREG_CLASS_CJK . ']+/u' , 'search_expand_cjk' , $text ); } // To improve searching for numerical data such as dates, IP addresses // or version numbers, we consider a group of numerical characters // separated only by punctuation characters to be one piece. // This also means that searching for e.g. '20/03/1984' also returns // results with '20-03-1984' in them. // Readable regexp: ([number]+)[punctuation]+(?=[number]) $text = preg_replace( '/([' . PREG_CLASS_NUMBERS . ']+)[' . PREG_CLASS_PUNCTUATION . ']+(?=[' . PREG_CLASS_NUMBERS . '])/u' , '\1' , $text ); // Multiple dot and dash groups are word boundaries and replaced with space. // No need to use the unicode modifier here because 0-127 ASCII characters // can't match higher UTF-8 characters as the leftmost bit of those are 1. $text = preg_replace( '/[.-]{2,}/' , ' ' , $text ); // The dot, underscore and dash are simply removed. This allows meaningful // search behavior with acronyms and URLs. See unicode note directly above. $text = preg_replace( '/[._-]+/' , '' , $text ); // With the exception of the rules above, we consider all punctuation, // marks, spacers, etc, to be a word boundary. $text = preg_replace( '/[' . Unicode::PREG_CLASS_WORD_BOUNDARY . ']+/u' , ' ' , $text ); // Truncate everything to 50 characters. $words = explode ( ' ' , $text ); array_walk ( $words , '_search_index_truncate' ); $text = implode( ' ' , $words ); return $text ; } |
Please login to continue.