search_simplify($text, $langcode = NULL)
Simplifies and preprocesses text for searching.
Processing steps:
- Entities are decoded.
- Text is lower-cased and diacritics (accents) are removed.
- hook_search_preprocess() is invoked.
- CJK (Chinese, Japanese, Korean) characters are processed, depending on the search settings.
- Punctuation is processed (removed or replaced with spaces, depending on where it is; see code for details).
- Words are truncated to 50 characters maximum.
Parameters
string $text: Text to simplify.
string|null $langcode: Language code for the language of $text, if known.
Return value
string Simplified and processed text.
See also
File
- core/modules/search/search.module, line 255
- Enables site-wide keyword searching.
Code
function search_simplify($text, $langcode = NULL) { // Decode entities to UTF-8 $text = Html::decodeEntities($text); // Lowercase $text = Unicode::strtolower($text); // Remove diacritics. $text = \Drupal::service('transliteration')->removeDiacritics($text); // Call an external processor for word handling. search_invoke_preprocess($text, $langcode); // Simple CJK handling if (\Drupal::config('search.settings')->get('index.overlap_cjk')) { $text = preg_replace_callback('/[' . PREG_CLASS_CJK . ']+/u', 'search_expand_cjk', $text); } // To improve searching for numerical data such as dates, IP addresses // or version numbers, we consider a group of numerical characters // separated only by punctuation characters to be one piece. // This also means that searching for e.g. '20/03/1984' also returns // results with '20-03-1984' in them. // Readable regexp: ([number]+)[punctuation]+(?=[number]) $text = preg_replace('/([' . PREG_CLASS_NUMBERS . ']+)[' . PREG_CLASS_PUNCTUATION . ']+(?=[' . PREG_CLASS_NUMBERS . '])/u', '\1', $text); // Multiple dot and dash groups are word boundaries and replaced with space. // No need to use the unicode modifier here because 0-127 ASCII characters // can't match higher UTF-8 characters as the leftmost bit of those are 1. $text = preg_replace('/[.-]{2,}/', ' ', $text); // The dot, underscore and dash are simply removed. This allows meaningful // search behavior with acronyms and URLs. See unicode note directly above. $text = preg_replace('/[._-]+/', '', $text); // With the exception of the rules above, we consider all punctuation, // marks, spacers, etc, to be a word boundary. $text = preg_replace('/[' . Unicode::PREG_CLASS_WORD_BOUNDARY . ']+/u', ' ', $text); // Truncate everything to 50 characters. $words = explode(' ', $text); array_walk($words, '_search_index_truncate'); $text = implode(' ', $words); return $text; }
Please login to continue.