tokenize – Tarik Billa

Python – RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

April 6, 2024 by Tarik

how to get data between quotes in java?

December 30, 2023 by Tarik

You can use a regular expression to fish out this sort of information. Pattern p = Pattern.compile(“\”([^\”]*)\””); Matcher m = p.matcher(line); while (m.find()) { System.out.println(m.group(1)); } This example assumes that the language of the line being parsed doesn’t support escape sequences for double-quotes within string literals, contain strings that span multiple “lines”, or support other … Read more

PHP namespace removal / mapping and rewriting identifiers

December 28, 2023 by Tarik

Securing my API to only work with my front-end

December 17, 2023 by Tarik

Apply CORS – server specifies domains allowed to request your API. How does it work? Client sends special “preflight” request (of OPTIONS method) to server, asking whether domain request comes from is among allowed domains. It also asks whether request method is OKAY (you can allow GET, but deny POST, …) . Server determines whether … Read more

Tokenizer vs token filters

December 11, 2023 by Tarik

A tokenizer will split the whole input into tokens and a token filter will apply some transformation on each token. For instance, let’s say the input is The quick brown fox. If you use an edgeNGram tokenizer, you’ll get the following tokens: T Th The The (last character is a space) The q The qu … Read more

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] – Tokenizing BERT / Distilbert Error

November 23, 2023 by Tarik

How do you extract only the date from a python datetime? [duplicate]

September 10, 2023 by Tarik

You can use date and time methods of the datetime class to do so: >>> from datetime import datetime >>> d = datetime.now() >>> only_date, only_time = d.date(), d.time() >>> only_date datetime.date(2015, 11, 20) >>> only_time datetime.time(20, 39, 13, 105773) Here is the datetime documentation. Applied to your example, it can give something like this: … Read more

What is more efficient a switch case or an std::map

September 8, 2023 by Tarik

I would suggest reading switch() vs. lookup table? from Joel on Software. Particularly, this response is interesting: ” Prime example of people wasting time trying to optimize the least significant thing.” Yes and no. In a VM, you typically call tiny functions that each do very little. It’s the not the call/return that hurts you … Read more

Is it a Lexer’s Job to Parse Numbers and Strings?

September 3, 2023 by Tarik

The simple answer is “Yes”. In the abstract, you don’t need lexers at all. You could simply write a grammer that used individual characters as tokens (and in fact that’s exactly what SGLR parsers do, but that’s a story for another day). You need lexers because parsers built using characters as primitive elements aren’t as … Read more

How to use a Lucene Analyzer to tokenize a String?

August 12, 2023 by Tarik

Based off of the answer above, this is slightly modified to work with Lucene 4.0. public final class LuceneUtil { private LuceneUtil() {} public static List<String> tokenizeString(Analyzer analyzer, String string) { List<String> result = new ArrayList<String>(); try { TokenStream stream = analyzer.tokenStream(null, new StringReader(string)); stream.reset(); while (stream.incrementToken()) { result.add(stream.getAttribute(CharTermAttribute.class).toString()); } } catch (IOException e) { … Read more