text-extraction – Tarik Billa

Extract all hex colors from a multiline CSS string

March 31, 2024 by Tarik

Get numeric suffix from key starting with specific substring

March 9, 2024 by Tarik

Getting URL parameter in java and extract a specific text from that URL

August 4, 2023 by Tarik

I think the one of the easiest ways out would be to parse the string returned by URL.getQuery() as public static Map<String, String> getQueryMap(String query) { String[] params = query.split(“&”); Map<String, String> map = new HashMap<String, String>(); for (String param : params) { String name = param.split(“=”)[0]; String value = param.split(“=”)[1]; map.put(name, value); } return … Read more

How to extract text from MS office documents in C#

July 22, 2023 by Tarik

For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you … Read more

How can I read pdf in python? [duplicate]

July 16, 2023 by Tarik

You can USE PyPDF2 package # install PyPDF2 pip install PyPDF2 Once you have it installed: # importing all the required modules import PyPDF2 # creating a pdf reader object reader = PyPDF2.PdfReader(‘example.pdf’) # print the number of pages in pdf file print(len(reader.pages)) # print the text of the first page print(reader.pages[0].extract_text()) Follow the documentation.

Extracting text from a PDF file using PDFMiner in python?

June 26, 2023 by Tarik

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016) from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec=”utf-8″ laparams = LAParams() device = TextConverter(rsrcmgr, … Read more

PDF text extraction from given coordinates

May 31, 2023 by Tarik

Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in “portions” (parts of single pages). What you can do: extract the text of a certain range of pages only. First: Ghostscript’s txtwrite output device (not so good) gs … Read more

Extract all email addresses from bulk text using jquery

May 30, 2023 by Tarik

Here’s how you can approach this: HTML <p id=”emails”></p> JavaScript var text=”sdabhikagathara@rediffmail.com, “assdsdf” <dsfassdfhsdfarkal@gmail.com>, “rodnsdfald ferdfnson” <rfernsdfson@gmal.com>, “Affdmdol Gondfgale” <gyfanamosl@gmail.com>, “truform techno” <pidfpinfg@truformdftechnoproducts.com>, “NiTsdfeSh ThIdfsKaRe” <nthfsskare@ysahoo.in>, “akasdfsh kasdfstla” <akashkatsdfsa@yahsdfsfoo.in>, “Bisdsdfamal Prakaasdsh” <bimsdaalprakash@live.com>,; “milisdfsfnd ansdfasdfnsftwar” <dfdmilifsd.ensfdfcogndfdfatia@gmail.com> datum eternus hello+11@gmail.com”; function extractEmails (text) { return text.match(/([a-zA-Z0-9._+-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi); } $(“#emails”).text(extractEmails(text).join(‘\n’)); Result sdabhikagathara@rediffmail.com,dsfassdfhsdfarkal@gmail.com,rfernsdfson@gmal.com,gyfanamosl@gmail.com,pidfpinfg@truformdftechnoproducts.com,nthfsskare@ysahoo.in,akashkatsdfsa@yahsdfsfoo.in,bimsdaalprakash@live.com,dfdmilifsd.ensfdfcogndfdfatia@gmail.com,hello+11@gmail.com Source: Extract email from bulk text (with … Read more

Extract text from pdf file using javascript [duplicate]

May 29, 2023 by Tarik

here is a nice example of how to use pdf.js for extracting the text: http://git.macropus.org/2011/11/pdftotext/example/ of course you have to remove a lot of code for your purpose, but it should do it

How to extract just plain text from .doc & .docx files? [closed]

April 3, 2023 by Tarik

If you want the pure plain text(my requirement) then all you need is unzip -p some.docx word/document.xml | sed -e ‘s/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g’ Which I found at command line fu It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.