Text Absorber
Using Aspose PDF's Text Absorber to extract text from PDF files.

Using TextFragmentAbsorber to determine which PDF page a search string is located
At work we send thousands of PDF files to an external supplier to print and post the letters. We encountered an error which meant that only the first page of multiple page letters were being sent.
We needed to find a way to identify which PDF files had important content on the second or third pages.
To do this, we needed to identify if the search string Your sincerely was on a page greater or equal to two.
The C Sharp code below finds the PDF files that needed to be checked:
1public static void ProcessFiles(string location)
2{
3 FileInfo[] files = [.. new DirectoryInfo(location)
4 .EnumerateFiles("*.pdf")
5 .OrderByDescending(f => f.Name)];
6
7 foreach (FileInfo file in files)
8 {
9 CheckPDFForYoursSincerely(file.FullName, file.Name);
10 }
11}
The CheckPDFForYoursSincerely function below is how we checked which page the text could be found.
Any letter that didn’t have this text was reprocessed as we couldn’t be sure if important information was missed.
1public static void CheckPDFForYoursSincerely(string fullName, string shortName)
2{
3 var pdfDocument = new Document(fullName);
4
5 var tfa = new TextFragmentAbsorber("Yours sincerely");
6 pdfDocument.Pages.Accept(tfa);
7 TextFragmentCollection tfc = tfa.TextFragments;
8
9 // if Yours sincerely is not found then copy the file over to
10 // be sure as we can't guarantee where the clinical content ends
11 if (tfc.Count == 0)
12 {
13 Console.WriteLine("INFO: Copying " + shortName + " Yours sincerely missing");
14 if (!File.Exists(outputFolder + shortName))
15 {
16 LogSQLtoCountCC(shortName, "0");
17 }
18 File.Copy(fullName, outputFolder + shortName, true);
19 }
20 else
21 {
22 foreach (TextFragment tf in tfc)
23 {
24 // if Yours sincerely is found on page 2 or beyond
25 if (tf.Page.Number >= 2)
26 {
27 Console.WriteLine("INFO: Copying " + shortName);
28 if (!File.Exists(outputFolder + shortName))
29 {
30 LogSQLtoCountCC(shortName, tf.Page.Number);
31 }
32 File.Copy(fullName, outputFolder + shortName, true);
33 }
34 else
35 {
36 Console.WriteLine("INFO: Skipping " + shortName);
37 }
38 }
39 }
40}