PDF to plain text, please

BLOB-stored PDF’s can be indexed using Oracle Text using the AUTO_FILTER. This filter can also be applied when just converting the PDF to plain text. You don’t need an index in that case. However, this eats-all-type filter might be missing out on specific PDF documents, even if the PDF version in combination with its security settings are supported. What are the alternatives…

Let me describe the problem more clearly first. I recently had to investigate an issue with converting binary stored PDF files ( BLOB column ) to plain text ( CLOB column ). Essentially the data was modeled like this:

(assume DOCS is the schema owner…)

I had started to implement this business requirement using Oracle Text, specifically using its CTX_DOC.policy_filter procedure. This procedure requires a so called named policy which has to be created in advance (on-time) using CTX_DLL.create_policy. My first approach was to create this policy specifying the previously mentioned AUTO_FILTER.

It worked like a charm…regrettably with the exception of just a few specific PDF documents. I did not receive error messages, but closer inspection of the resulting documents revealed textual properties the end-user would not have accepted. For example, some paragraphs were converted such that each word appeared in a separate line surrounded by three additional empty lines. If AUTO_FILTER would be used to text-index the document, that would not be a problem. For the specific text-based (ML) algorithm that had to read the converted text, it was.

Conversion of PDF documents using AUTO_FILTER works like this. A soon as the RDBMS needs access to the enclosed text stored in the binary document a so-called third-party filter is triggered to convert this document. For Oracle Text, this particular filter is the ctxhx executable living in the $ORACLE_HOME/ctx/bin folder. This executable ( backed-up by a long list of libraries) is very much capable of determining the type of binary document and handling that type appropiately (well mostly). As such, it is part of the Oracle Outside In Technology (OIT). Roger Ford, Oracle Product Manager, has written quite some blog posts on OIT and how that helps on Text-indexing binary documents, see here.

However, as described above, using AUTO_FILTER just wasn’t appropiate enough in the customer’s case. This is where USER_FILTER came to the rescue.

Of course, more or less at this point, I did think about raising an Oracle Support ticket. However, this would involve uploading at least one example of a PDF demonstrating this behaviour. The problem with this had to do with sharing the content of the PDF….

With USER_FILTER an alternative executable can be referenced. This executable has to be placed in the same folder as mentioned above. It must accept two parameters ( filename in, filename out) and of course some file privileges need to be in place in order for the RDBMS to be able to execute it.

I named my executable userfilterpdftotext.sh, revealing that in fact the pdftotext command-line utility is called. This is the content of the script:

#!/bin/sh
/usr/bin/pdftotext -enc "UTF-8" -q "$1" "$2"

In other words the script runs quietly outputting UTF-8 encoded text. The -q is mandatory here: any terminal output would be interpreted as if an error had occured during conversion.
Create a preference to be able to use the script above:

begin
  ctx_ddl.create_preference
  ( preference_name => 'PDF_Filter_Preference'
  , object_name     => 'USER_FILTER'
  );
  ctx_ddl.set_attribute
  ( preference_name => 'PDF_Filter_Preference'
  , attribute_name  => 'COMMAND'
  , attribute_value => 'userfilterpdftotext.sh'
  );
  ctx_ddl.create_policy
  ( policy_name     => 'PDF_Filter_Policy'
  , filter          => 'PDF_Filter_Preference'
  );
end;
/

After which, essentially ( bulk-fetch and error handling had been used in the actual implementation) the following script was run to convert the binary pdf documents:

declare
  cursor c_doc
  is
  select  doc.binary_pdf
  ,       doc.plain_text
  from    pdf_documents doc
  for update of doc.plain_text;
  r_doc c_doc%rowtype;
begin
  open c_doc;
  fetch c_doc into r_doc;
  while c_doc%found
  loop
    ctx_doc.policy_filter
    ( policy_name => 'PDF_Filter_Policy'
    , document    => r_doc.binary_pdf
    , restab      => r_doc.plain_text
    , plaintext   => TRUE
    , format      => 'BINARY'
    );
    update pdf_documents doc
    set    doc.plain_text = r_doc.plain_text
    where  current of c_doc;    
    fetch c_doc into r_doc;    
  end loop;
  commit;
end;
/

The script above has been written to a package named “EXTRACTION”. The following SQL*Plus demonstrates the result. Note that for both LOB columns, just a small portion of the is shown here.

SQL> select * from pdf_documents
  2  /

BINARY_PDF                     PLAIN_TEXT                              
------------------------------ ----------------------------------
255044462D312E340A25E2E3CFD30D                                         
255044462D312E340A25E2E3CFD30D                                         
255044462D312E340A25E2E3CFD30D                                         

SQL> call extraction.pdftotext()
  2  /

Call completed.

SQL> select * from pdf_documents
  2  /

BINARY_PDF                     PLAIN_TEXT                              
------------------------------ ----------------------------------
255044462D312E340A25E2E3CFD30D Oracle® Database                        
                               Database PL/S                           

255044462D312E340A25E2E3CFD30D Oracle® Database                        
                               SQL Language                            

255044462D312E340A25E2E3CFD30D Oracle® Text                            
                               Reference                               
                                                                       
                               19c                                     
                               E9                                      


Where did I read about that?

looking for something?

Suppose you own more than a few documents you more or less know the contents of, like a set of copies from the Oracle Documentation. At times, you’re looking for more reading about a certain topic but aren’t sure which of these documents to search for.

This blog post is about using Oracle® Text to quickly find in which (pdf) documents a search term can be found.

To start with, we need a table to store our documents. Let’s call it the GUIDES table, with a column named CONTENT to store pdf documents. We create it as follows:

create table guides 
( id number(15,0)
    generated always as identity
    constraint guides_id_not_null not null
, name varchar2(60)
    constraint guides_name_not_null not null
, content blob
) ;
create index guides_pk on guides (id);
create index guides_uk1 on guides (name);
alter table guides
  add constraint guides_pk
  primary key (id) not deferrable using index guides_pk;
alter table guides
  add constraint guides_uk1
  unique (name) not deferrable using index guides_uk1; 

As a side note, notice how the not null constraints on both the id and name column have been created such that these constraints are named while remaining to be classified as column constraints. This refers to the null vs not null discussion, more about that here.

We now insert our documents. The SQL*Developer environment for example aids in adding BLOB content, like a pdf, to our table. More on adding BLOB content can be found here. I have loaded my top favourite guides:

An Oracle® Text index is needed to be able to query the pdf column. We anticipate on using the snippet method by creating a preference with the forward_index and save_copy attribute set.

call ctx_ddl.create_preference('guides_storage', 'BASIC_STORAGE')
/
call ctx_ddl.set_attribute('guides_storage','forward_index','TRUE')
/
call ctx_ddl.set_attribute('guides_storage','save_copy','PLAINTEXT')
/
create index guides_domidx on guides(content)
  indextype is ctxsys.context 
  parameters ('storage guides_storage');

And finally we can quickly query our guides:

define searchterm = 'snippet'
SQL> select  name
   2  from    guides g
   3  where   contains
   4          ( content
   5          , '&searchterm.',1
   6          ) > 0
   7  /
 NAME                                                        
 Database SecureFiles and Large Objects Developer's Guide
 XML DB Developer's Guide
 Text Application Developer's Guide
 Text Reference
 Database PL/SQL Packages and Types Reference
 Elapsed: 00:00:00.014

Now, to be able to gauge which of these documents we were actually looking for, we need to add a snippet, like below:

select  name
,       ctx_doc.snippet
        ( 'GUIDES_DOMIDX'
        , g.id
        , '&searchterm.'
        ) snippet
from    guides g
where   contains
        ( content
        , '&searchterm.',1
        ) > 0
/ 

Displaying the results of the query using SQL Developer user defined reports both helps in re-executing the query with different search values and displaying the snippet results:

Example of SQL*Developer user defined report

Humble SQL

For those who love maths, there’s no escape. In his pleasantly entertaining style, Matt Parker has conquered the world’s stages. Many an after-dinner talk has been cheered up by his show which mixes up jokes on math with actual mathematical knowledge. On a more abstract level, the book is about sharing what can be learned from mistakes. Do find more about him on his Stand-Up Maths Youtube channel.

Humble PI Transparent BG.png
Humble Pi by
Matt Parker

Also, have a look at his recent book Humble Pi. Maybe even buy yourself a copy. I did so and can’t stop reading. OK, well, enough of this free advertising… #notspon. You now should have picked up the reason why I went for the remarkable blog post title though.

I guess I love both maths and SQL. And coming to think of it, when visiting Oracle conferences like Tech Experience or UKOUG Technology Conference, sessions tending to a comedy act on real-world SQL problems do appear to stick the best. And luckily, the SQL-subject presenters of these conferences have been doing quite well  on that!

What does it take to present on a SQL subject in a comedian style? Besides the obvious presenter skills, you would need real world SQL experience, like actually having been involved in a SQL problem, preferably in context of a bug in an application in production. Also, you would need to be able to camouflage more than a bit. That’s because problem owners obviously don’t like te be discomforted by being confronted with mistakes in their organisation in public. This is not an easy task. Let me try with one realistic example though.

The customer demanded a report with a preface mentioning the total number of “inhabitant registrations” (note the camouflage) followed by a more detailed paragraph. Which had to feature a table of subtotals representing the amount of people in a group of age ranges. Just like the example below:

The research includes 127,015,265 inhabitant registrations. These can be divided into the following age ranges:

Ages Subtotal
0-1 170023
2-5 2396859
6-10 3605589
11-20 8248278
21-50 48913234
51-100 52634295
101-200 5337274

The number mentioned in the first paragraph had been calculated by counting all the records in the REGISTRATIONS table. The subsequent table had been calculated by a query like this

SELECT  rng.range
,       COUNT(*) subtotal
FROM    registrations reg
JOIN    ranges rng
ON      (MONTHS_BETWEEN(SYSDATE, prs.date_of_birth)/12)
        BETWEEN rng.lower_bound AND rng.upper_bound
WHERE   rng.range_type = 'Age ranges'
GROUP   BY rng.range

At first it seemed there were no problems. But soon it turned out that the numbers in the table didn’t add up to the total mentioned in the preface. In fact it was off by an amount of 5,709,713 ‘missing values’, which couldn’t sufficiently be explained by the usual “bad data” excuse (like nullified or out of range values for the referenced date_of_birth column).

To give you a hint on what had happened: suppose one calculates the age of a person born just over fifty years ago. That would result in a number between 50 and 51. But wait, that individual would not fall into the range of ages between 21 and 50, nor between 51 and 100… Of course, the query join expression calculating the age suffered from a missing TRUNCATE function.

ON      TRUNC(MONTHS_BETWEEN(SYSDATE, prs.date_of_birth)/12)
        BETWEEN rng.lower_bound AND rng.upper_bound

Not so much a hilarious example, however it does demonstrate how sharing your mistakes can have a learning effect on others!

The Winner Takes It All from A3Ba

This article is about sorting and min-max aggregations on character data. When sorting matters, you end your query with an order by. You might think that if you’re sorting on character data, the first value listed would equal the aggregated minimum, and the last value listed would equal the aggregated maximum.

Sorting on character data is different from sorting on numeric data, in that there are a lot less assumptions involved in the latter. Numeric sorting is based on simple mathematic conventions which are the same anywhere on the planet. Character sorting is someting else, which will be demonstrated below.

California-Redwoods

Maximum tree height manifestation in Redwood CA

Consider this select statement:

alter session set nls_language = 'AMERICAN';

prompt American
select char_value "Chr"
, min (char_value) over () "Min"
, max (char_value) over () "Max"
, first_value(char_value) over (order by char_value) "First"
, last_value(char_value) over (order by char_value 
                               rows between unbounded preceding
                               and unbounded following) "Last"
from   ( select cast(column_value as varchar2(5)) char_value 
         from table(dbmsoutput_linesarray('A','3','B','a'))
       )
order by char_value
/

Chr   Min   Max   First Last 
----- ----- ----- ----- -----
3     3     a     3     a    
A     3     a     3     a    
B     3     a     3     a    
a     3     a     3     a    

select parameter, value
from v$nls_parameters
where parameter in ('NLS_SORT')
order by 1
/

PARAMETER  VALUE 
---------- ------
NLS_SORT   BINARY 

Now, do note I started my session with setting my NLS_LANGUAGE to America, which happens to default my sorting method to ‘BINARY’. This might not be the case for other languages, like it won’t for Dutch. Dutch sorting means that a an uppercase “B” comes after lowercase “a”, and numbers follow alfabetic characters.

As a sidenote, I converted the character found in the array [A,3,B,a] to VARCHAR2(5) just for the convenience of listing the results in an equally “spaced” fashion. No magic intended there.

To make sure that “Last_value” considers all values found in the array, I added the “unbounded both ways” analytical window.

Now, if your development environment is set up to initialize your NLS settings in “DUTCH” and your production environment in (for example) “American”, you’re in for a surprise.

When setting the language to “DUTCH”, the script above returns:

Chr   Min   Max   First Last 
----- ----- ----- ----- -----
A     3     a     A     3    
a     3     a     A     3    
B     3     a     A     3    
3     3     a     A     3    

PARAMETER VALUE                                                           
--------- ------
NLS_SORT  DUTCH

What we see is that the order of the rows listed confirm that in “DUTCH” sorting, numbers follow alfabetic characters, and lowercase characters and uppercase characters appear ‘mixed up’. Consistent with that order are the values return for the first_value and last_value functions. However, the aggregated maximum and minumum remain the same as in the previous example! It means BINARY SORTING remains in effect with respect to aggregation, regardless of the language setting.

To conclude, this article has been set up to stress the importance of consistent language settings within your development infrastructure.

How your editor treats a backslash

This post is about escaping in Notepad++, escaping from what is commonly called a backslash. It is a typographic mark also denoted by a reverse slash or backslant, among others . When coding, many programming languages recognize this special character as a token to signal the compiler to treat the character following the backslash either normally ( if it would be treated specially when not preceded by the backslash ) or specially ( if it would otherwise be treated normally).

In particular, in many programming languages, a backslash followed by a single quote would cause that quote NOT to be treated as a normal single quote.

Note that in SQL, just as in PL/SQL, a backslash does not have this special meaning. This means the following PL/SQL code will just be compiled succesfully:

Backslashcode

However when using Notepad++ to automatically apply suitable styles to recognized language code elements, like I have done above, the default configuration would tempt the Notepad++ styler to unintentionally ignore the single quote following the backslash. So by default, the code above will be displayed in Notepad++ as shown below:

Backslash Notepad++default

Fortunately, like shown in the first example, Notepad++ can be configured not to treat the backslash character like that in SQL or PL/SQL, under preferences – miscellaneous:

Backslash Notepad++

Recommended Notepad++ preference!

which concludes my post…

For counting out loud

When a whole number needs conversion to a fully spelled out format, the wellknown J/JSP “tric” is readily found. Well, in case, you didn’t, here it is. No, it has nothing to do with Java Server Pages.

select to_char(to_date( :num, 'J'), 'JSP') from dual;

The query above will return a spelled out number related to the supplied numeric parameter :num. You can play around with ‘Jsp’ / ‘jsp’ for different capitalization effects. When you’re at it, add a ‘TH’ format element to have words like ‘fourth’, ‘second’ etc. for your query result.

Ishango Bone

The to_date(..) part will convert the given number to a date value as a consequence of applying the “Julian day” datetime format element defined by the ‘J’ value for the second parameter of to_date. Subsequently, the to_char function will convert this date value back to a Julian number — if it weren’t told to also ‘SP’ell it out.

For me, however, there are two major concerns:

  1. The accepted range of numbers is –somewhat– limited, i.e. from 1 to 5373484. Well, ok, I could live with that.
  2. No matter how hard I try, like adding a third parameter to to_char defining the desired dutch language:
    select to_char(to_date( :num, 'J'), 'JSP','nls_date_language=dutch') from dual;

    the result value is always in English. That’s not a bug: it’s a documented operational note on this date format element suffix.

Of course the requirement for a translation to the Dutch langue is explained by my Dutch citizen status. The above led me to define the requirement definition below.

Requirement

Supply an API for conversion of a numeric (positive integer including nul) value to the spelled number, in the Dutch language.

Solution

A database package dutch_number_spelling has been implemented. It offers just one function, named to_char_sp, which accepts a numeric value as input and returns a (Dutch) spelled number as output.

A second (defaulted) parameter is supplied to have some control on a preference having to do with conversion of a specific range of values, which can either be phrased like a multiple of hundreds followed by the remainder ( the default ), or more puristic but less common, as a multiple of thousands followed by the remainder.

The package can be found at the end of this blog.

Examples

SQL> select num, dutch_number_spelling.to_char_sp(num) spelled_num
 2 from ( select level num from dual connect by level <= 42 );

NUM SPELLED_NUM 
---------- ------------------------------
 1 één 
 2 twee 
 3 drie 
 4 vier 
 5 vijf 
 6 zes 
 7 zeven 
 8 acht 
 9 negen 
 10 tien 
 11 elf 
 12 twaalf 
 13 dertien 
 14 veertien 
 15 vijftien 
 16 zestien 
 17 zeventien 
 18 achttien 
 19 negentien 
 20 twintig 
 21 eenentwintig 
 22 tweeëntwintig 
 23 drieëntwintig 
 24 vierentwintig 
 25 vijfentwintig 
 26 zesentwintig 
 27 zevenentwintig 
 28 achtentwintig 
 29 negenentwintig 
 30 dertig 
 31 eenendertig 
 32 tweeëndertig 
 33 drieëndertig 
 34 vierendertig 
 35 vijfendertig 
 36 zesendertig 
 37 zevenendertig 
 38 achtendertig 
 39 negenendertig 
 40 veertig 
 41 eenenveertig 
 42 tweeënveertig

42 rows selected.

Well, and some huge numbers too:

exec dbms_output.put_line(dutch_number_spelling.to_char_sp(1969));
exec dbms_output.put_line(dutch_number_spelling.to_char_sp(1969,'THOUSANDS'));
exec dbms_output.put_line(dutch_number_spelling.to_char_sp(1e9+1));
exec dbms_output.put_line(dutch_number_spelling.to_char_sp(1e38-1));

negentienhonderdnegenenzestig

duizend negenhonderdnegenenzestig

één miljard één

negenennegentig sextiljoen negenhonderdnegenennegentig quintiljard negenhonderdnegenennegentig quintiljoen negenhonderdnegenennegentig quadriljard negenhonderdnegenennegentig quadriljoen negenhonderdnegenennegentig triljard negenhonderdnegenennegentig triljoen negenhonderdnegenennegentig biljard negenhonderdnegenennegentig biljoen negenhonderdnegenennegentig miljard negenhonderdnegenennegentig miljoen negenhonderdnegenennegentigduizend negenhonderdnegenennegentig

Code

create or replace package dutch_number_spelling
is
 /*
 getallen uitschrijven volgens Genootschap Onze Taal
 d.d. 25 maart 2018
 https://onzetaal.nl/taaladvies/getallen-uitschrijven
 
 Hoe schrijf je getallen voluit, bijvoorbeeld 108, 2013, 2577 en 17.053.980?

Hele getallen in woorden worden aan elkaar geschreven, met de volgende uitzonderingen:
 - Na duizend komt een spatie.
 - Woorden als miljoen en miljard staan los.
 
 Voorbeelden:
 - 108: honderdacht
 - 678: zeshonderdachtenzeventig
 - 2016: tweeduizend zestien
 - 2577: tweeduizend vijfhonderdzevenenzeventig / vijfentwintighonderdzevenenzeventig
 - 17.053.980: zeventien miljoen drieënvijftigduizend negenhonderdtachtig

Wie aan het getal het woord en wil toevoegen, schrijft het aaneen na honderd en als apart woord na duizend, miljoen en 
 dergelijke:
 -108: honderdenacht
 -678: zeshonderdenachtenzeventig
 -2016: tweeduizend en zestien
 -2577: tweeduizend en vijfhonderdzevenenzeventig / vijfentwintighonderdenzevenenzeventig
 
 Namen van grote getallen: http://home.kpn.nl/vanadovv/Bignum.html

Who When What
 Rob van den Berg March 2018 Creation v1.1

constants
 preference_hundreds denotes a spelling preference mentioning a multiple of hundreds
 example 'vijfentwintighonderdzevenenzeventig'
 this is the default
 preference_hundreds denotes a spelling preference mentioning a multiple of thousands
 example 'tweeduizend vijfhonderdzevenenzeventig'

parameters
 num positive integer including zero to be spelled out
 accepted range is between 0 and 1e38-1 ( i.e. 39 nine's )
 spelling_preference
 defaulted to dutch_number_spelling.preference_hundreds
 modifies the spelling preference when a choice between mentioning hundreds or thousands is applicable
 
 exceptions
 e_preference_not_exists is raised when a value different from preference_hundreds and preference_thousands is passed to spelling_preference
 e_spelling_not_implemented is raised when the input is out of bounds, or not a natural number
 
*/

--constants
 preference_hundreds constant varchar2(8) default 'HUNDREDS';
 preference_thousands constant varchar2(9) default 'THOUSANDS';

--exceptions
 e_spelling_not_implemented exception;
 e_preference_not_exists exception;
 
 --functions
 function to_char_sp
 ( num in number
 , spelling_preference in varchar2 default preference_hundreds
 ) 
 return varchar2;
end dutch_number_spelling;
/

create or replace package body dutch_number_spelling
is
 c_spelling_0 constant varchar2(25) default 'nul';
 c_spelling_1 constant varchar2(25) default 'één';
 --holds spelled out numbers within the range of 1 to 14
 c_spelling_1_to_14
 constant dbmsoutput_linesarray 
 default dbmsoutput_linesarray 
 ( 'een'
 , 'twee'
 , 'drie'
 , 'vier'
 , 'vijf'
 , 'zes'
 , 'zeven'
 , 'acht'
 , 'negen'
 , 'tien'
 , 'elf'
 , 'twaalf'
 , 'dertien'
 , 'veertien'
 , 'vijftien'
 , 'zestien'
 , 'zeventien'
 , 'achttien'
 , 'negentien'
 );
 
 --holds spelled out numbers as a multiple of 10
 c_spelling_tens
 constant dbmsoutput_linesarray 
 default dbmsoutput_linesarray 
 ( 'tien'
 , 'twintig'
 , 'dertig'
 , 'veertig'
 , 'vijftig'
 , 'zestig'
 , 'zeventig'
 , 'tachtig'
 , 'negentig'
 );

c_spelling_100 constant varchar2(25) default 'honderd';
 c_spelling_1000 constant varchar2(25) default 'duizend';

c_spelling_big_numbers
 constant dbmsoutput_linesarray 
 default dbmsoutput_linesarray 
 ( 'miljoen' -- 1e6
 , 'biljoen' -- 1e12
 , 'triljoen' -- 1e18
 , 'quadriljoen' -- 1e24
 , 'quintiljoen' -- 1e30
 , 'sextiljoen' -- 1e36
 ); 
 
 function div
 ( p_number in number
 , p_divisor in number
 ) return number
 is
 begin
 return trunc(p_number/p_divisor);
 end div; 
 
 function add_en
 ( p_getal_sp in varchar)
 return varchar2
 is
 begin
 return 
 p_getal_sp||
 case substr(p_getal_sp,-1,1)
 when 'e' 
 then unistr('\00ebn') -- 'en', e met trema
 else 'en'
 end; 
 end add_en;
 
 function to_digit_and_tens
 ( p_digit in number
 , p_tens in number
 ) return varchar2
 is
 begin
 return
 case
 when p_digit = 0
 then ''
 when p_tens = 1
 then c_spelling_1_to_14(p_digit)
 else add_en(c_spelling_1_to_14(p_digit))
 end
 ||c_spelling_tens(p_tens); 
 end to_digit_and_tens;
 
 function big_number_name
 ( p_exp_of_1000 in pls_integer
 )
 return varchar2
 is
 -- p_exp_of_1000 is expected to fall in the range of natural numbers equal or higher than 2
 -- and representing a value equal to power(1000,p_exp_of_1000)
 -- if 1eN is spelled one gazil"joen", 1e(N+3) is spelled one gazil"jard"
 l_big_number_name_prefix varchar2(200);
 begin
 l_big_number_name_prefix:=regexp_substr(c_spelling_big_numbers(div(p_exp_of_1000,2)),'(.*)joen$',1,1,'i',1);
 return 
 l_big_number_name_prefix||
 case mod(p_exp_of_1000,2)
 when 0
 then 'joen'
 when 1
 then 'jard'
 end; 
 end big_number_name;
 
 function to_char_sp
 ( p_number in number
 , p_reduce_singular_form in boolean
 , p_concatenated_spelling in boolean default false
 , p_count_hundreds in boolean default true
 , p_recursive in boolean default true
 ) 
 return varchar2
 is
 l_exp_of_1000 number;
 begin 
 case
 when p_number = 0
 then 
 return case when not p_recursive then c_spelling_0 end;
 when p_number = 1 and p_reduce_singular_form
 then
 return '';
 when p_number = 1 and not p_concatenated_spelling
 then
 return c_spelling_1;
 when p_number between 1 and 14
 then return c_spelling_1_to_14(p_number);
 when p_number between 15 and 99
 then return to_digit_and_tens
 ( p_digit => mod(p_number,10)
 , p_tens => div(p_number,10)
 );
 when p_number between 100 and 999
 then return to_char_sp
 ( p_number => div(p_number,100)
 , p_reduce_singular_form => true
 )||
 c_spelling_100||
 to_char_sp
 ( p_number => mod(p_number,100)
 , p_reduce_singular_form => false
 , p_concatenated_spelling => true
 );
 when p_number between 1000 and 9999 and p_count_hundreds
 and mod(div(p_number,100),10)!=0 -- preference does not apply to first century of each millenium
 then return to_char_sp
 ( p_number => div(p_number,100)
 , p_reduce_singular_form => true
 )||
 c_spelling_100||
 to_char_sp
 ( p_number => mod(p_number,100)
 , p_reduce_singular_form => false
 , p_concatenated_spelling => true
 );
 when p_number between 1000 and 999999
 then return to_char_sp
 ( p_number => div(p_number,1000)
 , p_reduce_singular_form => true
 )||
 c_spelling_1000||
 rtrim
 ( ' '||
 to_char_sp
 ( p_number => mod(p_number,1000)
 , p_reduce_singular_form => false
 )
 );
 else 
 l_exp_of_1000 := trunc ( log ( 1000 , p_number ));
 return to_char_sp
 ( p_number => div(p_number,power(1000, l_exp_of_1000))
 , p_reduce_singular_form => false
 )||' '||
 big_number_name(l_exp_of_1000)||
 rtrim
 ( ' '||
 to_char_sp
 ( p_number => mod(p_number,power(1000, l_exp_of_1000))
 , p_reduce_singular_form => false
 )
 );
 end case; 
 end to_char_sp;

procedure check_parameters
 ( p_number in number
 , p_spelling_preference in varchar2
 )
 is
 begin
 if p_number not between 0 and 1e39-1
 then
 raise e_spelling_not_implemented;
 end if; 
 if p_number <> trunc(p_number)
 then
 raise e_spelling_not_implemented;
 end if; 
 if p_spelling_preference not in ( preference_hundreds, preference_thousands)
 then
 raise e_preference_not_exists;
 end if; 
 end;
 
 function to_char_sp
 ( num in number
 , spelling_preference in varchar2 default preference_hundreds
 ) 
 return varchar2
 is
 begin
 check_parameters
 ( p_number => num
 , p_spelling_preference => spelling_preference
 );
 return 
 to_char_sp
 ( p_number => num
 , p_reduce_singular_form => false
 , p_count_hundreds => ( spelling_preference = preference_hundreds )
 , p_recursive => false
 );
 end to_char_sp;
end dutch_number_spelling;
/

On business days

When work gets done during business days, that’s what this blog post is about. Moreover, when also the amount of business days is known upfront.

calendar-relief2

© Robert Young – Calendar at Kom Ombo

Given a date referred to as the date to start counting from, manually finding the date resulting from adding a fixed number of business days known beforehand to that start date is not too difficult, especially when aided by a carved-out in stone, nowadays printed, copy of a suitable calendar. And as long as the number of business days at hand is short enough to be manageable for a manual calculation…

Problem definition

today’s calendar

Let’s now define the problem I’d like to solve more in terms used in database software development, particularly when using SQL. I’ll be working with a specific example to show you how to solve the problem, but I’m sure you’ll be able to generalize and translate the example into your own  specific needs.

Suppose we have a table PRODUCTS defining product attributes like NAME and a whole number (construction) DAYS  indicating how many business days are needed to produce the product.

 

I would like to build a view VW_PRODUCTS listing the product id, name, construction days, today’s date, and finally the calendar date the product would be in constructed state if we would start constructing right at the first coming business day after today.

Example data, with our yet to be build PREPARED column grayed out:

SQL> select * from vw_products order by id;

  ID NAME                 TODAY      DAYS PREPARED  
---- -------------------- ---------- ---- ----------
   1 Transistor Radio     2017-07-27    1 2017-07-28
   2 Dial Telephone       2017-07-27    2 2017-07-31
   3 Popcorn Maker        2017-07-27    5 2017-08-03
   4 G.I. Joe Figure      2017-07-27    9 2017-08-09
   5 Beer Cooler          2017-07-27   11 2017-08-11

Which expression would calculate the value for PREPARED, listing the expected end-of-construction date, based on the addition of the TODAY date and the DAYS number of business days?

Assumptions

We don’t make any assumptions on the global database NLS settings regarding NLS_LANGUAGE and / or NLS_TERRITORY, more on why I mention that further on.

The TODAY date doesn’t necessarily has to be a business day itself.

Business days

We of course need to define business days, which will be as simple as this: any day of the week not being a Saturday or a Sunday is a business day.

How to recognize a business day in SQL

We could literally translate the business days definition above, like this to determine whether or not a date is to be counted as a business day:

select  case to_char(sysdate,'DAY')
          when 'SATURDAY' then 0
          when 'SUNDAY'   then 0
          else                 1
        end as business_day_ind
from    dual;

But that would give unexpected results if your session NLS settings would have been different from English. So this could be improved by supplying a value for the third, optional parameter of to_char to make sure the result of the function is in English:

select  case to_char(sysdate,'DY', 'nls_date_language=english')
          when 'SAT' then 0
          when 'SUN' then 0
          else            1
        end as business_day_ind
from    dual;

Showing that setting the NLS parameter also works for the abbreviated day format.

We can now already come up with our first implementation of the requested view:

create or replace view vw_products1 as
with function count_calendar_days(p_start_date date, p_days pls_integer)
return pls_integer
as
  l_calendar_days pls_integer;
begin
  select  min(calendar_days)
  into    l_calendar_days
  from    ( select  calendar_days
                    ,       sum(business_day_ind) 
                            over ( order by calendar_days ) nrof_bdays
                    from    ( select level calendar_days
                              ,      case to_char
                                     ( p_start_date+level
                                     ,'DY'
                                     , 'nls_date_language=english')
                                     when 'SAT' then 0
                                     when 'SUN' then 0
                                     else            1
                                     end as business_day_ind
                              from   dual
                              connect by level <= p_days * 2
                            )
          )  
  where   nrof_bdays = p_days;
  return l_calendar_days;
end count_calendar_days;
select  id
,       name
,       trunc(sysdate) today
,       days
,       trunc(sysdate)+count_calendar_days (trunc(sysdate), days) prepared
from    products
order by id;        

SQL> 
SQL> select * from vw_products1 order by id;

  ID NAME                 TODAY      DAYS PREPARED  
---- -------------------- ---------- ---- ----------
   1 Transistor Radio     2017-07-27    1 2017-07-28
   2 Dial Telephone       2017-07-27    2 2017-07-31
   3 Popcorn Maker        2017-07-27    5 2017-08-03
   4 G.I. Joe Figure      2017-07-27    9 2017-08-09
   5 Beer Cooler          2017-07-27   11 2017-08-11

The function in fact iteratively adds calendar days to the starting days, keeps track of the number of actual added business days, until the required number of business days is added. We know in advance that roughly at most two times the number of required business days is needed to be looped through, hence the “level <= p_days * 2” equation.

The prepared date is now easily derived from the addition of the starting date and the found number of calendar days, and we could be fine by that.

However, this method fails to be concise. Also, we should critique on the fact that each and every date between the start date and the sought for prepared date is being ‘probed’ for as to what day of the week it is. That’s too much information gathering. We in fact only need to know the day of the week for the starting date.

Back to the drawing board

Going back, isn’t there a date format returning the day of the week as a number, with a perspective of being able to perform calculations based on that number ? Well, in a way, there is:

alter session set nls_territory='the netherlands';
select  refdate
,       to_char(refdate,'DAY', 'nls_date_language=english') day
,       case to_char(refdate,'DY', 'nls_date_language=english')
          when 'MON' then 1
          when 'TUE' then 2
          when 'WED' then 3
          when 'THU' then 4
          when 'FRI' then 5
          when 'SAT' then 6
          when 'SUN' then 7
        end as d1
,       to_number(to_char(refdate,'D')) d2
,       trunc(refdate)-trunc(refdate,'IW')+1 d3
from    ( select  sysdate+level-1 as refdate 
          from    dual 
          connect by level <= 7
        );

REFDATE  DAY          D1   D2   D3
-------- ---------- ---- ---- ----
27-07-17 THURSDAY      4    4    4
28-07-17 FRIDAY        5    5    5
29-07-17 SATURDAY      6    6    6
30-07-17 SUNDAY        7    7    7
31-07-17 MONDAY        1    1    1
01-08-17 TUESDAY       2    2    2
02-08-17 WEDNESDAY     3    3    3

7 rows selected. 

The result of columns d1, d2 and d3 are the same, but that’s because we “cheated”: we had to make sure the territory settings weren’t set to for example ‘american’ … Unfortunately, we cannot set the territory using the same parameter of to_char we’ve used earlier to set the language. Now, credits to Frank Kulash, there’s once more an alternative to that making use of the iso week format:

select  trunc(sysdate)-trunc(sysdate,'IW')+1 as d
from    dual;

will return 1 for Monday up to 7 for Sunday, regardless of nls settings. We have already checked our results in column d3 above.

Where to start counting from

Notice that adding X business days to a date starting on a Tuesday, is the same as adding X+1 days to the Monday just before that. Why do I mention that ? Well, If you know your starting date is a Monday, the calculation is easy. Adding Y business days to a Monday would boil down to adding as many weeks to that Monday as the number of multiples of five fitting within Y, and after that adding the remainder. The first part can be expressed as TRUNC(Y/5), the second part as MOD(Y,5).

So first thing we need to determine is the OFFSET, which is the (possibly negative) amount of calendar days to add to the start date to get to the the day of the week that was “Monday”. It turns out that the expression “mod(12-d,7)-4”, with “d” being the day number as calculated above, calculates just that offset. For a Monday, it returns mod(12-1,7)-4=4-4=0, for a Friday a -4, for a Saturday a 2 returned, indicating correctly we’d not have to go back, but two calendar days forward to get to Monday.

Now we need to determine how many calendar days to add from that offset date (DAYS_FROM_OFFSET). That would be the inverse of the offset number calculated above, except for starting days being not a weekday… The expression “days+greatest(mod(d+4,5),0)” correctly calculates the amount of calendar days to be added if we start at the offset date mentioned above.

Final definition

So finally we can define our final VW_PRODUCTS view:

create or replace view vw_products as
select  id
,       name
,       today
,       days
,       today+offset+7*trunc(days_from_offset/5)+mod(days_from_offset,5) prepared
from    ( select  prd.*
          ,       mod(12-d,7)-4 offset
          ,       days+greatest(mod(d+4,5),0) days_from_offset
          from    ( select  prd.*
                    ,       trunc(sysdate) today
                    ,       trunc(sysdate)-trunc(sysdate,'IW')+1 d
                    from    products prd
                  ) prd
        ) prd       
order by id;

Function call containing out parameters in SQL

You have a procedure or function returning computed values in out parameters. And you’d like to call it in SQL. How’s that done?

First, we set up an example, like a procedure computing when someone retires based on a given birth date. To add just a bit of complexity right at the start, also the retirement age is returned. Please note the procedure doesn’t necessarily compute the correct retirement date, it’s just an example.

create or replace procedure get_retirement_details
( p_in_birth_date       in    date
, p_out_retirement_date   out date
, p_out_retirement_age    out pls_integer
)
is
  c_turning_point_date constant date        := date '1954-12-31';
  c_retirement_age     constant pls_integer := 65;
  c_months_per_year    constant pls_integer := 12;
  c_retire_factor      constant pls_integer := 36;
begin
  if ( p_in_birth_date <= c_turning_point_date )
  then 
    p_out_retirement_age := c_retirement_age;
  else 
    p_out_retirement_age := c_retirement_age + 
                            months_between 
                            ( p_in_birth_date
                            , c_turning_point_date 
                            ) / c_retire_factor;
  end if; 
  p_out_retirement_date := add_months 
                           ( p_in_birth_date
                           , p_out_retirement_age*c_months_per_year 
                           );
end;
/

Note the two out parameters. Note also the declaration of constants for each literal used. Well, declaring a constant to denote the number of months per calendar year might be overdone. However, we put our procedure to the test:

declare
  l_birth_date date := date '1969-02-25';
  l_retirement_date date;
  l_retirement_age pls_integer;
begin
  get_retirement_details
  ( p_in_birth_date => l_birth_date
  , p_out_retirement_date => l_retirement_date
  , p_out_retirement_age => l_retirement_age 
  );
  dbms_output.put_line ( l_retirement_date);
  dbms_output.put_line ( l_retirement_age);
end;
/

PL/SQL procedure successfully completed.

2039-02-25
70

Suppose we have a table with person details including their birth date. How can we, in SQL, answer a question like ‘which of these persons will be 70 when they retire’ ?

Clearly, we can’t directly call the procedure in SQL due to the out parameters. Moreover, the procedure returns two values, so we can’t just get away with transforming or overloading the procedure to a function. At least not if the function would return a singular value type…but what if we would return an object containing both computed values ?

For brevity, I’ll show the solution using a plsql_declarations clause in the with clause, and leave the exercise to write a packaged function ( with pragma udf ) to you.

The table persons shown below represents our person details.

create table persons
( first_name varchar2(25)
, birth_date date
)
/

insert into persons values ( 'Jim', date '1969-02-25' );
insert into persons values ( 'Lee', date '1979-01-16' );
insert into persons values ( 'Yaz', date '1959-11-03' );

We need an object type, and a collection type, to be able to work with them in SQL:

create type retirement_details_type
is 
object
( retirement_date date
, retirement_age number(3)
)
/

create type retirement_details_list
is
table of retirement_details_type
 /

Now, we can use the with clause to specify a function returning a collection of values containing retirement details and call that function in the query:

with function find_retirement_details
( p_in_birth_date in date
)
return retirement_details_list
as
  l_retirement_date date;
  l_retirement_age pls_integer;
begin
  get_retirement_details
  ( p_in_birth_date => p_in_birth_date
  , p_out_retirement_date => l_retirement_date
  , p_out_retirement_age => l_retirement_age 
  );
  return 
    retirement_details_list
    ( retirement_details_type
      ( l_retirement_date
      , l_retirement_age
      )
    );
end; 
select  p.first_name
,       p.birth_date
,       d.retirement_date
,       d.retirement_age
from    persons p
,       lateral
        ( select  *
          from    table
                  ( find_retirement_details 
                    ( p_in_birth_date => p.birth_date
                    )
                  )
        ) d
where   d.retirement_age = 70 
/

Which will return the desired result:

FIRST_NAME BIRTH_DATE RETIREMENT_DATE RETIREMENT_AGE
---------- ---------- --------------- --------------
Jim        1969-02-25      2039-02-25             70

 

Taming outbound call rate

Suppose you are the developer of a database service which can be called any number of times. This database service itself just registers the call, and is allowed to immediately make an outbound call to another service, for example to enhance information presented by the initial inbound call.

No suppose the interactions with this service are restricted by a limited outbound call rate, which means you should take care it doesn’t exceed a certain maximum per second. You’d then want to optimize the number of times the API for this service is called to exactly match this allowed maximum.

For simplicity, let’s assume the service API should not be called more often than 25 times per second.

Requirements

  • There’s a stored procedure which represents the referenced outbound API, we will call it my_api_request();
  • There’s a table registering each successfull call to the API by the timestamp of it’s call;
  • Querying the table should reveal how many calls per second were made, which should ideally be optimized to 25 times per second.

Solution setup

The solution I present heavily relies on the ordered non-cached sequence feature. The idea is to construct an expression consisting of both a time representation including date, hour, minute and seconds complemented with a value which cycles through 25 different values using the modulo function. Of course, the value of 25 corresponds to the required maximum of 25 calls per second.

Hence, first we create an ordered sequence

create sequence my_ordered_seq order nocache
/

And then we create the API-calls logging table

create table my_api_requests
( time_of_request    timestamp    default systimestamp not null 
, seq_id             number(15)   default my_ordered_seq.nextval
, req_id             varchar2(21) generated always as 
                                  ( substr
                                    ( to_char(time_of_request,'YYYYMMDD-HH24MISS')
                                      ||'.'
                                      ||lpad(mod(seq_id,25),3,'0')
                                    , 1
                                    ,21
                                    )
                                  )
, constraint my_api_requests_limit unique (req_id)
)

As you will see, each time a value is inserted to the table, a subsequent number is derived from the ordered sequence.
If there a more than 25 inserts within the same second, the module function will result in a previously used expression. This will cause a unique key constraint violation, effectively prohibiting the outbound call rate to exceed the required maximum.

Solution test

Contents of timertest.sql, which is parameterized script to allow us to test for the specified N number of seconds:

define p_nrof_sec = &1.
declare
  c_nrof_seconds constant pls_integer := &p_nrof_sec.;
  l_time_ends constant date := sysdate + c_nrof_seconds/(24*3600);

  l_succeeded_count pls_integer := 0;
  l_skipped_count pls_integer := 0;
 
  procedure do_api_request
  is
  pragma autonomous_transaction;
  begin
    insert into my_api_requests (time_of_request) values (default);
    l_succeeded_count:=l_succeeded_count+1;
    -- call my_api_request() here ;
    commit;
  exception
    when others
    then -- too many inserts per unit of time
    l_skipped_count:=l_skipped_count+1;
  end do_api_request;

begin
  while sysdate < l_time_ends
  loop
    do_api_request;
  end loop;
  dbms_output.put_line ( 'succeeded api requests: '||l_succeeded_count);
  dbms_output.put_line ( 'skipped api requests: '||l_skipped_count);
end;
/

select   to_char(time_of_request,'YYYYMMDD-HH24MISS')
,        count(*)
from     my_api_requests
group by to_char(time_of_request,'YYYYMMDD-HH24MISS')
/

Output

SQL> @timertest 10
succeeded api requests: 250
skipped api requests: 11679

PL/SQL procedure successfully completed.

TO_CHAR(TIME_OF_REQUEST,'YYYYMMDD-HH24MISS')  COUNT(*)
--------------------------------------------- --------
20160530-154401                                     25
20160530-154402                                     25
20160530-154403                                     25
20160530-154404                                     25
20160530-154405                                     25
20160530-154406                                     25
20160530-154407                                     25
20160530-154408                                     25
20160530-154409                                     25
20160530-154410                                     25
10 rows selected.

SQL>

Conclusion

We succesfully created a mechanism to both restrict and optimize the number of times the outbound service has been called!

Your editor as your primary PL/SQL IDE

How to invoke SQL and/or PL/SQL scripts from Notepad++

Notepad++ might well be your one and only text editor. It’s your favourite editor too, right? But can you run your Oracle SQL*Plus or Oracle PL/SQL commands or scripts from within npp++ at a click of a button ? Yes you can.

Now let me tell you, this idea is not quite new. It’s just that solutions, if any, NOT involving the installation of some Notepad++ plugin don’t seem to have been described a whote lot. So that’s the goal of this blog, just to show you real quick how to get going.

Requirements

First of all, note that you need to have these two products installed at minimum:

– Notepad++, which is a free MS Windows based source code editor which can be downloaded here. Any edition will do, including the Notepad++ Portable edition.

– Oracle SQL*Plus client like Oracle Instant Client which can be downloaded here.

Instructions

Let’s assume both products have been installed and are running correctly. Next, this is what you need to do.

Create a Windows command file npp_sqlplus.cmd (store it in your %homepath% folder) with contents similar to

set NLS_LANG=.UTF8
set TNS_ADMIN={tns_admin}
set path={instantclient};%path%
start "SQL*Plus" /D"{sqlbase}" sqlplus.exe {username}/{password}@{connection} @%1

Replace {tns_admin} with the exact path to where your tnsnames.ora file is located. Likewise, replace {instantclient} with the path to your sqlplus.exe file. Finally, replace {sqlbase} with the location your sql scripts ( like your login.sql ) are located. Specify username, password and name of the database connection. No curly brackets should remain in the end.

Finally, close Notepad++, because you need to edit a Notepad++ configuration file. Open a command line window and enter the following command

notepad %appdata%\Notepad++\shortcuts.xml

This will open the configuration file listing the user defined commands. Add a line

<Command name="SQL*Plus" Ctrl="no" Alt="no" Shift="no" Key="120">%homepath%\npp_sqlplus.cmd &quot;$(FULL_CURRENT_PATH)&quot;</Command>

to the end of the “UserDefinedCommands” section and save the file.

Now, reopen Notepad++ and note that the “Run” menu lists an extra entry called “SQL*Plus” which can be  run by hitting the F9 key (see below). Doing so now will invoke SQL*Plus and run the (saved) file you have currently open within the database connection you specified above…

NPP+SQLPLUS_003