Large multimodal models – such as, GPT-4v, GPT-4o, Claude, etc – have the ability of contextual understanding in image, and these models now have the ability to replace current lots of vision tasks with flexible manners.
But unfortunately, the quality of text extraction in these models is not as well as today’s state-of-the-art OCR (optical character recognition) models.
This post introduces you how to improve text recognition quality in such models (GPT-4v, GPT-4o, etc) with the help of existing OCR models.
I note that here I have used GPT-4 vision (shortly, GPT-4v) preview model and all examples in this post include the following system message in prompts.
System message
Each image in user prompt is followed by OCR extracted text.
Now let’s start examples.
In the first example, let’s see the quality of text extraction in dense document.
The following is the result by vanilla GPT-4v.
GPT-4v without OCR
Extract information as JSON.
Response
{ "BALANCE SHEETS": {"As of": { "September 30th": {}, "June 30th": {}},"Assets": { "Current assets": {"Cash and cash equivalents": { "September 30th": "$80,452", "June 30th": "$34,704"},"Short-term investments": { "September 30th": "$3,499", "June 30th": "$16,558"},"Total cash, cash equivalents, and short-term investments": { "September 30th": "$83,951", "June 30th": "$51,262"},"Accounts receivable, net of allowance for doubtful accounts of $512 and $650": { "September 30th": "$36,053", "June 30th": "$48,808"},"Inventories": { "September 30th": "$2,962", "June 30th": "$2,667"},"Other current assets": { "September 30th": "$24,632", "June 30th": "$21,807"},"Total current assets": { "September 30th": "$207,586", "June 30th": "$184,261"} }, "Property and equipment, net of accumulated depreciation of $89,486 and $86,251": {"September 30th": "$156,502","June 30th": "$156,376" }, "Operating lease right-of-use assets": {"September 30th": "$11,423","June 30th": "$9,869" }, "Equity investments": {"September 30th": "$5,895","June 30th": "$4,376" }, "Goodwill": {"September 30th": "$67,039","June 30th": "$67,869" }, "Intangible assets, net": {"September 30th": "$8,716","June 30th": "$9,061" }, "Other long-term assets": {"September 30th": "$38,154","June 30th": "$30,361" }, "Total assets": {"September 30th": "$445,755","June 30th": "$411,976" }},"Liabilities and stockholders' equity": { "Liabilities": {"Current liabilities": { "Accounts payable": {"September 30th": "$19,307","June 30th": "$18,095" }, "Short-term debt": {"September 30th": "$5,708","June 30th": "$0" }, "Current portion of long-term debt": {"September 30th": "$2,498","June 30th": "$1,247" }, "Accrued compensation": {"September 30th": "$3,400","June 30th": "$5,029" }, "Short-term income taxes": {"September 30th": "$4,495","June 30th": "$4,152" }, "Short-term unearned revenue": {"September 30th": "$6,472","June 30th": "$10,701" }, "Other current liabilities": {"September 30th": "$8,645","June 30th": "$5,045" }, "Total current liabilities": {"September 30th": "$124,792","June 30th": "$104,169" }},"Long-term debt": { "September 30th": "$24,926", "June 30th": "$24,540"},"Long-term income taxes": { "September 30th": "$2,474", "June 30th": "$2,199"},"Long-term unearned revenue": { "September 30th": "$2,789", "June 30th": "$2,932"},"Deferred income taxes": { "September 30th": "$470", "June 30th": "$473"},"Operating lease liabilities": { "September 30th": "$13,487", "June 30th": "$12,738"},"Other long-term liabilities": { "September 30th": "$8,674", "June 30th": "$7,281"},"Total liabilities": { "September 30th": "$225,071", "June 30th": "$205,753"} }, "Commitments and contingencies": {}, "Stockholders' equity": {"Common stock and paid-in capital – shares authorized 24,000; outstanding 7,431 and 7,432": { "September 30th": "$95,568", "June 30th": "$93,718"},"Retained earnings": { "September 30th": "$132,143", "June 30th": "$118,374"},"Accumulated other comprehensive loss": { "September 30th": "($2,937)", "June 30th": "($3,243)"},"Total stockholders' equity": { "September 30th": "$224,774", "June 30th": "$208,849"} }, "Total liabilities and stockholders' equity": {"September 30th": "$445,755","June 30th": "$411,976" }} }, "Refer to accompanying notes": "5"}
As you can see below, there exist a lot of mistakes in extracting numbers.
Now I have extracted text information with state-of-the-art OCR (here I have used Azure AI Vision), and added this extracted result in the prompt as follows.
This time, as you can see below, the result matches with the original document.
GPT-4v with OCR
**OCR text:** BALANCESHEETS
[Inmillions](Unaudited)
September30,
June30,
2023
Assets
Currentassets
Cashandcashequivalents
80.452
34,704
Short-terminvestments
63,499
76.558
Totalcash,cashequivalents,andshort-terminvestments
143,951
111,252
Accountsreceivable,netofallowancefordoubtfulaccountsof$512and$650
36,953
48,688
Inventories
3.000
2.500
Othercurrentassets
23,682
21.807
Totalcurrentassets
207,586
184.257
Propertyandequipment,netofaccumulateddepreciationof$69,486and$68,251
102,502
95.641
Operatingleaseright-of-useassets
15.435
14.346
Equityinvestments
11,423
9.879
Goodwil
67,790
67,886
Intangibleassets,net
8,895
9.366
Otherlong-termassets
32,154
30,601
Totalassets
445,785
411,976
Liabilitiesandstockholders’equity
Currentliabilities:
Accountspayable
19.307
18.095
Short-termdebt
25,808
o
Currentportionoflong-termdebt
3,748
5.247
Accruedcompensation
6.990
11,009
Short-termincometaxes
8.035
4.152
Short-termunearnedrevenue
46.429
50.901
Othercurrentliabilities
14,475
14,745
Totalcurrentliabilities
124,792
104,149
Long-termdebt
41.946
41.990
Long-termincometaxes
22.983
25,560
Long-termuneamedrevenue
2,759
2.912
Deferredincometaxes
470
Operatingleaseliabilities
433
13.487
12.728
Otherlong-termliabilities
18.634
17.981
TotalBabilties
225,071
205,753
Commitmentsandcontingencies
Stockholders’equity:
Retainedearnings
Commonstockandpaid-incapital-sharesauthorized24.000;outstanding7,431and7,432
95,508
93.718
132,143
118,848
Accumulatedothercomprehensiveloss
(6,937)
(6,343)
Totalstockholders’equity
220,714
206,223
Totalliabilitiesandstockholders”equity
445,785
411.976
Refertoaccompanyingnotes.Extract information as JSON.
Response
{ "BALANCE SHEETS": {"Date": { "September 30": "2023", "June 30": "2023"},"Assets": { "Current assets": {"Cash and cash equivalents": { "September 30": "$80,452", "June 30": "$34,704"},"Short-term investments": { "September 30": "$63,499", "June 30": "$76,558"},"Total cash, cash equivalents, and short-term investments": { "September 30": "$143,951", "June 30": "$111,252"},"Accounts receivable, net of allowance for doubtful accounts of $512 and $650": { "September 30": "$36,953", "June 30": "$48,688"},"Inventories": { "September 30": "$3,000", "June 30": "$2,500"},"Other current assets": { "September 30": "$23,682", "June 30": "$21,807"},"Total current assets": { "September 30": "$207,586", "June 30": "$184,257"} }, "Property and equipment, net of accumulated depreciation of $69,486 and $68,251": {"September 30": "$102,502","June 30": "$95,641" }, "Operating lease right-of-use assets": {"September 30": "$15,435","June 30": "$14,346" }, "Equity investments": {"September 30": "$11,423","June 30": "$9,879" }, "Goodwill": {"September 30": "$67,790","June 30": "$67,886" }, "Intangible assets, net": {"September 30": "$8,895","June 30": "$9,366" }, "Other long-term assets": {"September 30": "$32,154","June 30": "$30,601" }, "Total assets": {"September 30": "$445,785","June 30": "$411,976" }},"Liabilities and stockholders' equity": { "Current liabilities": {"Accounts payable": { "September 30": "$19,307", "June 30": "$18,095"},"Short-term debt": { "September 30": "$25,808", "June 30": "o"},"Current portion of long-term debt": { "September 30": "$3,748", "June 30": "$5,247"},"Accrued compensation": { "September 30": "$6,990", "June 30": "$11,009"},"Short-term income taxes": { "September 30": "$8,035", "June 30": "$4,152"},"Short-term unearned revenue": { "September 30": "$46,429", "June 30": "$50,901"},"Other current liabilities": { "September 30": "$14,475", "June 30": "$14,745"},"Total current liabilities": { "September 30": "$124,792", "June 30": "$104,149"} }, "Long-term debt": {"September 30": "$41,946","June 30": "$41,990" }, "Long-term income taxes": {"September 30": "$22,983","June 30": "$25,560" }, "Long-term unearned revenue": {"September 30": "$2,759","June 30": "$2,912" }, "Deferred income taxes": {"September 30": "$470","June 30": "" }, "Operating lease liabilities": {"September 30": "$13,487","June 30": "$12,728" }, "Other long-term liabilities": {"September 30": "$18,634","June 30": "$17,981" }, "Total liabilities": {"September 30": "$225,071","June 30": "$205,753" }, "Stockholders' equity": {"Retained earnings": { "September 30": "", "June 30": ""},"Common stock and paid-in capital - shares authorized 24,000; outstanding 7,431 and 7,432": { "September 30": "$95,508", "June 30": "$93,718"},"Additional paid-in capital": { "September 30": "$132,143", "June 30": "$118,848"},"Accumulated other comprehensive loss": { "September 30": "($6,937)", "June 30": "($6,343)"},"Total stockholders' equity": { "September 30": "$220,714", "June 30": "$206,223"} }, "Total liabilities and stockholders' equity": {"September 30": "$445,785","June 30": "$411,976" }},"Notes": "Refer to accompanying notes." }}
However, you could find that the value in “Short-term debt
” is still wrong (it’s alphabetic “o
“, not numeric “0
“), because OCR model has extracted this wrong character.
Next, please change user’s prompt as follows in order to fix these misreading values. (The image and OCR text are the same as previous.)
Extract information as JSON.
OCR text is sometimes wrong, and please fix values if needed.
As you can see below, this will eventually fix it to the correct value, because GPT-4v can understand the contextual information in image like humans.
The type of values is also converted into appropriate type in JSON – such like, $80,452
to 80452
, “” (blank) to null
, or “” (blank) to 0
…
This JSON may be the result you have expected.
Response
{ "Balance Sheets": {"Date": { "September 30": "2023", "June 30": "2023"},"Assets": { "Current assets": {"Cash and cash equivalents": { "September 30": 80452, "June 30": 34704},"Short-term investments": { "September 30": 63499, "June 30": 76558},"Total cash, cash equivalents, and short-term investments": { "September 30": 143951, "June 30": 111252},"Accounts receivable, net of allowance for doubtful accounts of $512 and $650": { "September 30": 36953, "June 30": 48688},"Inventories": { "September 30": 3000, "June 30": 2500},"Other current assets": { "September 30": 23682, "June 30": 21807},"Total current assets": { "September 30": 207586, "June 30": 184257} }, "Property and equipment, net of accumulated depreciation of $69,486 and $68,251": {"September 30": 102502,"June 30": 95641 }, "Operating lease right-of-use assets": {"September 30": 15435,"June 30": 14346 }, "Equity investments": {"September 30": 11423,"June 30": 9879 }, "Goodwill": {"September 30": 67790,"June 30": 67886 }, "Intangible assets, net": {"September 30": 8895,"June 30": 9366 }, "Other long-term assets": {"September 30": 32154,"June 30": 30601 }, "Total assets": {"September 30": 445785,"June 30": 411976 }},"Liabilities and stockholders' equity": { "Current liabilities": {"Accounts payable": { "September 30": 19307, "June 30": 18095},"Short-term debt": { "September 30": 25808, "June 30": 0},"Current portion of long-term debt": { "September 30": 3748, "June 30": 5247},"Accrued compensation": { "September 30": 6990, "June 30": 11009},"Short-term income taxes": { "September 30": 8035, "June 30": 4152},"Short-term unearned revenue": { "September 30": 46429, "June 30": 50901},"Other current liabilities": { "September 30": 14475, "June 30": 14745},"Total current liabilities": { "September 30": 124792, "June 30": 104149} }, "Long-term debt": {"September 30": 41946,"June 30": 41990 }, "Long-term income taxes": {"September 30": 22983,"June 30": 25560 }, "Long-term unearned revenue": {"September 30": 2759,"June 30": 2912 }, "Deferred income taxes": {"September 30": 470,"June 30": 0 }, "Operating lease liabilities": {"September 30": 13487,"June 30": 12728 }, "Other long-term liabilities": {"September 30": 18634,"June 30": 17981 }, "Total liabilities": {"September 30": 225071,"June 30": 205753 }, "Stockholders' equity": {"Retained earnings": { "September 30": null, "June 30": null},"Common stock and paid-in capital - shares authorized 24,000; outstanding 7,431 and 7,432": { "September 30": 95508, "June 30": 93718},"Accumulated other comprehensive loss": { "September 30": -6937, "June 30": -6343},"Total stockholders' equity": { "September 30": 220714, "June 30": 206223} }, "Total liabilities and stockholders' equity": {"September 30": 445785,"June 30": 411976 }} }, "Notes": "Refer to accompanying notes."}
Let’s go to the next example.
Currently (Feb 2024), OpenAI says that GPT-4v may not perform optimally when handling images with text of non-Latin alphabets – such as, Chinese, Japanese or Korean.
In the second example, we try to extract information from handwritten Japanese text as follows.
As you can see below, there’s a lot of misreading Japanese words in extracting with vanilla GPT-4v model.
GPT-4v without OCR
Extract information as markdown text.
Response
| 品名 | 数量 | 単価 | 小計 ||--------|------|------|------|| うな丼 | 2| 210 | 420 || リール | 1| 18 | 18 || 雑誌 | 60 | 89 | 5340 |
Now I have also extracted text information with OCR model (here I have also used Azure AI Vision), and added this extracted result in the prompt as follows.
As you can see below, the result exactly matches with the original image.
GPT-4v with OCR
**OCR text:** 物品
単価
数量
小計
消しゴム
98
51
4998
210
3780
18
ノート
鉛筆
60
89
5340Extract information as markdown text.
Response
| 物品 (Item) | 単価 (Unit Price) | 数量 (Quantity) | 小計 (Subtotal) ||-------------|------------------|----------------|----------------|| 消しゴム| 98 | 51 | 4998 || ノート | 210 | 18 | 3780 || 鉛筆| 60 | 89 | 5340 |
These 2 examples simply extract information by text (JSON, Markdown), but this will eventually improve accuracy in a lot of practical tasks with visual recognition – such as, extracting means, finding insights, generating reports, more and more. (Please compare the results in the following example.)
Summarize this report into 5 bullet points.
Response without OCR
- The balance sheet is for the periods ending September 30th and June 30th, with figures in millions and unaudited.
- Total assets increased from $414,976 million on June 30th to $445,785 million on September 30th.
- Total liabilities and stockholders’ equity also increased from $411,976 million on June 30th to $445,785 million on September 30th.
- The company’s cash and cash equivalents saw a significant rise from $34,704 million on June 30th to $80,452 million on September 30th.
- Common stock and paid-in capital remained relatively stable at $95,168 million on September 30th compared to $93,718 million on June 30th, while retained earnings increased and accumulated other comprehensive loss decreased slightly.
Response with OCR
- The total assets of the company as of September 30, 2023, amount to $445,785 million, an increase from $411,976 million on June 30, 2023.
- Current assets, including cash and cash equivalents, short-term investments, accounts receivable, inventories, and other current assets, total $207,586 million, up from $184,257 million in the previous quarter.
- The company has a total stockholders’ equity of $220,714 million, a rise from $206,223 million reported on June 30, 2023, with retained earnings and common stock being the primary components.
- Long-term debt stands at $41,946 million, with long-term income taxes and long-term unearned revenue at $22,983 million and $2,759 million respectively, contributing to total liabilities of $225,071 million.
- Short-term liabilities include debts like accounts payable, short-term debt, accrued compensation, and short-term unearned revenue, totaling $124,792 million, an increase from the previous quarter’s $104,149 million.
In Microsoft Azure OpenAI, AI enhancement feature by OCR (in which, Azure AI Vision is used to add OCR insights in prompts) is built in the platform to simplify this integration between GPT-4v and OCR. (See here.)You just set “enhancements: {ocr: {enabled: true} }
” when invoking API in Azure OpenAI.
(Note : This feature (enhancement) is not available in GA models.)
In this post, I have experimented with OpenAI GPT-4v, but this technique might also be applied to other multimodal models (even closed models) – such as, GPT-4o, Claude, Gemini, etc.
Moreover this technique is applied not only by OCR, but also by another manner of text-assisted input. For instance, when you request some task with full screen capture (layout) of web page, you can also add html text to improve the accuracy of response.
Categories: Uncategorized
1 reply»