[Updated] Alli User Guide - Indexing HTML Documents for Knowledge Base
Updated on 6/23/2021: You can use custom names for HTML-sourced documents now.
Just like Alli indexes files from OneDrive, it can now crawl text data from HTML pages and create Documents to build your Knowledge Base. Let's take a look.
First, go to the Knowledge Base > Source. You'll see OneDrive and HTML Documents as possible sources of your Knowledge Base. Click the Add button in the HTML Documents section to start.
In the popup that appears, put the URL of the HTML documents you want to crawl data from. Please remember that the page must be public. You can also select the way to decide the document's name. The document name can be the URL, the HTML document title, or a custom title you want to use.
Checking 'Use sub URL' under the SUB-URL Regex field lets you index the content in the links on the main page. You can specify the links to crawl data using regex format. For example, you can index all the pages linked on 'https://example.allganize.ai/' that start with 'https://example.allganize.ai/' by setting as below:
One more example: if the SUB-URL Regex is 'https://example.allganize.ai/product.*', it crwals any links start with 'product' under the https://example.allganize.ai/ domain, such as 'https://example.allganize.ai/product_alli' or 'https://example.allganize.ai/product/alli'
Please remember that if you use the sub URL feature, the content in the main URL is not crawled.
Click the Submit button and you'll see the item you added under the HTML Documents section. You may need to wait for a few seconds (or longer if there's a lot to index) and reload the page to see the name appears. If the Status toggle is on, auto-updating will happen every 12 am UTC to keep the documents up to date.
The documents created can be found under the Documents tab.
Learn More About Allganize's Technology