Solr Journey

Oh Solr. So much power, so much fumbling.

https://solr.apache.org/guide/solr/latest/index.html

Environment

I was able to run Solr in Windows, and utilise the multiline Curl commands in WSL by add

SOLR_JETTY_HOST=0.0.0.0

as last line to:

notepad C:\solr\solr-9.8.1\bin\solr.in.cmd

Stop and Start Solr.

This also makes it available on the LAN

curl http://192.168.0.10:8983/solr/admin/info/system from the WSL returns a lot of debugging information to verify.

Start switches

bin/solr start -p 8983 (this is one you want for dev/testing)

-c Core AND Collection depending on context

bin/solr create -c mycore (core name)
bin/solr create_collection -c customers (collection name)

-p Port 8983 is the typical one in the docs
-e Example (Don't use)

Core vs Collection

For now use Core (Stand-alone mode)

Note in the docs you can start a collection with the create command

Command	Solr Mode	Creates
`bin/solr create -c mycore`	Standalone	Core
`bin/solr create -c localDocs --shards 2 -rf 2`	SolrCloud	Collection
`bin/solr create_collection -c localDocs --shards 2 -rf 2`	SolrCloud	Collection

Faceting and Clustering

https://solr.apache.org/guide/solr/latest/getting-started/searching-in-solr.html

Faceting is the arrangement of search results into categories (which are based on indexed terms). Within each category, Solr reports on the number of hits for relevant term, which is called a facet constraint. Faceting makes it easy for users to explore search results on sites such as movie sites and product review sites, where there are many categories and many items within a category.

Clustering groups search results by similarities discovered when a search is executed, rather than when content is indexed. The results of clustering often lack the neat hierarchical organization found in faceted search results, but clustering can be useful nonetheless. It can reveal unexpected commonalities among search results, and it can help users rule out content that isn’t pertinent to what they’re really searching for.

Schema

in Solr, the managed-schema.xml (or sometimes just managed-schema) is typically user-edited or managed through the Schema API, but Solr itself can write to it under specific circumstances.

Here's the breakdown:

✅ Solr can write to managed-schema:

If your core is configured to use a managed schema (not schema.xml), then Solr will update managed-schema automatically when you use:

The Schema API to add fields/dynamic fields/copy fields, etc.

Field guessing from documents if schema auto-creation is enabled (update.autoCreateFields=true in solrconfig.xml).

This file is managed by Solr internally in that case — and you should avoid editing it directly because:

Manual changes may be overwritten.

Mistakes can cause Solr startup errors.

✅ User-managed alternative: schema.xml

If you're using a classic schema.xml, then yes, Solr does not write to it, and it's fully user-edited. That mode is often preferred in controlled environments. [gut feeling says this is best to get to]

To switch to classic mode:

In solrconfig.xml, set <schemaFactory class="ClassicIndexSchemaFactory" />.

Rename or provide schema.xml instead of managed-schema.

🔄 How to tell if you're using a managed schema:

In solrconfig.xml, check for:

This means you're using managed-schema.

This is obviously important and 9.8 instructions should be worked through.

once you want to lock it down

add...

<bool name="mutable">false</bool>

</schemaFactory>

To solrconfig.xml

If you're still developing and want both worlds:


<schemaFactory class="ManagedIndexSchemaFactory">
  <bool name="mutable">true</bool>
  <bool name="managedSchemaResourceName">managed-schema</bool>
</schemaFactory>

Fields and Copyfields

Use copyField when your primary field is designed for full-text search but you also need exact match capabilities.

Gotchas

Unexpected Arrays

Arrays appearing in Solr for single value items,

change to

*.*

Goddamit, solr fails sliently with zero results if I forget *:* due to DOS-legacy

Duplicates in solr_id (or other id field)

<field name="solr_id" type="text_general" multiValued="false" indexed="true" stored="true" required="true"/>

Cannot be "text_general" needs to be "string"

<field name="solr_id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

Handling the Unique Key in managed-schema.xml

To set the "primary" "update key" in that file look for <uniqueKey> replace with this:

This part of the managed-schema is not auto-updated.

Handling location `LatLonPoint`.

In managed-schema.xml:


<field name="location" type="location" indexed="true" stored="true" />

→ type="location" = built-in for lat,lon

in vb.net:

solrDoc.Add("location", row("lat").ToString() & "," & row("lng").ToString())

Searching in Solr later:

Everything within 10km of a point:

fq={!geofilt sfield=location pt=51.5901,-1.8154 d=10}

Sort by nearest:

sort=geodist(location,51.5901,-1.8154) asc

Useful urls

http://localhost:8983/solr/planning_applications/schema

http://localhost:8983/solr/planning_applications/schema/uniquekey (showed solr_id) which is my own unique key

http://localhost:8983/solr/admin/cores?action=RELOAD&core=planning_applications

Move to fixed schema

in Solrconfig.xml set:

<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"

Testing

Create record.json with e.g.

[

{

"uid": "#24/00245/FULPP",

"authority": "Rushmoor",

"solr_id": "Rushmoor_2332342F30303234352F46554C5050",

"earliest_date": "18/04/2024Z",

"address": "6 Jubilee Close Farnborough Hampshire GU14 9TD",

"description": "Erection of a single storey side extension",

"lat": 51.2945,

"lng": -0.7885,

"app_size": "Small",

"app_state": "Permitted",

"app_type": "Full",

"linked_from_another_application": false,

"url": "https://publicaccess.rushmoor.gov.uk/online-applications/applicationDetails.do?activeTab=summary&keyVal=SC4OXKNMITN00",

"numAppeals": 0,

"location": "51.2945,-0.7885",

"lc_url": "/planning-applications/local-authority/Rushmoor/uid/2332342F30303234352F46554C5050",

"hex_uid": "2332342F30303234352F46554C5050",

"more_data": true

}

]

then

C:\solr\solr-9.8.1\bin>curl -X POST "http://localhost:8983/solr/planning_applications/update?commit=true" -H "Content-Type: application/json" --data-binary "@record.json"

{

"responseHeader":{

"status":400,

"QTime":128

"error":{

"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","java.time.format.DateTimeParseException"],

"msg":"ERROR: [doc=Rushmoor_2332342F30303234352F46554C5050] Error adding field 'earliest_date'='18/04/2024Z' msg=Invalid Date in Date Math String:'18/04/2024Z'",

"code":400

}

C:\solr\solr-9.8.1\bin>

Will give the actual error.

Backup

One single core just stop instance, then backup/zip

D:\solr\solr-9.8.1\server\solr\core_name\

e.g.

D:\solr\solr-9.8.1\server\solr\planning_applications\

This folder contains \conf \data and the core.properties file.

Searching

Text

Spatial

There are four main field types available for spatial search:

LatLonPointSpatialField (has docValues enabled by default)
PointType
SpatialRecursivePrefixTreeFieldType (RPT for short), including RptWithGeometrySpatialField, a derivative
BBoxField

LatLonPointSpatialField is the ideal field type for the most common use-cases for lat-lon point data. RPT offers some more features for more advanced/custom use cases and options like polygons and heatmaps.

RptWithGeometrySpatialField is for indexing and searching non-point data though it can do points too. It can’t do sorting/boosting.

BBoxField is for indexing bounding boxes, querying by a box, specifying a search predicate (Intersects,Within,Contains,Disjoint,Equals), and a relevancy sort/boost like overlapRatio or simply the area.

Logs

Log Rotation

If you're using Solr as a service (like you are), make sure it's using logrotate:


sudo nano /etc/logrotate.d/solr

Add something like:


/var/solr/logs/*.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    copytruncate
}

This keeps 7 days of logs, compressed. Change daily to weekly if needed.

Disable tlogs (if not needed)

In solrconfig.xml, within your core:


<updateHandler class="solr.UpdateHandler">
  <updateLog>
    <str name="dir">${solr.ulog.dir:}</str>
    <bool name="enabled">false</bool> <!-- Disable tlogs -->
  </updateLog>
</updateHandler>

Only do this if you're not using SolrCloud or don’t care about crash recovery.

Auto Start

sudo systemctl is-enabled solr check if enabled

sudo systemctl enable solr enable it

Migration from Zookeeper and shards to a single node.

Each Shard had 50% of the records
Logs: These are managed by log4j, and will grow unless rotation is configured. So this is important for all installs (see above)
I was starting the CL "example" with cloud parameter etc.
I moved the data from /example to into /var/solr/data and let the service manage Solr - that is sudo systemctl start solr and auto-restart etc.
Used this reindex.sh to copy data from one shard to a surviving one

GNU nano 7.2 reindex.sh #!/bin/bash

SRC="http://localhost:8983/solr/parallelTextShard2"

DEST="http://localhost:8983/solr/parallelText"

ROWS=500

START=0

while true; do

echo "Fetching $START..."

RESP=$(curl -s "$SRC/select?q=*:*&start=$START&rows=$ROWS&wt=json")

DOCS=$(echo "$RESP" | jq '.response.docs')

COUNT=$(echo "$DOCS" | jq 'length')

if [ "$COUNT" -eq 0 ]; then

echo "Done."

break

echo "$DOCS" | jq 'map(del(._version_))' > docs.json #This line bined the solr version node that was causing it to reject as it's autogenerated.

curl -s "$DEST/update?commit=true" \

-H "Content-Type: application/json" \

--data-binary @docs.json

START=$((START + ROWS))

done

Search This Blog

Byte Bastard

Solr Journey

Environment

Start switches

Core vs Collection

Faceting and Clustering

Schema

Gotchas

Handling the Unique Key in managed-schema.xml

Handling location `LatLonPoint`.

Useful urls

Move to fixed schema

Backup

Searching

Log Rotation

Disable tlogs (if not needed)

Comments

Post a Comment

Popular posts from this blog

Django Journey

git journey

github start project

Solr Journey

Environment

Start switches

Core vs Collection

Faceting and Clustering

Schema

Gotchas

Handling the Unique Key in managed-schema.xml

Handling location LatLonPoint.

Useful urls

Move to fixed schema

Backup

Searching

Log Rotation

Disable tlogs (if not needed)

Comments

Post a Comment

Popular posts from this blog

Django Journey

git journey

github start project

Handling location `LatLonPoint`.