Solr Journey

Oh Solr.  So much power, so much fumbling.

https://solr.apache.org/guide/solr/latest/index.html


Environment


I was able to run Solr in Windows, and utilise the multiline Curl commands in WSL by  add

SOLR_JETTY_HOST=0.0.0.0

as last line to:

notepad C:\solr\solr-9.8.1\bin\solr.in.cmd

Stop and Start Solr.  

This also makes it available on the LAN

curl http://192.168.0.10:8983/solr/admin/info/system  from the WSL returns a lot of debugging information to verify.

Start switches

bin/solr start -p 8983     (this is one you want for dev/testing)

-c   Core AND Collection depending on context

bin/solr create -c mycore                            (core name)
bin/solr create_collection -c customers      (collection name)

-p   Port  8983 is the typical one in the docs
-e   Example (Don't use)


Core vs Collection

For now use Core (Stand-alone mode)

Note in the docs you can start a collection with the create command

CommandSolr ModeCreates
bin/solr create -c mycoreStandaloneCore
bin/solr create -c localDocs --shards 2 -rf 2SolrCloudCollection
bin/solr create_collection -c localDocs --shards 2 -rf 2SolrCloudCollection


Faceting and Clustering

https://solr.apache.org/guide/solr/latest/getting-started/searching-in-solr.html

Faceting is the arrangement of search results into categories (which are based on indexed terms). Within each category, Solr reports on the number of hits for relevant term, which is called a facet constraint. Faceting makes it easy for users to explore search results on sites such as movie sites and product review sites, where there are many categories and many items within a category.

Clustering groups search results by similarities discovered when a search is executed, rather than when content is indexed. The results of clustering often lack the neat hierarchical organization found in faceted search results, but clustering can be useful nonetheless. It can reveal unexpected commonalities among search results, and it can help users rule out content that isn’t pertinent to what they’re really searching for.


Schema

in Solr, the managed-schema.xml (or sometimes just managed-schema) is typically user-edited or managed through the Schema API, but Solr itself can write to it under specific circumstances.

Here's the breakdown:
✅ Solr can write to managed-schema:

If your core is configured to use a managed schema (not schema.xml), then Solr will update managed-schema automatically when you use:

    The Schema API to add fields/dynamic fields/copy fields, etc.

    Field guessing from documents if schema auto-creation is enabled (update.autoCreateFields=true in solrconfig.xml).

This file is managed by Solr internally in that case — and you should avoid editing it directly because:

    Manual changes may be overwritten.

    Mistakes can cause Solr startup errors.

✅ User-managed alternative: schema.xml

If you're using a classic schema.xml, then yes, Solr does not write to it, and it's fully user-edited. That mode is often preferred in controlled environments.  [gut feeling says this is best to get to]

To switch to classic mode:

    In solrconfig.xml, set <schemaFactory class="ClassicIndexSchemaFactory" />.

    Rename or provide schema.xml instead of managed-schema.

🔄 How to tell if you're using a managed schema:

In solrconfig.xml, check for:

<schemaFactory class="ManagedIndexSchemaFactory">

This means you're using managed-schema.

This is obviously important and 9.8 instructions should be worked through.

once you want to lock it down

add...

<schemaFactory class="ManagedIndexSchemaFactory">
  <bool name="mutable">false</bool>
</schemaFactory>

To solrconfig.xml

If you're still developing and want both worlds:


<schemaFactory class="ManagedIndexSchemaFactory"> <bool name="mutable">true</bool> <bool name="managedSchemaResourceName">managed-schema</bool> </schemaFactory>


Fields and Copyfields

Use copyField when your primary field is designed for full-text search but you also need exact match capabilities.


Gotchas

Unexpected Arrays


Arrays appearing in Solr for single value items,


 <field name="description" type="text_general"/>

change to

<field name="description" type="text_general" multiValued="false"/>

*.*

Goddamit,  solr fails sliently with zero results if I forget *:* due to DOS-legacy

Handling the Unique Key in managed-schema.xml


To set the "primary" "update key" in that file look for <uniqueKey>  replace with this:

  <uniqueKey>id</uniqueKey>  it   <uniqueKey>id</uniqueKey>
<field name="solr_id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>

This part of the managed-schema is not auto-updated.


Handling location LatLonPoint.


In managed-schema.xml:


<field name="location" type="location" indexed="true" stored="true" />

type="location" = built-in for lat,lon

in vb.net:

solrDoc.Add("location", row("lat").ToString() & "," & row("lng").ToString())

Searching in Solr later:

Everything within 10km of a point:

fq={!geofilt sfield=location pt=51.5901,-1.8154 d=10}

Sort by nearest:

sort=geodist(location,51.5901,-1.8154) asc

Useful urls


http://localhost:8983/solr/planning_applications/schema

http://localhost:8983/solr/planning_applications/schema/uniquekey   (showed solr_id) which is my own unique key

http://localhost:8983/solr/admin/cores?action=RELOAD&core=planning_applications



Move to fixed schema


in Solrconfig.xml set:


  <!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"


Testing

Create record.json with e.g.

[
 {
  "uid": "#24/00245/FULPP",
  "authority": "Rushmoor",
  "solr_id": "Rushmoor_2332342F30303234352F46554C5050",
  "earliest_date": "18/04/2024Z",
  "address": "6 Jubilee Close Farnborough Hampshire GU14 9TD",
  "description": "Erection of a single storey side extension",
  "lat": 51.2945,
  "lng": -0.7885,
  "app_size": "Small",
  "app_state": "Permitted",
  "app_type": "Full",
  "linked_from_another_application": false,
  "url": "https://publicaccess.rushmoor.gov.uk/online-applications/applicationDetails.do?activeTab=summary&keyVal=SC4OXKNMITN00",
  "numAppeals": 0,
  "location": "51.2945,-0.7885",
  "lc_url": "/planning-applications/local-authority/Rushmoor/uid/2332342F30303234352F46554C5050",
  "hex_uid": "2332342F30303234352F46554C5050",
  "more_data": true
}
]


then 


C:\solr\solr-9.8.1\bin>curl -X POST "http://localhost:8983/solr/planning_applications/update?commit=true" -H "Content-Type: application/json" --data-binary "@record.json"
{
  "responseHeader":{
    "status":400,
    "QTime":128
  },
  "error":{
    "metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","java.time.format.DateTimeParseException"],
    "msg":"ERROR: [doc=Rushmoor_2332342F30303234352F46554C5050] Error adding field 'earliest_date'='18/04/2024Z' msg=Invalid Date in Date Math String:'18/04/2024Z'",
    "code":400
  }
}
C:\solr\solr-9.8.1\bin>


Will give the actual error.

Backup


One single core just stop instance, then backup/zip

D:\solr\solr-9.8.1\server\solr\core_name\

e.g.

D:\solr\solr-9.8.1\server\solr\planning_applications\

This folder contains \conf  \data and the core.properties file.


Searching


Text

Spatial

There are four main field types available for spatial search:

  • LatLonPointSpatialField (has docValues enabled by default)

  • PointType

  • SpatialRecursivePrefixTreeFieldType (RPT for short), including RptWithGeometrySpatialField, a derivative

  • BBoxField

LatLonPointSpatialField is the ideal field type for the most common use-cases for lat-lon point data. RPT offers some more features for more advanced/custom use cases and options like polygons and heatmaps.

RptWithGeometrySpatialField is for indexing and searching non-point data though it can do points too. It can’t do sorting/boosting.

BBoxField is for indexing bounding boxes, querying by a box, specifying a search predicate (Intersects,Within,Contains,Disjoint,Equals), and a relevancy sort/boost like overlapRatio or simply the area.



Logs


Log Rotation

If you're using Solr as a service (like you are), make sure it's using logrotate:


sudo nano /etc/logrotate.d/solr

Add something like:


/var/solr/logs/*.log { daily rotate 7 compress missingok notifempty copytruncate }

This keeps 7 days of logs, compressed. Change daily to weekly if needed.


Disable tlogs (if not needed)

In solrconfig.xml, within your core:


<updateHandler class="solr.UpdateHandler"> <updateLog> <str name="dir">${solr.ulog.dir:}</str> <bool name="enabled">false</bool> <!-- Disable tlogs --> </updateLog> </updateHandler>

Only do this if you're not using SolrCloud or don’t care about crash recovery.

Auto Start


sudo systemctl is-enabled solr      check if enabled


sudo systemctl enable solr            enable it



Migration from Zookeeper and shards to a single node.  


  • Each Shard had 50% of the records
  • Logs: These are managed by log4j, and will grow unless rotation is configured. So this is important for all installs (see above)
  • I was starting the CL "example" with cloud parameter etc.  
  • I moved the data from /example to into /var/solr/data and let the service manage Solr - that is sudo systemctl start solr   and auto-restart etc.
  • Used this reindex.sh to copy data from one shard to a surviving one
  GNU nano 7.2                                           reindex.sh                                                     #!/bin/bash

SRC="http://localhost:8983/solr/parallelTextShard2"
DEST="http://localhost:8983/solr/parallelText"
ROWS=500
START=0

while true; do
  echo "Fetching $START..."
  RESP=$(curl -s "$SRC/select?q=*:*&start=$START&rows=$ROWS&wt=json")
  DOCS=$(echo "$RESP" | jq '.response.docs')

  COUNT=$(echo "$DOCS" | jq 'length')
  if [ "$COUNT" -eq 0 ]; then
    echo "Done."
    break
  fi

  echo "$DOCS" | jq 'map(del(._version_))' > docs.json        #This line bined the solr version node that was causing it to reject as it's autogenerated.

  curl -s "$DEST/update?commit=true" \
       -H "Content-Type: application/json" \
       --data-binary @docs.json

  START=$((START + ROWS))
done

Comments

Popular posts from this blog

git journey

Django Journey

github start project